

# Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme

Angelika Schwarz  
NVIDIA Corporation  
Stockholm, Sweden  
abschwarz@nvidia.com

Harun Bayraktar  
NVIDIA Corporation  
Santa Clara, California, USA  
hbayraktar@nvidia.com

RuQing G. Xu  
NVIDIA Corporation  
Minato, Tokyo, Japan  
ruqingx@nvidia.com

Pawel Tabaszewski  
NVIDIA Corporation  
Warsaw, Poland  
ptabaszewski@nvidia.com

Anton Anders  
NVIDIA Corporation  
Santa Clara, California, USA  
anton@nvidia.com

John Gunnels  
NVIDIA Corporation  
Santa Clara, California, USA  
jgunnels@nvidia.com

Samuel Rodriguez  
NVIDIA Corporation  
Santa Clara, California, USA  
srodriguezbe@nvidia.com

Cole Brower  
NVIDIA Corporation  
Santa Clara, California, USA  
cbrower@nvidia.com

Kate Clark  
NVIDIA Corporation  
Santa Clara, California, USA  
mclark@nvidia.com

Sebastien Cayrols  
NVIDIA Corporation  
Knoxville, Tennessee, USA  
scayrols@nvidia.com

Victor Podlozhnyuk  
NVIDIA Corporation  
Reading, United Kingdom  
vpodlozhnyuk@nvidia.com

## Abstract

The rapid growth of artificial intelligence (AI) has made low-precision formats such as FP16, FP8, and, most recently, block-scaled FP4 the primary focus of modern GPUs, where Tensor Cores now deliver orders-of-magnitude higher throughput than traditional FP64 pipelines. This hardware shift has sparked a new line of algorithm research: using low-precision units to emulate double-precision accuracy through schemes such as Ozaki decompositions. We advance this direction with Automatic Dynamic Precision (ADP), a fully GPU-resident framework that makes emulated FP64 matrix multiplication both efficient and reliable. At its core is the Exponent Span Capacity (ESC), a hardware-agnostic estimator that conservatively determines the decomposition parameter (a.k.a., slices) required to achieve FP64-level accuracy. Built on ESC, ADP integrates exception handling, run time heuristics, and seamless fallback to native FP64, ensuring correctness without host-device synchronization or user intervention. Additionally, we further improve Ozaki-style decompositions with an unsigned integer slicing scheme, which increases representational efficiency and reduces computational waste. Validated against recently proposed BLAS grading tests, ADP consistently preserves FP64 fidelity on challenging inputs while incurring less than 10% run time overhead. In a 55-bit mantissa setting, our approach achieves up to 2.3 $\times$  and 13.2 $\times$  speedups over native FP64 GEMM on NVIDIA Blackwell GB200 and the RTX Pro 6000 Blackwell Server Edition, respectively. Our results demonstrate

that low-precision accelerators can serve as a practical, production-ready foundation for high-fidelity and high-performance scientific computing workloads.

## CCS Concepts

- Mathematics of computing → Mathematical software performance;
- Computer systems organization → Multicore architectures.

## Keywords

Matrix Multiplication, Emulation, High Performance Computing, Power Efficiency

## ACM Reference Format:

Angelika Schwarz, Anton Anders, Cole Brower, Harun Bayraktar, John Gunnels, Kate Clark, RuQing G. Xu, Samuel Rodriguez, Sebastien Cayrols, Pawel Tabaszewski, and Victor Podlozhnyuk. 2026. Guaranteed DGEMM Accuracy While Using Reduced Precision Tensor Cores Through Extensions of the Ozaki Scheme. In *SCA/HPCAsia 2026: Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region (SCA/HPCAsia 2026), January 26–29, 2026, Osaka, Japan*. ACM, New York, NY, USA, 11 pages. <https://doi.org/10.1145/3773656.3773670>

## 1 Introduction

Over the past decade, the rapid adoption of artificial intelligence has shifted GPU hardware design toward *low-precision formats* such as FP16, FP8, INT8, and, most recently, block-scaled MXFP8 and MXFP4 formats [26]. Specialized accelerators such as Tensor Cores now deliver orders-of-magnitude higher matrix multiplication throughput in these formats than in traditional FP32 and FP64 pipelines [10, 15]. This rift has motivated a new line of algorithm



This work is licensed under a Creative Commons Attribution 4.0 International License.  
*SCA/HPCAsia 2026, Osaka, Japan*  
© 2026 Copyright held by the owner/author(s).  
ACM ISBN 979-8-4007-2067-3/2026/01  
<https://doi.org/10.1145/3773656.3773670>

research: using low-precision units to emulate double-precision accuracy, most notably through *Ozaki-style decompositions* that split FP64 values into low-precision slices, perform multiplications at high throughput, and reassemble the result [19, 22, 25, 29].

Recent work has demonstrated that such approaches can achieve FP64 accuracy on GPUs [23, 28]. However, existing implementations require the user to specify the decomposition parameters, dictated by the numerical properties of the input data, required to maintain accuracy; as such, these are *unsafe* and *impractical* for production deployment [1]. By *unsafe*, we mean they cannot guarantee FP64-level accuracy under all inputs. By *impractical*, we mean they lack mechanisms to ensure efficiency—either wasting compute on benign inputs or failing to perform competitively on difficult ones. Without safeguards for both accuracy and efficiency, Ozaki scheme-based emulation can only be used by a limited set of applications or for research, and cannot replace existing matrix multiplication implementations currently used in applications that rely on hardware-native FP64 capabilities.

## Our Contributions

This work addresses these aforementioned gaps with four contributions that together make FP64 emulation both safe and efficient on modern GPUs:

- **Bit-efficient slicing.** We introduce an *unsigned integer slicing representation* that eliminates redundant sign bits, increasing mantissa utilization and improving the accuracy of Ozaki-style decomposition on INT8 Tensor Cores.
- **Exponent Span Capacity (ESC).** We describe a hardware-agnostic estimator that based on input data, conservatively determines the number of bits needed to achieve FP64 accuracy. ESC eliminates reliance on fixed slice counts and provides the theoretical basis for adaptive precision.
- **Automatic Dynamic Precision (ADP).** We develop a GPU-resident workflow built on ESC that integrates exception handling, run time heuristics, and seamless fallback to native FP64. ADP guarantees accuracy by construction while ensuring performance does not fall below that of native FP64.
- **Structured validation.** We validate ADP with recently introduced *BLAS grading tests* [7, 8], providing a reproducible and rigorous assessment of numerical behavior across adversarial inputs.

The techniques described in this paper are general and can be extended to support a variety of numerical schemes for FP64 emulation. In this work, we focus on Ozaki Scheme I [23] as a representative basis, as it is widely studied and provides a clear foundation for evaluating ESC and ADP. Extensions to other formulations, such as Ozaki Scheme II [25], are possible but are beyond the scope of this study.

To encourage reproducibility and adoption, we will release open-source implementations of ESC and our Ozaki-I-based DGEMM together with this paper.

## Results and Impact

On recent NVIDIA GPU architectures, ADP consistently achieves *componentwise accuracy indistinguishable from FP64 GEMM* (*i.e.*, DGEMM) while introducing less than 10% overhead. In favorable

cases, our approach accelerates cuBLAS’s [21] DGEMM by 2.3× and 13.2× on Blackwell GB200 and RTX Pro 6000 Blackwell Server Edition GPUs, respectively. Together, ESC and ADP establish a safe, efficient, and production-ready path for FP64 emulation on low-precision accelerators, bridging AI hardware trends with the accuracy demands of high-performance scientific computing workloads.

## 2 Background and Related Work

High-performance scientific computing has long depended on IEEE double-precision (FP64) arithmetic for accuracy, stability, and reproducibility. However, motivated by AI applications such as Large Language Models (LLMs) [17, 26], modern GPUs have increasingly prioritized low-precision matrix multiplication and accumulation (MMA) execution units called Tensor Cores for types such as FP16, BF16, INT8, FP8, and, most recently, block-scaled FP4, which deliver orders-of-magnitude higher throughput than native FP64 pipelines. This architectural imbalance has motivated a growing body of work on emulating FP64 arithmetic with low-precision units.

### 2.1 FP64 Emulation on Low-Precision Units

A major milestone in this area was the adaptation of the *Ozaki scheme*, originally proposed for reproducible summations [24], to GPUs. The scheme partitions FP64 numbers into multiple fixed-point *mantissa slices*, computes products slice-by-slice, and recomposes the results. Mukunoki et al. demonstrated that this approach could deliver FP64-accurate DGEMM on NVIDIA Tensor Cores [19]. Subsequent work showed that Ozaki-style decomposition could be generalized to higher-precision formats, including binary128 emulation [30], and adapted to GPU integer matrix multiplication units [23]. Uchino, Ozaki, and Imamura further optimized the scheme to improve efficiency, achieving superior performance-per-watt compared to native FP64 hardware [30].

### 2.2 Mixed-Precision Frameworks

In parallel, mixed-precision solvers have been proposed as an alternative path to FP64 accuracy. Iterative refinement schemes use low-precision GEMM kernels as inner solvers while correcting residuals in FP64 [11]. Domke et al. examined the practicality of such approaches in matrix engines for HPC workloads [9]. More recently, techniques such as *MixPert* [16] have demonstrated systematic ways to compose low-precision GEMM kernels into effective FP64 emulators. Surveys confirm that Tensor Core acceleration with precision refinement is becoming a mainstream strategy for sustaining FP64-level accuracy at a significantly lower energy cost [15].

### 2.3 Recent Advances and Open Challenges

Several recent contributions have refined FP64 emulation techniques. Lightweight, cache-aware refinements have been explored in LE-GEMM [33], while modular arithmetic has been used to improve slice utilization in Ozaki Scheme II [25]. Other directions include portable, hardware-agnostic estimators for precision selection [1] and INT8-based GEMM emulation for energy-sensitive scenarios [29], the acceleration of quantum chemistry through combinations of emulation and mixed precision on AI-centric GPUs [6],

and low-order orthogonal voxel finite elements with INT8 tensor cores for GPU-based elastic wave propagation analysis [14].

Together, these works show that *emulation is viable in principle*: FP64 accuracy can be reconstructed from low-precision units, and several techniques improve the efficiency of doing so. However, *critical gaps remain* for deployment in production HPC software as a default implementation. The three limitations in current approaches are:

- **Representational inefficiency.** Signed-slice encodings waste capacity by redundantly storing sign bits, forcing higher slice counts than necessary and reducing computational efficiency.
- **No universal safety guarantees.** Other methods use a fixed slice count that the user has to input and cannot ensure FP64-level accuracy under adversarial inputs or extreme exponent ranges.
- **Lack of a run time framework.** Existing methods do not provide mechanisms to detect numerical corner cases (e.g., NaN/Inf propagation), predict performance, or decide adaptively between emulation and native FP64. Without such guardrails, emulation remains risky outside of controlled benchmarks.

These open challenges explain why Ozaki-style emulation has not yet been adopted in production numerical libraries. Bridging this gap requires both a theoretical foundation for safe precision selection and a practical system that integrates safety checks, performance heuristics, and fallbacks into a GPU-resident workflow. This motivation underpins our contributions in Sections 3, 4, and 5.

### 3 Unsigned Slice Encoding

When decomposing FP64 matrices into integer slices for emulation on INT8 Tensor Cores, a naïve approach is to store every slice as signed 8-bit integers (s8). This introduces an inefficiency: each sub-leading slice redundantly encodes a sign bit, even though the sign of the overall number is already captured in the most significant slice. As a result, each sub-leading slice contributes only 7 additional bits of effective mantissa precision, rather than the full 8 bits. To achieve a target precision equivalent to FP64’s 53-bit mantissa, this approach requires eight slices (i.e. 64 bit), increasing both storage and compute costs.

In principle, the sign need only be stored in the leading slice, with subsequent slices encoded as unsigned 8-bit integers (u8). To compute the leading signed slice, round-to-negative-infinity rounding is used, ensuring that the remainder is positive for representations using unsigned integer slices. Under such a scheme, the FP64 mantissa could be represented in just seven slices rather than eight—a 22% reduction in compute overhead. This optimization is feasible, since NVIDIA’s integer Tensor Cores natively support mixed signed-unsigned arithmetic (s8 × u8). We prototyped this approach and confirmed its correctness.

However, our production implementation adopts an alternative method that leverages properties of two’s complement arithmetic to achieve the same effect while avoiding the need for mixed signed-unsigned paths. The key insight is that we encode the full range of a u8 slice by redistributing values across multiple s8 slices:

- If the original u8 slice lies in [0, 127], it is stored directly in s8.
- If the u8 slice lies in [128, 255], it is re-expressed as  $256 - x$ , where  $x \in [0, 127]$ . The higher-order slice is incremented by 256, while the lower slice stores  $-x$  in s8.

Two’s complement ensures that this remapping preserves bitwise compatibility: the negative s8 value  $-x$  has the same bit-string representation as the original u8 value. For example, a mantissa contribution of

$$123 \times 256 + 200 \text{ (u8)}$$

can be equivalently represented as

$$124 \times 256 - 56 \text{ (s8),}$$

where both 200 (u8) and  $-56$  (s8) share the bit pattern 0b11001000 (see Figure 1 for an illustration of this approach).

This unsigned slice encoding eliminates wasted sign bits, reduces slice count, and avoids specialized hardware paths, enabling a more efficient use of integer Tensor Cores for high-precision emulation.



**Figure 1: Unsigned slice encoding using two’s complement arithmetic.** Case 1: u8 values in [0,127] map directly to s8 without modification. Case 2: u8 values in [128,255] are remapped as  $256 - x$  with a +256 carry to the higher slice, while storing  $-x$  in s8. Bit patterns are preserved, e.g., 200 (u8)  $\equiv$   $-56$  (s8) = 0b11001000.

### 4 ESC: Exponent Span Capacity Estimation

When utilizing the Ozaki scheme, the number of bits needed to maintain the desired accuracy is dependent on the numerical contents of the constituent matrices [31]. However, to provide users of emulated DGEMM with a familiar interface, it is necessary for the software to analyze the matrices and determine the number of bits needed in the intermediate representation. We will refer to this as computing the Exponent Span Capacity (ESC).

As described in Section 3, our FP64 emulation is currently based on the use of integer-based (fixed-point) compute. Because there is no exponent in this representation, when converting from FP64 to a multi-slice INT8 representation, a substitute for exponent functionality is required. This involves shifting the mantissa bits in an extended storage format instead of altering an exponent field, shifting left for a greater exponent or right for one that is lesser. For reasons of efficiency, the space to do this must be set aside at the outset of the matrix multiplication algorithm and must be uniform for the entire matrix. Assume, for example, that we are to preserve

53 mantissa bits during conversion and the greatest exponent field in a given row of the  $A$  matrix is 00001100400 (dec. 100). For an entry in that row whose exponent is  $100 - p$ , we would set aside  $p$  extra bits in the INT8 intermediate format. The dynamic range of FP64 implies a potential storage requirement of approximately 2100 bits (the  $2^{11}$  exponent bit range and the 53 additional bits needed to represent an IEEE FP64 mantissa with the implicit bit made explicit). However, using such a fixed representation would entail a prohibitive amount of both storage and emulative compute, as the Ozaki-I scheme's compute requirements are quadratic in the number of slices used. However, it is necessary to determine how many extra "padding bits" are needed to compute with the same precision used by IEEE FP64, without using excess resources.

The formula used to compute the ESC is deterministic, reproducible, and easy to understand, both intuitively and formally. Since matrix multiplication,  $C_{m \times n} = A_{m \times k} B_{k \times n}$ , is an unordered set of dot products, we will isolate and examine the ESC for a single dot product,  $x^T \cdot y$ , where  $x^T$  is a row vector of  $k$  FP64 values and  $y$  is a column vector of  $k$  FP64 values.

The ESC for matrix multiplication is the maximum ESC of the  $mn$  component dot products. For purposes of exposition, it is helpful to view the dot product evaluation as a two-step process, though these operations are fused in practice. These steps are:

- (1) Multiply each  $x_i \odot y_i \rightarrow z_i$ , the Hadamard product
- (2) Reduce all  $z_i \rightarrow s$ , a scalar value ( $s = \sum_{i=1}^k z_i$ )

In computing the ESC, the mantissa field is ignored, so all values with the same exponent are viewed as equivalent and the elements of  $z$  are the result of adding the exponents of the elements in  $x$  and  $y$ , not element-wise multiplication. We will discuss why this is safe to do after the formula is presented. There are five designated values of interest:

- (1)  $x_p$ , entry with the maximum exponent in  $x$
- (2)  $y_q$ , entry with the maximum exponent in  $y$
- (3)  $z_r$ , entry with the maximum exponent in  $z$
- (4)  $x_r$ , entry in  $x$ , giving rise to  $z_r$
- (5)  $y_r$ , entry in  $y$ , giving rise to  $z_r$

Note that  $p$ ,  $q$ , and  $r$  may not be unique, but the exponents of  $x_p$ ,  $y_q$ , and  $z_r$  are each sets with a single value.

The ESC is calculated as  $\text{ESC} = \exp(x_p) + \exp(y_q) - \exp(z_r)$ , where  $\exp(z_r) = \exp(x_r) + \exp(y_r)$  and  $\exp(\cdot)$  denotes the exponent component of  $\cdot$ .

The algorithm described above is for a single dot product in a matrix multiplication composed of  $mn$  dot products of length  $k$ . For performance reasons, it is highly desirable to reduce the time to compute the ESC without sacrificing the safety it provides. This is done by computing an estimate of  $z_r$  as follows. Both the  $x$  and  $y$  vectors are coarsened, broken into blocks of length  $b$ , where the greatest and least exponent in each block,  $\text{Max}(b_i)$  and  $\text{Min}(b_i)$  are stored as representatives of that block. Then,  $z_r$  is approximated by finding the maximal value, over all blocks,  $i$ , of  $\text{Max}(xb_i) + \text{Min}(yb_i)$  and  $\text{Min}(xb_i) + \text{Max}(yb_i)$ . This may underestimate  $z_r$ , resulting in a larger ESC value than the non-coarsened, but computationally impractical, version of the algorithm. However, the primary goal is safety and it is straightforward to demonstrate that the coarsened algorithm cannot overestimate the exponent of  $z_r$ , derived from the non-coarsened algorithm, as follows.

Assume, for some block index,  $i$ ,  $z_r < \text{Max}(xb_i) + \text{Min}(yb_i)$ , as the other case is symmetric. Let  $\exp(x_j) = \text{Max}(xb_i)$  for one or more  $j$  in the span of  $xb_i$ . For all such  $j$ ,  $\exp(x_j) + \text{Min}(yb_i)$  is greater than  $\exp(x_j) + \exp(y_j)$  by the assumption that the non-coarsened ESC,  $z_r$ , is less than the coarsened ESC. Also,  $\text{Min}(yb_i) \leq \exp(y_j)$  by definition. Therefore,  $\exp(x_j) + \text{Min}(yb_i) \leq \exp(x_j) + \exp(y_j)$  and  $\exp(x_j) + \exp(y_j) \leq z_r$  by the definition of  $z_r$ . Given that  $\exp(x_j) = \text{Max}(xb_i)$ , by transitivity,  $\text{Max}(xb_i) + \text{Min}(yb_i) \leq z_r$ , contradicting the assumption.

This formulation supplies us with some valuable guardrails. First, it gives us a guarantee that the maximal (in terms of exponent magnitude) contribution to the dot product is captured with full fidelity. That is, the number of bits dictated by the user (the default being 53 bits), are all used in the calculation of  $z_r$ . Put another way, all of the mantissa bits of  $x_r$  and  $y_r$  will be extracted from the corresponding FP64 input values, placed in fixed-point components, and used to calculate  $z_r$  using the Ozaki scheme. This is because we will reserve enough fixed-point slices to accommodate the requisite mantissa bits as well as all of the placeholder bits needed for shifting these representations, as described above. Notice that this applies to all such contributions to the dot product. As  $r$  is not necessarily unique,  $\exp(z_r) = F$ , can arise from multiple  $\exp(x_r)$  and  $\exp(y_r)$  values. This formula gives the "full fidelity guarantee" for all combinations of these exponents, where each summand is the same (maximal) value,  $F$ , and simultaneously removes any need to track the indices,  $r$ . Since the resultant emulation utilizes all source mantissa bits for even the most asymmetric pairings, it also captures the "funnel" of values around  $F$  (i.e. the method will discard no more than  $j$  bits when the Hadamard exponent is  $F - j$ ). This is also the reason why the mantissa field can be ignored, as regards  $z_r$ . Multiplying the mantissa values of two scalars can result in an exponent increase of 1 (the product of the mantissas is always less than 4.0), we increment the ESC value by 1 to guarantee the margin of safety.

Assuming that all values are of the same sign, this approach yields the same precision characteristics as IEEE FP64. However, the cancellation characteristics are different. Emulation results are invariant with respect to the ordering of summed values, while those utilizing IEEE FP64 arithmetic are not. The strength of native FP64 in this regard rests in the dynamic exponent. There are instances where the shifting of the exponent, due to either gradual or dramatic cancellation, provides an advantage. In other cases, the extended precision enabled by the extra bits utilized in the intermediate form of the summand will lead to improved accuracy.

## 5 ADP: Automatic Dynamic Precision

While emulation provides a path to recover FP64 accuracy on low-precision Tensor Cores, its practicality depends on integrating precision estimation, numerical safety checks, and performance-aware run time decisions. In this section we describe our fully GPU-resident workflow, which leverages the Exponent Span Capacity (ESC) estimator together with lightweight pre-processing kernels to enable seamless run time integration.

## 5.1 Pre-processing and Safety Checks

The implementation described in this paper adheres to FP64 levels of accuracy for both normal and denormalized values, but does not strictly adhere to the IEEE 754 standard in its treatment of signed zeros, Infs, and NaNs. Negative zeros in the input are simply treated as a zero, since our implementation does not handle signed zeros. Our handling of emergent NaNs and Infs is similar to IEEE 754 and is described below. However, a key limitation of Ozaki-style emulation is its inability to handle input that includes special values such as Inf and NaN. If left unchecked, these values propagate incorrectly through slice decompositions, leading to undefined behavior. To ensure numerical safety, we scan matrices  $A$  and  $B$  prior to multiplication. If either scan detects Inf or NaN, the workflow falls back to native FP64 GEMM, guaranteeing correctness. In practice, this scanning occurs while preparing for the coarsened ESC calculation and fallback can be incurred before any  $O(n^3)$  algorithms are executed. The separate treatment of the exponent in the Ozaki method leads to emergent signed Infs that reflect the Inf of greater magnitude and does not produce emergent NaNs in all cases where FP64 would do so. While it is possible to anticipate (imperfectly) the possibility of emergent NaNs and Infs, possibly invoking FP64 fallback in such cases, we do not do this. The Ozaki algorithm can give rise to +Inf and -Inf when the result of the dot product is converted back to an FP64 representation, but Infs are not “sticky,” as this is the terminal step in our emulation algorithm. The benefit to this is that accumulated values may be brought back into the normal range of FP64 values. Though the Ozaki method itself does not address emergent NaNs, the necessity to aggregate partial results, so as to avoid overflowing accumulators, can give rise to sticky Infs and NaNs if that intermediate aggregation includes conversion to FP64 values. Our implementation will generate Infs when they are necessary, though not in all of the cases where an FP64-based computation would do so, and NaNs in a subset of the instances where FP64 would generate NaNs, given the disparate generation of emergent Infs.

## 5.2 Exponent Span Estimation

The exponent span estimator begins by pre-processing the  $A$  and  $B$  matrices to find the max and min exponent per block for our coarsened Exponent Span Capacity. After the inputs have been preprocessed, a dedicated kernel implements the coarsened exponent span capacity. The keen observer may notice that this  $O(n^3)$  algorithm and is reminiscent of a GEMM. We extend the CUTLASS library [4] and leverage dynamic programming (DPX) instructions [2] available on Hopper and newer GPUs to significantly accelerate this process. Thanks to these hardware features and safety guarantees, we are able to have an accurate *and* cost-effective method to estimate the exponent span capacity.

## 5.3 Heuristic run time Selection

After ESC reports the required bit width, a lightweight heuristic evaluates whether emulation is expected to outperform native FP64. If the slice count remains within a performance-efficient range, the workflow dispatches the emulated GEMM kernel. Otherwise, native FP64 GEMM is launched instead. This mechanism ensures that

emulation is only applied when beneficial, avoiding performance regressions for difficult cases with large exponent spans.

## 5.4 GPU-Resident Integration

Crucially, all of the above steps—safety scans, ESC estimation, and heuristic selection—are implemented as CUDA kernels running entirely on the GPU. No host-device synchronization is required, allowing our approach to integrate transparently with existing GPU libraries such as cuBLAS. The result is a run time framework that provides FP64-accurate GEMM with dynamic precision tuning and negligible decision overhead, while ensuring correctness in the presence of exceptional values.

## 6 Numerical Accuracy

This section evaluates the numerical accuracy of our implementation of ADP-enabled DGEMM in the cuBLAS library. We demonstrate that our implementation with guardrails enabled indistinguishably behaves like a native floating-point implementation.

It is well understood that different implementations of matrix multiplication satisfy different accuracy bounds. In response to the appearance of fixed point implementations [22], Demmel et al. [7, 8] devised tests that detect the type of the underlying matrix multiplication algorithm.

They considered two dimensions. First, an implementation can either be  $O(n^3)$  or Strassen-like. Second, it can use either floating-point or fixed point arithmetic. The following tests, when executed as a tree, cover the four possible combinations.

- **Test 1:** Distinguish a conventional  $O(n^3)$  implementation from a Strassen-like implementation.
- If Test 1 detects an  $O(n^3)$  implementation, use **Test 2:** Distinguish a  $O(n^3)$  floating-point implementation from an  $O(n^3)$  fixed point implementation.
- If Test 1 detects a Strassen-like implementation, use **Test 3:** Distinguish a Strassen-like floating-point implementation from a Strassen-like fixed point implementation.

The tests rely on differences in the computed results to determine the type of algorithm, and not on (asymptotic) run time measurements.

In addition to the algorithm discovery, they summarized accuracy criteria for further assigning a grade to a matrix multiplication implementation. We evaluate the grade A criterion, which measures the componentwise relative error.

Following the work by Demmel et al. [8], we demonstrate the following two aspects through numerical experiments.

- A1 Test 2 cannot distinguish the Ozaki based DGEMM from an  $O(n^3)$  floating-point implementation if it has the guardrails enabled and may fall back to native FP64 DGEMM. This is the default behavior.
- A2 Ozaki based DGEMM with automatic fallback meets the grade A criterion and attains an accuracy similar to native  $O(n^3)$  FP64 DGEMM implementations provided in highly optimized libraries.

*Aspect A1.* Test 2 is most important to our evaluation since the Ozaki based DGEMM is an  $O(n^3)$  implementation (and not Strassen-like). Test 2 [7] exploits the fact that a fixed point implementation



**Figure 2: ADP-enabled DGEMM, configured with six distinct mantissa bit counts for the Ozaki-I algorithm, on Test 2, where  $n = 1024$ . For each mantissa bit count, we display the variant without the option to fall back to native FP64 DGEMM (solid lines) and the variant with guardrails and automatic fallback to native FP64 DGEMM (dashed lines).**

can lose accuracy when not all necessary bits are covered. For that reason, the test chooses the input matrices to have a wide exponent span. As a result, fixed point implementations that use a predetermined, fixed number of bits likely do not cover all bits required for an accurate computation.

Specifically, the matrices  $\mathbf{A} \in \mathbb{R}^{n \times n}$  and  $\mathbf{B} \in \mathbb{R}^{n \times n}$  are constructed starting from a vector with entries uniformly distributed in (1,2)

$$\mathbf{x} \sim \mathcal{U}(1, 2)^{n \times 1}.$$

The exponent range is determined by the diagonal matrix

$$\mathbf{D} = \text{diag}(2^{j_1}, \dots, 2^{j_n}),$$

where the exponents of the diagonal entries are chosen as integers

$$j_{i+1} = -b + \text{round}(i\Delta).$$

Demmel et al. [7] select  $b$  and  $\Delta = 2b/(n-1)$  such that  $-j_1 = j_n$  and  $b \approx \lfloor \log_2(\sqrt{\Omega}) \rfloor - \lceil \log_2(n) \rceil - 1$ , where  $\Omega$  is the overflow threshold. We generalize Test 2 and treat  $b$  as a parameter that controls the exponent range.

Then, the  $k$ th row of  $\mathbf{A}$  and the  $k$ th column of  $\mathbf{B}$  are computed as

$$\mathbf{A}_{k,:} = \mathbf{x}^T \mathbf{D} \mathbf{P}_k, \quad \mathbf{B}_{:,k} = \mathbf{P}_k^{-1} \mathbf{D}^{-1} \mathbf{x}.$$

$\mathbf{P}_k$  is the permutation matrix that realizes a cyclic row shift of  $\mathbf{A}$  by  $k$  entries. The permutations ensure that implementations cannot game the test by strategically rescaling the input matrices.

Figure 2 shows the results of ADP-enabled DGEMM with and without guardrails and automatic fallback, respectively, on Test 2 for increasing values of  $b$ . The relative error is computed as

$$\max_{ij} e_{ij}, \quad e_{ij} = \begin{cases} \frac{|\mathbf{x}^T \mathbf{x} - c_{ii}|}{|\mathbf{x}^T \mathbf{x}|}, & \text{if } i = j \\ \frac{|c_{ij}^{\text{ref}} - c_{ij}|}{|c_{ij}^{\text{ref}}|}, & \text{otherwise.} \end{cases}$$

For that, we use a reference  $O(n^3)$  floating-point implementation of GEMM to compute  $\mathbf{C}^{\text{ref}} = \mathbf{AB}$ . In addition, we compute the diagonal entries through  $\mathbf{x}^T \mathbf{x}$  using long double (FP80).

Our implementation of emulated DGEMM based on Ozaki-I scheme, by default, falls back to native FP64 DGEMM if emulation is infeasible. For Test 2, this occurs when the ESC exceeds the available number of bits. For this type of matrix, the ESC with and without coarsening computes the required number of mantissa bits accurately without overestimation. Without guardrails, the emulated DGEMM reaches a high relative error if  $b$  is sufficiently large and the available mantissa bits cannot represent all bits required for an accurate computation and, hence, fails Test 2. As such, Test 2 as designed by Demmel et al [7, 8] is reliable. With guardrails and fallback option, the emulated DGEMM attains a small relative error comparable to what can be expected from a standard  $O(n^3)$  floating-point GEMM.

**Aspect A2.** Implementations of GEMM can be evaluated according to different accuracy criteria. Demmel et al. [8] established a framework that defines three grades. Grade C is the weakest grade and is defined as a norm-wise bound that can be satisfied by Strassen-like algorithms. Grade B is a mixed componentwise and norm-wise bound, which is invariant under certain diagonal scalings. Grade A represents the most stringent criterion. It considers the componentwise relative error bound

$$|\text{fl}(\mathbf{AB})_{ij} - (\mathbf{AB})_{ij}| \leq f(n)\epsilon(|\mathbf{A}||\mathbf{B}|)_{ij}, \quad (1)$$

where  $f(n)$  measures how the error accumulates. Inequality (1) is considered the gold standard for the evaluation of numerical accuracy and is used by Higham [13, Eq. (3.13)] and reference (“netlib”) LAPACK [3].

The function  $f(n)$  in inequality (1) is critical for the assessment of how accurate a GEMM implementation is. For grade A compliance,  $f(n)$  must not exceed linear growth. Miller [18] demonstrates that enforcing strong numerical stability in matrix multiplication – defined as bounded error growth relative to input size – necessitates  $O(n^3)$  computational complexity, thereby excluding faster algorithm such as Strassen’s from satisfying such stability guarantees.

In our numerical experiments, we compare the accuracy of emulated DGEMM, native FP64 DGEMM, and a floating-point Strassen matrix multiplication. Emulated DGEMM is configured to use up to 200 mantissa bits. We have validated that the emulated DGEMM never falls back to native FP64 DGEMM in any of the experiments. The floating-point Strassen implementation is a simple reference implementation that we include for comparison purposes. The experiment multiplies two square matrices whose entries are uniformly distributed in (0,1). Each routine is executed with five distinct seeds. Figure 3 shows the maximum componentwise relative error. Emulated DGEMM consistently meets the Grade A criterion across all matrix sizes, with the maximum componentwise relative error growing well below the maximum allowed slope. The floating-point Strassen implementation, as expected, exceeds the maximum slope allowed for Grade A. Native FP64 DGEMM changes its error accumulation behavior at  $n = 4096$ , which indicates an algorithm switch. For large matrices, the emulated DGEMM and native FP64 DGEMM show comparable error accumulation behavior.



**Figure 3: Maximum componentwise relative error when multiplying two random uniformly distributed matrices.**



**Figure 4: Average componentwise relative error when multiplying two random uniformly distributed matrices.**

Figure 4 shows the average componentwise relative error. The error pattern of emulated DGEMM closely follows the theoretically expected growth proportional to  $\sqrt{n}$  for random uniformly distributed matrices. For larger matrices, native FP64 DGEMM follows the same expected pattern. This experiment supports the claim that our implementation of Ozaki DGEMM accumulates errors in a manner similar to native double-precision arithmetic and as dictated by theoretical error analysis.

## 7 Performance Evaluation

We evaluate ADP-enabled DGEMM on recent NVIDIA architectures to assess both its efficiency and deployability. The experiments cover microbenchmarks, end-to-end DGEMM throughput, and integration into NVIDIA’s cuSOLVER [20] library. The results highlight

that ADP’s run time guardrails incur modest overhead while preserving most of the performance gains of emulation, and that these benefits extend transparently to HPC workloads.

### 7.1 Performance Breakdown

Figure 5 reports the breakdown of DGEMM performance when ADP is configured to emulate 55 mantissa bits. In this experiment, ADP is forced to always use 55 bits, regardless of input characteristics, in order to expose its overhead under the most challenging conditions.

This setup deliberately places ADP at its largest disadvantage. The cost of ADP’s guardrails (fallback checks, exception handling, heuristics) is essentially constant, while the cost of slicing and low-precision multiplications grows with the number of mantissa bits. By fixing the precision at 55 bits, the relative share of ADP overhead is maximized. For smaller mantissa counts, the total computation shrinks but ADP’s cost remains flat, so its relative contribution is higher.

This experiment confirms that even in this worst-case configuration, ADP adds less than 10% overhead to the total DGEMM run time. The majority of execution time is spent in integer matrix multiplications and slice recomposition, while ADP’s run time checks account for only a small and predictable fraction of cost.

Overall, Figure 5 demonstrates that ADP overhead is modest and bounded: even when forced to operate at 55 mantissa bits, where its relative impact is maximized, ADP consumes less than 10% of run time. In more typical cases, its cost is smaller still, making ADP practical as a default safeguard in FP64 emulation.

### 7.2 Performance

Figure 6 summarizes end-to-end GEMM speedups over cuBLAS’s native DGEMM for two GPU NVIDIA platforms: GB200 (top row) and RTX Pro 6000 Blackwell Server Edition (bottom row). Each row presents two values: (left) emulated DGEMM without ADP and (right) emulated DGEMM with ADP forced to use 55 mantissa bits.

The left plot establishes the performance upper bound of this experiment: emulated DGEMM with 55 mantissa bits and no guardrails achieves the highest throughput, but does not provide any safety guarantees. The right chart shows the same emulation with ADP enabled. Here, performance is nearly identical to the no-ADP baseline, with only a small difference accounting for run time guardrails. This confirms that ADP preserves almost all of the performance benefit of emulation while adding accuracy guarantees.

The bottom row shows analogous results on the RTX Pro 6000 Blackwell Server Edition. Again, the left plot (55-bit emulation without ADP) represents the performance ceiling, and the right chart shows that ADP closely matches this ceiling while ensuring safe deployment. Across both architectures, the additional cost of ADP remains well below 10% of total run time, even under the forced 55-bit configuration where its overhead is maximized.

Taken together, Figure 6 demonstrates that ADP is practical and deployable: it enables emulated DGEMM to outperform native FP64 by up to 2.3× and 13.2× speedups over native FP64 GEMM on NVIDIA Blackwell GB200 and the RTX Pro 6000 Blackwell Server Edition, respectively, while guaranteeing accuracy.

### 7.3 Application-Level Integration

To evaluate the practicality of ADP beyond synthetic DGEMM benchmarks, we integrated our framework into NVIDIA’s cuSOLVER library, and specifically the `cusolverDnGeqr` routine. The QR factorization in cuSOLVER is implemented using the compact *WY* representation of block Householder reflectors [27], as outlined in Algorithm 1. Lines 6–8 correspond to the trailing matrix update—the BLAS3 component that dominates run time for medium and large problem sizes—which we redirected to use emulated DGEMM with ADP guardrails.

**Algorithm 1** Blocked Householder-QR factorization

**Input:**  $A \in \mathbb{R}^{m \times n}$ , assume  $\min(m, n)$  is an integer multiple of the panel width  $b$

**Outputs:** Matrix of Householder reflectors  $Y \in \mathbb{R}^{m \times n}$ , upper triangular matrix  $R \in \mathbb{R}^{m \times n}$ , both overwriting  $A$

```

1: Partition  $A \rightarrow [A_1 \ A_s]$  where  $A_1$  is  $m \times b$ 
2: for  $i \leftarrow 1, \dots, \min(m, n)/b$  do
3:   Factor  $A_i = Q_i R_i$  such that  $Q_i = I - Y_i T_i Y_i^T$ 
4:   Store  $A_i \leftarrow \begin{bmatrix} R_i \\ \diagdown \\ Y_i \end{bmatrix}$ 
5:   Update  $A_s$  by performing  $Q_i^T A_s$  as follows:
6:      $W \leftarrow Y_i^T A_s$                                 ▷ GEMM
7:      $W \leftarrow T_i^T W$                                 ▷ GEMM for efficiency
8:      $A_s \leftarrow A_s - Y_i W$                           ▷ GEMM
9:   Partition  $A_s \rightarrow \begin{bmatrix} R_{i,2} & R_{i,3} \\ A_{i+1} & A_s \end{bmatrix}$  where  $A_{i+1}$  is  $m - ib \times b$ 
10:  end for
```

Figure 7 presents results for random input matrices across a range of sizes on RTX Pro 6000 Blackwell Server Edition. The left chart reports speedups relative to native FP64 for two configurations: (i) fixed emulation with 55 mantissa bits and no ADP (upper-bound performance), and (ii) ADP in dynamic mode, where the framework selects the number of mantissa bits at run time. The right chart shows the distribution of slice counts that ADP chooses across all matrix-matrix multiplications in the factorization.

Several observations emerge. First, ADP achieves end-to-end speedups of up to  $3.7\times$  compared to native FP64, while still delivering accuracy comparable to FP64 in the overall factorization. Second, although the fixed 55-bit emulation achieves slightly higher speedups, its residuals degrade gradually with matrix size, whereas ADP consistently matches the accuracy of native FP64 across all problem sizes. Third, the performance gap between the fixed 55-bit ceiling and ADP is modest, demonstrating that the run time guardrails do not significantly reduce throughput. Fourth, the slice distribution in the right chart highlights how ADP adapts dynamically: most GEMMs require only a small number of slices (e.g., 8–9), while higher slice counts are used only rarely when numerical fidelity demands it. Finally, for very small problem sizes, ADP recognizes that the overhead of emulation outweighs its benefits and therefore falls back to native FP64, ensuring that performance never drops below the baseline. Together, these results show that ADP’s heuristics are both conservative enough to guarantee accuracy and efficient enough to avoid unnecessary cost.



**Figure 5: Breakdown of DGEMM performance when emulating 55 mantissa bits on NVIDIA Tensor Cores.** For this experiment, ADP is forced to always use 55 bits, regardless of input characteristics, in order to maximize its relative overhead (worst-case configuration).

It is worth noting that ADP introduces two types of overhead. The first is fixed: run time checks and guardrails that apply to every GEMM call. The second is dynamic: in cases where ADP decides to use more slices, the extra slicing and integer multiplications proportionally increase cost. Despite these factors, the net impact remains favorable: ADP consistently delivers large end-to-end speedups while ensuring FP64-level reliability under all inputs.

Overall, this experiment demonstrates that ADP is not only effective in isolated GEMM benchmarks but also deployable in production solvers. Its integration into cuSOLVER shows that ADP can serve as a transparent and safe drop-in replacement for native FP64 in real HPC workflows.

## 8 Discussion

The evaluation results confirm that FP64 emulation guided by ESC and realized through ADP is both safe and efficient. Taken together, our findings show that Ozaki-style schemes—once confined to research prototypes—can now be deployed as practical building blocks of production HPC libraries. We highlight three main outcomes.

### 8.1 Safety through ADP Guardrails

A central contribution of this work is demonstrating that ADP can serve as a lightweight, GPU-resident guardrail for emulation. By analyzing exponent spans and detecting exceptional values, ESC determines the minimal number of slices needed to guarantee FP64 accuracy. If the requirement exceeds what is practical or if it encounters any corner cases, ADP falls back automatically to native FP64. This ensures that users and applications are never exposed to silent accuracy loss. The result is that ADP-enabled DGEMM can be adopted as a drop-in replacement without requiring application developers to reason about slice counts or precision tuning.

### 8.2 Efficiency in Practice

Despite introducing guardrails, ADP incurs modest overhead: less than 10% of run time even under worst-case configurations, and smaller still in typical use cases. Our results show speedups of up to  $2.3\times$  on GB200 and  $13.2\times$  on RTX Pro 6000 Blackwell Server Edition over native FP64, respectively.



**Figure 6: End-to-end DGEMM performance comparison on two GPU platforms: (top) GB200 and (bottom) RTX Pro 6000 Blackwell Server Edition. Each row shows speedups over cuBLAS’s native DGEMM for two cases: (left) emulated DGEMM with 55 mantissa bits and no ADP (upper bound on performance, but no safety guarantees) and (right) emulated DGEMM forced to use 55 mantissa bits.**

These results demonstrate that HPC workloads can be accelerated safely with minimal friction: existing solvers can benefit from ADP without substantial code modifications. As a case study, we modified the QR factorization routines in cuSOLVER so that their trailing matrix updates were dispatched to ADP-enabled GEMM. This shows how ADP can be integrated into production solvers transparently, while delivering both performance gains and accuracy guarantees. More importantly, once ADP-enabled GEMM is provided in cuBLAS, higher-level libraries such as cuSOLVER, MAGMA [12], and PETSc [5] can inherit these benefits automatically by linkage. This productization of ADP-enabled DGEMM into cuBLAS underscores the confidence in the method: what began as a research technique is now positioned as a core technology of NVIDIA’s math libraries.

### 8.3 Implications for HPC and AI Hardware

Our results also carry broader implications. On the software side, they suggest a path toward mainstreaming emulated FP64 as a first-class execution mode in GPU libraries, rather than a research-only alternative. On the hardware side, they illustrate that existing low-precision accelerators can be made trustworthy for double-precision matrix multiplication oriented workloads, and point toward future designs where precision modes are flexibly exposed and managed at run time by guardrails such as ESC.

### 8.4 Limitations and Future Directions

While ADP ensures correctness and efficiency, the conservative nature of ESC means that its bounds are not always tight. In some cases, this leads to overestimation of the number of mantissa bits



**Figure 7: Application-level evaluation of ADP in QR factorization on RTX Pro 6000 Blackwell Server Edition.** We modified `cusolverDnGeqr` so that the trailing matrix updates are performed with emulated DGEMM. (Left) End-to-end speedups relative to native FP64 for two configurations: fixed emulation at 55 mantissa bits with no ADP (performance upper bound) and ADP in dynamic mode, where the number of mantissa bits is selected at run time. The absolute residual of the factorization is shown. (Right) Distribution of slice counts chosen by ADP across all GEMMs in the factorization.



**Figure 8: Flowchart for GEMM decision process.**

required. The consequence is twofold: first, ADP may incur additional operations by using more slices than strictly necessary, and second, in extreme cases, the estimate may exceed what is practical, triggering a fallback to native FP64. Both outcomes preserve correctness but can reduce performance. Developing tighter bounds and refined run time heuristics to mitigate overestimation remains an important avenue for future work.

Overall, this discussion underscores that ADP bridges the gap between the hardware reality of low-precision accelerators and the software demand for reliable FP64. It offers a safe, efficient, and transparent foundation for the next generation of numerical libraries and HPC applications.

## 9 Conclusion

Modern GPUs deliver their highest performance through low-precision Tensor Cores, yet much of scientific computing continues to demand FP64 accuracy. Bridging this gap requires techniques that are not only fast, but also safe and practical for production use.

This work introduced the *Exponent Span Capacity (ESC)* estimator and the *Automatic Dynamic Precision (ADP)* framework as a path to deployable FP64 emulation for matrix multiplication. ESC provides a hardware-agnostic method for conservatively estimating the number of slices needed to guarantee accuracy, while ADP builds on ESC to integrate run time heuristics, exception handling, and automatic fallback to native FP64. Together, they ensure that emulated GEMM is both efficient and numerically reliable. The accuracy experiments demonstrate that our Ozaki-I DGEMM implementation behaves in every aspect like standard native DGEMM. We further improved Ozaki-style decomposition with an unsigned slicing representation that increases bit utilization, reducing slice counts and improving throughput.

Our evaluation shows that ADP overhead remains below 10% even under worst-case configurations, and that emulated GEMM achieves up to 2.3 $\times$  speedups on GB200 and 13.2 $\times$  on RTX Pro 6000 Blackwell Server Edition compared to native FP64. Application-level experiments with QR factorization demonstrate that ADP-enabled GEMM implementations can be integrated into HPC workloads with minimal friction, achieving up to 3.7 $\times$  end-to-end speedups while maintaining accuracy on par with FP64. Importantly, once ADP-enabled GEMM is adopted in cuBLAS, higher-level libraries and frameworks will benefit automatically through linkage, seamlessly into scientific computing applications.

While ADP guarantees correctness, the conservative nature of ESC means that its bounds are not always tight. Overestimation of required slices can lead to unnecessary overhead or, in extreme cases, fallback to native FP64. Both outcomes preserve accuracy but reduce performance. Tightening ESC's estimates and refining heuristics to minimize overestimation remain important directions for future work. Extending ADP beyond dense GEMM also represents a promising path forward.

Although our evaluation focused on Ozaki Scheme I, the techniques introduced in this paper are general and can be extended to other emulation schemes such as Ozaki Scheme II, underscoring the broader applicability of ESC and ADP beyond the specific instantiation studied here. Additionally, it is straightforward to extend the emulation of DGEMM, including the APD framework, to ZGEMM via the 4M method [32].

Implementations of ESC and our Ozaki-I emulation framework will be made available as open source, enabling the community to build upon this work and integrate these techniques into HPC libraries and applications.

In conclusion, ADP and ESC transform FP64 matrix multiplication emulation from a research prototype into a safe deployable library implementation. They enable low-precision accelerators—originally designed for AI—to serve as trustworthy building blocks for high-fidelity HPC workloads. This work demonstrates that scientific computing can reap the performance benefits of AI-driven hardware without compromising numerical accuracy.

## References

- [1] Ahmad Abdelfattah, Jack Dongarra, Massimiliano Fasi, Mantas Mikaitis, and Fran oise Tisseur. 2025. Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic. arXiv:2506.11277 [math.NA] <https://arxiv.org/abs/2506.11277>
- [2] Joe Eaton Ajay Tirumala and Matt Tyrlik. 2022. Boosting Dynamic Programming Performance Using NVIDIA Hopper GPU DPX Instructions. <https://developer.nvidia.com/blog/boosting-dynamic-programming-performance-using-nvidia-hopper-gpu-dpx-instructions/>
- [3] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. *LAPACK Users' guide (third ed.)*. Society for Industrial and Applied Mathematics, USA.
- [4] Julien Demouth Andrew Kerr, Duane Merrill and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++. <https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/>
- [5] S. Balay, S. Abhyankar, M. F. Adams, S. Benson, J. Brown, P. Brune, K. Buschelman, E. M. Constantinescu, L. Dalcin, A. Dener, et al. 2024. *PETSc/TAO Users Manual V.3.21*. Technical Report. Argonne National Laboratory (ANL), Argonne, IL (United States). doi:10.2172/2337606
- [6] William Dawson, Katsuhisa Ozaki, Jens Domke, and Takahito Nakajima. 2024. Reducing Numerical Precision Requirements in Quantum Chemistry Calculations. *Journal of Chemical Theory and Computation* 20, 24 (2024), 10826–10837. doi:10.1021/acs.jctc.4c00938
- [7] Jim Demmel et al. 2025. More aggressive (sparse) BLAS testing, to identify aggressive optimizations. Private communication. Unpublished manuscript, referenced with author approval. Citation details will be updated once published.
- [8] Jim Demmel, Xiaoye Li, Julien Langou, Wesley Pereira, Mark Gates, and Cindy Rubio Gonzalez. 2024. How to grade the accuracy of an implementation of the BLAS. Presentation at BLIS Retreat. [https://www.cs.utexas.edu/~flame/BLISRetreat2024/slides/Grading\\_BLAS.pdf](https://www.cs.utexas.edu/~flame/BLISRetreat2024/slides/Grading_BLAS.pdf)
- [9] Jens Domke, Rio Yokota, and Kengo Ozaki. 2021. Matrix engines for HPC: promise vs reality. In *Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER)*. 64–75. doi:10.1109/CLUSTER49102.2021.00015
- [10] Jack Dongarra, John Gunnels, Harun Bayraktar, Azzam Haidar, and Dan Ernst. 2024. Hardware Trends Impacting Floating-Point Computations In Scientific Applications. arXiv:2411.12090 [math.NA] <https://arxiv.org/abs/2411.12090>
- [11] Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. 2018. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In *SC18: International Conference for High Performance Computing, Networking, Storage and Analysis*. IEEE, 603–613.
- [12] Michael Heroux, Ahmad Abdelfattah, Natalie Beams, Robert Carson, Pieter Ghysels, Tzanis Kolev, Thomas Stitt, Arturo Vargas, Stanimire Tomov, and Jack Dongarra. 2024. MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures. *Int. J. High Perform. Comput. Appl.* 38, 5 (Sept. 2024), 468–490. doi:10.1177/10943420241261960
- [13] Nicholas J Higham. 2002. *Accuracy and stability of numerical algorithms*. SIAM.
- [14] Tsuyoshi Ichimura, Kohei Fujita, Muneo Hori, and Maddegedara Lalith. 2025. Fast and Power Efficient GPU-Based Explicit Elastic Wave Propagation Analysis by Low-Ordered Orthogonal Voxel Finite Element with INT8 Tensor Cores. *Journal of Computational Science* 91 (2025), 102659. doi:10.1016/j.jocs.2025.102659
- [15] Ali Kashi, Nicholas J. Higham, and Jack Dongarra. 2024. Mixed-precision numerics: a survey of algorithms and architectures. *arXiv preprint* (2024). arXiv:2412.19322
- [16] Zeyu Lin, Yan Chen, and Hao Zhao. 2024. MixPert: Mixed-precision emulation on tensor cores for high-performance GEMM. In *Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCSES)*. 65–76. doi:10.1145/3642345.3646789
- [17] Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamal, Naveen Mellemputti, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. 2022. FP8 Formats for Deep Learning. arXiv:2209.05433 [cs.LG] <https://arxiv.org/abs/2209.05433>
- [18] Webb Miller. 1974. Computational complexity and numerical stability. In *Proceedings of the Sixth Annual ACM Symposium on Theory of Computing* (Seattle, Washington, USA). 317–322. doi:10.1145/800119.803910
- [19] Daisuke Mukunko, Kengo Ozaki, and Toshiyuki Imamura. 2020. DGEMM using Tensor Cores and its accurate versions. In *Proceedings of the 26th International European Conference on Parallel and Distributed Computing (Euro-Par)*. 205–218. doi:10.1007/978-3-030-57675-2\_16
- [20] NVIDIA Corporation. 2024. cuSOLVER. <https://docs.nvidia.com/cuda/cusolver/index.html>
- [21] NVIDIA Corporation. 2025. cuBLAS. <https://docs.nvidia.com/cuda/cublas/index.html>
- [22] Hiroyuki Ootomo, Katsuhisa Ozaki, and Rio Yokota. 2024. DGEMM on integer matrix multiplication unit. *The International Journal of High Performance Computing Applications* 38, 4 (2024), 297–313.
- [23] Shohie Ootomo, Kengo Ozaki, and Rio Yokota. 2024. DGEMM on integer matrix multiplication units with Ozaki decomposition. *International Journal of High Performance Computing Applications* (2024). doi:10.1177/1094342024120467
- [24] Katsuhisa Ozaki, Takeshi Ogita, Shin'ichi Oishi, and Siegfried M. Rump. 2012. Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. *Numerical Algorithms* 59 (2012), 95–118. doi:10.1007/s11075-011-9478-1
- [25] Kengo Ozaki, Daisuke Uchino, and Toshiyuki Imamura. 2025. Ozaki Scheme II: GEMM-oriented floating-point emulation with modular arithmetic. *arXiv preprint* (2025). arXiv:2504.08009
- [26] Bita Darvish Rouhani, Nitin Garegrat, Tom Savell, Ankit More, Kyung-Nam Han, Ritchie Zhao, Mathew Hall, Jasmine Klar, Eric Chung, Yuan Yu, Michael Schulte, Ralph Wittig, Ian Bratt, Nigel Stephens, Jelena Milanovic, John Brothers, Pradeep Dubey, Marius Cornea, Alexander Heinecke, Andres Rodriguez, Martin Langhammer, Maxim Den Summer an dNaumov, Paulius Micikevicius, Michael Siu, and Colin Verrillli. 2023. *OCP Microscaling Formats (MX) Specification*. Technical Report. Open Compute Project. <https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf> Revision 1.0.
- [27] Robert Schreiber and Charles Van Loan. 1989. A storage-efficient WY representation for products of Householder transformations. *SIAM J. Sci. Statist. Comput.* 10, 1 (1989), 53–57. doi:10.1137/0910004
- [28] Daisuke Uchino and Toshiyuki Imamura. 2025. High-performance INT8 GEMM emulation with Ozaki decomposition. *arXiv preprint* (2025). arXiv:2508.03984
- [29] Daisuke Uchino, Kengo Ozaki, and Toshiyuki Imamura. 2025. Performance enhancement of the Ozaki Scheme on integer units for high-performance GEMM. *International Journal of High Performance Computing Applications* (2025). doi:10.1177/1094342024123456
- [30] Daisuke Uchino, Kengo Ozaki, and Toshiyuki Imamura. 2025. Performance enhancement of the Ozaki Scheme on integer units for high-performance GEMM. *International Journal of High Performance Computing Applications* (2025). doi:10.1177/1094342024123456
- [31] Yuki Uchino, Katsuhisa Ozaki, and Toshiyuki Imamura. 2024. Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit. arXiv:2409.13313 [cs.DC] <https://arxiv.org/abs/2409.13313>
- [32] Field G. Van Zee and Tyler M. Smith. 2017. Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods. *ACM Trans. Math. Softw.* 44, 1, Article 7 (July 2017), 36 pages. doi:10.1145/3086466
- [33] Wei Zhang, Peng Li, and Jian Wang. 2025. LE-GEMM: Lightweight GEMM emulation with precision refinement. *Journal of Systems Architecture* (2025). doi:10.1016/j.sysarc.2025.103017

Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009