

# HERO-Sign: Hierarchical Tuning and Efficient Compiler-Time GPU Optimizations for SPHINCS<sup>+</sup> Signature Generation

Yaoyun Zhou

Department of EECS

University of California, Merced

yzhou96@ucmerced.edu

Qian Wang

Department of Electrical Engineering

University of California, Merced

qianwang@ucmerced.edu

**Abstract**—SPHINCS<sup>+</sup> is a stateless hash-based signature scheme known for its strong post-quantum security, but it suffers from slow signature speed due to the heavy use of hash operations. The parallel architecture of GPUs offers a potential advantage for accelerating the computation of SPHINCS<sup>+</sup> signatures. However, existing GPU-based optimization efforts for SPHINCS<sup>+</sup> either do not fully exploit the inherent parallelism of its Merkle Tree-based structure, or lack fine-grained, compiler-level customization tailored to its diverse computational kernels.

This paper proposes HERO-Sign, which adopts hierarchical tuning methodologies and efficient compiler-time GPU optimizations for SPHINCS<sup>+</sup>. HERO-Sign rethinks the parallelization potential arising from data independence in SPHINCS<sup>+</sup>' components, including FORS (Forest of Random Subsets), MSS (Merkle Signature Scheme, and WOTS<sup>+</sup> (Winternitz One-Time Signature Plus). First, it introduces a Tree Fusion strategy for FORS, whose structure contains a large number of branches. Our FORS Fusion strategy is supported by an automated Tree Tuning search algorithm, allowing it to adapt and optimize fusion schemes across various GPU platforms. To further enhance performance, HERO-Sign adopts an adaptive compilation strategy that accounts for the varying effectiveness of compiler optimizations across different SPHINCS<sup>+</sup> component kernels (`FORS_Sign`, `TREE_Sign`, `WOTS+_Sign`). This strategy automatically selects between PTX and native branches during the compilation phase to maximize efficiency. For multiple batches of message signatures, HERO-Sign focuses on optimizing kernel-level overlapping and employs a Task Graph-based construction strategy to minimize multi-stream idle time and reduce kernel launch overhead. Compared to state-of-the-art GPU implementations, under the SPHINCS<sup>+</sup>-128f, SPHINCS<sup>+</sup>-192f and SPHINCS<sup>+</sup>-256f parameter sets, HERO-Sign demonstrates an enhanced throughput of  $1.28 \times - 3.13 \times$ ,  $1.28 \times - 2.92 \times$ , and  $1.24 \times - 2.60 \times$  on RTX 4090. Similar performance improvements have also been achieved on other architectures, including the A100, H100, and GTX 2080. HERO-Sign also achieves a two-order-of-magnitude reduction in kernel launch latency.

## I. INTRODUCTION

Quantum computing has emerged as a transformative technology with profound implications across multiple domains, including cryptography, drug discovery, materials science, and artificial intelligence [4], [8], [9], [29]. However, with the improvement of quantum error correction techniques, widely used public-key encryption systems such as Rivest-Shamir-Adleman (RSA) and Elliptic Curve Cryptography (ECC) are

facing new security challenges. Shor's algorithm threatens the security foundations of RSA and ECC by enabling efficient solutions to integer factorization and discrete logarithm problems, while Grover's algorithm presents a moderate risk to symmetric-key systems through quadratic speedups [13], [33]. The advent of IBM's 1,121-qubit Condor quantum processor significantly advances the practical feasibility of implementing such quantum cryptanalysis algorithms [17].

Based on this, NIST launched the Post-Quantum Cryptography (PQC) standardization initiative to evaluate quantum-resistant cryptographic algorithms, culminating with the standardization of lattice-based schemes for key encapsulation (CRYSTALS-Kyber, Hamming Quasi-Cyclic) and digital signatures (CRYSTALS-Dilithium and FALCON), alongside the hash-based signature scheme SPHINCS<sup>+</sup> [6], [22], [27]. Among these algorithms, SPHINCS<sup>+</sup> uses more than 100,000 hash computations in Hypertree Structure. Although it provides strong security guarantees based solely on the hardness of hash functions, it suffers from low computational efficiency. For instance, SPHINCS<sup>+</sup>-128f has larger signature sizes of 17,088 bytes, compared to lattice-based schemes such as FALCON-512 (690 bytes) and Dilithium-2 (2,420 bytes). On the Arm Cortex-A72 platform, SPHINCS<sup>+</sup>-256f signature generation is approximately 18 times slower than FALCON-512 and 259 times slower than Dilithium-2 [12]. Specifically, in high-throughput applications such as blockchain, authentication, VPNs, and IoT devices, the signatures speed of SPHINCS<sup>+</sup> directly impacts the performance of the system.

Prior main optimization works for SPHINCS<sup>+</sup> can be categorized into three approaches: FPGA-based custom accelerators: Optimizing latency-critical paths [2], [7]. CPU-level enhancements: Using SIMD extensions, e.g., AVX2/AVX-512/Neon, cryptographic instruction sets (e.g., Intel SHA-NI) [1], [14], [28], [39]. GPU-centric parallelization: Exploiting the algorithm's inherent parallelism in SPHINCS<sup>+</sup> structure [19], [34], [37]. GPUs offer thousands of CUDA Cores, a fine-grained SIMT (Single Instruction, Multiple Threads) execution model and hierarchical memory subsystems (Global Memory, Shared memory, and Constant Memory, etc.), which provide a flexible framework for us to optimize SPHINCS<sup>+</sup>'s

compute-intensive operations.

Different from previous GPU-based works, our work not only introduces a hierarchical optimization framework but also provides a more flexible auto-tuning strategy to adapt to various GPU platforms. We select SHA-256 as the hash function baseline for SPHINCS<sup>+</sup> due to its widespread adoption and standardization. Our optimization strategies are algorithm-agnostic and do not depend on specific hash function. Therefore, users can flexibly adopt other standardized hash primitives (e.g., SHA-512, SHA-3), according to their security and performance requirements. Furthermore, we focus on the  $-f$  parameter sets, which feature a larger number of FORS trees and longer WOTS chains than the  $-s$  variants, thus presenting greater optimization challenges.

Our work contributions are as follows:

- We propose HERO-Sign, a CUDA-based implementation on GPU, which adopts Hierarchical Tuning and Efficient CompileR-Time Optimizations for SPHINCS<sup>+</sup> Signature Generation. HERO-Sign rethinks the potential of inherent data parallelism in SPHINCS<sup>+</sup>'s components, where MSS are parallelized based on tree-level parallelism, and WOTS<sup>+</sup> is parallelized using chain-level parallelism.
- HERO-Sign leverages shared memory as a bridge to optimize the highly branched structure of FORS by introducing an automated Tree Tuning search algorithm. Further, HERO-Sign develops a more general memory padding strategy for mitigating shared memory multi-bank conflicts, supporting single-threaded access patterns of 16-, 24-, and 32-byte widths.
- HERO-Sign allows fine-grained tuning via PTX-based SHA2 implementations. At compile time, each component kernel adaptively selects between native and PTX-optimized branches based on the effectiveness of compiler optimizations. In contrast to runtime branching, which can increase register pressure and lead to a higher memory footprint due to multiple active code paths, our compile-time branching strategy allocates a single, fixed execution path per kernel with only a small increase in compilation overhead.
- For multiple batches of message signatures, HERO-Sign employs a Task Graph-based construction strategy, minimizing idle time and reducing kernel launch overhead. HERO-Sign also explores the impact of varying batch sizes on kernel-level overlapping efficiency.

TABLE I: The SPHINCS<sup>+</sup>-f parameter sets.  $n$ : The bytes of hash outputs, secret keys and public seeds.  $h$ : The height of the hypertree.  $d$ : The number of layers in the hypertree.  $\log(t)$ : The height of each FORS tree.  $k$ : The number of FORS trees.  $w$ : The Winternitz parameter used in WOTS<sup>+</sup>.

| Scheme                     | $n$ | $h$ | $d$ | $\log(t)$ | $k$ | $w$ |
|----------------------------|-----|-----|-----|-----------|-----|-----|
| SPHINCS <sup>+</sup> -128f | 16  | 66  | 22  | 6         | 33  | 16  |
| SPHINCS <sup>+</sup> -192f | 24  | 66  | 22  | 8         | 33  | 16  |
| SPHINCS <sup>+</sup> -256f | 32  | 68  | 17  | 9         | 35  | 16  |

## II. BACKGROUND AND MOTIVATION

### A. SPHINCS<sup>+</sup> Signature Scheme

SPHINCS<sup>+</sup> is characterized by a modular design comprising FORS, WOTS<sup>+</sup>, MSS, and a multi-layer Hypertree structure. Figure 1 illustrates the overall structure and signature generation flow. In the following sections, we analyze the parallelization in each component of the signature process. Table I presents the specific values for the  $f$  parameter.

1) *Winternitz One-Time Signature Scheme*: The generation of a WOTS<sup>+</sup> (Winternitz One-Time Signature Plus) signature involves computing a sequence of hash chains over a message-derived input. The total number of chains, denoted as  $\text{WOTS}_{\text{len}}$ , is determined by both the chosen Winternitz parameter  $w$  and the output length of the underlying hash function. Specifically,  $\text{WOTS}_{\text{len}} = \ell_1 + \ell_2$ , where  $\ell_1 = \left\lceil \frac{8 \cdot n}{\log_2 w} \right\rceil$  encodes the message and  $\ell_2$  ensures checksum integrity. Each element of the WOTS<sup>+</sup> signature is computed as a hash chain of length determined by the base- $w$  message representation. Since each WOTS<sup>+</sup> chain operates independently, the signature generation process is highly parallelizable, making it well-suited for acceleration using CUDA multi-threading programming.

2) *FORS Signature*: The FORS (Forest of Random Subsets) signature scheme constructs a signature by selecting and computing authentication paths in a forest of Merkle trees. The scheme consists of  $k$  binary Merkle trees, each with height  $\log t$ , where  $t = 2^{\log t}$  denotes the number of leaves in each tree. To generate a signature, the message is first mapped to  $k$  indices, each indicating a leaf in one of the  $k$  trees, so that each Merkle Tree can be computed independently. For each index, the corresponding leaf node and its authentication path (comprising  $\log t$  sibling nodes) are computed to form the signature. Due to the structure of the Merkle trees, the computation of nodes at each layer in each FORS subtree is independent. This property allows multiple nodes to be computed in parallel using multi-threading.

3) *Merkle Signature Scheme*: The Merkle Signature Scheme (MSS) is a hash-based digital signature scheme that constructs a single binary Merkle tree to authenticate multiple one-time signatures. At the core of MSS is a Merkle tree built from the public keys of a set of one-time signature (OTS) key pairs, such as WOTS<sup>+</sup>. The root of this Merkle tree is the public key of MSS. To generate a signature for a message, a one-time key pair is selected, and the message is signed using the corresponding OTS scheme. Along with the OTS signature, an authentication path from the used leaf to the Merkle root is included to prove its validity. This structure enables MSS to generate many secure signatures using a compact public key and the robustness of hash-based cryptography.

4) *Hypertree Signature Scheme*: The Hypertree signature scheme extends the MSS by organizing multiple MSS trees into a hierarchical structure composed of  $d$  layers of subtrees, with a total height  $h$ . Each layer consists of several Merkle trees of height  $\frac{h}{d}$ , where the root of each subtree in a lower layer is signed using a one-time signature and incorporated as a leaf node in the Merkle tree of the layer above. The



Fig. 1: The overall structure of SPHINCS<sup>+</sup>, highlighting the key components: FORS, WOTS+, MSS and the resulting Hypertree structure. → represents the flow of signature computation and storage from left to right. ● ■ denote the generated signatures. Merkle tree signatures are computed bottom-up and WOTS+ signatures are derived from multiple WOTS+ chains. □ marks the parallelizable regions of the structure. It further illustrates the workflow the Tree Tuning algorithm and the compile-time execution path selection mechanism.

TABLE II: Time Breakdown (ms) for TCAS-SPHINCS<sup>+</sup>

| Time Measure (ms)          | FORS  | Idle Time | MSS   | WOTS+ |
|----------------------------|-------|-----------|-------|-------|
| SPHINCS <sup>+</sup> -128f | 1.89  | 2.27      | 6.57  | 0.93  |
| SPHINCS <sup>+</sup> -192f | 7.75  | 2.31      | 10.06 | 1.33  |
| SPHINCS <sup>+</sup> -256f | 13.25 | 2.29      | 26.55 | 1.47  |

root of the top-level tree serves as the global public key. To generate a signature, the scheme traverses from the bottom layer to the top, producing one-time signatures and corresponding authentication paths at each level. Due to the tree-based structure, both the computation of each MSS subtree and the generation of their internal nodes and authentication paths can be performed independently. This independence allows for effective parallelization of the Hypertree signature generation.

### B. Motivation

For lattice-based PQC algorithms, prior efforts on GPU acceleration of CRYSTALS-Kyber, CRYSTALS-dilithium, and Falcon have been explored [30], [32], [36]. For hash-based schemes like SPHINCS<sup>+</sup>, Sun et al. [34] proposed an initial GPU-based optimization of SPHINCS, the predecessor of SPHINCS<sup>+</sup>, where the original HORST (Hash to Obtain a Random Subset with Tree) was employed as a few-time signature scheme. This work implemented parallelization strategies for both WOTS+ and HORST over a single Merkle tree. Kim

et al. [19] introduced the first GPU-parallel implementation of the Hypertree structure in SPHINCS<sup>+</sup>. This work extended the single Merkle tree parallelism to support simultaneous computation across multiple Merkle trees, significantly improving throughput. However, their implementation focused primarily on the Hypertree component and supported only single FORS subtree parallelism. Table II presents the detailed time breakdown of their GPU-based SPHINCS<sup>+</sup> implementation. We observe that the MSS phase dominates the most computation, followed by FORS, while WOTS<sup>+</sup> is relatively lightweight. Meanwhile, a non-negligible amount of idle time is observed across all three parameter sets. CBPSPX introduces a theoretically estimated Thread Utilization Efficiency Index to guide parallel methods selection, but lacks empirical profiling of actual parallel execution efficiency [38].

Furthermore, batch message signatures based on multiple streams has been widely adopted in cryptographic systems, as it enables efficient overlapping of computation and data movement. CUSPX estimates the number of streams by dividing the total number of tasks by the available CUDA cores [37]. While this approach enables effective performance gain, it does not eliminate kernel-level idle time during execution. BatchZK assigns each Merkle tree to a separate CUDA stream and overlaps their execution to improve throughput [21]. While this approach performs well for a small number of Merkle trees, it does not scale efficiently to scenarios like Hypertree

in SPHINCS<sup>+</sup>, which involve up to 35 Merkle trees due to the high latency induced by excessive stream count. Moreover, the degree of overlap between streams is constrained by hardware resources such as register usage, thread allocation, and shared memory, making it difficult to eliminate idle time through stream-level parallelism completely.

To address the above challenges, HERO-Sign rethinks the inherent parallelism within SPHINCS<sup>+</sup> and introduces an automated Tree Tuning search algorithm specifically tailored for FORS, which involves a large number of subtrees. For batch message signatures, HERO-Sign explores message-level collaborative parallelism and employs a Task Graph-based kernel construction method to reduce kernel launch overhead and mitigate idle time during concurrent multiple kernels execution. HERO-Sign also prioritizes the use of shared memory for fast memory access whenever possible and selectively tunes instruction choices to align with different component kernels. HERO-Sign is adaptable to performance tuning across diverse GPU architectures.

### III. THE OVERVIEW OF HERO-SIGN

SPHINCS<sup>+</sup> is a compute-intensive cryptographic scheme, but the signature generation process of its constituent kernel functions vary significantly. For instance, the generation of leaf nodes in the Hypertree layers heavily relies on repeated calls to the `wots_gen_leaf` function. Taking SPHINCS<sup>+</sup>-128f as an example, the generation of a single node involves approximately 560 iterations of WOTS<sup>+</sup> chain hashing in once `wots_gen_leaf` function call. The high register demand per thread increases pressure on the limited register file within each SM, reducing warp occupancy. In contrast, FORS node generation is relatively lightweight per node with only a few hash computations, but due to the large number and increased height of its subtrees, the overall parallel performance depends critically on efficient thread utilization and careful management of shared memory. For WOTS<sup>+</sup> signature generation, although it also relies on `wots_gen_leaf`, it is invoked only once per layer, making it more amenable to thread-level parallelization.

Given these computational characteristics, HERO-Sign follows the approach of Kim et al [19] and decompose the SPHINCS<sup>+</sup> signature process into three separate CUDA kernels: `TREE_Sign`, `FORS_Sign` and `WOTS+_Sign`. HERO-Sign integrates parallel Merkle tree processing, a Tree Tuning algorithm for FORS, adaptive compilation branches and Task Graph-based batch message signatures, collectively enhancing GPU resource utilization. In the following Sections A–H, we detail the optimization techniques employed in HERO-Sign.

#### A. Multiple Merkle Trees Parallelization

As described in the code snippet of Figure 2, each Merkle tree within the Hypertree structure of SPHINCS<sup>+</sup> can be computed independently without data dependencies. Similarly, the signature computations for each FORS subtree are also independent. If given sufficient thread blocks and adequate register and shared memory resources, HERO-Sign could

---

```

hash_message(mhash, &tree, &leaf_idx, sig, pk,
m, ...);
    ... /* Set WOTS+ Address */
/* Generate the fors's indices array from
message */
message_to_indices(indices, m);
fors_sign(sig, root, mhash, indices...);
sig += SPX_FORS_BYTES;
for (i = 0; i < SPX_D; i++) {
    merkle_sign(sig, root, ..., leaf_idx);
    ... /* Set Tree Address */
    sig += SPX_WOTS_BYTES + SPX_TREE_HEIGHT *
        → SPX_N;
/* Update the indices for the next layer. */
leaf_idx = (tree & ((1 << SPX_TREE_HEIGHT)
- 1));
tree = tree >> SPX_TREE_HEIGHT;
}

```

---

Listing 1: SPHINCS<sup>+</sup> Signature Generation Snippet

Fig. 2: SPHINCS<sup>+</sup> Signature Generation Snippet. Both the `leaf_idx` and the `indices` array can be precomputed from a fragment of the input message. The `indices` array determines the selected signature indices for each FORS subtree, the `leaf_idx` specifies the signature index selected within each Hypertree layer.

parallelize Hypertree and FORS across multiple Merkle trees. In the case of WOTS<sup>+</sup>, the kernel can be launched once the root nodes of the FORS and Hypertree Merkle trees are computed, enabling further concurrency in the signature generation process. HERO-Sign also applies the `__launch_bounds__` qualifier to constrain register usage and ensure reliable kernels’ execution under GPU limitations resource.

#### B. FORS Fusion

1) *Parallelism Feasibility*: Considering SPHINCS<sup>+</sup> parameter sets 128f, 192f and 256f, the corresponding Hypertree structures comprise 176, 176, and 272 leaf nodes, respectively. HERO-Sign assigns each leaf node to a dedicated CUDA thread within a block (up to 1024 threads), resulting in shared memory usage of approximately 1KB, 4.125KB, and 8.5KB per block. These allocations remain well within the 48KB per-block static shared memory limit of the RTX 4090, thus enabling fully parallel multiple Merkle tree computations during signature generation. While the FORS contains 2,112, 8,448 and 17,920 leaf nodes respectively, exceeding the 1,024-thread capacity of one block. This results in shared memory consumption of 33 KB, 198 KB and 560 KB, with the latter two configurations surpassing 48 KB shared memory limit per block.

2) *FORS Fusion Process*: As shown in Table III, the theoretical occupancy of FORS in the 128f parameter set is 3.89 times higher than the practical occupancy, this suggests notable underutilization of GPU resources. To address this inefficiency, HERO-Sign introduces a FORS Fusion strategy. HERO-Sign first groups multiple Merkle trees that can be

TABLE III: Warp Occupancy, Theoretical Occupancy, Registers allocated Per Thread for different component kernels in SPHINCS<sup>+</sup>-128f in TCAS-SPHINCSp on RTX4090 [19], which are profiled by Nsight System. The TREE\_Sign has low theoretical and practical occupancies with higher register per thread. This high register usage of TREE\_Sign also limits the potential for kernel fusion within SPHINCS<sup>+</sup> [15], [35]. We explored with several fusion strategies combining TREE\_Sign, WOTS+\_Sign and FORS\_Sign, but observed performance degradation as a result.

|                       | FORS_Sign | TREE_Sign | WOTS+_Sign |
|-----------------------|-----------|-----------|------------|
| Warp Occupancy        | 17%       | 25%       | 46%        |
| Theoretical Occupancy | 66.67%    | 25%       | 52.08%     |
| Registers Per Thread  | 64        | 128       | 72         |

Occupancy reflects the ratio between the number of active warps and the maximum supported warps per Streaming Multiprocessor (SM).

### Algorithm 1 Auto Tree Tuning Algorithm under SEME and Thread Constraints

```

Require: FORS parameters ( $k$ ,  $\log_2 t$ ,  $n$ )
Require: SEMEPERBLOCK()  $\triangleright$  Dynamic/Static
Ensure: Optimal configuration ( $T^*$ ,  $F^*$ )
1:  $T_{\min} \leftarrow 2^{\log_2 t}$   $\triangleright$  Minimum threads per FORS tree
2:  $T_{\max} \leftarrow 1024$ ,  $S_{\max} \leftarrow \text{SEMEPERBLOCK}()$ 
3:  $Cand \leftarrow \emptyset$ 
4: for  $T_{\text{Set}} = T_{\min}$  to  $T_{\max}$  step  $T_{\min}$  do
5:    $N_{\text{tree}} \leftarrow T_{\text{set}}/T_{\min}$ 
6:    $S_{\text{set}} \leftarrow N_{\text{tree}} \cdot t \cdot n$ 
7:   if  $S_{\text{set}} > S_{\max}$  then
8:     continue
9:   end if
10:   $F_{\max} \leftarrow \min \left( \left\lfloor \frac{S_{\max}}{S_{\text{set}}} \right\rfloor, \left\lfloor \frac{k}{N_{\text{tree}}} \right\rfloor \right)$ 
11:  for  $F = 1$  to  $F_{\max}$  do
12:     $T_{\text{used}} \leftarrow T_{\text{set}}$   $\triangleright$  Threads Fixed per Set
13:     $S_{\text{used}} \leftarrow F \cdot S_{\text{set}}$   $\triangleright$  SEME used after Fusing Sets
14:    if  $T_{\text{used}} > T_{\max}$  or  $S_{\text{used}} > S_{\max}$  then
15:      continue
16:    end if
17:     $U_T \leftarrow T_{\text{used}}/T_{\max}$ ,  $U_S \leftarrow S_{\text{used}}/S_{\max}$ 
18:    if  $U_T = 1$  and  $U_S = 1$  or  $U_T < \alpha$  then  $\triangleright$  The
        $\alpha$  value is an optional tune factor and it may vary across
       different GPU architectures.
19:      continue
20:    end if
21:     $sync \leftarrow \frac{\log_2 t \cdot \lceil \frac{k}{N_{\text{tree}}} \rceil}{F}$   $\triangleright$  Sync points after fusion.
22:     $Cand \leftarrow Cand \cup \{(T_{\text{set}}, F, U_T, U_S, sync)\}$ 
23:  end for
24: end for
25:  $(T^*, F^*, U_T^*, U_S^*) \leftarrow \arg \min_{(T, F, U_T, U_S, sync)} (sync, -U_T, -U_S)$  over
    $Cand$ 
26: return  $(T^*, F^*, U_T^*, U_S^*)$ 

```



Fig. 3: FORS Fusion Process. In Unfused state, for instance,  $Set_1$  must wait for  $Set_0$  to release SEME One, resulting in a sequential execution pattern in one Block. In Fused state, multiple sets including  $Set_0$ ,  $Set_1$ ,  $Set_2$ , and up to  $Set_F$  are fused to a new  $Set$ . If sufficient shared memory is available, the value of  $F$  could be increased accordingly.

processed in parallel within a block into a unit denoted as  $Set$ . By leveraging the available shared memory, HERO-Sign allows FORS Fusion operations across consecutive  $Set$ s, thus enabling parallel execution of multiple  $Set$ s. This strategy also helps reduce the synchronization overhead typically introduced with multiple independent  $Set$ s. Figure 3 illustrates the FORS Fusion process. During the fusion process, HERO-Sign introduces an  $\text{OFFSET}$  variable to ensure a one-to-one correspondence between the Merkle trees fused from different  $Set$ s. Since computing  $\text{OFFSET}$  incurs some overhead due to high branching, HERO-Sign reuses the same variable across fusion process.

3) *Fusion Trade-offs in FORS*: Determining how many  $Set$ s can be fused involves a trade-off between thread allocation and shared memory constraints. HERO-Sign incorporates an offline Auto Tree Tuning algorithm 1 to automatically determine the optimal Tree Fusion strategy. The Tree Tuning algorithm automatically queries the maximum shared memory per block available on the target GPU and leverages FORS-specific parameters to compute the optimal fusion strategy. It also identifies a set of near-optimal candidates, among which the final configuration can be selected based on empirical profiling results. The searching process follows a set of prioritized heuristics. First, valid configurations must allocate more threads than the total number of nodes in a single FORS subtree (Line 1), ensuring full coverage. Second, configurations that fully utilize 1024 threads and approach the per-block shared memory limit are excluded (Lines 18-19), as practical results show that such settings tend to increase resource contention and reduce warp occupancy. Third, as the number of fused  $Set$ s increases, one synchronization will cover more branches. When multiple near-optimal solutions are found, HERO-Sign prioritizes the configurations that require fewer

synchronization points within the candidate set (Lines 21-25). The searching results for the 128f and 192f configurations on RTX4090 are summarized in Table IV.

TABLE IV: Searching Results of Shared memory utilization(Static), thread utilization, and F (Number of fused Sets).

| Parameter-Sets             | Shared Memory Utilization | Thread Utilization | F |
|----------------------------|---------------------------|--------------------|---|
| SPHINCS <sup>+</sup> -128f | 0.6875                    | 0.6875             | 3 |
| SPHINCS <sup>+</sup> -192f | 0.75                      | 0.75               | 2 |

4) *Relax-FORS model*: Furthermore, considering the 256f parameter set, each FORS tree contains 512 leaf nodes, consuming up to 16 KB of shared memory per tree. Assigning one thread per leaf node allows at most two subtrees to run in parallel, while 256f requires processing 35 FORS trees in total. This leads to excessive synchronization and resource contention in terms of both threads and shared memory, thereby limiting warp occupancy.

We propose *Relax\_FORS* model to address these constraints. Observing that each tree's shared-memory usage is halved when moving one level upward in the Merkle reduction (see Figure 7), we introduce a *Relax\_Buffer* at the bottom layer of each *Set* and utilize complete shared memory for the upper layers. Instead of dedicating one thread per leaf node, a single thread is responsible for generating two leaf nodes and then storing them in the *Relax\_Buffer*. This buffer is implemented in registers, whose per-thread organization aligns naturally with CUDA's fine-grained SIMT execution model. Each thread maintains a private register array, and we constrain its register usage to a threshold  $R_t$  to reduce register pressure, avoid register spilling, and limit SM resource consumption. The overhead of this buffering strategy is negligible due to the low-latency register access. The *Relax\_FORS* model shows that moderately reducing memory usage can improve performance by alleviating memory pressure and enhancing data locality. Figure 4 provides an overview of the *Relax\_FORS* workflow.

### C. Adaptive Selection of PTX branches

The hash-based structure of SPHINCS<sup>+</sup> inherently with intra-chain dependencies and inter-chain aggregation patterns, which pose a challenge for efficient thread allocation. HERO-Sign addresses this challenge by shifting the optimization of SHA2 computations to the compiler level. HERO-Sign leverages PTX-level instruction tuning to balance aggressive and conservative compiler optimizations. HERO-Sign further adopts an adaptive branch selection strategy tailored to each kernel's behavior.

1) *The selection of PTX-level instructions*: The SHA-2 function processes each 512-bit message block through 64 rounds of bitwise logic, shift operations, and modular additions to produce a 256-bit digest. While these logical operations can be effectively optimized by the compiler, we observed that each round requires 16 big-endian data loads, traditionally implemented using multiple shift operations. We replace these shift-based conversions with the `prmt` instruction. While the



Fig. 4: The Workflow of *Relax\_FORS*. The bottom layer of each *Set* relies on a *Relax\_buffer*: for each FORS tree, once a thread acquires it, it quickly computes the parent node and writes the parent node back to the shared memory of the upper layer. Thus, it reduces shared memory usage by half. The per-thread register constraint effectively reduces contention. In this *Set*, *Relax\_buffer* reserves local space for eight leaf nodes to accommodate four trees.

`prmt` instruction has a higher latency than a simple `shl` instruction, it can replace multiple `shl` instructions with one single byte-level permutation, thus reducing `shl` register pressure. `IADD3` instruction also has high usage rate in the SPHINCS<sup>+</sup>-128f parameter set, we replaced it with the `mad` instruction, however, we observe that the compiler will still aggressively optimize the `mad` instruction into `IADD3` at SASS level. We introduce an auxiliary parameter  $m$  into the `mad` function, so the compiler will be misled to interpret the function as having four input parameters, which alters its optimization heuristics and will retain `mad` instructions at the SASS level shown in Figure 5.

2) *PTX Branch Selection at Kernel-Level*: Since the signature computation procedures vary across different kernels, and PTX does not always outperform native version due to restricted compiler optimization space, HERO-Sign considers PTX branch selection from a coarser-grained, kernel-centric perspective. Given the opaque and hardware-specific nature of the compilation pipeline (e.g., from nvcc or ptxas to different GPU architectures), quantitative analysis of PTX-level optimizations based solely on compiler rules remains infeasible. HERO-Sign adopts a more intuitive approach by providing both PTX and native branches, allowing profiling-based comparison to select the more performant configuration. The branch selection applied to different kernels are summa-

```

// Permutation on a 32-bit value
asm volatile (
    "{\n\t"
    " prmt.b32 %0, %1, 0, 0x0123;\n\t"
    "}\n"
    : "=r"(result)
    : "r"(input)
);

// Permutation on a 64-bit value
asm volatile (
    "{\n\t"
    " .reg .b32 x, y;\n\t"
    " mov.b64 {x, y}, %1;\n\t"
    " prmt.b32 x, x, 0, 0x0123;\n\t"
    " prmt.b32 y, y, 0, 0x0123;\n\t"
    " mov.b64 %0, {y, x};\n\t"
    "}\n"
    : "=l"(result)
    : "l"(input)
);

// MAD usage (example with m = 1)
asm volatile (
    "mad.lo.u32 %0, %1, %4, %2;\n\t"
    "mad.lo.u32 %0, %0, %4, %3;\n\t"
    : "=r"(result)
    : "r"(a), "r"(b), "r"(c), "r"(m)
);

```

Fig. 5: Examples of PTX inline assembly used for 32-bit and 64-bit permutation, as well as mad.lo.u32 usage. All selected instructions operate on `uint32_t` to maximize efficiency without compromising security.

rized in Table V. We also consider register-level factors based on the occupancy model in Equation 1. Due to the limited register file per SM on GPUs, increased register usage per thread leads to reduced warp occupancy.

$$\text{Occupancy} = \frac{1}{W_{\max}} \cdot \left\lfloor \frac{\mathcal{R}_{\text{total}}}{\mathcal{R}_{\text{thread}} \times \mathcal{T}_{\text{block}}} \right\rfloor \cdot \left( \frac{\mathcal{T}_{\text{block}}}{32} \right) \quad (1)$$

where  $\mathcal{R}_{\text{total}}$  is the total number of registers per SM,  $\mathcal{R}_{\text{thread}}$  is the number of registers allocated per thread,  $\mathcal{T}_{\text{block}}$  is the number of threads per block, and  $W_{\max}$  denotes the maximum number of warps supported per SM.

For instance, in SPHINCS<sup>+</sup>-256f, the baseline TREE\_Sign kernel requires 168 registers per thread, with a warp occupancy of only 19%. By introducing the PTX branch, the register usage is reduced to 95 per thread, thereby improving occupancy to 37.5%, a 1.97× increase compared to the native version. Furthermore, in TREE\_Sign kernel, generating a single leaf node with one thread involves 560, 816 and 1072 SHA-2 computations in SPHINCS<sup>+</sup>-128f, SPHINCS<sup>+</sup>-192f and SPHINCS<sup>+</sup>-256f, respectively, the better performance of PTX in SPHINCS<sup>+</sup>-256f suggests that PTX can help alleviate aggressive compiler optimizations.

3) *Compilation-Level Branching Strategy*: Runtime branch selection inevitably results in multiple active code paths, which not only increases device memory usage but also imposes additional pressure on the instruction cache. HERO-Sign lever-

TABLE V: PTX branch selection across signature kernels under different SPHINCS<sup>+</sup> parameter sets on RTX 4090 (Block Size = 1024). A ✓ indicates that the PTX version outperformed native; ✗ indicates the native version was retained.

| Parameter Set              | FORS_Sign | TREE_Sign | WOTS+_Sign |
|----------------------------|-----------|-----------|------------|
| SPHINCS <sup>+</sup> -128f | ✓         | ✗         | ✗          |
| SPHINCS <sup>+</sup> -192f | ✓         | ✗         | ✗          |
| SPHINCS <sup>+</sup> -256f | ✓         | ✓         | ✓          |

```

// Define two selectable execution paths
template <bool> UseOptimizedPath>
__device__ uint32_t SHA2(uint8_t* x) {
    if constexpr(UseOptimizedPath)
        { /* PTX Version */ ... }
    else { /* Native */ ... }
}

// Each kernel forwards the compile-time policy to
// the SHA-2 routine.

template <bool> UseOptimizedPath>
__global__ void FORS_Sign(/* args */) {
    // ...
    (void) sha2_round<UseOptimizedPath>(/* */);
    // ...
}

template <bool> UseOptimizedPath>
__global__ void TREE_Sign(/* args */) {
    // ...
    (void) sha2_round<UseOptimizedPath>(/* */);
    // ...
}

template <bool> UseOptimizedPath>
__global__ void WOTS+_Sign(/* args */) {
    // ...
    (void) sha2_round<UseOptimizedPath>(/* */);
    // ...
}

// Predefine a bool variable to select between PTX
// and native paths.
dim3 grid(/*...*/), block(/*...*/);
// SPHINCS+-128f selection as follows
FORS_Sign<true><< <grid, block, ...>> >(...);
TREE_Sign<false><< <grid, block, ...>> >(...);
WOTS+_Sign<false><< <grid, block, ...>> >(...);

```

Fig. 6: The Snippet illustrates the overview of the compile-time branch selection, from the kernel launch to the specific branches within the SHA-2 implementation.

ages `constexpr` if specialization to statically determine branch path at compile time, ensuring that each kernel includes only one single fixed execution path in the final binary. When all kernel selections resolve uniformly to either the native or PTX version, HERO-Sign opts to use a branch-free specialized copy, thereby eliminating unnecessary compilation overhead. Figure 6 shows the implementation logic of compile-time branching.

#### D. Hybrid Memory Allocation

HERO-Sign also exploits shared memory to accelerate internal array accesses by placing frequently accessed arrays in shared memory, effectively reducing the number of off-

chip memory accesses. During FORS Fusion process, where memory access is more intensive, excessive usage of shared memory can reduce the number of active warps per SM, HERO-Sign mitigates shared memory pressure by allocating frequently used, read-only data, such as the initial state values, `sk_seed`, `pk_seed` and `state_seed` to constant memory. Leveraging its broadcast capabilities, constant memory delivers access latency comparable to that of on-chip SRAM. In SPHINCS<sup>+</sup>-192f, for the `Tree_Sign` kernel, memory access is relatively infrequent, HERO-Sign allocates the corresponding read-only data directly in global memory and moderately leverages vectorized types such as `int4` and `int2` to enable coalesced memory accesses via `ldg.128` and `ldg.64` instructions.

### E. A Generalized Strategy for Eliminating Bank Conflicts



Fig. 7: Tree-Based Reduction Process. ■ highlights the signature nodes generated during the bottom-up process. ▲ represents the selected `leaf_idx`, which guides the authentication path and signature nodes selection. Fast shared memory access enabled by on-chip SRAM facilitates frequent data access in Merkle Tree-based signature computation. By leveraging an even-odd access and storage pattern, shared memory can be utilized more efficiently without causing RAW data hazards and allow this computation to approach register-level performance [31].

We utilize shared memory extensively during Tree-based Reduction process shown in Figure 7. Bank conflicts in shared memory will occur when multiple threads within a warp access different addresses that reside in the same memory bank, causing serialized access and increased latency [24]. Moreover, SPHINCS<sup>+</sup> defines three security levels, with each thread required to access 16 bytes, 24 bytes, and 32 bytes of data for Level 1, Level 2 and Level 3, respectively. Therefore, our bank conflict elimination strategy is designed to meet two key criteria: (1) it must remain effective during Reduction process; and (2) it must be compatible with multi-bank, since each bank typically holds only 4 bytes.

1) *From Known Results to New Observations:* In CUDA architectures, a single shared memory transaction is typically aligned to 128 bytes. While bank conflicts often arise during

a single shared memory transaction, they are not confined to accesses within a single warp [10], [26]. By inserting an extra bank after every 128 bytes, threads can be redirected to different memory banks, effectively eliminating bank conflicts, which much like conventional memory padding strategies. During Reduction process, where each thread *loads* two nodes and *stores* their parent node, since the applied padding alters the memory bank layout (see Figure 8), ensuring that all *load/store* operations within a single transaction remain conflict-free. The padding strategy applies seamlessly to SPHINCS<sup>+</sup>-128f and SPHINCS<sup>+</sup>-256f, which involve 16-byte and 32-byte accesses per thread, respectively. We define the following padding formula:

$$\underbrace{128}_{\text{A single memory transaction}} = \mathcal{B}_n \times 4 \times \mathcal{T}_h \quad (2)$$

where  $\mathcal{B}_n$  denotes the number of banks each thread accesses, and  $\mathcal{T}_h$  indicates the thread interval after which a padding bank is inserted.



Fig. 8: Each row represents a 128-byte transaction. The padding bank alters the access addresses of individual threads.

2) *Extension to 24-Byte Bank Padding:* In SPHINCS<sup>+</sup>-192f, each thread access 24 bytes, which is not a multiple of 128; therefore, Equation 2 does not directly apply. Figure 9 illustrates that 24-byte accesses break memory alignment, causing non-contiguous transactions, e.g., thread six access crosses a 128-byte boundary, and since its bank usage overlaps with thread one (banks 0–3), it is unclear whether new bank conflicts occur. We assume that the GPU hardware-level employs a coalescing strategy, allowing limited strided 128-byte accesses to be merged into a larger memory transaction. If this condition holds, the resulting access pattern within a warp induces only a 2-way bank conflict, which is more aligned with hardware-friendly scheduling strategies. Based on this, we derive the extended formula:

$$\underbrace{128 \times \mathcal{R}}_{\text{A single memory transaction region}} = \mathcal{B}_n \times 4 \times \mathcal{T}_h \quad (3)$$

where  $\mathcal{R}$  is a positive integer representing the number of contiguous 128-byte rows.

As shown in Table VI, our padding strategy reduces bank conflicts to near zero during Reduction process. This observation aligns with our initial hypothesis; to the best of our knowledge, this is the first time it has been observed that NVIDIA GPUs support a single shared memory transaction in multiples of 128 bytes.



Fig. 9: Address layout for 24-byte accesses. Illustration of cross-memory transaction redundancy in thread accesses, which persists until thread 16. Following Equation 3, a padding bank is inserted after the 16th thread.

TABLE VI: Bank conflicts comparison between baseline and our padding strategy during Reduction process profiled by Nsight System.

| Block = 1      | FORS_Sign  |            |              |       | TREE_Sign |       |              |       |
|----------------|------------|------------|--------------|-------|-----------|-------|--------------|-------|
|                | Baseline   |            | With Padding |       | Baseline  |       | With Padding |       |
|                | Load       | Store      | Load         | Store | Load      | Store | Load         | Store |
| SPHINCS+ -128f | 22,099,968 | 12,435,456 | 0            | 0     | 1568      | 704   | 1            | 0     |
| SPHINCS+ -192f | 64,152     | 30,096     | 0            | 0     | 1203      | 408   | 1            | 0     |
| SPHINCS+ -256f | 400,960    | 192,640    | 0            | 0     | 11,905    | 5,377 | 0            | 0     |

#### F. Graph-based Multiple Batches of Message Signatures

Existing task-based graph construction frameworks, such as CUDA Graph [23], Intel oneAPI DPC++ [18], StarPU [5], MLIR Task DAG [20], and Taskflow [16], which enable efficient kernel packaging and concurrent execution. While traditional CUDA streams depend on host-initiated asynchronous submissions and GPU-side scheduling, it can lead to substantial kernel launch overhead when processing multiple batches. Task graph-based methods generally capture stream operations during execution and perform static analysis on kernels' dependencies and resource usage, effectively reducing kernel launch overhead at runtime. Once instantiated, the graph can be launched with a single call to execute multiple operations.

**Graph Construction Strategies:** In analyzing kernel independence, only the signature generation of `WOTS_Sign` depends on the root nodes generated by `FORS_Sign` and `TREE_Sign`, therefore, `FORS_Sign` and `TREE_Sign` can be scheduled earlier and execute concurrently. We assign one block to represent each message. We adopt CUDA Graph to construct the task graph, as it provides native CUDA support and enables hardware-level execution optimizations. For scenarios involving a large number of messages, we partition them into multiple batches and explore appropriate batch sizes for efficient overlapping (Figure 10). For instance, in Block-based strategy,  $m \times T$  blocks are distributed across  $T$  CUDA graphs, each bound to a non-blocking stream, allowing concurrent execution and maximizing overlap across graphs. While orthogonal to traditional multi-streaming, this design benefits from CUDA Graph's preallocation resource mechanism, enabling lower launch latency across multiple graph streams.



Fig. 10: Block-based CUDA Graph construction strategies. `FORS_Sign`, `TREE_Sign` and `WOTS+_Sign` kernels are organized into DAGs by adding explicit dependencies between nodes.

## IV. EXPERIMENTAL EVALUATION

### A. Implementation of HERO-Sign

HERO-Sign first employs an offline Tree Tuning algorithm to generate a candidate set of optimal FORS fusion configurations tailored to different GPU architectures. The most performance configuration is then selected based on profiling results. Leveraging the same profiling insights, HERO-Sign further selects the appropriate PTX or native implementation for each of the three major kernel types. Finally, the selected kernels are captured and assembled into a CUDA Graph for efficient graph-based execution.

### B. Evaluation Methodology

1) *Baseline Selection:* We select TCAS-SPHINCSp as our baseline for comparison, given that it is the fastest and the state-of-the-art (SOTA) publicly available GPU implementation of SPHINCS<sup>+</sup> [19].

2) *Evaluation Metrics:* We define throughput, measured in kilo-signature operations per second (KOPS), as our primary performance metric. Kernel launch latency (in microseconds) is measured using Nsight Systems. We use Nsight Compute to obtain low-level profiling metrics, including warp occupancy, compute throughput and memory throughput [25].

3) *Experimental Platform:* Table VII summarizes our experimental platform configurations. We begin our evaluation on RTX 4090, where we present detailed performance results and optimization steps in Sections IV-C, IV-D and IV-E. In Section IV-F, we further extend HERO-Sign's optimization strategies across multiple GPU architectures. To ensure a fair comparison, we maintain the same input length across all tests and compile each configuration using the appropriate sm version for the corresponding architecture.

### C. Performance Steps of FORS\_Sign

Figure 11 illustrates the optimization steps of `FORS_Sign` kernel. We begin by enabling parallel construction of multiple

FORS\_Sign(Block = 1024) Performance Optimization Comparison



Fig. 11: FORS\_Sign Performance Optimization Steps. Step Speedup refers to performance improvement introduced by each individual optimization over the previous stage, Cumulative Speedup measures the total acceleration compared to the baseline.

TABLE VII: GPU architecture and corresponding CPU configurations. We use the PCIe interface for the following GPUs.

| GPU         | Architecture | SM Version | Base Clock (MHz) | CPU (INTEL® XEON®) |
|-------------|--------------|------------|------------------|--------------------|
| GTX 1070    | Pascal       | SM61       | 1506             | CPU E5-2620        |
| V100        | Volta        | SM70       | 1230             | PLATINUM 8575C     |
| RTX 2080 Ti | Turing       | SM75       | 1350             | CPU E5-2698        |
| A100        | Ampere       | SM80       | 1095             | GOLD 6330          |
| RTX 4090    | Ada          | SM89       | 2235             | GOLD 6530          |
| H100        | Hopper       | SM90       | 1035             | PLATINUM 8470      |

Merkle trees, *MMTP* (Section III-A). We then apply a FORS fusion strategy, *+FS* (Section III-B). Next, we apply PTX-level instruction tuning, *+PTX* (Section III-C). To mitigate shared memory contention, we introduce a hybrid memory allocation scheme, *+HybridME* (Section III-D). Finally, we eliminate shared memory bank conflicts through a lightweight bank padding technique, *+FreeBank* (Section III-E). For both 128f and 192f, the most effective performance improvement comes from *MMTP*, which fully leverages the inherent independence across Merkle tree computations to improve CUDA Core utilization. In case of 256f, we further enhance its parallelism by introducing the *Relax\_FORS* model, resulting in a 1.38× speedup over Baseline. *+FreeBank* offers limited benefit, as it mainly helps workloads sensitive to shared memory bank conflicts. For 192f, eliminating 24-byte bank conflicts yields only a 1% speedup. We hypothesize that such inefficiency may stem from limited native instruction support for aligned to more than 128-byte boundary memory transactions on modern GPU architectures. We also observe that different optimization strategies yield varying benefits across parameter sets. For instance, *+HybridME* achieves the most pronounced step speedup of 1.22× under 128f. This is attributed to a higher degree of FORS tree fusion, where each thread frequently accesses overlapping memory regions. By leveraging constant memory broadcasting, *HybridME* effectively mitigates the pressure from intensive memory reads.

#### D. Overall Kernel Performance Comparison of SPHINCS<sup>+</sup>

When integrated, our final performance improvements yield up to 2.14×, 1.26× and 2.02× speedups in FORS\_Sign, TREE\_Sign and WOTS+\_Sign, respectively, across the

128f, 192f and 256f parameter sets shown in Table VIII. Benefiting from *+PTX* instruction tuning, the TREE\_Sign kernel under 256f achieves a 3.37× increase in warp occupancy. In contrast, we observe a reduction in compute throughput for WOTS+\_Sign under 128f and 192f. This is expected, as our optimization rewrites expensive division and modulo operations into more efficient bitwise shifts and & operations. Additionally, we observe a decrease in memory throughput across several kernels, with the maximum reduction reaching 0.54×. This is primarily due to *HybridME* effectively reducing off-chip memory traffic. We also observe a 1.91× improvement in memory throughput for TREE\_Sign under the 256f configuration, where shared memory bank conflicts are particularly pronounced. This improvement stems from our 32-byte bank conflict elimination strategy, which effectively enhances the efficiency of shared memory transactions. We compare the overall performance of all kernels under three configurations (Figure 12): Baseline, HERO-Sign without CUDA Graph and HERO-Sign with CUDA Graph. HERO-Sign with CUDA Graph consistently achieves the best performance, demonstrating that task graph construction can significantly reduce inter-kernel idle time. In 128f, idle time is largely mitigated through constraining register allocation. However, in 192f and 256f, resource contention among kernels introduces idle periods that cannot be fully resolved in software. By packaging multiple kernels into a single execution graph relying on the CUDA driver, CUDA Graph effectively addresses this issue and reduces kernel launch latency by up to two orders of magnitude (221.3×).

**Comparison with FPGA and ASIC Implementations shown in Table IX.** Quentin Berhet *et al.* focus on minimizing resource utilization on *Xilinx XZU3EG* devices [7]. In contrast, HERO-Sign achieves up to 10<sup>3</sup>× higher throughput, while also demonstrating substantial energy efficiency improvements. Specifically, HERO-Sign reduces the PPS (per-signature power consumption) by 133× and 158× compared to their implementation. Similarly, Amiet *et al.* optimize SPHINCS<sup>+</sup> on the *FPGA Artix-7* platform using the *SHAKE* instance [2]. Despite this difference in the hash function,

TABLE VIII: Kernel Performance comparison between Baseline and HERO-Sign (Block = 1024).

| Parameter                  | Kernel     | Performance[KOPS] |           |         | Occupancy [%] |           |       | Compute Throughput[%] |           |       | Memory Throughput[%] |           |       |
|----------------------------|------------|-------------------|-----------|---------|---------------|-----------|-------|-----------------------|-----------|-------|----------------------|-----------|-------|
|                            |            | Baseline          | HERO-Sign | Speedup | Baseline      | HERO-Sign | Imp   | Baseline              | HERO-Sign | Imp   | Baseline             | HERO-Sign | Imp   |
| SPHINCS <sup>+</sup> -128f | FORS_Sign  | 442.9             | 946.3     | 2.14x   | 27.09         | 36.02     | 1.33x | 45.18                 | 56.37     | 1.25x | 11.26                | 9.83      | 0.87x |
|                            | TREE_Sign  | 125.2             | 157.7     | 1.26x   | 23.65         | 23.88     | 1.01x | 92.87                 | 97.67     | 1.05x | 247.0                | 1.88      | 0.76x |
|                            | WOTS+_Sign | 2493.1            | 4915.7    | 1.97x   | 42.36         | 46.54     | 1.10x | 43.63                 | 34.55     | 0.79x | 73.70                | 69.94     | 0.95x |
| SPHINCS <sup>+</sup> -192f | FORS_Sign  | 128.9             | 222.0     | 1.72x   | 32.74         | 47.05     | 1.44x | 44.69                 | 54.48     | 1.22x | 10.21                | 8.26      | 0.81x |
|                            | TREE_Sign  | 88.2              | 93.6      | 1.06x   | 23.83         | 23.87     | 1.00x | 95.57                 | 97.76     | 1.02x | 4.73                 | 2.54      | 0.54x |
|                            | WOTS+_Sign | 1457.6            | 2464.9    | 1.69x   | 31.44         | 35.09     | 1.12x | 24.50                 | 22.37     | 0.91x | 82.49                | 84.23     | 1.02x |
| SPHINCS <sup>+</sup> -256f | FORS_Sign  | 66.6              | 116.4     | 1.75x   | 32.60         | 63.76     | 1.96x | 42.42                 | 66.37     | 1.56x | 20.71                | 13.55     | 0.65x |
|                            | TREE_Sign  | 36.4              | 44.9      | 1.23x   | 18.53         | 62.43     | 3.37x | 72.38                 | 96.17     | 1.33x | 5.46                 | 10.42     | 1.91x |
|                            | WOTS+_Sign | 776.8             | 1570.9    | 2.02x   | 35.37         | 35.47     | 1.00x | 11.93                 | 12.77     | 1.07x | 88.19                | 86.80     | 0.98x |



Fig. 12: Performance (KOPS) and Kernel Launch Latency (μs) comparison between the Baseline and the fully optimized HERO-Sign, with and without CUDA Graph integration (Graph instantiation time is excluded).

 TABLE IX: Cross-Platform Comparison of SPHINCS<sup>+</sup> Variants (Throughput(KOPS), PPS(Watt)). PPS: Power Consumption Per signature.

| Variant           | GPU (RTX4090) | FPGA               | FPGA             | ASIC            |
|-------------------|---------------|--------------------|------------------|-----------------|
| HASH Algorithm    | SHA256        | SHA256             | SHAKE256         | SHA256          |
| Throughput (KOPS) | HERO-Sign     | Berthet et al. [7] | Amiet et al. [3] | SPHINCSLET [11] |
| SPHINCS+128f      | 119.47        | 0.016              | 0.99             | 0.52            |
| 128f PPS (Watt)   | 0.003         | 0.4                | 9.76             | No Support      |
| SPHINCS+192f      | 65.43         | No Support         | 0.85             | 0.20            |
| 192f PPS (Watt)   | 0.002         | No Support         | 9.69             | No Support      |
| SPHINCS+256f      | 33.88         | 0.00057            | 0.40             | 0.10            |
| 256f PPS (Watt)   | 0.003         | 0.474              | 9.80             | No Support      |

TABLE X: CPU AVX2 Performance Comparison (KOPS) [1]

| Implementation       | SPHINCS+-128f | SPHINCS+-192f | SPHINCS+-256f |
|----------------------|---------------|---------------|---------------|
| AVX2 (Single Thread) | 0.143         | 0.087         | 0.044         |
| AVX2 (16 Threads)    | 0.828         | 0.560         | 0.356         |

HERO-Sign still achieves significant throughput improvements of  $120.68\times$ ,  $76.98\times$  and  $84.70\times$  for the 128f, 192f and 256f parameter sets, respectively, while maintaining notably lower PPS. We further include the limited available ASIC-optimized implementation, *SPHINCSLET*, for reference [11]. Compared to this design, HERO-Sign achieves remarkable throughput improvements of  $229.75\times$ ,  $327.15\times$  and  $338.8\times$ , respectively, highlighting the substantial efficiency advantage of our GPU-based architecture over existing hardware counterparts.

**Comparison with AVX2 Implementations.** Table X presents the performance results of the AVX2-based CPU implementations [1]. When compared to the multi-threaded AVX2 baseline, HERO-Sign achieves substantial throughput improvements of  $144.29\times$ ,  $116.84\times$  and  $95.17\times$  for the 128f,

192f and 256f parameter sets, respectively.

#### E. Sensitivity Study

**1) Sensitivity on Block Sizes & Batch Sizes:** We evaluate the performance impact of varying block size from 2 to 1024(Figure 13). HERO-Sign maintains consistently high speedups across all three parameter sets within the 2–64 block size range, achieving  $3.10\times$ – $3.10\times$  for 128f,  $2.92\times$ – $2.48\times$  for 192f, and  $2.60\times$ – $2.48\times$  for 256f. As the block size increases, speedup begins to degrade due to approaching the GPU’s resource limits. For 256f, the speedup drops most noticeably at block size 128, as the introduction of the *Relax\_Buffer* intensifies thread contention. However, at block size 512, speedup improves, reflecting the effectiveness of *Relax\_Buffer* in alleviating shared memory pressure. Therefore, for RTX 4090, when considering maximizing throughput, larger batch sizes (e.g.,  $\geq 512$ ) are preferred unless PCIe transfer becomes the bottleneck, to enable better overlap between host-device data transfers and computation, a smaller batch size near 64 is optimal.

**2) Sensitivity on Compilation overhead:** As discussed earlier, we introduce compile-time branching to enable automatic selection between PTX and native paths. To evaluate its impact, we measure the average compilation time across batch sizes from 2 to 1024 for all three parameter sets in Table XI. Interestingly, our approach reduces compilation time by  $1.28\times$ ,  $1.07\times$  and  $1.26\times$  compared to the Baseline. In other words, the reduction in compilation time from limiting the compiler’s optimization space via PTX outweighs the overhead introduced by template instantiation at the compilation stage.

**3) Sensitivity on Input Sizes:** We also evaluate performance across input lengths of 1K, 2K, 3K and 4K, with the block



Fig. 13: Performance (KOPS) comparison between Baseline and HERO-Sign (with Graph) under varying Block Sizes.

TABLE XI: Average compilation time (in seconds) from Block Size 2 to 1024 for Baseline and HERO-Sign.

| Parameter     | Baseline | HERO-Sign | Speedup |
|---------------|----------|-----------|---------|
| SPHINCS+-128f | 18.68    | 14.61     | 1.28×   |
| SPHINCS+-192f | 23.25    | 21.72     | 1.07×   |
| SPHINCS+-256f | 24.19    | 19.18     | 1.26×   |

size fixed at 1024. HERO-Sign achieves average speedups of  $1.30\times$ ,  $1.28\times$  and  $1.45\times$  on 128f, 192f and 256f, respectively. In practice, input messages of varying lengths are first hashed to produce a digest, from which the `leaf_idx` values are derived. These indices determine the signature path in SPHINCS<sup>+</sup>. However, since the tree structure and number of signing operations are fixed, the overall computational workload remains constant regardless of the input length.

#### F. Extension to different GPU Architectures

We extend HERO-Sign to a range of GPU architectures and apply the offline Tree Tuning algorithm to determine the maximum configurable dynamic shared memory per block on each platform. This allows us to generate a candidate set of optimized FORS fusion strategies tailored to the architectural constraints, from which the best-performing configuration is selected. To mitigate performance degradation caused by control-flow divergence in irregular fusion configurations, we prefer fusion strategies that minimize conditional branches.

Figure 14 presents the performance comparison across architectures. The Pascal architecture exhibits the lowest throughput and speedup due to its limited number of CUDA cores (1920) and significantly smaller shared memory per SM (64 KB), which restricts warp-level concurrency and fusion depth. In contrast, the Hopper architecture provides up to 228 KB of shared memory per SM, which allows for a larger number of active warps to be resident concurrently. Leveraging this, HERO-Sign achieves its highest performance gain on 256f, with a speedup of  $1.88\times$ . However, across all architectures, RTX 4090 consistently delivers the highest absolute performance. For instance, on 256f, HERO-Sign

achieves 33.88 KOPS on RTX 4090 versus 26.63 KOPS on H100. Although H100 has slightly more CUDA cores (16,896 vs. 16,384), its base clock frequency is significantly lower (1,035 MHz vs. 2,235 MHz on RTX 4090). Given that SPHINCS<sup>+</sup> is an arithmetic instruction-bound workload, performance scales closely with core frequency. Assuming ideal IPC (Instructions Per Cycle), the instruction throughput can be approximated as:

$$\text{Throughput} \propto \text{Core Count} \times \text{Frequency}$$

Thus, despite a smaller core count, RTX 4090 achieves a higher instruction throughput due to its  $2.16\times$  frequency advantage, which aligns with the observed performance superiority.



Fig. 14: Performance Comparison: Baseline vs HERO-Sign (with Graph) (Block=1024) Across GPU Architectures.

## V. CONCLUSION

In this paper, we propose HERO-Sign, a GPU-based accelerated framework for SPHINCS<sup>+</sup> signature generation. HERO-Sign achieves significant speedups across multiple GPU architectures. Maximizing SPHINCS<sup>+</sup> throughput remains a critical step toward its practical deployment as a post-quantum cryptography scheme, and our work serves as a foundation for continued exploration in this direction.

## ACKNOWLEDGMENT

This work is supported by the United States National Science Foundation under Grant No. 2530705.

## REFERENCES

- [1] D. M. Alter, P. Schwabe, and J. Daemen, “Optimizing the nist post quantum candidate sphincs+ using avx-512,” 2021.
- [2] D. Amiet, A. Curiger, and P. Zbinden, “FPGA-based accelerator for post-quantum signature scheme SPHINCS-256,” *IACR Transactions on Cryptographic Hardware and Embedded Systems*, vol. 2018, no. 1, pp. 18–39, 2018. [Online]. Available: <https://tches.iacr.org/index.php/TCHES/article/view/831>
- [3] D. Amiet, L. Leuenberger, A. Curiger, and P. Zbinden, “Fpga-based sphincs+ implementations: Mind the glitch,” in *2020 23rd Euromicro Conference on Digital System Design (DSD)*, 2020, pp. 229–237.
- [4] A. Aspuru-Guzik, A. D. Dutoi, P. J. Love, and M. Head-Gordon, “Simulated quantum computation of molecular energies,” *Science*, vol. 309, no. 5741, pp. 1704–1707, 2005. [Online]. Available: <https://www.science.org/doi/abs/10.1126/science.1113479>
- [5] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “Starpu: A unified platform for task scheduling on heterogeneous multicore architectures,” *Concurrency and Computation: Practice and Experience*, vol. 23, no. 2, pp. 187–198, 2011.

- [6] D. J. Bernstein, A. Hülsing, S. Kölbl, R. Niederhagen, J. Rijneveld, and P. Schwabe, "The sphincs+ signature framework," in *Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security*, ser. CCS '19. New York, NY, USA: Association for Computing Machinery, 2019, p. 2129–2146. [Online]. Available: <https://doi.org/10.1145/3319535.3363229>
- [7] Q. Berthet, A. Upegui, L. Gantel, A. Duc, and G. Traverso, "An area-efficient sphincs+ post-quantum signature coprocessor," in *2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)*. IEEE, 2021, pp. 180–187.
- [8] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd, "Quantum machine learning," *Nature*, vol. 549, no. 7671, pp. 195–202, 2017.
- [9] Y. Cao, J. Romero, and A. Aspuru-Guzik, "Potential of quantum computing for drug discovery," *IBM Journal of Research and Development*, vol. 62, no. 6, pp. 6:1–6:20, 2018.
- [10] H. Code, "Cuda shared memory access mechanism with vectorized instructions," <https://code.hitorimoe/post/cuda-shared-memory-access-mechanism-with-vectorized-instructions/>, Dec. 2023, accessed: Jul. 9, 2025.
- [11] S. Deshpande, Y. Lee, C. Karakuzu, J. Szefer, and Y. Paek, "Sphincslet: An area-efficient accelerator for the full sphincs+ digital signature algorithm," *ACM Trans. Embed. Comput. Syst.*, vol. 24, no. 5, Sep. 2025. [Online]. Available: <https://doi.org/10.1145/3728469>
- [12] B. Dong and Q. Wang, "Evaluating post-quantum cryptography on embedded systems: A performance analysis," 2024. [Online]. Available: <https://arxiv.org/abs/2409.05298>
- [13] L. K. Grover, "A framework for fast quantum mechanical algorithms," in *Proceedings of the thirtieth annual ACM symposium on Theory of computing(STOC)*, 1998, pp. 53–62.
- [14] T. Hanson, Q. Wang, S. Ghosh, F. Virdia, A. Reinders, and M. R. Sastry, "Optimization for sphincs+ using intel secure hash algorithm extensions," Cryptology ePrint Archive, Paper 2022/1726, 2022. [Online]. Available: <https://eprint.iacr.org/2022/1726>
- [15] S. Hong and H. Kim, "An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness," in *Proceedings of the 36th Annual International Symposium on Computer Architecture*, ser. ISCA '09. New York, NY, USA: Association for Computing Machinery, 2009, p. 152–163. [Online]. Available: <https://doi.org/10.1145/1555754.1555775>
- [16] T.-W. Huang, "Taskflow: A general-purpose parallel and heterogeneous task programming system," <https://taskflow.github.io/>, 2023.
- [17] IBM Research, "Ibm quantum roadmap to 2033," <https://www.ibm.com/quantum/blog/quantum-roadmap-2033>, 2023, accessed: 2025-06-29.
- [18] Intel Corporation, "Intel oneapi dpc++ compiler," <https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html>, 2023.
- [19] D. Kim, H. Choi, and S. C. Seo, "Parallel implementation of sphincs+ with gpus," *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. PP, no. 99, pp. 1–14, 2024. [Online]. Available: [https://www.researchgate.net/publication/378790147\\_Parallel\\_Implementation\\_of\\_SPHINCS\\_With\\_GPUs](https://www.researchgate.net/publication/378790147_Parallel_Implementation_of_SPHINCS_With_GPUs)
- [20] LLVM Project, "MLIR: Multi-level intermediate representation — linalg dialect and task dag," <https://mlir.llvm.org/>, 2024.
- [21] T. Lu, Y. Chen, Z. Wang, X. Wang, W. Chen, and J. Zhang, "Batchzk: A fully pipelined gpu-accelerated system for batch generation of zero-knowledge proofs," in *Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1*, ser. ASPLOS '25. New York, NY, USA: Association for Computing Machinery, 2025, p. 100–115. [Online]. Available: <https://doi.org/10.1145/3669940.3707270>
- [22] D. Moody, "Nist pqc standardization update," *National Institute of Standards and Technology*, pp. 2021–10, 2021.
- [23] NVIDIA, "Cuda graphs," <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs>, 2022, accessed: 2025-07-09.
- [24] NVIDIA Corporation, *CUDA C Programming Guide*, 2023, version 12.x. [Online]. Available: <https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>
- [25] NVIDIA Corporation, *NVIDIA Nsight Compute*, 2025, <https://developer.nvidia.com/nsight-compute>.
- [26] NVIDIA Developer Forum, "How to understand the bank conflict of shared memory?" <https://forums.developer.nvidia.com/t/how-to-understand-the-bank-conflict-of-shared-mem/260900/2>, 2023, accessed: Jul. 9, 2025.
- [27] N. I. of Standards and Technology, "NIST Selects HQC as Fifth Algorithm for Post-Quantum Encryption," March 2025, accessed: 2025-04-06. [Online]. Available: <https://www.nist.gov/news-events/news/2025/03/nist-selects-hqc-fifth-algorithm-post-quantum-encryption>
- [28] Open Quantum Safe, "liboqs: C library for prototyping and experimenting with quantum-resistant cryptography," 2025. [Online]. Available: <https://github.com/open-quantum-safe/liboqs>
- [29] J. Preskill, "Quantum computing in the nisq era and beyond," *Quantum*, vol. 2, p. 79, Aug. 2018. [Online]. Available: <http://dx.doi.org/10.22331/q-2018-08-06-79>
- [30] D. C. Seog Chung Seo, Sang Woo An, "Accelerating falcon post-quantum digital signature algorithm on graphic processing units," *Computers, Materials & Continua*, vol. 75, no. 1, pp. 1963–1980, 2023. [Online]. Available: <http://www.techscience.com/cmc/v75n1/51452>
- [31] D. Shasha and M. Snir, "Efficient and correct execution of parallel programs that share memory," *ACM Trans. Program. Lang. Syst.*, vol. 10, no. 2, p. 282–312, Apr. 1988. [Online]. Available: <https://doi.org/10.1145/42190.42277>
- [32] S. Shen, H. Yang, W. Dai, H. Zhang, Z. Liu, and Y. Zhao, "High-throughput gpu implementation of dilithium post-quantum digital signature," *IEEE Transactions on Parallel and Distributed Systems*, 2024.
- [33] P. Shor, "Algorithms for quantum computation: discrete logarithms and factoring," in *Proceedings 35th Annual Symposium on Foundations of Computer Science(STOC)*, 1994, pp. 124–134.
- [34] S. Sun, R. Zhang, and H. Ma, "Efficient parallelism of post-quantum signature scheme sphincs," *IEEE Transactions on Parallel and Distributed Systems*, vol. 31, no. 11, pp. 2542–2555, 2020.
- [35] M. Wahib and N. Maruyama, "Scalable kernel fusion for memory-bound gpu applications," in *SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis*, 2014, pp. 191–202.
- [36] L. Wan, F. Zheng, G. Fan, R. Wei, L. Gao, Y. Wang, J. Lin, and J. Dong, "A novel high-performance implementation of crystals-kyber with ai accelerator," in *Computer Security – ESORICS 2022: 27th European Symposium on Research in Computer Security, Copenhagen, Denmark, September 26–30, 2022, Proceedings, Part III*. Berlin, Heidelberg: Springer-Verlag, 2022, p. 514–534. [Online]. Available: [https://doi.org/10.1007/978-3-031-17143-7\\_25](https://doi.org/10.1007/978-3-031-17143-7_25)
- [37] Z. Wang, X. Dong, H. Chen, Y. Kang, and Q. Wang, "Cuspx: Efficient gpu implementations of post-quantum signature sphincs+," *IEEE Transactions on Computers*, vol. 74, no. 1, pp. 15–28, 2025.
- [38] J. Wu, Y. Yu, Z. Chen, H. Yang, C. Li, and Z. Liu, "Cbpspx: A cuda-based batch parallel optimization of post-quantum signature sphincs+," *IEEE Internet of Things Journal*, pp. 1–1, 2025.
- [39] Y. Zhou, K. Rajasekaran, and Q. Wang, "Exploring parallel implementation of sphincs+ using advanced vector extensions (avx) sets," in *2025 26th International Symposium on Quality Electronic Design (ISQED)*, 2025, pp. 1–8.