

# Probabilistic Computers for MIMO Detection: From Sparsification to 2D Parallel Tempering

M Mahmudul Hasan Sajeeb,<sup>1</sup> Corentin Delacour,<sup>1</sup> Kevin Callahan-Coray,<sup>1</sup>  
Sanjay Seshan,<sup>2</sup> Tathagata Srimani,<sup>2</sup> and Kerem Y. Camsari<sup>1</sup>

<sup>1</sup>*Department of Electrical and Computer Engineering,  
University of California, Santa Barbara, Santa Barbara, CA, 93106, USA*

<sup>2</sup>*Department of Electrical and Computer Engineering,  
Carnegie Mellon University, Pittsburgh, PA, 15213, USA*

(Dated: January 15, 2026)

Probabilistic computers built from p-bits offer a promising path for combinatorial optimization, but the dense connectivity required by real-world problems scales poorly in hardware. Here, we address this through graph sparsification with auxiliary copy variables and demonstrate a fully on-chip parallel tempering solver on an FPGA. Targeting MIMO detection, a dense, NP-hard problem central to wireless communications, we fit 15 temperature replicas of a 128-node sparsified system (1,920 p-bits) entirely on-chip and achieve bit error rates significantly below conventional linear detectors. We report complete end-to-end solution times of 4.7 ms per instance, with all loading, sampling, readout, and verification overheads included. ASIC projections in 7 nm technology indicate about 90 MHz operation with less than 200 mW power dissipation, suggesting that massive parallelism across multiple chips could approach the throughput demands of next-generation wireless systems. However, sparsification introduces sensitivity to the copy-constraint strength. Employing Two-Dimensional Parallel Tempering (2D-PT), which exchanges replicas across both temperature and constraint dimensions, we demonstrate over 10× faster convergence without manual parameter tuning. These results establish an on-chip p-bit architecture and a scalable algorithmic framework for dense combinatorial optimization.

## I. INTRODUCTION

Probabilistic computers built from p-bits have emerged as a promising architecture for combinatorial optimization and probabilistic sampling [1]. By mapping hard problems to Ising Hamiltonians and leveraging the natural dynamics of coupled stochastic units, the so-called Ising machines [2] can explore solution spaces efficiently. However, scaling p-bit systems to real-world problem sizes faces a fundamental challenge: the dense, all-to-all connectivity required by many applications grows quadratically, causing routing congestion and limiting clock frequency in CMOS and FPGA implementations.

This challenge is not unique to p-bits. Quantum annealers [3, 4], coherent Ising machines [5, 6], memristive arrays [7–9], and coupled oscillator networks [10–13] all confront the same connectivity bottleneck when targeting dense problem graphs. Prior work has addressed this through minor embedding [14, 15] or problem decomposition, but these approaches incur significant overhead in qubit/spin count [16] or require iterative off-chip communication. An alternative is graph sparsification whereby auxiliary copy nodes distribute connections across a sparse physical network with bounded local degree [17, 18].

MIMO detection exemplifies this challenge. The maximum-likelihood detector for this wireless communications problem maps to a dense Ising Hamiltonian and scales exponentially with system size [19, 20], while linear approaches such as MMSE sacrifice accuracy. This complexity-accuracy tradeoff has motivated numerous physics-inspired approaches for MIMO, including coherent Ising machines [19, 20], oscillator networks [21, 22], and quantum annealers [23, 24], but these typically require dense couplings that limit embeddable problem size or incur off-chip communication overhead that negates latency

benefits. MIMO detection thus serves as an ideal benchmark for evaluating sparsified on-chip architectures.

While sparsification solves the routing problem and enables constant-frequency scaling, it fundamentally alters the optimization landscape (FIG. 1a-b). The logical identity between a node and its copies must be enforced via ferromagnetic couplings with strength  $P$ . If  $P$  is too weak, copies disagree and the solution becomes infeasible; if  $P$  is too strong, the energy landscape becomes rigid, trapping the system in local minima. This critical tuning problem has limited the practical deployment of sparsified Ising machines [25].

Here, we address this challenge with both hardware and algorithms. We first implement a fully on-chip parallel tempering solver on an FPGA, fitting 15 temperature replicas of a 128-node sparsified system (1,920 p-bits total) entirely on-chip (FIG. 1c). Targeting  $64 \times 64$  MIMO detection, we achieve bit error rates significantly below conventional linear detectors with complete end-to-end solution times of 4.7 ms per instance including all loading, sampling, readout, and verification overheads. ASIC projections in 7 nm technology indicate 89 MHz operation at 185 mW, suggesting that massive parallelism across low-power chips could approach the throughput demands of next-generation wireless systems. However, on-chip experiments reveal that performance is highly sensitive to the copy-constraint strength  $P$ , degrading sharply outside a narrow optimal window. To eliminate this tuning bottleneck, we employ Two-Dimensional Parallel Tempering (2D-PT) [26], which exchanges replicas across both temperature and constraint dimensions (FIG. 1d). Validating on the Sherrington-Kirkpatrick spin glass and  $128 \times 128$  MIMO instances, we show that 2D-PT achieves over 10× faster convergence without manual parameter optimization, establishing a robust path toward scalable, tuning-free probabilistic computers.



**FIG. 1.** (a) Multiple-input multiple-output (MIMO) detection: transmitted symbols pass through a noisy fading channel and must be recovered at the receiver. (b) The maximum-likelihood detection cost maps to an Ising Hamiltonian with all-to-all connectivity (left). Sparsification introduces auxiliary copy nodes connected by ferromagnetic couplings of strength  $P$  (right). (c) On-chip parallel tempering with 15 temperature replicas instantiated entirely on an FPGA. Replicas exchange states along the inverse temperature ( $\beta$ ) axis while  $P$  remains fixed. (d) Two-Dimensional Parallel Tempering (2D-PT) arranges replicas on a grid, enabling exchanges along both  $\beta$  and  $P$  axes. This eliminates the need for manual  $P$ -tuning while improving mixing across energy barriers.

## II. MIMO DETECTION WITH P-BITS

In a multiple-input multiple-output (MIMO) communication system,  $N_t$  users each transmit a BPSK symbol  $x_i \in \{-1, +1\}$ , forming the vector  $\mathbf{x} \in \{-1, +1\}^{N_t}$ . The received signal is  $\mathbf{y} = \mathbf{Hx} + \mathbf{n}$ , where  $\mathbf{H} \in \mathbb{R}^{N_r \times N_t}$  is the Rayleigh fading channel and  $\mathbf{n}$  is additive white Gaussian noise. Maximum-likelihood detection seeks:

$$\hat{\mathbf{x}}_{\text{ML}} = \arg \min_{\mathbf{x} \in \{-1, +1\}^{N_t}} \|\mathbf{y} - \mathbf{Hx}\|^2 \quad (1)$$

Expanding and dropping constants, this maps to an Ising Hamiltonian  $E(\mathbf{s}) = -\sum_i h_i s_i - \sum_{i < j} J_{ij} s_i s_j$  [19, 20] with parameters

$$\mathbf{J} = -\mathbf{H}^T \mathbf{H}, \quad \mathbf{h} = \mathbf{H}^T \mathbf{y} \quad (2)$$

where any global prefactor is absorbed into the inverse temperature  $\beta$ . We solve this using p-bits updated according to:

$$I_i = \sum J_{ij} m_j + h_i, \quad m_i = \text{sgn}(\tanh(\beta I_i) - r_i) \quad (3)$$

where  $r_i \in [-1, 1]$  is a uniform random number. This update rule samples from the Boltzmann distribution at inverse temperature  $\beta$  [27]. To accelerate convergence and escape local minima, we employ parallel tempering with multiple replicas spanning a range of  $\beta$  values (Section III), and later extend to two-dimensional exchanges across both  $\beta$  and constraint strength  $P$  (Section V).

For reference, we compare against the minimum mean square error (MMSE) detector [28, 29],  $\hat{\mathbf{x}}_{\text{MMSE}} = (\mathbf{H}^T \mathbf{H} + \lambda \mathbf{I})^{-1} \mathbf{H}^T \mathbf{y}$ , a standard linear baseline that balances interference suppression against noise amplification. Note that the unregularized limit ( $\lambda \rightarrow 0$ ) corresponds to continuously relaxing the Ising spins, illustrating how discrete optimization can outperform linear detection.

## III. ON-CHIP PARALLEL TEMPERING

To map the dense MIMO problem onto hardware, we sparsify the all-to-all graph by introducing two auxiliary copies per node, connected by ferromagnetic couplings of



**FIG. 2.** (a) Preprocessing: a 64-node all-to-all MIMO problem is sparsified to 128 p-bits (two copies per node) and replicated across 15 inverse temperatures ( $\beta$  values). All replicas are loaded onto the FPGA in  $t_{\text{load}} = 160$  ns. (b) On-chip parallel tempering: each step performs Monte Carlo sweeps, energy evaluation, and replica swaps, with total step time  $t_{\text{step}}$ . (c) Verification: the best sparse configuration is projected to the original all-to-all graph via majority voting, and the energy is verified against the dense weight matrix in  $t_{\text{verify}} = 40$  ns. The complete solution time is  $t_{\text{instance}} = t_{\text{load}} + N_{\text{steps}} \times t_{\text{step}} + t_{\text{read}} + t_{\text{verify}}$ ; all timing values are averaged over 13,000 instances. (d) Bit error rate versus SNR for 64 × 64 BPSK MIMO detection. On-chip parallel tempering significantly outperforms MMSE, achieving complete solution times of 4.7 ms ( $N_{\text{steps}} = 1000$ ) and 47 ms ( $N_{\text{steps}} = 10000$ ). (e) Residual energy  $\rho_E^f = (E_{\text{meas}} - E_{\text{ground}})/N$  versus  $N_{\text{steps}}$  for several SNR values; upper axis shows corresponding wall-clock time. Error bars denote 95% bootstrap confidence intervals.

strength  $P$  (FIG. 1b). This yields 128 sparse p-bits from the original 64-node problem. We then implement Parallel Tempering (PT) [30, 31] directly on an FPGA similar to designs in Ref. [32, 33], with the distinction that all Monte Carlo sweeps, energy calculation, and Metropolis replica exchanges are entirely executed on-chip (FIG. 2b).

The sparse problem is expanded into a 15-temperature ladder, for a total of  $128 \times 15 = 1,920$  p-bits instantiated concurrently (FIG. 2a). An optimized copy strength  $P = 3.5$  is used to achieve the best bit-error rate for the computational budget. Weights and biases for all replicas are loaded once per instance with latency  $t_{\text{load}} = 160$  ns; after this one-time load, the dynamics proceed without further host communication.

During the run, the lowest-energy replica at each swap is tracked, and the final readout corresponds to the best configuration encountered. The sparse state is reduced to 64 logical p-bits via majority voting (ties resolved by coin flip) and verified against the original dense Ising Hamiltonian

in  $t_{\text{verify}} = 40$  ns. Each parallel tempering step comprises Monte Carlo sweeps, energy computation, and a swap attempt:

$$t_{\text{step}} = t_{\text{sweeps}} + t_{\text{energy}} + t_{\text{swap}} \quad (4)$$

The complete end-to-end time per instance is:

$$t_{\text{inst}} = t_{\text{load}} + N_{\text{steps}} t_{\text{step}} + t_{\text{read}} + t_{\text{verify}} \quad (5)$$

where  $t_{\text{read}} = 80$  ns. All timing values are averaged over 13,000 instances, and the reported solution times (4.7 ms and 47 ms) include all overheads: load, sweep, swap, read, and verify, with nothing hidden or neglected.

As shown in FIG. 2d, the on-chip solver achieves substantially lower bit error rates than the conventional MMSE detector. With  $N_{\text{steps}}=1,000$  and  $10,000$  (100 sweeps per step), measured runtimes are 4.7 ms and 47 ms per instance, respectively, including all overheads. Results



**FIG. 3.** Graphic Database System II (GDSII) layout for a  $64 \times 64$  MIMO problem, mapped to a 64-node all-to-all graph and sparsified with two copies per node ( $N = 128$  physical p-bits). The design was generated using the ASAP7 predictive 7 nm Process Design Kit (PDK) via the *mflowgen* toolchain and follows a three-level hierarchy: (a) Top-level floorplan integrating 15 parallel replicas for 1D-PT, occupying a total integrated chip area of  $10.6 \times 10.6 \text{ mm}^2$  (including power, pins, and other overhead), with logic area occupying  $55.9 \text{ mm}^2$ . (b) Layout of a single replica macro ( $1.9 \times 1.9 \text{ mm}^2$ ). (c) Detailed layout of an individual p-bit macro ( $0.12 \times 0.12 \text{ mm}^2$ ).

TABLE I. Key metrics of synthesized p-computer designs: maximum frequency, chip area, and total expected power usage.

| Design Size              | No. of P-bits | Expected Freq. (MHz) | Total Area (mm <sup>2</sup> ) | Total Power (mW) | Total Gate Instances |
|--------------------------|---------------|----------------------|-------------------------------|------------------|----------------------|
| $64 \times 64, 2$ copies | 1920          | 88.6                 | 112.36                        | 185.2            | 1.2M                 |

are averaged over 100 independent channel realizations, each decoding 10 transmitted vectors. The residual energy (FIG. 2e) decreases systematically with  $N_{\text{steps}}$ , saturating earlier at low SNR and following power-law scaling at high SNR.

#### IV. ASIC DESIGN & IMPLEMENTATION

To project the performance of our solver to a dedicated hardware platform, we targeted an Application-Specific Integrated Circuit (ASIC) implementation of the same  $64 \times 64$  BPSK MIMO instance evaluated on the FPGA. We explored the configuration with two copies per node using the ASAP7 predictive 7 nm FinFET Process Design Kit (PDK) [34]. To manage the complexity of this large-scale system, we employed a custom hierarchical synthesis flow via *mflowgen* [35] that partitions the design into three physical levels: individual p-bit macros, replica blocks, and the full top-level array of sparsely interconnected p-bits. This top-level design was fully synthesized, including a realistic clock-tree distribution network, to ensure the rigorous characterization of area, timing, and power metrics.

Based on a 100 MHz reference clock, timing closure was extrapolated from the worst-case setup time. The  $64 \times 64$  BPSK MIMO system, sparsified with two copies per node and instantiating 15 parallel replicas, achieves an operating frequency of 89 MHz with a total power consumption of 185.2 mW (total of static and dynamic at 50% gate activity/swapping factor) and an effective logic area (not including power, pins, and other overhead that is present in the layout) of  $56 \text{ mm}^2$ . The corresponding, fully integrated, physical layout (GDS mask) is illustrated in FIG 3.

Results for the two-copy ( $N = 128$  sparse p-bits per replica) configuration are summarized in Table I.

#### V. 2D PARALLEL TEMPERING

The optimal coupling strength  $P$  between copy nodes depends on the Monte Carlo budget and typically requires significant preprocessing. FIG. 4a shows the residual energy of 64-spin Sherrington-Kirkpatrick (SK) instances found by 1D Parallel Tempering (1D-PT) for various copy strengths and number of swaps (100 sweeps per swap). Regardless of the Monte Carlo budget, small copy strengths cannot find good solutions due to copy disagreement. Large copy strengths are ideal in theory as they enforce copy constraints. However, hard copy constraints hinder the search and require a large number of samples to find low-energy states, such as 50,000 swaps for reaching  $\rho_E^f \approx 3 \times 10^{-5}$  at  $P = 2$ . In contrast, a similar energy residual is found for only 2,000 swaps at the optimal copy strength  $P \approx 1.25$ —a  $25\times$  speedup compared to  $P = 2$ . As we increase the number of swaps, the energy minimum decreases with larger copy strengths (shifts to the right), consistent with the hard-constraint limit that recovers the original SK energy minima.

Instead of co-optimizing penalty strength and Monte Carlo budget, we use Two-Dimensional Parallel Tempering (2D-PT) [26], which automatically enforces hard constraints in the latest replicas, while benefiting from fast-mixing replicas having small to medium penalty strengths. As illustrated in FIG. 1d, 2D-PT explores various copy strengths in a new dimension dedicated to constraint exploration, orthogonal to the inverse temperature axis ( $\beta$ ) and resulting in an array of replicas. Each column of the 2D array has a fixed penalty strength  $P$ , interpolating from soft constraints in the first column to hard constraints in the last one. In addition to vertical configuration exchanges along the inverse temperature ( $\beta$ -swaps), new horizontal exchanges in the penalty direction ( $P$ -swaps) transfer feasible states found in the lower columns (fast mixing) to the higher columns that



**FIG. 4.** (a) Residual energy per spin  $\rho_E^f$  for the Sherrington–Kirkpatrick (SK) spin glass ( $N = 64$ , two-copy sparsification) under 1D parallel tempering as a function of copy strength  $P$ . For smaller swap budgets,  $\rho_E^f$  decreases with increasing  $P$ , reaches a minimum near  $P \approx 1.5$ , then increases as excessive copy strength traps the system in local minima. For larger swap budgets, the algorithm escapes these traps, yielding lower residual energies even at large  $P$ . (b) Copy-agreement percentage across the 2D replica grid ( $\beta$  rows,  $P$  columns) with  $P \in [0.5, 2]$  reported in Table II. The multi-column  $P$  ladder enforces near-perfect agreement ( $> 99\%$ ) in the rightmost (high- $P$ ) column while allowing state exploration in low- $P$  columns. (c) Residual energy versus number of swaps. 2D-PT reaches the ground state for all instances in  $\approx 10^3$  swaps, whereas 1D-PT (using optimized  $P$  at each budget) requires  $> 10^5$  swaps to reach a residual energy of  $10^{-5}$ . All data averaged over 100 instances  $\times$  100 trials, error bars denote 95% bootstrap confidence intervals.

enforce constraints (slow mixing). FIG. 4b shows the  $10 \times 10$  array used for the next SK experiments, where the last-column bottom replica enforces a mean copy-agreement of 99.99% and holds the lowest-energy solution during the search.

We first apply 2D-PT on 64-spin SK instances, which, like MIMO, are all-to-all graphs with Gaussian coupling elements, but have no noise or external fields. As for MIMO, the SK graphs are sparsified to 128 spins with two copies per spin to fit in the FPGA. The 2D-PT  $\beta$  and  $P$  schedules are found with the adaptive algorithm described in the Method Section and reported in Table II. 1D-PT employs the same ten  $\beta$  values as 2D-PT. To ensure a fair comparison, each 1D-PT data point from FIG. 4c uses the copy strength  $P$  optimized for that specific swap budget (provided by FIG. 4a), resulting in the best possible 1D-PT configuration for the given  $\beta$  schedule. The 2D-PT experiments in FIG. 4c are performed in Matlab simulation as an initial exploration; hardware implementation is left for future ASIC designs. While 1D-PT requires more than  $10^5$  swaps to reach a mean residual energy of  $\rho_E^f \approx 10^{-5}$ , 2D-PT reaches the same residual energy in roughly 200 swaps, providing a  $500\times$  speedup. Moreover, 2D-PT finds the constrained ground states ( $\rho_E^f = 0$ ) in fewer than 1,000 swaps. Here, 2D-PT does not just trade space with runtime: it employs 10 $\times$  more replicas than 1D-PT but is at least 500 $\times$  faster to find low-energy states, highlighting a significant computational advantage compared to 1D-PT.

We then apply 2D-PT to  $128 \times 128$  BPSK MIMO instances using a  $16 \times 13$  replica array (16  $\beta$  values, 13  $P$  values) with 5 sweeps per swap; schedules are given in Table II. For 2,000 swaps, 2D-PT significantly outperforms MMSE across the entire SNR range, with fully decoded signals (BER=0) at large SNR values ( $> 15$  dB). In contrast, 1D-PT barely improves MMSE for the same Monte Carlo budget and has a larger BER at high SNR as shown in FIG 5a. When increasing the number

of swaps to 20,000, 1D-PT recovers 2D-PT’s performance in noisy regimes (low SNR), but cannot perfectly decode all instances at high SNR (error floor of around  $10^{-5}$ ). Beyond accelerating MIMO decoding by more than 10 $\times$  compared to 1D-PT, 2D-PT has the additional advantage of not requiring tuning the penalty strength for every Monte Carlo budget (as in FIG 4a), owing to its last high-penalty column that enforces copy-agreement.

2D-PT’s only requirement for high performance is to enable some state swaps between adjacent replica pairs, acting as global moves to escape local minima. The exchange probability between replicas depends on the  $\beta$  and  $P$  schedules, obtained by averaging individual schedules from 10 random instances and reported in Table II. The high swap probabilities in FIG. 5b confirm efficient mixing via frequent state exchanges along the  $P$ -axis, enabling feasible states to migrate into hard-constraint replicas. The  $\beta$ -axis swap probability remains high across all temperature pairs (FIG. 5c), transferring low-energy solutions to the final  $\beta$  row. While our current FPGA only implements 1D-PT due to capacity limits, its modular architecture, built from Monte Carlo sweeps, energy calculation, and swap logic blocks, is designed to support 2D-PT in future ASIC implementations.

## VI. CONCLUSION

We have demonstrated a fully on-chip probabilistic computer for dense combinatorial optimization, using MIMO detection as a benchmark. By sparsifying the all-to-all problem graph and implementing parallel tempering entirely on an FPGA, we fit 1,920 p-bits (15 temperature replicas of a 128-node sparse system) on a single device. The solver achieves bit error rates significantly below conventional MMSE detectors with complete end-to-end solution times of 4.7 ms per instance, including all loading, sampling,



**FIG. 5.** (a) Bit error rate (BER) versus signal-to-noise ratio (SNR) for  $128 \times 128$  BPSK MIMO detection, averaged over 200 channels and 10 transmitted symbols per channel. All instances are sparsified with two copies per node (256 p-bits per replica). 1D-PT results are shown for two swap budgets ( $N_{\text{swaps}} = 2,000$  and 20,000) using a 16-replica  $\beta$  ladder; increasing the budget by 10× improves performance but 1D-PT still exhibits an error floor at high SNR. 2D-PT uses a  $16 \times 13$  replica array with the same  $\beta$  values and a distinct  $P$  per column (values in Table II), achieving substantially lower BER and reaching zero errors at high SNR. For both methods, the final solution is selected as the minimum-energy state after mapping back to the all-to-all problem; for 2D-PT, this selection is restricted to the coldest replicas (last row). (b,c) Mean swap probabilities along the  $P$ -axis and  $\beta$ -axis, respectively, averaged over all rows and columns.

readout, and verification overheads. ASIC projections in 7 nm technology indicate 89 MHz operation at 185 mW, suggesting that parallel arrays of such chips could approach the throughput demands of next-generation wireless systems.

However, sparsification introduces sensitivity to the copy-constraint strength  $P$ , which must be tuned for each computational budget. We addressed this using Two-Dimensional Parallel Tempering (2D-PT), which exchanges replicas across both temperature and constraint dimensions. On the Sherrington–Kirkpatrick spin glass, 2D-PT achieves about 500× faster performance. On  $128 \times 128$  MIMO, 2D-PT achieves zero bit errors at high SNR where 1D-PT exhibits an error floor. These results establish a complete on-chip p-bit architecture and a robust algorithmic framework for scalable probabilistic computing on dense optimization problems.

## VII. METHODS

### A. FPGA design

The on-chip parallel tempering solver was implemented on a *Corigine* FPGA accelerator card, communicating with a host CPU via PCIe for initialization and readout. Coupling weights  $J_{ij}$  and biases  $h_i$  (pre-multiplied by inverse temperature  $\beta$ ) use 10-bit fixed-point precision (1 sign, 6 integer, 3 fractional bits). We instantiated 15 parallel replicas, each containing 128 p-bits (a sparsified  $N = 64$  graph with two copies per node), yielding 1,920 p-bits on-chip. This density represents the maximum feasible packing on a single device without

violating timing closure or power constraints.

To minimize logic footprint and avoid DSP usage, replica temperatures are preprocessed: the CPU uploads  $\beta_r J_{ij}$  and  $\beta_r h_i$  directly, eliminating on-chip multiplication. The p-bit update becomes:

$$\beta_r I_i = \sum_j (\beta_r J_{ij}) m_j + \beta_r h_i \quad (6)$$

The system energy is reconstructed from local contributions:

$$\beta_r E = \frac{1}{2} \sum_i -m_i (\beta_r I_i + \beta_r h_i) \quad (7)$$

This allows each p-bit's energy contribution to be computed with a single adder and multiplexer in LUT logic, bypassing DSP multipliers. Local terms are accumulated via a shift register and adder tree, enabling  $\mathcal{O}(N/d)$  energy evaluation where  $d$  is the shift-register stride.

As illustrated in the block-level view in Fig. 6, multiple replicas operate in parallel at different inverse temperatures, and their total energies are streamed into a shared swap controller to evaluate replica exchanges. The Metropolis acceptance probability for swapping replicas  $R_a$  and  $R_b$  is defined as  $p_{\text{swap}} = \min(1, \exp(\Delta))$ , where  $\Delta = (\beta_b - \beta_a)(E_b - E_a)$ . Since replica modules are only capable of providing a temperature-based energy  $\beta_r E_r$ , the log-probability argument  $\Delta$  is expanded to accommodate the



**FIG. 6.** A simplified block-level view of the on-chip FPGA implementation of parallel tempering. Each replica contains a full graph of p-bits, where every p-bit computes its influence field ( $I_i$ ), applies the tanh activation, generates stochastic updates via an LFSR, and produces a local energy contribution. These contributions are pushed to a shift register to be accumulated into the replica energy. All replicas at different inverse temperatures ( $\beta_1, \beta_2, \beta_3, \dots$ ) operate in parallel. Swap controllers are shared between three replicas; edge replicas can be connected to two swap controllers to cover all swap pairs. The controller evaluates the Metropolis acceptance rule  $e^{\Delta\beta \cdot \Delta E}$  with an approximate method, compares it to a hardware LFSR, and issues a swap direction signal that exchanges the states of adjacent replicas. This architecture enables fully on-chip execution of sweeps, energy calculations, and replica exchanges without host intervention.

provided information:

$$\Delta = \left(1 - \frac{\beta_a}{\beta_b}\right) (\beta_b E_b) + \left(1 - \frac{\beta_b}{\beta_a}\right) (\beta_a E_a) \quad (8)$$

where scaling factors are precomputed by the CPU. This optimization limits the log-probability calculation to 2 multiplies and 1 add per replica pair.

While many exponential function implementations exist, this system required a single-cycle function that utilized minimal area. We replace the standard exponential with a base-2 approximation derived from a first-order Taylor expansion, optimized for fixed-point arithmetic:

$$e^x = 2^{x \log_2 e} \approx 2^{\lfloor \frac{23}{16}x \rfloor} \left(1 + \frac{23}{16}x - \left\lfloor \frac{23}{16}x \right\rfloor\right) \quad (9)$$

The fractional multiplication factor  $\frac{23}{16}$  is implemented efficiently using bit-shifts ( $x + x/2 + x/4 + x/16$ ) and additions. This approximation provides sufficient numerical fidelity for the Metropolis criterion while drastically reducing resource usage.

Finally, a central Finite State Machine (FSM) controls the parallel execution of all 15 replicas through a pipelined sequence: (i) **Sweep Phase**, where p-bits update in parallel according to a graph-colored schedule; (ii) **Energy Phase**, where local energies are accumulated via the pipelined adder trees; and (iii) **Swap Phase**, where the Swap Controller evaluates the swap probability and triggers instantaneous state transfers between adjacent replicas. The total reference clock cycles required for one swap step ( $C_{step}$ ) is determined by the cumulative latency of these phases:

$$C_{step} = N_{color} \cdot N_{sweep} + \lceil N/d \rceil + 7 \quad (10)$$

Here,  $N_{color}$  is the number of graph colors required for the update schedule,  $N_{sweep}$  represents the sweep-to-swap ratio (Monte Carlo sweeps performed between swap attempts), and  $\lceil N/d \rceil$  represents the cycles needed for energy accumulation with stride  $d$ . The constant ‘7’ accounts for the pipeline overhead cycles required by the swap controller (this pipeline is not strictly optimized). Readout is handled in the PCIe’s clock domain; consequently, the absolute wall-clock time per swap step ( $t_{step}$ ) can be estimated as:

$$t_{step} = \left\lceil C_{step} \times \frac{f_{PCIe}}{f_{sys}} \right\rceil \times \frac{1}{f_{PCIe}} \quad (11)$$

This deterministic timing model allows for precise performance characterization and synchronization across different FPGA platforms and clock domains. For our case, the PCIe clock frequency is 125 MHz.

## B. 2D-PT parameters

The  $\beta$  and  $P$  schedules for 2D-PT are determined iteratively using the adaptive algorithm proposed in Ref. [26]. Beginning with the initial  $\beta$  and  $P$  values at the top-left replica, the parameter grid is expanded according to:

$$\begin{aligned} \beta(i+1, j) &= \beta(i, j) + \alpha_\beta / \sigma_E \\ P(i, j+1) &= P(i, j) + \alpha_P / (\beta(i, j) \sigma_g) \end{aligned} \quad (12)$$

Here,  $i$  and  $j$  denote the indices along the inverse temperature ( $\beta$ ) and constraint ( $P$ ) dimensions, respectively, while  $\alpha_\beta$  and  $\alpha_P$  control the step size between consecutive values. The terms  $\sigma_E$  and  $\sigma_g$  represent the standard deviations of the energy and the constraint function ( $g = 0$  indicating satisfied

TABLE II. 2D-PT parameters for the SK ( $N = 64$ ) and MIMO ( $N = 128$ ) experiments, where both problems are sparsified with 2 copies per original node.  $\alpha_\beta$  and  $\alpha_P$  are the parameters used in the adaptive algorithm to obtain the  $\beta$  and  $P$  values iteratively [26].

| Problem                | Param.                | Row/Column     | 1     | 2     | 3     | 4     | 5     | 6     | 7    | 8    | 9    | 10   | 11   | 12   | 13   | 14   | 15   | 16  |
|------------------------|-----------------------|----------------|-------|-------|-------|-------|-------|-------|------|------|------|------|------|------|------|------|------|-----|
| (SK<br>$(N = 64)$ )    | $\alpha_\beta = 2.5$  | $\beta$ -value | 0.500 | 0.801 | 1.13  | 1.52  | 2.05  | 2.92  | 4.62 | 8.60 | 24.6 | 138  | –    | –    | –    | –    | –    |     |
|                        | $\alpha_P = 0.4$      | $P$ -value     | 0.500 | 0.572 | 0.649 | 0.740 | 0.846 | 0.970 | 1.12 | 1.33 | 1.59 | 2.05 | –    | –    | –    | –    | –    |     |
| (MIMO<br>$(N = 128)$ ) | $\alpha_\beta = 1.25$ | $\beta$ -value | 0.500 | 0.564 | 0.647 | 0.743 | 0.859 | 1.00  | 1.18 | 1.40 | 1.71 | 2.16 | 2.88 | 4.24 | 7.46 | 18.0 | 72.0 | 332 |
|                        | $\alpha_P = 0.75$     | $P$ -value     | 0.800 | 0.970 | 1.15  | 1.33  | 1.54  | 1.76  | 2.02 | 2.32 | 2.69 | 3.21 | 3.96 | 4.85 | 9.44 | –    | –    | –   |

constraints) for replica  $(i, j)$ , estimated using 20 independent chains of  $10^3$  Monte Carlo sweeps. The adaptive algorithm stops row expansion when  $\sigma_E$  falls below a defined threshold (one-tenth of the mean weight amplitude) and halts column expansion once all samples satisfy the constraints. The final  $\beta$  and  $P$  vectors are derived from the median values of the arrays along the column and row directions, respectively. The step parameters  $\alpha_\beta$  and  $\alpha_P$  were initially tuned on a single instance to ensure a minimum of 10 values in each direction. For the experiments shown in FIG. 4 and FIG. 5, the final schedules are averaged over 10 instances of SK and MIMO problems (Table II). In all cases, the single-column 1D-PT employs the same  $\beta$  schedule as 2D-PT.

#### ACKNOWLEDGMENT

MMHS, KCC, CD, and KYC acknowledge support from the Office of Naval Research (ONR), Multidisciplinary University Research Initiative (MURI) grant N000142312708. MMHS and KYC acknowledge support from the Semiconductor Research Corporation (SRC) grant. SS and TS acknowledge support from CMU CIT dean's fellowship.

#### DATA AVAILABILITY

The data that support the findings of this article are not publicly available. The data are available from the authors upon reasonable request.

#### CODE AVAILABILITY

The computer code used in this study is available from the corresponding author upon reasonable request.

#### AUTHOR CONTRIBUTIONS

MMHS and KYC conceived the study. KYC supervised the study. MMHS performed on-chip (FPGA) 1D/PT experiments. CD conducted 2D-PT experiments in simulation. MMHS and CD performed all of the benchmarking. KCC developed RTL code to implement on-chip PT. SS and TS implemented the ASIC physical design. All authors discussed and analyzed the experiments and participated in writing the manuscript.

#### COMPETING INTERESTS

The authors declare no competing interests.

#### REFERENCES

- [1] Shuvro Chowdhury, Jasper Pieterse, Navid Anjum Aadit, Johan H Mentink, and Kerem Y Camsari. Probabilistic computers for neural quantum states. *arXiv preprint arXiv:2512.24558*, 2025.
- [2] Naeimeh Mohseni, Peter L McMahon, and Tim Byrnes. Ising machines as hardware solvers of combinatorial optimization problems. *Nature Reviews Physics*, 4(6):363–379, 2022.
- [3] Mark W Johnson, Mohammad HS Amin, Suzanne Gildert, Trevor Lanting, Firas Hamze, Neil Dickson, Richard Harris, Andrew J Berkley, Jan Johansson, Paul Bunyk, et al. Quantum annealing with manufactured spins. *Nature*, 473(7346):194–198, 2011.
- [4] Andrew D King, Jack Raymond, Trevor Lanting, Richard Harris, Alex Zucca, Fabio Altomare, Andrew J Berkley, Kelly Boothby, Sara Ejtemaei, Colin Enderud, et al. Quantum critical dynamics in a 5,000-qubit programmable spin glass. *Nature*, 617(7959):61–66, 2023.
- [5] Peter L McMahon, Alireza Marandi, Yoshitaka Haribara, Ryan Hamerly, Carsten Langrock, Shuhei Tamate, Takahiro Inagaki, Hiroki Takesue, Shoko Utsunomiya, Kazuyuki Aihara, et al. A fully programmable 100-spin coherent Ising machine with all-to-all connections. *Science*, 354(6312):614–617, 2016.
- [6] Toshimori Honjo, Tomohiro Sonobe, Kensuke Inaba, Takahiro Inagaki, Takuya Ikuta, Yasuhiro Yamada, Takushi Kazama, Koji Enbusu, Takeshi Umeki, Ryoichi Kasahara, Ken-ichi Kawarabayashi, and Hiroki Takesue. 100,000-spin coherent Ising machine. *Science Advances*, 7(40):eabhw0952, October 2021. ISSN 2375-2548.
- [7] Z. Fahimi, M. R. Mahmoodi, H. Nili, Valentin Polishchuk, and D. B. Strukov. Combinatorial optimization by weight annealing in memristive hopfield networks. *Scientific Reports*, 11(1):16383, August 2021. ISSN 2045-2322.
- [8] Mingrui Jiang, Keyi Shan, Chengping He, and Can Li. Efficient combinatorial optimization by quantum-inspired parallel annealing in analogue memristor crossbar. *Nature Communications*, 14(1):5927, September 2023. ISSN 2041-1723.
- [9] Yihan He, Ming-Chun Hong, Qiming Ding, Chih-Sheng Lin, Chih-Ming Lai, Chao Fang, Xiao Gong, Tuo-Hung Hou, and Gengchiau Liang. A hardware demonstration of a universal programmable rram-based probabilistic computer for molecular docking. *Nature Communications*, 2025.
- [10] Antik Mallick, Mohammad Khairul Bashar, Daniel S Truesdell, Benton H Calhoun, Siddharth Joshi, and Nikhil Shukla. Using

- synchronized oscillators to compute the maximum independent set. *Nature communications*, 11(1):1–7, 2020.
- [11] William Moy, Ibrahim Ahmed, Po-wei Chiu, John Moy, Sachin S Sapatnekar, and Chris H Kim. A 1,968-node coupled ring oscillator circuit for combinatorial optimization problem solving. *Nature Electronics*, 5(5):310–317, 2022.
- [12] Hao Lo, William Moy, Hanzhao Yu, Sachin Sapatnekar, and Chris H Kim. An ising solver chip based on coupled ring oscillators with a 48-node all-to-all connected array architecture. *Nature Electronics*, 6(10):771–778, 2023.
- [13] Markus Graber and Klaus Hofmann. An integrated coupled oscillator network to solve optimization problems. *Communications Engineering*, 3(1):116, Aug 2024. ISSN 2731-3395.
- [14] Vicky Choi. Minor-embedding in adiabatic quantum computation: II. Minor-universal graph design. *Quantum Information Processing*, 10(3):343–353, 2011.
- [15] Yuya Sugie, Yuki Yoshida, Normann Mertig, Takashi Takemoto, Hiroshi Teramoto, Atsuyoshi Nakamura, Ichigaku Takigawa, Shin-ichi Minato, Masanao Yamaoka, and Tamiki Komatsuzaki. Minor-embedding heuristics for large-scale annealing processors with sparse hardware graphs of up to 102,400 nodes. *arXiv preprint arXiv:2004.03819*, 2020.
- [16] Ryan Hamerly, Takahiro Inagaki, Peter L. McMahon, Davide Venturelli, Alireza Marandi, Tatsuhiro Onodera, Edwin Ng, Carsten Langrock, Kensuke Inaba, Toshimori Honjo, Koji Enbutsu, Takeshi Umeki, Ryoichi Kasahara, Shoko Utsunomiya, Satoshi Kako, Ken-ichi Kawarabayashi, Robert L. Byer, Martin M. Fejer, Hideo Mabuchi, Dirk Englund, Eleanor Rieffel, Hiroki Takesue, and Yoshihisa Yamamoto. Experimental investigation of performance differences between coherent Ising machines and a quantum annealer. *Science Advances*, 5(5):eaau0823, May 2019. ISSN 2375-2548.
- [17] Navid Anjum Aadit, Andrea Grimaldi, Mario Carpentieri, Luke Theogarajan, John M Martinis, Giovanni Finocchio, and Kerem Y Camsari. Massively parallel probabilistic computing with sparse ising machines. *Nature Electronics*, 5(7):460–468, 2022.
- [18] M Mahmudul Hasan Sajeeb, Navid Anjum Aadit, Shuvro Chowdhury, Tong Wu, Cesely Smith, Dhruv Chinmoy, Atharva Raut, Kerem Y Camsari, Corentin Delacour, and Tathagata Srimani. Scalable connectivity for Ising machines: Dense to sparse. *Physical Review Applied*, 24(1):014005, 2025.
- [19] Abhishek Kumar Singh, Kyle Jamieson, Peter L McMahon, and Davide Venturelli. Ising machines’ dynamics and regularization for near-optimal MIMO detection. *IEEE Transactions on Wireless Communications*, 21(12):11080–11094, 2022.
- [20] Abhishek Kumar Singh, Ari Kapelyan, Minsung Kim, Davide Venturelli, Peter L McMahon, and Kyle Jamieson. Uplink MIMO detection using Ising machines: A multi-stage Ising approach. *IEEE Transactions on Wireless Communications*, 2024.
- [21] Shreesha Sreedhara, Jaijeet Roychowdhury, Joachim Wabnig, and Pavan Koteswaran Srinath. MU-MIMO Detection Using Oscillator Ising Machines. In *2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)*, pages 1–9. IEEE, 2023.
- [22] Naomi Sagan and Jaijeet Roychowdhury. Das: Implementing dense ising machines using sparse resistive networks. In *Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design*, pages 1–9, 2022.
- [23] Ioannis Krikidis. MIMO with analogue 1-bit phase shifters: A quantum annealing perspective. *IEEE Wireless Communications Letters*, 13(6):1571–1575, 2024.
- [24] Minsung Kim, Abhishek Kumar Singh, Davide Venturelli, John Kaewell, and Kyle Jamieson. X-ResQ: Reverse annealing for quantum MIMO detection with flexible parallelism. *arXiv preprint arXiv:2402.18778*, 2024.
- [25] Corentin Delacour. Self-adaptive ising machines for constrained optimization. In *2025 Design, Automation & Test in Europe Conference (DATE)*, pages 1–7. IEEE, 2025.
- [26] Corentin Delacour, M Mahmudul Hasan Sajeeb, João P Hespanha, and Kerem Y Camsari. Two-dimensional parallel tempering for constrained optimization. *Physical Review E*, 112(2):L023301, 2025.
- [27] Kerem Yunus Camsari, Rafatul Faria, Brian M Sutton, and Supriyo Datta. Stochastic p-bits for invertible logic. *Physical Review X*, 7(3):031014, 2017.
- [28] H Vincent Poor. Turbo multiuser detection: An overview. In *2000 IEEE Sixth International Symposium on Spread Spectrum Techniques and Applications. ISSTA 2000. Proceedings (Cat. No. 00TH8536)*, volume 2, pages 583–587. IEEE, 2000.
- [29] Sergio Verdu. *Multiuser detection*. Cambridge university press, 1998.
- [30] Robert H Swendsen and Jian-Sheng Wang. Replica monte carlo simulation of spin glasses. *Physical review letters*, 57(21):2607–2609, 1986.
- [31] Koji Hukushima and Koji Nemoto. Exchange monte carlo method and application to spin glass simulations. *Journal of the Physical Society of Japan*, 65(6):1604–1608, 1996.
- [32] Navid Anjum Aadit, Masoud Mohseni, and Kerem Y Camsari. Accelerating Adaptive Parallel Tempering with FPGA-based p-bits. In *2023 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits)*, pages 1–2. IEEE, 2023.
- [33] Srijan Nikhar, Sidharth Kannan, Navid Anjum Aadit, Shuvro Chowdhury, and Kerem Y Camsari. All-to-all reconfigurability with sparse and higher-order ising machines. *Nature Communications*, 15(1):8977, 2024.
- [34] Lawrence T Clark, Vinay Vashishtha, Lucian Shifren, Aditya Guja, Saurabh Sinha, Brian Cline, Chandrasekaran Ramamurthy, and Greg Yeric. Asap7: A 7-nm finfet predictive process design kit. *Microelectronics Journal*, 53:105–115, 2016.
- [35] Alex Carsello, James Thomas, Ankita Nayak, Po-Han Chen, Mark Horowitz, Priyanka Raina, and Christopher Tornq. mflowgen: A modular flow generator and ecosystem for community-driven physical design. In *Proceedings of the 59th ACM/IEEE Design Automation Conference*, pages 1339–1342, 2022.