

# Training Deep Boltzmann Networks with Sparse Ising Machines

Shaila Niazi,<sup>1,\*</sup> Navid Anjum Aadit,<sup>1</sup> Masoud Mohseni,<sup>2</sup> Shuvro Chowdhury,<sup>1</sup> Yao Qin,<sup>1,3</sup> and Kerem Y. Camsari<sup>1,†</sup>

<sup>1</sup>*Department of Electrical and Computer Engineering,  
University of California, Santa Barbara, Santa Barbara, CA, 93106, USA*

<sup>2</sup>*Google Quantum AI, Venice, CA 90291, USA*

<sup>3</sup>*Google Research*

The slowing down of Moore’s law has driven the development of unconventional computing paradigms, such as specialized Ising machines tailored to solve combinatorial optimization problems. In this paper, we show a new application domain for probabilistic bit (p-bit) based on Ising machines by training deep generative AI models with them. Using sparse, asynchronous, and massively parallel Ising machines we train deep Boltzmann networks in a hybrid probabilistic-classical computing setup. We use the full MNIST and Fashion MNIST (FMNIST) dataset without any downsampling and a reduced version of CIFAR-10 dataset in hardware-aware network topologies implemented in moderately sized Field Programmable Gate Arrays (FPGA). For MNIST, our machine using only 4,264 nodes (p-bits) and about 30,000 parameters achieves the same classification accuracy (90%) as an optimized software-based restricted Boltzmann Machine (RBM) with approximately 3.25 million parameters. Similar results follow for FMNIST and CIFAR-10. Additionally, the sparse deep Boltzmann network can generate new handwritten digits and fashion products, a task the 3.25 million parameter RBM fails at despite achieving the same accuracy. Our hybrid computer takes a measured 50 to 64 billion probabilistic flips per second, which is at least an order of magnitude faster than superficially similar Graphics and Tensor Processing Unit (GPU/TPU) based implementations. The massively parallel architecture can comfortably perform the contrastive divergence algorithm (CD- $n$ ) with up to  $n=10$  million sweeps per update, beyond the capabilities of existing software implementations. These results demonstrate the potential of using Ising machines for traditionally hard-to-train deep generative Boltzmann networks, with further possible improvement in nanodevice-based realizations.

## I. INTRODUCTION

The slowing down of Moore’s Law is ushering in an exciting new era of electronics where the traditionally separate layers of the computing stack are becoming increasingly intertwined. The rise of domain-specific computing hardware and architectures is driving unconventional computing approaches. One approach that generated great excitement recently is the field of Ising machines, where special-purpose hardware is developed to solve combinatorial optimization problems [1]. The goal of Ising machines is to improve energy efficiency, time to solution, or some other useful metric to solve optimization problems by co-designing all layers in the computing stack.

In this paper, we draw attention to another possibility of using probabilistic Ising machines, beyond combinatorial optimization, to demonstrate their application to deep generative AI models. We focus on deep Boltzmann Machines (BM) that are multi-layer generalizations of the original Boltzmann Machine [2, 3]. Despite being powerful models, BMs fell out of favor from mainstream deep learning praxis [4], primarily because they are computationally hard to train with widely available hardware [5]. Our goal in this paper is to illustrate how a sparse version of deep BMs can be efficiently trained using special-purpose hardware systems that provide orders of magnitude improvement over commonly used software implementations in the computationally hard probabilistic sampling task.

With minor modifications, our core computational kernel – fast probabilistic Markov Chain Monte Carlo sampling – could support a large family of energy-based models, including restricted and unrestricted BMs [6], contrastive Hebbian learning [7], Gaussian-Bernoulli BMs [8, 9], equilibrium propagation [10], predictive coding [11] and related algorithms.

We design a probabilistic bit (p-bit) [12] based realization of Boltzmann networks, as their lowest level realization in hardware. Using FPGAs, we physically construct a network of binary stochastic neurons (BSN) in hardware and connect them to one another in a fixed hardware topology. We also design an asynchronous architecture where p-bits (BSNs) dynamically evolve in parallel, much like an interacting collection of particles without a synchronizing global clock. Such a low-level realization of a Boltzmann network provides up to 5 orders of magnitude improvement in generating samples from the Boltzmann distribution, even in moderately sized FPGAs. An intense amount of work is currently underway to design scaled probabilistic computers out of magnetic nanodevices [13–16] which can scale probabilistic computers to much larger densities in energy-efficient implementations. Despite our FPGA-specific design in this paper, much of our results are applicable to scaled p-computers as well as other Ising machines based on many different physical realizations [1]. Our broader goal is to help stimulate the development of physics-inspired probabilistic hardware [17, 18] which can lead to energy-efficient systems to reduce the rapidly growing costs of conventional deep learning based on graphics and tensor processing units (GPU/TPU) [19].

---

\* sniazi@ucsb.edu

† camsari@ece.ucsb.edu



**Fig. 1.** (a) Hybrid computing scheme with probabilistic computer and classical computer implemented on a CPU. The p-computer generates samples according to the Boltzmann-Gibbs distribution and provides them to the CPU. Then CPU computes gradients, updates the weights ( $J$ ) and biases ( $h$ ), and sends them back to the p-computer until convergence. (b) The p-computer illustrated here is based on digital CMOS implementation (FPGA) and can have a measured sampling speed of  $\approx 50$  to  $64$  flips/ns. (c) Nanodevice-based p-computer: Various analog implementations have been proposed [17]. (d) Hardware-aware sparse Deep Boltzmann Machines (DBMs) are represented with visible and hidden p-bits (examples of the Pegasus [20] and Zephyr graphs [21] are shown). (e) The sparse DBMs shown in (d) are illustrated with two layers of hidden units (Left) where both the interlayer and intralayer (not shown) connections are allowed. (see Supplementary section 15 for a full view of the networks used in this work. The graph density and vertex degree distribution of the sparse DBMs are shown in the Supplementary Section 3.) When a particular label p-bit corresponding to a digit is activated (clamping that label p-bit to 1 and clamping the rest to 0), the network evolves to an image of that digit as shown in the example (Right). (f) All 10 digits are generated with sparse DBM after training the network with the full MNIST dataset.

## II. A HYBRID PROBABILISTIC-CLASSICAL COMPUTING SCHEME

The approach we take in this paper to train deep Boltzmann networks is to construct a hybrid probabilistic and classical computing setup (FIG. 1a). The role of the classical computer is to compute gradients and the new set of weights and biases given a set of samples from the probabilistic computer. The role of the probabilistic computer is to generate equilibrium samples for a given network (defined by the set of weights and biases) according to the Boltzmann law:

$$p_i = \frac{1}{Z} \exp(-\beta E_i) \quad (1)$$

where  $E_i$  is the configuration dependent energy and  $\beta$  is the inverse (algorithmic) temperature. In our context of probabilistic sampling,  $\beta$  is typically set to 1, unlike the optimization setting where it is gradually increased to find the configuration with minimum energy. In general, the

configuration-dependent energy can be expressed as a  $k$ -local Hamiltonian [22], in this paper, we focus on the 2-local energy that is given by:

$$E = - \left( \sum_{i < j} J_{ij} m_i m_j + \sum_i h_i m_i \right) \quad (2)$$

where  $J_{ij}$  and  $h_i$  represent the network topology and  $m_i$  represents the bipolar state of nodes that are either  $+1$  or  $-1$ . The probabilistic computer we design approximates the Boltzmann law by the following dynamical equations, where the effective field  $I_i$  and the activation of  $m_i$  are given by [12, 23]:

$$I_i(t + \Delta t) = \sum J_{ij} m_j(t) + h_i \quad (3)$$

and the activation of a p-bit is given:

$$m_i(t) = \text{sgn}(\tanh[\beta I_i(t)]) - \text{rand}_{U,[-1,1]} \quad (4)$$

The iterated evolution of Eq. (3) and Eq. (4) with a predefined (or random) update order generates samples approximating the Boltzmann law defined by Eq. (1) [24]. Note that in the rest of this paper  $\beta$  is always set to 1, except in image generation experiments where we anneal the network.

An important requirement to reach the Boltzmann equilibrium is that connected p-bits are updated serially (re-computing Eq. (3) every time) rather than in parallel [25] so that each p-bit updates with the most up-to-date information. This iterative process is called Gibbs Sampling [26] and is a fundamental Markov Chain Monte Carlo (MCMC) algorithm used in many machine learning applications [27]. The physical implementation of Eq. (3) and Eq. (4) to perform MCMC introduces several challenges. The primary difficulty is the serial updating requirement of connected p-bits, prohibiting the parallelization of updates in dense networks. The second difficulty is to ensure p-bits receive all the latest information from their neighbors before updating, otherwise, the network does not sample from the true Boltzmann distribution [28].

### III. HARDWARE-AWARE SPARSE NETWORKS

Both of these difficulties are more easily addressed in sparse networks. Sparsity limits the number of neighbors between p-bits allowing parallel and asynchronous updates on unconnected p-bits. Indeed, we show that as long as the chosen network topology is sparse, a massively parallel architecture where the frequency of probabilistic samples that linearly depends on the number of nodes in the network can be constructed [25] (see Section IX for details). Our present FPGA implementation of the probabilistic computer can take up to 50 to 64 flips per nanosecond (flips/ns) and projections indicate stochastic magnetic tunnel junction-based (sMTJ) implementations can take this number to about a million flips/ns or more (FIG. 1b,c) [29, 30]. These projections have not been realized, however, nanosecond fluctuations with sMTJs [15, 31] have been demonstrated. Given the gigabit densities in MTJ-based memory [32], large-scale integration of p-computers remains plausible, with MTJ-based prototypes demonstrating architectures of the type we discuss here [13, 14, 16].

In this paper, we adopt the Pegasus [20] and the Zephyr [21] topologies developed by D-Wave's quantum annealers to train hardware-aware sparse deep BMs (FIG. 1d). Even though our approach is applicable to any sparse graph (regular and irregular), we focus on such hardware-aware networks with limited connectivity where maximum degrees range between 15 and 20. Our choice of sparse models is motivated by scaled but connectivity-limited networks such as the human brain and advanced microprocessors.

Despite the common use of full connectivity in BM-based networks where inter-layer connections are typically fully connected [33], both advanced microprocessors with networks of billion transistors and the human brain exhibit a large degree of sparsity [34]. In fact, most hardware implementations of RBMs [35–37] suffer from scaling problems due to large fan-outs, requiring off-chip memory

access or distributed computation in multiple chips [37]. On the other hand, sparse connectivity in hardware neural networks often exhibits energy and area advantages [38].

FIG. 1e shows a typical sparse DBM that we use in this paper with 2-layers of hidden bits. This graph is obtained by randomly assigning visible and hidden bits in the Pegasus (or Zephyr) graphs of various sizes. Unlike standard deep BMs [39, 40], sparse DBMs do not have fully-connected interlayer connections. On the other hand, they do allow connections between nodes in a given layer, increasing the representative capability of the network. In Section VIII, we systematically study the effect of distributing visible/hidden nodes in such sparse networks, which introduces new challenges that do not exist in fully connected networks.

Unlike standard deep BM training where training is typically done layer-by-layer [41], in sparse DBMs, we tackle the training directly on the full network, by relying on our massively parallel architecture and the efficient mixing of sparse graphs.

As we discuss in Section V, we reach about 90% classification accuracy in 100 epochs with the full MNIST dataset without any downsampling, coarse-graining, or the use of much simpler datasets, typically performed in alternative hardware-based approaches [42–45]. To support our conclusions, we also train the harder fashion MNIST dataset and a reduced version of CIFAR-10, in the Supplementary Section 13–14.

Moreover, unlike RBMs, the sparse DBM learns the images well enough that for any given label, it can generate a new handwritten digit (or an FMNIST sample) as shown in FIG. 1f, when a single one-hot encoded output p-bit is clamped to a given digit. Image generation is an important feature of physics-inspired algorithms such as diffusion models [46], and the fact that RBMs fail at this task even when they have 100 $\times$  more parameters is surprising (both in MNIST and FMNIST), stressing the potential of sparse DBM models, as we discuss further in Section VI.

### IV. TRAINING SPARSE DBMS WITH SPARSE ISING MACHINES

As our network model, we use the Pegasus [20] and Zephyr [21] graphs at different sizes as a fixed network. Boltzmann networks are typically used for unsupervised learning without any explicit labels. To use deep BMs for classification, we follow a similar approach to [47] where we create additional visible bits, calling them “label bits”. We use one-hot encoding to specify 10 digits with 10 label bits, such that each image in the MNIST data set is paired with these label bits (FIG. 1e). We then use our fast Gibbs sampler (probabilistic computer) to perform the contrastive divergence (CD) algorithm [41, 48] that minimizes the KL divergence between the data and the model distributions. An equivalent formulation from a maximum likelihood estimation viewpoint [26, 49] can also be used to obtain the following learning rule



**Fig. 2.** (a) MNIST accuracy vs training epochs: with sparse DBM, 90% accuracy is achieved in 100 epochs. Full MNIST (60,000 images) is trained on sparse DBM (Pegasus 4,264 p-bits) with CD- $10^5$ , batch size = 50, learning rate = 0.003, momentum = 0.6 and epoch = 100 where the total number of parameters is 30,404. Each epoch is defined as the network seeing the entire 60,000 images with 1,200 weight updates. Test accuracy shows the accuracy of all the 10,000 images from the MNIST test set and the training accuracy represents the accuracy of 10,000 images that are randomly chosen from the training dataset. (b) MNIST accuracy with Restricted Boltzmann Machine (RBM) using 43 hidden units and CD-1 (CPU implementation) where the total number of parameters is 34,142. The accuracy of this RBM is less than 90% but sparse DBM can reach 90% with approximately the same number of parameters. (c) MNIST accuracy of RBM with 4,096 hidden units. Here the total number of parameters is 3,252,224 and the accuracy is 90% in 100 epochs which can be achieved using sparse DBM with around  $100 \times$  fewer parameters. (d) Test accuracy of MNIST as a function of the number of parameters with sparse DBMs (Pegasus) and RBMs. We trained full MNIST with 5 different sizes of Pegasus graphs for 100 epochs using the same set of hyperparameters and collected the test accuracy of the whole test set. When the number of parameters is only 6,464 with the smaller Pegasus (960 p-bits), test accuracy could not reach beyond 50%. On larger graphs with increased parameters, accuracy starts to increase and  $\approx 90\%$  accuracy is achieved with the largest Pegasus (4264 p-bits) that fits into our FPGA. RBM reached 90% accuracy with around 200,000 parameters but the increased number of parameters (up to 3.25 million) could not help go beyond  $\approx 92\%$  accuracy.

(see Supplementary Section 9),

$$\Delta J_{ij} = \varepsilon \left( \langle m_i m_j \rangle_{\text{data}} - \langle m_i m_j \rangle_{\text{model}} \right) \quad (5)$$

$$\Delta h_i = \varepsilon \left( \langle m_i \rangle_{\text{data}} - \langle m_i \rangle_{\text{model}} \right) \quad (6)$$

where  $\Delta J_{ij}$  and  $\Delta h_i$  represent the weight and bias updates per iteration and the terms in the parentheses represent the negative gradient of the KL divergence between data and the model distributions.  $\varepsilon$  is the learning rate,  $\langle m_i m_j \rangle_{\text{data}}$  and  $\langle m_i m_j \rangle_{\text{model}}$  are the average correlation between p-bits  $i$  and  $j$  in the “positive” (data) and “negative” (model) phases, respectively. During the positive phase of sampling, the p-computer clamps the visible p-bits to the corresponding training image one after the other, taking  $N$  sweeps for each image for a total of  $N \times B$  sweeps where  $B$  is the batch size. Using these sweeps, the CPU then computes the data correlations  $\langle m_i m_j \rangle_{\text{data}}$ . In the negative phase, the p-computer is allowed to run freely without any clamping, and the CPU computes the model correlations  $\langle m_i m_j \rangle_{\text{model}}$  by taking  $N \times B$  sweeps. Then the connection weights are updated according to Eq. (5) and Eq. (6). In actual training,

we also use a momentum modification to Eqs. (5,6) (see Supplementary Section 8). A pseudocode of the algorithm is presented in Algorithm 1.

For the sparse DBMs we consider in this work, establishing correlations between the data requires executing Gibbs sampling even for the positive phase, which is obtained in a single inference step in RBMs. Our machine can be configured to implement the persistent contrastive divergence (PCD) [6, 50, 51] algorithm. PCD maintains a long-running Markov chain such that small changes in weights do not take the equilibrium state of the new network far from the old one. We discuss the possible benefits of PCD vs CD in the context of our results in Section VII.

## V. RESULTS ON THE FULL MNIST DATASET

The dataset that we used for training sparse DBMs is the full MNIST handwritten digit dataset [52, 53] without any reduction or downsampling. We show results on FMNIST and CIFAR-10 in the Supplementary Section 13-14. MNIST consists of 60,000 training images and 10,000 test images with  $28 \times 28$  pixels having digits from 0 to 9 and we use black/white images by rounding up the pixel intensities. We

---

**Algorithm 1:** Training sparse DBMs

---

**Input :** number of samples  $N$ , batch size  $B$ , number of batches  $N_B$ , epochs  $N_L$ , learning rate  $\varepsilon$

**Sampler:** Classic Gibbs CPU, Graph-colored Gibbs CPU, Graph-colored Gibbs FPGA

**Output :** trained weights  $J_{\text{out}}$  and biases  $h_{\text{out}}$

```

 $J_{\text{out}} \leftarrow \mathcal{N}(0, 0.01);$ 
 $h_{\text{out}, \text{hidden}} \leftarrow 0;$ 
 $h_{\text{out}, \text{visible}} \leftarrow \log(p_i / (1 - p_i));$ 
for  $i \leftarrow 1$  to  $N_L$  do
    for  $j \leftarrow 1$  to  $N_B$  do
         $J_{\text{Sampler}} \leftarrow J_{\text{out}};$ 
        /* positive phase */ 
        for  $k \leftarrow 1$  to  $B$  do
             $h_B \leftarrow \text{clamping to batch images};$ 
             $h_{\text{Sampler}} \leftarrow h_B + h_{\text{out}};$ 
             $\{m\} \leftarrow \text{Sampler}(N);$       ▷ p-computer
        end
         $\langle m_i m_j \rangle_{\text{data}} = \{m\} \{m\}^T / (N \times B);$       ▷ CPU
        /* negative phase */ 
         $h_{\text{Sampler}} \leftarrow h_{\text{out}};$ 
         $\{m\} \leftarrow \text{Sampler}(N \times B);$       ▷ p-computer
         $\langle m_i m_j \rangle_{\text{model}} = \{m\} \{m\}^T / (N \times B);$       ▷ CPU
        /* update weights and biases */ 
         $J_{\text{out}} \leftarrow J_{\text{out}} + \Delta J_{ij};$       ▷ CPU
         $h_{\text{out}} \leftarrow h_{\text{out}} + \Delta h_i;$       ▷ CPU
    end
end

```

---

set the initial values of weights and biases according to the recipe that Hinton suggested for RBMs in [6]. The weights are initialized to small random values chosen from a zero-mean Gaussian with a standard deviation of 0.01 for every p-bit. The initial values of hidden biases are set to zero and visible biases are set to  $\log[p_i / (1 - p_i)]$  where  $p_i$  is the proportion of training vectors in which unit  $i$  is on. The values of hyperparameters used while training are  $\beta = 1$ , learning rate  $\varepsilon = 0.003$ , and momentum  $\alpha = 0.6$ .

The sparse DBM network used here (the largest size Pegasus that we can fit into our FPGA) consists of 834 visible p-bits ( $834 = 28 \times 28 + 10 \times 5$ ; we used 5 sets of labels each containing 10 p-bits) and 3,430 hidden p-bits arranged in 2-layers as shown in the inset of FIG. 2a. Then we randomly distribute the visible and hidden units on the sparse DBMs to ensure the label indices are delocalized (see Section VIII for details of this process and the original network in Supplementary FIG. S13).

To train the network efficiently, we divide the training set into 1,200 mini-batches having 50 images in each batch. The weights are updated after each mini-batch following the CD algorithm. We train MNIST for 100 epochs with  $CD-10^5$ , where  $10^5$  sweeps of the entire network are taken in the negative phase ( $N \times B$ ). The weight precision in the FPGA is 10 bits (1 sign, 6 integer, and 3 fraction bits, i.e., s{6}{3}) while the CPU uses double-precision floating-point with 64 bits, to compute the gradients. Before the new weights are loaded into the FPGA, however, they are reduced to s{6}{3}

to fit into the FPGA. A systematic study of the effect of weight precision is shown in Supplementary Section 5 along with image completion experiments. In short, we do not observe any significant differences at higher precision in the FPGA, indicating that the 10-bit weight precision is adequate.

During inference, the 784 p-bits that correspond to the pixels are clamped to the test data and the label p-bits fluctuate freely. To test classification accuracy, we use  $10^5$  sweeps and perform a softmax classification scheme as follows: as we have 50 label p-bits for 5 sets of labels, by time-averaging the corresponding label bits we finally have the 10 labels for 10 digits. The p-bit with the highest probability of being ‘1’ is used for the classified digit. For comparison, we also train an optimized RBM model using CD-1 in the CPU. The label, testing, and training details of RBMs are very similar to those of sparse DBMs.

FIG. 2 shows our main results. We see that the sparse DBM architecture in the Pegasus graph with 4,264 p-bits reaches about 90% accuracy in 100 epochs (see Supplementary Section 4 where the training accuracy can reach 100% for MNIST/100 images). To compare the sparse DBM architecture with a standard RBM, we perform two tests, one at “iso-parameter” and the other at “iso-accuracy”. The iso-parameter test uses an RBM with about the same number of parameters (with an all-to-all interlayer connection). This RBM falls short of reaching 90% in this setting. Then, we choose an RBM with  $100 \times$  more parameters and observe that the results saturate at about 90% accuracy. We also note that increasing CD-1 to CD- $n$  ( $n$  up to 100) does not result in an appreciable difference in accuracy while making the training computationally much harder.

Detailed testing in both models (sparse DBM and RBM) indicates that marginal improvements are possible with more training epochs, however, both models show similar asymptotic behavior in 100 epochs, this is why we stop training around 100 epochs (FIG. 2d shows experiments at various network sizes). Note that this is still a computationally intense process where 60,000 images are shown to the network for a total of 6,000,000 times and the weights are updated a total of  $100 \times N_B = 120,000$  times since  $N_B = 1200$ .

To investigate the effect of total parameters of sparse DBMs on the accuracy, we used five Pegasus graphs of different sizes to train MNIST using our massively parallel architecture. These include 960, 1,664, 2,560, 3,080, and 4,264 p-bit graphs with a varying number of parameters from  $\approx 6,000$  to  $\approx 30,000$  as shown in FIG. 2d. We trained full MNIST on each of these five sparse DBMs with  $CD-10^5$  using the same hyperparameters and reported the classification accuracy for the entire test set. Similarly, we also trained eight different RBMs with full MNIST for 100 epochs to compare their accuracy with the number of parameters (FIG. 2d). Increasing the number of parameters to millions could not increase the test accuracy significantly whereas 90% accuracy is achieved with around 200,000 parameters.

Based on these experimental results, we arrive at the following two important conclusions: First, the sparse DBM architecture despite having a much smaller degree of connections between its layers (limited to a graph degree of



**Fig. 3.** (a) Images generated with sparse DBM by annealing the network from  $\beta=0$  to  $\beta=5$  with 0.125 steps after training the full MNIST dataset. The labels for a particular digit are clamped to show how the visible p-bits evolve to that specific image. Examples of digits ‘0’ and ‘7’ are shown here. (b) The same procedure for image generation is applied to the RBM network (with 4,096 hidden units) that achieves 90% test accuracy. Using the same annealing schedule, RBM does not produce the correct digits, unlike the sparse DBM. (c) Generated images of fashion products (e.g. ‘Trouser’ and ‘Pullover’) with sparse DBM by annealing the network from  $\beta=0$  to  $\beta=5$  with 0.125 steps after training full Fashion MNIST. (d) RBM with 4096 hidden units can not generate the correct images according to the labels despite achieving around 83% test accuracy.

$\approx 15$  to 20) matches the classification accuracy of a fully-connected RBM. Second, the sparse DBM requires far fewer parameters (about 30,000) to reach 90% accuracy in the MNIST dataset. Both of these indicate the potential of sparse DBMs which can be directly tackled by the orders of magnitude acceleration obtained in the hardware. We show in the Supplementary Section 13- 14 that similar results with the same order of magnitude differences between sparse DBMs and RBMs hold for the full FMNIST and a reduced version of the CIFAR-10 dataset.

Compared to more powerful standard DNN algorithms such as CNNs, sparse DBMs do not reach state-of-the-art classification accuracy in MNIST at these modest network sizes and depths. Further improvements should be possible by algorithmic techniques and at larger sizes as discussed in Ref’s [50, 54]. Surprisingly, however, in a head-to-head comparison using the same contrastive divergence algorithm, the sparse DBM architecture matches the performance of highly optimized RBMs, despite the severely limited connectivity. More detailed comparisons may reveal the true potential of hardware-aware sparse DBMs which can be implemented on Ising Machines. It is important to note that the generative nature of BMs allows applications beyond classification, such as representing many-body quantum

wavefunctions [55, 56].

## VI. IMAGE GENERATION

Given their generative nature, a natural question to ask is whether the sparse DBM and RBM can generate new images when they are run in the “invertible mode”. This is similar to the image generation from “noise” discussed in diffusion models [46]. We test this idea post-training by clamping the label bits to a given digit and annealing the network using the inverse temperature,  $\beta$ .

Here we present an example of image synthesis with sparse DBMs with  $\approx 30,000$  parameters and the optimized RBM with  $\approx 3.25$  million parameters (FIG. 3a-b for MNIST and FIG. 3c-d for Fashion MNIST). For this process, we clamp the label bits for digits ‘0’ or ‘7’ in the case of MNIST while all other p-bits run freely. Using the final weights and biases and by annealing the network slowly from  $\beta=0$  to  $\beta=5$  with a 0.125 increment and we check the 784 image p-bits at various time steps. At lower values of  $\beta$  when the system is in a high-temperature state, the model is sampling from noise (first column of FIG. 3a). With increasing  $\beta$  values, digits gradually become recognizable, and at final  $\beta=5$  we see clear images of a ‘0’ or ‘7’ (leftmost column of FIG. 3a). This example

is demonstrated with the Pegasus graph using about 30,000 parameters from FIG. 2, similar results are obtained with the Zephyr graph as we discuss in Supplementary Section 7. Using the same approach of image generation, the fashion products ‘Trouser’ or ‘Pullover’ from FMNIST are generated as shown in FIG. 3c (details in Supplementary Section 13).

In contrast, the images generated with the RBM (4,096 hidden units) are not recognizable even after careful annealing (FIG. 3b-d), despite multiple experiments with different trials. Similarly, the annealing schedule for RBM varies from  $\beta = 0$  to  $\beta = 5$ . To test whether RBMs can generate images with better gradient calculations, we also trained an RBM with 4,096 hidden p-bits with CD- $10^2$  but this did not lead to any success in image generation. Interestingly, however, the RBM with 4,096 hidden p-bits and CD- $10^2$  can accomplish a simpler “image completion” task when it is presented half of a given digit (see Supplementary Section 11).

These results seem to be in keeping with the idea that in freely “dreaming” or image-completing networks, it may be possible to generate images with RBMs [57]. In our experiments, the image generation task is forced by clamping only the label bits, without giving the network any other lead. In this stricter sense, the failure of the RBM to generate images is consistent with the general understanding on the subject [58, 59]. We believe that accelerating the Gibbs sampling by orders of magnitude can enable the training of even deeper, previously untrainable deep BMs. The potential of physics-inspired RBMs for image generation is also seen in the recent interest in Gaussian-Bernoulli (GRBMs) [9], whose sparse and deep variants could be even more powerful.

## VII. MIXING TIMES

One of the key difficulties that are often cited in the training of Boltzmann networks is the computational intractability of the partition function,  $Z$ , that appears in the Boltzmann law, Eq. (1). Formally, what is required to ensure an exact calculation of the gradient is that the calculation of correlations  $\langle m_i m_j \rangle$  and averages  $\langle m_i \rangle$  come from the equilibrium states of a given network defined by  $J_{ij}$  and  $h_i$ . The time it takes for an undirected network to reach equilibrium is defined as the “mixing time”. A formal analysis of how long it takes for a given graph to mix can be extremely difficult to calculate and is unknown for all but the simplest, most regular networks [60]. Here, we empirically study the mixing time of the Pegasus graph that we used in generating our main results in FIG. 2. The fact that there is no a priori method to determine mixing times, a lot of hyperparameter optimization might be necessary to squeeze the maximum results out of these networks (see, for example, [6]).

In FIG. 4, we observe that the test set accuracy of our network increases significantly if the probabilistic sampler takes  $10^4$  or more sweeps per weight update. Above this value, there seem to be diminishing returns in improving the accuracy. This suggests that taking more sweeps does not improve the estimation of the averages and correlations because these samples are already in equilibrium and  $\approx 10^4$  sweeps at this size (with 4,264 p-bits) of the Pegasus



**Fig. 4.** (a) Test accuracy after training full MNIST (up to only 40 epochs for computational simplicity) with different numbers of sweeps per iteration is shown. For our sparse graph, to mix the Markov chain properly we need a minimum CD- $10^4$ . Reducing the number of sweeps to  $10^3$  or  $10^2$  degrades the quality of mixing the chain significantly. (b) Test accuracy as a function of CD-n at epoch 40 showing the equilibrium and non-equilibrium samples.

graph can be empirically defined as the mixing time of the network (Supplementary Section 6 shows the mixing time study of different size Pegasus). As mentioned earlier, our probabilistic computer could be modified to perform persistent CD algorithm (PCD) [50]. Beyond CD- $10^4$ , this may have diminishing returns since the chain mixes and starts sampling from the equilibrium distribution, even if it starts from a random state, as shown in FIG. 4.

The reason for the saturating classification accuracy of sparse DBMs at around 90% is likely that the network is not deep or wide enough and not because of the intractability of the algorithm. In fact, considering our hardware architecture FPGA is able to take  $\approx 64$  billion samples per second, and obtaining  $10^5$  sweeps from our machine can be done in mere milliseconds (Table I shows comparisons of sampling rates between standard CPUs and our graph colored (GC) architecture, where our probabilistic computer (GC-FPGA) demonstrates  $\approx 4$  to  $6$  orders of magnitude improvement over the optimized and standard CPU implementations of Gibbs sampling, respectively). In Supplementary Section 10, we



**Fig. 5.** (a) The sparse DBMs (Pegasus and Zephyr) where all the p-bits are distributed in a serial manner such as 1 to 784 are the visible p-bits, 785 to 834 are the label p-bits (50 bits for 5 sets of labels), and the rest are hidden p-bits. (b) The sparse DBMs with randomized indices are shown here. (c) Test accuracy of full MNIST as a function of training epochs for two different sparse DBMs. In both cases, training the sparse DBMs with the serial distribution (no randomization) of indices could not achieve an accuracy of more than 50%. In contrast, randomization of indices helps the network to reach 90% accuracy.

show how the performance reported in Table I fares against superficially similar Ising solvers in highly optimized GPU and TPU implementations.

These results suggest that our machine can be used to sample much more “difficult” networks that require many more samples to mix, enabling the training of previously untrainable networks with potentially richer representation.

It is important to note that in case of the contrastive divergence algorithm where the goal is to estimate model correlations and averages, one does not need to compute the effect of all samples. After the network reaches equilibrium, a small number of samples may be used to estimate the

averages and correlations (with  $\mathcal{O}(1/\sqrt{N})$  complexity). This significantly eases practical read-out constraints of a fast probabilistic sampler (See Supplementary Section 2 for our detailed read-out architecture).

## VIII. RANDOMIZATION OF INDICES

A very important point that arises in training Boltzmann network models on a given sparse network is the notion of graph distance between visible, hidden, and label nodes. Typically, if the layers are fully connected, the graph distance between any given two nodes is a constant. On the other hand, when training sparse DBMs, the placement of visible, hidden, and label p-bits plays a crucial role. FIG. 5 shows the comparison of how the indexing of p-bits affects the classification accuracy in the Pegasus and Zephyr graphs. We observe that if the hidden, visible, and label bits are clustered and too close, the classification accuracy suffers greatly. This is likely because the correlation between the label bits and the visible bits gets weaker if their graph distance is too large. On the other hand, randomizing the indices seems to solve this problem, repeatable in completely different but sparse graphs. We performed further experiments with different random indices and essentially observed the same behavior. For example, FIG. 2d shows a monotonically increasing accuracy with different sizes of sparse DBMs (Pegasus) even

TABLE I. Comparison of the FPGA-based MCMC sampler with standard CPU and graph-colored CPU implementations. All data points are measured, as discussed in the Methods.

| Sampling method      | topology | size  | max. degree | flips/ns              |
|----------------------|----------|-------|-------------|-----------------------|
| Standard Gibbs (CPU) | Pegasus  | 4,264 | 15          | $2.18 \times 10^{-5}$ |
| GC Gibbs (CPU)       | Pegasus  | 4,264 | 15          | $8.50 \times 10^{-3}$ |
| GC Gibbs (FPGA)      | Pegasus  | 4,264 | 15          | $6.40 \times 10^1$    |
| Standard Gibbs (CPU) | Zephyr   | 3,360 | 20          | $2.67 \times 10^{-5}$ |
| GC Gibbs (CPU)       | Zephyr   | 3,360 | 20          | $4.40 \times 10^{-3}$ |
| GC Gibbs (FPGA)      | Zephyr   | 3,360 | 20          | $5.04 \times 10^1$    |

though each graph has a different randomized index set.

To reduce the graph distance between the label bits, visible and hidden bits, we chose 5 sets of label bits ( $5 \times 10 = 50$  p-bits) using one-hot encoding per digit. Experiments with more label bits did not show significant differences. Also, experiments with multiple label bits in the RBM did not show any difference. This suggests that randomization of indices is particularly important for sparse models, but is unnecessary for fully-connected networks whose graph distance between any two nodes is a constant.

## IX. P-COMPUTER ARCHITECTURE

On the sparse DBM, we color the graph using the heuristic graph-coloring algorithm DSatur [61] to exploit parallel updating of unconnected p-bits. This approach involves assigning different colors to connected p-bits and the same color to unconnected p-bits as shown in FIG. 6a to implement Gibbs sampling in a massively parallel manner on sparse and irregular graphs [25]. Finding the minimum number of colors is an NP-hard problem, however, the minimum number of colors is not a strict requirement as sparse graphs require only a limited number of colors, and for our purpose, heuristic coloring algorithms like DSatur with polynomial complexity can color the graph efficiently.

In the case of the Pegasus graph with 4,264 p-bits, where the maximum number of neighbors is 15, only four colors are used as shown in FIG. 6a. Therefore we need four equally phase-shifted and same-frequency clocks for updating the p-bits in each color block one by one. Similarly, the Zephyr graph (3,360 p-bits and the maximum number of neighbors is 20) can also be colored with five colors using this procedure. In this approach, a graph comprised of  $N$  p-bits is able to perform a full sweep in a single clock cycle ( $T_{\text{clk}}$ ). We refer to this architecture as the pseudo-asynchronous Gibbs sampling [17]. The key advantage of this approach is that the p-computer becomes faster as the graph size grows as shown in FIG. 6b and Table I for both graph-colored FPGA and graph-colored CPU.

Parallelism offers many more samples to be taken at a clock cycle (scales as  $N$ , being the number of p-bits in the network as shown in FIG. 6b), however, we also establish that this parallelism does not introduce any approximations or errors by performing an “inference” experiment as discussed in Supplementary Section 12.

## CONCLUSION

In this work, we have presented a hybrid probabilistic-classical computing setup to train sparse and deep Boltzmann networks. We used a sparse Ising machine with a massively parallel architecture that achieves a sampling speed orders of magnitude faster than traditional CPUs. Unlike similar hardware approaches where the MNIST dataset is downsampled, reduced, or replaced entirely for smaller datasets, we trained the full MNIST dataset without any simplifications. We also trained harder datasets like Fashion MNIST and a reduced version of CIFAR-10 with a grayscale scheme. Our sparse DBM matched the accuracy of

RBM while using  $100\times$  fewer parameters and successfully generated new images while the RBM failed to do so. We systematically studied the mixing time of the hardware-aware network topologies and showed that the classification accuracy of our model is not limited by the computational tractability of the algorithm but limited by the moderately sized FPGAs that we were able to use in this work.

Further improvements may involve using deeper, wider, and perhaps “harder to mix” network architectures using more complex datasets, the use of higher-order energy functions beyond quadratic interactions [22] and the use of Gaussian visible units to better model continuous data [9]. The use of higher-order interactions in the context of combinatorial optimization has generated significant attention [62–64], and its application to learning problems could be possible due to the simplified nature of multiplication with binary neurons. Moreover, combining layer-by-layer training techniques of conventional DBMs with our approach could lead to further possible improvements. Nanodevice implementations of such sparse Ising machines, for example, using stochastic magnetic tunnel junctions may significantly change this picture, potentially changing established wisdom on the practical use of deep Boltzmann networks.

## METHODS

### 1. FPGA and CPU specifications

In this article, Xilinx Alveo U250, a data center accelerator card (Virtex UltraScale+ XCU250 FPGA) with peripheral component interconnect express (PCIe) connectivity has been used [65]. PCIe interface performs data transfer at the rate of 2.5 gigatransfers per second (GT/s). The classical computer used in this study is equipped with an 11th Gen Intel Core i7-11700 processor with a clock speed of up to 4.90 GHz and 64 GB of random access memory (RAM).

The digital implementation of p-bits consists of a pseudorandom number generator (PRNG), a lookup table for the activation function (tanh), and a threshold to generate a binary output (details in the Supplementary Section 1). The read-out architecture with mirror p-bits is discussed in the Supplementary Section 2. Weights and biases with fixed point precision of 10 bits (1 sign bit, 6 integer bits, and 3 fraction bits) are used to provide tunability through the activation function.

### 2. MNIST data, D-Wave graphs and RBM code

MNIST files are downloaded from [52]. Then the image data are converted to binary form (black and white) by rounding up the pixel intensities in MATLAB. While we focused on black and white images for our main results, in the Supplementary Section 13-14, we show how the learning algorithm can be extended to learn grayscale images, following a similar time-averaging approach discussed in Ref. [66]. The Pegasus and Zephyr graphs are extracted using the procedure described in [67]. The RBM code used in this work is similar to the one available in [68].



**Fig. 6.** (a) An example of massively parallel architecture with four parallel same-frequency and equally phase-shifted clocks to trigger the colored p-bit blocks. The sparse DBM (Pegasus 4,264 p-bits) is colored with four colors using the graph-coloring algorithm to exploit parallel updating of unconnected p-bits and the input for each p-bit is computed using Eq. (3). (b) Measured flips/ns as a function of graph size (number of p-bits) showing ideal parallelism scaling linearly with the system size in the case of the graph-colored FPGA (top). The graph-colored CPU flips/ns as a function of the graph size (bottom).

### 3. Data transfer between FPGA and CPU

A PCIe interface is used to communicate between FPGA and CPU through MATLAB interface for the ‘read/write’ operations (see Supplementary FIG. S1c). A global ‘disable/enable’ signal broadcast from MATLAB to the FPGA is used to freeze/resume all p-bits. Before a ‘read’ instruction, the p-bit states are saved to the local block memory (BRAM) with a snapshot signal. Then the data are read once from the BRAM using the PCIe interface and sent to MATLAB for post-processing i.e., computing gradients and updating the weights. For the ‘write’ instruction, the ‘disable’ signal is sent from MATLAB to freeze the p-bits before sending the updated weights. After the ‘write’ instruction is done, p-bits are enabled again with the ‘enable’ signal sent from MATLAB. The data transfer efficiency is influenced by this back-and-forth communication between the FPGA and MATLAB. Furthermore, the conversion of bipolar to binary weights and biases during each epoch (as explained in Supplementary 1 b) adds some time overheads while sending them from MATLAB to FPGA. Even though sampling is very fast in FPGA, due to these overheads it takes  $\approx 20$  hours to train full MNIST on 4,264 p-bit Pegasus with CD- $10^5$  for 100 epochs. This issue can be improved significantly by updating the weights and biases inside the FPGA. To understand the improvement introduced by our hybrid approach, we also note that the corresponding equivalent version with CD- $10^4$  and with the same graph coloring on a CPU took  $\approx 2.4$  days (57.5 hours) to complete only 10 epochs (projected time for 100 epochs is  $\approx 24$  days).

### 4. Measurement of flips per nanosecond

To measure the flips/ns, one p-bit in each color block is designed with a programmable counter in the FPGA to count the flip attempts. A reference counter running parallelly is set to count up to a preset value at the positive edge of a reference clock. When the reference counter is done counting, the p-bit counters are stopped. Comparing the p-bit counter outputs (representing the total number of attempted flips in each color block) with the reference counter preset value, the time for the total flips is obtained. With this data, the

flips/ns of the p-computer is measured experimentally. To determine the flips/ns for the standard CPU and graph-colored CPU, MATLAB built-in ‘tic’ and ‘toc’ functions are used to measure the elapsed time while counting the total flips. The flips/ns is measured in real-time using this data. The error bars in FIG. 6b are obtained by taking 500 measurements of flips/ns.

### ACKNOWLEDGEMENTS

We gratefully acknowledge discussions with Dr. Jan Kaiser. We are thankful to the Xilinx University Donation Program (XUP) for the FPGA development boards and G. Eschemann for useful discussions on airhdl. This work is partially supported by an Office of Naval Research Young Investigator Program grant and a National Science Foundation CCF 2106260 grant.

### DATA AVAILABILITY

The data that support the plots within this paper and other findings of this study are available from the corresponding author upon reasonable request.

### CODE AVAILABILITY

The computer code used in this study is available from the corresponding author upon reasonable request.

### AUTHOR CONTRIBUTIONS

SN and KYC conceived the study. KYC supervised the study. SN and NAA developed the hybrid FPGA-CPU implementation. SN and SC performed the benchmark RBM training. SN and NAA performed the FPGA experiments to train sparse DBMs. SN, NAA, MM, SC, YQ, KYC discussed, analyzed the experiments, and participated in writing the manuscript.

### COMPETING INTERESTS

The authors declare no competing interests.

## REFERENCES

- [1] N. Mohseni, P. L. McMahon, and T. Byrnes, Ising machines as hardware solvers of combinatorial optimization problems, *Nature Reviews Physics* **4**, 363 (2022).
- [2] G. E. Hinton, T. J. Sejnowski, and D. H. Ackley, *Boltzmann machines: Constraint satisfaction networks that learn* (Carnegie-Mellon University, Department of Computer Science Pittsburgh, PA, 1984).
- [3] P. Huembeli, J. M. Arrazola, N. Killoran, M. Mohseni, and P. Wittek, The physics of energy-based models, *Quantum Machine Intelligence* **4**, 1 (2022).
- [4] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, *nature* **521**, 436 (2015).
- [5] I. Goodfellow, Y. Bengio, and A. Courville, *Deep learning* (MIT press, 2016).
- [6] G. E. Hinton, A Practical Guide to Training Restricted Boltzmann Machines, in *Neural Networks: Tricks of the Trade: Second Edition*, edited by G. Montavon, G. B. Orr, and K.-R. Müller (Springer Berlin Heidelberg, Berlin, Heidelberg, 2012) pp. 599–619.
- [7] X. Xie and H. S. Seung, Equivalence of backpropagation and contrastive hebbian learning in a layered network, *Neural computation* **15**, 441 (2003).
- [8] K. H. Cho, T. Raiko, and A. Ilin, Gaussian-bernoulli deep boltzmann machine, in *The 2013 International Joint Conference on Neural Networks (IJCNN)* (IEEE, 2013) pp. 1–7.
- [9] R. Liao, S. Kornblith, M. Ren, D. J. Fleet, and G. Hinton, Gaussian-Bernoulli RBMs Without Tears, arXiv preprint arXiv:2210.10318 (2022).
- [10] B. Scellier and Y. Bengio, Equilibrium propagation: Bridging the gap between energy-based models and backpropagation, *Frontiers in computational neuroscience* **11**, 24 (2017).
- [11] B. Millidge, Y. Song, T. Salvatori, T. Lukasiewicz, and R. Bogacz, Backpropagation at the infinitesimal inference limit of energy-based models: unifying predictive coding, equilibrium propagation, and contrastive hebbian learning, arXiv preprint arXiv:2206.02629 (2022).
- [12] K. Y. Camsari, R. Faria, B. M. Sutton, and S. Datta, Stochastic p-bits for invertible logic, *Physical Review X* **7**, 031014 (2017).
- [13] W. A. Borders, A. Z. Pervaiz, S. Fukami, K. Y. Camsari, H. Ohno, and S. Datta, Integer factorization using stochastic magnetic tunnel junctions, *Nature* (2019).
- [14] A. Grimaldi, K. Selcuk, N. A. Aadit, K. Kobayashi, Q. Cao, S. Chowdhury, G. Finocchio, S. Kanai, H. Ohno, S. Fukami, and K. Y. Camsari, Experimental evaluation of simulated quantum annealing with MTJ-augmented p-bits, in *2022 IEEE International Electron Devices Meeting (IEDM)* (2022).
- [15] K. Hayakawa, S. Kanai, T. Funatsu, J. Igarashi, B. Jinnai, W. Borders, H. Ohno, and S. Fukami, Nanosecond random telegraph noise in in-plane magnetic tunnel junctions, *Physical Review Letters* **126**, 117202 (2021).
- [16] J. Kaiser, W. A. Borders, K. Y. Camsari, S. Fukami, H. Ohno, and S. Datta, Hardware-aware in situ learning based on stochastic magnetic tunnel junctions, *Physical Review Applied* **17**, 014016 (2022).
- [17] S. Chowdhury, A. Grimaldi, N. A. Aadit, S. Niazi, M. Mohseni, S. Kanai, H. Ohno, S. Fukami, L. Theagarajan, G. Finocchio, *et al.*, A full-stack view of probabilistic computing with p-bits: devices, architectures and algorithms, *IEEE Journal on Exploratory Solid-State Computational Devices and Circuits* (2023).
- [18] P. J. Coles, Thermodynamic AI and the fluctuation frontier, arXiv preprint arXiv:2302.06584 (2023).
- [19] D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean, Carbon emissions and large neural network training, arXiv preprint arXiv:2104.10350 (2021).
- [20] N. Dattani, S. Szalay, and N. Chancellor, Pegasus: The second connectivity graph for large-scale quantum annealing hardware, arXiv preprint arXiv:1901.07636 (2019).
- [21] K. Boothby, A. King, and J. Raymond, Zephyr topology of D-Wave quantum processors, *D-Wave Tech. Rep. Ser.* **1** (2021).
- [22] T. J. Sejnowski, Higher-order boltzmann machines, in *AIP Conference Proceedings*, Vol. 151 (American Institute of Physics, 1986) pp. 398–403.
- [23] R. Faria, J. Kaiser, K. Y. Camsari, and S. Datta, Hardware design for autonomous bayesian networks, *Frontiers in computational neuroscience* **15**, 14 (2021).
- [24] E. Aarts and J. Korst, *Simulated annealing and Boltzmann machines: a stochastic approach to combinatorial optimization and neural computing* (John Wiley & Sons, Inc., 1989).
- [25] N. A. Aadit, A. Grimaldi, M. Carpentieri, L. Theagarajan, J. M. Martinis, G. Finocchio, and K. Y. Camsari, Massively parallel probabilistic computing with sparse Ising machines, *Nature Electronics* **5**, 460 (2022).
- [26] D. Koller and N. Friedman, *Probabilistic graphical models: principles and techniques* (MIT press, 2009).
- [27] C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan, An introduction to MCMC for machine learning, *Machine learning* **50**, 5 (2003).
- [28] A. Z. Pervaiz, L. A. Ghantasala, K. Y. Camsari, and S. Datta, Hardware emulation of stochastic p-bits for invertible logic, *Scientific reports* **7**, 10994 (2017).
- [29] B. Sutton, R. Faria, L. A. Ghantasala, R. Jaiswal, K. Y. Camsari, and S. Datta, Autonomous probabilistic coprocessing with petaflips per second, *IEEE Access* **8**, 157238 (2020).
- [30] K. Y. Camsari, S. Salahuddin, and S. Datta, Implementing p-bits with embedded MTJ, *IEEE Electron Device Letters* **38**, 1767 (2017).
- [31] C. Safranski, J. Kaiser, P. Trouilloud, P. Hashemi, G. Hu, and J. Z. Sun, Demonstration of nanosecond operation in stochastic magnetic tunnel junctions, *Nano letters* **21**, 2040 (2021).
- [32] K. Lee, J. Bak, Y. Kim, C. Kim, A. Antonyan, D. Chang, S. Hwang, G. Lee, N. Ji, W. Kim, *et al.*, 1Gbit high density embedded STT-MRAM in 28nm FDSOI technology, in *2019 IEEE International Electron Devices Meeting (IEDM)* (IEEE, 2019) pp. 2–2.
- [33] R. Salakhutdinov and G. Hinton, Deep boltzmann machines, in *Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics*, Proceedings of Machine Learning Research, Vol. 5, edited by D. van Dyk and M. Welling (PMLR, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 2009) pp. 448–455.
- [34] D. S. Bassett and E. Bullmore, Small-world brain networks, *The neuroscientist* **12**, 512 (2006).
- [35] M. N. Bojnordi and E. Ipek, Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning, in *High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on* (IEEE, 2016) pp. 1–13.
- [36] C.-H. Tsai, W.-J. Yu, W. H. Wong, and C.-Y. Lee, A 41.3/26.7 pJ per neuron weight RBM processor supporting on-chip learning/inference for IoT applications, *IEEE Journal of Solid-State Circuits* **52**, 2601 (2017).
- [37] S. K. Kim, L. C. McAfee, P. L. McMahon, and K. Olukotun, A highly scalable restricted Boltzmann machine FPGA implementation, in *2009 International Conference on Field Programmable Logic and Applications* (IEEE, 2009) pp. 367–

- 372.
- [38] A. Ardakani, C. Condo, and W. J. Gross, Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks, arXiv preprint arXiv:1611.01427 (2016).
- [39] R. Salakhutdinov and H. Larochelle, Efficient learning of deep Boltzmann machines, in *Proceedings of the thirteenth international conference on artificial intelligence and statistics* (JMLR Workshop and Conference Proceedings, 2010) pp. 693–700.
- [40] I. J. Goodfellow, A. Courville, and Y. Bengio, Joint training deep Boltzmann machines for classification, arXiv preprint arXiv:1301.3568 (2013).
- [41] G. E. Hinton, Training products of experts by minimizing contrastive divergence, *Neural computation* **14**, 1771 (2002).
- [42] S. H. Adachi and M. P. Henderson, Application of quantum annealing to training of deep neural networks, arXiv preprint arXiv:1510.06356 (2015).
- [43] H. Manukian, F. L. Traversa, and M. Di Ventra, Accelerating deep learning with memcomputing, *Neural Networks* **110**, 1 (2019).
- [44] V. Dixit, R. Selvarajan, M. A. Alam, T. S. Humble, and S. Kais, Training Restricted Boltzmann Machines With a D-Wave Quantum Annealer, *Frontiers in Physics* **9**, 589626 (2021).
- [45] F. Böhm, D. Alonso-Urquijo, G. Verschaffelt, and G. Van der Sande, Noise-injected analog Ising machines enable ultrafast statistical sampling and machine learning, *Nature Communications* **13**, 5847 (2022).
- [46] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, Deep unsupervised learning using nonequilibrium thermodynamics, in *International Conference on Machine Learning* (PMLR, 2015) pp. 2256–2265.
- [47] H. Larochelle and Y. Bengio, Classification using discriminative restricted Boltzmann machines, in *Proceedings of the 25th international conference on Machine learning* (2008) pp. 536–543.
- [48] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, in *Proceedings of the 24th international conference on Machine learning* (2007) pp. 473–480.
- [49] M. A. Carreira-Perpiñán and G. Hinton, On contrastive divergence learning, in *Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics*, Proceedings of Machine Learning Research, Vol. R5, edited by R. G. Cowell and Z. Ghahramani (PMLR, 2005) pp. 33–40, reissued by PMLR on 30 March 2021.
- [50] T. Tieleman, Training restricted Boltzmann machines using approximations to the likelihood gradient, in *Proceedings of the 25th international conference on Machine learning* (2008) pp. 1064–1071.
- [51] A. Fischer and C. Igel, Training restricted Boltzmann machines: An introduction, *Pattern Recognition* **47**, 25 (2014).
- [52] Y. LeCun, The MNIST database of handwritten digits, <http://yann.lecun.com/exdb/mnist/> (1998).
- [53] L. Deng, The MNIST database of handwritten digit images for machine learning research, *IEEE Signal Processing Magazine* **29**, 141 (2012).
- [54] H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio, Learning algorithms for the classification restricted boltzmann machine, *The Journal of Machine Learning Research* **13**, 643 (2012).
- [55] A. Dawid and Y. LeCun, Introduction to latent variable energy-based models: A path towards autonomous machine intelligence, arXiv preprint arXiv:2306.02572 (2023).
- [56] G. Carleo and M. Troyer, Solving the quantum many-body problem with artificial neural networks, *Science* **355**, 602 (2017).
- [57] H. Hu, L. Gao, and Q. Ma, Deep restricted boltzmann networks, arXiv preprint arXiv:1611.07917 (2016).
- [58] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial networks, *Communications of the ACM* **63**, 139 (2020).
- [59] J. Sleeman, J. Dorband, and M. Haleem, A hybrid quantum enabled RBM advantage: convolutional autoencoders for quantum image compression and generative learning, in *Quantum information science, sensing, and computation XII*, Vol. 11391 (SPIE, 2020) pp. 23–38.
- [60] D. A. Levin and Y. Peres, *Markov chains and mixing times*, Vol. 107 (American Mathematical Soc., 2017).
- [61] D. Brélaž, New methods to color the vertices of a graph, *Communications of the ACM* **22**, 251 (1979).
- [62] M. K. Bashar and N. Shukla, Designing ising machines with higher order spin interactions and their application in solving combinatorial optimization, *Scientific Reports* **13**, 9558 (2023).
- [63] C. Bybee, D. Kleyko, D. E. Nikonov, A. Khosrowshahi, B. A. Olshausen, and F. T. Sommer, Efficient optimization with higher-order ising machines, *Nature Communications* **14**, 6033 (2023).
- [64] Y. He, C. Fang, S. Luo, and G. Liang, Many-body effects-based invertible logic with a simple energy landscape and high accuracy, *IEEE Journal on Exploratory Solid-State Computational Devices and Circuits* (2023).
- [65] Xilinx, U250 Data Sheet, <https://www.xilinx.com/products/boards-and-kits/alveo/u250.html#documentation>.
- [66] T. Hirtzlin, B. Penkovsky, M. Bocquet, J.-O. Klein, J.-M. Portal, and D. Querlioz, Stochastic computing for hardware implementation of binarized neural networks, *IEEE Access* **7**, 76394 (2019).
- [67] D-Wave Systems Inc., D-Wave Ocean Documentation: DNX Generators, [https://docs.ocean.dwavesys.com/en/latest/docs\\_dnx/reference/generators.html](https://docs.ocean.dwavesys.com/en/latest/docs_dnx/reference/generators.html) (2021).
- [68] G. Hinton and R. Salakhutdinov, Training a deep autoencoder or a classifier on MNIST digits (2014).
- [69] D. Blackman and S. Vigna, Scrambled linear pseudorandom number generators, *ACM Transactions on Mathematical Software (TOMS)* **47**, 1 (2021).
- [70] airhdl.com, airhdl VHDL/SystemVerilog Register Generator, <https://airhdl.com>.
- [71] B. Block, P. Virnau, and T. Preis, Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model, *Computer Physics Communications* **181**, 1549 (2010).
- [72] T. Preis, P. Virnau, W. Paul, and J. J. Schneider, GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model, *Journal of Computational Physics* **228**, 4468 (2009).
- [73] K. Yang, Y.-F. Chen, G. Roumpos, C. Colby, and J. Anderson, High performance Monte Carlo simulation of Ising model on TPU clusters, in *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis* (2019) pp. 1–15.
- [74] Y. Fang, S. Feng, K.-M. Tam, Z. Yun, J. Moreno, J. Ramanujam, and M. Jarrell, Parallel tempering simulation of the three-dimensional Edwards–Anderson model with compact asynchronous multispin coding on GPU, *Computer Physics Communications* **185**, 2467 (2014).
- [75] H. Xiao, K. Rasul, and R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,

- arXiv preprint arXiv:1708.07747 (2017).
- [76] A. Krizhevsky and G. Hinton, Cifar-10 and cifar-100 datasets, <https://www.cs.toronto.edu/~kriz/cifar.html> (2009), accessed: [Insert Date of Access].
- [77] X. Qin, X. Luo, Z. Wu, and J. Shang, Optimizing the sediment classification of small side-scan sonar images based on deep learning, *IEEE Access* **9**, 29416 (2021).

## SUPPLEMENTARY INFORMATION

### 1. FPGA Implementation

We present the experimental results in the main paper using a hybrid classical-probabilistic computer. Here, we discuss the technical details of the FPGA-based p-computer implemented on a Xilinx Alveo U250 data center accelerator card. The basic architecture is presented in FIG. S1.

#### a. *p-bit and MAC Unit*

A single p-bit consists of a tanh lookup table (LUT) for the activation function, a pseudorandom number generator (PRNG), and a comparator to implement Eq. (4). Digital input with a fixed point precision (10 bits with 1 sign, 6 integers, and 3 fraction bits) provides tunability through the activation function. The effect of different fixed point precisions has been explored in Supplementary Section 5. In this work, we used a high-quality 32-bit PRNG (xoshiro [69]). There is also a multiplier–accumulator (MAC) unit to compute Eq. (3) based on the neighbor p-bit states and provide input signal for the LUT as shown in FIG. S1a.

#### b. *Bipolar to binary conversion*

As described in Section II of the main article, Eq. (2)-(4) are represented with the bipolar state of the variables. However, for the FPGA-based implementation of p-computer, using a binary state of variables is more convenient since digital CMOS operates with gnd (0) and VDD (1) and Boolean logic can naturally represent the p-bit state. Therefore, bipolar to binary conversion is done on the weights and biases before being sent to the FPGA, according to the following mapping:

$$J_{\text{binary}} = 2J_{\text{bipolar}} \quad (\text{S.1})$$

$$h_{\text{binary}} = h_{\text{bipolar}} - J_{\text{bipolar}} \mathbf{1} \quad (\text{S.2})$$

where,  $\mathbf{1}$  is a vector of ones of size  $N \times 1$ , and  $N$  is the number of p-bits. All the variables in the MAC unit are also calculated in binary notations. To convert the stored activation function values of LUT from bipolar to binary representation, the  $\tanh(x)$  is mapped to  $[1 + \tanh(x)]/2$ . Finally, the output p-bit is represented with binary states  $m_i \in \{0, 1\}$ .



**Fig. S1.** (a) The MAC (multiplier–accumulator) unit implements Eq. (3). The p-bit unit consists of a xoshiro pseudorandom number generator (PRNG), a lookup table for the activation function ( $\tanh$ ), and a comparator to generate a binary output. (b) A built-in clocking unit generates equally phase-shifted and same-frequency parallel clocks to trigger the PRNGs inside the colored p-bit blocks. (c) A PCIe interfacing unit transfers data between MATLAB and the FPGA.

#### c. *Clocking unit*

In Section IX of the main paper, we discussed graph coloring to color p-bit blocks achieving massive parallelism, which we define as the linear scaling of probabilistic flips/ns with respect to increasing graph size. To achieve this parallelism, we utilized multiple built-in clocks inside the FPGA board to drive the PRNGs within the p-bit blocks, as shown in FIG. S1a. The mixed-mode clock manager (MMCM) block, available in the Xilinx Alveo U250 data center accelerator card, generates equally phase-shifted and same-frequency parallel stable clocks. The input to the MMCM is a 300 MHz differential pair system clock created on a low-voltage differential signaling (LVDS) clock-capable pin (FIG. S1b). These clocks are highly precise with minimal noise or phase deviation. By triggering the colored p-bit blocks with these phase-shifted clocks, we achieve massive parallelism.

#### d. Interfacing unit

We use MATLAB as an Advanced eXtensible Interface (AXI) manager to communicate with the FPGA board through a PCI express interface (FIG. S1c). We have designed an AXI manager integrated IP on the board that transfers data with a 32-bit memory-mapped slave register IP via the fourth-generation AXI (AXI4) protocol. An external website ‘airhdl’ [70] is used to manage the memory mapping of the registers into the block rams (BRAMs) inside the FPGA. We used two BRAMs to write the weights and biases from MATLAB and another BRAM to save the p-bit states that MATLAB reads out.

## 2. Readout architecture with mirror p-bit

FIG. S2 shows our readout architecture that can take a “snapshot” of the system at a given time. The nature of the learning algorithm is that only a handful of equilibrium samples are needed to estimate correlations and averages without requiring continuous samples. By way of mirror (or copy) p-bit registers, we decouple system p-bits (that are always on) from snapshot states that are saved to memory. In this way, while the system goes through a large number of samples, we are able to take a sufficient number of samples that are saved to local memory (at our own accord) which can then be read by a CPU.



**Fig. S2.** (a) Block diagram showing the mirror (or copy) p-bit architecture with a snapshot signal. The controller block generates the snapshot signal at which time the original p-bit states are copied to local memory (registers) and at the inverted snapshot signal those states are saved into block memory (BRAM), only once. Subsequent zero signals from the snapshot signal do nothing to mirror/copy p-bits or to BRAM. (b) Conceptual diagram to visualize the operation of the snapshot signal.

## 3. Graph density of sparse DBMs

The networks we present in the manuscript (Pegasus 4,264 p-bits and Zephyr 3,360 p-bits) are highly sparse, which we quantitatively show in FIG. S3. We use the typical graph density metric, measured by

$$\text{graph density} = \frac{\text{number of edges in the network}}{\text{number of edges in a fully-connected network}} \times 100 \quad (\text{S.3})$$



**Fig. S3.** The graph density and the neighbor distribution of sparse DBMs (Pegasus and Zephyr) where graph density,  $\rho = 2|E|/(|V|^2 - |V|)$  where  $|E|$  = the number of edges and  $|V|$  = the number of vertices in the graph. The density of Pegasus 4,264 p-bits is 0.33% and 3,256 p-bits have the maximum number of neighbors 15. Zephyr (3,360 p-bits) has a density of 0.56% and 2,432 p-bits have the 20 maximum neighbors.

By this metric, the network we have shown in FIG. 1e (Pegasus, 4264) only has a graph density of 0.33%. Moreover, we also show a vertex degree distribution of the network providing a histogram of nodes and the number of their neighbors. On the right, the same metrics are shown for the Zephyr graphs with similar results.

#### 4. Training accuracy of MNIST/100

We have studied the effect of training the sparse DBMs (Pegasus 4,264 p-bits) with a small subset of data before training the full MNIST. In this setting, we chose 100 images from MNIST to train on the sparse network with our massively parallel architecture. To train these 100 images, we used 10 mini-batches, having 10 images in each batch and the same set of hyperparameters as in training full MNIST. The training accuracy reached 100% within 1,000 epochs as illustrated in FIG. S4. We also explored different values of the ‘regularization’ parameter ( $\lambda = 0.005, 0.001$ ) which is generally used to keep the weights from growing too much. In our case of sparse DBM, we did not observe any significant difference between a small regularization and without any regularization. The poor test accuracy here is an indication of overfitting due to the small size of the training set (only 100 images). We observed similar accuracy on Zephyr graphs with 3,360 p-bits for 100 images.



**Fig. S4.** Accuracy of training 100 images with sparse DBMs up to 1,000 epochs. Training is accomplished with 10 mini-batches and CD- $10^5$ . All the trained 100 images and 20 unseen images have been tested here.

#### 5. Effect of weight precision

The results we have in Section V in the main paper utilized a weight precision of 10 bits (1 sign, 6 integer, and 3 fraction bits i.e., s{6}{3}). Here, we explore different weight precisions by changing the fraction bit width and compare the results to identify the effect of weight precision on accuracy. We trained full MNIST with 1,200 mini-batches for 200 epochs, using five different weight precisions of s{6}{2}, s{6}{3}, s{6}{5}, s{4}{2} and s{3}{2}. We chose a Pegasus 2,560 p-bit graph as a sparse DBM for this experiment that fits into the FPGA since increasing bit precision reduces the available resources. The weight update is accomplished in MATLAB (double-precision floating point with 64 bits), but before the new weights are loaded to the FPGA, they are converted to the corresponding fixed-point precision. The choice of hyperparameters remains the same for all cases. The test accuracy goes to  $\approx 88\%$  in each case (with 17,984 parameters) and there is no remarkable difference among the accuracy of the different weight precisions between s{6}{5} to s{6}{3}, accuracy starts degrading at or below s{4}{2} (FIG. S5a). We also trained full MNIST on RBM (512 hidden units) using both float64 and s{6}{3} weight precision for 200 epochs. The test accuracy remains the same for these two different precisions as shown in FIG. S5b.

To further study the impact of weight precision on the *generative* properties of the sparse DBM network, we have conducted image completion experiments. FIG. S5c shows inference experiments where we obscure half of an image and let a trained network evolve to the corresponding minima (by annealing the network from  $\beta=0$  to  $\beta=5$ ). We observe that while s{6}{3} can complete this task, precisions below s{4}{2} start failing.

#### 6. Mixing times in Pegasus graph (3,080 p-bits)

We described the mixing times in Section VII of the main paper showing the results (FIG. 4) from our largest size Pegasus (4,264 p-bits). Here, we show another graph, Pegasus with 3,080 p-bits to measure the mixing time of the network. Unlike the main model, for this experiment, we trained full MNIST for only 50 epochs (instead of 100) using the same hyperparameters as mentioned in the main Section V with different numbers of sweeps starting from CD- $10^2$  to CD- $10^6$ . Test accuracy improves significantly when we take more than CD- $10^4$  per epoch.

#### 7. Image generation with Zephyr

In the main paper, FIG. 1f displays the images generated with Pegasus (4,264 p-bits) graph and the procedure is described in Section VI. Here we explored image generation with a different type of sparse DBM, Zephyr (3,360 p-bits) that also reached



**Fig. S5.** (a) Test accuracy of full MNIST with sparse DBMs (Pegasus 2,560 p-bits) up to 200 epochs with five different fixed point precisions of weights ( $s\{6\}\{2\}$ ,  $s\{6\}\{3\}$ ,  $s\{6\}\{5\}$ ,  $s\{4\}\{2\}$  and  $s\{3\}\{2\}$ ). (b) Test accuracy of full MNIST for RBM (512 hidden units) with double-precision floating point 64 bits and  $s\{6\}\{3\}$ . (c) Image completion examples with sparse DBMs for fixed point precisions of weights  $s\{6\}\{3\}$  and  $s\{4\}\{2\}$  (annealing schedule varies from  $\beta=0$  to  $\beta=5$  with 0.125 steps). With  $s\{6\}\{3\}$  precision, the network can complete the images where the right half of the image starts from random noise. Below  $s\{4\}\{2\}$ , the network fails to complete the images.

$\approx 90\%$  accuracy with randomized indices as demonstrated in the main FIG. 5c (bottom). The generated images with Zephyr as shown in FIG. S7 are slightly different from the Pegasus ones.

### 8. Momentum to the learning rule

In our training, we used the momentum in our update rules, which are empirically added to the learning rule we discuss in the next section. By retaining a portion of the last update to the weights, momentum helps increase the effective learning rate [6]. The effective increase in the learning rate is equivalent to multiplying it by a factor of  $1/(1-\alpha)$  where  $\alpha$  is denoted as momentum. Using this process, the algorithm can increase the effective learning rate without causing unstable oscillations, which ultimately speeds up the convergence of the training process [50]. We modify the learning rule equations in the main Eq. (5) and Eq. (6) by



**Fig. S6.** (a) Test accuracy after training full MNIST up to 50 epochs with different numbers of sweeps using sparse DBMs (Pegasus 3,080 p-bits). For our sparse graph, to mix the Markov chain properly we need minimum CD-10<sup>4</sup>. Reducing the number of sweeps significantly degrades the quality of mixing in the chain. (b) Test accuracy as a function of CD-n at epoch 50 showing the equilibrium and non-equilibrium samples.



**Fig. S7.** Image generation examples with sparse DBM Zephyr (3,360 p-bits) after training the network with the full MNIST dataset.

introducing the momentum term as follows:

$$\Delta J_{ij}(n) = \varepsilon \left( \langle m_i m_j \rangle_{\text{data}} - \langle m_i m_j \rangle_{\text{model}} \right) + \alpha \Delta J_{ij}(n-1) \quad (\text{S.4})$$

$$\Delta h_i(n) = \varepsilon \left( \langle m_i \rangle_{\text{data}} - \langle m_i \rangle_{\text{model}} \right) + \alpha \Delta h_i(n-1) \quad (\text{S.5})$$

where  $n$  represents the  $j^{\text{th}}$  index (ranging from 1 to the number of batches) in Algorithm 1 in the main paper.

### 9. Maximum Likelihood for Boltzmann Networks

The basic idea of Boltzmann networks is to start from a physics-inspired variational guess, that the data distribution will be approximated by a model whose probability  $p^M$  for a given input vector  $m^{(i)}$  ( $i$ , being the input index) obeys the Boltzmann law (ignoring biases in our derivation for simplicity):

$$p^M \left( m^{(i)} \right) = \frac{1}{Z} \exp \left[ \sum_{er} J_{er} m_e^{(i)} m_r^{(i)} \right] \quad (\text{S.6})$$

In our setting, we have a system of  $M$  fully visible p-bits connected in some arbitrary graph topology. The problem is learning a “truth table” with exactly  $N_T$  lines of inputs in it. The model is going to try to select these  $N_T$  states in the space of  $2^M$  possible discrete probabilities. Like in any other ML model, fitting every line of the truth table exactly will overfit, but the Boltzmann formulation given by Eq. (S.6) smooths out the sum of “delta function”-like data vectors in the  $2^M$  space, which can later be used for generating new samples.

We define a  $p^V$  as the probability distribution of the data, corresponding to the visible bits. Then, a maximum likelihood estimation minimizing the Kullback–Leibler divergence between the data and the model can be used to derive the learning rule, by taking the negative derivative of  $KL(p^V \| p^M)$ :

$$KL(p^V \| p^M) = \sum_{i=1}^{N_T} p_i^V \log \left[ \frac{p_i^V}{p_i^M} \right] \quad (\text{S.7})$$

where  $i$  is the index of truth table lines  $N_T$ . To simplify analysis, we consider fully visible networks where  $p_i^V$  is independent of  $J_{mn}$  for any network topology since  $p_i^V$  represents the data distribution.

$$\frac{\partial KL(p^V \| p^M)}{\partial J_{mn}} = \frac{\partial p_i^V}{\partial J_{mn}} \log \left[ \frac{p_i^V}{p_i^M} \right] = \sum_{i=1}^{N_T} \frac{p_i^V}{p_i^M} \left( \frac{-\partial p_i^M}{\partial J_{mn}} \right) \quad (\text{S.8})$$

$$\frac{\partial p_i^M}{\partial J_{mn}} = \frac{\partial}{\partial J_{mn}} \left[ \frac{1}{Z} \exp \left[ \sum_{er} J_{er} m_e^{(i)} m_r^{(i)} \right] \right] \quad (\text{S.9})$$

$$Z = \sum_{j=1}^{2^N} \exp \left[ \sum_{er} J_{er} m_e^{(j)} m_r^{(j)} \right] \quad (\text{S.10})$$

where the index  $j$  represents all possible states from 1 to  $2^M$  for the model.

$$\begin{aligned} \frac{\partial p_i^M}{\partial J_{mn}} &= \frac{-1}{Z^2} \cdot \frac{\partial Z}{\partial J_{mn}} \cdot \left[ \exp \sum_{er} J_{er} m_e^{(i)} m_r^{(i)} \right] + \frac{1}{Z} \cdot \frac{\partial}{\partial J_{mn}} \left[ \exp \sum_{er} J_{er} m_e^{(i)} m_r^{(i)} \right] \\ &= \frac{-\partial Z}{\partial J_{mn}} \cdot \frac{1}{Z} \cdot p_i^M + m_m^{(i)} \cdot m_n^{(i)} \cdot p_i^M \\ &= - \sum_{j=0}^{2^N-1} m_m^{(j)} m_n^{(j)} \cdot p_j^M \cdot p_i^M + m_m^{(i)} m_n^{(i)} \cdot p_i^M \end{aligned} \quad (\text{S.11})$$

$$\begin{aligned} -\frac{\partial KL(p^V \| p^M)}{\partial J_{mn}} &= \sum_{i=1}^{N_T} \frac{p_i^V}{p_i^M} \cdot \frac{\partial p_i^M}{\partial J_{mn}} \\ &= \sum_{i=1}^{N_T} \frac{p_i^V}{p_i^M} \left[ m_m^{(i)} \cdot m_n^{(i)} \cdot p_i^M - \sum_{j=0}^{2^M-1} m_m^{(j)} \cdot m_n^{(j)} \cdot p_j^M \cdot p_i^M \right] \\ &= \left[ \sum_{i=1}^{N_T} p_i^V \cdot m_m^{(i)} m_n^{(i)} - \sum_{i=1}^{N_T} p_i^V \sum_{j=0}^{2^M-1} m_m^{(j)} \cdot m_n^{(j)} \cdot p_j^M \right] \\ &= [\langle m_m m_n \rangle_{\text{data}} - \langle m_m m_n \rangle_{\text{model}}] \end{aligned} \quad (\text{S.12})$$

which gives the familiar learning rule. A similar learning rule in terms of the averages can be derived by accounting for the biases in the energy, which we ignored for simplicity.

## 10. Comparison with state-of-the-art GPUs and TPUs

In the main paper, we show how graph-colored FPGA achieves massive parallelism to provide a few orders of magnitude faster sampling throughput than traditional CPUs. Here in Supplementary Table S1, we also compare the sampling speed to some state-of-the-art (SOTA) Ising machines implemented on the latest GPUs and TPUs. The throughput reported in this work up to 64 billion flips per second outperforms the numbers reported by the SOTA Ising solvers in GPUs and TPUs. It is also important that this comparison is not completely accurate and favors the GPU/TPU implementations for two reasons: First, all the GPUs and TPUs discussed here are solving simple, nearest-neighbor chessboard lattices, unlike the irregular and relatively high-degree (with up to 20 neighbors) graphs used in this work. Second, GPU/TPU implementations generally use low precision  $\{+1, -1\}$  weights (compared to 10 bits of weight precision in our work) and thus can explore only a few discrete energy levels. Both of these features are heavily exploited in reporting a large degree of flips/ns in these solvers and their performance would presumably be much worse if they were implemented in the same graphs with the same precision we discuss in this work.

TABLE S1. Optimized GPU and TPU implementations of Markov Chain Monte Carlo sampling with regular chessboard lattices. It is important to note that these TPU and GPU implementations solve Ising problems in sparse graphs, however, their graph degrees are usually restricted to 4 or 6, unlike more irregular and higher degree graphs.

| Sampling method                 | topology   | max. degree | flips/ns |
|---------------------------------|------------|-------------|----------|
| Nvidia Tesla C1060 GPU [71, 72] | Chessboard | 4           | 7.98     |
| Nvidia Tesla V100 GPU [73]      | Chessboard | 4           | 11.37    |
| Google TPU [73]                 | Chessboard | 4           | 12.88    |
| Nvidia Fermi GPU [74]           | Chessboard | 4           | 29.85    |

### 11. Image completion with DBM and RBM

A typical task for energy-based generative models is that of image completion as opposed to image generation. Image completion is relatively easier since the network is clamped near local minima. In the main manuscript, we show how an iso-accuracy RBM with around 3M parameters cannot perform image generation. Here, we show that an RBM can complete the easier task of image completion. We clamp half of the visible bits along with the labels and clamp the other half to noise. Then we let the trained network evolve to the corresponding minima by annealing the network from  $\beta=0$  to  $\beta=5$ . Results are shown for a sparse DBM (4,264 p-bits) and RBM (4,096 hidden units, trained with CD-100). We observe that despite failing at image generation, the RBM performs similarly to our sparse DBM in image completion as shown in FIG. S8.



**Fig. S8.** Image completion examples with RBM (4,096 hidden units and CD-100) and sparse DBM (Pegasus 4,264 p-bits). Only the left half of the images is shown (clamped) to the networks while the other half is obscured. The label bits are also clamped and the annealing schedule varies from  $\beta=0$  to  $\beta=5$  with 0.125 steps.

### 12. Quality of parallel samples

To explicitly show the quality of our parallel samples in our graph-colored architecture (in the main Section IX), we have performed the following “inference” experiment in the CPU performing exact Gibbs sampling vs. our parallelized FPGA using a sparse DBM (Pegasus 3,080 p-bits):

- We start with an MNIST-trained Pegasus network with (3080 p-bits) with known weights and biases.
- We initialize all p-bits to the +1 state at time step 1 and define a “network magnetization”,  $\langle m \rangle_k = \sum_i^N (m_i)/N$
- We perform Exact (sequential) Gibbs Sampling in a CPU and our parallelized Gibbs Sampling in the FPGA for M=100 times, measuring  $\langle m \rangle_k$  for each run,  $k \in 1, \dots, M$ . We obtain an ensemble-averaged  $\langle m \rangle$ .
- We then compare these averaged magnetizations, as a function Monte Carlo sweeps taken in the FPGA and CPU.



**Fig. S9.** Average magnetization as a function of Monte Carlo sweeps in a CPU (exact Gibbs sampling) vs. parallelized FPGA. See text for details.

FIG. S9 shows the results of this experiment showing near identical relaxation between the FPGA and the CPU. FPGA takes about 0.067 seconds to take a million samples as opposed to a projected 21.1 hours from the CPU (we did not take more than 10,000 samples over 100 iterations in the CPU, since at that point both models converged). These numbers are in accordance with our expectations from their relative flips/second numbers and they establish that the samples taken by the FPGA follow the Gibbs sampling process.

### 13. Grayscale training of full Fashion MNIST

To test the differences between sparse DBMs and RBMs, we used our largest network (Pegasus 4264 p-bits) to train full Fashion MNIST [75], a more challenging dataset than MNIST [9]. Fashion MNIST consists of  $28 \times 28$  grayscale images of 70,000 fashion products (60,000 in the training set and 10,000 in the test set) from 10 categories (e.g. t-shirt/top, trouser, sneaker, bag, pullover, and others), with 7,000 images per category. We have trained this dataset using our sparse DBM on the Pegasus graph. There are 4264 p-bits with 30,404 parameters in this network where the number of visible units is 784, 50 label units, and 3430 hidden units.

Our approach to grayscale images is based on time-averaging inspired by the stochastic computing approach of Ref. [66], where grayscale images between 0 and 1 are treated as the time-averaged probability of activation for p-bits. During the positive phase, we choose N (e.g., 20, 50, 100) binary samples from this probability and clamp the visible nodes as described in the main Section IV and Section V. Here, we used 1200 mini-batches containing 50 grayscale images in each batch during training. To train Fashion MNIST, we used 20 binary (black and white) samples for each grayscale image resulting in a total of  $60,000 \times 20$  training images. We found that the number of black and white samples can vary depending on the dataset as a hyperparameter.

TABLE S2. Fashion MNIST accuracy with different sizes of RBMs.

| Number of hidden units | number of parameters | maximum accuracy (%) |
|------------------------|----------------------|----------------------|
| 43                     | $34 \times 10^3$     | 71.90                |
| 64                     | $50 \times 10^3$     | 76.56                |
| 128                    | $101 \times 10^3$    | 77.45                |
| 256                    | $203 \times 10^3$    | 78.45                |
| 1264                   | $1 \times 10^6$      | 85.56                |
| 2048                   | $2 \times 10^6$      | 84.72                |
| 4096                   | $3.25 \times 10^6$   | 82.64                |

To test classification accuracy, we also used 20 black and white samples for each grayscale image to perform the same softmax classification as described in the main Section V. After the network sees those 20 black and white samples to form a grayscale image, we check the labels to establish classification accuracy. Using this scheme of grayscale images, our sparse DBM with 30,404 parameters can reach around 80% in 120 epochs as shown in FIG. S10a. We trained RBMs with different numbers of parameters as listed in Table S2 using the same approach as sparse DBM. The iso-parameter RBM (43 hidden units and 34k parameters) can reach a maximum of 72% accuracy on full Fashion MNIST while the million-parameter RBMs can go to around 85% test accuracy. As in MNIST, we see that RBM requires the order of a million parameters to reach sparse DBM accuracy.

Using a similar approach to image generation as described in the main Section VI, we can also generate images of fashion products with our sparse DBM as shown in FIG. S10b. Similar to other cases, for image generation, we only clamp label bits for



**Fig. S10.** (a) Fashion MNIST accuracy can reach around 80% in 120 epochs with sparse DBM. Full Fashion MNIST (60,000 images) with 20 black and white samples per grayscale image is trained on sparse DBM (Pegasus 4,264 p-bits and 30,404 parameters) using  $CD-10^5$ , batch size = 50, learning rate = 0.003, momentum = 0.6 and epoch = 120. Test accuracy shows the accuracy of the whole test set (10,000 images) and the training accuracy represents the accuracy of 10,000 images randomly chosen from the training set. (b) Image generation examples with sparse DBM after training the network with full Fashion MNIST dataset by annealing the network from  $\beta=0$  to  $\beta=1$  with 0.125 steps. A particular image. We anneal the network slowly from  $\beta=0$  to  $\beta=5$  with a 0.125 increment using the final weights and biases. Then we check the 784 visible p-bits after time-averaging the collected samples to obtain grayscale images. Using a similar procedure, we observe that none of the RBMs can generate images despite having a maximum of 85% accuracy. We observed that different annealing schedules (e.g., with slower  $\beta$  changes) do not help RBM image generation.

#### 14. Grayscale CIFAR-10/100

To see whether the same conclusions hold for our architectural and algorithmic ideas for sparse DBMs, we trained 100 images (10 images from each class) from the CIFAR-10 dataset [76]. Due to resource limitations in our FPGA, we could not increase the number of parameters, hence we used Pegasus 4,264 p-bits as sparse DBM to train the grayscale CIFAR-10/100. The CIFAR-10 dataset consists of  $32 \times 32$  color images of 10 different classes (i.e., airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck), with 6000 images per class. There are 50000 training images and 10000 test images.

We converted the color images into grayscale [77] using the ‘rgb2gray’ function in MATLAB. We used 1024 visible units, 5 sets of labels (50 p-bits), and 3190 hidden units arranged in 2 layers as shown in the inset of FIG. S11a. Utilizing the same approach of binary transformation from grayscale images as described in Section 13, we have trained the CIFAR-10/100 dataset with 100 black and white samples per grayscale image with 10 mini-batches (having randomly selected 10 grayscale images in each batch). Training accuracy of this dataset with sparse DBM can reach around 90% in 2000 epochs as shown in FIG. S11a while the iso-parameter RBM (40 hidden units) accuracy is only 68% as listed in Table S3. RBMs reach the same levels of accuracy using between 264,000 and 1 million parameters, in line with our earlier results.



**Fig. S11.** (a) Training accuracy of 100 images from CIFAR-10 is around 90% in 2000 epochs with sparse DBM. Training is accomplished with 100 black and white samples per grayscale image using  $CD-10^5$ , batch size = 10, learning rate = 0.006, momentum = 0.6, and epoch = 2000. (b) Image completion examples with RBM (4096 hidden units) and sparse DBM. Only the left half of the grayscale images are shown (clamped) to the networks (using 100 black and white samples) while the other half is obscured. The label bits are also clamped and the annealing schedule varies from  $\beta=0$  to  $\beta=5$  with 0.125 steps, and for the other case  $\beta$  is kept to 1.

Image generation for CIFAR-10 in this reduced setting with only 100 images in the training set failed both for sparse DBM

and RBM. For this reason, we also examined the image completion task (details in Section 11) with RBM and sparse DBM as shown in FIG. S11b. We clamped only the left half of a grayscale image (using 100 black and white samples) along with the corresponding label bits and checked the right half of that image. In this case, both RBM (4096 hidden units) and sparse DBM performed similarly, in this much harder setting.

TABLE S3. CIFAR-10/100 accuracy with different sizes of RBMs.

| Number of hidden units | number of parameters | maximum accuracy (%) |
|------------------------|----------------------|----------------------|
| 40                     | $41 \times 10^3$     | 68                   |
| 64                     | $66 \times 10^3$     | 83                   |
| 128                    | $132 \times 10^3$    | 88                   |
| 256                    | $264 \times 10^3$    | 88                   |
| 968                    | $1 \times 10^6$      | 99                   |
| 2048                   | $2 \times 10^6$      | 100                  |
| 4096                   | $4 \times 10^6$      | 100                  |

### 15. Full graph topologies: Pegasus 4,264

Below we show the full Pegasus network topology with 4,264 p-bits and its sparse deep BM representation.



**Fig. S12.** Layered embedding of the 4,264 p-bit Pegasus graph of FIG. S13, illustrating the sparse DBM architecture: the first layer is visible p-bits with 834 nodes, second and third layers are the hidden p-bits with 3,226 and 204 nodes respectively. There are also some intralayer connections within each layer. An example is shown in the right circle which shows the neighboring connections around node 3,443. The number next to a line represents the number of wires grouped in that branch, the total number being the fan-out of a given p-bit (vertex).



**Fig. S13.** The original sparse DBM network (Pegasus: 4,264 p-bits) used in this work with marked-up visible (blue), hidden (orange), and label (yellow) units.