

# Photonic Multiply-Accumulate Operations for Neural Networks

Mitchell A. Nahmias<sup>1</sup>, Thomas Ferreira de Lima<sup>1</sup>, Alexander N. Tait<sup>1</sup>, Hsuan-Tung Peng<sup>1</sup>, Bhavin J. Shastri<sup>2</sup>, *Member, IEEE*, and Paul R. Prucnal<sup>1</sup>, *Fellow, IEEE*

**Abstract**—It has long been known that photonic communication can alleviate the data movement bottlenecks that plague conventional microelectronic processors. More recently, there has also been interest in its capabilities to implement low precision linear operations, such as matrix multiplications, fast and efficiently. We characterize the performance photonic and electronic hardware underlying neural network and deep learning models using multiply-accumulate operations. First, we investigate the fundamental limits of analog electronic crossbar arrays and on-chip photonic linear computing systems. Photonic processors are shown to be superior in the limit of large processor sizes ( $>100\text{ }\mu\text{m}$ ), large vector sizes ( $N > 500$ ), and low noise precision ( $\leq 4$  bits). We discuss several proposed tunable photonic MAC systems, and provide a concrete comparison between deep learning and photonic hardware using several empirically-validated device and system models. We show significant potential improvements over digital electronics in energy ( $> 10^2$ ), speed ( $> 10^3$ ), and compute density ( $> 10^2$ ).

**Index Terms**—Artificial intelligence, Neural networks, Analog computers, Analog processing circuits, Optical computing

## I. INTRODUCTION

Photonics has been well studied for its role in communication systems. Fiber optic links currently form the backbone of the world's telecommunications infrastructure, vastly overshadowing the best electronic communication standards in information capacity. Light signals have many advantageous properties for the transfer of information. For one, a photonic waveguide, with diameters ranging from those in fiber ( $\sim 80\text{ }\mu\text{m}$ ) to those fabricated on-chip ( $\sim 500\text{ nm}$ ), can carry information at enormous bandwidth densities—i.e., terabits per second—with an energy efficiency that scales nearly independent of distance. This density is possible thanks to signal parallelization in photonic waveguides, in which hundreds of high speed, multiplexed channels can be independently modulated and detected. Photonic channels also experience less distortion, jitter, and crosstalk between one another compared to their electrical counterparts.

Photonic technology has traditionally been used for long distance communication. However, modern bandwidth requirements and the standardization of silicon photonic integrated circuits (PICs) has lead to the proliferation of shorter distance photonic links. For example, silicon photonic transceivers are now a pervasive component in data-centers. In addition, the efficiency of a photonic link, which is dominated by the E/O

and O/E conversion costs between the electrical and photonic domains, is rapidly encroaching on the efficiency of electronic links: the cost to move data photonically between nodes at a data-center ( $\sim 1\text{ pJ/bit}$  [1]) is now within order unity from a modern DRAM memory stack to a processor [2].

At the same time, there has been a substantial increase in the use of many-core parallel processing systems for a variety of tasks in high performance computing (HPC). Artificial intelligence (AI), in particular, is growing at an alarming pace: deep learning models have been doubling in size every 3.5 months, far outpacing Moore's law [3]. These systems have much greater communication overheads than classical von Neumann architectures such as CPUs, resulting in a dramatic increase of both the area and energy consumption of metal interconnects (see, for example, Ref. [4]). They are also bottlenecked computationally by the ability to perform matrix multiplications efficiently, which represent the most common operations in HPC.

The most computationally expensive task in current AI models is the implementation of neural networks. Current deep learning models require dense, low-precision matrix computations. Digital instantiations of matrix (or tensor) units typically suffer from high communication overheads, expensive digital operations, and high latencies. On the hand, photonic linear operations—such as passive Fourier transforms [5] or matrix operations [6]—exhibit stark advantages in bandwidth density, latency, and energy. As mentioned in [7], [8], photonic computations are passive, exhibiting favorable energy scaling costs which are potentially  $O(N)$  for  $O(N^2)$  fixed point operations. Photonic matrix multiplication occurs in a single step, only bottlenecked by the periphery of modulation and detection. A more surprising observation is the computational density of such an approach: despite the large sizes of photonic devices, such systems can deliver more operations per second in a given area than those in digital electronics.

This manuscript analyzes the merits of using photonics for simulating neural networks. We begin by exploring the implementation of multiply-accumulate operations (which take the form  $a' \leftarrow a + w \cdot x$ ) in various platforms in Section III, discussing the costs and benefits of digital electronics, analog electronics, and photonics. We provide a comparison of the fundamental limits of electronic crossbar arrays and photonic linear computing systems in Section IV, and analyze the performance of these models across of metrics such as energy, speed, and computational density. We consider the general performance of photonic MACs along these metrics based on practical devices that are compatible with large-scale silicon

<sup>1</sup>Department of Electrical Engineering, Princeton University, Princeton, NJ, 08544 USA

<sup>2</sup>Department of Physics, Engineering Physics & Astronomy, 64 Bader Lane, Queen's University, Kingston, ON, Canada, K7L 3N6

photonic foundries. In the last section, we provide a concrete comparison between fully-tunable neuromorphic photonic networks based on known photonic device models and principles with digital state-of-the-art deep learning chips.

## II. MULTIPLY-ACCUMULATE OPERATIONS

The multiply-accumulate (MAC) operation calculates the product of two numbers and adds the result to an accumulator. For a given accumulation variable  $a$  and modified state  $a'$ , the operation takes the following form:

$$a' \leftarrow a + (w \times x) \quad (1)$$

MACs are constituents of a number of linear mathematical operations, including dot products, matrix multiplication, function evaluation, fourier transforms, and convolutions. MACs have traditionally characterized the performance signal processing (DSP) applications [9], [10], but have become increasingly prominent in modern HPC.

We are most interested in a specific use case: the simulation of neural network models. For a set of input variables  $x_i$  and output variables  $y_j$ , each node  $j$  (or neuron) receives signals from a large number  $M$  of other nodes  $i$ . The inputs are combined via a *weighted sum* of the form  $\sum w_i x_i$  and then input into some nonlinear scalar function  $f\{x\}$ :

$$x'_j = f \left\{ \sum_1^M w_i x_i; x_j \right\}, j = 1, 2, 3 \dots N \quad (2)$$

The function  $f\{x\}$  can represent any nonlinear function (i.e., a ReLU or spiking neuron), and can be simulated in the analog or digital domains. The weighted sum can be broken down into a series of MAC operations of the form  $a_i = a_{i-1} + w_i x_i$  for  $i = 1 \dots M$ . Each neuron requires  $M$  parallel MAC operations. Therefore, a neural network of size  $N$  requires  $M \times N$  MAC operations per time step, or one operation per synapse. In a fully interconnected network with  $N$  nodes ( $M = N$  case), the number of MAC operations required per time step  $\Delta t$ —or characteristic time constant  $\tau$  in analog hardware—is  $N^2$  per step. The nonlinear function  $f\{x\}$  can also consume energy, but since this operation scales with  $O(N)$  rather than  $O(N^2)$ , it does not represent the most costly operation. The size of the network  $N$  grows large, MACs—i.e. *synaptic computations*—become the most burdensome hardware bottlenecks in neural networks [11]. It is therefore no surprise that MACs are the most ubiquitous computations in deep learning hardware acceleration [12], [13].

### A. Data Movement

Although the operations in neural network processors are dominated by MACs, the vast majority of the energy consumption in digital systems is in data movement [14], [15], as illustrated in Fig. 1. Activations and weights must be shuttled to and from various memory caches and buffers to the matrix multiplication units and back. The immense energy cost primarily stems from the need to charge and discharge metal wires. Metal lines have a capacitance per unit length of around  $100 \text{ aF}/\mu\text{m}$ ; charging energy scales according to



Fig. 1: Standard configuration of a modern AI chip. Information is passed from the memory to the processor, which consists primarily of MAC operations. Moving data (blue arrows) takes up the majority of the energy budget, followed by MAC operations.

$\sim CV^2$ , and since the voltage  $V$  is fixed by the fabrication node, conventional digital electronics must necessarily pay this cost [16] (see discussion in Section III-A).

It has been well known for that photonic links have the potential to solve the data movement problems that has plagued electronic chips [17], [18], [18]–[20]. Instead of paying an energy cost proportional to the length of each connection, photonic links pay the majority of the cost upfront, converting from the electrical domain to the photonic domain and back (length-dependent energy loss is typically negligible in comparison). Waveguides can thus beat metal wires in efficiency, provided that the cost of E/O/E conversion is less than the cost of charging a metal wire over the same distance.

Addressing the data movement problem alone, however, may not be worthwhile—one still pays the E/O/E cost ( $\sim 1 \text{ pJ}$  [1]) communicating between cores, which is within order unity of the current cost of performing each MAC operation (see Table II). Instead, we can use photonics for both data movement and matrix multiplication simultaneously, interfacing modulators and detectors in close proximity with local memory directly to a photonic neural network processor. Various photonic memory architectures have been explored in the literature (see for example Ref. [21]–[23]). This manuscript focuses primarily on the neural network processor. A key advantage, in this case, is that the memory retrieval cost is amortized over the neural network, which can lead to significant energy savings, and ultimately, huge performance gains over digital processors.

### B. Precision

Neural networks have unique properties that make them well suited for analog computations—they are one of the only widespread computational paradigms that work effectively with both low precision and fixed point operations. Analog computing is far more resolution limited than standard floating-point operations. For example, representing a 16-bit value on an optical signal at minimum requires detecting  $2^{32}$

photons per time step to stay above the shot noise limit. At a typical telecommunications wavelength ( $\lambda \sim 1.55 \mu\text{m}$ ), this puts us above the energy consumption of current digital processors ( $\sim 550 \text{ pJ}$  per sample, leading to  $> 1 \text{ pJ/MAC}$ , see Table II). In addition, although analog variables can represent the mantissa of a floating point number, it is far more difficult to represent exponents. Analog operations thus rely on mantissa-only representations, in which the exponent remains fixed throughout the computation.

Thankfully, empirical research has shown that neural networks can operate effectively with both low precision and fixed point operations. Inference models work nearly just as well with 4-8 bits of precision—sometimes even down to 1-2 bits [24], [25]—and training with nearly 8-16 bits of precision per computation [26], [27]. Training can even work with binary weight evaluations, as long as high resolution stored weights are applied stochastically during training [28]. There is also evidence that fixed point arithmetic within the matrix core is also effective for both inference [27] and training [29]. However, many of these studies have focused on *quantized precision*, in which signals are resolved deterministically via a set of threshold values. Analog systems can be far more random, both as a function of noise and fabrication variation. In the digital domain, there are strict conditions on the number of bit errors that systems can handle (typically, we want SNR  $\sim 10 \text{ dB}$  for a digital channel with forward error correction [30]).

The degree of noise or fault tolerance can vary significantly across different neural network models [31], but interestingly, such models can *be made* robust via proper construction and training [32]–[35]. In some cases, unbiased noise added during training results in a more robust model, effectively acting as a form of regularization [36]. The resulting network becomes more noise-tolerant with an accuracy that is equivalent to a network trained without noise [37]. In practice, noise levels can also approach deterministic precision thresholds: for example, stochastic rounding across signals and weights has many theoretical advantages [38], which is effectively similar to setting the SNR  $\sim 0 \text{ dB}$  relative to the quantization level. In this sense, robustly constructed neural networks can operate with far more noise than standard digital links.

For the remainder of this manuscript, we characterize our precision with respect to the analog noise in each channel. We define a parameter  $\text{SNR} \equiv 2^{N_b}$ , where  $N_b$  represents the number of *bits of noise precision* for a given computation. We will also define a parameter,  $\rho$  (see Ref. [39] for an equivalent definition) which represents the loss of precision in the analog domain from the digital domain. For  $\rho = N$ , we have fixed point arithmetic, in which the precision is only defined with respect to the dynamic range of the output after summation  $\sum x_i w_i$ . This leads to scaling advantages, as discussed in Sections IV and VI. For  $\rho = 1$ , we guarantee every output  $w_i x_i$  maintains full precision  $N_b$ , if even if the weight  $w_i$  is small. We can also have  $1 < \rho < N$  where the desired precision is in some way dependent to the amplitude of the signal:  $\rho = \sqrt{N}$  represents an interesting case, guaranteeing that the precision of a signal in a prior layer maintains the same precision in the next layer after a  $1/N$  fan-

out loss. Importantly, we will consider only the fixed point case ( $\rho = N$ ), since neural networks can function natively in fixed point, and it leads to great efficiency in the analog domain.

### C. Compute Density

Throughout this manuscript, we define a figure-of-merit that we can use to compare various architectures with one another. This metric (previously used to benchmark power-performant floating point operations in digital electronics [40]) will be referred to as *compute density*, and is defined as follows:

$$D = \frac{\text{Speed (MACs/s)}}{\text{Area per MAC unit (mm}^2\text{)}} \quad (3)$$

Compute density is related to several other well established metrics. For example, since it is limited by the ability to communicate across each MAC unit, its upper bounded by *bandwidth density* (bits/s/mm<sup>2</sup> Ref. [18]). It is also affected by energy efficiency, since we must keep our system within a reasonable power density (<1 W/mm<sup>2</sup> [40]) to prevent thermal runaway. We analyze these limits in Section IV.

There are a number of reasons why compute density is useful, particularly when we are comparing different kinds of architectures that may multiplex signals differently or run at vastly different clock rates. When we look at crossbar arrays (such as Ref. [41]) or digital matrix configurations such as systolic arrays [12], [42], there are well defined notions of MAC area, MAC density, memory density, and speed. However, digital architectures that use time multiplexing strategies (i.e., TrueNorth [43]), or photonic strategies that could use either time or wavelength multiplexing (i.e., those described in Sections IV, V) do not necessarily have the same clear definitions, because there are many more MACs being implemented than the number of physical units. It depends on whether we consider these “virtual” MACs as part of the density calculation or not, which can complicate our comparisons.

Defining a compute density metric remedies these ambiguities, providing a grounded way to speak about processing power that is relatively invariant to the multiplexing or channelization strategy. We will also see that, like bandwidth density, is it not necessary to define how we divide the spectrum up into independent channels in order to talk intelligently about the limits of compute density. And ultimately, we are interested in the total amount of computational power (op/s) that a given system can exhibit. Microprocessor areas are fairly invariant—they tend to occupy 100s of mm<sup>2</sup>, for cost and yield reasons. From this perspective, compute density also acts as a measure of the compute power of a microprocessor that uses a given architecture, since chipsets likely occupy areas that are within order unity.

## III. PHYSICAL IMPLEMENTATIONS OF NEURAL NETWORK HARDWARE

In order to compare electronic and photonic processing with one another, we will use the multiply-accumulate (MAC) operation, defined in Section II. Below, we explore the advantages and disadvantages of implementing these operations in digital microelectronics, analog electronics, and photonics.

### A. Electronic Implementations

1) *Digital Electronics*: Conventional digital computers are based on the von Neumann architecture [44] (also called the Princeton architecture), and are typically implemented in silicon microelectronics. They include a memory bank that stores both data and instructions, and a central processing unit (CPU) that performs nonlinear operations. Instructions and data stored in the memory unit lie behind a shared multiplexed bus which means that both cannot be accessed simultaneously. This leads to the well known von Neumann bottleneck [45] which fundamentally limits the performance of the system—a problem that is aggravated as processors run memory-bound algorithms. Nonetheless, this computing paradigm has dominated for over 60 years, driven in part by the continual progress dictated by Moore’s law for CPU scaling—the number of transistors that can be put on a microchip doubles every 18 months to 24 months [46]—and Koomey’s law—the number of computations per joule of energy dissipated doubles approximately every 1.57 years [47].

These limitations have lead to the massive parallelization and specialization of hardware architectures [48]. CPUs used to be the most common choice for most applications, but in recent years, many-core architectures such as GPUs and FPGAs have expanded to encompass general purpose tasks in the high performance computing arena. Concurrently, specialized ASICs are becoming increasingly popular for the implementation of artificial intelligence algorithms, which require low precision, high density matrix computations, a notable example of which is Google’s Tensor Processing Unit (TPU) [12]. Although parallelization can break down tasks that are highly distributable [49], the performance of this operation eventually leads to diminishing returns as a result of Amdahl’s law [50]. As a separate issue, I/O latency and sequential processing capabilities cannot exceed the time resolution of the processor itself, which is ultimately bounded by its *clock rate*. Even MAC units need to serialize the summands to perform weighted addition (Fig. 2(a)).

Although digital microelectronics continue to increase in performance as lower nodes are introduced, an increasing number of practical barriers are inhibiting the scaling of processing and energy densities. As an illustrative example, clock rates have saturated to around 500 MHz to 4 GHz [51], and chip designers have been forced to increase parallelism instead [52]. Attempting to drive processors faster, or with higher compute density, results in several runaway effects, including:

- **Energy Consumption:** The scalability of modern microprocessors is largely limited by power density, or energy dissipation per unit area ( $\text{W/mm}^2$ ). There is a trade-off between bandwidth and energy consumption in electronic devices. Ideally, the energy lost is almost entirely due to capacitive discharging. Dynamic power scales according to:

$$P = \alpha_{0 \rightarrow 1} \frac{1}{2} C V^2 f_s \quad (4)$$

for node transition activity factor  $\alpha_{0 \rightarrow 1}$ , capacitance  $C$ , driving voltage  $V$ , and switching frequency  $f_s$  [53].

### a) Digital



### b) Analog DC



Fig. 2: Parallel MAC operations (i.e. weighted addition  $\sum_i w_i x_i$ ) in different electronic implementations. (a) a typical MAC unit today includes separate multiply and accumulate operations, implemented in digital logic. (b) An analog implementation could use tunable impedance to implement weights, which can be instantiated in dense crossbar arrays.

However, at higher frequencies, secondary effects such as short circuit current and leakage become more pronounced, causing  $\alpha_{0 \rightarrow 1}$  to decrease and  $P$  to hit a floor value. Different material structures, device architectures, or higher driving voltages  $V$  can offset these effects, but typically increase energy consumption. This, in turn, produces more heat, which can manifest in runaway thermal effects. These thermal limitations are often the dominant limiting factor for chip scalability [54].

The largest energy contribution originates from communication, which primarily involves charging and discharging many metal wires. Metal lines, like electronic devices, dissipate energy resistively (via Eq. 4). In many processors with high communication overheads—such as FPGAs or deep learning chips—communication can easily occupy more than half the energy cost [55], [56]. As it stands, digital architectures are far from optimal: the power efficiency of biological systems is estimated to be  $<1\text{ aJ}$  [57] per MAC operation, which is six orders of magnitude greater than the power efficiency of current state-of-the-art machine learning chips at  $\sim 1\text{ pJ}$  (see Fig. 7).

- **Signal Bandwidth:** Since interconnects are restricted by geometric constraints, microelectronic circuits typically rely on some form of temporal multiplexing for widespread, parallel data distribution between processors. For example, many neuromorphic architectures employ a digitization scheme called address event representation (AER) to communicate events between different neural processor cores [58], [59]. Unfortunately, electronic connections experience harsh trade-offs between bandwidth and interconnectivity. Signal bandwidth for both capaci-

tive and inductive lines scale according to

$$B_l \propto \frac{A}{L^2} \quad (5)$$

for bandwidth  $B_l$ , cross sectional area  $A$ , and length  $L$  [60], [61]. As a result, metal wires are typically limited to signals no faster than several gigahertz in frequency. Temporal multiplexing strategies lead to even harsher trade-offs, since multiplexing  $N$  channels each with channel bandwidths  $B_c$  requires a total bandwidth of at least  $B_l \geq NB_c$  per multiplexed line.

### B. Analog Electronics: Spatial Multiplexing

One way to avoid digital bottlenecks is to use an analog networking configuration in which each connection is represented by a physical wire. Dense connections can be instantiated in space-efficient topology such as crossbar arrays [62], [63]. Summation and multiplication can both be performed simultaneously using resistive elements together with Kirchhoff's current law (Fig. 2(b)). However, closely spaced wires also experience bandwidth-distance trade-offs. As an illustrative example, for a cluster of adjacent wires with pitch  $P$ , width  $P/2$ , thickness  $T$ , length  $L$ , RC bandwidth scales according to [60]:

$$B_l \propto \frac{1}{L^2} \left( \frac{1}{P^2} + \frac{1}{T^2} \right)^{-1} \quad (6)$$

This can become particularly problematic for large  $L > 1 \text{ mm}^2$ , and is responsible for the enormous energy costs seen for off-chip communication in electronics. That being said, if  $L$  is kept small, the bandwidth can actually be quite high and the energetic communication cost low [17]. One must be careful to shrink the cores in a small area to keep the efficiency as high as possible (this point is discussed in more detail in Section IV).

One of the primary difficulties of analog electronic arrays is finding a good linear and tunable resistive element—traditional transistors, optimized for digital operations, do not have the linear transconductance profiles to make this tenable. New materials or fabrication approaches are therefore a necessity in creating efficient analog electronic arrays. To this end, memristive devices have been explored quite extensively (see Ref. [41], [64]–[66]) along with phase change memory (see Ref. [67] for a good review), which have yielded a number of interesting approaches for high-density storage and computing. For example, memristive memory now beat traditional flash memory in performance along many metrics, including density, reliability, speed, and endurance (see for example Ref. [68]). Nonetheless, for tunable resistive elements to take full advantage of the possibilities that crossbar arrays have to offer (as discussed in Section IV), we need to see additional performance improvements, and there needs to be a low-cost way to integrate them into standard fabrication processes.

### C. Photonic Implementations

Photonic signals can support much greater bandwidth densities and consume less energy for longer distances than the electrical counterparts [16]. This has motivated the development

of fiber optic technology in telecommunication networks and now, interconnections in datacenters and processors [61]. The advantages of photonics are especially relevant for systems with high communication or bandwidth overheads. There are several unique physical properties that allow optical signals to manifest these advantages:

- **Bandwidth:** Optical carrier waves possess different orthogonal features, including wavelength, spatial mode and polarization, which do not interact with each other in passive devices. The total complex electric field  $\vec{E}(x, y, z, t)$  in a waveguide or fiber optic communication channel can be described as a sum over every optical mode  $m$ , polarization  $p$ , and wavelength  $n$ :

$$\vec{E}(x, y, z, t) = \sum_m \sum_p \sum_n \vec{e}_p A_{mp}(x, y) B_{mnp}(t') \times \cos(\omega_n t - \beta_i z + \phi_n)$$

for unit vector  $\vec{e}_p$ , mode profile  $A_{mp}$ , time-dependent term  $B_{mnp}$ , angular frequency  $\omega_n$ , propagation vector  $\beta_n$ ,  $t' = t - z/v_g$ , and group velocity  $v_g$ . Each term can be modulated independently via  $B_{mnp}$  and, in the absence of interference, can be separated using linear photonic devices. The optical telecommunication band itself has  $\Delta f \sim 5 \text{ THz}$  of spectral bandwidth, which provides approximately  $\sim 5 \text{ Tb/s}$  of information capacity for every mode  $m$  and polarization  $p$ . Unlike in electronics, bandwidth and linear separability is an *intrinsic property* of the electromagnetic wave, i.e. it is *independent* of design constraints such as waveguide length or proximity.

- **Impedance:** In optical systems, one only needs to match the refractive index to prevent reflections. In addition, since electric/optical (E/O) and optical/electrical (O/E) conversion is an inherently quantum process, electric nodes which communicate using photonic edges need not be electrically impedance matched with one another [69]. This reduces many of the design constraints that typically limit microwave electronic circuits.
- **Energy:** Since photonic signals are not subject to Joule heating, waveguides and fibers can be designed with very low signal attenuation (i.e.  $< 0.1 \text{ dB/cm}$  [70] and  $< 0.1 \text{ dB/m}$  in some cases [71], [72]), allowing for communication costs that scale independently of distance. This allows for the propagation of higher power signals without the associated contribution to thermal runaway. In addition, communication or computations in the optical domain could be performed with minimal or theoretically even *zero* energy consumption—especially for linear or unitary operations.

In addition to these physical benefits, there are also practical ones. While there has been research on photonic integration for some time, in the past five years, there has been a paradigm-shift in photonic integration that could garner the manufacturing benefits enjoyed by digital microelectronics [73], [74], namely:

- **Performance:** Shrinking devices reduces their energy requirements, and allows for continuous performance scaling. Furthermore, the high yields attainable only in

foundries enable the fabrication of complex photonic systems.

- **Economics:** The presence of large markets driving silicon photonics (i.e., data-center transceivers) enables economies of scale in production, amortizing the cost of fabrication and packaging.
- **Standardization:** Every foundry line has a standard library of heavily optimized device designs through which, smaller enterprises can effectively utilize the fruits of millions of dollars worth of industrial research.

Silicon photonics offers a combination of foundry compatibility, device compactness, and cost that enables the creation of scalable photonic systems on chip. Its heavy use for data-center transceivers have lead to a decrease in overall packaging costs. Of course, the industry is still new, so photonic chips are not without their challenges. A prime example is that tunable photonic devices are currently energetically expensive: microring resonators and phase shifters currently use heaters for coarse tuning, which can consume significant energy. This point is discussed more in Section V-A.

#### IV. ANALOG MATRIX MULTIPLICATION: A COMPARISON BETWEEN OPTICS AND ELECTRONICS

It's clear that analog computing in both the electronic and photonic domains offer many advantages over digital microelectronics. So which one will win in the end? To get a better sense of fundamental performance bounds, we will compare an electronic crossbar array (the most common architecture for devices in Secton III-B) with a hypothetical dense photonic matrix core in which MACs are performed using a resistive approach in electronics and passive linear approach in photonics. Inputs for the electronic core are analog voltages and currents, whereas the inputs and outputs for the photonic core are optically multiplexed signals with analog light intensities.

We use an example of performing a single, square matrix-vector operation, consisting of  $N$  input channels and  $N$  output channels ( $N^2$  MAC operations) with a fixed preconfigured matrix. We implicitly assume that there is a set of devices that can fully tune resistance or optical loss locally and selectively without a significant quiescent power overhead. A schematic of these models is shown in Fig. 3.

##### A. Bandwidth Density

We first consider how our bandwidth density limits the overall *compute density* (see Section II-C) of each approach. A given compute core must simultaneously address both processing within the core (i.e., an efficient implementation of a MAC operation  $a = a + w \times x$ ) and data movement *across* the core (i.e., each MAC operation requires a result from a prior MAC unit in order to perform a full dot product  $\sum w_i x_i$  at the end of each row). As we will see, the data movement constraint can bound the performance of each of the cores.

We assume that there is a tunable, resistive element at the interface between metal crossbars, and each tunable element emulates a simple resistor associated with a fixed weight  $w$ . Kirchhoff's current law performs the summation  $\sum w_i x_i$

with the weights within each matrix, determined by the relative resistance values along each wire. A standard formula for assessing the bandwidth of on-chip metal interconnects is  $B_E \leq B_{RC}A/L^2$  per wire, for constant bit rate  $B_E$ , architecture-dependent constant  $B_{RC}$  (typically  $B_{RC} \sim 10^{16}$  for on-chip RC interconnects [61]), cross sectional area  $A$ , and length  $L$  of the wire. Extending this analysis to crossbars, we make the simple observation that the area occupied by each resistive element is approximately equivalent to the cross-sectional wiring area  $A$  in two dimensions. Computing over a  $N \times N$  matrix multiply array with  $L = NP_l$  for crossbar line pitch  $P_l$ , this gives us our bandwidth-limited electronic compute density  $D_E$ :

$$D_E \leq \frac{B_{RC}}{L^2} \quad (7)$$

in units of  $1/\text{s/mm}^2$ .

In the optical domain, each waveguide has an intrinsic bandwidth  $B_O$  upper bounded by the speed of the wave itself—for standard telecommunications wavelengths (1550 nm), this upper bound is in the range of  $B_O \sim 100 \times 10^{12} \text{ s}^{-1}$  for multiplexed signals (from  $f = 193 \text{ THz}$ ), but more realistically  $\sim 5 \text{ THz}$  for WDM-multiplexed systems in the  $1.3 \mu\text{m}$  or  $1.55 \mu\text{m}$  wavelength bands. Photonic waveguides are limited by the evanescent field coupling overlap between adjacent modes, which is a function of the wavelength of light. We can thus derive a minimum pitch  $P_\lambda$  between waveguides. This leads to a maximum bandwidth-limited photonic compute density  $D_O$  of:

$$D_O \leq \frac{B_O}{P_\lambda^2} \quad (8)$$

There is a critical difference here: electronic crossbars decrease in bandwidth density as the size of the crossbar ( $L^2$ ) grows larger, whereas photonic systems maintain their density, independent of size. For fairly reasonable values based on the gain bandwidths in typical III-V devices and preventing crosstalk between waveguides ( $B_O \sim 3 \times 10^{12} \text{ bits/s}$ ,  $P_\lambda^2 \sim 2 \mu\text{m}$ ), the crossover point at which  $D_O > D_E$  occurs near  $L > 100 \mu\text{m}$ . Put another way, optics is expected to exhibit a greater on-chip bandwidth density limit than electronics for cores that occupy more than  $L^2 > 0.01 \text{ mm}^2$  of area.

There are a number of factors that this analysis did not take into account. Channel crosstalk becomes a bigger problem for electronic systems, but this can be greatly reduced placing an isolating ground wire between each signal wire, keeping the bandwidth density still within order unity. Also, both optical and metal crossbar arrays can be scaled vertically with using 3D stacking technology (see [75] for the optical case), and optical waveguides can also include mode multiplexing, which may shrink the effective pitch  $P_\lambda$ . Nonetheless, the analysis above provides a good first principles look at the bandwidth density, and shows that they are both capable of enormous compute densities, with optics winning in the large  $L$  limit.

##### B. Switching & Driving Energy

Here, we consider contributions from the *driving energy*—i.e., the amplitude of the signals required to drive any output



Fig. 3: Schematic of analog matrix cores in electronics and photonics. (a) A schematic of an  $N \times N$  resistive crossbar array, with a tunable resistive element at each junction that represents the matrix element being applied for input voltages (or currents)  $x_i$  and output currents (or voltages)  $y_j$ . The size of the core scales with  $(NP)^2$  for wire pitch  $P$ . (b) A schematic of a hypothetical  $N \times N$  evanescent-field-limited wavelength multiplexed optical matrix core, with wavelength multiplexed inputs  $X_{i,\lambda}$  and outputs  $Y_{j,\lambda}$  along  $k$  wavelengths (labeled by  $\lambda$ ) in different waveguides (labeled by  $i$ ).  $M$  applies some linear function to the inputs to create the output vector, using local operations at waveguide junctions. The size of the core scales with  $([N/k]P_\lambda)^2$  for waveguide pitch  $P_\lambda$ .

circuitry—and the capacitive switching energy for both analog electronic and photonic cores. We will assume that the input and output voltages are compatible with transistors, restricting values to  $\sim 0.5$  V or larger to prevent thermal leakage (see discussion in Ref. [16]).

Given this voltage condition, the main way through which electronic crossbar arrays lose energy is capacitance discharge across the array. The energy lost per cycle is  $\frac{1}{2}CV_l^2$ , where  $C$  is the capacitance of the array and  $V_l$  is the line voltage. To arrive at a per-operation metric, we consider the contribution of charging each group of metal wires surrounding each resistive element: for a wire pitch  $P_l$ , this is  $L = 2P_l$ . Given standard capacitances of about  $c_l = 200 \text{ aF}/\mu\text{m}$  [61] and with each wire charging and discharging according to  $\frac{1}{2}C_lV_r^2$  for total capacitance  $C_l = 2c_lP_l$ , our energy consumption becomes:

$$E_{\text{MAC(E)}} = c_l P_l V_r^2 \quad (9)$$

per operation. For a standard line pitch  $P_l \sim 80$  nm and  $V_r \sim 0.5$  V, we arrive at  $E_{\text{MAC(E)}} \sim 4$  aJ. This is quite low, and may be brought lower if advanced techniques are employed to reduce this pitch (i.e.,  $P_l \sim 12$  nm in Ref. [76]).

The optical case has a potential scaling advantage, because metal wires need not be charged at each junction. In particular, photonics only requires charging  $N$  detectors for  $N^2$  operations. However, we must generate enough light to *drive* the detector with sufficient charge, which can be significantly limiting [61]. This depends on the amount of light that each detector receives, which can be affected by the precision loss  $\rho$ . For example, in a conservative estimate, a given signal in an  $N \times N$  matrix is split to  $1/N$ , and we must multiply our light power by  $N$  to make up for the loss if we are to maintain the same input precision ( $\rho = \sqrt{N}$ ). In a better case (i.e., fixed point arithmetic with  $\rho = N$ ), we care less about the signal and more about the full dynamic range of the output.

For some power  $P_L$  driving a laser with efficiency  $\eta_L$ , some loss through the optical system  $\eta_{wg}$ , and detection efficiency  $\eta_d$ , the current we see at the detector is  $I_d = \eta_L \eta_{wg} \eta_d P_L / E_{ph}$  for photon energy  $E_{ph} = h\nu$ . Lumping the efficiencies into a single quantum efficiency  $\eta = \eta_L \eta_{wg} \eta_d$ , this gives us a minimum energy of:

$$E_{\text{MAC(O)}} \geq \frac{N}{\rho^2} \cdot C_d V_r \cdot \frac{h\nu}{e\eta} \quad (10)$$

for photon energy  $h\nu$  and elementary charge  $e$ . Note that we also have capacitive discharge from the detector (scaling according to  $(1/N) \cdot (1/2)CV_r^2$  per operation), although it typically has a smaller effect on the energy consumption than the driving condition above.

If we consider deep learning framework compatible with fixed point arithmetic ( $\rho = N$ ), we see that, unlike in the electronic case, the capacitive charging scales with  $N$  rather than  $N^2$ . Choosing a high performance detector with  $C_d \sim 1\text{ fF}$  [77],  $V_r = 0.5\text{ V}$  (bringing the optical link energy to  $<500\text{ aJ}$ , see Ref. [16] for further discussion), and assuming a fairly efficient laser source ( $\eta = 0.2$ ), we start to see a difference around  $N > 500$  as shown in Fig. 4. We once again observe optical matrix multiplication cores gaining an advantage as the matrix becomes larger—in this case, we have a direct dependence on the  $N \times N$  matrix size. Note that the single digit aJ/MAC bound is still a factor of  $1 \times 10^5$  out of range relative to current state-of-the-art technologies (which are  $>100\text{ fJ/MAC}$ , see Section VI), so it is a far cry from limits we are seeing in the near term. Nonetheless, it is clear that both approaches have the potential for very low energy operations, with optics exhibiting a greater overall advantage in the large  $N$  limit for fixed point operations.

(a)  $N$  Scaling at 4-bit Precision(b)  $N_b$  Scaling at  $N = 1024$ 

Fig. 4: Various scaling laws near the fundamental limits for photonic (red) and electronic (blue) fixed point compute cores ( $\eta = 0.2$ ,  $V_r = 0.5$  V,  $\rho = N$ ,  $T = 300$  K,  $C_d = 1$  fF,  $\nu = 193$  THz). We neglect periphery costs, including the capacitive charging and discharging of drivers and receivers. Solid lines represent total energy/MAC, while dotted lines represent the noise power contribution to this energy.

### C. Noise

Noise affects analog precision during computations and has a strong effect on the energy consumption of each analog core. Reading values from a resistive crossbar with some SNR is fundamentally limited by thermal noise [39]. Using  $\rho$  and  $N_b$  as defined in Section II, this gives us the following expression for the energy per MAC operation:

$$E_{\text{MAC(E)}} \geq \frac{N}{\rho^2} \cdot 4k_B T \cdot 2^{2N_b} \quad (11)$$

We again consider the case of full fixed point precision, where we define the precision with respect to the total output dynamic range (as discussed in Section II) and set  $\rho = N$ . Our MAC energy numbers become  $E_{\text{MAC(E)}} \sim 4$  aJ/N for 4-bit operations, and  $E_{\text{MAC(E)}} \sim 1$  fJ/N for 8-bit.

In the case of the optical matrix multiplier, we need to consider the noise on the E/O and O/E interfaces to and from the input and output. At the detector, the fundamental limit is shot noise, resulting from photon fluctuations from the incoming wave. Considering the total quantum efficiency  $\eta$ , we arrive at an analogous expression as above, but for shot noise:

$$E_{\text{MAC(O)}} \geq \frac{N}{\rho^2} \cdot \frac{2h\nu}{\eta} \cdot 2^{2N_b} \quad (12)$$

Using a fixed point representation ( $\rho = N$ ) with an efficient laser ( $\eta \sim 0.2$ ) in the C-band, this gives us 0.33 fJ/N at 4-bit and 84 fJ/N for 8-bit.

Comparing these two quantities directly, the optical shot noise factor  $\frac{2h\nu}{\eta}$  is about an order of magnitude off from the thermal noise factor  $4k_B T$ . If we let our E/O/E efficiency  $\eta \rightarrow 1$  in the best case, the ratio between the energies is still  $E_{\text{MAC(O)}}/E_{\text{MAC(E)}} \sim 15$ , which is larger than order unity. So we see that in the limit of noise power limited operation at high precision, electrical crossbars have an advantage over optics.

### D. Discussion

We have considered the bandwidth density, switching energy, and noise at the physical limits of both electronic and optical matrix multiplier cores. We see that photonic cores exhibit scaling advantages over electronics for large core areas ( $L > 100$   $\mu$ m) or large channel counts ( $N > 500$ , see Fig. 4), but perform worse, in the limit, if the system is noise-power limited.

To illustrate performance differences between the two approaches, let's set a vector size of  $N = 1024$ , which is within an order of magnitude of current conventions [12]. We calculate the maximum compute density with both 4 bit and 8 bit operations. For a given energy  $E_{\text{MAC}}$ , our power density is  $D_P = E_{\text{MAC}}\Delta f/P^2$  and our computational density (ops/s/mm<sup>2</sup>) is  $D = \Delta f/P^2$  for pitch  $P$  between MAC elements and signal bandwidth  $\Delta f$ . We restrict the power density below a critical threshold  $D_P < 1$  W/mm<sup>2</sup> [40] to prevent anticipated thermal issues that would otherwise result. We use the following parameters: a pitch of  $P_E = 80$  nm for electronic crossbars, a driving voltage of  $V_l = 0.5$  V,  $B_O = 5$  THz,  $P_\lambda = 2$   $\mu$ m,  $C_d = 1$  fF and  $\eta = 0.2$  (assuming a fairly efficient laser source). The results are shown in Table I.

For 4-bit operations, switching energy largely dominates over noise energy for both photonics and electronics. Optics exhibits an advantage here: electronic cores hit the thermal density limit, but photonic cores are able to saturate their full bandwidth density limit before that point. In the 8-bit case, we see noise energy becoming significantly larger. There is a large jump in the photonic energy consumption as we move to higher precision, thanks to a quadratic dependence on the relative noise power of each signal. In cases in which high precision is necessary, operating in a noise power limited regime results in electronics crossbars performing better.

Note that although electrical crossbars are less noise-bound than photonic cores, it is unclear if this increased precision

| Technology          | Noise Precision | Energy (aJ/MAC) | Compute Density (PMACs/s/mm <sup>2</sup> ) |
|---------------------|-----------------|-----------------|--------------------------------------------|
| Electronic Crossbar | 4 bit           | 4.0             | 250                                        |
|                     | 8 bit           | 5.0             | 198                                        |
| Photonic Core       | 4 bit           | 2.0             | 513                                        |
|                     | 8 bit           | 81.9            | 12.2                                       |

TABLE I: Compute density performance for idealized electronic and photonic matrix cores with  $N = 1024$ , subject to power density  $<1 \text{ W/mm}^2$  to avoid thermal runaway. Photonic cores perform better at lower precision, i.e., in regimes not limited by noise power.

capacity is important for artificial intelligence. Ref. [27], [28] have shown that the forward compute step does not need high precision even during training, as long as the underlying weight storage and gradient rules maintain granularity. Also, since shot and thermal noise are unbiased, batching can be used to average the noise over a given set of training data (where the effective precision over the batch with  $M$  samples is equal to  $N'_b = N_b + \log_2 \sqrt{M}$ ).

The limits discussed here are a far cry away from current technology—compute densities in the range of 100s of PMACs/s/mm<sup>2</sup> are a factor of  $>1 \times 10^5$  from the current state-of-the-art as discussed in Section VI. This shows that both electronic and photonic arrays have immense computational capacity, and what may ultimately differentiate them may be short term technological developments, i.e., cheap, high endurance, and tunable weight elements, or the efficiency of the nonlinear periphery surrounding each matrix core.

An interesting note is that optical systems are pitch limited by  $P_\lambda$ , and electrical crossbars can have much smaller pitches ( $<100 \text{ nm}$ ). This means that, in the limit, photonic devices will be much larger but run at much higher speeds. This can actually a significant practical advantage: larger photonic devices may not be as sensitive to device variations or yield in a given fabrication process. We shall see that this size difference also occurs in nearer term systems, explored more closely in the sections that follow.

## V. PHOTONIC MULTIPLY-ACCUMULATE OPERATIONS

Here, we consider the practical performance of photonic MACs based on existing photonic devices. There are a variety of methods for implementing photonic multiply-accumulate operations using tunable photonic elements [8], [78], [80], [81] and also in many fixed network implementations in reservoir computing approaches [82]–[85]. We will distinguish between two primary mechanisms for implementing linear summations: *coherent* or *incoherent*, as defined in Ref. [7]. The former uses interferometry to implement linear operations via constructive and destructive interference, changing the relative power levels of a coherent beam. The second utilizes excited carriers to perform summations or nonlinear operations, and can potentially accept multiple wavelengths or modes.

Coherent approaches can implement linear, unitary operations while only consuming energy resulting from passive loss. However, operations must be performed within a single

wavelength and mode for a given matrix—or else constructive and destructive interference would not occur between interacting lightwaves—and all-optical nonlinearities are generally challenging to implement at low optical signal intensities. Systems that fall under the interference-mediated approach include the passive reservoir [85] and the interference-based processor described in Ref. [8].

Incoherent photonic MAC units are capable of operating across different wavelengths, modes, or polarizations. For dot product functionality, filter banks (described in [78], [79]), can apply weights via the partial transmission of signals to one (or more) detectors. This can greatly increase the information density on-chip, since many independent channels can coexist in a single waveguide. Performing a MAC is also passive in the incoherent approach: for a fixed filter topology, the computations are performed as lightwaves flow to their respective destinations. Unlike in the coherent approach, semiconductor devices (and therefore, O/E conversions) are required at each nonlinear processing stage. Systems that occupy this category include those described in Ref. [78], [82], [86], [87]. A more detailed discussion of these relationship is also provided in Ref. [7].

For both approaches, we will speak broadly about photonic MAC operations in the context of an  $N \times N$  matrix operation. We consider the energy per MAC, speed (signal bandwidth and latency), and computation density (i.e., MACs/s/mm<sup>2</sup>).

### A. Energy

Photonic devices, much like their resistive electronic counterparts, implement matrix operations passively and linearly. This leads to a number of advantages—in particular, for an  $N \times N$  matrix, many of the most expensive energy costs scale with the size of the vector  $O(N)$  rather than the size of the matrix  $O(N^2)$ . Below, we outline a general framework for understanding energy consumption in passive  $N \times N$  photonic arrays, and provide some analysis on the trade-offs between various tunable devices.

First, we consider the cost of driving the system with a light source. An unavoidable, fundamental contribution is from shot noise, as explored in Section IV-C. We can also have relative intensity noise (RIN) on each laser input, which can affect our precision  $N_b$ . However, this is typically close to the shot noise level for sufficiently high modulation frequencies. Secondly, we must drive the capacitor of the detector with enough light to switch it (see Section IV-B). The main point to consider is whether these energies scale with  $O(N)$ ,  $O(N^2)$  or something worse, which depends on the precision loss  $\rho$ . As mentioned in Section II, it is likely that deep learning algorithms work well in fixed point arithmetic, allowing us to recover an  $O(N)$  scaling law for our light input with  $\rho = N$ . Therefore, we potentially have a favorable scaling law for our light source, depending on the nature of the computations being performed.

Secondly, we consider costs that scale only with  $O(N)$  rather than  $O(N^2)$ , which are those involving the periphery around the  $N \times N$  matrix. Since we must first retrieve data from memory, modulate  $N$  signals on the input and detect  $N$  such signals at the output to place back into memory, we



Fig. 5: Schematics for incoherent (top) [78], [79] and coherent (bottom) [8] implementations of tunable photonic multiply-accumulate operations. (a) Incoherent approaches can directly perform dot products on optically multiplexed signals. However, they rely on detectors and O/E conversion for summation. (b) The ability to multiplex allows for network flexibility, which can enable larger-scale networks with minimal waveguide usage. (c) Coherent approaches can apply a unitary rotation to incoming lightwaves. This unit can perform a tunable  $2 \times 2$  unitary rotation denoted by  $\mathbb{U}$ . (d) Example of scaling the system to perform a matrix operation in a feedforward topology, using a  $\mathbb{U}$  unit at each crossing together with singular value decomposition.

must consider the intrinsic costs associated with the driving and receiving circuitry, the modulators, detectors, and memory I/O. These energies are similar to those in digital photonic links (see Ref. [23], [88]), which include both driving and tuning the modulating device and the amplification and the recovery circuitry in the electronic receiver. Energy per sample can reach in the hundreds of fJ for co-optimized photonic platforms [88], [89].

Lastly, we consider what can be the largest contribution to energy: costs that scale with  $O(N^2)$  with every photonic device. Although fixed systems can implement a pre-defined weight matrix  $W$  passively with low loss, tunable systems require a way of modifying the weight  $w$ . Photonic devices currently use heaters for coarse tuning, which consume a significant amount of power. Phase shifters in coherent approaches typically consume 10 mW to 20 mW per unit for thermal shifting [90], while microring heaters can consume  $\sim 1$  mW [91]. However, given the nature of passive photonic systems, these limits are not inherent. There are a variety of device modifications that promise to alleviate these problems that could see integration into foundries very soon. For example, phase shifters can be greatly enhanced with slow light cavities [92], and microresonators can be trimmed to the desired value using foundry-compatible techniques, negating the need for a heater [93]–[95].

Considering all these factors, our full energy per MAC

equation is as follows:

$$\begin{aligned} E_{\text{MAC}} = & \frac{N}{\rho^2} \cdot \frac{h\nu}{\eta} \cdot \max \left[ 2^{2N_b+1}, \frac{C_d V_r}{e} \right] \\ & + \frac{1}{N} \cdot (E_{\text{mod}} + E_{\text{rec}} + E_{\text{mem}}) \\ & + \frac{P_q}{\Delta\tau} \end{aligned} \quad (13)$$

The first term accounts for the optical power supplied to the system, which may either be noise limited (left) or swing-limited (right). The second term accounts for the capacitive switching and driving circuitry for the modulators ( $E_{\text{mod}}$ ), detectors ( $E_{\text{rec}}$ ) and the memory retrieval cost ( $E_{\text{mem}}$ ). The last term is the quiescent power use  $P_q$  for each element, which includes the power of coarse tuned heaters and the leakage power across diode junctions. We operate our system over some characteristic sampling time window  $\Delta\tau$  with some effective sampling rate  $1/\Delta\tau$ .

In practice, for heater-tuned resonators and phase shifters, the primary source of energy consumption is from tuning each element. If we operate the system at 10 GS/s (see Ref. [88] for various photonic link speeds), this puts the energy squarely in the range of 150 fJ/MAC to 1.5 pJ/MAC for resonators (on the low end) and phase shifters (on the high end). If we use techniques to remedy this cost as discussed above, our next primary contributions are the link energy  $E_L$ —which is typically in the 100s of fJ range—and the capacitive

charge of the detector, which consumes several fJ, even with conservative assumptions on precision ( $\rho = \sqrt{N}$ ). The former quantity divides by  $N$ , so with channel counts in the hundreds, we are quickly brought to the low fJ/MAC range. This means that with  $N > 100$  and the eradication of power hungry heaters, the single digit fJ/MAC range becomes tenable, a  $> 10^2$  improvement over the current state-of-the-art in energy efficiency.

In order for us to go beyond into the  $\sim \text{aJ}$  range that we have explored at the fundamental limits (Section IV), we rely on the (I) creation of very low energy optoelectronic devices to reduce  $E_L$  significantly as discussed in Ref. [16], and (II) fixed point operations with  $\rho = N$  to reduce the energy cost of the light source, which reduces both the shot noise contribution and the light required to drive each detector, and (III) memory-localized photonic links that shuttles data between the memory and neural network processor at a relatively low cost. We explore an architecture aimed at bringing aJ/MAC efficiencies in Section VI.

### B. Speed

Photonic MACs can be done at very high speeds, limited only by the optoelectronic devices that encode and decode the signals on the input and output. An  $N \times N$  matrix only requires one time step to compute the result. We can divide speed into two primary components: signal bandwidth and latency. If the system is bandwidth-limited by multiple parts of the signal pathway with time constants  $\tau_1, \tau_2, \tau_3 \dots$ , we can approximate the total bandwidth as

$$\tau^2 \sim \tau_1^2 + \tau_2^2 \dots$$

The delay for each component is about half the bandwidth, i.e.,  $\tau_1/2, \tau_2/2 \dots$  and the total latency is the addition of all the delays s.t.

$$T \sim \tau_1 + \tau_2 \dots$$

Several properties of photonic devices lead to their operation at much higher speeds compared to digital and analog electronic devices: (1) they do not suffer from data movement and clock distribution costs along metal wires, reducing the thermal barrier and allowing for higher clock rates, (2) a small number of photonic devices are required to perform the same MAC operations, greatly reducing latency, (3) photonic devices have a larger footprint than analog electrical devices and thus run faster to saturate the available bandwidth density, and (4) photonic arrays do not suffer from the clock jitter problems that plague metal wires and cause inconsistent signal arrival times. With typical bandwidths of  $> 20 \text{ GHz}$  per photonic device and only several photonic devices in a signal pathway for a given  $N \times N$  matrix operation, the signal bandwidth of each input can readily exceed  $10 \text{ GS/s}$ . Similarly, a  $< 50 \text{ ps}$  delay time for most photonic components and only several devices per pathway (see, for example Fig. 6) results in a delay that is  $< 100 \text{ ps}$ . In other words, the entire matrix is effectively computed in less than a *single digital electronic clock cycle*. This contrast quite sharply with the  $\sim \mu\text{s}$  delays seen in current digital approaches [12]. We thus see a stark  $> 10^3$  decrease in latency, meaning that any practical system

will be limited more by the periphery circuitry than the neural network core itself.

### C. Compute Density

We use the same compute density metric defined in Section II-C: the number of operations (MACs) performed in a given area ( $\text{mm}^2$ ) per unit of time (seconds). The underlying density of a photonic compute core can be quite high using standard photonic components, which we will illustrate with a simple example: suppose we took the  $512 \times 512$  AWG prototyped in Ref. [96] and used it to apply  $N^2$  linear operations over a vectorized set of input light intensities. Suppose that there were multiple sets of these signals at different wavelengths s.t. they were multiplexed across the entire  $\sim 5 \text{ THz}$  wavelength band. If we took the number of operations and divided by the area of the chip, we get the rather large compute density of  $6.8 \text{ PMAC/s/mm}^2$ , exceeding the state-of-the-art in digital electronics by  $> 10^4$ . This gives a picture for the capacity of optics—the large value stems largely from the ability to multiplex both signals and connections, a technique exploited quite often by optical reservoir computing approaches (see for example Ref. [82], [85]).

However, making matrices with adjustable weight values  $w_{ij}$  can be more challenging—tunable photonic systems typically require  $N^2$  photonic devices, since there must be a device for every weight  $w_{ij}$ . As discussed in Section V-A, there are a couple tunable approaches that have received significant attention: the coherent and incoherent approaches, which require  $2N^2$  Mach-Zehnder interferometers (MZIs) or  $N^2$  resonators, respectively. The former currently loses on compute density, since each MZI requires significantly more area ( $\sim 10000 \mu\text{m}^2$  in Ref. [8]) compared to microresonators ( $\sim 250 \mu\text{m}^2$  or much smaller). Miniaturizing each MZI relies on some complex modifications, such slow-light enhanced structures [92] or perhaps inverse design [97], [98], whereas resonators can increase in performance as they are shrunken in size [99].

To get a better sense of what  $N^2$  photonic devices can achieve, we can look towards prototyped devices that are compatible with silicon photonic foundry models. Standard microrings of size  $50 \mu\text{m} \times 50 \mu\text{m}$  operating at a sampling speed of  $10 \text{ GS/s}$  results in a computational density of  $10 \text{ TMACs/s/mm}^2$ . This is a major improvement over current digital electronic densities, which are around  $150 \text{ GMACs/mm}^2$  (see Table II). A key point is that even though photonic devices are much larger than individual transistors, a single MAC unit in digital electronics is actually composed of many hundreds or thousands of transistors, occupying  $> 100 \mu\text{m}$  in area [100], comparable to the one (or several) elements that can accomplish the same operation in analog photonics. With a higher energy efficiency, photonic elements can be clocked much faster without hitting energy density limits, leading to the overall larger compute density seen here.

What compute densities will photonics be able to attain in the near future? This is considered in the last part of Section VI, in which photonic crystal defect states [101]



Fig. 6: Schematic of the neuromorphic photonic models under comparison. The abstract neuron model (above) can be represented using: (A) A hybrid spiking laser neuron, investigate in Ref. [7], [102]. (B) A co-integrated silicon modulator neuron, based on the system in Ref. [80], [103]. (C) A sub- $\lambda$  photonic crystal neuron, running close to fundamental photonic limits. Photonically connected memory refers to models such as [21]–[23]. \*Note that A does not require the off-chip laser source since it generates its own light.

that occupy close to  $2 \mu\text{m}^2$  per resonator are closely packed together. As shown in Table II, this can lead to an enormous photonic compute density ( $5 \text{ PMACs/s/mm}^2$ ). In conclusion, we can expect photonic devices will exceed current digital electronic systems by  $> 10^2$  in compute density with miniaturized resonator components. In the future, more exotic structures (such as PhCs) could reach  $> 10^3$  as photonic devices reach their fundamental limits in size.

## VI. NEURAL NETWORK HARDWARE COMPARISON

This section provides comparisons between neuromorphic photonic processing models and digital electronic processing

systems. For concreteness, we focus specifically on Broadcast-and-Weight architectures [78], [102], which have been developed enough for a comparison to be possible—in particular, the empirical validation of both tunable weight systems [104]–[107] and nonlinear processors that have a direct functional correspondence with neuron models [7], [103]. Nonetheless, given that photonic architectures are bound by the same physical constraints and underlying devices, this comparison provides some insights for the performance of neuromorphic photonic systems in the more general case. For the photonic platforms, we choose three models with distinct characteristics: 1) a laser neural network based on an instantiation in a hybrid spiking III-V/silicon platform [108], [109], 2) a silicon photonics platform with tight co-integration with digital electronic drivers, controllers, and amplifiers [80], and 3) a nanophotonic platform operating close to fundamental noise limits. These hardware platforms are depicted in Fig. 6. A list of computed values is included in Table II, and a graph depicting the compute density and energy efficiency—along with some of the the analog limits discussed in Section IV—is shown in Fig. 7.

1) *Hybrid Laser Neural Network*: This model, which is largely the focus of Ref. [102], uses currently available silicon photonic technology together with integrated III-V lasers to emulate biological spiking behavior. It has been proposed together with the Broadcast-and-Weight networking framework [78], and has also received considerable experimental validation, both in the tunable weight units [105] and the nonlinear processors that communicate using such units [7], [110], [111]. These systems are limited by two primary sources of energy consumption: the quiescent power of the laser and amplifier units (which can be as large as  $200 \text{ mW}$ ), and the static power consumption of the heaters used within each filter bank (which can be as large as  $2 \text{ mW}$  each). For the comparison, we assume an all-to-all network with a channel number of  $N = 56$ , based on limits discussed in Ref. [105]. The precision is based on experimentally-validated measurements of microring filters [107]. We assume that, for excitable operation, lasers are biased close to that threshold. We also consider a semiconductor optical amplifier on the output port to generate enough output power for the next stage. For an  $N \times N$  fully-connected network, the energy consumption per MAC operation can be expressed as:

$$E_{\text{MAC}} = \frac{1}{N} \cdot \underbrace{\frac{P_{\lambda(\text{th})} + P_{\text{SOA}}}{\tau_s}}_{\text{node energy}} + \underbrace{\frac{P_h + P_l}{\tau_s}}_{\text{edge energy}} \quad (14)$$

where  $P_{\lambda(\text{th})} = I_{\text{th}} V_L$  represents the laser power consumption at threshold current  $I_{\text{th}}$ ,  $P_{\text{SOA}} = I_{\text{SOA}} V_{\text{SOA}}$  is the power consumption of each output SOA,  $P_h = I_h^2 R_h$  is the average power dissipation of each microring heater, and  $P_l = I_l V_{\text{MRR}}$  is the current across the junction biased at  $V_{\text{MRR}}$ .  $\tau_s$  represents the effective sampling rate, determined by the bandwidth of the real-time signal pathway and I/O (i.e.,  $\sim 10 \text{ GHz}$  [109]) during operation. We distinguish between power use at each node (which scales with  $O(N)$  for  $N^2$  operations) and power use at each edge (which scales with  $O(N^2)$  for a MAC performed

| Technology                           | Energy/MAC  | Compute Density             | Vector Size | Precision | Latency/Speed* |
|--------------------------------------|-------------|-----------------------------|-------------|-----------|----------------|
| Google TPU [12]                      | 0.43 pJ/MAC | 150 GMACs/s/mm <sup>2</sup> | 256         | 8 bits    | 2 $\mu$ s/1 ns |
| Hybrid Laser NN [7], [109]           | 0.22 pJ/MAC | 4.5 TMACs/s/mm <sup>2</sup> | 56          | 5.1+ bits | <100 ps        |
| Co-Integrated Silicon NN [80], [103] | 2.7 fJ/MAC  | 50 TMACs/s/mm <sup>2</sup>  | 148         | 5.1+ bits | <100 ps        |
| Sub- $\lambda$ Nanophotonics         | 30.6 aJ/MAC | 5 PMAC/s/mm <sup>2</sup>    | 300         | 5.1+ bits | <50 ps         |

TABLE II: Comparison of various photonic hardware approaches with a well-known deep learning accelerator during mean operating conditions. Density is computed with respect to the core(s) only. \*Latency is defined as the time it takes to do a single matrix multiplication operation at the given vector size. Speed is defined as the time between subsequent matrix multiplies.

at each network edge), and omit the memory I/O cost, since it is not dominant here.

In this system, energy efficiency is primarily bottlenecked on the quiescent power consumption of the optical amplifier and that of the heaters. In practice, the remaining contributions—the laser threshold power and leakage terms—are negligible in comparison. In particular, the amplifier must provide enough energy to drive the next stage, meeting *cascadability* conditions as discussed in Ref [7]. With our assumed channel density  $N = 56$ , and other parameters based on current photonic devices ( $\tau_s \sim 100$  ps), we arrive at 0.22 pJ shown in Table II. This system is comparable to deep learning chips and neuromorphic electronic systems in energy consumption, fan-in, and compute density. In the following section, we will explore the improvements that can manifest in systems better optimized for higher energy efficiency.

2) *Co-Integrated Neuromorphic Silicon Photonic Network:* This platform (first discussed in Ref. [80], [103]) uses continuous models and can vastly reduce the energy consumption via a close interface between digital electronic and photonic systems. This interface allows easy E/O and O/E conversions between electrical nonlinearities and photonic linear computation elements. This system also uses silicon photonic technologies that are currently available in foundries, but its performance depends critically on several new developments and insights, including: (I) the use of active electronic amplification to sidestep the gain-bandwidth trade-off in each nonlinear processing unit, and (II) the reduction of static power in microring filters by minimizing the use of heaters. For the remainder of this analysis, we also assume a close proximity, low capacitance interface between electronics and photonics (i.e., TOVs with <50 fF [112]), and low-node electronics (i.e., FinFETs [113]).

One of the first challenges is minimizing the quiescent power usage that results from each filter (scaling with  $O(N^2)$ ) requiring a power hungry heater. Note that this is not a problem inherent in photonic elements, since a pre-fabricated fixed photonic network performs the same computations without consuming power. To avoid the immense cost of tuning across the fabrication variation that occurs across microresonators, we assume that each element is trimmed to avoid the use of heaters, as discussed in Section V-A. Integrating these approaches into the fabrication process would allow for an tremendous reduction ( $P_h \rightarrow 0$ ) in energy consumption.

Next, we consider the limitations imposed by amplitude

cascadability. In a *passive* neuron configuration in which a detector directly drives a modulator with no intermediate circuitry (i.e., Ref. [103]), each nonlinear element must replenish the energy lost from the previous layer. In an all-to-all  $N$ -node network with  $N^2$  connections, we must assure that the small-signal gain from layer to layer allows is greater than unity (i.e.,  $g = dP_{\text{out}}/dP_{\text{in}} > 1$ ). This puts the following lower bound the energy consumption per MAC operation:

$$E_{\text{MAC}} \geq \frac{N}{\rho^2} \cdot \underbrace{\frac{\hbar\nu}{\eta}}_{\text{quantum efficiency}} \cdot \underbrace{\frac{1}{e} [2\pi V_s(C_{\text{mod}} + C_{\text{PD}})]}_{\substack{\text{switching charge} \\ \text{for gain cascadability}}} \quad (15)$$

$$(16)$$

where  $\eta = \eta_L \eta_{wg} \eta_d$  is laser efficiency, photonic link efficiency, and photodetector efficiency, respectively;  $V_s$  is the inverse slope of the modulator's voltage-to-transmission curve  $T(V)$ ; and  $C_{\text{mod}}, C_{\text{PD}}$  are the joint capacitances of the photodetector and modulator. In a typical foundry-model where  $V_s(C_{\text{mod}} + C_{\text{PD}}) \sim 70$  fC and  $\eta \sim .07$ , even with  $\rho = N$  in fixed point systems, we arrive at a floor of approximately  $E_{\text{MAC}} \geq 30$  fJ/MAC. This does not include the passive losses through the weight banks, although this can be made quite small (i.e., drop loss can be < 0.5 dB [114]).

Going beyond this barrier requires the use of an active trans-impedance amplifier (TIA) placed between the detector and modulator, which can be instantiated using digitally-compatible circuitry in a number of different configurations. This serves several functions: it can separate capacitive contributions of the photodetector and modulator, and it also reduces the impedances associated with each stage. In a low-node electronics platform with TOVs, the energy consumption per sample can be quite low (<100 fJ) for a good TIA, see for example analysis in Ref. [88] or Ref. [89], [115]. Given that the signal-to-noise ratio must exceed the given bit precision  $N_b$  (i.e.,  $\text{SNR} = I_p/\sigma_i > 2^{N_b}$  for received current  $I_p$  and RMS shot noise current  $\sigma_i$  at each detector), we arrive at a new energy-per-MAC metric:

$$E_{\text{MAC}} = \frac{N}{\rho^2} \cdot \underbrace{\frac{2h\nu}{\eta}}_{\text{quantum efficiency}} \cdot \underbrace{2^{2N_b}}_{\text{noise and resolution}} + \underbrace{\frac{E_{\text{samp}}}{N}}_{\substack{\text{switching} \\ \text{energy/MAC}}} + \underbrace{\frac{P_l}{\tau_s}}_{\text{leakage}} \quad (17)$$

Here,  $E_{\text{samp}}$  includes contributions from the active TIA, the modulator switching energy per unit of time  $\tau_s$ , which is

typically expressed in J/bit, and the energy associated with memory I/O. Note that we neglect the effect of nonlinearity on noise reduction, which can have positive effects on the resulting precision. With fixed point like precision ( $\rho = 1$ ), power dissipation is dominated by E/O and O/E interfaces together with digital circuitry. Given the similarity between each modulator neuron and the E/O/E interface in a standard photonic link—requiring the same electrical interfaces, amplification, and driver circuitry—we can use  $E_{\text{samp}}$  estimates from digital links [88], [116] and those from photonically interconnected DRAM memory, using the figure of 100 fJ/bit in Ref. [21] over 4 bits (100 fJ/S) as a relatively accurate proxy for the energy consumption per node. We arrive at our energy consumption of 2.7 fJ as shown in Table II. We assume an improvement in areal density and channel density by shrinking the resonators to  $\sim 10 \mu\text{m}$  in diameter [99] and high fidelity photonic two-pole filters as described in [106].

3) *Sub- $\lambda$  Nanophotonics*: Here, we consider the performance of photonic devices as they begin to hit their physical limits in the B&W architecture. The basic principle of operation of each unit is similar to the co-integrated silicon photonic network. The platform is assumed to include both low node electronics and photonics on the same platform (i.e., a variant of [117]) to avoid additional capacitances at the interfaces. Additionally, we assume that there are a significant number of layers in the network ( $M \sim 100$ ) before the information is passed back to memory, further amortizing the energy cost (100 fJ/S) of the photonic memory link by the factor  $1/M$ . Each sub- $\lambda$  neuron uses (I) a nanophotonic photodetector such as [77] with  $<1 \text{ fF}$  of capacitance, (II) operate in the "near-receiverless" regime discussed in [16], i.e., a minimal gain stage, if any, between the detector and modulator such as a single inverter amplifier (see Ref. [115], [118]), and (III) the filters and modulators are instantiated efficiently using more exotic enhancement techniques [119], [120]. We utilize devices that have been empirically prototyped, but not yet scaled in foundries. Our metrics are based on several insights:

- **Compute Density:** Photonic devices can be shrunk significantly in size compared to where they are now. The smallest known resonators are photonic crystal defect states, [101] which can occupy small footprints—if we pack them very tightly, they can be as small as  $\sim 2 \mu\text{m}^2$ . A single defect state can potentially perform a weight multiplication. This has significant ramifications for compute density ( $\sim 10^3$ ) compared to microring filter banks, even if the effective sampling rate is kept constant.
- **Channel Number:** The number of channels is limited by the total bandwidth available in the optical spectrum. At 10 GS/s, we can fit about  $\sim 300$  channels in a 30 nm spectral gain curve. Although channel number can be extended further through the use of heterogeneous laser sources or frequency combs, this goes beyond the scope of this work. We also assume low precision, fixed point operations ( $\rho = N$ ).
- **Energy Consumption:** There are many vectors for improvement in Eq. (17). We will assume the reverse-biased filter leakage can be brought down from microamperes [99] to nanoamperes with better manufactur-



Fig. 7: Comparison of deep learning hardware accelerators with photonic platforms discussed in Section VI, modified from Ref. [7]. Photonic systems can support high bandwidth densities on-chip while consuming minimal energy both transporting data and performing computations. Metrics for digital electronic architectures taken from various sources [12], [122]–[125]. Also included are the analog limits for photonic and electronic matrix cores with  $N = 1024$  and 4 bits of precision, from Table I.

ing. The O/E/O switching energy  $E_{\text{samp}}$ —which shares many properties with digital links—can be improved significantly using a variety of techniques to reach the  $\sim 1 \text{ fJ}$  level [16]. Modulators, for example, can reach in the  $\sim 100 \text{ aJ}$  per bit range [121]. We also assume that optical losses through the system are small, which can be optimized via passive device engineering. With this in place, the system is now bottlenecked by shot noise at the detector and the cost of the I/O to memory, limiting precision for a given input power. With more efficient laser sources, the total quantum efficiency to as high as  $\eta \sim 20\%$ . All together, this leads to  $E_{\text{MAC}} = 17.3 \text{ aJ} + 13.3 \text{ aJ} (\text{memory}) = 30.6 \text{ aJ}$ .

## VII. SUMMARY AND CONCLUDING REMARKS

Historically, both electronic neuromorphic systems and electronic emulations of neural networks have been constrained by the inherent scaling laws of digital systems and metal interconnects. In particular, energy scales with  $O(N^2)$ , where  $N$  is the number of neurons, and for systems of large numbers of neurons, this becomes untenable for modern applications. Photonics provides a solution, alleviating the energy consumption of both data movement across metal wires and of multiply-accumulate (MAC) computation itself, both of which are major bottlenecks in neural computing.

We have extensively compared the fundamental limits of electronic crossbar arrays with photonic linear compute cores, and have shown that optics exhibits advantages in the limit of large processor sizes ( $>100 \mu\text{m}$ ), large vector sizes ( $N > 500$ ), and low precision ( $\leq 4$  bits). We have discussed the myriad advantages that photonic multiply-accumulate (MAC) operations possess over their digital electronic counterparts in

energy ( $> 10^2$ ), speed ( $> 10^3$ ), and compute density ( $> 10^2$ ). We have analyzed how they can manifest in practical models, based on empirically validated, foundry compatible photonic devices. Although we considered resonator-based methods for networking and linear operations, the advantages of photonic MACs remain relevant for many architectures beyond those presented in this work.

Although photonics has traditionally been studied for its role in communication, there is great potential to address new and emerging bottlenecks in computing. Artificial intelligence has brought unique challenges to processor architectures: modern GPUs and machine learning ASICs now implement high volume, high density, low precision matrix operations with specialized compute cores. These processors are subject to trade-offs that are significantly more communication bottlenecked than traditional von Neumann architectures. There are still many challenges towards seeing functional analog computing systems: for example, one must consider the cost of the periphery, the cost of A/D and D/A conversion, and the higher-level communication protocols between multiple neuromorphic cores. Nonetheless, photonics has the potential to address the major bottlenecks of hardware neural network simulation, providing a means to simultaneously move data across a chip and perform matrix multiplication with little cost.

## REFERENCES

- [1] B. R. Moss, C. Sun, M. Georgas, J. Shainline, J. S. Orcutt, J. C. Leu, M. Wade, Y. Chen, K. Nammari, X. Wang, H. Li, R. Ram, M. A. Popovic, and V. Stojanovic, "A 1.23pj/b 2.5gb/s monolithically integrated optical carrier-injection ring modulator and all-digital driver circuit in commercial 45nm soi," in *2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers*, 2013, pp. 126–127.
- [2] J. Jeddeloh and B. Keeth, "Hybrid memory cube new dram architecture increases density and performance," in *2012 Symposium on VLSI Technology (VLSIT)*, 2012, pp. 87–88.
- [3] D. Amodei and D. Hernandez, "Ai and compute." [Online]. Available: <https://blog.openai.com/ai-and-compute/> (Accessed May 16, 2018).
- [4] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam, B. Taba, M. Beakes, B. Brezzo, J. Kuang, R. Manohar, W. Risk, B. Jackson, and D. Modha, "Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 34, no. 10, pp. 1537–1557, Oct 2015. [Online]. Available: <http://dx.doi.org/10.1109/TCAD.2015.2474396>
- [5] C. Dragone, "Efficient n\*n star couplers using fourier optics," *Journal of Lightwave Technology*, vol. 7, no. 3, pp. 479–489, 1989.
- [6] R. A. Athale and W. C. Collins, "Optical matrix–matrix multiplier based on outer productdecomposition," *Applied Optics*, vol. 21, no. 12, pp. 2089–2090, 1982. [Online]. Available: <http://ao.osa.org/abstract.cfm?URI=ao-21-12-2089>
- [7] H. Peng, M. A. Nahmias, T. F. de Lima, A. N. Tait, and B. J. Shastri, "Neuromorphic photonic integrated circuits," *IEEE Journal of Selected Topics in Quantum Electronics*, vol. 24, no. 6, pp. 1–15, 2018.
- [8] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochele, D. Englund, and M. Soljačić, "Deep learning with coherent nanophotonic circuits," *Nature Photonics*, vol. 11, pp. 441 EP –, 06 2017. [Online]. Available: <http://dx.doi.org/10.1038/nphoton.2017.93>
- [9] S. W. Smith *et al.*, "The scientist and engineer's guide to digital signal processing," 1997.
- [10] G. Frantz, "Digital signal processor trends," *IEEE micro*, vol. 20, no. 6, pp. 52–59, 2000.
- [11] J. Hasler and H. B. Marr, "Finding a roadmap to achieve large neuromorphic hardware systems," *Front. Neurosci.*, vol. 7, no. 118, 2013.
- [12] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers *et al.*, "In-datacenter performance analysis of a tensor processing unit," in *Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on*. IEEE, 2017, pp. 1–12.
- [13] J. Schmidhuber, "Deep learning in neural networks: An overview," *Neural networks*, vol. 61, pp. 85–117, 2015.
- [14] S. R. Agrawal, S. Idicula, A. Raghavan, E. Vlachos, V. Govindaraju, V. Varadarajan, C. Balkesen, G. Giannikis, C. Roth, N. Agarwal, and E. Sedlar, "A many-core architecture for in-memory data processing," in *Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture*, ser. MICRO-50 '17. New York, NY, USA: ACM, 2017, pp. 245–258. [Online]. Available: <http://doi.acm.org/10.1145/3123939.3123985>
- [15] P. Jawandhiya, "Hardware design for machine learning," *International Journal of Artificial Intelligence and Applications (IJAI)*, vol. 9, no. 1, pp. 63–84, 2018.
- [16] D. A. B. Miller, "Attojoule optoelectronics for low-energy information processing and communications," *J. Lightwave Technol.*, vol. 35, no. 3, pp. 346–396, Feb 2017. [Online]. Available: <http://jlt.osa.org/abstract.cfm?URI=jlt-35-3-346>
- [17] D. A. B. Miller, "Device requirements for optical interconnects to silicon chips," *Proceedings of the IEEE*, vol. 97, no. 7, pp. 1166–1185, 2009.
- [18] D. A. B. Miller, "Rationale and challenges for optical interconnects to electronic chips," *Proceedings of the IEEE*, vol. 88, no. 6, pp. 728–749, June 2000. [Online]. Available: <http://dx.doi.org/10.1109/5.867687>
- [19] C. Gunn, "CMOS photonics for high-speed interconnects," *IEEE Micro*, vol. 26, no. 2, pp. 58–66, Mar. 2006.
- [20] K. Preston, N. Sherwood-Droz, J. S. Levy, and M. Lipson, "Performance guidelines for wdm interconnects based on silicon microring resonators," in *CLEO:2011 - Laser Applications to Photonic Applications*. Optical Society of America, 2011, p. CThP4. [Online]. Available: [http://dx.doi.org/10.1364/CLEO\\_SI.2011.CThP4](http://dx.doi.org/10.1364/CLEO_SI.2011.CThP4)
- [21] C. Batten, A. Joshi, J. Orcutt, A. Khilo, B. Moss, C. W. Holzwarth, M. A. Popovic, H. Li, H. I. Smith, J. L. Hoyt, F. X. Kartner, R. J. Ram, V. Stojanovic, and K. Asanovic, "Building many-core processor-to-dram networks with monolithic emos silicon photonics," *IEEE Micro*, vol. 29, no. 4, pp. 8–21, 2009.
- [22] S. Beamer, C. Sun, Y.-J. Kwon, A. Joshi, C. Batten, V. Stojanović, and K. Asanović, "Re-architecting dram memory systems with monolithically integrated silicon photonics," *SIGARCH Comput. Archit. News*, vol. 38, no. 3, pp. 129–140, Jun. 2010. [Online]. Available: <http://doi.acm.org/10.1145/1816038.1815978>
- [23] A. V. Krishnamoorthy, R. Ho, X. Zheng, H. Schwetman, J. Lexau, P. Koka, G. Li, I. Shubin, and J. E. Cunningham, "Computer systems based on silicon photonic interconnects," *Proceedings of the IEEE*, vol. 97, no. 7, pp. 1337–1361, 2009.
- [24] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, "Binarized neural networks," in *Advances in neural information processing systems*, 2016, pp. 4107–4115.
- [25] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, "Finn: A framework for fast, scalable binarized neural network inference," in *Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*. ACM, 2017, pp. 65–74.
- [26] R. Banner, I. Hubara, E. Hoffer, and D. Soudry, "Scalable methods for 8-bit training of neural networks," in *Advances in Neural Information Processing Systems*, 2018, pp. 5145–5153.
- [27] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, "Deep learning with limited numerical precision," in *International Conference on Machine Learning*, 2015, pp. 1737–1746.
- [28] M. Courbariaux, Y. Bengio, and J.-P. David, "Binaryconnect: Training deep neural networks with binary weights during propagations," in *Advances in neural information processing systems*, 2015, pp. 3123–3131.
- [29] U. Köster, T. Webb, X. Wang, M. Nassar, A. K. Bansal, W. Constable, O. Elibol, S. Gray, S. Hall, L. Hornof *et al.*, "Flexpoint: An adaptive numerical format for efficient training of deep neural networks," in *Advances in neural information processing systems*, 2017, pp. 1742–1752.
- [30] F. Chang, K. Onohara, and T. Mizuuchi, "Forward error correction for 100 g transport networks," *IEEE Communications Magazine*, vol. 48, no. 3, pp. S48–S55, 2010.
- [31] B. Reagen, U. Gupta, L. Pentecost, P. Whatmough, S. K. Lee, N. Mulholland, D. Brooks, and G. Wei, "Ares: A framework for quantifying

- the resilience of deep neural networks,” in *2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)*, 2018, pp. 1–6.
- [32] D. Rolnick, A. Veit, S. Belongie, and N. Shavit, “Deep learning is robust to massive label noise,” *arXiv preprint arXiv:1705.10694*, 2017.
- [33] S. Sukhbaatar and R. Fergus, “Learning from noisy labels with deep neural networks,” *arXiv preprint arXiv:1406.2080*, vol. 2, no. 3, p. 4, 2014.
- [34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” *The journal of machine learning research*, vol. 15, no. 1, pp. 1929–1958, 2014.
- [35] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in *Proceedings of the 30th International Conference on Machine Learning*, ser. Proceedings of Machine Learning Research, S. Dasgupta and D. McAllester, Eds., vol. 28, no. 3, Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1058–1066. [Online]. Available: <http://proceedings.mlr.press/v28/wan13.html>
- [36] C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,” *Neural Computation*, vol. 7, no. 1, pp. 108–116, 1999/09/06 1995. [Online]. Available: <https://doi.org/10.1162/neco.1995.7.1.108>
- [37] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens, “Adding gradient noise improves learning for very deep networks,” *arXiv preprint arXiv:1511.06807*, 2015.
- [38] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” *J. Mach. Learn. Res.*, vol. 18, no. 1, pp. 6869–6898, Jan. 2017. [Online]. Available: <http://dl.acm.org/citation.cfm?id=3122009.3242044>
- [39] S. Agarwal, T.-T. Quach, O. Parekh, A. H. Hsia, E. P. DeBenedictis, C. D. James, M. J. Marinella, and J. B. Aimone, “Energy scaling advantages of resistive memory crossbar based computation and its application to sparse coding,” *Frontiers in Neuroscience*, vol. 9, p. 484, 2016. [Online]. Available: <https://www.frontiersin.org/article/10.3389/fnins.2015.00448>
- [40] S. Galal and M. Horowitz, “Energy-efficient floating-point unit design,” *IEEE Transactions on Computers*, vol. 60, no. 7, pp. 913–922, 2011.
- [41] J. J. Yang, D. B. Strukov, and D. R. Stewart, “Memristive devices for computing,” *Nature Nanotechnology*, vol. 8, pp. 13 EP –, 12 2012. [Online]. Available: <https://doi.org/10.1038/nnano.2012.240>
- [42] H.-T. Kung, “Why systolic architectures?” *IEEE computer*, vol. 15, no. 1, pp. 37–46, 1982.
- [43] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. Nam, B. Taba, M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, and D. S. Modha, “Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,” *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 34, no. 10, pp. 1537–1557, 2015.
- [44] J. von Neumann, “First draft of a report on the edvac,” *IEEE Annals of the History of Computing*, vol. 15, no. 4, pp. 27–75, Oct. 1993. [Online]. Available: <http://dx.doi.org/10.1109/85.238389>
- [45] J. Backus, “Can programming be liberated from the von neumann style?: A functional style and its algebra of programs,” *Communications of the ACM*, vol. 21, no. 8, pp. 613–641, Aug. 1978. [Online]. Available: <http://dx.doi.org/10.1145/359576.359579>
- [46] G. E. Moore, “Readings in computer architecture,” M. D. Hill, N. P. Jouppi, and G. S. Sohi, Eds. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000, ch. Cramming More Components Onto Integrated Circuits, pp. 56–59. [Online]. Available: <http://dl.acm.org/citation.cfm?id=333067.333074>
- [47] J. Koomey, S. Berard, M. Sanchez, and H. Wong, “Implications of historical trends in the electrical efficiency of computing,” *IEEE Annals of the History of Computing*, vol. 33, no. 3, pp. 46–54, March 2011. [Online]. Available: <http://dx.doi.org/10.1109/MAHC.2010.28>
- [48] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, “GPUs and the future of parallel computing,” *IEEE Micro*, vol. 31, no. 5, pp. 7–17, Sept 2011.
- [49] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” *Communications of the ACM*, vol. 51, no. 1, pp. 107–113, 2008.
- [50] P. J. Denning and T. G. Lewis, “Exponential laws of computing growth,” 2017.
- [51] V. Agarwal, M. Hrishikesh, S. W. Keckler, and D. Burger, “Clock rate versus ipc: The end of the road for conventional microarchitectures,” in *ACM SIGARCH Computer Architecture News*, vol. 28, no. 2. ACM, 2000, pp. 248–259.
- [52] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark Silicon and the End of Multicore Scaling,” *IEEE Micro*, vol. 32, no. 3, pp. 122–134, may 2012. [Online]. Available: <http://ieeexplore.ieee.org/document/6175879/>
- [53] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consumption in digital cmos circuits,” *Proceedings of the IEEE*, vol. 83, no. 4, pp. 498–523, 1995.
- [54] I. L. Markov, “Limits on fundamental limits to computation,” *Nature*, vol. 512, no. 7513, p. 147, 2014.
- [55] E. Kadric, D. Lakata, and A. Dehon, “Impact of parallelism and memory architecture on fpga communication energy,” *ACM Trans. Reconfigurable Technol. Syst.*, vol. 9, no. 4, pp. 30:1–30:23, Aug. 2016. [Online]. Available: <http://doi.acm.org/10.1145/2857057>
- [56] M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in *2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, 2014, pp. 10–14.
- [57] J. Hasler and B. Marr, “Finding a roadmap to achieve large neuromorphic hardware systems,” *Frontiers in Neuroscience*, vol. 7, no. 7 SEP, p. 118, 2013. [Online]. Available: <http://dx.doi.org/10.3389/fnins.2013.00118>
- [58] V. Chan, S.-C. Liu, and A. van Schaik, “Aer ear: A matched silicon cochlea pair with address event representation interface,” *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 54, no. 1, pp. 48–59, 2007.
- [59] J. Park, T. Yu, S. Joshi, C. Maier, and G. Cauwenberghs, “Hierarchical address event routing for reconfigurable large-scale neuromorphic systems,” *IEEE transactions on neural networks and learning systems*, vol. 28, no. 10, pp. 2408–2422, 2017.
- [60] M. T. Bohr, “Interconnect scaling—the real limiter to high performance ulsi,” in *Electron Devices Meeting, 1995. IEDM’95., International*. IEEE, 1995, pp. 241–244.
- [61] D. A. B. Miller, “Device requirements for optical interconnects to silicon chips,” *Proceedings of the IEEE*, vol. 97, no. 7, pp. 1166–1185, 2009.
- [62] D. B. Strukov and K. K. Likharev, “Cmol fpga: a reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices,” *Nanotechnology*, vol. 16, no. 6, p. 888, 2005.
- [63] J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and R. S. Williams, “memristive switches enable ‘stateful’ logic operations via material implication,” *Nature*, vol. 464, no. 7290, p. 873, 2010.
- [64] H. Akinaga and H. Shima, “Resistive random access memory (reram) based on metal oxides,” *Proceedings of the IEEE*, vol. 98, no. 12, pp. 2237–2251, 2010.
- [65] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya, P. Mazumder, and W. Lu, “Nanoscale memristor device as synapse in neuromorphic systems,” *Nano Letters*, vol. 10, no. 4, pp. 1297–1301, 04 2010. [Online]. Available: <https://doi.org/10.1021/nl904092h>
- [66] C. Li, M. Hu, Y. Li, H. Jiang, N. Ge, E. Montgomery, J. Zhang, W. Song, N. Dávila, C. E. Graves, Z. Li, J. P. Strachan, P. Lin, Z. Wang, M. Barnell, Q. Wu, R. S. Williams, J. J. Yang, and Q. Xia, “Analogue signal and image processing with large memristor crossbars,” *Nature Electronics*, vol. 1, no. 1, pp. 52–59, 2018. [Online]. Available: <https://doi.org/10.1038/s41928-017-0002-z>
- [67] G. W. Burr, M. J. Brightsky, A. Sebastian, H. Cheng, J. Wu, S. Kim, N. E. Sosa, N. Papandreou, H. Lung, H. Pozidis, E. Eleftheriou, and C. H. Lam, “Recent progress in phase-change memory technology,” *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, vol. 6, no. 2, pp. 146–162, 2016.
- [68] B. Govoreanu, G. S. Kar, Y. Chen, V. Paraschiv, S. Kubicek, A. Fantini, I. P. Radu, L. Goux, S. Clima, R. Degraeve, N. Josart, O. Richard, T. Vandeweyer, K. Seo, P. Hendrickx, G. Pourtois, H. Bender, L. Altimime, D. J. Wouters, J. A. Kittl, and M. Jurczak, “1010nm2hf/hfoxcrossbar resistive ram with excellent performance, reliability and low-energy operation,” in *2011 International Electron Devices Meeting*, 2011, pp. 31.6.1–31.6.4.
- [69] D. A. Miller, “Optics for low-energy communication inside digital processors: quantum detectors, sources, and modulators as efficient impedance converters,” *Optics Letters*, vol. 14, no. 2, pp. 146–148, 1989.
- [70] A. Gondarenko, J. S. Levy, and M. Lipson, “High confinement micron-scale silicon nitride high q ring resonator,” *Opt. Express*, vol. 17, no. 14, pp. 11 366–11 370, Jul 2009. [Online]. Available: <http://www.opticsexpress.org/abstract.cfm?URI=oe-17-14-11366>
- [71] J. F. Bauters, M. J. R. Heck, D. D. John, J. S. Barton, C. M. Bruinink, A. Leinse, R. G. Heideman, D. J. Blumenthal,

- and J. E. Bowers, "Planar waveguides with less than 0.1 db/m propagation loss fabricated with wafer bonding," *Opt. Express*, vol. 19, no. 24, pp. 24 090–24 101, Nov 2011. [Online]. Available: <http://www.opticsexpress.org/abstract.cfm?URI=oe-19-24-24090>
- [72] H. Lee, T. Chen, J. Li, O. Painter, and K. J. Vahala, "Ultra-low-loss optical delay line on a silicon chip," *Nature Communications*, vol. 3, pp. 867 EP –, 05 2012. [Online]. Available: <https://doi.org/10.1038/ncomms1876>
- [73] R. Soref, "The past, present, and future of silicon photonics," *IEEE Journal of Selected Topics in Quantum Electronics*, vol. 12, no. 6, pp. 1678–1687, 2006.
- [74] W. Bogaerts, M. Fiers, and P. Dumon, "Design challenges in silicon photonics," *IEEE Journal of Selected Topics in Quantum Electronics*, vol. 20, no. 4, pp. 1–8, 2014.
- [75] J. Chiles, S. M. Buckley, S. W. Nam, R. P. Mirin, and J. M. Shainline, "Multiplanar dielectric waveguides for neural communication," in *2018 IEEE 15th International Conference on Group IV Photonics (GFP)*, 2018, pp. 1–2.
- [76] S. Pi, C. Li, H. Jiang, W. Xia, H. Xin, J. J. Yang, and Q. Xia, "Memristor crossbar arrays with 6-nm half-pitch and 2-nm critical dimension," *Nature Nanotechnology*, vol. 14, no. 1, pp. 35–39, 2019. [Online]. Available: <https://doi.org/10.1038/s41565-018-0302-0>
- [77] K. Nozaki, S. Matsuo, T. Fujii, K. Takeda, M. Ono, A. Shakoor, E. Kuramochi, and M. Notomi, "Photonic-crystal nano-photodetector with ultrasmall capacitance for on-chip light-to-voltage conversion without an amplifier," *Optica*, vol. 3, no. 5, pp. 483–492, May 2016. [Online]. Available: <http://www.osapublishing.org/optica/abstract.cfm?URI=optica-3-5-483>
- [78] A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, "Broadcast and weight: An integrated network for scalable photonic spike processing," *Journal of Lightwave Technology*, vol. 32, no. 21, pp. 3427–3439, Nov 2014. [Online]. Available: <http://dx.doi.org/10.1109/JLT.2014.2345652>
- [79] L. Yang, R. Ji, L. Zhang, J. Ding, and Q. Xu, "On-chip cmos-compatible optical signal processor," *Opt. Express*, vol. 20, no. 12, pp. 13 560–13 565, Jun 2012. [Online]. Available: <http://www.opticsexpress.org/abstract.cfm?URI=oe-20-12-13560>
- [80] A. N. Tait, "Silicon photonic neural networks," Ph.D. dissertation, Princeton University, April 2018.
- [81] J. M. Shainline, S. M. Buckley, R. P. Mirin, and S. W. Nam, "Superconducting optoelectronic circuits for neuromorphic computing," *Physical Review Applied*, vol. 7, p. 034013, Mar 2017. [Online]. Available: <https://link.aps.org/doi/10.1103/PhysRevApplied.7.034013>
- [82] L. Appeltant, M. C. Soriano, G. Van der Sande, J. Danckaert, S. Massar, J. Dambre, B. Schrauwen, C. R. Mirasso, and I. Fischer, "Information processing using a single dynamical node as complex system," *Nature Communications*, vol. 2, p. 468, 09 2011. [Online]. Available: <http://dx.doi.org/10.1038/ncomms1476>
- [83] L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fischer, "Photonic information processing beyond turing: an optoelectronic implementation of reservoir computing," *Optics Express*, vol. 20, no. 3, pp. 3241–3249, Jan 2012. [Online]. Available: <http://dx.doi.org/10.1364/OE.20.003241>
- [84] M. C. Soriano, S. Ortín, D. Brunner, L. Larger, C. R. Mirasso, I. Fischer, and L. Pesquera, "Optoelectronic reservoir computing: tackling noise-induced performance degradation," *Optics Express*, vol. 21, no. 1, pp. 12–20, Jan 2013. [Online]. Available: <http://www.opticsexpress.org/abstract.cfm?URI=oe-21-1-12>
- [85] K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers, G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre, and P. Bienstman, "Experimental demonstration of reservoir computing on a silicon photonics chip," *Nat Commun*, vol. 5, 03 2014.
- [86] K. Vandoorne, J. Dambre, D. Verstraeten, B. Schrauwen, and P. Bienstman, "Parallel reservoir computing using optical amplifiers," *IEEE Transactions on Neural Networks*, vol. 22, no. 9, pp. 1469–1481, Sept. 2011.
- [87] Y. Paquot, F. Duport, A. Smerieri, J. Dambre, B. Schrauwen, M. Haelterman, and S. Massar, "Optoelectronic reservoir computing," *Scientific Reports*, vol. 2, pp. 287 EP –, 02 2012. [Online]. Available: <http://dx.doi.org/10.1038/srep00287>
- [88] M. Georgas, J. Leu, B. Moss, C. Sun, and V. Stojanović, "Addressing link-level design tradeoffs for integrated photonic interconnects," in *2011 IEEE Custom Integrated Circuits Conference (CICC)*, 2011, pp. 1–8.
- [89] L. Szilagyi, J. Pliva, R. Henker, D. Schoeniger, J. P. Turkiewicz, and F. Ellinger, "A 53-gbit/s optical receiver frontend with 0.65 pj/bit in 28-nm bulk-cmos," *IEEE Journal of Solid-State Circuits*, vol. 54, no. 3, pp. 845–855, 2019.
- [90] N. C. Harris, Y. Ma, J. Mower, T. Baehr-Jones, D. Englund, M. Hochberg, and C. Galland, "Efficient, compact and low loss thermo-optic phase shifter in silicon," *Opt. Express*, vol. 22, no. 9, pp. 10 487–10 493, May 2014. [Online]. Available: <http://www.opticsexpress.org/abstract.cfm?URI=oe-22-9-10487>
- [91] C. Sun, M. Wade, M. Georgas, S. Lin, L. Alloatti, B. Moss, R. Kumar, A. H. Atabaki, F. Pavanello, J. M. Shainline, J. S. Orcutt, R. J. Ram, M. Popović, and V. Stojanović, "A 45 nm cmos-soi monolithic photonics platform with bit-statistics-based resonant microring thermal tuning," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 4, pp. 893–907, 2016.
- [92] Y. Lai, M. S. Mohamed, B. Gao, M. Minkov, R. W. Boyd, V. Savona, R. Houdré, and A. Badolato, "Ultra-wide-band structural slow light," *Scientific Reports*, vol. 8, no. 1, p. 14811, 2018. [Online]. Available: <https://doi.org/10.1038/s41598-018-33090-x>
- [93] G. J. Sharp, C. Klitis, V. Biryukova, B. Holmes, and M. Sorel, "Trimming of silicon-on-insulator micro-ring resonators by laser irradiation," in *2017 Conference on Lasers and Electro-Optics Europe & European Quantum Electronics Conference (CLEO/Europe-EQEC)*, 2017, pp. 1–1.
- [94] M. M. Milosevic, X. Chen, W. Cao, A. F. J. Runge, Y. Franz, C. G. Littlejohns, S. Mailis, A. C. Peacock, D. J. Thomson, and G. T. Reed, "Ion implantation in silicon for trimming the operating wavelength of ring resonators," *IEEE Journal of Selected Topics in Quantum Electronics*, vol. 24, no. 4, pp. 1–7, 2018.
- [95] A. P. Knights, "Device and method for post-fabrication trimming of an optical ring resonator using a dopant-based heater," Apr. 17 2018, uS Patent 9,946,027.
- [96] S. Cheung, T. Su, K. Okamoto, and S. Yoo, "Ultra-compact silicon photonic 512×512 25 ghz arrayed waveguide grating router," *IEEE Journal of Selected Topics in Quantum Electronics*, vol. 20, no. 4, pp. 310–316, 2014.
- [97] J. Lu and J. Vučković, "Inverse design of nanophotonic structures using complementary convex optimization," *Opt. Express*, vol. 18, no. 4, pp. 3793–3804, Feb 2010. [Online]. Available: <http://www.opticsexpress.org/abstract.cfm?URI=oe-18-4-3793>
- [98] J. Peurifoy, Y. Shen, L. Jing, Y. Yang, F. Cano-Renteria, B. G. DeLacy, J. D. Joannopoulos, M. Tegmark, and M. Soljačić, "Nanophotonic particle simulation and inverse design using artificial neural networks," *Science Advances*, vol. 4, no. 6, 2018. [Online]. Available: <https://advances.sciencemag.org/content/4/6/eaar4206>
- [99] E. Timurdogan, C. M. Sorace-Agaskar, J. Sun, E. Shah Hosseini, A. Biberman, and M. R. Watts, "An ultralow power athermal silicon modulator," *Nature Communications*, vol. 5, pp. 4008 EP –, 06 2014. [Online]. Available: <https://doi.org/10.1038/ncomms5008>
- [100] J. Johnson, "Rethinking floating point for deep learning," *arXiv preprint arXiv:1811.01721*, 2018.
- [101] J. D. Joannopoulos, P. R. Villeneuve, and S. Fan, "Photonic crystals: putting a new twist on light," *Nature*, vol. 386, no. 6621, pp. 143–149, 1997. [Online]. Available: <https://doi.org/10.1038/386143a0>
- [102] P. R. Prucnal and B. J. Shastri, *Neuromorphic Photonics*. Boca Raton, FL, USA: CRC Press, Taylor & Francis Group, 2017.
- [103] A. N. Tait, T. F. de Lima, M. A. Nahmias, H. B. Miller, H.-T. Peng, B. J. Shastri, and P. R. Prucnal, "A silicon photonic modulator neuron," *arXiv preprint arXiv:1812.11898*, 2018.
- [104] A. N. Tait, T. Ferreira de Lima, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, "Multi-channel control for microring weight banks," *Optics Express*, vol. 24, no. 8, pp. 8895–8906, Apr 2016. [Online]. Available: <http://dx.doi.org/10.1364/OE.23.012758>
- [105] A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A. Nahmias, and P. R. Prucnal, "Microring weight banks," *IEEE Journal of Selected Topics in Quantum Electronics*, vol. 22, no. 6, pp. 312–325, Nov 2016.
- [106] A. N. Tait, A. X. Wu, T. F. de Lima, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, "Two-pole microring weight banks," *Opt. Lett.*, vol. 43, no. 10, pp. 2276–2279, May 2018. [Online]. Available: <http://ol.osa.org/abstract.cfm?URI=ol-43-10-2276>
- [107] A. N. Tait, H. Jayatilleka, T. F. De Lima, P. Y. Ma, M. A. Nahmias, B. J. Shastri, S. Shekhar, L. Chrostowski, and P. R. Prucnal, "Feedback control for microring weight banks," *Optics Express*, vol. 26, no. 20, pp. 26 422–26 443, 2018. [Online]. Available: <http://www.opticsexpress.org/abstract.cfm?URI=oe-26-20-26422>
- [108] M. A. Nahmias, B. J. Shastri, A. N. Tait, and P. R. Prucnal, "A Leaky Integrate-and-Fire Laser Neuron for Ultrafast Cognitive Computing," *IEEE Journal of Selected Topics in Quantum Electronics*, vol. 19,

no. 5, 2013. [Online]. Available: <http://dx.doi.org/10.1109/JSTQE.2013.2257700>

- [109] M. A. Nahmias, A. N. Tait, B. J. Shastri, T. F. de Lima, and P. R. Prucnal, "Excitable laser processing network node in hybrid silicon: analysis and simulation," *Optics Express*, vol. 23, no. 20, pp. 26 800–26 813, Oct 2015. [Online]. Available: <http://dx.doi.org/10.1364/OE.23.026800>
- [110] M. A. Nahmias, B. J. Shastri, A. N. Tait, and P. R. Prucnal, "A leaky integrate-and-fire laser neuron for ultrafast cognitive computing," *IEEE J. Sel. Top. Quant. Electron.*, vol. 19, no. 5, pp. 1–12, Sept 2013.
- [111] B. J. Shastri, M. A. Nahmias, A. N. Tait, A. W. Rodriguez, B. Wu, and P. R. Prucnal, "Spike processing with a graphene excitable laser," *Scientific Reports*, vol. 6, pp. 19 126 EP –, 01 2016. [Online]. Available: <http://dx.doi.org/10.1038/srep19126>
- [112] G. Katti, M. Stucchi, K. D. Meyer, and W. Dehaene, "Electrical modeling and characterization of through silicon via for three-dimensional ics," *IEEE Transactions on Electron Devices*, vol. 57, no. 1, pp. 256–262, 2010.
- [113] M. Jurczak, N. Collaert, A. Veloso, T. Hoffmann, and S. Biesemans, "Review of finfet technology," in *2009 IEEE International SOI Conference*, 2009, pp. 1–4.
- [114] S. Xiao, M. H. Khan, H. Shen, and M. Qi, "Multiple-channel silicon micro-resonator based filters for wdm applications," *Opt. Express*, vol. 15, no. 12, pp. 7489–7498, Jun 2007. [Online]. Available: <http://www.opticsexpress.org/abstract.cfm?URI=oe-15-12-7489>
- [115] I. Ozkaya, A. Cebrero, P. A. Francese, C. Menolfi, T. Morf, M. Brändli, D. M. Kuchta, L. Kull, C. W. Baks, J. E. Proesel, M. Kossel, D. Luu, B. G. Lee, F. E. Doany, M. Meghelli, Y. Leblebici, and T. Toifl, "A 64-gb/s 1.4-pj/b nrz optical receiver data-path in 14-nm cmos finfet," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 12, pp. 3458–3473, 2017.
- [116] C. Sun, C. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L. Peh, and V. Stojanovic, "Dsent - a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling," in *2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip*, 2012, pp. 201–210.
- [117] A. H. Atabaki, S. Moazeni, F. Pavanello, H. Gevorgyan, J. Notaros, L. Alloatti, M. T. Wade, C. Sun, S. A. Kruger, H. Meng, K. Al Qubaisi, I. Wang, B. Zhang, A. Khilo, C. V. Baiocco, M. Popović, V. M. Stojanović, and R. J. Ram, "Integrating photonics with silicon nanoelectronics for the next generation of systems on a chip," *Nature*, vol. 556, no. 7701, pp. 349–354, 2018. [Online]. Available: <https://doi.org/10.1038/s41586-018-0028-z>
- [118] G.-S. Jeong, W. Bae, and D.-K. Jeong, "Review of cmos integrated circuit technologies for high-speed photo-detection," *Sensors*, vol. 17, no. 9, p. 1962, 2017.
- [119] V. J. Sorger, N. D. Lanzillotti-Kimura, R.-M. Ma, and X. Zhang, "Ultra-compact silicon nanophotonic modulator with broadband response," *Nanophotonics*, vol. 1, no. 1, pp. 17–22, 2012.
- [120] V. J. Sorger, R. Amin, J. B. Khurgin, Z. Ma, H. Dalir, and S. Khan, "Scaling vectors of attojoule per bit modulators," *Journal of Optics*, vol. 20, no. 1, p. 014012, 2017.
- [121] R. Amin, Z. Ma, R. Maiti, S. Khan, J. B. Khurgin, H. Dalir, and V. J. Sorger, "Attojoule-efficient graphene optical modulators," *Applied Optics*, vol. 57, no. 18, pp. D130–D140, 2018. [Online]. Available: <http://ao.osa.org/abstract.cfm?URI=ao-57-18-D130>
- [122] "Groq," November 2017. [Online]. Available: <https://groq.com/> (Accessed May 11, 2018).
- [123] R. Smith, "Nvidia volta unveiled: Gv100 gpu and tesla v100 accelerator announced," May 2017. [Online]. Available: <https://www.anandtech.com/show/11367/nvidia-volta-unveiled-gv100-gpu-and-tesla-v100-accelerator-announced>
- [124] S. Knowles, "Scalable silicon compute," NIPS 2017 Workshop on Deep Learning at Supercomputer Scale, December 2017.
- [125] P. Wijesinghe, A. Ankit, A. Sengupta, and K. Roy, "An all-memristor deep spiking neural network: A step towards realizing the low power, stochastic brain," *arXiv preprint arXiv:1712.01472*, 2017.



**Mitchell A. Nahmias** received a B. S. (Honors) in Electrical Engineering with a Certificate in Engineering Physics in 2012, and an M.A. in Electrical Engineering 2014, both from Princeton University. He is currently finishing his Ph.D. degree as a member of the Princeton Lightwave Communications Laboratory. His research interests include photonic integrated circuits, unconventional computing, and neuromorphic photonics. Mr. Nahmias has authored or co-authored more than 80 journal papers, has been cited over 1000 times, and is an inventor on several

patents. He was the recipient of the Best Engineering Physics Independent Work Award (2012), the National Science Foundation Graduate Research Fellowship (NSF GRFP), the Best Paper Award at IEEE Photonics Conference 2014 (third place), and the Best Paper award at the 2015 IEEE Photonics Society Summer Topicals Meeting Series (first place). He is also a contributing author to the textbook *Neuromorphic Photonics* (2017).



**Bhavin J. Shastri** is an Assistant Professor of Engineering Physics at Queen's University, Canada. He is a co-author of the book, *Neuromorphic Photonics* (Taylor & Francis, CRC Press). He was an Associate Research Scholar (2016-2018) and Banting and NSERC Postdoctoral Fellow (2012-2016) at Princeton University. He received the Ph.D. degree in electrical engineering (photonics) from McGill University in 2012. Dr. Shastri is a recipient of the 2014 Banting Postdoctoral Fellowship from the Government of Canada, the 2012 D. W. Ambridge

Prize for the top graduating Ph.D. student, an IEEE Photonics Society 2011 Graduate Student Fellowship, a 2011 NSERC Postdoctoral Fellowship, a 2011 SPIE Scholarship in Optics and Photonics, a 2008 NSERC Alexander Graham Bell Canada Graduate Scholarship, including the Best Student Paper Awards at the 2014 IEEE Photonics Conference, 2010 IEEE Midwest Symposium on Circuits and Systems, the 2004 IEEE Computer Society Lance Stafford Larson Outstanding Student Award, and the 2003 IEEE Canada Life Member Award.



**Thomas Ferreira de Lima** received a bachelor's degree and the Ingénieur Polytechnicien master's degree from Ecole Polytechnique, Palaiseau, France, with a focus on Physics for Optics and Nanosciences. He is working toward the Ph.D. degree in Electrical Engineering in the Lightwave Communications Group, Department of Electrical Engineering, Princeton University, Princeton, New Jersey. His research interests include integrated photonic systems, nonlinear signal processing with photonic devices, spike-timing-based processing, ultra-

fast cognitive computing, and dynamical lightmatter neuro-inspired learning and computing. He has authored or co-authored more than 40 journal or conference papers, contributes to four major opensource projects, and is a contributing author to the textbook *Neuromorphic Photonics* (2017).



**Alexander Tait** received his PhD in the Lightwave Communications Research Laboratory, Department of Electrical Engineering, Princeton University, Princeton, NJ, USA, advised by Professor Paul Prucnal. He also received the B.Sci.Eng. (Honors) at Princeton in Electrical Engineering in 2012. His research interests include silicon photonics, optical signal processing, optical networks, and neuromorphic engineering.

Dr. Tait is a recipient of the National Science Foundation Graduate Research Fellowship and is a Student Member of the IEEE Photonics Society and the Optical Society of America (OSA). He is the recipient of the Award for Excellence from the Princeton School of Engineering and Applied Science (SEAS), the Optical Engineering Award of Excellence from the Princeton Department of Electrical Engineering, the Best Student Paper Award at the 2016 IEEE Summer Topicals Meeting Series, and the Class of 1883 Writing Prize from the Princeton Department of English. He has authored 9 refereed papers and a book chapter, presented research at 13 technical conferences, and contributed to the textbook *Neuromorphic Photonics* (2017).



**Paul R. Prucnal** received his A.B. in mathematics and physics from Bowdoin College, graduating *summa cum laude*. He then earned M.S., M.Phil. and Ph. D. degrees in electrical engineering from Columbia University. After his doctorate, Prucnal joined the faculty at Columbia University, where, as a member of the Columbia Radiation Laboratory, he performed groundbreaking work in OCDMA and self-routed photonic switching. In 1988, he joined the faculty at Princeton University. His research on optical CDMA initiated a new research field

in which more than 1000 papers have since been published, exploring applications ranging from information security to communication speed and bandwidth. In 1993, he invented the "Terahertz Optical Asymmetric Demultiplexer," the first optical switch capable of processing terabit per second (Tb/s) pulse trains. Prucnal is author of the book, *Neuromorphic Photonics*, and editor of the book, *Optical Code Division Multiple Access: Fundamentals and Applications*. He was an Area Editor of IEEE Transactions on Communications. He has authored or co-authored more than 350 journal articles and book chapters and holds 28 U.S. patents. He is a Life Fellow of the Institute of Electrical and Electronics Engineers(IEEE), the Optical Society of America (OSA) and the National Academy of Inventors (NAI), and a member of honor societies including Phi Beta Kappa and Sigma Xi. He was the recipient of the 1990 Rudolf Kingslake Medal for his paper entitled "Self-routing photonic switching with optically-processed control, received the Gold Medal from the Faculty of Mathematics, Physics and Informatics at the Comenius University, for leadership in the field of Optics 2006 and has won multiple teaching awards at Princeton, including the E-Council Lifetime Achievement Award for Excellence in Teaching, the School of Engineering and Applied Science Distinguished Teacher Award, The President's Award for Distinguished Teaching. He has been instrumental in founding the field of Neuromorphic Photonics and developing the "photonic neuron", a high speed optical computing device modeled on neural networks, as well as integrated optical circuits to improve wireless signal quality by cancelling radio interference.



**Hsuan-Tung Peng** received the B.S. degree from National Taiwan University with a B.S. in physics in 2015, and an M.A. from Princeton University in Electrical Engineering in 2018. He is now pursuing a Ph.D. degree at Princeton University, Princeton NJ, USA. His current research interests include neuromorphic photonics, photonic integrated circuits, and optical signal processing.