

# Session 12 Overview: *High Performance Optical Receivers*

## WIRELINE SUBCOMMITTEE



**Session Chair:** Byungsub Kim  
POSTECH, Pohang, Korea



**Session Co-Chair:** Thomas Toifl  
Cisco Systems, Thalwil, Switzerland

Data rate and power efficiency of optical links have surged impressively in recent years, making them the work horse communication medium for data center networking and high-performance computing. The papers in this session reflect key developments on how to increase data rates as well as how to enable low-power coherent optical solutions. The first paper in the session demonstrates a low-power ring-resonator-based silicon photonic WDM receiver module at 350Gb/s aggregate data rate. The second paper describes the state of the art for an optical RX frontend, where -14dBm sensitivity was achieved for a TIA operating with a 106.25Gb/s PAM-4 signal. Analog implementations of coherent optical links are attractive since they enable lower power consumption and smaller area than DSP-based designs. The third paper in this session demonstrates how to achieve a low-latency, high-bandwidth carrier-phase recovery (CPR) loop for a coherent optical RX using a purely analog approach.

8:30 AM

**12.1 A 0.96pJ/b 7×50Gb/s-per-Fiber WDM Receiver with Stacked 7nm CMOS and 45nm Silicon Photonic Dies**

Mayank Raj, AMD, San Jose, CA

In Paper 12.1, AMD demonstrates a 7×50Gb/s NRZ WDM receiver module, which achieves 350Gb/s aggregate data rate at <1e-12 BER without forward error correction. The RX incorporates stacked 7nm CMOS and 45nm silicon photonic dies and uses an array of cascaded optical ring resonators to receive optical data on a 1.5nm-spaced laser grid. High sensitivity (-11.1dBm median) was measured at 0.96pJ/b energy efficiency.



12

9:00 AM

**12.2 A 7 pA/√Hz Asymmetric Differential TIA for 100Gb/s PAM-4 links with -14dBm Optical Sensitivity in 16nm CMOS**

Kadaba Lakshmikumar, CISCO Systems, Allentown, PA

In Paper 12.2, Cisco Systems describe a linear PAM-4 CMOS differential TIA with asymmetric signal paths utilizing signal currents from both terminals of the PD to improve the SNR by 3dB. Designed in 16nm FinFET, the TIA has a gain of 77dBΩ, a bandwidth of 18.4GHz, and an input-referred noise density of 7pA/√Hz to achieve a -14dBm optical sensitivity at 106.25Gb/s.



9:30 AM

**12.3 A Carrier-Phase-Recovery Loop for a 3.2pJ/b 24Gb/s QPSK Coherent Optical Receiver**

Ahmed E. Abdelrahman, University of Illinois, Urbana, IL

In Paper 12.3, University of Illinois at Urbana-Champaign describes a wide bandwidth analog carrier-phase recovery (CPR) method for short-reach QPSK coherent optical links. Using 16-phase switched-inverter-based harmonic-rejection complex mixers (HRMs) and low-latency phase detection circuits, the prototype QPSK receiver fabricated in 28nm CMOS achieves 100MHz bandwidth, 600MHz tracking range, and recovers 24Gb/s QPSK data without errors. The power efficiency of the QPSK coherent receiver is 3.2pJ/b.



## 12.1 A 0.96pJ/b 7×50Gb/s-per-Fiber WDM Receiver with Stacked 7nm CMOS and 45nm Silicon Photonic Dies

Mayank Raj<sup>1</sup>, Chuan Xie<sup>1</sup>, Ade Bekele<sup>1</sup>, Adam Chou<sup>1</sup>, Wenfeng Zhang<sup>1</sup>, Ying Cao<sup>1</sup>, Jae Wook Kim<sup>1</sup>, Nakul Narang<sup>2</sup>, Hongyuan Zhao<sup>2</sup>, Yipeng Wang<sup>2</sup>, Kee Hian Tan<sup>2</sup>, Winson Lin<sup>1</sup>, Jay Im<sup>1</sup>, David Mahashin<sup>1</sup>, Santiago Asuncion<sup>1</sup>, Parag Upadhyaya<sup>1</sup>, Yohan Frans<sup>1</sup>

<sup>1</sup>AMD, San Jose, CA

<sup>2</sup>AMD, Singapore, Singapore

Emerging applications such as machine learning, high-performance computing, and cloud storage continue to push compute demands at the data center. To keep up, distributed computing architectures are being increasingly adopted where the physical locations of the CPU, GPU, FPGA, memory, and storage may span over several meters. In package silicon-photonics-based optical links with wavelength division multiplexing (WDM) and non-return-to-zero (NRZ) signaling provides a power-efficient, high-bandwidth, and low-latency interface between these components. In this paper, we present a low-power (0.96pJ/b), high-sensitivity (-11.1dBm median), high-bandwidth (7×50Gb/s NRZ WDM) receiver (RX) that achieves <1e-12 bit-error-rate (BER) without forward-error-correction (FEC).

Figure 12.1.1 shows the multi-chip module (MCM) prototype consisting of an electrical chip (EIC), a Si photonics chip (PIC), integrated fan-out (InFO), an organic interposer, and an organic laminate substrate. To achieve low power and high bandwidth, the EIC RX input is placed directly on top of the PIC using copper-pillar bumps with 55µm bump pitch. This allows the TIA to connect to the photodetector (PD) with minimal parasitic effects. InFO is used to connect the EIC to the proxy Core IC through a parallel interface (not used in this work). Light is coupled in and out of the PIC using a 15-fiber array with a 250µm pitch via V-grooves.

The PIC RX architecture is detailed Fig. 12.1.1. To enable WDM channel selection a cascaded ring resonator (CRR) is used. A CRR filter is formed by coupling two ring resonators to create a second-order flat-top filter centered around a given wavelength and low crosstalk from adjacent wavelengths. The free spectral range (FSR) of the CRRs is designed to be 8× the channel spacing of 1.5nm. The drop port for each of the RX1-RX7 CRRs is connected to a SiGe-based waveguide photo detector (PD) with a responsivity of 0.9A/W. The drop port of the last CRR (RX8) is not connected to a PD and is used for debugging purposes in this test chip. The CRRs are designed to have an insertion loss of <1.5dB and crosstalk of <15dB. A racetrack design with a straight coupling section is chosen to increase coupling distances and thereby improve process variation. The second order filter provides lower crosstalk between adjacent wavelengths compared to a first-order filter (a single ring resonator) owing to -40dB/dec gain roll-off. Additionally, the flat-top characteristic is also less susceptible to laser wavelength spacing variations. This allows for collective thermal tuning of the CRR array to match the laser wavelengths. For the best overall BER performance, each CRR passband must be matched to its corresponding laser wavelength while maintaining a 1.5nm grid spacing. To optimize the laser grid and filter alignment, the through port power (which comes for free in this design) for RX1-RX7 is summed and the lowest loss middle value of 1313nm is used as the wavelength for RX1 (Fig. 12.1.2). Laser wavelengths for RX2-RX7 are determined by decrementing them by multiples of 1.5nm. The same operation can be achieved by collectively tuning the CRR array with fixed laser wavelengths.

Figure 12.1.3 shows the complete RX architecture. A low-power, highly sensitive TIA is essential to an optical RX design. To utilize gm/C scaling benefits offered by the 7nm technology node, a 3-stage inverter-based TIA [1] design with t-coil peaking [2] is chosen. The TIA output ( $V_{outb}$ ) is directly fed to the strong-arm latch-based slicers. The positive terminal of the slicers ( $V_{cm}$ ) is connected to a fixed slicer DC potential generated by a self-biased inverter with PMOS/NMOS strength control. To match the kickback from the sampling clocks on the  $V_{cm}$  and  $V_{outb}$  node, a programmable, transmission-gate-based switch is added to the  $V_{cm}$  node. Instead of a conventional continuous time linear equalizer (CTLE) circuit, NMOS-based switches are added to resistors R1 and R3 of the TIA to implement programmable bandwidth and peaking. A DC current cancellation loop with 10MHz bandwidth is added to ensure that the input common-mode of the slicer is maintained at  $V_{cm}$  regardless of the average photo-diode current level. The DC loop can be disabled by powering off the operation transconductance amplifier (OTA) (Fig. 12.1.3) and operating the TIA in an open loop. In this mode, the output voltage of the TIA after the RC filter is monitored using a static analog probe (SAP) to characterize the insertion loss of the CRR filters. The TIA is designed for transimpedance ( $Z_t$ ) between  $4k\Omega$  to  $6k\Omega$ , >35GHz bandwidth, and  $3\mu A$  integrated input-referred noise. The power supply sensitivity of the TIA is improved by using an on-chip regulated supply, which provides >25dB of power supply rejection ratio (PSRR). To avoid (mid-band 10-200MHz) differential PSRR peaking through the  $V_{cm}$  path due to tracking from the DC loop, a low pass filter is added (Fig. 12.1.3).

The 4 data and 1 error slicer are clocked by four phases of a quarter-rate clock. The error slicer samples the peak of the input signal and is used in combination with the data samples to perform a baud-rate clock and data-recovery (CDR) [2]. The CDR loop directs the sampling clocks to their optimum positions by digitally controlling an 8b phase interpolator (PI). The 8-phase clock to the PI is generated by a ring-based injection-locked oscillator (ILO1) (Fig. 12.1.3) which uses the 12.5GHz clock from the common PLL as a reference. The 4-phase clock for slicers is generated by ILO2, which uses the PI output as a reference. A quadrature locked loop (QLL) is used to correct the phase mismatch of the 4-phase sampling clock and generate the supply voltage for the PI and the two ILOs [3]. In this design, the quadrature error detector (QED) driven voltage to current (V-to-I) circuit directly drives the NMOS source follower of the regulator. This allows us to remove the OTA and the filter capacitor (C2). Eliminating C2 pushes the second pole out, thus C1 can be reduced further, thereby saving silicon area.

The EIC is fabricated in a 7nm FinFET process with a per-channel RX active area of  $165 \times 186 \mu m^2$ . The PIC is fabricated in a 45nm process. The measurement setup is shown in Fig. 12.1.4. All 7 receivers are characterized using a reference optical modulator, which generates a 50Gb/s PRBS7 pattern with a 3.7dB extinction ratio (ER). The wavelength is set by a tunable laser. A tunable optical attenuator is used to control the power sent to the PIC. The RX1-RX7 measured sensitivity varies from -11.4dBm (OMA) to -10.1dBm (OMA) (Fig. 12.1.4). The median sensitivity is -11.1dBm (OMA). The variation in optical sensitivity is due to the loss variation in the CRR filters. This is verified by disabling the dc loop of the TIA and measuring the output voltage of the TIA for different optical attenuation. Figure 12.1.4 shows that the worst channel (RX1) response matches that of the best channel (RX7) with 1.5dB additional power input. This corresponds to the 1.3dB OMA loss in RX1. Figure 12.1.5 shows the measured BER bathtub curves for RX1-RX7 with 1dB link margin. All channels meet  $BER < 1e-12$  for >14% UI opening at 50Gb/s. Thus, in aggregate we achieve 350Gb/s  $BER < 1e-12$  operation in a single fiber without any FEC. The measured 2-D eye scans (Fig. 12.1.5) with  $1e10$  bits show the internal eye shapes. The horizontal resolution for the 2-D eye scan is one PI code (312.5fs at 50Gb/s) and the vertical resolution is 4.4mV. The advantage of using a second-order flat top CRR filter is demonstrated by RX5 which is <1e-12 error-free despite its laser wavelength being off from the filter center by 200pm (Fig. 12.1.2). To quantify the impact of optical crosstalk among neighboring channels, bathtub BER, and 2-D BER measurements are done for RX2 and RX3 with a 2- laser source. Measurements show minimal impact on horizontal and vertical eye-openings (Fig. 12.1.6), despite the narrow channel spacing of 1.5 nm. The RX consumes 40mW while the common PLL consumes 8.2mW (amortized across 7 lanes). The combined energy efficiency for the RX is measured to be 0.96pJ/b at 50Gb/s. Figure 12.1.6 summarizes the system performance and compares it to previously reported CMOS receiver frontends. We achieve the highest aggregate data-rate while being most power efficient when compared to works with integrated SerDes. Figure 12.1.7 shows the die photos of the CRR array in the PIC, the RX circuits in the EIC, and the packaged module.

### Acknowledgement:

The work is funded by DARPA PIPES contract HR0011-19-3-0004. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

### References:

- [1] M. Rakowski et al., "A 4×20Gb/s WDM ring-based hybrid CMOS silicon photonics transceiver," *ISSCC*, pp. 408-409, Feb. 2015.
- [2] M. Raj et al., "Design of a 50-Gb/s Hybrid Integrated Si-Photonic Optical Link in 16-nm FinFET," *IEEE JSSC*, vol. 55, no. 4, pp. 1086-1095, Apr. 2020.
- [3] C. F. Poon et al., "A 1.24-pJ/b 112-Gb/s (870 Gb/s/Mm) Transceiver for In-Package Links in 7-nm FinFET," *IEEE JSSC*, vol. 57, no. 4, pp. 1199-1210, Apr. 2022.
- [4] H. Li et al., "11.6 A 100Gb/s-8.3dBm-Sensitivity PAM-4 Optical Receiver with Integrated TIA, FFE and Direct-Feedback DFE in 28nm CMOS," *ISSCC*, pp. 190-191, Feb. 2021.
- [5] I. Ozkaya et al., "A 64Gb/s 1.4-pJ/b NRZ Optical Receiver Data Path in 14nm CMOS FinFET," *IEEE JSSC*, vol. 52, no. 12, pp. 3458-3473, Dec. 2017.



Figure 12.1.1: MCM prototype (top left), EIC block diagram (top right) and PIC architecture (bottom).



Figure 12.1.2: Measured through port power (top) and optimizing 1.5nm laser grid to minimize overall loss (bottom).



Figure 12.1.3: Detailed receiver architecture.



Figure 12.1.4: Measurement setup and measured sensitivity for RX1-RX7 at 50Gb/s.



Figure 12.1.5: Bathtub BER (top) and 2-D eye scan (bottom) measurement for RX1-RX7 at 50Gb/s.



\*Best across 7 channels \*\*No SerDes

Figure 12.1.6: BER bathtub and 2-D eye scan measurement with and without crosstalk (top). Performance summary and comparison (bottom).



Figure 12.1.7: EIC micrograph (top), PIC micrograph (bottom left), and packaged module (bottom right).

## 12.2 A 7 pA/Hz Asymmetric Differential TIA for 100Gb/s PAM-4 links with -14dBm Optical Sensitivity in 16nm CMOS

Kadaba Lakshmikumar<sup>\*1</sup>, Alexander Kurylak<sup>\*1</sup>, Romesh Kumar Nandwana<sup>\*1</sup>, Bibhu Das<sup>1</sup>, Joe Pampanin<sup>1</sup>, Mike Brubaker<sup>1</sup>, Pavan Kumar Hanumolu<sup>2</sup>

<sup>1</sup>CISCO Systems, Allentown, PA

<sup>2</sup>University of Illinois, Urbana, IL

\*Equally Credited Authors (ECAs)

A transimpedance amplifier (TIA) is a critical building block that impacts the noise, bandwidth, and power consumption of intensity modulation and direct detection (IMDD) optical links used in data centers. CMOS TIAs using the shunt-feedback (SF) topology (Fig. 12.2.1) have recently been shown to achieve adequate noise and bandwidth performance to facilitate 100Gb/s receivers [1-4]. However, the SF-TIA suffers from debilitating tradeoffs between its noise and bandwidth, which make it fundamentally challenging to improve noise/bandwidth performance beyond what has already been achieved. An alternative that has the potential to overcome the fundamental shortcomings of the single-ended (SE) SF-TIA is a differential TIA. Recognizing that the SE-SF-TIA only uses the photo-current flowing out of one terminal of the photodiode (PD), a differential TIA seeks to double the signal current by using the current coming out of the PD's other terminal (Fig. 12.2.1). As the PD current also flows in the complementary branch, the signal increases by 6dB at the cost of a 3dB increase in noise, resulting in a theoretical 3dB increase in SNR. However, achieving this 3dB SNR improvement in practice is difficult. To understand the reasons behind it, consider the conventional differential TIA as shown in Fig. 12.2.1. It employs capacitively coupled signal paths to bring the PD current to the TIAs. Resistors  $R_{B1}$  ( $R_{B2}$ ) are used to reverse bias the PD and need to be chosen such that the corner frequency ( $F_C$ ) of the high-pass filter formed by  $R_{B1}$ - $C_{C1}$  ( $R_{B2}$ - $C_{C2}$ ) is low enough to pass the low-frequency components of the PAM-4 data. However, the maximum value of  $R_{B1}$  ( $R_{B2}$ ) is limited by the tolerable voltage drop caused by the average PD current. For example, even a 300µA average current with  $R_{B1} = R_{B2} = 20\text{k}\Omega$  would entail a 6V drop, which is prohibitively large in fine-line CMOS processes. In [5], the bias resistor was replaced by a regulator to alleviate the voltage headroom issue. But this approach is severely limited by the conflicting regulator output impedance ( $R_{OUT}$ ) requirements: achieving a low  $F_C$  requires a large  $R_{OUT}$ ; achieving good line/load regulation and power supply rejection (PSR) mandates a low  $R_{OUT}$ . Even if  $R_{OUT}$  is made as high as 20kΩ and  $C_{C1} = C_{C2} = 20\text{pF}$ , the high-pass corner would be nearly 400kHz, which is about an order of magnitude higher than what is needed for low baseline wander. Consequently, further increasing  $C_{C1}/C_{C2}$  is the only viable option for lowering  $F_C$ . However, the top/bottom plate parasitic capacitors  $C_{PT}/C_{PB}$  of the coupling capacitors severely degrade the TIA performance in two critical ways. First, they shunt the photocurrent, significantly reducing the signal current flowing into the TIA and lowering the effective transimpedance at high frequencies. Second, they add to the TIA input capacitance, reducing the TIA bandwidth and increasing its noise [6]. Because of these drawbacks, practical differential TIA performance is not superior to an SE-TIA.

This paper presents the first CMOS differential TIA that overcomes the fundamental shortcomings of differential TIAs described above. Using asymmetric signal paths wherein the capacitive coupling is employed only on the PD's cathode side and cascaded regulators to bias the PD, the proposed differential TIA achieves  $F_C$  similar to that of an SE-TIA without using large capacitors and compromising the regulator's PSR and provides about 1.5dB optical sensitivity improvement over its SE-TIA counterpart. Corresponding eye diagrams for SSPRQ data are shown in Fig. 12.2.1, highlighting the degradation in eye quality compared to an SE-TIA.

Figure 12.2.2 shows the detailed schematic of the proposed differential TIA when it is flip-chipped on a photonic IC. It comprises a PD, a wideband regulator, a narrowband regulator, two SE-SF-TIAs, a small on-chip coupling capacitor ( $C_c$ ), DC-feedback (DCFB) loops, programmable-gain amplifiers (PGAs), and 50Ω output buffers. The PD anode is direct-coupled to the first SF-TIA and is biased at the common-mode voltage ( $\approx V_{SUP}/2$ ) of the self-biased TIA, like in a conventional SE-TIA [1]. The PD cathode is capacitively coupled to the second SF-TIA and biased by a narrow-band regulator (NBR) that is designed to present a large  $R_{OUT}$  at high frequencies ( $> 100\text{kHz}$ ). As described later, the wideband regulator (WBR) improves the high-frequency PSR, thus overcoming the PSR versus  $R_{OUT}$  tradeoff in a conventional differential TIA. Thanks to the large  $R_{OUT}$  of the NBR and the proposed hybrid DC/AC coupling, a small  $C_c$  of only 2pF was sufficient to satisfy the low  $F_C$  requirement. To elucidate this point further, note that summing the DC- and AC-coupling paths lowers the corner frequency by ~2x compared to the conventional case when both the anode and cathode nodes are AC-coupled. In addition, as illustrated by the simulated AC response (Fig. 12.2.3), the low-frequency attenuation up to the DCFB corner frequency is limited to 6dB, irrespective of the high-pass corner frequency, significantly lowering the baseline wander.

The small  $C_c$  in our design alleviates the two critical issues facing the conventional differential TIAs described earlier. First, the parasitic capacitors increase the second TIA's

input capacitance ( $C_T$ ) only slightly and therefore have a negligible impact on the TIA bandwidth. This benefit can be directly observed in the simulated AC response (Fig. 12.2.3) and the cleaner eye diagrams compared to Fig. 12.2.1. Second, the small parasitic capacitors minimally shunt the signal current, which translates to negligible gain reduction, as seen in Fig. 12.2.3. It is important to note that the asymmetry in the frequency response of the DC- and AC-coupled signal paths only amounts to a small common-mode signal at low frequencies, which can be easily suppressed by a downstream differential stage.

The WB and NB regulators are implemented using a PMOS pass-device-based low dropout (LDO) topology. The WBR operates from a 3.3V module supply and has its dominant pole at the output node ( $V_{SUP_1}$ ) to achieve good power supply rejection over a wide band. The small PD average current makes it possible to make the output pole dominant by using a relatively small compensation capacitor placed at  $V_{SUP_1}$ . On the other hand, NBR is frequency-compensated by placing the dominant pole at the output of the error amplifier ( $V_{G2}$ ) so that  $R_{OUT}$  beyond its bandwidth approximately equals the output resistance ( $r_{ds}$ ) of the pass transistor. A large channel length pass transistor is used to achieve high  $r_{ds}$ . Simulated impedances at the cathode, looking into the NB regulator and the TIA for different coupling capacitor values, are shown in Fig. 12.2.3. The intersection of these curves is the high-pass corner frequency for the signal for the ac coupled path. A significant output impedance helps to reduce the capacitor value for a given high-pass corner. For frequencies higher than the corner, the PD signal flows predominantly through  $C_c$  to the TIA as desired.

The prototype TIA chip is fabricated in a 16nm FinFET CMOS technology. To minimize noise, the TIA bandwidth was targeted to be about a third of the baud rate. The DSP receiver equalizes the ISI created by this low bandwidth. Alongside, a SE-SF-TIA was also designed with similar gain and bandwidth to measure the improvement offered by the asymmetric differential TIA accurately. The characterization setup is shown in Fig. 12.2.4. The electrical ICs were flip-chipped on a silicon photonics IC containing the photodiode. The optical input to the chip-on-chip assembly is generated by modulating a laser source with a LiNbO<sub>3</sub> modulator. No filter or equalizer was applied for the bandwidth and noise measurements in the sampling scope. The measured bandwidth of the SE-SF-TIA and the differential TIA at the maximum gain setting of the PGAs are very similar (18.4GHz), as shown. The output noise of the differential TIA was measured by turning off the laser source and plotting the output voltage histogram on a sampling scope (Fig. 12.2.4). The output noise is 6.88mV<sub>rms</sub>, including 0.6mV<sub>rms</sub> scope noise, which translates to an input referred noise of 1.14µA<sub>rms</sub> and an average 7pA/Hz noise density with 77dBΩ transimpedance and 26.5625GHz noise bandwidth. This is the lowest noise reported for any CMOS TIA for 100Gb/s PAM-4 links. The time-domain noise measurements were corroborated using a spectrum analyzer's noise density plot. The input-referred current noise spectral densities for the SE and differential TIAs are shown in Fig. 12.2.4.

The setup for measuring the symbol-error-rate (SER) vs. optical-modulation-amplitude (OMA) is shown in Fig. 12.2.4. A low-pass filter with a corner frequency of 0.75\*baud rate was instantiated in software at the front end of the sampling scope. This is followed by a 12-tap FFE and 1-tap DFE equalizer to emulate the receiver characteristics. The SER vs. OMA is shown for both the SE and differential TIAs in Fig. 12.2.5. There is nearly a 1.5dB improvement in optical sensitivity at the FEC limit, enabling a -14dBm sensitivity for the proposed asymmetric differential TIA at the pre-FEC SER limit of 4.8E-4. The eye diagrams for the SE and proposed asymmetric differential TIA are shown for PRBS13Q and SSPRQ patterns. There is no degradation in the quality of the SSPRQ eye for the differential TIA, demonstrating the proposed architecture's effectiveness with high-stress patterns. Figure 12.2.6 shows the performance summary and comparison with the state-of-the-art TIAs indicating the lowest reported input referred noise with only 108mW of power consumption from 3.3V and 1.8V supplies. Die micrographs of both the TIAs are shown in Fig. 12.2.7.

### Acknowledgement:

The authors thank S. Sunder, C. Appel, V. Bocuzzi, M. Traverso, M. Mazzini, and K. Patel for their contribution and support.

### References:

- [1] K. R. Lakshmikumar et al., "A Process and Temperature Insensitive CMOS Linear TIA for 100 Gb/s/λ PAM-4 Optical Links," *IEEE JSSC*, vol. 54, no. 11, pp. 3180-3190, Nov. 2019.
- [2] H. Li et al., "A 112 Gb/s PAM4 Linear TIA with 0.96 pJ/bit Energy Efficiency in 28nm CMOS," *IEEE ESSCIRC*, pp. 238-241, Sept. 2018.
- [3] H. Li et al., "A 100-Gb/s PAM-4 Optical Receiver With 2-Tap FFE and 2-Tap Direct-Feedback DFE in 28-nm CMOS," *IEEE JSSC*, vol. 57, no. 1, pp. 44-53, Jan. 2022.
- [4] D. Patel et al., "A 112 Gb/s -8.2 dBm Sensitivity 4-PAM Linear TIA in 16nm CMOS with Co-Packaged Photodiodes," *IEEE CICC*, Apr. 2022.
- [5] J. Lambrecht et al., "A 106-Gb/s PAM-4 Silicon Optical Receiver," *IEEE Photonics Technology Letters*, vol. 31, no. 7, pp. 505-508, Apr. 2019.
- [6] E. Säckinger, *Analysis and Design of Transimpedance Amplifiers for Optical Receivers*. John Wiley & Sons, 2017.



Figure 12.2.1: Block diagram of conventional single-ended and differential TIA with simulated SSPRQ performance.



Figure 12.2.2: Simplified schematic of the proposed asymmetric differential linear CMOS TIA.



Figure 12.2.3: Simulated AC response comparison between different TIA architectures, cathode node impedance and internal node eye diagrams of the proposed asymmetric differential TIA.



Figure 12.2.4: Optical measurement setup, large signal bandwidth measurement, output noise and input-referred noise density measurement of the proposed TIA.



Figure 12.2.5: Optical sensitivity measurement, PRBS-13Q, and SSPRQ output eye diagram comparison between single-ended TIA and proposed asymmetric differential TIA.

|                                                             | JSSC'19 [1]        | ESSCIRC'18 [2]                   | JSCC'22 [3]                             | CICC'22 [4]                     | PTL'19 [5]                                                  | <b>This work</b>                   |
|-------------------------------------------------------------|--------------------|----------------------------------|-----------------------------------------|---------------------------------|-------------------------------------------------------------|------------------------------------|
| Technology                                                  | 16nm FinFET        | 28nm Bulk                        | 28nm Bulk                               | 16nm FinFET                     | 55nm SiGe                                                   | 16nm FinFET                        |
| Data Rate (Gbps)                                            | 106.25             | 112                              | 100                                     | 112                             | 106                                                         | 106.25                             |
| Modulation Format                                           | PAM-4              | PAM-4                            | PAM-4                                   | PAM-4                           | PAM-4                                                       | PAM-4                              |
| Supply Voltage (V)                                          | 1.8                | 2.5/1.2                          | 1.5                                     | 0.9                             | 3.3/2.5                                                     | 1.8                                |
| TIA Architecture                                            | Single-Ended       | Single-Ended                     | Single-Ended                            | Single-Ended                    | Differential                                                | Single-Ended                       |
| Optical Measurements                                        | Yes                | No                               | Yes                                     | Yes                             | Yes                                                         | Yes                                |
| Bandwidth (GHz)                                             | 27                 | 60                               | 20                                      | 32                              | N/A                                                         | 18.4                               |
| Transimpedance (dBQ)                                        | 78                 | 65                               | 66                                      | 63                              | 66                                                          | 75.5                               |
| Input ref. noise ( $\mu\text{A}_{\text{rms}}$ )             | 2.7 <sup>1</sup>   | 4.7                              | 2.5                                     | 3                               | 3.2                                                         | 1.5                                |
| Input ref. noise density ( $\mu\text{A}/\sqrt{\text{Hz}}$ ) | 16.7               | 19.3                             | 17                                      | 16.9                            | N/A                                                         | 9.2                                |
| Power (mW)                                                  | 60.8               | 107                              | 117                                     | 47 <sup>4</sup>                 | 160                                                         | 103.6                              |
| Output Swing (mVpp diff.)                                   | 600 <sup>2</sup>   | 300                              | 600 <sup>2</sup>                        | 450                             | N/A                                                         | 300                                |
| Sensitivity at KP4 pre-FEC SER (dBm)                        | -11<br>(5-tap FFE) | -5.1 <sup>3</sup><br>(5-tap FFE) | -8.9<br>(2-tap FFE + 2-tap DFE on-Chip) | -9.6<br>(4-tap FFE + 4-tap DFE) | -5 <sup>5</sup> -7 <sup>6</sup><br>(12-tap FFE + 1-tap DFE) | -12.48<br>(12-tap FFE + 1-tap DFE) |

<sup>1</sup>Calculated <sup>2</sup>Without 50Ω termination <sup>3</sup>Simulated <sup>4</sup>No supply regulation <sup>5</sup>In-fiber OMA

<sup>6</sup>Calculated with PD responsivity

Figure 12.2.6: Performance summary and comparison with the state-of-the-art TIA designs.



Figure 12.2.7: Die micrographs of single-ended and proposed asymmetric differential TIAs.

### 12.3 A Carrier-Phase-Recovery Loop for a 3.2pJ/b 24Gb/s QPSK Coherent Optical Receiver

Ahmed E. Abdelrahman<sup>1</sup>, Mostafa G. Ahmed<sup>1,2</sup>, Mahmoud A. Khalil<sup>1</sup>, Mohamed Badr Younis<sup>1</sup>, Kyu-Sang Park<sup>1</sup>, Pavan Kumar Hanumolu<sup>1</sup>

<sup>1</sup>University of Illinois, Urbana, IL, <sup>2</sup>now at Ain Shams University, Cairo, Egypt

The increasing intra-datacenter traffic is pushing the demand for ultra-high-speed optical interconnect that maximizes both power efficiency and data rate per wavelength. Intensity modulation-direct detection (IM-DD) links are used in these short-reach applications because of their simplicity and low power consumption; however, increasing their data rates is becoming exceedingly difficult due to technology- and packaging-imposed constraints. Coherent links, traditionally used in long-reach applications, are gaining traction as an alternative to short-reach IM-DD links. Compared to IM-DD, coherent links can deliver 4x spectral efficiency by utilizing three degrees of freedom of the optical signal (i.e., intensity, phase, and polarization states). Still, it comes at the expense of the receiver complexity needed to perform polarization demultiplexing, chromatic dispersion (CD) compensation, and carrier phase recovery (CPR). Such complex functions are usually implemented on dedicated DSP chips separate from the analog front-end, resulting in very high power consumption. Recently, analog-based implementations of polarization demultiplexing, CD compensation and CPR have been successfully demonstrated [1–4]. But the CPR in [1] suffers from limited phase tracking bandwidth (~100kHz) and requires high-quality tunable lasers with very narrow linewidth to avoid adding much phase noise, degrading phase recovery capabilities. While a wide CPR loop bandwidth (~1.1GHz) was achieved in [4] at the expense of high power consumption (75pJ/b). Moreover, the feedback signals are routed off-chip with external loop filters, making the sensitive control signal susceptible to external noise.

Given the above drawbacks, this paper presents a monolithic CPR using 16-phase switched-inverter-based harmonic-rejection complex mixers (HRMs) and low latency QPSK phase detection circuits to achieve a wide bandwidth of 100MHz, which is three orders of magnitude wider than that reported in [1], and 23x better energy-efficiency than [4], making the proposed analog-based receiver well suited for energy-efficient coherent links in short-reach data center applications.

Figure 12.3.1 shows the system-level block diagram of the proposed receiver. A photonic IC (PIC) with a 90° hybrid and four unbalanced photodiodes detects the beat note between the received modulated optical signal and the local RX laser into differential I and Q current signals ( $I_{IP}/I_{IN}$ ,  $I_{OP}/I_{ON}$ ). Because the TX and RX laser frequencies are typically not equal, this beat note appears at the intermediate frequency (IF)  $F_{IF} = F_{TX} - F_{RX}$  in the form of a constantly rotating I/Q constellation, requiring accurate de-rotation to maintain a stationary constellation and maximize the SNR. This is achieved by down-converting from  $F_{IF}$  to baseband using the circuitry shown in Fig. 12.3.1.

The four input currents are converted to voltage signals ( $V_{IP}/V_{IN}$ ,  $V_{OP}/V_{ON}$ ) using four high-gain wide-bandwidth AFEs consisting of transimpedance amplifiers (TIAs), followed by two stages of variable gain amplifiers (VGA1, VGA2). The amplified signals are fed to the CPR loop comprised of a bank of 16-phase HRMs, a phase detector (PD), a  $G_M$  stage, a loop filter (LF), an 8-phase ring-VCO (RVCO), and a 16-phase divide-by-2 divider. A clean external clock is fed to an I/Q clock generator that produces four quarter-rate sampling clocks ( $CLK_{0,90,180,270}$ ) to I/Q data and CPR PD slicers. The HRMs mix the  $V_{IP}/V_{IN}$ ,  $V_{OP}/V_{ON}$  signals with an effective complex LO to generate differential NRZ voltage signals ( $V_{IP}'/V_{IN}'$ ,  $V_{OP}'/V_{ON}'$ ), which represent the de-rotated QPSK constellation. The data slicers sample them to recover the two data bits in the received QPSK signal. The PD comprised of two T/H, two slicers, and a mixing unit, extracts the phase error,  $\Phi_e$ , by measuring  $V_{D_0}D_0 - V_{D_1}D_1 = K_{PD}\sin(\Phi_e)$ , where  $K_{PD}$  is the PD gain, as depicted in the upper left of Fig. 12.3.2. The error signal,  $V_{PD}$ , is converted into a current by the  $G_M$  stage and fed to a conventional proportional-integral loop filter (LF). The filtered error voltage,  $V_{CTRL}$ , drives the RVCO toward phase lock. Under this condition, the RVCO frequency is  $2F_{IF}$  ( $F_{IF}$  after the divider), and the constellation seen by the data slicers becomes stationary.

The phase error transfer function from the receiver inputs to the HRM outputs is high-pass due to the CPR, with a corner frequency equal to the CPR loop bandwidth ( $BW_{CPR}$ ). The value of  $BW_{CPR}$  is dictated by the performance of the TX and RX laser sources and can be set anywhere from 10 to 100MHz in the prototype receiver. A high  $BW_{CPR}$  is desirable in phase-noise limited regimes dominated by laser linewidths to suppress the phase-noise and maintain a stationary constellation. On the other hand, in amplitude-noise limited regimes where laser shot noise or AFE thermal noise dominate, it is beneficial to lower  $BW_{CPR}$  to minimize the self-generated phase-noise [5].

At the core of the CPR is the I/Q mixing stage, implemented as 16-phase HRMs, shown in Fig 12.3.2. The HRMs consist of four pseudo-differential switched-inverter-based 8-phase mixing units that perform complex LO signal multiplication with the input signal through a weighted sum of inverter currents. 16-phase 50% duty-cycle square waves at

$F_{IF}$  control all switched-inverters to create the output currents ( $I_{HRMIP}/I_{HRMIN}$ ,  $I_{HRMOP}/I_{HRMON}$ ), which are converted into the voltage signals ( $V_{IP}/V_{in}$ ,  $V_{op}/V_{on}$ ) through shunt-feedback TIAs. The weights and number of phases are chosen such that the effective LO is as close as possible to a sinusoid with higher-order harmonic cancellation up to  $15F_{IF}$ . For a 16-phase HRM, the effective LO has a THD of 11% due to uncancelled higher-order harmonics and is fixed across PVT and the RVCO tuning range. A DC offset cancellation loop (DCOC2) eliminates DC components at the input of the HRMs and prevents DC up-conversion to  $F_{IF}$ .

Figure 12.3.3 shows the details of the RVCO, the 16-phase divide-by-2 divider, and their unit cells. An 8-phase RVCO generates 8 signals at  $F_{RVCO} = 2F_{IF}$ , regardless of the data rate, and the following divide-by-2 stage creates 16-phase LO square waves at  $F_{IF}$ . This multi-phase generation scheme guarantees a 50% duty cycle and avoids RVCO frequency pulling by nearby circuits running at  $F_{IF}$ . RVCO frequency tuning is performed by controlling the gate voltages of the differential MOS varactors,  $C_{VAR}$ , while MOM caps,  $C_{FIX}$ , set the center frequency. The center frequency and  $K_{RVCO}$  programmability are achieved by switching fixed and varactor cap banks using switches,  $SW_{VAR}$  and  $SW_{FIX}$ , while  $SW_{1-4}$  set the initial RVCO states. MOS resistors,  $TG_{1,2}$ , control the coupling strength between the differential signals and ensure oscillation. The divider is built as an 8b differential shift register with an optimized differential latch cell that minimizes the CLK-to-Q delay. The reset signal sets the initial divider states.

A prototype coherent receiver is fabricated in a 28nm CMOS process. Even though the prototype is designed to operate up to 64Gb/s, all the measurements were performed using 24Gb/s PRBS15 QPSK data upconverted to  $F_{IF}$  due to equipment limitations; the maximum sampling rate of the dual-channel arbitrary waveform generator (Tektronix AWG7122C) available is 12GS/s/channel. The analog receiver power is 76.3mW, which translates to 3.2pJ/b, including the CPR and excluding the input clock and output data buffers. The measured RVCO output phase noise plots and spectra under different operating conditions are shown in Fig. 12.3.4. The upper half shows the RVCO output phase noise when the CPR is locked to a 1.8GHz  $F_{IF}$  with and without data modulation for three different bandwidth settings (i.e., 10, 50, and 100MHz). When the BW is 100MHz, data-dependent errors degrade the residual PM noise from 0.8° to 4.9° integrated from 100kHz to 1GHz, translating to 2.5° after the feedback divider. To test the CPR frequency tracking range,  $F_{IF}$  was modulated with a ±300MHz amplitude 11kHz sinusoid around 1.8GHz. The bottom half of Fig. 12.3.4 shows the output spectra when the RVCO is locked without frequency modulation (left) and then with frequency modulation (right). This emulates laser frequency jumps or drifts due to noise and temperature variations and shows the proposed receiver's tracking ability needed to suppress them.

Figure 12.3.5 shows bathtub curves under three different conditions. In the top left plot, the receiver is operating at zero  $F_{IF}$  with the CPR disabled, and the data is recovered by the two NRZ receivers clocked by a fixed clean clock. This sets the baseline for the receiver performance, where the error-free phase margin at  $BER < 10^{-12}$  is 0.6UI. When the CPR is enabled, this margin reduces to 0.36UI due to uncompensated skews between the sampling clocks of the PD slicers and the data slicers and any I/Q noise injected through the loop. Most of this degradation is attributed to 20ps of fixed skew between the sampling clocks since the CPR can achieve lock even in the presence of many errors made by the PD slicers. Lastly, the bathtub plot measured with  $F_{IF}$  frequency modulation shows a drop in the phase margin due to the fast phase variations the CPR cannot correct fully. The steep rise in BER on the left side of the curve is due to phase slipping where the CPR fails to track the phase, and the constellation rotates by 90°, 180°, or 270°, which affects all the data slicer patterns instantly; however, the degradation on the right is due to errors made by the data slicers. The receiver and CPR power consumption breakdown is shown in Fig. 12.3.5, and receiver performance is summarized and compared with recent prior art in Fig. 12.3.6. The die photo is shown in Fig. 12.3.7. Active area is 0.22mm<sup>2</sup>.

#### Acknowledgement:

This work was partly supported by the Andrew T. Yang Research Award.

#### References:

- [1] K. Sheng et al., "A 4.6pJ/b 200Gb/s Analog DP-QPSK Coherent Optical Receiver in 28nm CMOS," ISSCC, pp. 282-283, Feb. 2022.
- [2] L. A. Valenzuela et al., "A 50-GBaud QPSK Optical Receiver With a Phase/Frequency Detector for Energy-Efficient Intra-Data Center Interconnects," IEEE OJ-SSCS, vol. 2, pp. 50-60, Feb. 2022.
- [3] R. Ashok et al., "Analog Domain Carrier Phase Synchronization in Coherent Homodyne Data Center Interconnects," Journal of Lightwave Technology, vol. 39, no. 19, pp. 6204-6214, Oct. 2021.
- [4] M. Lu et al., "An Integrated 40 Gbit/s Optical Costas Receiver," Journal of Lightwave Technology, vol. 31, no. 13, pp. 2244-2253, July 2013.
- [5] L. Kazovsky, "Decision-driven phase-locked loop for optical homodyne receivers: Performance analysis and laser linewidth requirements," Journal of Lightwave Technology, vol. 3, no. 6, pp. 1238-1247, Dec. 1985.



Figure 12.3.1: Block diagram of the proposed coherent receiver.



Figure 12.3.2: Schematics of the CPR phase detector and 16-phase inverter-based I/Q HRMs, a single 8-phase mixing unit, and an illustration of higher-order harmonics generated by the HRM.



Figure 12.3.3: Circuit details of the RVCO and divide-by-2 multi-phase generator.



Figure 12.3.4: RVCO output phase-noise for different loop bandwidth settings with and without data modulation and RVCO output spectra with and without IF frequency modulation.



Figure 12.3.5: Measured bathtub curves for three different received signal scenarios and power breakdown.

|                                | This work        | ISSCC 2022 [1]   | OJSSCS 2022 [2] | JLT 2021 [3]      | JLT 2013 [4]    |
|--------------------------------|------------------|------------------|-----------------|-------------------|-----------------|
| Technology                     | 28nm CMOS        | 28nm CMOS        | 130nm SiGe HBT  | 130nm SiGe BiCMOS | 500nm HBT       |
| RX architecture                | Analog/Intradyne | Analog/Intradyne | Analog/Homodyne | Analog/Homodyne   | Analog/Homodyne |
| Tunable RX laser               | Not required     | Required         | Required        | Required          | Required        |
| Modulation                     | QPSK             | DP-QPSK          | QPSK            | DP-QPSK           | BPSK            |
| Data rate [Gb/s]               | 24               | 200              | 100             | 40                | 40              |
| CPR bandwidth [MHz]            | 10–100           | 0.1*             | 382**           | 8.3**             | 1,100           |
| CPR tracking range [MHz]       | 600              | 12               | —               | 200               | 30 GHz          |
| Power consumption [mW]         | 76.8             | 920              | 534             | 412               | 3,000           |
| Energy efficiency [pJ/b]       | 3.2              | 4.6              | 5.34            | 10.3              | 75              |
| BER                            | 1.0E-12          | 1.0E-10          | 1.0E-12         | 1.1E-03           | 1.0E-12         |
| Active area [mm <sup>2</sup> ] | 0.22             | 0.06             | 2.8             | 0.07              | —               |

\* Estimated from phase locking plot

\*\* Simulation results

Figure 12.3.6: Performance summary and comparison with prior works.



Figure 12.3.7: Die micrograph of the QPSK coherent optical receiver.