

## 4.2 A 60GHz 144-Element Phased-Array Transceiver with 51dBm Maximum EIRP and $\pm 60^\circ$ Beam Steering for Backhaul Application

Tirdad Sowlati, Saikat Sarkar, Bevin Perumana, Wei Liat Chan, Bagher Afshar, Michael Boers, Donghyup Shin, Timothy Mercer, Wei-Hong Chen, Anna Papio Toda, Alfred Grau Besoli, Seunghwan Yoon, Sissy Kyriazidou, Phil Yang, Vipin Aggarwal, Nooshin Vakilian, Dmitriy Rozenblit, Masoud Kahrizi, Joy Zhang, Alan Wang, Padmanava Sen, David Murphy, Mohyee Mikhemar, Ali Sajjadi, Alireza Mehrabani, Brima Ibrahim, Bo Pan, Kevin Juan, Shelley Xu, Claire Guan, Guy Geshwindman, Khim Low, Namik Kocaman, Hans Eberhart, Koji Kimura, Igor Elgorriaga, Vincent Roussel, Hongyu Xie, Leo Shi, Venkat Kodavati

Broadcom, Irvine, CA

The 802.11ad standard (WiGig) provides throughput speeds of multi-Gb/s covering tens of meters and uses beamforming in the four 2GHz-wide channels in the 60GHz ISM band. Conventional backhaul solutions, on the other hand, are designed with high gain directional antennas with no electronic beam steering and have high cost for installation and alignment of antennas. This paper presents a full-featured 802.11ad chipset with 144-element phased array using a tiled approach and CMOS IPs developed for WiGig to address low-cost municipal WiFi, small-cell backhauling, and broadband to home covering the last mile. Designed with a link budget of 120dB+, the phased-array solution can be mounted on top of lamp posts and roof-top buildings covering a 200m LOS, and with tens of hops a range of 2km can be covered from a single point of fiber. Moreover, the steerable phased arrays enable dynamic routing optimization between hopping nodes in a mesh network.

A Master-Slave configuration for the 60GHz transceiver is used that expands a 12-element phased-array transceiver chip to 144-element tiled phased-array chipset, Fig. 4.2.1. The Master-60GHz chip performs the up / downconversion from IF (10.56GHz) to RF (60GHz) and has twelve separate 60GHz I/Os each driving a Slave-60GHz chip with twelve independently phase-controlled TX/RX paths connected to two antennas, creating a 288-antenna-element phased-array solution. The Slave-60GHz uses the same IC as the Master-60GHz. However, it has a direct 60GHz interface port that bypasses the IF-to-RF up/downconversion, and the unused blocks (IF gain stages, RF-PLL, LOGen and the RF Mixers) are powered down. The SOC chip includes PCIe Gen2x1 interface, MAC, PHY, and BB-to-IF (10.56GHz) Radio. The chips have integrated power management units to connect directly to a single 2.7V to 5.5V power supply.

The RF VCO is an NMOS cross-coupled LC oscillator that drives a multiply-by-four in LOGen to create the RF LO signal. The PA block, Fig. 4.2.2, is a four-stage differential wideband design with capacitive neutralization for achieving higher stability and Gmax. It has a gain of 30dB, OP<sub>1dB</sub> of +2dBm, PSAT of 6.5dBm with 30mA from a 0.9V supply. The LNA block is a four-stage single-ended wideband cascode design. It has 30dB gain with a NF of 4dB at midband and 4.5dB across all four channels, OP<sub>1dB</sub> of -3dBm with 18mA from 0.9V. The TX and RX phase shifters are based on passive vector summation and provide a 6° phase-shift step with 11dB loss. The T/R switch uses triple-well transistors with large resistors on the gate and substrate connections, and has negligible nonlinearity by having only ON switches in TX mode.

The antenna array is designed on an LTCC substrate with dielectric constant of 6, loss tangent of 0.002, and insertion loss of 0.8dB/cm at 60GHz. The unit antenna cell has a main microstrip patch that is parasitically loaded by four auxiliary patches (forming an H-shape) to broaden the bandwidth of the antenna, Fig. 4.2.3a. A stripline feed is used to excite a slot that couples to the radiating patch above it. This aperture coupling feeding mechanism avoids using any vias to connect the distribution feed-line at the lower metal layers to the top radiator. A single tile of 48 elements (8 rows by 6 columns) is designed with vertical and horizontal element spacing of 2.8mm and 2.35mm, respectively, giving a tile size of 22.4mm×14.1mm shown in Fig. 4.2.3b. A single RF port drives two vertically aligned antenna patches via a 2-way divider/combiner realized in the LTCC substrate. Two Slave-60GHz WLCSP chips are mounted on the back of the LTCC substrate, Fig. 4.2.3c, and their 60GHz I/O ports are routed to the antenna array. Distributed matching circuits are used in the transitions to maintain low reflection. Ground-via-fencing of the transitions and antenna feed layers is used across the array to suppress the excitation of substrate modes, reduce transition losses and avoid sharp resonances over the wide frequency band.

Six of these tiles are placed together with the same uniform spacing between elements on the boundary of adjacent tiles to form a larger array of 16 rows by 18 columns with a total array area of 44.8mm×42.3mm, Fig. 4.2.3d. The tiles are mounted on an organic laminate PCB with a 60GHz routing loss of 1.2dB/cm. Twelve Slave-60GHz WLSCP chips are routed to the Master-60GHz FCBGA chip soldered on the back of the PCB with a loss of <15dB in the signal path, Fig. 4.2.3e. The systematic gain and phase offset between the Slave-60GHz chips are compensated in the Master-60GHz chip. The daisy-chain LVDS routings from Master to the Slave chips for the control signaling are done on the PCB. A single coax cable connects the Master-60GHz chip to the SOC. The programming and triggering of all the Slave-60GHz chips are synchronized and the six tiled arrays operate as a single large array. A return loss <-10dB across 57 to 66GHz is satisfied over the targeted beam scanning specification of  $\pm 60^\circ$  in azimuth and  $\pm 10^\circ$  in elevation.

The SOC chip performs internal calibration of IQ mismatch, LOFT, DC offset, and frequency response in both TX and RX using on-die IF-TSSI and multiple loop back paths. The 60GHz chip uses a separate IF test port, RF-TSSI, IF-RSSI and loop backs to perform TX and RX frequency response calibration and TX/RX RF functionality test without external 60GHz equipment. Additionally, the SOC and 60GHz chips can perform a combined TX/RX frequency response and TX gain calibration using a CW tone generation, RF-TSSI and digital engine in the SOC. The RX AGC gain calibration is performed using thermal noise at the 60GHz input ports amplified through the chain and digitized by ADC.

The SOC chip was fabricated in a 28nm CMOS process and the 60GHz chips were fabricated in a 40nm CMOS process. The PN measured at 60GHz TX output was -100dBc/Hz at 1MHz offset, Fig. 4.2.4a. The TX integrated PN from 1MHz to 1GHz was <-34dBc in all four channels. The measured MCS12 constellation (SC-16QAM 4.6Gb/s) for the tiled mode TX at EVM floor of <-24dB is shown in Fig. 4.2.4b. The measured PSAT and OP<sub>1dB</sub> for the full transmitter at chip port were +5dBm and 0dBm, respectively, with MCS12 EVM of -22dB at -2dBm output power. Measured NF for the full receiver at chip port was 6.5dB, with MCS12 sensitivity of -56dBm at PER of 10<sup>-2</sup> for a single path (no array gain). The measured EVM versus P<sub>out</sub>, and PER versus P<sub>in</sub> for a single TX/RX path at chip port are shown in Figs. 4.2.4c and 4.2.4d, respectively. The tiled phased array in Fig. 4.2.3 had a de-embedded measured gain of 45dB in TX mode and 23dB gain in RX mode, 1dB to 1.5dB lower than the ideal array gain and spatial power-combining gain. With antenna unit element gain of 5dB and transition/routing loss of 3.5dB, a total TX plus RX phased-array gain of 71dB was obtained. The tiled phased-array solution provided a link budget of 123.5dB for MCS12 considering the -2dBm P<sub>out</sub> and 6.5dB NF at chip port, total TX plus RX phased-array gain of 71dB, thermal noise of -81.5dBm, and SNR of 20.5dB for PER of 10<sup>-5</sup>. After calibration, the transmitter power variation at -22dB EVM and the RX sensitivity variation were both within  $\pm 1.5$ dB over PVT. The worst-case PN at 60GHz degraded by less than 2dB over PVT. A maximum EIRP of 51dBm was measured in the full array at broadside. The phased-array solution incorporated a complete 802.11ad feature set including Sector Level Sweep, Beam Refinement and Beam Tracking. Figure 4.2.5 shows the measured co- and cross-polarization beam-steering radiation patterns for one tile array for  $\pm 60^\circ$  in Azimuth and  $\pm 10^\circ$  in Elevation. Figure 4.2.6 shows the performance comparison summary [1-5]. The die photos for the SOC and the 60GHz ICs are shown in Fig. 4.2.7.

### Acknowledgements:

The authors would like to thank the SOC, CAD, Modeling, Layout, DVT, Packaging, PMU and Software teams for their contributions.

### References:

- [1] M. Boers, et al., "A 16TX/16RX 60GHz 802.11ad Chipset with Single Coaxial Interface and Polarization Diversity," *ISSCC*, pp.344-345, Feb. 2014.
- [2] E. Cohen, et al., "A CMOS Bidirectional 32-Element Phased-Array Transceiver at 60 GHz With LTCC Antenna," *IEEE TMTT*, pp. 1359-1375, March 2013.
- [3] B. Sadhu, et al., "A 28GHz 32-Element Phased-Array Transceiver IC with Concurrent Dual Polarized Beams and 1.4 Degree Beam-Steering Resolution for 5G Communication," *ISSCC*, pp.128-129, Feb. 2017.
- [4] K. Kibaroglu, et al., "An Ultra Low-Cost 32-Element 28 GHz Phased-Array Transceiver with 41 dBm EIRP and 1.0-1.6 Gbps 16-QAM Link at 300 Meters," *IEEE RFIC*, pp. 73-76, 2017.
- [5] S. Zihir, et al., "60-GHz 64- and 256-Elements Wafer-Scale Phased-Array Transmitters Using Full-Reticile and Subreticile Stitching Techniques," *IEEE TMTT*, pp. 4701-4719, Dec. 2016.



Figure 4.2.1: Block diagram of the tiled phased array.



Figure 4.2.2: 60GHz front-end circuits.



Figure 4.2.3: 144-element tiled phased-array transceiver.



Figure 4.2.4: a) Phase Noise b) Constellation c) TX EVM d) RX PER measured at 60GHz at chip port.



Figure 4.2.5: Normalized radiation pattern measured for one tiled array.

| Reference               | [1]                               | [2]                             | [3]                              | [4]                             | [5]                              | This work                        |
|-------------------------|-----------------------------------|---------------------------------|----------------------------------|---------------------------------|----------------------------------|----------------------------------|
| Process                 | 40nm CMOS                         | 90nm CMOS                       | 0.13μm SiGe BiCMOS               | 0.13μm SiGe BiCMOS              | 0.13μm SiGe BiCMOS               | 28nm / 40nm CMOS                 |
| Frequency               | 57-66 GHz                         | 57-62 GHz                       | 28 GHz                           | 28-32 GHz                       | 58-64 GHz                        | 57-66 GHz                        |
| Level of Integration    | w/ PCIe Interface                 | TRX with IF input at 12GHz      | TRX with IF input at 3GHz        | TRX with 28GHz input            | TX only w/ 60GHz input           | MAC/PHY/TRX w/ PCIe Interface    |
| Modulation / Channel BW | 16QAM / 1.76GHz                   | 16QAM / 1.76GHz                 | N/A                              | 16QAM / 400MHz                  | 16QAM / 1.76GHz                  | 16QAM / 1.76GHz                  |
| TX Chain OP1dB/Element  | 0dBm                              | 0dBm                            | 14dBm                            | 10dBm                           | 0dBm                             | 0dBm                             |
| RX Chain NF             | < 10dB                            | < 10dB                          | 6.6dB                            | 4.6dB (Front End only)          | No RX                            | < 7dB                            |
| PN @1MHz Offset         | -94dBc/Hz @60GHz                  | -93dBc/Hz @46GHz                | Ext. LO                          | Ext. LO                         | Ext. LO                          | -100dBc/Hz @60GHz                |
| No. of Antenna Elements | 16                                | 32                              | 32                               | 32                              | 256                              | 288                              |
| Max EIRP                | 29dBm                             | 32dBm                           | 28dBm/IC                         | 41dBm                           | 45dBm                            | 51dBm                            |
| Power Consumption       | 1.2W in TX<br>0.96W in RX         | 1.2W in TX<br>0.85W in RX       | 4.6W in TX / IC                  | 6.4W in TX<br>4.2W in RX        | 32W in TX<br>No RX               | 8.4W in TX<br>6.6W in RX         |
| Total Si Area           | 32.5 mm <sup>2</sup> (Full Radio) | 29 mm <sup>2</sup> (IF-RF only) | 166 mm <sup>2</sup> (IF-RF only) | 94 mm <sup>2</sup> (IF-RF only) | 1740mm <sup>2</sup> (RF-RF Only) | 292 mm <sup>2</sup> (Full Radio) |

Figure 4.2.6: Performance comparison summary.



Figure 4.2.7: Die micrographs: a) SOC IC b) 60GHz IC.

#### 4.3 A 23-to-30GHz Hybrid Beamforming MIMO Receiver Array with Closed-Loop Multistage Front-End Beamformers for Full-FoV Dynamic and Autonomous Unknown Signal Tracking and Blocker Rejection

Min-Yu Huang, Taiyun Chi, Fei Wang, Tso-Wei Li, Hua Wang

Georgia Institute of Technology, Atlanta, GA

Millimeter-wave massive MIMOs leverage large array size to enhance the link budget and spatial selectivity, but their resulting narrow beamwidth substantially complicates the transmitter-receiver (TX-RX) alignment. Unlike most existing “static” applications (e.g., mm-wave HDTV transmission), many future mm-wave links will operate in highly “dynamic” environments, such as wireless AR/VR and vehicle-/drone-/machine-based links, necessitating rapid and precise beamforming/-tracking for high link reliability and low latency. Densely deployed mm-wave nodes will also result in future congested/contested environments, requiring spatially tracking/rejecting unknown blockers (unknown frequency, angle-of-arrival AoA, or modulation).

Most existing RF/analog beamformers (BFs) are open-loop circuits *per se*, which require phase control signals from extensive baseband computation, while digital beamforming relies even more on the baseband DSP. A recent mm-wave link in an almost idealistic setting requires 45ms for beamforming and cannot meet a <1ms 5G latency target [1]. Self-steering arrays (SSA) perform closed-loop and rapid front-end beamforming without DSP. However, existing SSAs mostly use PLL-/coupled-oscillator-architectures that are inherently narrowband with limited Field-of-View (FoV) and cannot support multibeams or blocker suppression. In parallel, although array-based spatial filtering is extensively studied [2,3], most are open-loop circuits, whose notch syntheses require prior knowledge of the blockers or complete phase/amplitude controls from DSP.

We present a broadband scalable full-FoV MIMO RX array with hybrid beamforming using mm-wave/IF front-end SSA BFs and baseband digital beamforming (Fig. 4.3.1). The SSA BFs achieve DSP-free beamforming with DLL-like phase-domain negative feedback loops to cover broad bandwidth and full FoV. They are cascadable for rapid yet accurate multibeam operations to reject unknown blockers and align desired signals. The digital beamforming is for fine beam alignment. A proof-of-concept 8-element RX array chip includes two 4-element SSA unit arrays, scalable for massive MIMOs (Fig. 4.3.1). Each SSA unit array is composed of two parallel mm-wave SSA BFs (1<sup>st</sup> stage) and one IF SSA BF (2<sup>nd</sup> stage). The 1<sup>st</sup>-stage mm-wave BF contains two signal-path I/Q phase shifters with continuous tuning, a mm-wave power-aware phase detector (PD) as a 90° coupler followed by VGAs and voltage rectifiers, and an on-chip combiner/subtractor. The 2<sup>nd</sup>-stage IF BF has two downconversion mixers, IF amplifiers, an IF PD, two LO-path I/Q continuously tunable phase shifters, and an off-chip combiner/subtractor. This paper focuses on the design and performance of the SSA front-end BFs.

Next, the DLL-like operation of the SSA BF stages is explained. Based on the incident angle, the received signals in the adjacent two paths exhibit phase-shifts, which are then detected by the PD to generate corresponding differential DC control voltages that are fed back to adjust the phase shifters. This creates a phase-domain negative feedback loop, which, like a DLL, autonomously phase-shifts the two signal paths and equalizes the phase difference of their outputs [4]. Then, if the output combiner (or subtractor) is selected, the BF will perform constructive beamforming of the desired signal (or spatial notching of the blocker). Although blocker/signal classification needs one-step DSP demodulation, the SSA front-end BF does not require DSP for beam scanning or computation, drastically accelerating the beamforming. Notably, when multiple signals are received concurrently, the power-aware PD output is dominated by the signal with the largest power due to its nonlinear rectification, so that each BF stage only responds to the strongest blocker or the desired signal it receives. The amplifiers in the feedback ensure a large loop gain and a near-zero path-to-path residual phase error even for end-fire receiving to enable full FoV coverage [4]. Unlike existing PLL-/oscillator-based SSAs, our DLL-like SSA BFs do not require resonators/oscillators and are intrinsically broadband, and they can be cascaded to process concurrent multiple signals/blockers.

The 2-stage front-end SSA BFs support various operation modes (Fig. 4.3.2). Mode I: The 1<sup>st</sup> and 2<sup>nd</sup> BFs both use output combiners. The RX array serves as an 8-element SSA per chip and autonomously tracks one desired signal by beamforming over full FoV. Mode II: The 1<sup>st</sup> BF uses subtractors, and the 2<sup>nd</sup> BF uses combiners. The RX array first creates a spatial notch to autonomously reject one unknown in-band blocker, and then performs beamforming for one desired

signal. The power-aware PD ensures that the 1<sup>st</sup> BF only tracks and cancels the blocker, not the desired signal. Mode III: The 1<sup>st</sup> and 2<sup>nd</sup> BFs both use subtractors. The RX array can reject one strong in-band blocker twice by a deep spatial notch (Mode III-A), or it can create two independent spatial notches to reject two in-band blockers (Mode III-B). In Modes II and III, spatial filtering largely suppresses the in-band blockers, relaxes the following RX dynamic range (e.g., ADC), and enables digital beamforming.

An 8-element 23-to-30GHz scalable RX array was implemented in a 0.13μm SiGe BiCMOS process (Fig. 4.3.7). The measurements were all based on autonomous SSA beamforming/notching with no phase/amplitude controls from DSP except mode selection.

The 1<sup>st</sup> and 2<sup>nd</sup> SSA BFs were first tested individually (Fig. 4.3.3). Over 23 to 30GHz and full FoV, the 1<sup>st</sup> SSA BF achieved a flat normalized array factor >-0.6dB for the desired signal beamforming and a 20-to-41dB spatial notch for blocker rejection. The 2<sup>nd</sup> SSA BF also achieved a flat normalized array factor >-0.53dB for desired signal beamforming and a 21-to-36dB spatial notch for blocker rejection over full FoV and a wide 0.1-to-4GHz IF range. The full FoV and wide bandwidth enable autonomous beamforming of in-band signals or cancellation of co-channel blockers, even if their carrier frequency, AoA, and modulations are unknown. The response time of each closed-loop SSA BF stage was <1μs over full FoV, which ensures rapid beam-forming/-tracking in dynamic low-latency applications and is 100 to 1000× faster than existing DSP-controlled BFs. The equivalent single-element double-sideband noise figure ( $NF_{DSB,eq}$ ) [2] was 4.2 to 6.3dB over 23 to 30GHz.

The 8-element RX array was then tested for various operation modes (Fig. 4.3.4). In Mode I, the 4-element SSA unit showed a flat normalized array factor >-0.8dB over full FoV. A high-quality 8-element SSA hybrid beamforming was further achieved via digital beamforming. In Mode II, one desired signal and one blocker were concurrently injected; with the 2-stage SSA BFs both turned on, the 1<sup>st</sup> SSA BF suppressed the blocker, and the 2<sup>nd</sup> SSA BF performed beamforming on the signal. The measured blocker-free RX  $P_{in,1dB}$  was -25dBm/element, while a -15dBm blocker ( $P_{in,1dB}+10dB$ ) degraded the RX  $P_{in,1dB}$  to <-48dBm. After the 1<sup>st</sup> BF stage was enabled to suppress the blocker, the RX  $P_{in,1dB}$  was largely restored over a wide FoV, except when the signal and blocker had similar incident angles, and both were suppressed by the 1<sup>st</sup> BF spatial notch. In Mode III-A, one strong blocker and one desired signal were concurrently injected, and the 2-stage SSA BFs both locked to the blocker and suppressed it twice. For different blocker cases (e.g., -40° and 53° incidence), a deep spatial notch with maximum 54dB rejection was achieved. In Mode III-B, two medium-power blockers and one desired signal were concurrently injected; the 2-stage SSA BFs sequentially suppressed the blockers with two independent spatial notches of maximum 40dB rejection.

The RX array (Mode-II) was also tested under a wideband modulated co-channel blocker (-36dBm) and desired signal (-46dBm) with no digital beamforming. In Fig. 4.3.5, after enabling the 2-stage SSA BFs for autonomous blocker rejection and signal beamforming, the desired signal was autonomously beamformed in the 2<sup>nd</sup> stage BF and successfully demodulated, showing -27.2dB EVM for 3Gb/s 64QAM and -33.9dB EVM for 0.8Gb/s 256QAM. The desired signal was then swept for its incident angle and its frequency offset from the blocker, demonstrating a clear spatial filtering effect on the blocker signal. The RX array also supports 6Gb/s 64QAM and 1.6Gb/s 256QAM as the state-of-the-art demonstrated modulation scheme and data-rate for wideband modulated blocker rejection and signal beamforming. In conclusion, the MIMO RX array achieves (1) autonomous rejection of unknown blockers and beamforming on unknown desired signals, (2) 64-/256-QAM co-channel blocker rejection and desired signal beamforming both with Gb/s wideband modulation, and (3) fast response time <1μs per BF stage, advancing the state of the art (Fig. 4.3.6).

##### Acknowledgements:

The authors would like to thank GlobalFoundries for chip fabrication.

##### References:

- [1] W. Roh, et al., “Millimeter-Wave Beamforming as An Enabling Technology for 5G Cellular Communications: Theoretical Feasibility and Prototype Results,” *IEEE Commun. Mag.*, vol. 52, no. 2, pp. 106-113, Feb. 2014.
- [2] L. Zhang, et al., “A 0.1-to-3.1GHz 4-Element MIMO Receiver Array Supporting Analog/RF Arbitrary Spatial Filtering,” *ISSCC*, pp. 410-412, Feb 2017.
- [3] S. Jain, et al., “A 10GHz CMOS RX Frontend with Spatial Cancellation of Co-Channel Interferers for MIMO/Digital Beamforming Arrays,” *IEEE RFIC*, pp. 99-102, 2016.
- [4] M. Huang, et al., “An All-Passive Negative Feedback Network for Broadband and Wide Field-of-View Self-Steering Beam-Forming with Zero DC Power Consumption,” *IEEE JSSC*, vol. 52, no. 5, pp. 1260-1273, May 2017.



Figure 4.3.1: System architecture of the 8-element full-FoV MIMO RX array with hybrid beamforming using mm-wave/IF front-end SSA beamformers and baseband digital beamforming.



Figure 4.3.3: Measured wideband and full-FoV autonomous desired signal beamforming and blocker rejection in the 1<sup>st</sup> and 2<sup>nd</sup> IF SSA BF, dynamic response time over full FoV, and  $NF_{DSB,eq}$  with the 2-stage SSA BFs both turned on.



Figure 4.3.5: Mode-II demonstration for autonomous blocker rejection and desired signal beamforming when the blocker and desired signal are both broadband modulated at the same scheme and speed. After the 2-stage SSA BFs are enabled, the desired signal is successfully demodulated, showing autonomous spatial cancellation of co-channel blocker.



Figure 4.3.2: Various operation modes of the RX array by reconfiguring the output combiners or subtractors in the 1<sup>st</sup> and 2<sup>nd</sup> SSA front-end BF stages.



Figure 4.3.4: Measured various mode operations of the receiver array including: Mode I (an 8-element hybrid beam-former), Mode II (the RX  $P_{in1dB}$  with the in-band blocker), Mode III-A (one deep notch), and Mode III-B (two independent notches).

|                                               | Spatial Notch Array RX |                    | SSA RX                    |                    | Mm-Wave Beam-Forming Array RX |                      | This Work                                                              |
|-----------------------------------------------|------------------------|--------------------|---------------------------|--------------------|-------------------------------|----------------------|------------------------------------------------------------------------|
|                                               | L. Zhang<br>ISSCC 17   | S. Jain<br>RFIC'16 | M. Huang<br>JSSC 17       | A. Gupta<br>TMTT14 | W. Roh<br>Commun.<br>Mag. 14  | B. Sadhu<br>ISSCC'17 |                                                                        |
| Technology                                    | 65nm CMOS              | 65nm CMOS          | 130nm CMOS                | 45nm SOI           | NR                            | 130nm SiGe BiCMOS    | 130nm SiGe BiCMOS                                                      |
| Frequency (GHz)                               | 0.1 – 3.1              | 10                 | 4 – 5.68                  | 7.4 – 9.4          | 28                            | 28                   | 22 – 30                                                                |
| Element No./Chip                              | 4                      | 4                  | 4                         | 4                  | 32                            | 32                   | 8                                                                      |
| Blocker Rejection                             | Open Loop with DSP     | Open Loop with DSP | No                        | No                 | No                            | No                   | Closed Loop with SSA front-end BFs                                     |
| Beam-Forming                                  | Open Loop with DSP     | Open Loop with DSP | Closed Loop SSA           | Closed Loop SSA    | Open Loop with DSP            | Open Loop with DSP   | Closed Loop with SSA front-end BFs                                     |
| Single-Element Conversion Gain (dB)           | 43                     | 14                 | -8                        | NR                 | 34                            | 33                   |                                                                        |
| $NF_{DSB,eq}$ (dB) <sup>1</sup>               | 3.4 – 5.8              | 9.5                | NR                        | NR                 | 6                             | 4.2 – 6.3            |                                                                        |
| Spatial Blocker Suppression (dB)              | 56                     | 32                 | No                        | No                 | No                            | No                   | 41 in Mode II and 54 in Mode III-A                                     |
| Blocker Modulation Scheme                     | CW                     | CW <sup>2</sup>    | 200Mbps QPSK <sup>3</sup> | No                 | No                            | No                   | 6Gb/s 64-QAM and 1.6Gb/s 256-QAM <sup>4</sup>                          |
| Desired Signal Modulation Scheme              | 2Mbps QPSK             | 200Mbps QPSK       | NR                        | NR                 | 2.1Gb/s 16QAM                 | NR                   | 6Gb/s 64-QAM and 1.6Gb/s 256-QAM <sup>4</sup>                          |
| EVM of Desired Signal after Blocker Rejection | 2Mbps QPSK 20.5%       | NR                 | NR                        | No                 | No                            | No                   | 0.6Gb/s 64-QAM – 2.16%<br>3Gb/s 64-QAM – 4.39%<br>6Gb/s 64-QAM – 5.62% |
| Response Time                                 | NR                     | NR                 | 3ms                       | NR                 | 45ms                          | NR                   | < 1μs per Beamformer Stage                                             |
| Power Consumption per Element (mW)            | 28.5 – 36.75           | 36.25              | 0                         | 35.75              | NR                            | 206                  | 70                                                                     |
| Area (mm <sup>2</sup> )                       | 2.25                   | 3.8                | 4.1                       | 3.5                | 165.9 <sup>5</sup>            | 21.6                 |                                                                        |

NR: Not reported. <sup>1</sup> Equivalent Single-Element  $NF_{DSB,eq}$  =  $NF_{PDS}$  measured with single-element excitation and output side beam-forming - 10log(*n*) where *n* is the number of elements. <sup>2</sup> The modulation test is based on Mode II operation without any baseband digital beamforming. <sup>3</sup>  $P_{blocker} = -36dBm$  and  $P_{desired\_signal} = -46dBm$ . <sup>4</sup> Based on the RFIC'16 presentation slides, it presents CW blocker (when  $P_{blocker} > P_{desired\_signal}$ ) and modulated blocker (when  $P_{blocker} = P_{desired\_signal}$ ). <sup>5</sup> The area includes its transmitter design.

Figure 4.3.6: Comparison with the state-of-the-art.



Figure 4.3.7: Die micrograph.

#### 4.4 A 28GHz Bulk-CMOS Dual-Polarization Phased-Array Transceiver with 24 Channels for 5G User and Basestation Equipment

J. D. Dunworth<sup>1</sup>, A. Homayoun<sup>1</sup>, B-H. Ku<sup>1</sup>, Y-C. Ou<sup>1</sup>, K. Chakraborty<sup>1</sup>, G. Liu<sup>1</sup>, T. Segoria<sup>1</sup>, J. Lerdworatawee<sup>1</sup>, J. W. Park<sup>1</sup>, H-C. Park<sup>2</sup>, H. Hedayati<sup>3</sup>, D. Lu<sup>1</sup>, P. Monat<sup>1</sup>, K. Douglas<sup>1</sup>, V. Aparin<sup>1</sup>

<sup>1</sup>Qualcomm, San Diego, CA

<sup>2</sup>now with Samsung Electronics, Suwon, Korea

<sup>3</sup>now with Atlazo, San Diego, CA

Developing next-generation cellular technology (5G) in the mm-wave bands will require low-cost phased-array transceivers [1]. Even with the benefit of beamforming, due to space constraints in the mobile form-factor, increasing TX output power while maintaining acceptable PA PAE, LNA NF, and overall transceiver power consumption is important to maximizing link budget allowable path loss and minimizing handset case temperature. Further, the phased-array transceiver will need to be able to support dual-polarization communication. An IF interface to the analog baseband is desired for low power consumption in the handset or user equipment (UE) active antenna and to enable use of arrays of transceivers for customer premises equipment (CPE) or basestation (BS) antenna arrays with a low-loss IF power-combining/splitting network implemented on an antenna backplane carrying multiple tiled antenna modules.

Recent publications have demonstrated single-polarization RFICs for smart phones in 28nm bulk CMOS [2], dual-polarization 28GHz phased arrays for picocells in 0.13um SiGe [3], and single-polarization RFICs with RF interfaces for large-scale arrays in 0.18um SiGe and 0.13um SiGe [4,5]. The RFIC reported here significantly reduces die size and power consumption per antenna element compared to the SiGe solutions, and improves TX  $P_{out}$  and NF in similar area per antenna element compared to the CMOS solution. The RFIC supports switching or combining among three 4-channel sub-array groups in each polarization enabling both smart-phone and small-cell applications to be served with a single RFIC design.

Figure 4.4.1 shows the IC architecture and two antenna modules into which the IC is integrated. The UE module uses the IC in pairs, to enable testing of different UE antenna arrays such as 1x4 dipole, 1x4 patch, 2x2 patch and 2x4 patch. In the BS array tile, a 4x4 patch array is active with 2 rows of dummy patches on one edge. The signal paths are fully differential except for the two-stage LNA and the TX IF VGAs. The IC is divided into six groups of 4-channel sub-arrays, one for each polarization on the left, top and right of the die. Each sub-array has one TX and one RX RFVGA that can be bypassed in both TX and RX modes. The RFVGAs are copies of the PA with the output transformer modified to match to the 100Ω on-chip differential transmission lines. Passive 3b phase shifters are used in each channel. Lumped element Wilkinsons are used in the sub-arrays, and configurable power combiners/splitters (CPCS) in the center of the IC allow either combining or switching of the sub-arrays. A 6.6-to-8GHz Synthesizer with multiply by 3 is included in the chip, as is a test port to inject a 20-to-23GHz external LO. IF is 6.5GHz and designed RF range is 26.5 to 29.5GHz. Digital control is implemented through an RFFE interface. An external resistor is used to provide process invariant bandgap bias current.

Figure 4.4.2 shows the details of the CPCS. In the combining mode, the CPCS acts as a conventional Wilkinson with parasitic capacitance of transistors  $T_1$  providing the lumped capacitance. In the switched mode, transistors  $T_2$  in the isolation resistor are opened, creating high impedance, and  $T_1$  are closed, creating a large inductive impedance looking into the unused port. The simulated loss in the combining mode is 0.8dB and in the switched mode 1.5dB vs. 3.7dB for a conventional Wilkinson with only 1 port driven.

Figure 4.4.3 shows one front-end channel (1XCH). T/R switching losses are minimized by connecting the PA and LNA in shunt with each other. The PA output stage is a differential pair with capacitive neutralization while the driving stages are differential pairs with capacitive neutralization and AMPM compensation. In the RX mode, the PA is disabled to a high impedance state by grounding the output stage bias and  $L_1$  is part of the LNA input match. In the TX mode, switch  $M_{sh}$  is enabled, protecting the LNA from the voltage swings generated by the

differential PA and presenting a large inductive impedance to the PA, which along with the flip-chip pad capacitance is incorporated into the PA output match. A variant die was used to measure the 1XCH (LNA, PA and Phase shifter only) performance. All measurements except the array scanning in Fig. 4.4.5 are made on connectorized evaluation boards on which the IC is directly flip-chip mounted. The LNA and PA implement 9dB and 7dB of 1dB gain-control steps to be used for channel equalization and array taper. Peak LNA and PA gain are cascaded with ~7dB loss of the phase shifter and  $S_1$  in the 1XCH measurement. The 3dB BW of the RX 1XCH is 5.5GHz and NF is 3.2 to 4.4dB. TX 1XCH  $P_{sat}$  is 14dBm, peak PAE is 20% and -25dB EVM is met for 6dBm (OFDM) and 8dBm (SCFDM) with PAE of 7.5% (OFDM) and 12% (SCFDM).

Figure 4.4.4 shows full chip performance for the 4-channel sub-array across multiple boards. Measurements are made using an external LO at 20 to 23GHz. The RX implements 4 gain steps in the LNA and maintains 400MHz-BW 64-QAM OFDM EVM >25dB from -62dBm to -20dBm  $P_{in}$ , with NF of 3.8 to 4.6dB at the max gain. RX  $P_{dc}$  from 1V is 155mW/140mW. Peak TX gain is 35 to 44dB across 5 samples. 4XCH  $P_{dc}$  from 1V at 6dBm  $P_{out}$  per PA is 350mW (OFDM) and 375mW (SCFDM) at 8dBm per PA.  $P_{sat}$  is reduced compared to 1XCH measurement because of uncompensated  $V_{DD}$  IR drop in the measurement setup,  $OP_{1dB}$  is likewise reduced to ~11dBm, and TX  $P_{dc}$  at  $OP_{1dB}$  is 475mW. RX/TX modes draw 11.75mW/12mW from 1.8V for bias generation. This bias power is included in the comparison table in Fig. 4.4.6.

EIRP and array scanning patterns were measured on a UE antenna module and shown in Fig. 4.4.5. Uncalibrated scanning to +/- 45 degrees is demonstrated on H-pol/V-pol patch and dipoles. Peak EIRP is 35dBm for H-pol, 34dBm for V-pol on a 2x4 array, and 32dBm for a 1x4 dipole array when dipoles are driven differentially by 2 adjacent PAs. With 5dBi simulated patch gain and ~1dB feed loss, the back estimated  $P_{sat}$  per PA is 13dBm for H-pol and 12dBm for V-pol.

Figure 4.4.6 compares this work with state-of-the-art 28GHz phased-array chips with at least 4 channels. Where possible, the current consumption is normalized to a single channel when 4 channels are enabled, and the die size is compared as the total die size including pads. Only one reference is known to the authors in 28nm CMOS [2], and this work has higher  $P_{out}$ , lower NF, and lower power consumption per channel for RX and for TX at 64-QAM max  $P_{out}$ . Compared to the SiGe transceivers [3-5], this IC is significantly smaller, uses significantly less RX power, achieves slightly better NF, and similar TX  $P_{sat}$  and  $OP_{1dB}$ . Direct comparison of TX mode power is difficult because the SiGe references do not report power at linear  $P_{out}$ . We compare instead the  $P_{dc}$  at  $OP_{1dB}$ , and this IC consumes ~487mW or 122mW/channel, which is still less than the SiGe counterparts.

Figure 4.4.7 is a die micrograph. The die size is 5.97mm×4.865mm. The die layout mirrors the schematic representation of the architecture in Fig. 4.4.1, with 3 sides of the die having 8 PA/LNA/PS each. The bump pitch for the G-S-G antenna ports is 250μm, which constrains the minimum die size. Elsewhere the minimum pitch is 150μm for bumps connected to the same net, and 240μm for bumps connected to different nets. This allowed meeting the bump density rules and flip-chip mounting of the die on a low-loss PCB-based antenna module and evaluation board.

##### Acknowledgements:

The authors thank the digital, layout and antenna design teams of Qualcomm Research for their support of this work, Drew Arnett, Dave Palmer and Hector Hernandez for their support of measurements and Jeremy Goldblatt and Sherif Shakib for useful discussions.

##### References:

- [1] S. Shakib, et al., "A Wideband 28GHz Power Amplifier Supporting 8×100MHz Carrier Aggregation for 5G in 40nm CMOS," *ISSCC*, pp. 44-45, Feb. 2017.
- [2] H. T. Kim, et al., "A 28GHz CMOS direct conversion transceiver with packaged antenna arrays for 5G cellular system," *IEEE RFIC*, pp. 69-72, June 2017.
- [3] B. Sadhu, et al., "A 28GHz 32-Element Phased-Array Transceiver IC with Concurrent Dual Polarized Beams and 1.4 Degree Beam-Steering Resolution for 5G Communication," *ISSCC*, pp. 128-129, Feb. 2017.
- [4] K. Kibaroglu, et al., "A Quad-Core 28-32 GHz Transmit/Receive 5G Phased Array IC with Flip-Chip Packaging in SiGe BiCMOS," *IEEE IMS*, pp. 1-3, June 2017.
- [5] Y. S. Yeh, et al., "A 28-GHz phased-array transceiver with series-fed dual-vector distributed beamforming," *IEEE RFIC*, pp. 65-68, June 2017.



**Figure 4.4.1:** IC architecture and block diagram, exploded views of user equipment (UE) and basestation (BS) antenna modules.



**Figure 4.4.2:** Schematic diagram of conventional Wilkinson and configurable power combiner and splitter with switch operation.



**Figure 4.4.3:** Schematic diagram of 1XCH showing PA/LNA switch and interface to flip-chip bump, and measured performance.



**Figure 4.4.4:** 4XCH Sub-array measured performance on conducted flip-chip evaluation board. RX EVM and gain measurements are with 400MHz 64-QAM OFDM signal. TX measurements are with CW signal.



**Figure 4.4.5:** Radiated performance measurements on UE antenna module of peak TX EIRP and TX beam-steering performance at  $P_{sat}$ .

| Parameter                               | This Work            | [2]RFIC17              | [3]ISSCC17         | [4]IMS17           | [5]RFIC17          |
|-----------------------------------------|----------------------|------------------------|--------------------|--------------------|--------------------|
| Technology                              | 28nm LP-RF CMOS 1P7M | 28nm LP CMOS 1P7M      | 0.13um SiGe BiCMOS | 0.18um SiGe BiCMOS | 0.13um SiGe BiCMOS |
| Simultaneous Polarizations per IC       | 2                    | 1                      | 2                  | 1                  | 1                  |
| Front end channels per IC               | 24                   | 8                      | 32                 | 4                  | 4                  |
| TX Input / RX output interface          | 6.5GHz IF            | Analog IQ BB           | 3GHz IF            | RF                 | RF                 |
| Phase Shifter Resolution                | 3 bit                | 3 bit                  | 5 bit              | 6 bit              | 5 bit              |
| RF 3dB BW (GHz)                         | 7.5 (TX), 5.5 (RX)   | 2.2                    | 1.5                | 7 (TX), 6(RX)      | 4                  |
| TX Gain (dB)                            | 34-44                | 50-52 (4 patches)      | 24-32              | 12                 | 9.4-14.3           |
| TX Psat (dBm)                           | >14                  | 10.5                   | 16                 | -                  | >12.5              |
| TX OP1dB (dBm)                          | >12                  | 9.5                    | 13.5               | 10.5               | >5.5               |
| TX 64QAM Pout (dBm)                     | 6                    | 3                      | -                  | 2.5                | -                  |
| TX 64QAM PAE                            | 7.5%                 | 3%                     | -                  | -                  | -                  |
| TX Total Power (W)                      | 0.36 (4xCH)          | 0.416 (4xCH)           | 4.6                | 0.8                | 1.08               |
| TX Power per Channel (mW)               | 90                   | 104                    | 287.5              | 200                | 270                |
| RX Gain (dB)                            | 32-34dB (4xCH max)   | 49-50 (4 patches)      | 28-34              | 18                 | 8.7 to 11.5        |
| LNA NF w/SW & PS (dB)                   | 3.8-4.4              | 5.6                    | 6.0-6.9            | 4.6                | 4.5-6.9            |
| RX NF (dB)                              | 4.4-4.7              | 6.7                    | -                  | 4.6                | 4.5-6.9            |
| RX Total Power per IC (W)               | 0.167 (4xCH)         | 0.291 (4xCH 100MHz BW) | 3.3                | 0.42               | 0.68               |
| RX Power per channel (mW)               | 42                   | 73                     | 206                | 105                | 170                |
| Die Size (mm x mm)                      | 4.65 x 5.97          | 2.6 x 2.8              | 10.5 x 15.8        | 2.5 x 4.7          | 2.93 x 2.35        |
| Die Size per channel (mm <sup>2</sup> ) | 1.16                 | 0.91                   | 5.18               | 2.94               | 1.72               |

**Figure 4.4.6:** Comparison table with state of the art published 28GHz mm-wave front-end transceivers with at least 4 channels.



Figure 4.4.7: Die micrograph.

## 4.5 A Reconfigurable 28/37GHz Hybrid-Beamforming MIMO Receiver with Inter-Band Carrier Aggregation and RF-Domain LMS Weight Adaptation

Susnata Mondal, Rahul Singh, Jeyanandh Paramesh

Carnegie Mellon University, Pittsburgh, PA

This paper presents a hybrid beamforming mm-wave MIMO receiver with two key innovations. First, it can be configured into three modes: two single-band multistream modes at 28 or 37 GHz that can support single- or multi-user MIMO, and a concurrent 28 and 37GHz dual-band single-stream phased-array inter-band carrier-aggregation mode. In all modes, the receiver features full connectivity from each antenna element input to each output stream, thereby maximizing usage of the available aperture. Second, the digitally programmable RF beamforming weights can be controlled by an external serial interface, or by an on-chip “one-port” mixed-signal adaptation loop that implements a technique that we call double-sampling time-multiplexed LMS (DS-TM-LMS). Unlike conventional LMS-type adaptation algorithms that require access to the individual array inputs and the combined output, and are therefore not easily amenable to a hybrid beamformer, DS-TM-LMS updates the RF-domain weights by accessing only the combined downconverted array outputs. Such adaptation is valuable for adaptive main-lobe, side-lobe or null steering, but more importantly, it can assist/augment codebook-based beam acquisition/tracking algorithms, which may fail in the presence of multipath, on- or off-channel interferers.

A simplified architecture of the four-element, two-stream hybrid beamforming receiver is shown in Fig. 4.5.1. In each element, a concurrent dual-band LNA is shared between the two streams. Each stream comprises 28/37 GHz dual-band per-element, per-stream RF-domain complex-weights, and signal combiners. This is followed by two image-reject downconverters (one per stream) that select either the lower or the higher band using high-side or low-side LO injection, respectively. While the frequencies of the desired signals in the two bands can in general be chosen to be at some offset from their image locations, independent LO's would be required for each downconverter, which adds complexity. Here, the desired signals in the two bands are chosen to be at the image frequency of each other. This allows the LO generation circuitry to be shared between the downconversion chains, which facilitates inter-band carrier aggregation without hardware overhead (Fig. 4.5.2). It is important to note that any interferer can be attenuated by spatial filtering or null steering. For interference at the image location, this image-reject architecture enables additional suppression.

The Cartesian-combining technique [1] is well-suited to implement programmable RF-domain complex weighting at two widely separated frequencies. The complex weights are set by the gain ratio of a pair of programmable-gain amplifiers (PGA) in conjunction with complex-quadrature downconversion, and therefore do not require frequency-selective elements unlike conventional phase-shifters [2,3]. The image-reject Cartesian-combining architecture introduced here (Fig. 4.5.1 bottom) performs Cartesian complex weighting at the output of the first complex-quadrature mixing stage, while the cascade of the two complex-quadrature mixing stages enables image-rejection.

A schematic of the four-element, two-stream hybrid beamforming receiver prototype is shown in Fig. 4.5.2. It uses LNA's with a  $g_m$ -boosted common-gate input stage and a common-source second stage. Each LNA is followed by two pairs of 5b PGAs where each pair determines the complex weight in each stream [4]. The output currents of corresponding PGAs from all elements are summed (see Fig. 4.5.2) using a two-stage active combiner. In order to achieve a dual narrowband response at widely separated frequencies, the load networks in the LNA, PGA and combiner stages use coupled resonators with transformers that have moderate  $k$  values ( $\sim 0.4$ ). The complex-quadrature image-reject mixers are of the Gilbert cell type. Quadrature LO phases are generated using a polyphase-filter (PPF) and a static CML divide-by-2 in the first and second mixing stages. In each stream, baseband (BB) I/Q outputs are filtered by  $g_m$ -C filters, digitized by 4b flash ADC's, and fed to the digital adaptation logic.

In an ordinary image-reject receiver, quadrature error (QE) in both mixing stages can be consolidated and corrected at BB. However, in a Cartesian-combining image-reject receiver, the first stage QE, when captured at BB, varies with weight settings. To maintain high image-rejection across all complex-weight settings,

QE from each mixing stage should be calibrated separately. In the first stage where significant QE is expected due to the high frequency and PPF-based quadrature generation, the following technique is used to extract and calibrate QE in LO, separately. First, the LO<sub>1</sub> QE is translated to IF using the top mixer pair of the first mixing stage (Fig. 4.5.3). Then, the QE between two mixer outputs at IF (4.5GHz in the measurement) is converted to a voltage using a cross-coupled mixer pair (measured step of  $\sim 20\text{mV}^\circ$  is shown in Fig. 4.5.3). Cross-coupled mixers are used in order to equalize the loading at two IF outputs, and thus reduce imperfections due to RF and LO trace mismatches inside the QE-extraction circuit. The sign of the voltage representing the QE is extracted using a comparator and fed to a digital calibration engine that minimizes the average comparator output by increasing or decreasing the 5b control words of the capacitor banks in tuned-LC I/Q LO buffers, which can tune the I/Q phases with  $\sim 0.75^\circ/\text{LSB}$  phase resolution. This calibration can reduce raw QE of over  $20^\circ$  in a 30-to-36GHz LO frequency range to below  $1^\circ$  (Fig. 4.5.3 inset). The LO<sub>2</sub> QE is corrected at BB using a phase rotator (Fig. 4.5.3). Image-reject-ratio (IRR) measurement (Fig. 4.5.4) shows that where calibration of 2<sup>nd</sup> stage only achieves >35dB IRR for a limited number of complex weight settings, calibration of both stages results in >35dB IRR over the entire range of weights in both 28 and 37GHz bands.

The entire four-element, two-stream receiver chip (Fig. 4.5.7) occupies  $2.9 \times 1.55\text{mm}^2$  (2.2mm<sup>2</sup> front-end core including LO path and mixers) in 65nm CMOS. It consumes 310mA (40/100/100/70 mA in LNA/combiners/mixers/LO path) from 1V. For a single element, the receiver achieves 33dB (26.5dB) peak conversion gain, 2.75GHz (3.75GHz) RF bandwidth, 5.7dB (8.5dB) NF, input return loss >10dB (>15dB), and P<sub>1dB</sub> of -30dBm (-23dBm) at 28GHz (37GHz). Simple baseband filters and ADCs were implemented for functionality, and can be further optimized. Concurrent dual-band signal reception is demonstrated with both the streams turned on, where a (28+0.03)GHz tone and a (37+0.06)GHz tone are combined and applied to a single element. Stream #1 (#2) is configured to reject the 28GHz (37GHz) band, and >35dB IRR is simultaneously achieved (Fig. 4.5.5) in both bands. Figure 4.5.5 (inset) shows how IRR can be further improved by steering a spatial null towards an image-band interferer. It is seen that with two channels, null-steering improves the IRR from 35dB to 48dB. Comparisons with state-of-the-art single-band beamformers are summarized in Fig. 4.5.7 (bottom).

DS-TM-LMS array adaptation (Fig. 4.5.6) is performed as follows: (1) In the 1<sup>st</sup> half-cycle of each symbol period, beamformer weight of a single element ( $n^{th}$ ) is set to unity and others to zero. This allows the  $n^{th}$  element's input ( $x_n$ ) to be extracted from combined BB outputs. (2) In the 2<sup>nd</sup> half-cycle, the current beamforming weights are applied to all the elements, and the combined output-mean-square-error-gradient w.r.t. the  $n^{th}$  weight is calculated from the beamformer output  $y$  and the desired signal  $d$  as  $\text{grad}_n = (d-y)^*x_n$ . (3) Error gradients w.r.t. all other weights are extracted sequentially in time-multiplexed fashion (one per symbol period). (4) Beamforming weights are updated using LMS, and the above operations are repeated until the algorithm converges. Note that the DS-TM-LMS algorithm enables MMSE adaptation of the weights without any hardware overhead in the front-end. Moreover, in a hybrid beamformer, multiple streams can simultaneously perform training/tracking, and can augment each other for faster training. An example of adaptation is shown in Fig. 4.5.6 (bottom), where a directional signal is emulated by benchtop components and applied to two elements. It is seen that the BB output matches the transmitted symbols after adaptation, which indicates that the DS-TM-LMS has successfully steered the array in the direction of the signal.

### Acknowledgements:

This work was supported in part by the National Science Foundation under grants CCF-1314876 and ECCS-1343324.

### References:

- [1] J. Paramesh, et al., "A Four-Antenna Receiver in 90-nm CMOS for Beamforming and Spatial Diversity," *IEEE JSSC*, vol. 40, no. 12, pp. 2515-2524, Dec. 2005.
- [2] B. Sadhu, et al., "A 28GHz 32-Element Phased-Array Transceiver IC with Concurrent Dual Polarized Beams and 1.4 Degree Beam-Steering Resolution for 5G Communication," *ISSCC*, pp. 128-129, 2017.
- [3] Y. S. Yeh, et al., "A 28 GHz Phased-Array Receiver Front End with Dual-Vector Distributed Beamforming," *IEEE JSSC*, vol. 52, no. 5, pp. 1230-1244, May 2017.
- [4] S. Mondal, et al., "A 25-30 GHz 8-Antenna 2-Stream Hybrid Beamforming Receiver for MIMO Communication," *IEEE RFIC*, pp. 112-115, 2017.



Figure 4.5.1: Concurrent dual-band beamforming receiver (top); Image-reject Cartesian complex-weighting (bottom).



Figure 4.5.2: 28/37GHz four-element, two-stream, LMS adaptive hybrid beamforming receiver (calibration details are shown in next schematic).



Figure 4.5.3: LO quad-error (QE) calibration (calibration hardware is shown in green) (top); QE detector characterization, and LO<sub>1</sub> QE vs. freq. (bottom).



Figure 4.5.4: Measured image-reject ratio (left); Single-element receiver characterization (right).



Figure 4.5.5: Concurrent dual-band reception (left); Two-element null-steering assisted image-rejection (right).



Figure 4.5.6: DS-TM-LMS weight-update-cycle for a four-element array (top); Measured BB output for two elements (bottom).



|                                   | Guan<br>JSSC'04    | Yu<br>RFIC'09          | Yeh<br>RFIC'16         | Sudhu<br>ISSCC'17  | Kim<br>RFIC'17    | Kibaroglu<br>RFIC'17 | Mandal<br>RFIC'17 | This Work<br>ISSCC'18 |
|-----------------------------------|--------------------|------------------------|------------------------|--------------------|-------------------|----------------------|-------------------|-----------------------|
| Technology (nm)                   | 180-SiGe           | 130-CMOS               | 120-SiGe               | 130-SiGe/1.5GHz    | 26-CMOS           | 180-SiGe             | 65-CMOS           | 65-CMOS               |
| Freq. (GHz)                       | 24                 | 24-27                  | 28-32                  | 28(1.5GHz)         | 25.8-28           | 28-32                | 25-30             | 27-29.75              |
| Gain (dB)                         | 43 <sup>#</sup>    | 12 <sup>*</sup>        | 9.4 <sup>*</sup>       | 34                 | 39 <sup>*</sup>   | 20                   | 34 <sup>#</sup>   | 33 <sup>#</sup>       |
| N <sub>f</sub> <sub>rx</sub> (dB) | 7.4 <sup>#</sup>   | 7.8 <sup>*</sup>       | 5.1 <sup>*</sup>       | -                  | 6.7 <sup>#</sup>  | 4.6 <sup>*</sup>     | 7.3 <sup>#</sup>  | 26.5 <sup>#</sup>     |
| S11 (dB)                          | <-10               | <-10                   | <-10                   | -                  | -                 | <-10                 | <-8               | <-10                  |
| I <sub>P1dB</sub> (dBm)           | -27 <sup>#</sup>   | -19 - -22 <sup>*</sup> | -16 - -13 <sup>*</sup> | -22.5              | -                 | -22                  | -29 <sup>#</sup>  | <-15                  |
| Power (mW)                        | 113 <sup>#</sup>   | 57.5 <sup>*</sup>      | 136.5 <sup>*</sup>     | 206.2 <sup>#</sup> | 50 <sup>#</sup>   | 130 <sup>*</sup>     | 27.5 <sup>#</sup> | 52.5 <sup>#</sup>     |
| Area (mm <sup>2</sup> )           | 1.44 <sup>#</sup>  | 0.45 <sup>*</sup>      | 0.45 <sup>*</sup>      | 2.59 <sup>#</sup>  | 0.91 <sup>#</sup> | 1.5 <sup>#</sup>     | 0.32 <sup>#</sup> | 0.46 <sup>#</sup>     |
| No. of Elements                   | 8                  | 4                      | 4                      | 16                 | 8                 | 4                    | 8                 | 4                     |
| No. of Streams                    | 1                  | 1                      | 1                      | 1                  | 1                 | 1                    | 2                 | 2                     |
| Power (W)                         | 2.9 <sup>#</sup>   | 0.23 <sup>*</sup>      | 0.54 <sup>*</sup>      | 3.3 <sup>*</sup>   | 0.4 <sup>#</sup>  | 0.52 <sup>#</sup>    | 0.34 <sup>#</sup> | 0.31 <sup>#</sup>     |
| Die Area (mm <sup>2</sup> )       | 11.55 <sup>#</sup> | 4 <sup>*</sup>         | 7.2 <sup>*</sup>       | 41.5 <sup>#</sup>  | 7.28 <sup>#</sup> | 5.87 <sup>**</sup>   | 6.16 <sup>#</sup> | 4.5 <sup>#</sup>      |

<sup>#</sup> Including down-conversion   \* RF front-end only   <sup>\*</sup>Estimated   Includes PA area

Figure 4.5.7: Four-element 28/37GHz prototype die micrograph (top); Comparison table (bottom).

## 4.6 A Fully Integrated Scalable W-Band Phased-Array Module with Integrated Antennas, Self-Alignment and Self-Test

Shahriar Shahramian<sup>1</sup>, Mike Holyoak<sup>2</sup>, Amit Singh<sup>1</sup>, Bahar Jalali Farahani<sup>1</sup>, Yves Baeyens<sup>1</sup>

<sup>1</sup>Bell Laboratories, New Providence, NJ; <sup>2</sup>LGS Innovations, Florham Park, NJ

Advanced SiGe BiCMOS and CMOS processes continue to push the frontier on millimeter-wave (mm-wave) and highly integrated phased-array systems for a variety of communication applications [1,3]. Furthermore, next-generation mobile technology (5G) demands ultra-low latency and high data-rates with ubiquitous deployment supporting multi-users through the use of pico-cells. These cells may require up to hundreds of active elements capable of producing thousands of beam patterns. In order to make wide adoption of such mm-wave systems a reality, the overall cost of the system must be significantly reduced. This can be accomplished through several means. First, producing highly-integrated phased arrays eliminates the need for additional external components (such as expensive mm-wave synthesizers, amplifiers and switches), which reduces the overall system costs. Second, eliminating exotic packaging processes and materials would allow low-cost traditional manufacturing techniques to be applied to mm-wave systems. Lastly, incorporating self-test, fault-detection, health-monitoring and self-calibration into the RFIC significantly reduces the costs of factory testing (by eliminating the need for any mm-wave verifications) and enables remote-maintenance and system-reconfiguration in case of failures.

The new generation of phased-array modules presented in this paper addresses the above challenges. Each W-band phased-array IC comprises a full transceiver architecture with either 16TX/8RX or 16RX/8TX calibrated active elements (Fig. 4.6.1). Each chip is also equipped with direct up- and downconverter mixers as well as a fully integrated Phase-Locked Loop (PLL) with a prime-ratio mm-wave multiplier in order to avoid VCO pulling. A high-dynamic-range analog baseband block with programmable filter bandwidth is also included. Passive and active signal distributions provide the RF signal path to all phase-shifter elements and a distributed digital Serial Peripheral Interface (SPI) network is used for calibration, self-test and health monitoring. Each chipset includes >175 internal monitor and diagnostic points with 29 on-chip ADCs for digital readouts as well as a 32-slot beam lookup table for rapid beam hopping.

The simplified block diagram of the transmit and receive phase-shifter elements are shown in Fig. 4.6.2. The 5b active phase shifting is accomplished through vector summation where the performance has been demonstrated in the previous generation RFICs in [4]. In this chipset, each receive element additionally employs a built-in coupler, amplifier and VGA chain followed by a power detector in order to independently monitor the performance of the element during self-test procedures. The transmitter element uses an extra power detector on the isolation port of the balanced power amplifier to detect antenna failures or strong reflected signals. All diagnostics information (including power supply, bias-point, temperature monitor and power detectors) is digitized locally in each element as well as routed on a global analog bus. Active elements are also equipped with calibration memory and beam-lookup tables where broadcast SPI commands can be used to switch to a new beam-state for all the active elements in a large-scale phased array system at the same time. Calibration and beam lookup tables are populated at the time of power-on-self-test and calibration.

A major challenge in the design of large-scale multichip phased-array systems operating at mm-waves is LO distribution and chip-to-chip synchronization. Furthermore, due to large radiated power in systems employing hundreds of active transmitter elements, integrated PLLs are strongly susceptible to VCO pulling and LO signal pollution. Additionally, routing mm-wave LO signals to multiple RFICs from a single source becomes increasingly impractical for systems using tens of phased-array modules. The presented RFICs in this work are equipped with the PLL shown in Fig. 4.6.3 to address these issues. In a multichip phased-array system, the synchronization LO signal is daisy-chained between RFICs using the integrated PLL's two modes of operation. In the first mode (referred to as master PLL), the built-in dual-VCOs are locked to a reference at 1/16<sup>th</sup> of the VCO frequency using a second-order loop architecture. The VCO fundamental frequency is chosen to be at a non-integer multiple of the carrier frequency  $f_{VCO} = (3 \times f_{Carrier})/10$ . This ensures that no mixing terms of the RF carrier and VCO frequency fall into the PLL loop-bandwidth or within several GHz of the VCO frequency. The locked VCO signal is further multiplied by a prime-ratio of (5/3) multiplier using the shown triple-mixer architecture producing an  $f_{Carrier}/2$  signal. A 180-degree 9b phase shifter is included in this path to allow for multichip LO phase alignment. Additionally, the locked VCO signal is also routed externally for use by the next RFIC in the chain. The final PLL stage uses two doublers to generate the  $f_{Carrier}$  signal required by the direct-conversion transmit and receiver mixers. In the second mode (referred to as slave PLL) the received  $(3 \times f_{Carrier})/10$  signal is amplified internally and routed to the next RFIC in the chain as well as

multiplicated internally to reach  $f_{Carrier}$ . The built-in phase shifter is used in a similar fashion to align the received LO signal to the master PLL phase. Figure 4.6.3 also shows the measured system phase-noise at a carrier of 90.67GHz, which corresponds to a VCO frequency of 27.2GHz and a PLL reference frequency of 1.7GHz generated by an Analog Devices HMC1034 synthesizer equipped with a 100MHz crystal reference. The locking range of the master PLL is between 81.5 and 97.5GHz with an integrated jitter of <100fs between 12kHz and 20MHz. No degradation is observed in the phase noise of the synchronization LO signal due to daisy-chaining two RFICs.

Two unique interposers were designed to accommodate the two RFIC chipsets. The interposers integrated 16-element and 8-element Aperture-Stacked Patch (ASP) antenna sub-arrays to interface with the 24 TX/RX elements. An ASP antenna architecture provides medium gain and large bandwidth, and can be constructed using standard Printed Circuit Board (PCB) processes. A series of coupling apertures and stacked vias were used to realize the complete antenna feed-lines, which are all phase-matched to 180-degree increments at 90GHz. The antenna elements were spaced at  $0.63\lambda$  (~2.1mm) at 90GHz. Typically, the elements would be spaced at  $0.5\lambda$ ; however, due to physical constraints set by the RFIC as well as PCB process limitations, such a scheme was not possible. Figure 4.6.4 shows the 3D model of the 16TX/8RX 5x5 ASP antenna array and its simulated performance.

A block diagram of the baseband is shown in Fig. 4.6.5. It comprises a tunable lowpass active filter, two VGAs each with 22dB of gain control and an output driver with  $50\Omega$  on-chip termination. DC offsets are cancelled by continuous feedback loops both at the input of the baseband (i.e. mixer output) as well as at the input of the first VGA. Gain of the VGAs can be controlled independently either by analog control or digitally through the SPI interface. Carrier-induced DC offsets generated at the output of the mixer are cancelled by injecting correction current at the input of the tunable lowpass filter. The strength of this signal is proportional to internal carrier feedthrough as well as any received carrier tone externally. Therefore, by intentionally inducing carrier leakage from an adjacent phased-array module in the daisy-chain architecture, the relative phase of the local carrier versus adjacent module can be measured. Consequently, the built-in LO phase shifter can be used to align two adjacent phased-array modules. This process can be repeated through the daisy-chain to align all RFICs. This process can be completely automated through the SPI interface without the need for any mm-wave measurements or factory calibration. Figure 4.6.5 also demonstrates the measured DC offset cancellation signals at the I and Q ports of the baseband block as a function of LO phase shifter. An equal I and Q signal can be designated as a reference alignment point for all RFICs in the chain.

The phased-array chipset is implemented in the TowerJazz 0.18μm SiGe BiCMOS process with  $f_T/f_{MAX}$  of 240/270GHz. The process offers 0.18μm CMOS transistors and 6 layers of Aluminum metallization. Each IC has 525 flip-chip pads that are used for power supply, digital I/O, RF and IF signals with a total area of  $8 \times 4.4\text{mm}^2$ . The die micrograph of the chipset is shown in Fig. 4.6.7. Each chip operates from 1.5V and 2.5V. The transmit/receive phased array elements consume on average 275mW/225mW while the PLL requires 200mW in master-mode and 75mW in slave-mode. The baseband, up- and downconverters consume 500mW in total. The receiver NF is better than 8dB at 90GHz with a gain range between 10 and 80dB per element.

Figure 4.6.6 demonstrates the measured radiated broadband image rejection of the 16TX/8RX phased-array module mounted in a socket. The module offers better than 35dB of image rejection across an 8GHz frequency band centered at 90.67GHz. The master PLL is active during all measurements with all 16TX elements activated. Figure 4.6.6 also demonstrates module-to-module 64-QAM, 18Gb/s loop-back test with an additional 16RX/8TX phased-array module mounted in an adjacent socket. Figure 4.6.7 also shows a photograph of an assembled phased-array module's front and back. The construction of a large-scale 384-element phased-array system employing 16 phased-array modules is currently underway.

*Acknowledgements:* The authors would like to acknowledge Bradley Farnsworth from LGS Innovations and Hernan Castro from Bell Laboratories for their support with system simulations and layout.

### References:

- [1] B. Sadhu, et al., "A 28GHz 32-Element Phased-Array Transceiver IC with Concurrent Dual Polarized Beams and 1.4-Degree Beam-Steering Resolution for 5G Communication," *ISSCC*, pp. 128-129, Feb. 2017.
- [2] A. Valdes-Garcia, et al., "A Fully-Integrated Dual-Polarization 16-Element W-Band Phased Array Transceiver in SiGe BiCMOS," *IEEE RFIC*, pp. 375-378, 2013.
- [3] S.Y. Kim, et al., "A 76-84 GHz 16-Element Phased Array Receiver with a Chip-Level Built-in-Self-Test System," *IEEE TMTT*, pp. 3083-3098, 2013.
- [4] S. Shahramian, et al., "A 16-Element W-band Phased-Array Transceiver Chipset with Flip-Chip PCB Integrated Antennas for Multi-Gigabit Data Links," *IEEE RFIC*, pp. 27-30, 2015.



Figure 4.6.1: Simplified block diagram of the 16TX / 8RX W-band phased array chipset. The IC fully integrates a transceiver with analog base-band, beam lookup tables, self-test, self-calibration & multi-chip LO synchronization.



Figure 4.6.2: Block diagrams of the receiver (top) and transmitter (bottom) phased array elements. Each element incorporates local beam look-up tables, calibration memory, self-diagnostics, continues health monitoring as well as 5-bit phase shifting capability.



Figure 4.6.3: (left) Block diagram of the integrated W-band PLL which incorporates a prime-ratio (5/3) LO multiplier to prevent VCO pulling as well as a 9b phase-shifter for multichip synchronization. The measured PLL locking range is 81.5 to 97.5GHz with an integrated jitter of <100fs between 12kHz and 20MHz (right).



Figure 4.6.4: (A) 3D model of 24-element antenna array (B) 3D radiation pattern with phase shift set to 0° (C) H-plane radiation patterns with phase shifts varied from -135° to +135° (D) Antenna array directivity, gain and efficiency.



Figure 4.6.5: (top) Block diagram of the integrated 44dB dynamic-range, 3GHz-bandwidth analog baseband. The DC offset cancellation path in conjunction with internal LO phase shifter is used for multichip alignment (bottom).



Figure 4.6.6: (top) Measured radiated broadband upper- and lower-sideband image rejection and loop-back performance between two IC stamps. (bottom) Comparison table with state-of-the-art W-band phased arrays.



Figure 4.6.7: (left) Die micrograph of the 16TX/8RX W-band phased-array IC which is 8x4.4mm<sup>2</sup>. (right) Stamps drawing showing both 16TX/8RX and 16RX/8TX antenna configuration. Front and back of a manufactured module is shown occupying a total of 10.15x10.15mm<sup>2</sup>.

#### 4.7 A 64GHz Full-Duplex Transceiver Front-End with an On-Chip Multifeed Self-Interference-Canceling Antenna and an All-Passive Canceler Supporting 4Gb/s Modulation in One Antenna Footprint

Taiyun Chi, Jong Seok Park, Sensen Li, Hua Wang

Georgia Institute of Technology, Atlanta, GA

Millimeter-wave full-duplex (FD) transceivers (TRXs) have the potential to unlock the full throughput of future 5G links. A major challenge in mm-wave FD TRXs is to suppress the wideband modulated self-interference (SI) of the transmitter (TX) to its own receiver (RX) with ~100dB cancellation over a large instantaneous bandwidth. This often requires distributed cancellations across the antenna, RF, and digital domains [1].

The antenna-domain self-interference cancellation (SIC) is critical since it directly relaxes the RX linearity requirement [1]. Practical antenna SIC solutions should provide low loss, compact footprint, and a large modulation bandwidth. Several antenna-interface SIC techniques have been demonstrated in silicon, each with advantages/limitations for wideband high mm-wave applications. First, reciprocal electrical-balance duplexers [2] suffer from 3dB signal loss, directly degrading TX output power ( $P_{out}$ )/efficiency and RX noise figure (NF). Second, non-reciprocal circulators [3] are also lossy in practice (~3dB) and require large LO driving power, which is challenging for high mm-wave bands. Finally, a 60GHz TX/RX antenna pair with an auxiliary reflective termination is reported; it requires frequency-dependent tuning and introduces 0.5 to 1dB additional loss for both TX and RX paths [1]. Dual TX/RX antennas also occupy a large area and are undesirable for massive MIMOS.

The RF-domain SIC is equally important to further suppress the SI and relax the downstream RX, e.g., ADC, dynamic range. For future FD massive MIMOs, desirable RF SIC solutions should offer ultra-low power, large dynamic range, orthogonal amplitude/phase tuning, and large instantaneous bandwidth.

In this paper, we propose a multifeed SIC antenna that provides a high TX-RX isolation (>35dB in measurement), an instantaneous broad bandwidth (60 to 75GHz), and no additional TX/RX-path signal loss, in only one antenna footprint. An all-passive zero-power RF canceler is integrated with nearly orthogonal amplitude/phase tunability to further enhance SIC. As a proof of concept, 2 FD TRX front-end chips are used to establish a 4Gb/s FD wireless link before using any digital SIC.

The proposed FD TRX front-end consists of an on-chip 4-feed SIC slot-loop antenna, 2 parallel TX paths, 2 parallel RX paths, and a passive canceler (Fig. 4.7.1). The 4-feed antenna supports 2 concurrent radiation modes with orthogonal polarizations, one for TX and one for RX. For the TX, 2 TX feeds are driven differentially, synthesizing a standing-wave voltage distribution on the slot loop. Due to the symmetry, the TX signals have 2 voltage nulls at the 2 RX input feeds, naturally providing high TX/RX antenna isolation. The 2 TXs are directly power-combined on the antenna for low-loss high-efficiency radiation [4], while the 2 RX feeds enable low-loss on-antenna power splitting to extend RX linearity and ensure its sensitivity. More importantly, the proposed antenna SIC relies on its symmetry and is inherently frequency independent, yielding an instantaneous broadband SIC with no frequency-tuning element. Phase shifters (PS) and variable-gain amplifiers (VGA) are integrated in the 2 sub-TXs to compensate their potential mismatch. Leveraging the SOI process high-resistivity substrate, the EM-simulated 4-feed antenna radiation efficiency is 91.2% at 70GHz with the 2 RX feeds loaded, even assuming 1dB amplitude and 10° phase mismatches between the 2 sub-TXs. This radiation efficiency closely matches that of a 2-feed antenna with only 2 TX feeds. Therefore, compared to the dual TX/RX antennas [1], the multifeed SIC antenna enables FD TX/RX operation in only one antenna footprint but without major loss penalty on TX  $P_{out}$  and RX NF.

The all-passive RF canceler is shown in Fig. 4.7.2. Its zero DC power consumption and high linearity are particularly suitable for massive MIMOs, where each RX element needs multiple cancelers to suppress its self-/neighbor-interferences. A small TX replica feeds the RF canceler via a -20dB capacitive coupler, and is amplitude-scaled by cascaded reflection-type attenuators (RTA) and phase-shifted by reflection-type phase shifters (RTPS). Then, the cancellation signal is combined with the RX balun output by a Wilkinson combiner [1]. Both RTA and RTPS use

wideband compact transformer-based 90° couplers (Fig. 4.7.2). RTA is inherently phase-invariant during amplitude tuning, and it offers 0°/180° phase shift when the tunable NMOS resistors are set above/below the coupler characteristic impedance (50Ω). The simulated RF canceler total attenuation and its total phase shift at 70GHz is shown in Fig. 4.7.2. Since the last-stage RTA adds 0°/180° phase shift, the required RTPS continuous phase tuning is reduced from 360° to 180°, enabling low loss variation (<3dB) during its phase shifting. Thus, the all-passive canceler supports nearly orthogonal and continuous amplitude/phase tuning, greatly easing the canceler adjustment.

To support Gb/s wireless communication with a sufficient link budget, large TX  $P_{out}$  and high efficiency are essential. In the TX, each sub-PA contains one common-source driver and one cascode output stage both with neutralization (Fig. 4.7.3). The differential sub-TX output signals at the 2 TX antenna feeds are generated by exchanging the signal/ground terminals of the PA output transformers, which are co-designed with the on-chip antenna for broadband matching (Fig. 4.7.3). The TX PS is a varactor-loaded transmission line, providing a simulated 20° phase tuning range with <0.5dB loss variation at 70GHz. The VGA is a one-stage cascode amplifier with a PMOS tunable load. An extra 400Ω load reduces the output quality factor and minimizes VGA phase variation over gain settings. The simulated gain tuning range is 1 to 6dB with <3° phase variation. The small PS loss variation and small VGA phase variation also enable nearly orthogonal amplitude/phase tuning and fast sub-TX mismatch calibrations. In the RX, the LNA has 2 cascode stages with input inductive degeneration, and the 2 sub-RX outputs are combined by a broadband balun.

The FD TRX front-end was implemented in a 45nm CMOS SOI process. The individual TX/RX performance was measured first (Fig. 4.7.4). In the TX continuous-wave (CW) tests, a horn antenna and a power sensor measured the TX EIRP at the far-field. The 3D EM-simulated antenna gain was used to calculate TX  $P_{out}$  from measured EIRP. At 64GHz, the TX achieved 18.5dBm  $P_{sat}$ , 16.5dBm  $P_{1dB}$ , 23.5% PAE<sub>max</sub>, and 19.4% PAE<sub>1dB</sub>, demonstrating high efficiency and linearity. The  $P_{sat}$  1dB bandwidth is 62 to 71GHz. The TX also supports high-quality 1Gsym/s (6Gb/s) 64-QAM wireless transmission with no digital pre-distortion (DPD). In the RX individual testings, a far-field horn antenna transmits a CW signal to the FD TRX chip. The RX output is amplified by an external LNA, down-converted, and monitored by a spectrum analyzer. The RX measures a 10.9dB gain and 4.8dB minimum NF.

Next, both TX and RX were turned on to measure the SIC over frequency using a CW signal (Fig. 4.7.5). An FD TRX chip without RF canceler was used to test the individual antenna SIC, showing measured antenna SIC >35dB at 60 to 75GHz. Then, an FD TRX chip with RF canceler was measured. Under 3 different canceler settings, the total antenna+RF SIC was >60dB at 63 to 65GHz, 65 to 66GHz, and 71.7 to 72.3GHz, supporting reconfigurable and instantaneous wideband SIC.

Finally, an FD link was demonstrated over 0.5m between 2 FD TRX chips (Fig. 4.7.5). Chip1 transmits the target modulated signal, while Chip2 simultaneously transmits an independent signal as TX SI with the same modulation scheme and rate at the same carrier frequency. The demodulated Chip2 RX outputs demonstrate successful FD communication using 4Gb/s 16-QAM and 3Gb/s 64-QAM with no digital SIC or DPD. If a symmetric Gb/s mm-wave FD link is needed, digital SIC (typically with ~20dB SIC [1]) can be added to readily equalize the two chips' EIRP. A performance summary and comparison with state-of-the-art mm-wave FD TRX front-ends is shown in Fig. 4.7.6.

##### Acknowledgements:

The authors would like to thank members of the Georgia Tech GEMS Lab for helpful technical discussions and GlobalFoundries for chip fabrication.

##### References:

- [1] T. Dinc, et al., "A 60GHz CMOS Full-Duplex Transceiver and Link with Polarization-Based Antenna and RF Cancellation," *IEEE JSSC*, vol. 51, no. 5, pp. 1125-1140, May 2016.
- [2] B. van Liempd, et al., "A +70dBm IIP3 Single-Ended Electrical Balance Duplexer in 0.18μm SOI CMOS," *ISSCC*, pp. 32-33, Feb. 2015.
- [3] T. Dinc and H. Krishnaswamy, "A 28GHz Magnetic-Free Non-Reciprocal Passive CMOS Circulator Based on Spatio-Temporal Conductance Modulation," *ISSCC*, pp. 294-295, Feb. 2017.
- [4] T. Chi, et al., "A 60GHz On-Chip Linear Radiator with Single-Element 27.9dBm  $P_{sat}$  and 33.1dBm Peak EIRP Using Multifeed Antenna for Direct On-Antenna Power Combining," *ISSCC*, pp. 296-297, Feb. 2017.



Figure 4.7.1: System architecture, EM-simulated antenna radiation efficiency, and full-duplex operation principles of the on-chip multifeed SIC antenna.



Figure 4.7.3: Schematic of the sub-PA, TX PS, TX VGA, RX LNA, and simulated sub-PA load impedance ( $Z_L$ ) including the sub-PA output capacitances and on-chip SIC antenna load.



Figure 4.7.5: Measured antenna SIC and total SIC, and a wireless FD link demonstration over 0.5m between two FD TRX chips.



Figure 4.7.2: Schematic and 3D EM model of the all-passive canceler, and simulated canceler total attenuation and phase shift at 70GHz, showing nearly orthogonal amplitude/tuning.



Figure 4.7.4: Measured individual TX and RX CW performance, and measured modulation results for 1Gsym/s (6Gb/s) 64-QAM signal in the TX radiation mode.

| Implementation   | This Work                                | [1]                                                            |                                                                    |
|------------------|------------------------------------------|----------------------------------------------------------------|--------------------------------------------------------------------|
|                  |                                          | Antenna-Domain SIC                                             | Multifeed SIC Antenna in One Architecture                          |
|                  | Technology                               | 45nm CMOS SOI                                                  | 45nm CMOS SOI                                                      |
|                  | Size                                     | 2.7mm $\times$ 2.7mm                                           | 1.3mm $\times$ 3.4mm (TRX Chip)<br>2.4mm $\times$ 3.4mm (PCB Ant.) |
| TRX Metrics      | Frequency Range                          | 60-75GHz                                                       | 57-66GHz                                                           |
|                  | TX $P_{\text{sat}}$ , Peak PAE           | 18.5dBm, 23.5%                                                 | 14.3dBm, 18.9%                                                     |
|                  | TX $P_{\text{avg}}$                      | 16.5dBm                                                        | N/A                                                                |
|                  | RX Minimum NF                            | 4.8dB                                                          | 4.52dB                                                             |
| FD Metrics       | Antenna SIC                              | >35dB (60-75GHz)                                               | N/A                                                                |
|                  | Antenna + RF SIC                         | >60dB (Reconfigurable, 63-65GHz, 65-66GHz, 71.7-72.3GHz, etc.) | >65dB (58.5-59.5GHz)                                               |
|                  | Additional TX/RX Loss Due to Antenna SIC | -0 (Sim.) / ~0 (Sim.)                                          | 1.1dB (Sim.) / 0.52dB (Sim.)                                       |
|                  | Canceled DC Power                        | 0                                                              | 44mW                                                               |
| FD Demonstration | Link Distance                            | 0.5m**                                                         | 0.8m**                                                             |
|                  | Wireless Link Setup                      | FD Chip to FD Chip                                             | Horn Ant. to FD Chip                                               |
|                  | Carrier Frequency                        | 63.1GHz                                                        | 63.1GHz                                                            |
|                  | Target Received Signal                   | 4Gb/s                                                          | 3Gb/s                                                              |
|                  | TX SI                                    | 64-QAM                                                         | 64-QAM                                                             |
|                  | Target Signal and TX SI EIRP Difference  | 7dB**                                                          | 4dB**                                                              |
|                  | RX Output                                | 15.9dB SINR                                                    | 16.2dB SINR                                                        |
|                  | Off-Chip Digital SIC                     | No                                                             | No                                                                 |

\* Graphically estimated.  
\*\* Include the simulated signal loss due to the auxiliary reflective termination on the dual TX/RX antennas.  
\*\* Adding digital SIC (typically with -20dB SIC) will further extend the FD link distance. Digital SIC will also equalize the EIRP difference between the target signal and TX SI, if a symmetric FD link is needed.

Figure 4.7.6: Performance summary and comparison with state-of-the-art mm-wave FD TRX front-end.



Figure 4.7.7: Die micrographs of the FD TRX front-end with RF canceler (left) and without RF canceler (right).

# Session 5 Overview: *Image Sensors*

## IMMD SUBCOMMITTEE



**Session Chair:**  
**Hayato Wakabayashi**  
*Sony Electronics, San Jose, CA*



**Associate Chair:**  
**Makoto Ikeda**  
*University of Tokyo, Tokyo, Japan*

**Subcommittee Chair: Makoto Ikeda, University of Tokyo, Tokyo, Japan**

The session presents advances in image sensors covering BSI, global shuttering, organic photoconductive film, pixel scaling, dynamic vision, high frame rate imaging, 3D time-of-flight, and SPADs. The first paper, by Sony, presents a BSI global shutter with in-pixel ADC. Then, Panasonic presents a global shutter using organic film with in-pixel noise cancellation. Samsung presents a 0.9 $\mu$ m pixel with complete deep-trench isolation. Sony presents a low-power event-driven imager with motion detection. TSMC presents a 13.5Mpixel BSI image sensor with a readout subsampling architecture that allows 514fps at 720p. NHK presents a high-speed image sensor achieving 8K video up to 480fps. Toshiba presents a LiDAR SoC enabling range measurements up to 200m. Microsoft presents a BSI time-of-flight image sensor with 3.5 $\mu$ m global-shutter pixels with modulation frequencies up to 320MHz. Delft University presents a direct time-of-flight image sensor with modular SPAD-based pixel arrays fabricated in 3D-stacked 45/65nm CMOS. Finally, FBK presents a SPAD array coupled with TDCs to measure spatial correlations of entangled photons at a rate of 800kHz.



**1:30 PM**  
**5.1 A Back-Illuminated Global-Shutter CMOS Image Sensor with Pixel-Parallel 14b Subthreshold ADC**

*M. Sakakibara, Sony Semiconductor Solutions, Atsugi, Japan*

In Paper 5.1, Sony presents a 1.46MP BSI global shutter CMOS image sensor using a pixel-parallel single-slope ADC. Using a Cu-Cu bonding pixel unit, positive feedback, and the digital bucket relay of a repeater through multistage flip-flop connection, all pixels are converted simultaneously with a 14b single-slope ADC having a size of 6.9 $\times$ 6.9 $\mu$ m<sup>2</sup>, in a subthreshold region with operating current of 7.74nA.



**2:00 PM**  
**5.2 An 8K4K-Resolution 60fps 450ke- Saturation-Signal Organic-Photoconductive-Film Global-Shutter CMOS Image Sensor with In-Pixel Noise Canceller**

*K. Nishimura, Panasonic, Moriguchi, Japan*

In Paper 5.2, Panasonic presents an 8K4K resolution organic photoconductive film CMOS image sensor operating in both rolling shutter and global shutter mode at 60fps using in-pixel capacitive-coupled noise cancellation. The noise canceller is also used to expand the full-well capacity up to 450ke- with a density of 50ke-/ $\mu$ m<sup>2</sup>, which is 10dB higher than that of a silicon global-shutter CMOS image sensor.



**2:15 PM**  
**5.3 A 1/2.8-inch 24Mpixel CMOS Image Sensor with 0.9 $\mu$ m Unit Pixels Separated by Full-Depth Deep-Trench Isolation**

*W. Choi, Samsung Electronics, Hwaseong, Korea*

In Paper 5.3, Samsung presents a 1/2.8-inch 24Mpixel CMOS image sensor with 0.9 $\mu$ m unit pixels separated by full-depth DTI. Full-well capacity is increased up to 6,000e-, which is even larger than a conventional 1.0 $\mu$ m pixel; dark noise characteristics are also improved. Better optical performance is also achieved by using an optimized back-side design and a higher aspect ratio of full-depth DTI.



2:30 PM

**5.4 A 1/4-inch 3.9Mpixel Low-Power Event-Driven Back-Illuminated Stacked CMOS Image Sensor***O. Kumagai, Sony Semiconductor Solutions, Atsugi, Japan*

In Paper 5.4, Sony presents a 1/4-inch 3.9Mpixel low-power event-driven back-illuminated stacked CMOS image sensor deployed with a readout circuit that detects motion for each pixel under lighting conditions from 1lux to 64,000lux. Utilizing pixel summation in a shared floating diffusion (FD) for each pixel block, moving object detection is realized at 10fps, while consuming only 1.1mW, a 99% reduction in power from the full resolution 60fps power of 95mW.



3:15 PM

**5.5 A 1.1μm-Pitch 13.5Mpixel 3D-Stacked CMOS Image Sensor Featuring 230fps Full-High-Definition and 514fps High-Definition Videos by Reading 2 or 3 Rows Simultaneously Using a Column-Switching Matrix***P.-S. Chou, TSMC, Hsinchu, Taiwan*

In Paper 5.5, TSMC presents a new architecture to achieve 4x and 9x higher frame rates for 4-to-1 and 9-to-1 subsampled videos implemented in a 1.1μm pitch, 13.5MP 3D-stacked CMOS image sensor using 1 bank of column ADCs. A digitally controlled column-switching matrix combined with a hard-wired vertical signal routing is designed to utilize all the ADCs in subsampling operation.



3:30 PM

**5.6 A 2.1μm 33Mpixel CMOS Imager with Multi-Functional 3-Stage Pipeline ADC for 480fps High-Speed Mode and 120fps Low-Noise Mode***T. Yasue, NHK Science & Technology Research Laboratories, Tokyo, Japan*

In Paper 5.6, NHK presents a 2.1μm 33Mpixel CMOS image sensor for 8K video using a column-parallel 3-stage pipeline ADC composed of Folding-Integration (FI), dual-cyclic and SAR. In the 120fps 14b mode, the 6-times sampling in the FI and digital CDS reduce random noise and VFPN to 3.2e- and 0.24e-, respectively. In the 480fps mode, the dual-cyclic and SAR achieve 480fps operation.



3:45 PM

**5.7 A 20ch TDC/ADC Hybrid SoC for 240×96-Pixel 10%-Reflection <0.125%-Precision 200m-Range Imaging LiDAR with Smart Accumulation Technique***K. Yoshioka, Toshiba, Kawasaki, Japan*

In Paper 5.7, Toshiba presents a TDC/ADC hybrid LiDAR SoC with a smart accumulation technique (SAT) to achieve 200m range imaging with 240×90 pixel resolution for reliable self-driving systems. The SAT using ADC information enhances the effective pixel resolution with an accumulation activated by recognizing only the target reflection, while the hybrid architecture enables a wide measurement range from 0 to 200m.



4:15 PM

**5.8 1Mpixel 65nm BSI 320MHz Demodulated TOF Image Sensor with 3.5μm Global Shutter Pixels and Analog Binning***C. S. Bamji, Microsoft, Mountain View, CA*

In Paper 5.8, Microsoft presents a 1024×1024 Time-of-Flight image sensor with 3.5×3.5μm<sup>2</sup> global shutter pixels with analog binning in a TSMC 65nm 1P8M BSI CMOS image sensor process with modulation frequencies of up to 320MHz. The pixels have modulation contrast of 87%@200MHz, 78%@320MHz and QE=44%@860nm, while the readout chain implements adaptive gain with either 9b 3.4GS/s or 10b 1.7GS/s ADC.



4:30 PM

**5.9 A 256×256 45/65nm 3D-Stacked SPAD-Based Direct TOF Image Sensor for LiDAR Applications with Optical Polar Modulation for up to 18.6dB Interference Suppression***A. Ronchini Ximenes, Delft University of Technology, Delft, The Netherlands*

In Paper 5.9, Delft presents a direct time-of-flight image sensor with modular 2×8×8 SPAD-based pixel arrays, 14b range, 500μW, 60ps always-on TDCs shared through 6-level decision trees, in-pixel 21b memories, and in-locus data processing, fabricated in a 3D-stacked 45/65nm CMOS technology. A maximum distance of 430m and worst-case accuracy of 0.4% was recorded, while 256×256 3D images were obtained through laser scanning.



4:45 PM

**5.10 A 32×32-Pixel Time-Resolved Single-Photon Image Sensor with 44.64μm Pitch and 19.48% Fill-Factor with On-Chip Row/Frame Skipping Features Reaching 800kHz Observation Rate for Quantum Physics Applications***L. Gasparini, Fondazione Bruno Kessler (FBK), Trento, Italy*

In Paper 5.10, FBK presents a 32×32-pixel image sensor fabricated in standard 150nm CMOS, for the measurement of spatial correlations of entangled photons. Each 44.64μm pixel includes a SPAD coupled to a 205ps 8b TDC to correlate simultaneous photons. On-chip mechanisms allow the sensor to observe events at 800kHz and readout at 250kfps skipping irrelevant data.

## 5.1 A Back-Illuminated Global-Shutter CMOS Image Sensor with Pixel-Parallel 14b Subthreshold ADC

Masaki Sakakibara<sup>1</sup>, Koji Ogawa<sup>1</sup>, Shin Sakai<sup>1</sup>, Yasuhisa Tochigi<sup>1</sup>, Katsumi Honda<sup>1</sup>, Hidekazu Kikuchi<sup>1</sup>, Takuya Wada<sup>1</sup>, Yasunobu Kamikubo<sup>1</sup>, Tsukasa Miura<sup>1</sup>, Masahiko Nakamizo<sup>1</sup>, Naoki Jyo<sup>2</sup>, Ryo Hayashibara<sup>2</sup>, Yohei Furukawa<sup>3</sup>, Shinya Miyata<sup>3</sup>, Satoshi Yamamoto<sup>1</sup>, Yoshiyuki Ota<sup>1</sup>, Hirotsugu Takahashi<sup>1</sup>, Tadayuki Taura<sup>1</sup>, Yusuke Oike<sup>1</sup>, Keiji Tatani<sup>1</sup>, Takashi Nagano<sup>1</sup>, Takayuki Ezaki<sup>1</sup>, Teruo Hirayama<sup>1</sup>

<sup>1</sup>Sony Semiconductor Solutions, Atsugi, Japan

<sup>2</sup>Sony Semiconductor Manufacturing, Kumamoto, Japan

<sup>3</sup>Sony LSI Design, Fukuoka, Japan

Rolling-shutter CMOS image sensors (CISs) are widely used [1,2]. However, the distortion of moving subjects remains an unresolved problem, regardless of the speed at which these sensors are operated. It has been reported that by adopting in-pixel analog memory (MEM) in pixels, a global shutter (GS) can be achieved by saving all pixels simultaneously as stored charges [3,4]. However, as signals from a storage unit are read in a column-wise sequence, a light-shielding structure is required for the MEM to suppress the influence of parasitic light during the reading period. Pixel-parallel ADCs have been reported as methods of implementing GS on a circuit [5,6]. However, these techniques have not been successful in operations on megapixels because they do not address issues such as the timing constraint for reading and writing a digital signal to and from an ADC in a pixel owing to increase in the number of pixels and the increase in the total power consumption of massively parallel comparators (CMs).

In this paper, we report a stacked back-illuminated CIS with a 1.46-Mpixel 14b ADC using pixel-level bonding technology. By implementing a positive feedback (PFB) circuit in the CM, the current during operation is in a subthreshold region of 7.74nA. Moreover, we use a repeater circuit in which a flip-flop (FF) is connected in successive stages. The propagation delay, which includes the sense-amp in the conventional method [5,6], is reduced to the delay between FFs. This structure enables the readout of the 15b signal of the 408 repeater columns in parallel to run at 148.5MHz.

Figure 5.1.1 shows the simplified block diagram of this prototype. It consists of a pixel wafer on which light is incident from the back, and a logic wafer that performs signal processing. The pixel wafer consists of 1632×896 pixels. The size of a single pixel is 6.9×6.9μm<sup>2</sup>. It contains a part of the CM along with a bias circuit that generates current ( $I_{cm}$ ). In the logic wafer, a part of the CM and a 15b latch for digital memory are arranged in each ADC, and 408 repeaters write and read gray codes (GCs) in the vertical direction for each column. In addition, the wafer is equipped with one frame of SRAM for CDS and Scalable Low Voltage Signaling with Embedded Clock (SLVS-EC) for north and south I/Os. A vertical driver to drive the logic block, a global driver to drive the pixel block, and a DAC to generate slope voltage are also implemented.

Figure 5.1.2 shows a simplified schematic of the pixel-parallel ADC and repeater. A part of the CM contains differential configurations: AMP-Tr, which is the gate connected to the FD, REF-Tr, BIAS-Tr for generating bias current, TG-Tr for transfer, and OFG-Tr for discharging of photo-electrons. The pixel is reset by RST-Tr, and the fixed-pattern noise (FPN) for each pixel and comparator are stored in the FD with reset kT/C noise. A part of the CM of the pixel is connected to a PMOS current mirror of the logic wafer and functions as a CM. The short signal return path and the differential structure of the CM have immunity from common-mode noise. As the supply voltage of the CM affects the initial FD voltage of the pixel, it is set to 2.9V to ensure a dynamic range. The first stage output node ( $V_{cr}$ ) of the CM is connected to the high-voltage PMOS, and its drain is connected to one NMOS Tr with the gate connected to the logic power supply (1.1V) to limit voltage swing. This contributes to circuit area reduction because all subsequent circuits can use a low-voltage Tr. PFB is applied to the floating node ( $V_{pfb}$ ) of the succeeding CM. The floating node is charged at low speed by the subthreshold current of the high-voltage PMOS during AD conversion. It is charged more rapidly when positive feedback is applied to the low-voltage PMOS connected to the VDDL after the threshold of the NOR circuit is exceeded; this leads to a high-speed transition. High PVT tolerance is achieved using a static latch instead of DRAM [5,6] for the signal storage element in the latter stage of the CM.

The GC writing and reading operations to the latch are carried out by a repeater. The repeater is comprised of cascaded 28-stage 15b FFs, which are a part of clusters. A cluster is formed by 15b FFs with 128 pixels. The code to the latch is supplied through digital bucket relay (DBR). The GC generated on the side opposite to the ADCK is input to the repeater via the FF. The GC has timing error tolerance because the signal transited at 1 ADCK is 1b and it maintains constant power supply fluctuation during the operation. Starting from the side closer to the GC up to the output, 1 LSB is shifted per cluster, with a maximum of 27 LSBs for 28 stages, to form the FPN in this configuration. However, this FPN is canceled along with the reset noise and the other circuit FPN in the CDS.

The timing waveform is shown in Fig. 5.1.3. First, the DAC is set to the initial voltage, and the CM is reset using RST. The PFB part is initialized by controlling INI, and writing is enabled by WEN. Then, the slope begins. ADCK is supplied to the repeater and data are written to the latch. The 15b GC is transferred through the DBR of the repeater. When the slope voltage of the FD and DAC reach the same potential, the CM flips and the GC of the reset level is stored in the latch by terminating the writing of the GC. To suppress the collision due to the CM not flipping at the time of signal reading, the CM is flipped by controlling FORCE at the end of the slope. By controlling WORD[127:0] and REN, the data of each latch are read through the repeater, and the signal is stored in SRAM as a reset signal after conversion from the GC to binary code(BC). TG-Tr is driven to transfer the charge of the PD to the FD, and similar processing is carried out to obtain the signal level. The CM is flipped with the voltage corresponding to the photo signal level, and the GC is stored in the latch. Likewise, data are output to the SRAM through the repeater. CDS is performed on the reset BC that is read from SRAM during the conversion of the signal from GC to BC. The calculated BC is written back into SRAM as 14b BC. Then, this 14b BC is output to the outside of the CHIP through the SLVS-EC interface.

By setting the DAC to low voltage when AD conversion is not being performed, the current of the CM can be turned off, thus reducing the average power consumption. As a result, the time average of the operation current of 7.74nA during active time is 1.67nA at 660fps.

Figure 5.1.4 shows the captured image, demonstrating that the pixel-parallel subthreshold ADC is successfully realized. Figure 5.1.5 summarizes the characteristics of the chip for two bias current settings. Figure 5.1.6 shows the relationship among power, noise, dynamic range, and ADC resolution using two figure-of-merits (FOMs). The best performance of FOM2 is achieved at a comparator operation current of 111nA. These results imply that sensor performance can be adjusted considerably using comparator current. The die micrograph is shown in Fig. 5.1.7.

A CIS with a 1.46-megapixel parallel subthreshold ADC is developed and tested. The subthreshold comparator with PFB helps in reducing comparator operation current and minimizes circuit area to reduce power consumption. The logic part is dominant in the power, and this is likely to be improved by power saving technologies, such as clock gating, which have not yet been implemented.

### References:

- [1] T. Yasue, et al., "A 14-bit, 33-Mpixel, 120-fps Image Sensor with DMOS Capacitors in 90-nm/65-nm CMOS," *ISSW*, pp. 200–203, June 2015.
- [2] A. Suzuki, et al., "A 1/1.7-inch 20Mpixel Back-Illuminated Stacked CMOS Image Sensor for New Imaging Applications," *ISSCC Dig. Tech. Papers*, pp. 110–111, Feb. 2015.
- [3] Y. Oike, et al., "8.3 M-Pixel 480-fps Global-Shutter CMOS Image Sensor with Gain-Adaptive Column ADCs and Chip-on-Chip Stacked Integration," *IEEE JSSC*, vol. 52, no. 4, pp. 985–993, Apr. 2017.
- [4] M. Kobayashi, et al., "4.5 A 1.8e-rms Temporal Noise over 110dB Dynamic Range 3.4μm Pixel Pitch Global Shutter CMOS Image Sensor with Dual-Gain Amplifiers, SS-ADC and Multiple-Accumulation Shutter," *ISSCC Dig. Tech. Papers*, pp. 74–75, Feb. 2017.
- [5] S. Kleinfelder, et al., "A 10,000 Frames/s CMOS Digital Pixel Sensor," *IEEE JSSC*, vol. 36, no. 12, pp. 2049–2059, Dec. 2001.
- [6] H. Sugo, et al., "A Dead-Time Free Global Shutter CMOS Image Sensor with In-Pixel LOFIC and ADC Using Pixel-Wise Connections," *IEEE Symp. VLSI Circuits*, pp. C224–C225, June 2016.



Figure 5.1.1: Simplified block diagram of pixel-parallel ADC.



Figure 5.1.2: Simplified circuit schematic of pixel-parallel ADC and block diagram of repeater.



Figure 5.1.3: One-frame timing diagram of pixel-parallel ADC.



Figure 5.1.4: Captured image at bias current of 7.74nA/CM.

| Item                                           | Data                                     |                                               |
|------------------------------------------------|------------------------------------------|-----------------------------------------------|
| Process                                        | CIS wafer: 90nm 1 Poly 4 Metal Layer     | Logic wafer: 65nm 1 Poly 7 Metal Layer        |
| Supply Voltage                                 | 2.9 [V]                                  | 1.1 [V]                                       |
| Num. of pixels                                 | 1632 <sup>(H)</sup> x 896 <sup>(V)</sup> |                                               |
| Pixel size                                     | 6.9 [ $\mu\text{m}$ ]                    | x 6.9 [ $\mu\text{m}$ ]                       |
| Output interface                               | 16ch x 4.752 [Gbps/ch]                   | SLVS-EC                                       |
| Max frame rate                                 | 660 [fps]                                |                                               |
| Saturation signal                              | 16.6k [e-]                               |                                               |
| Sensitivity                                    | 61,500 [e-/lx· s]                        | (green pixel, 3200K light with IR cut filter) |
| PLS                                            | -75 [dB]                                 |                                               |
| Conversion gain                                | 60 [ $\mu\text{V/e-}$ ]                  |                                               |
| Comparator operation current                   | 7.74 [nA]                                | 111 [nA]                                      |
| Comparator current<br>(time average @660[fps]) | 1.67 [nA]                                | 23.9 [nA]                                     |
| Power consumption                              | 654 [mW]                                 | 746 [mW]                                      |
| Rms random noise<br>@Analog Gain 0[dB]         | 8.77 [e <sub>rms</sub> ]                 | 5.15 [e <sub>rms</sub> ]                      |
| Dynamic range                                  | 65.7 [dB]                                | 70.2 [dB]                                     |
| ADC resolution                                 | 14 [bit]                                 |                                               |

Figure 5.1.5: Chip characteristics.



Figure 5.1.6: Performance comparison.



Figure 5.1.7: Die micrograph and the part of cross-section.

## 5.2 An 8K4K-Resolution 60fps 450ke<sup>-</sup>Saturation-Signal Organic-Photoconductive-Film Global-Shutter CMOS Image Sensor with In-Pixel Noise Canceller

Kazuko Nishimura<sup>1</sup>, Sanshiro Shishido<sup>1</sup>, Yasuo Miyake<sup>1</sup>, Masaaki Yanagida<sup>1</sup>, Yoshiaki Satou<sup>1</sup>, Makoto Shouho<sup>1</sup>, Hidenari Kanehara<sup>1</sup>, Ryota Sakaida<sup>1</sup>, Yoshihiro Sato<sup>1</sup>, Junji Hirase<sup>1</sup>, Yuko Tomekawa<sup>1</sup>, Yutaka Abe<sup>2</sup>, Hiroshi Fujinaka<sup>2</sup>, Yoshiyuki Matsunaga<sup>3</sup>, Masashi Murakami<sup>1</sup>, Mitsuru Harada<sup>1</sup>, Yasunori Inoue<sup>1</sup>

<sup>1</sup>Panasonic, Moriguchi, Japan

<sup>2</sup>Panasonic Semiconductor Solutions, Nagaokakyō, Japan

<sup>3</sup>Kyoto, Japan

There is a growing demand for high-resolution and high-reality cameras for use in broadcasting, surveillance, and various other systems. Conventional papers report on research and development of 8K ultra-high-definition television (UHDTV) systems, 8K full-resolution cameras [1], and 8K 240fps cameras that employ stacked sensors [2]. In these camera systems, a rolling-shutter method is used for scanning, since a global-shutter method has an area tradeoff between the photoelectric conversion region and the charge storage region [3-5]. However, this leads to a shutter distortion problem during the high-speed imaging and synchronization of multi-viewpoint imaging. To overcome this problem, a CMOS image sensor is developed that has an 8K4K resolution, a 60fps frame rate, and a 450ke<sup>-</sup> saturation signal, with an organic photoconductive film (OPF) laminated on the pixel circuits. Even with small (e.g., 3μm) pixels, a global shutter can be realized without degradation of the saturation signal [6]. However, there still remains a requirement to achieve 8K4K resolution at 60fps readout speed. There are three potential strategies to achieve this: 1) high-speed cancellation of reset noise in single storage-type global shutter pixels, 2) high-speed readout with a long vertical signal line, and 3) high-saturation in global-shutter mode.

For 1), in the conventional case, a feedback amplifier (FBAMP) is allocated to each column to affect noise cancellation, but there is a long noise suppression time due to the large time constant of the vertical signal line. An in-pixel noise canceller is developed to shorten the noise suppression time even when the length of the vertical signal line becomes four times longer than with a Full High Definition (FHD) sensor. Moreover, to utilize the advantages of the OPF image sensor's stacked structure, high-capacitance Metal-Insulator-Metal (MIM) capacitors are allocated in the metal interconnect area. For 2), since the photoelectric conversion film is present at the upper layer, the photoelectric conversion characteristics are not affected, even if the number of vertical signal lines is increased. Therefore, two pairs of quadruple vertical signal lines are used for each vertical 8 pixels, and 16 sample-and-hold (S/H) capacitors are provided in each column. For 3), a high saturation circuit is developed without increasing the pixel size.

Figure 5.2.1 shows a block diagram of the OPF image sensor. Every column has multiple high-speed readout circuits (MHRCs) and analog-to-digital converters (ADCs).

Figure 5.2.2 shows a cross-section of the OPF image sensor pixel featuring these key technologies. Each pixel has an "in-pixel capacitive-coupled noise canceller (IP-CCNC)" with MIM capacitors, multiple high-speed readout lines for the "MHRC" to realize 8K4K resolution and 60fps readout, and a "global shutter" function.

The global shutter is controlled by switching the voltage applied to the indium tin oxide (ITO) electrode. In the 8K4K image sensor, there is a large pixel area of 26,400μm<sup>2</sup> × 13,644μm<sup>2</sup>, but, by controlling one ITO electrode that is globally and simultaneously connected to all pixels, a shutter speed of 1/65000 second is achieved. Moreover, the effective function of an "electrical Neutral Density (ND) filter" is realized by setting the voltage applied to the ITO electrode according to the sensitivity characteristics of the OPF. This leads to easy, smooth, and continuous control of sensitivity globally, without an external ND filter.

Figure 5.2.3 shows a schematic of the IP-CCNC, which consists of four transistors and two capacitors (SF: amplifier transistor, SEL: select transistor, RST: reset transistor, FB: feedback transistor, Cs: stabilized capacitor, Cc: coupled capacitor) in each pixel, and four switches (S1, S1b, R1, R1b), two bias voltage lines (AVDD, Vbias), and two current sources in each column. To perform 8K4K resolution readout and high-speed noise cancellation, a FBAMP should be allocated to each pixel, not to each column, even though the pixel area is limited. To shrink the pixel size, a reconfigurable pixel circuit architecture is developed that has two operation modes: a FBAMP for noise cancellation and a source follower amplifier for signal

readout. During the noise cancellation period, the in-pixel SF and the column current source configure the in-pixel common source inverting amplifier, by setting S1 and R1 to the ON state and S1b and R1b to the OFF state. This allows noise cancelling for each pixel to be realized. First, the floating diffusion (FD) node is set to the reset voltage by simultaneously turning on RST and FB; then RST and FB are turned off sequentially. During this period, reset noise caused by RST and FB is suppressed using a negative feedback loop that includes FB, which is bandwidth-controlled using the voltage Vfb. When the gain of the negative feedback loop is set to -A, the reset noise of RST and FB can be suppressed proportionally to 1/(A×Cc/Cfd) and 1/(A×Cs/Cc), respectively. In this case, the gain of the in-pixel FBAMP is smaller than that of the column FBAMP. To increase this gain, high-capacitance MIM capacitors are developed and the Cs value is set high. This results in a reduction of the 8K4K image sensor's reset noise from 23e<sup>-</sup> to 4.9e<sup>-</sup>, within 4μs. During the reset and signal readout period, the in-pixel SF and the column current source configure the source follower amplifier, by setting S1 and R1 to the OFF state and, S1b and R1b to the ON state. The signal readout operation from the FD node is then realized. Additionally, this reconfigurable pixel circuit has one more mode. When the amount of incident light is small, the circuit operates in noise cancellation mode, but when the amount of incident light is large, this circuit operates in high saturation mode by increasing the capacitance of FD to connect the Cs capacitor to the FD node while setting the gate voltage of RST Vrst to the ON state. A saturation signal of 450ke<sup>-</sup> is thus achieved. The saturation signal per unit square is 10dB higher than that of the silicon global shutter image sensor [5].

Figure 5.2.4 shows a schematic of the MHRC. To realize high-speed readouts, 1) signal readout, 2) noise cancellation, and 3) reset readout are performed sequentially by a four-line set. The readout signals are held in the S/H capacitors of the MHRC, and the held signals are sequentially converted from analog to digital. In this architecture, 8-column FBAMPs are unnecessary, which brings the benefit of a smaller area and lower power consumption.

A die micrograph is shown in Fig. 5.2.7. The 8K4K OPF global shutter image sensor is fabricated using 65nm 1P4Cu1Al CMOS technology. The sensor is designed with a pixel size of 3μm, a total pixel number of 8,800<sup>2</sup> × 4,548<sup>2</sup>, and an effective pixel number of 8,192<sup>2</sup> × 4,320<sup>2</sup>. It realizes 60fps digital readout using a 1.404Gb/s Sub-LVDS interface. The supply voltages are 3.3V (analog) and 1.2V (digital).

Figure 5.2.5 shows images captured by the OPF image sensor. High-definition 8K4K resolution images are obtained. It also shows high-saturation mode ON and OFF images, rolling shutter mode and global shutter mode images using this sensor.

Finally, Fig. 5.2.6 lists a performance summary of the 8K4K OPF image sensor. Global shutter speed of 1/65000 seconds, parasitic light sensitivity (PLS) of -110dB, and saturation signal of 450ke under global shutter mode, are the highest performances reported to date based on the table, and will contribute to applications that require high-quality and accurate images even in demanding environments.

### Acknowledgment:

We would like to thank engineers in Panasonic Semiconductor Solutions Co., Ltd. Semiconductor Business Unit and Panasonic Corporation Automotive & Industrial Systems Company Engineering Division for supporting chip design.

### References:

- [1] R. Funatsu, et al., "133Mpixel 60fps CMOS Image Sensor with 32-Column Shared High-Speed Column-Parallel SAR ADCs", *ISSCC Dig. Tech. Papers*, pp. 112-113, Feb. 2015.
- [2] T. Arai, et al., "A 1.1μm 33Mpixel 240fps 3D-Stacked CMOS Image Sensor with 3-Stage Cyclic-Based Analog-to-Digital Converters", *ISSCC Dig. Tech. Papers*, pp. 126-127, Feb. 2016.
- [3] G. Meynarts, et al., "A 47 MPixel 36.4 × 27.6 mm<sup>2</sup> 30 fps Global Shutter Image Sensor", *Dig. IISW*, pp. 410-413, June 2017.
- [4] M. Kobayashi, et al., "A 1.8erms Temporal Noise Over 110dB Dynamic Range 3.4μm Pixel Pitch Global Shutter CMOS Image Sensor with Dual-Gain Amplifiers, SS-ADC and Multiple-Accumulation Shutter", *ISSCC Dig. Tech. Papers*, pp. 74-75, Feb. 2017.
- [5] Y. Sakano, et al., "224-ke Saturation Signal Global Shutter CMOS Image Sensor with In-pixel Pinned Storage and Lateral Overflow Integration Capacitor", *IEEE Symp. VLSI Circuits*, pp. C250-C251, June 2017.
- [6] S. Shishido, et al., "210ke Saturation Signal 3μm-Pixel Variable-Sensitivity Global-Shutter Organic Photoconductive Image Sensor for Motion Capture", *ISSCC Dig. Tech. Papers*, pp. 112-113, Feb. 2016.



Figure 5.2.1: Block diagram of the OPF image sensor.



Figure 5.2.3: Dual-function schematic of IP-CCNC.



Figure 5.2.5: Images captured by the OPF image sensor.



Figure 5.2.7: Die micrograph.

### 5.3 A 1/2.8-inch 24Mpixel CMOS Image Sensor with 0.9μm Unit Pixels Separated by Full-Depth Deep-Trench Isolation

Yitae Kim, Wonchul Choi, Donghyuk Park, Heegeun Jeoung, Bumsuk Kim, Youngsun Oh, Sunghoon Oh, Byungjun Park, Euiyeol Kim, YunKi Lee, Taesub Jung, Yongwoon Kim, Sukki Yoon, Seokyong Hong, Jesuk Lee, Sangil Jung, Chang-Rok Moon, Yongin Park, Duckhyung Lee, Duckhyun Chang

Samsung Electronics, Hwaseong, Korea

CMOS image sensors (CIS) have attracted much attention for the emerging mobile market, and the demand of high-resolution image sensors in mobile applications continues to increase [1-3]. For this reason, pixel pitch has been reduced down to 1.0μm for mass production. Nevertheless, CISs are continuously scaling down to meet the strong demand for higher-resolution images. However, when the pixel size is reduced down to the sub-micron regime (possibly smaller than the diffraction limit), it is very important to consider photo sensitivity and crosstalk, which determine signal-to-noise ratio (SNR). To minimize degradation of photo sensitivity, back-side illumination (BSI), which collects light at the back side, is widely used instead of front-side illumination. In addition to BSI technology, deep-trench isolation (DTI) has emerged as a leading candidate to suppress crosstalk since it physically isolates the pixel. Previous work shows that partial-depth DTI can be applied in a 1.12μm-pitch pixel [4]. Furthermore, full-depth DTI has been demonstrated in a 1.12μm pixel with 24% larger full-well capacity (FWC), 30% smaller YSNR10, 2.0dB higher SNR, and especially for lower crosstalk (12.5%) compared with a conventional one [5]. In this work, a 24-Mpixel CIS with 0.9μm unit pixels that takes advantage of full-depth DTI is demonstrated.

Figure 5.3.1 shows a schematic diagram of the full-depth DTI structure compared with that of a conventional partial-depth DTI structure. Both full- and partial-depth DTI structures adopt BSI technology to maximize the photodiode fill factor. The pixel with full-depth DTI has the following advantages over the partial-depth DTI. First, full-depth DTI completely separates pixels, both electrically and optically. Full-depth DTI is considered as a light guide to trap light within each pixel. The insulating layers formed on the sidewall of each pixel can totally reflect the light, and no light can penetrate through DTI. As a result, the full-depth DTI structure ideally has no optical crosstalk to neighboring pixels. In addition, this isolated structure ensures that electrical crosstalk does not happen between pixels. However, in partial-depth DTI, there is a pathway to neighboring pixels for charges generated in deep silicon by light with long wavelength; full-depth DTI has no risk of electrical crosstalk even in the deep region. Thus, blooming, which comes from electrical interference between pixels, can be prevented. Second, the voltage applied to the transfer gate and floating diffusion can be lower than for partial-depth DTI. The infinite barrier that prevents the blooming path ensures far greater flexibility in designing a photodiode. Furthermore, the maximum potential of the photodiode is around 1.0V for the full-depth DTI structure, whereas the partial-depth structure has a maximum potential of >2.0V. This indicates that a lower bias is needed to transfer electrons from the photodiode to the floating diffusion. In this experiment, an almost zero lag is measured until the transfer gate voltage is reduced down to 1.8V. In addition, the reset voltage of the floating diffusion can be less than for partial-depth DTI because the maximum potential of the photodiode is much lower than for the partial-depth DTI structure. Thus, lower power consumption and better settling margin for high speed operation can be achieved. These are major advantages for future devices.

The structures of the conventional 1.0μm pixel and the 0.9μm pixel are compared in Fig. 5.3.2 with cross-sectional potential profile and TEM images. It is seen that full-depth DTI is formed along the periphery of each pixel for complete isolation. Insulating oxides and polysilicon are used to fill these DTIs. The important point of this figure is the total silicon thickness and the DTI width. In our process for the 0.9μm pixel, thicker Si and narrower width of DTI are adopted than those for the conventional 1.0μm one, in order to minimize the loss of silicon area that could be used as photodiode. Thickness is increased up to 48% for photo sensitivity. As is well known, an increase in pixel thickness brings a benefit to improve quantum efficiency (QE), especially for long wavelengths since more light has a chance to contribute to charge generation. In our approach, to make the best use of this thicker Si, higher-energy ion implantation is employed to optimize

the deeper photodiode, and the simulated potential profile is shown in Fig. 5.3.2. Furthermore, the narrower DTI width also significantly contributes to improve FWC for the 0.9μm pixel. As a result, FWC of 0.9μm pixels is 6,000 e- which is measured as 11 % larger than that of the conventional 1.0μm one.

The measured dark and electrical characteristics of the 0.9μm pixel are compared with those of the conventional 1.0μm pixel, as shown in Fig. 5.3.3. The first and second parts of the figure show the fixed-pattern and temporal noise histograms in dark condition, respectively. Signals around the second peak are dramatically decreased in the fixed-pattern histogram. In addition to the fixed-pattern noise, the temporal noise and random telegraph signal are also significantly suppressed. Consequently, white spot, dark current, and temporal noise, which are key factors in CIS performance, are successfully enhanced. Optimized fabrication processes, including plasma doping [6] and layout architecture design effectively lead enhancement of dark characteristics.

Figure 5.3.4 shows the optical properties of the 0.9μm pixel and we clearly notice that the 0.9μm pixel is superior to the conventional 1.0μm one. Even though computational simulation predicts the SNR degradation due to the shrinkage of pixel and the increase of crosstalk, it also shows better performance in practice. This result indicates that advanced sensor technologies effectively work for lower crosstalk, lower temporal noise, and lower fixed-pattern noise, etc. The key elements for high SNR are higher QE, and less crosstalk, and they are also represented. As shown in Fig. 5.3.4, the measured QE of the 0.9μm pixel is significantly improved in the whole range of wavelengths while crosstalk is dramatically suppressed. The non-diagonal elements of color correction matrix are closer to zero, which indicates reduced crosstalk of the 0.9μm pixel. The improved optical characteristics are obtained as a result of optimized dimensions and materials of anti-reflecting layer, color filter array, and micro lens, etc.

The summary of pixel performance is listed in Fig. 5.3.5. FWC is increased up to 6,000 e- and crosstalk is reduced from 17.2% to 14%, which are 11% and 20% improvements, respectively. Though sensitivity is degraded from 2920 e-/lux-s to 2600 e-/lux-s, due to shrinkage of pixel size, consequently, the 0.9μm pixel with full-depth DTI achieves enhanced characteristics over the conventional one. Figure 5.3.6 shows sample images taken with the 0.9μm pixels and the conventional 1.0μm ones.

In conclusion, we have demonstrated a 1/2.8-inch 24 Mpixel high-resolution CMOS image sensor with 0.9μm unit pixels separated by full-depth DTI. The 0.9μm pixel is superior to the conventional 1.0μm one in terms of electrical and optical characteristics. FWC is increased up to 6,000 e-, which is even larger than for conventional 1.0μm pixels, and dark noise characteristics also improved. Moreover, better optical performance is also achieved by using an optimized back-side design and the higher aspect ratio of full-depth DTI.

#### References:

- [1] S. Choi, et al., "An All Pixel PDAF CMOS Image Sensor with 0.64 μm x 1.28 μm Photodiode Separated by Self-Aligned In-Pixel Deep Trench Isolation for High AF Performance," *IEEE Symp. VLSI Tech.*, pp. 104-105, June 2017.
- [2] J. C. Ahn, et al., "Advanced Image Sensor Technology for Pixel Scaling Down Toward 1.0 μm (Invited)," *Proc. IEDM*, pp. 1-4, 2008.
- [3] A. Suzuki, et al., "6.1 A 1/1.7-inch 20Mpixel Back-Illuminated Stacked CMOS Image Sensor for New Imaging Applications," *ISSCC Dig. Tech. Papers*, pp. 110-111, Feb. 2015.
- [4] Y. Kitamura, et al., "Suppression of Crosstalk by Using Backside Deep Trench Isolation for 1.12 μm Backside Illuminated CMOS Image Sensor," *Proc. IEDM*, 2012.
- [5] J. C. Ahn, et al., "A 1/4-inch 8Mpixel CMOS Image Sensor with 3D Backside-Illuminated 1.12 μm Pixel with Front-Side Deep-Trench Isolation and Vertical Transfer Gate," *ISSCC Dig. Tech. Papers*, pp. 124-125, Feb. 2014.
- [6] C. R. Moon, et al., "Application of Plasma-Doping (PLAD) Technique to Reduce Dark Current of CMOS Image Sensors," *IEEE Electron Device Letters*, pp. 114-116, Feb. 2007.



Figure 5.3.1: Schematic diagram and operation.



Figure 5.3.2: Pixel structure and full-well capacity.



Figure 5.3.3: Electrical characteristics.



Figure 5.3.4: Optical characteristics.

|                          | unit      | Conventional 1.0 $\mu\text{m}$ | 0.9 $\mu\text{m}$ |
|--------------------------|-----------|--------------------------------|-------------------|
| YSNR10*                  | lux       | 185                            | 175               |
| G-sensitivity @D65-light | e-lux.sec | 2920                           | 2600              |
| Crosstalk                | %         | 17.2                           | 14                |
| Linear full well         | e-        | 5500                           | 6000              |
| Image lag                | e-        | < 1.0                          | < 1.0             |
| Dark temporal noise      | e-        | 1.8                            | 1.4               |
| Dark fixed pattern noise | e-        | 0.75                           | 0.55              |
| Dynamic range            | dB        | 63.9                           | 64.9              |
| Dark current @T=60°C     | e-/s      | 7.5                            | 2.0               |
| White spot**             | ea/Mp     | 265                            | 90                |
| RTS***                   | ea/Mp     | 350                            | 1                 |

\*YSNR10 : illumination for YSNR=10

With AWB, CCM, F.2.8, 3200K light, 18%reflectance gray patch

92%lens transmittance, IR filter of 98%maximum transmittance and 660±2 (0.9  $\mu\text{m}$ ), 650±5 (1.0  $\mu\text{m}$ ) nm cut-offColor accuracy ( $\Delta E 2000$ )=2.5

\*\*White spot : # of pixels≥80LSB @1-frame, gain x8, 200msec, Ta=60°C

\*\*\*RTS : # of pixels≥30LSB @difference between 2-frame, gain x8, 0msec, room temp

Figure 5.3.5: Sensor performance comparison.



Figure 5.3.6: Sample images.



Figure 5.3.7: Die micrograph.

## 5.4 A 1/4-inch 3.9Mpixel Low-Power Event-Driven Back-Illuminated Stacked CMOS Image Sensor

Oichi Kumagai<sup>1</sup>, Atsumi Niwa<sup>1</sup>, Katsuhiko Hanzawa<sup>2</sup>, Hidetaka Kato<sup>1</sup>, Shinichiro Futami<sup>1</sup>, Toshio Ohyama<sup>1</sup>, Tsutomu Imoto<sup>1</sup>, Masahiko Nakamizo<sup>1</sup>, Hirotaka Murakami<sup>2</sup>, Tatsuki Nishino<sup>1</sup>, Anas Bostamam<sup>1</sup>, Takahiro Inuma<sup>1</sup>, Naoki Kuzuya<sup>1</sup>, Kensuke Hatsukawa<sup>3</sup>, Frederick Brady<sup>2</sup>, William Bidermann<sup>2</sup>, Toshiyumi Wakano<sup>1</sup>, Takashi Nagano<sup>1</sup>, Hayato Wakabayashi<sup>2</sup>, Yoshikazu Nitta<sup>1</sup>

<sup>1</sup>Sony Semiconductor Solutions, Atsugi, Japan

<sup>2</sup>Sony Electronics, San Jose, CA

<sup>3</sup>Sony LSI Design, Atsugi, Japan

Wireless products such as smart home-security cameras, intelligent agents, and virtual personal assistants, are evolving rapidly to satisfy our needs. Small size, extended battery life, transparent machine interfaces: all these are required of the camera system in these applications. These applications, in battery-limited environments, can profit from an event-driven approach for moving-object detection. This paper presents a 1/4-inch 3.9Mpixel low-power event-driven (ED) back-illuminated stacked CMOS image sensor (CIS) deployed with a pixel readout circuit that detects moving objects for each pixel under lighting conditions ranging from 1 to 64,000lux. Utilizing pixel summation in a shared floating diffusion (FD) for each pixel block, moving object detection is realized at 10 frames per second while consuming only 1.1mW, a 99% reduction in power from the same CIS at a full-resolution 60fps power of 95mW.

Figure 5.4.1 shows a block diagram of the low-power ED sensor. The sensor consists of a pixel array, row drivers, row decoders, single-slope column-parallel 10b ADCs, a DAC for slope generation, motion detection (MD)/optical detection (OPD) blocks, an image signal processor, SRAM for frame memory, a Mobile Industry Processor Interface (MIPI), and a CPU connected to the sensor control block. A phase-locked loop (PLL) is used to generate all internal clocks from a single master input clock running between 12 and 27MHz. The clock generation circuitry is programmable through an I2C interface. The sensor is operated by three power supplies: a 1.8V pixel supply, a 1.0V digital supply, and a 1.8V supply for both analog and low-power I/O. The pixel array consists of visible pixels and optically black pixels used in for on-chip image processing. The array has vertical signal lines (VSLs), used for readout of each 2x4 shared pixel unit. The analog pixel readout signal is converted to a digital signal using a column-parallel ADC. The MD and OPD blocks are used while in the sensing binning mode. In this mode, a very low resolution of 80 binned pixel units (1 output for each 16x5 pixel subarray) are read out on a linear scale. If a moving object is detected, the CPU generates an external interrupt signal and triggers the capture of a high-quality image using on-chip auto exposure (AE) with zero latency. The maximum output data rate is 1.6Gb/s per MIPI lane. The chip incorporates 2 MIPI lanes. The sensor includes various camera functions such as sensing binning mode, pixel binning mode, all pixel scan image mode and is capable of recording full-HD movies with high sensitivity at 60 fps.

Figure 5.4.2 shows the overall readout architecture in sensing binning mode. The sensor consists of an array of 2560(H) x 1536(V) pixels. Each pixel in the array is addressed through a horizontal word line and VSLs for each 2x4 shared pixel unit readout. Pixel readout is performed one row at a time through the pixel source follower (SF) amplifier. The entire row is then converted to digital output using the column comparators and counters. The sensor supports 2x1 analog binning (2-column binning, called SF binning mode), 2x2 analog binning (2-column/2-row binning, called SF/FD binning mode) and 160x154 analog/digital binning (8-column/154-row SF/FD binning and 20 horizontal digital binning, called sensing binning mode). Binning can be enabled when the sensing binning mode is enabled (SMEN = 1). In this mode, all pixels are read through vertical floating diffusion lines (VFLs) to avoid blind-spots [1]. A combination of short-exposure pixels and long-exposure pixels alternating every two lines is used to achieve 160x154 pixel analog/digital binning with high dynamic range [2]. To reduce power consumption, each column's ADC is controlled by column-enable signals (COLEN).

There are two new functions on-chip: an MD function used to detect a change of object position relative to its surroundings, and an auto exposure (AE) function executed 1 frame before normal mode operation. In the ED system of this sensor, MD and AE are used in sensing binning mode. Figure 5.4.3 shows the MD and AE processes in this sensor. There are two event-driven methods employed: a frame difference method, and a background difference method. The MD function is used to minimize power and trigger a full resolution capture. In the MD process, moving objects are detected and the luminance difference between 2 frames of images are used as the detection trigger if the luminance difference exceeds a certain threshold. If a moving object is detected, the sensor generates an interrupt signal to wake up an application processor (AP). The AE is used to determine the appropriate exposure parameters: shutter exposure time and gain. The value of OPD is used to calculate current illuminance level (evaluation value); then the required exposure value to achieve brightness target called AE scale is calculated before the AE scale value is input to AE diagram. The appropriate value of shutter exposure time and gain is obtained by scaling the AE value from the minimum gain point in the AE diagram. The calculated exposure time and gain value are applied to the next normal readout frame continuously after moving object is detected.

As shown in Fig. 5.4.4, the sensing binning mode and normal mode are successfully captured using the fabricated ED sensor. The first image captures the entire image with low resolution and high contrast. With high accuracy, a moving object is detected in sensing binning mode and is followed by a high-quality image in normal mode.

A performance summary and comparison to recently published devices are shown Fig. 5.4.5. The chip consumes 95mW of total power at a frame rate of 60fps in normal mode achieving an FOM1 of 0.7e-.nJ for the ADC. Furthermore, the ADC demonstrates an FOM2 of 1.5e-.nJ/DRU. The chip characteristics are summarized in Fig. 5.4.6. The top chip uses a 90nm 1Al-4Cu CMOS process with specialized add-on steps to enable backside illumination. The bottom chip uses a 40nm 1Al-6Cu logic process technology. The total number of effective pixels is 3.9MPixels (2560 (H) x 1536 (V)), excluding the optically black pixels. The pixel pitch is 1.5μm and incorporates an on-chip micro-lens. A saturation signal of 7800e- at 60°C, a sensitivity of 8033 e-/lx·s, dynamic range of 67dB at 10b in normal mode and dynamic range of 96dB in sensing binning mode are all measured. An RMS random noise of 1.8e- and a conversion gain of 55.8μV/e- are confirmed. The die micrograph is shown in Fig. 5.4.7. The chip size is 4.959mm (H) x 4.401mm (V).

In conclusion, this sensor has an intelligent sensing function for real-time moving object detection within predefined areas. This low-power ED sensor facilitates event recording, which significantly reduces power consumption and data bandwidth of camera systems while in low-power sensing mode. These features enhance the device usability and satisfies the new demands of a low-resolution always-on sensing device that is also capable of high-quality imaging.

### Acknowledgments:

The authors would like to thank the members of Sony Semiconductor Manufacturing, Sony Electronics Inc., Sony LSI Design Inc. and Sony Semiconductor Solutions Co. for their support of this work.

### Reference:

- [1] G. Kim, et al., "A 467nW CMOS Visual Motion Sensor with Temporal Averaging and Pixel Aggregation," *ISSCC Dig. Tech. Papers*, pp. 480-481, Feb. 2013.
- [2] S.K. Nayar, et al., "High Dynamic Range Imaging: Spatially Varying Pixel Exposures," *IEEE Conf. Computer Vision and Pattern Recognition*, vol. 1, pp. 472-479, June 2000.
- [3] Y. Chae, et al., "A 2.1Mpixel 120frame/s CMOS Image Sensor with Column-Parallel  $\Delta\Sigma$  ADC Architecture," *ISSCC Dig. Tech. Papers*, pp. 394-395, Feb. 2010.
- [4] A. Suzuki, et al., "A 1/1.7-inch 20Mpixel Back-Illuminated Stacked CMOS Image Sensor for New Imaging Applications," *ISSCC Dig. Tech. Papers*, pp. 110-111, Feb. 2015.
- [5] T. Arai, et al., "A 1.1μm 33Mpixel 240fps 3D-Stacked CMOS Image Sensor with 3-Stage Cyclic-Based Analog-to-Digital Converters," *ISSCC Dig. Tech. Papers*, pp. 126-127, Feb. 2016.
- [6] M. Kobayashi, et al., "A 1.8erms Temporal Noise Over 110dB Dynamic Range 3.4μm Pixel Pitch Global Shutter CMOS Image Sensor with Dual-Gain Amplifiers, SS-ADC and Multiple-Accumulation Shutter," *ISSCC Dig. Tech. Papers*, pp. 74-75, Feb. 2017.



5



|                         | ISSCC 2010 [3] | ISSCC 2015 [4] | ISSCC 2016 [5] | ISSCC 2017 [6] | This work |
|-------------------------|----------------|----------------|----------------|----------------|-----------|
| ADC resolution [bit]    | 12.5           | 12             | 12             | 10             | 10        |
| Frame rate [fps]        | 120            | 30             | 240            | 120            | 60        |
| # of V pixels           | 1212           | 3934           | 4320           | 2054           | 1536      |
| # of H pixels           | 1696           | 5256           | 7680           | 2592           | 2560      |
| Random noise [e·rms]    | 2.4            | 1.3            | 3.6            | 1.8            | 1.8       |
| Power [mW]              | 180            | 532            | 3000           | 450            | 95        |
| FoM1 [e·nJ] for ADC     | 1.8            | 1.1            | 1.4            | 1.3            | 0.7       |
| FoM2 [e·nJ/DRU] for ADC | 0.4            | 3.3            | 3.4            | 0.3            | 1.5       |

FoM1=Power[W]xNoise[e-]x10<sup>9</sup>/FPS[s<sup>-1</sup>]xNum. of Eff. Pixels  
FoM2=Power[W]xNoise[e-]x10<sup>12</sup>/FPS[s<sup>-1</sup>]xNum. of Eff. Pixels x DRU  
DRU=Saturation signal[e-]/Noise[e-]/Gain[times]

|                                 |                                                                                                                                                                       |
|---------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Fabrication Process             | 90nm 1AL4Cu CIS/40nm 1AL6Cu Logic                                                                                                                                     |
| Number of effective pixels      | 2560 (H) x 1536 (V) 3.9 M pixels                                                                                                                                      |
| Image size                      | Diagonal 4.48 mm (1/4-type)                                                                                                                                           |
| Pixel size                      | 1.5 μm (H) x 1.5 μm (V)                                                                                                                                               |
| Supply voltage                  | 1.8V / 1.8V / 1.0V                                                                                                                                                    |
| Frame rate                      | All-pixel scan 10 bits 60 fps<br>Full HD 10 bits 60 fps<br>Sensing 16x5 10 bits 10 fps                                                                                |
| Power consumption               | All-pixel scan 95 mW at 10 bits 60fps<br>57 mW at 10 bits 30fps<br>Full HD 71 mW at 10 bits 60 fps<br>41 mW at 10 bits 30 fps<br>Sensing 16x5 1.1 mW at 8 bits 10 fps |
| Saturation signal               | 7800 e- at 60°C                                                                                                                                                       |
| Sensitivity(Typical value F5.6) | 8033 e-/lx·sec(Green Pixel)                                                                                                                                           |
| RMS random noise                | 1.8 e-rms (analog gain:18dB)                                                                                                                                          |
| Dynamic range                   | All-pixel scan 67 dB at 10 bits<br>Sensing 16x5 96 dB (1~64klx)                                                                                                       |
| Conversion Gain                 | 55.8μV/e-                                                                                                                                                             |
| Input clock frequency           | 6~27 MHz                                                                                                                                                              |
| Image output format             | Bayer RAW                                                                                                                                                             |
| Outputs                         | MIPI(CSI-2) 2 lane DPHY 1.6 Gbps/lane                                                                                                                                 |

Figure 5.4.5: Performance comparison.

Figure 5.4.6: Chip characteristics.



A micrograph of the chip



Figure 5.4.7: Die micrograph.

## 5.5 A 1.1 $\mu$ m-Pitch 13.5Mpixel 3D-Stacked CMOS Image Sensor Featuring 230fps Full-High-Definition and 514fps High-Definition Videos by Reading 2 or 3 Rows Simultaneously Using a Column-Switching Matrix

Po-Sheng Chou, Chin-Hao Chang, Manoj M. Mhala,  
Charles Chih-Min Liu, Calvin Yi-Ping Chao, Chiao-Yi Huang, Honyih Tu,  
Thomas Wu, Shang-Fu Yeh, Seiji Takahashi, Yimin Huang

TSMC, Hsinchu, Taiwan

Slow-motion video is a desirable feature for state-of-art smartphones. The effect is achieved by capturing a video at a higher frame rate and playing it back at a lower frame rate. While the still-image resolution of smartphone cameras ranges from 8MP to 25MP, standard videos are limited to 3 formats: 3840×2160 (4K2K, 2160p), 1920×1080 (Full High Definition, FHD, 1080p), and 1280×720 (High Definition, HD, 720p). CMOS image sensors using various column-parallel ADC architectures have been reported to reach high frame rates [1-5]. The single-slope (SS) ADC is an attractive choice for a balanced performance among high speed, low noise, small area, and low power consumption. However, in conventional SS ADC design, each ADC is hardwired to a column signal line. ADCs for skipped columns are left idle during the subsampling operation, and the potential to reach higher frame rate is not optimized. In this paper, we develop an approach in which all the column ADCs are fully utilized in both of the 2-to-1 and 3-to-1 subsampling modes, such that the maximum of 4× faster FHD and 9× faster HD videos are demonstrated with reference to the 1-to-1 non-subsampled 4K2K video.

The 1.1 $\mu$ m-pitch, 13.5Mpixel (4224×3200) chip consists of a top pixel array layer and a bottom readout circuit layer, fabricated in a 45nm backside illumination (BSI) CIS process and a 65nm low-power logic process, respectively. The two layers are wafer-to-wafer, face-to-face bonded together using a 3D stacking technology. For the 4K2K video, the 3840×2160 sub array is cropped from the center region of the full array. For FHD and HD videos, the 1920×1080 and the 1280×720 arrays are vertically and horizontally 2-to-1 and 3-to-1 subsampled from the 3840×2160 array while maintaining the 2×2 Bayer color patterns, described as the (V:1/2, H:1/2) and (V:1/3, H:1/3) modes in Fig. 5.5.1.

The key idea behind this design is that for 2-to-1 and 3-to-1 subsampling, the ADCs associated with the skipped columns can be used to read additional rows simultaneously to maximize the frame rates. Figure 5.5.1 shows that the 1 row, 2 rows, or 3 rows of pixels can be read out at the same time in each of the 2160p, 1080p, and 720p modes. Three column-signal lines (C1, C2, and C3) are required for every 2 columns. A column switching matrix (CSM) between the pixel array and the column ADCs is used to direct the column signals to different ADCs row by row.

Because of the 2×2 shared pixels, the even- and odd-column pixels on the same row are read out sequentially using 1 column ADC. Pixels on the same column but different rows can be read out by 1, 2, or 3 ADCs simultaneously. The CSM comprises 3 types of column-switching cells (CSC) with a pattern repeating every 6 ADCs, or 12 columns. The CSM and CSCs are controlled by a set of switching and enable signals, S[1:5] and E[0:3]. Figure 5.5.2 shows the active columns, the inactive columns, and the routing paths of the column signals to different ADC groups in each of the 2160p, 1080p, and 720p modes.

In the vertical 3-to-1 subsampling case, 2 rows are read out and 4 rows are skipped for every 6 rows. To route the column signals to 3 adjacent ADCs, an 18-row period pattern is needed. Similarly, in the vertical 2-to-1 subsampling case, 2 rows are read out and 2 rows are skipped for every 4 rows. To route the column signals to 2 neighboring ADCs, an 8-row period pattern is needed. Combining the above 2 requirements, the vertical routing pattern of a period of 72 rows is needed, since the least common multiplier of 18 and 8 is 72. Figure 5.5.3 shows the detailed implementation together with the tabulated switching control signals S[1:5] for each of the 3 modes. Note that such an implementation is not unique; indeed, there are many other ways to achieve the same effect. For each different hardwired vertical connection to C1, C2, and C3, a set of matching control signals needs to be designed accordingly.

Take the 720p case in Fig. 5.5.3 as an example, where a total of 24 rows out of every 72 rows are read out in the following triple-row sequence: (0, 6, 12), (1, 7, 13), (18, 24, 30), (19, 25, 31), (36, 42, 48), (37, 43, 49), (54, 60, 66), and (55, 61, 67). The corresponding column signal lines are (C1, C3, C2), (C1, C3, C2), (C2, C1, C3), (C2, C1, C3), (C2, C3, C1), (C2, C3, C1), (C1, C2, C3), and (C1, C2, C3), respectively. The on-chip digital block uses a finite state machine to generate the switching control signals and an external FPGA rearranges the output data back to the correct order using SRAM line buffers.

The analog signal chain is illustrated in Fig. 5.5.4. The 4 shared pixels are buffered by a source follower (SF), enabled by a row-select device (Rsel), and biased by a constant current source (Cbias). The column signal lines are controlled by the CSM and fed into 12b SS column ADCs.

A continuous ramp generator is designed using the switched-capacitor charge integration structure and achieves low sensitivity to process variation, high linearity, small area, and consumes only 5mA current. The current integrator buffer is capable of driving a heavy loading of 2,112 comparator capacitor arrays while maintaining a large output swing. The offset of the buffer is cancelled by an auto-zero (AZ) switch during the reset phase. The ADC gain can be adjusted by changing the ramp slope via programming the integration current.

A two-stage column comparator compares the pixel output with the ramp voltage. Two unbalanced transistors M3 and M4 are used to increase the input swing headroom from ( $V_{dd}$ - $V_{gs}$ ) to ( $V_{dd}$ - $V_{ds}$ ) of M1 and M2. The cascode device M4 is used to reduce the kickback noises from comparator output and to enhance the gain for offset reduction.

The column hybrid counter design is shown in Fig. 5.5.5. The total 2,112 columns are partitioned into 11 groups of 192 columns. A semi-global 5b Gray code counter in each group is driven by a global 2GHz PLL clock. Correlated double sampling (CDS) is performed by dual ramping and AD conversions for the reset voltage and the signal voltage sequentially. The reset data are obtained by latching the 5b Gray codes. The signal data are obtained by the 5 LSBs from the Gray code counter and the 8 MSBs from the in-column ripple counter. The total 18b of data per column are transmitted to the digital block at 20MHz for CDS subtraction. Finally, the chip outputs at a 6.144Gb/s data rate using 12 parallel LVDS channels at 512MS/s each.

The performance comparison and chip characteristics are tabulated in Fig. 5.5.6, as well as the sequence of cropped 720p images at 514fps. In addition to the high video rates of 200fps 12b FHD, 230fps 11b FHD, 450fps 12b HD, and 514fps 11b HD, this chip achieved an excellent FOM of 1.01e $\times$ N based on the full-frame operation. The photos of the top CIS die and the bottom ASIC die are shown in Fig. 5.5.7. In summary, an architecture and circuit implementation to achieve high frame rate FHD and HD video is successfully demonstrated. Only 1 bank of column ADCs is used. All ADCs are 100% utilized in the 2-to-1 and 3-to-1 subsampling modes by an analog switching matrix and a set of digital control signals.

### References:

- [1] T. Watabe, et al., "A 33Mpixel 120fps CMOS Image Sensor Using 12b Column-Parallel Pipelined Cyclic ADCs," *ISSCC Dig. Tech. Papers*, pp. 388-389, Feb. 2012.
- [2] H. Honda, et al., "A 1-inch Optical Format, 14.2M-pixel, 80fps CMOS Image Sensor with a Pipelined Pixel Reset and Readout Operation," *IEEE Symp. VLSI Circuits*, pp. C4-C5, June 2013.
- [3] A. Suzuki, et al., "A 1/1.7-inch 20Mpixel Back-Illuminated Stacked CMOS Image Sensor for New Imaging Applications," *ISSCC Dig. Tech. Papers*, pp. 110-111, Feb. 2015.
- [4] T. Arai, et al., "A 1.1 $\mu$ m 33Mpixel 240fps 3D-Stacked CMOS Image Sensor with 3-Stage Cyclic-Based Analog- to-Digital Converters," *ISSCC Dig. Tech. Papers*, pp. 126-127, Feb. 2016.
- [5] T. Haruta, et al., "A 1/2.3inch 20Mpixel 3-Layer Stacked CMOS Image Sensor with DRAM," *ISSCC Dig. Tech. Papers*, pp. 76-77, Feb. 2017.



Figure 5.5.1: Video formats and subsampling.



Figure 5.5.2: Column-switching matrix (CSM).



Figure 5.5.3: Vertical signal routing and CSM control.



Figure 5.5.4: Pixel and peripheral circuit architecture, ramp generator, and comparator design.



Figure 5.5.5: Column hybrid counter.

| Reference                                                                                             | [1] 2012 ISSCC      | [2] 2013 VLSI-C        | [3] 2015 ISSCC | [4] 2016 ISSCC | [5] 2017 ISSCC | This work |
|-------------------------------------------------------------------------------------------------------|---------------------|------------------------|----------------|----------------|----------------|-----------|
| Process                                                                                               | 45nm 1P4M Sensor    |                        |                |                |                |           |
| Supply voltage                                                                                        | 2.8V pixel          | 2.8V pixel             | 1.4V digital   |                |                |           |
| # of pixels                                                                                           | 4224 * 3200, 13.5MP |                        |                |                |                |           |
| Pixel size                                                                                            | 1.1μm * 1.1μm       |                        |                |                |                |           |
|                                                                                                       | 13.5MP              | 34fps/12b ADC mode     |                |                |                |           |
| Frame rate                                                                                            | 2160p               | 50fps/12b, 58fps/11b   |                |                |                |           |
|                                                                                                       | 1080p               | 200fps/12b, 230fps/11b |                |                |                |           |
|                                                                                                       | 720p                | 450fps/12b, 514fps/11b |                |                |                |           |
| Counter speed                                                                                         | 2.048GHz            |                        |                |                |                |           |
| Random noise                                                                                          | 1.8e-               |                        |                |                |                |           |
| CFPN                                                                                                  | 0.28e-              |                        |                |                |                |           |
| FWC                                                                                                   | 4458e-              |                        |                |                |                |           |
| Conversion gain                                                                                       | 118uV/e-            |                        |                |                |                |           |
| Power [mW]                                                                                            | 2540                | 1100                   | 532            | 3000           | > 424          | 258       |
| FOIM = $\frac{\text{Power} \cdot \text{Noise}}{\# \text{of Pixels} \cdot \text{Fps} \cdot \text{nJ}}$ | 2.33                | 1.65                   | 1.15           | 1.36           | N/A            | 1.01      |

† Estimated from 1Gb DRAM to store 64 2MP frames (1Gb=2M\*42\*12b)

\*\* 240fps HD is a continuous video; 960fps FHD is a clip of 42~78 frames

† FOIM calculation is based on full-array frame rate



Figure 5.5.6: Performance comparison (top left), chip characteristics (top right), and images at 514fps (bottom).



**Figure 5.5.7: Chip micrograph top sensor chip (left) and bottom ASIC chip (right).**

## 5.6 A 2.1 $\mu$ m 33Mpixel CMOS Imager with Multi-Functional 3-Stage Pipeline ADC for 480fps High-Speed Mode and 120fps Low-Noise Mode

Toshio Yasue<sup>1</sup>, Kohei Tomioka<sup>1</sup>, Ryohei Funatsu<sup>1</sup>, Tomohiro Nakamura<sup>1</sup>, Takahiro Yamasaki<sup>1</sup>, Hiroshi Shimamoto<sup>1</sup>, Tomohiko Kosugi<sup>2</sup>, Sung-Wook Jun<sup>2</sup>, Takashi Watanabe<sup>2,3</sup>, Masanori Nagase<sup>2</sup>, Toshiaki Kitajima<sup>2</sup>, Satoshi Aoyama<sup>2</sup>, Shoji Kawahito<sup>2,3</sup>

<sup>1</sup>NHK Science & Technology Research Laboratories, Tokyo, Japan

<sup>2</sup>Brookman Technology, Hamamatsu, Japan

<sup>3</sup>Shizuoka University, Hamamatsu, Japan

High-resolution video has rapidly integrated into our daily life in the context of progress in camera, display, signal processing, and communication technologies. The uppermost video parameters standardized at this moment include 8K, 120-fps, 12b RGB, wide-color-gamut, and HDR. Although a camera that fulfills all these parameters has been reported based on 1.7-inch 33-Mpixel CMOS imagers [1], achieving a smaller form factor while also maintaining image quality is required from the standpoint of mobility, lens design, and depth of focus. In general, miniaturization of the imager causes degradation of the image quality metrics such as sensitivity, dynamic range, and resolution. We deliberated on these difficulties, and set a target optical format of 1.25 inch.

This paper describes a 1.25-inch 33-Mpixel CMOS imager supporting three operation modes. The first mode is 120fps 14b with noise suppression by folding-integration (FI) and digital CDS. The second mode is 240fps 12b for slow-motion imaging without noise suppression. The third mode is 480fps 10b for slow-motion imaging with further decreased speed.

A block diagram of the image sensor is shown in Fig. 5.6.1. We use a 110nm FSI technology to fabricate the imager. The pixel array with a 2.1 $\mu$ m pitch has a vertically shared 2.5-transistor structure. Each pixel column has duplicated source followers and a CDS. The CDS circuit is connected to a 3-stage pipelined ADC. For the 1<sup>st</sup> stage, we use an FI ADC (whose analog core is similar to a  $\Delta\Sigma$  modulator) to improve the S/N by multiple sampling [2]. For the 2<sup>nd</sup> stage, we use a cyclic ADC to achieve high-speed operation and accuracy. For the 3<sup>rd</sup> stage, we use an SAR ADC to improve speed-to-power efficiency. The details on the ADC operation and pixel readout are described below.

After the ADC, there are two sets of 18b registers that hold the binary-code outputs of the ADC for the digital-CDS. Since the CDS operation of the two source followers are duplicated, two sets of registers are used. The 18b reset data (signal after the reset of a pixel) held in the register is subtracted from the signal data (signal after the charge transfer in a pixel). The subtracted code is output in parallel in a 20b binary format (signed 6b for FI and 12b for cyclic and SAR). The 20b signals from the 46 columns are scanned by a horizontal scanner and transferred to the digital processing circuit. The digital processing circuit includes a format converter, a sync code inserter, and a parallel-to-serial converter. The format converter converts the 20b data into a certain word length depending on the operation mode. In the 120fps and 240fps modes, a word length of 16b is selected (3b from the 1<sup>st</sup> ADC, 13b from the 2<sup>nd</sup> and 3<sup>rd</sup> ADC.). In the 480fps mode, although the ADC outputs a 10b code, the word-length is selected as 8b. The 10bit code is converted to 8b length using a floating-point format. The effective bit depth is 10, 9, and 6b when the 10b data is within ranges of 0-127, 128-255 and 256-1023, respectively. When the imager operates at lower frame rates than 120fps, the word length can be extended to a maximum of 20b, corresponding to 30x of multi sampling in the FI. After the format conversion, the blank and sync codes are embedded into the signal, and the parallel data is serialized. The serialized data is output from the imager through a scalable low-voltage signaling (SLVS) driver. The imager has 184 data channels of 864Mb/s and 12 clock channels. The aggregate data rate output from the imager is approximately 159Gb/s.

Figure 5.6.2 shows a circuit diagram of the pixel signal readout. Vertically aligned 2-shared pixels are connected alternately to two source followers and the two source followers are connected to a CDS circuit via a switch. The switched-capacitor type CDS has an analog gain that can be varied up to 4x in steps. As shown in Fig. 5.6.3, the two source followers work in parallel and their operation phases differ by a 1/4 cycle in the 120fps mode. Source followers are settled preliminarily and connected to CDS circuits while its latter half of settling. In this configuration, the CDS circuit receives the reset levels of the two pixels

subsequently, and then, the signal levels of the two pixels subsequently. Since the CDS has one sampling capacitor, the reset noise arise from the floating diffusion must be eliminated using digital CDS.

To achieve a 14b conversion in a sampling period of 0.93 $\mu$ s, the 1<sup>st</sup> ADC has two sampling capacitors (C1A and C1B) as shown in Fig. 5.6.2. The FI ADC works as follows. The operation procedure for all switches is shown in Fig. 5.6.2. In the 1<sup>st</sup> sampling phase, C1A and C2 are charged by the input voltage. At that time, the comparator (CMP) compares the voltage on C2 with VC. In the 2<sup>nd</sup> sampling phase, the charge sampled in C1A is transferred to C2 while C1B simultaneously samples the input voltage. In the following phases, the ADC works in the same manner as 2<sup>nd</sup> sampling phase, but with the roles of C1A and C1B exchanged. In the 120fps mode, the ADC samples the input voltage 6x, and outputs a digital code from 0 to 6 with quadrupled and folded residue voltage. The 2<sup>nd</sup> ADC also duplicates its sampling capacitor, so we call this method dual-cyclic. By using the two sampling capacitors C1A and C1B, the amplified voltage of C1A can be captured by C1B simultaneously, and vice versa in the next phase. In this way, the ADC can save time to feed the amplified voltage back to the sampling capacitor. The 2<sup>nd</sup> ADC operates 1.5b cyclic conversion 6x and outputs a redundant 7b binary code. The 3<sup>rd</sup> ADC samples residue voltage from the 2<sup>nd</sup> ADC and outputs a 6b code.

In the 240fps mode, the digital-CDS is omitted to reduce the pixel processing time to half of that in the 120fps mode. Without the digital-CDS, the two source followers do not operate in parallel. Thus, the settling time for the source followers is reduced to 1/4 of that in the 120fps mode. Then, the 1<sup>st</sup> ADC is bypassed, and the output voltage from the CDS is directly sampled by the 2<sup>nd</sup> ADC. The 2<sup>nd</sup> and 3<sup>rd</sup> ADCs operate as a 12b ADC.

In the 480fps mode, the operation of the 1<sup>st</sup> ADC changes significantly. The two source followers are directly connected to the two sampling capacitors in the 1<sup>st</sup> ADC. In this way, the 1<sup>st</sup> ADC works as a CDS circuit with two sampling capacitors. While one sampling capacitor is in the signal transfer phase, the other sampling capacitor can sample a reset level from the other source follower. Therefore, the pixel and source follower can use the same time as the 240fps mode. The output voltage from the 1<sup>st</sup> ADC (works as a CDS) is converted to a 10b code by the 2<sup>nd</sup> and 3<sup>rd</sup> ADCs.

The measured performance and specifications are listed in Fig. 5.6.4. The developed pixel exhibits a full well capacity of 7,600e<sup>-</sup> and a sensitivity of 16,800e-/lx-s (CIE A-light with IRC). The measured input referred noise is 3.2e<sup>-</sup>, 4.3e<sup>-</sup> and 27e<sup>-</sup> in the 120fps, 240fps, and 480fps modes, respectively. The FI suppresses random noise in the 120fps mode despite the aggressive high-speed design of the ADC. Thanks to the digital-CDS, a VFPN is as small as 0.24e<sup>-</sup> in the 120fps mode. Although the power consumption is larger than our prediction, we already found the cause for this discrepancy and will correct it shortly. The image captured in the 480fps mode is shown in Fig. 5.6.5.

Figure 5.6.6 shows a comparison between the performance of recently reported 4K and 8K imagers [3-7]. Based on the figure, the imager described in this work achieves the highest pixel rate, lowest noise among the 8K imagers, and multi-functionality. A die micrograph is shown in Fig. 5.6.7.

### References:

- [1] K. Kitamura, et al., "Full-Specification 8K Camera System," *2016 NAB BEC Proceedings*, Apr. 2016.
- [2] M.-W. Seo, et al., "A Low-Noise High Intrascene Dynamic Range CMOS Image Sensor with a 13 to 19b Variable-Resolution Column-Parallel Folding-Integration/Cyclic ADC," *IEEE J. Solid-State Circuits*, vol.47, no.1, pp. 272-283, 2012.
- [3] Y. Oike, et al., "An 8.3M-pixel 480fps Global-Shutter CMOS Image Sensor with Gain-Adaptive Column ADCs and 2-on-1 Stacked Device Structure," in *Symp. VLSI Circuits Dig. Tech. Papers*, June 2016.
- [4] T. Arai, et al., "A 1.1 $\mu$ m 33Mpixel 240fps 3D-Stacked CMOS Image Sensor with 3-Stage Cyclic-Based Analog-to-Digital Converters," *ISSCC Dig. Tech. Papers*, Feb. 2016.
- [5] T. Yasue, et al., "A 14-bit, 33-Mpixel, 120-fps Image Sensor with DMOS Capacitors in 90-nm/65-nm CMOS," in *Proc. Of IISW*, June 2015.
- [6] R. Funatsu, et al., "133Mpixel 60fps CMOS Image Sensor with 32-Column Shared High-Speed Column-Parallel SAR ADCs," *ISSCC Dig. Tech. Papers*, Feb. 2015.
- [7] T. Toyama, et al., "A 17.7Mpixel 120fps CMOS Image Sensor with 34.8Gb/s Readout," *ISSCC Dig. Tech. Papers*, Feb. 2011.



Figure 5.6.1: Block diagram and signal flow of the image sensor.



Figure 5.6.3: Timing diagram of pixel readout, CDS, and A/D conversion in each operation mode.



Figure 5.6.5: Image captured in 480fps mode.



Figure 5.6.2: Pixel readout structure and ADC operation in the 120fps mode.

| Drive Mode        | 120fps with FI and Digital-CDS                                       | 240fps             | 480fps                          |
|-------------------|----------------------------------------------------------------------|--------------------|---------------------------------|
| Process           | 110 nm 1P4M CIS (FSI)                                                |                    |                                 |
| Chip size         | 22.3 mm (H) x 30.9 mm (V)                                            |                    |                                 |
| Pixel size        | 2.1 $\mu\text{m} \times 2.1 \mu\text{m}$                             |                    |                                 |
| Pixel count       | Active 7,680 x 4,320 (Total 8,464 x 4,352 include OB and test pixel) |                    |                                 |
| Output            | 184ch 864 Mbps SLVS + 12ch 432 MHz Clock                             |                    |                                 |
| Frame Freq.       | 120 Hz RS                                                            | 240 Hz RS          | 480 Hz RS                       |
| ADC resolution    | 14 bit                                                               | 12 bit             | 10 bit<br>(floating point code) |
| Integration times | 6 times                                                              | -                  | -                               |
| FWC               | 7,600 e (linear max)                                                 |                    |                                 |
| Sensitivity       | 16,800 e/lxs (monochrome) CIE A-light with IR-cut filter             |                    |                                 |
| Conversion Gain   | 85.4 $\mu\text{V/e}$                                                 |                    |                                 |
| Random Noise      | 3.2 e (at gain x2)                                                   | 4.3 e (at gain x2) | 27 e (at gain x1)               |
| VFPN              | 0.24 e                                                               | 8.1 e              | 24 e                            |
| Power             | 12.5 W                                                               | 9.8 W              | 9.0 W                           |
| A/D Conv. Period  | 0.93 $\mu\text{s}$<br>(2 conversions for 1 pixel)                    | 0.93 $\mu\text{s}$ | 0.46 $\mu\text{s}$              |

Figure 5.6.4: Performance summary.

|                   | This work       |       | [3]     | [4]      | [5]     | [6]    | [7]     |
|-------------------|-----------------|-------|---------|----------|---------|--------|---------|
| year              | 2018            |       | 2016    | 2016     | 2015    | 2015   | 2011    |
| Optical format    | 1.25-inch       |       | Super35 | 2/3-inch | Super35 | Full35 | Super35 |
| Pixel pitch       | $\mu\text{m}$   | 2.1   | 5.86    | 1.1      | 3.2     | 2.45   | 4.2     |
| Pixel count (H)   | pixels          | 7,680 | 3,840   | 7,680    | 7,680   | 15,360 | 8,192   |
| Pixel count (V)   | pixels          | 4,320 | 2,160   | 4,320    | 4,320   | 8,640  | 2,160   |
| Frame rate        | fps             | 480   | 240     | 120      | 480     | 240    | 60      |
| Pixel rate        | Gpix/s          | 15.9  | 7.96    | 3.98     | 7.96    | 3.98   | 2.12    |
| Power consumption | mW              | 9,000 | 9,800   | 12,500   | 5,230   | 3,000  | 3,200   |
| Conversion gain   | $\mu\text{V/e}$ | 85.4  | 85.4    | 85.4     | 30.3    | 92.0   | 61.0    |
| Analog gain       | times           | 1.0   | 2.0     | 2.0      | 1.0     | 4.0    | 3.5     |
| Noise             | e               | 27    | 4.3     | 3.2      | 4.6     | 3.6    | 5.2     |
| Saturation        | e               | 7,600 | 7,600   | 7,600    | 30,450  | 5,700  | 15,300  |
|                   |                 |       |         |          |         |        | 21,000  |

Figure 5.6.6: Comparison with recent 8K and 4K image sensors.



Figure 5.6.7: Die micrograph.

## 5.7 A 20ch TDC/ADC Hybrid SoC for 240×96-Pixel 10%-Reflection <0.125%-Precision 200m-Range Imaging LiDAR with Smart Accumulation Technique

Kentaro Yoshioka<sup>1</sup>, Hiroshi Kubota<sup>1</sup>, Tomonori Fukushima<sup>1</sup>, Satoshi Kondo<sup>1</sup>, Tuan Thanh Ta<sup>1</sup>, Hidenori Okuni<sup>1</sup>, Kaori Watanabe<sup>1</sup>, Yoshinari Ojima<sup>1</sup>, Katsuyuki Kimura<sup>1</sup>, Sohichiro Hosoda<sup>1</sup>, Yutaka Oota<sup>1</sup>, Tomohiro Koizumi<sup>1</sup>, Naoyuki Kawabe<sup>1</sup>, Yasuhiro Ishii<sup>1</sup>, Yoichiro Iwagami<sup>1</sup>, Seitaro Yagi<sup>1</sup>, Isao Fujisawa<sup>2</sup>, Nobuo Kano<sup>1</sup>, Tomohiro Sugimoto<sup>1</sup>, Daisuke Kurose<sup>1</sup>, Naoya Waki<sup>1</sup>, Yumi Higashi<sup>2</sup>, Tetsuya Nakamura<sup>1</sup>, Yoshikazu Nagashima<sup>1</sup>, Hirotomo Ishii<sup>1</sup>, Akihide Sai<sup>1</sup>, Nobu Matsumoto<sup>1</sup>

<sup>1</sup>Toshiba, Kawasaki, Japan; <sup>2</sup>Toshiba Memory, Kawasaki, Japan

Long-range and high-pixel-resolution LiDAR systems, using Time-of-Flight (ToF) information of the reflected photon from the target, are essential upon launching safe and reliable self-driving programs of Level 4 and above. 200m long-range distance measurement (DM) is required to sense proceeding vehicles and obstacles as fast as possible in a highway situation. To realize safe and reliable self-driving in city areas, LiDAR systems uniting wide angle-of-view and high pixel resolution are required to fully perceive surrounding events. Moreover, these performances must be achieved under strong background light (e.g., sunlight), which is the most significant noise source for LiDAR systems. To accomplish a 100m-range DM, an accumulation of the DM results through several pixels is utilized to improve the S/N ratio with 70klux background light [1]. Here, S is the number of photons reflected from the target and N as the number of background light photons. However, if the range is extended to 200m under similar condition of the laser power and frame rate (FPS), 16 $\times$  more pixel accumulation is required. Such pixel accumulation leads to blurring the range image, and hence, a serious oversight in the surrounding events, such as a flying-out pedestrian, may occur, not suiting self-driving applications. Furthermore, the Time-to-Digital Converter (TDC) based ToF measurement is activated only when 2 or more photons are detected simultaneously [1], and thus, is not suitable for the 200m long-range DM where few photons are reflected from the target. On the other hand, ToF measurements using ADCs, which can continuously quantize the silicon photomultiplier (SiPM) output and can sense single-photon events, suits long-range measuring purposes well [2]. However, a number of accumulations should still be required to accomplish 200m-range DM, and hence, low resolution is inevitable. In addition, the SoC cost is critical. To enhance the short-range DM resolution by using ADCs, the required sampling rate is over 10GS/s; upon realizing a 20ch AFE, such an ADC array alone may occupy an area of over 10mm<sup>2</sup> and consume huge power [3].

A TDC/ADC hybrid LiDAR SoC with Smart Accumulation Technique (SAT) is reported that achieves both 200m and high-pixel-resolution range imaging for reliable self-driving programs. SAT recognizes and accumulates only the target reflection data by utilizing the intensity and background light information from the ADC, which enhances the effective pixel resolution 4 $\times$  compared to the conventional accumulation. The TDC/ADC hybrid architecture relaxes the ADC sampling rate requirements for short-range DM precision, and moreover, a residue-quantizing noise-shaping (RQNS)-SAR ADC further downscals the ADC cost. The proof-of-concept 200m range LiDAR system achieves a 2 $\times$  wider DM-range compared to the conventional designs [1,5], and also realizes 240×96 pixel resolution and 0.125% DM precision.

Figure 5.7.1 shows a block diagram of the TDC/ADC hybrid LiDAR SoC. The SoC consists of an ADC-based long-range (LR) DM block (20 to 200m), a TDC-based short-range (SR) DM block (0 to 20m), and a digital block for range image construction including SAT. DM is carried on by measuring ToF of the round-tripped photon captured via SiPM and the laser scanning is applied by mechanical mirror. The ADC is an essential block for the LiDAR SoC to realize 200m-range DM with higher effective pixel resolution. However, to obtain 0.1%-DM precision for SR measurement closer than 20m (corresponding to 2cm at 20m), an ADC sampling rate of over 10GS/s is required, which is unrealistic due to chip cost and power consumption. The TDC/ADC hybrid architecture overcomes this dilemma. The 12b 40ps-LSB TDC achieves 1cm-precision and moreover, the hybrid architecture significantly relaxes the ADC sampling rate to 400MS/s. The reflected photon from the target is injected to two kinds of (LR/SR) SiPM customized for each LR/SR requirement where the SR-SiPM has higher input-photon-saturation tolerance than LR-SiPM. Using the techniques described above, an SoC for a 200m-range, 0.125%-DM-precision, 240×96-pixel-resolution LiDAR system for >Level 4 self-driving program is realized with satisfactory cost.

Figure 5.7.2 shows the main concept behind SAT. SAT recognizes the target reflection data and conducts smart accumulation by utilizing the intensity and background light information from the ADC; S/N ratio is efficiently improved and

long-range high-pixel-resolution DM is achieved. Since few photons are reflected from a 200m-range 10%-reflectivity target, the S/N ratio can be <0dB for a worst-case condition with a background light of 70klux. In such conditions, M-times accumulation of the DM results through the surrounding M pixels is necessary for the S/N improvement. However, if the pixels “watching” a different target are used for the accumulation, both S/N ratio and pixel resolution are degraded [1]. In Fig. 5.7.2, we assume that E is the measuring pixel (MP) and the surrounding pixels are candidates for accumulation. Note that accumulation of the pixels D,F,G,H,I “watching” the same target (a car) as E contribute to the S/N improvement. On the other hand, by accumulating pixels A,B,C “watching” a non-target object (an electric pole), the ToF-measurement can easily fail. To prevent this problem, SAT pre-processes the ADC results and tags the peak level (PL) and floor level (FL). Note that the amplitude swing of the ADC depends on the number of captured photons, and hence PL has a strong correlation with the reflectivity and the distance of the target. Moreover, FL also has a strong correlation with the target, where the reflected photon source is the sunlight. SAT accumulates the corresponding pixel only when the correlations of both PL and FL at MP exceeds an appropriate value (e.g., 70%). Also, SAT enhances not only the accuracy but the environmental robustness; PL is effective for nighttime and FL for daytime imaging, depending on the amount of the background light. Simulation results show that compared to the conventional simple accumulation [1], SAT can detect 1/4-size objects, hence, SAT enhances the effective pixel resolution by 4 $\times$ , and thus, suits longer-range DM.

Figure 5.7.3 shows the 400MS/s RQNS-SAR ADC. Generally, in a SAR ADC, the C-DAC area is dominant because it increases exponentially with the ADC bit-width. Moreover, SAR conversion cycles also increase with the ADC bit-width. Hence, to enhance the sampling rate up to our target, 400MS/s, an area-consuming time-interleaving technique is required. Therefore, a SAR ADC with noise-shaping (NS) [4] is used for our LiDAR SoC, which enhances the conversion accuracy per cycle by NS, hence requiring fewer cycles and ADC bits, as well as smaller area. On top of that, a 2<sup>nd</sup>-order residue-quantizing NS (RQNS) technique is proposed for further area reduction. In the conventional NS-SAR ADC, the residue signal processing requires a number of area-consuming analog circuits, such as sampling circuits and amplifiers. Here, the RQNS shifts the residue processing to the digital domain, where the signal processing has low area cost and the amplification can be conducted by bit-shifts and sampling circuits by a single flip-flop. This enables the NS order enhancement with minimum area penalty. A residue quantizer composed of SAR ADC samples and quantizes the residue prior to the next conversion in a pipelined fashion. Finally, the 2<sup>nd</sup>-order NS reduces the number of ADC bits by 3 and greatly reduces C-DAC area by 80%. Since the ADC sampling rate is crucial for DM precision rather than the input bandwidth, over-sampling is acceptable. The measurement results confirm the 2<sup>nd</sup>-order NS effect: SNDR=37.7dB at OSR=2 and achieves the smallest area among previously reported ADCs at ISSCC (see Fig. 5.7.3) with SNDR>35dB and BW 50 to 400MHz.

The TDC/ADC hybrid LiDAR SoC is fabricated in 28nm CMOS (Fig. 5.7.7). Figure 5.7.4 shows the LiDAR DM performance, where the 10%-reflectivity target was moved through 20 to 220m under 100-klux sunlight (70klux at the target surface). A DM success rate of 100% up to 180m and 92.7% at 200m is achieved. Moreover, the  $\sigma$  error of DM is constantly under 0.125% within the 20-to-200m range, representing a very high precision at wide range. The range images captured by the LiDAR w/ SAT, w/o accumulation, and w/ simple accumulations, are shown in Fig. 5.7.5(a),(b),(c), respectively. Figure 5.7.5(a) demonstrates that SAT significantly improves the pixel resolution and can recognize the pedestrian, suitable for self-driving applications. Figure 5.7.6 shows the LiDAR performance comparison. Based on the Figure, we see that the LiDAR using the hybrid SoC and SAT achieves 2 $\times$  longer DM and 4 $\times$  higher effective pixel resolution than conventional designs with almost equivalent FPS. Even if the same laser power is used as in [1], our LiDAR can achieve 1.4 $\times$  longer range DM due to the hybrid architecture and SAT.

### References:

- [1] C. Niclass, et al., “A 0.18 $\mu$ m CMOS SoC for a 100m-Range 10fps 200×96 Pixel Time-of-Flight Depth Sensor,” *ISSCC Dig. Tech. Papers*, pp. 488-489, Feb. 2013.
- [2] Velodyne, “High Definition Lidar White Paper”. Accessed on Nov. 19, 2017, [http://velodynelidar.com/docs/papers/HDL%20white%20paper\\_OCT2007\\_web.pdf](http://velodynelidar.com/docs/papers/HDL%20white%20paper_OCT2007_web.pdf)
- [3] D. Cui, et al., “A 320mW 32Gb/s 8b ADC-Based PAM-4 Analog Front-End with Programmable Gain Control and Analog Peaking in 28nm CMOS,” *ISSCC Dig. Tech. Papers*, pp. 58-59, Feb. 2016.
- [4] J. Fredenburg and M. Flynn, “A 90MS/s 11MHz Bandwidth 62dB SNDR Noise-Shaping SAR ADC,” *ISSCC Dig. Tech. Papers*, pp. 468-469, Feb. 2012.
- [5] H. Akita, “An Imager Using 2-D Single-Photon Avalanche Diode Array in 0.18- $\mu$ m CMOS for Automotive LIDAR Application,” *IEEE Symp. VLSI Circuits*, pp. 290-291, June 2017.



Figure 5.7.1: Block diagram of the 200m range Hybrid LiDAR SoC.



Figure 5.7.2: Smart Accumulation Technique (SAT) concept and simulation based comparison against conventional simple accumulation technique.



Figure 5.7.3: Block diagram and measured result of the 400MS/s Residue-Quantizing Noise-Shaping (RQNS) SAR ADC.



Figure 5.7.4: Measured LiDAR Distance Measurement (DM) performance for 10% reflectivity object.



Figure 5.7.5: Range image acquired by the LiDAR system: with SAT (a) and without SAT (b, c).

|                                            | This Work         | [1]               | [5]         |
|--------------------------------------------|-------------------|-------------------|-------------|
| Technology                                 | 28nm              | 180nm             | 180nm       |
| SPADs                                      | Off-chip          | On-chip           | On-chip     |
| Optical System                             | Mechanical Mirror | Mechanical mirror | MEMS mirror |
| Pixel-Resolution                           | 240x96            | 202x96            | N.A.        |
| Effective Pixel-Resolution w/ accumulation | 4x                | 1x                | N.A.        |
| Laser Wavelength [nm]                      | 905               | 870               | 870         |
| Laser power [mW]                           | 50                | 21                | N.A.        |
| FPS                                        | 10                | 10                | N.A.        |
| Target Reflectivity                        | 10%               | 9%                | 10%         |
| Background light [klux]                    | 70                | 70                | 75          |
| Distance range [m]                         | 200               | 100               | 20          |
| 1 sigma error @max distance                | 0.125%            | 0.14%             | 0.5%        |

Figure 5.7.6: Comparison results with recently published LiDAR systems.



Figure 5.7.7: Chip photo.

## 5.8 1Mpixel 65nm BSI 320MHz Demodulated TOF Image Sensor with 3.5 $\mu$ m Global Shutter Pixels and Analog Binning

Cyrus S. Bamji, Swati Mehta, Barry Thompson, Tamer Elkhatib, Stefan Wurster, Onur Akkaya, Andrew Payne, John Godbaz, Mike Fenton, Vijay Rajasekaran, Larry Prather, Satya Nagaraja, Vishali Mogallapu, Dane Snow, Rich McCauley, Mustansir Mukadam, Iskender Agi, Shaun McCarthy, Zhanping Xu, Travis Perry, William Qian, Vei-Han Chan, Prabhu Adepu, Gazi Ali, Muneeb Ahmed, Aditya Mukherjee, Sheetal Nayak, Dave Gampell, Sunil Acharya, Lou Kordus, Pat O'Connor

Microsoft, Mountain View, CA

The quest for accurate, high-resolution, low-power-consumption, and small-footprint 3D depth cameras has driven a rapid improvement in Continuous-Wave (CW) Time-of-Flight (ToF) technology. Commercially available 3D image acquisition techniques include Stereo Vision, Structured Light, and ToF. CW ToF imaging systems offer excellent mechanical robustness, no baseline requirement, high effective depth image resolution, low computational cost, and simultaneous IR ambient light invariant intensity capture (Active Brightness). In a CW ToF camera, light from an amplitude modulated light source is backscattered by objects in the camera's field of view, and the phase delay of the amplitude envelope is measured between the emitted and reflected light. This phase difference is translated into a distance value for each pixel in the imaging array.

Considerable effort has been applied to improve the spatial resolution, accuracy, and operating range of CW ToF cameras while lowering power consumption [1-4]. Uncertainty, range and power consumption are improved by: increasing Modulation Contrast (MC), Quantum Efficiency (QE) and Modulation Frequency, and reducing: read noise and analog-to-digital conversions; optical stack height is reduced by smaller pixels.

This paper presents a ToF image sensor fabricated in a TSMC 65nm 1P8M backside illumination (BSI) CMOS technology. This 1024 $\times$ 1024 pixel ToF global shutter image sensor achieves a high MC of 87% at 200MHz. The small pixel size of 3.5 $\times$ 3.5 $\mu$ m<sup>2</sup> is competitive with commercial global-shutter RGB sensors and facilitates a small optical stack height in mobile devices. A DLL-controlled clock-distribution system spreads the column clock peak currents across the columns thus reducing overall chip peak currents allowing a high 320MHz modulation frequency. The readout circuitry is fully differential and supports per-pixel adaptive gain selection, analog Correlated Double Sampling (CDS), analog pixel binning modes (e.g., 2 $\times$ 2) and a 9b or 10b Analog-to-Digital Converter (ADC).

Our differential ToF pixel photodetector operates in a *quasi-digital* demodulation mode. In this scheme, two polysilicon gates compete to collect photo-charges and the gate with a higher bias voltage captures almost all of the photo-charges. The gates also create a strong drift field allowing fast charge collection resulting in a high photodetector MC of 87% at 200MHz and 78% at 320MHz. QE is 44% at 860nm with a standard TSMC BSI process. Lower detector gate capacitance and voltage swing result in less than  $\frac{1}{2}$  the power consumption per unit area compared to our previous work [2]. Figure 5.8.1 shows a simplified cut through the photodetector structure and the collected photo current when polysilicon gate PGB is at a higher potential than polysilicon gate PGA. Simulation shows almost all charges are collected by gate PGB. Pixel effective fill factor is close to 100% due to the use of BSI technology and an additional optimized micro-lens structure.

Figure 5.8.2 presents the differential pixel schematic and timing diagram. The schematic includes two in-pixel memory storage elements which store collected photo charges as minority carriers suitable for analog CDS. The pixel layout has centroid symmetry, minimizing offsets and noise. *Global reset* clears charges from the gates and the memory elements. During *integration*, modulation gates PGA and PGB are driven with complementary column clocks and collected photo charges accumulate into in-pixel memories A and B.

A DLL-based clock driver system generates uniformly-time-spaced pixel column clocks to the sensor array avoiding the large peak current transients often generated by balanced clock trees. Each delay line element incorporates a feed forward component crossing from the A domain to the B domain, speeding up

the delay performance. This achieves a guaranteed column-to-column time delay of 9.5ps while creating two delay paths (Clock A and Clock B that drive pixel gates PGA and PGB) with guaranteed orthogonality at speeds beyond 320MHz. As a result, the clocking system provides enforced symmetric zero crossings with Process Voltage and Temperature (PVT) independent fast delay times and interpolated A to B to A spacing of 9.5ps.

After integration, global-shutter functionality is enabled during the *anti-blooming* period to reduce shot noise from ambient light. Shutter efficiency is about 99.8%. During a *row readout*, kT/C noise is cancelled by first resetting the floating diffusions FDA, FDB and reading out a corresponding kT/C and offset value. Then the values stored in the in-pixel memories A, B are dumped into floating diffusions FDA, FDB and the final signal value is read out through bitlines A, B. The just-read kT/C and offset value is then subtracted in the analog domain from this final value by the column readout circuitry and a single analog-to-digital conversion is performed on the analog subtracted value.

A block diagram of the readout circuit is shown in Fig. 5.8.3. The column multiplexer performs both pixel selection and electrical pixel binning functions. Six 1.2V, 4 $\mu$ A gain selection comparators simultaneously check both input signal polarities versus three decision thresholds to select one of four amplifier (AMP) gains. The column amplifier gain is programmed with adjustable input and feedback capacitors to allow gains from 0.25 $\times$  to 24 $\times$ . The analog CDS operation requires the amplifier to subtract the sampled pixel reset voltage from the pixel voltage before amplification while rejecting common-mode differences and cancelling amplifier offset. As shown in Fig. 5.8.3, during  $\phi_1$  and  $\phi_2$ , the pixel reset and amplifier offset are sampled on C3 and C5 and the image data and amplifier offset are sampled on C4 and C6. During  $\phi_3$  the charge on C3-C6 is moved to feedback capacitors C1 and C2.

Each AMP spans two pixel columns and two ADCs sample and convert the processed data from the AMP. Linearity of the 2GHz counter, 9b or 10b single-slope ADCs is improved with distributed feedback source followers in a global ramp generator. Differential read noise is 3e- (un-binned) in 10b mode. Each of the 512 readout circuits including pixel binning multiplexer, gain selection comparators, programmable gain amplifier, and two ADC's are implemented on a 7 $\mu$ m pitch with a height of about 1.2mm and provide 3.4GS/s 9b or 1.7GS/s 10b digital data.

Figure 5.8.4 shows the overall system accuracy and uncertainty at various sunlight equivalent ambient levels up to 25kLux as a function of distance. Figure 5.8.5 compares the system in high-performance and low-power operating modes with our previous work [2] and prior art [3]. The work presented in this paper extends the state of the art for ToF image sensors into the MPixel range. Based on the comparison table, it achieves the smallest pixel size, highest MC at high frequency, lowest noise, and lowest power as compared to previously reported work including our own [2,3]. Figure 5.8.6 shows a 1024 $\times$ 1024 image of 24 people over 40cm to 7m at 120° $\times$ 120° Field of View (FOV). The images are from a single frame of data for: Active Brightness, color coded depth (mm), depth point cloud and a depth point cloud zoomed in on a ping pong ball at 2m that still shows a good spherical depth shape. A die micrograph with labeled sub-blocks is shown in Fig. 5.8.7. We believe the presented ToF chip provides the performance, footprint and power envelope for demanding ToF scenarios including mobile devices.

### References:

- [1] S.-M. Han, et al., "A 413 $\times$ 240-Pixel Sub-Centimeter Resolution Time-of-Flight CMOS Image Sensor with In-Pixel Background Canceling Using Lateral-Electric-Field Charge Modulators," *ISSCC Dig. Tech. Papers*, pp. 130-131, Feb 2014.
- [2] A. Payne, et al., "A 512 $\times$ 424 CMOS 3D Time-of-Flight Image Sensor with Multi-Frequency Photo-Demodulation up to 130MHz and 2GS/s ADC," *ISSCC Dig. Tech. Papers*, pp. 134-135, Feb 2014.
- [3] Y. Kato, et al., "320x240 Back-Illuminated 10 $\mu$ m CAPD Pixels for High Speed Modulation Time-of-Flight CMOS Image Sensor," *IEEE Symp. VLSI Circuits*, pp. C288-C289, June 2017.
- [4] S. Kim, et al., "Time of Flight Image Sensor with 7 $\mu$ m Pixel and 640x480 Resolution," *IEEE Symp. VLSI Technology*, pp. T146-T147, June 2013.



Figure 5.8.1: Pixel structure and collected photo-charges.



Figure 5.8.2: Pixel schematic and timing diagram.



Figure 5.8.3: Schematic of Amplifier (AMP) and Analog-to-Digital Converter (ADC).



Figure 5.8.4: Measured system accuracy and uncertainty for 20% reflective target at varied ambient light levels in center region of field of view.

| Process Technology                  | TSMC 65nm IBS 1PRM            | Kinect v2           | Sony 2D17                        |
|-------------------------------------|-------------------------------|---------------------|----------------------------------|
| Pixel Pitch                         | 3.5µm x 3.5µm                 | 15µm x 10µm         | 10 µm x 10 µm                    |
| Chip Size                           | 1024 x 1024 (active pixel)    | 512 x 424           | 320 x 240                        |
| System Dynamic Range                | 5.4mV x 9.8mV                 | 8.2mV x 14.2mV      | ..                               |
| Modulation Contrast                 | >2500 x 68dB                  | >2500 x 68dB        | ..                               |
| Modulation Frequency                | 87% @ 860nm (2.20MHz)         | 68% @ 850nm @ 50MHz | 85% @ 850nm @ 100MHz (thin epil) |
| System Average Modulation Frequency | 10 to 130MHz                  | 10 to 130MHz        | 100MHz                           |
| Responsivity                        | 0.305A/W @ 860nm              | 0.144A/W            | 0.34A/W                          |
| FOV (H x V)                         | 120° x 120°                   | 70° x 60°           | ..                               |
| Depth Uncertainty                   | < 0.2% of range (3kLux)       | 0.5% of range       | 0.1% of range                    |
| F#                                  | 1.2                           | 1.07                | ..                               |
| Frame Rate                          | 39 fps                        | 30 fps              | ..                               |
| Global Shutter                      | (>60fps w/reduced resolution) | max 60fps           | ..                               |
| ADC resolution                      | 10 bits                       | 10 bits             | ..                               |
| ADC speed                           | Selectable 9 bits or 10 bits  | 205/s               | ..                               |
| Effective Fill Factor               | 100%                          | 60%                 | >80%                             |
| Reflectivity                        | 35% to 100%                   | 15% to 95%          | ..                               |
| Operating Mode                      | Low Power                     | High Performance    | ..                               |
| Operating Distance                  | 0.2 to 3.8m                   | 0.4 to 4.2m         | 0.8 to 4.2m                      |
| Readout Resolution                  | 512x512                       | 1024x1024           | ..                               |
| Readout Noise                       | (analog binning)              | (210uV)             | (320uV)                          |
| Chip Power                          | 150mW                         | 650mW               | 2.1W                             |
| Total System Power [chip and laser] | 225mW                         | 950mW               | 870mW                            |

Figure 5.8.5: Comparison table.



Figure 5.8.6: 1024x1024 Image: Active brightness, depth [mm], depth point cloud, depth point cloud zoomed in on bouncing ping pong ball.



Figure 5.8.7: Die micrograph (from chip front side).

## 5.9 A 256×256 45/65nm 3D-Stacked SPAD-Based Direct TOF Image Sensor for LiDAR Applications with Optical Polar Modulation for up to 18.6dB Interference Suppression

Augusto Ronchini Ximenes<sup>1</sup>, Preethi Padmanabhan<sup>2</sup>, Myung-Jae Lee<sup>2</sup>, Yuichiro Yamashita<sup>3</sup>, D. N. Yang<sup>3</sup>, Edoardo Charbon<sup>1,2</sup>

<sup>1</sup>Delft University of Technology, Delft, The Netherlands

<sup>2</sup>EPFL, Neuchatel, Switzerland; <sup>3</sup>TSMC, Hsinchu, Taiwan

Light detection and ranging (LiDAR) systems based on direct time-of-flight (DTOF) are used in spacecraft navigation, assembly-line robotics, augmented and virtual reality (AR/VR), (drone-based) surveillance, advanced driver assistance systems (ADAS), and autonomous cars. Common requirements are accuracy and speed, while ensuring long operating distance, high tolerance to background illumination and robustness to interference from other LiDAR systems. To meet these demands, the DTOF sensor community has provided numerous architectures, typically making use of resource sharing that often introduces tradeoffs between pixel count and speed. If resource sharing is not used, reduced fill factor, high non-uniformity, and pile-up distortion generally arise, thus limiting overall performance [1].

In this paper, we propose a 3D-stacked CMOS DTOF image sensor comprising a back-illuminated single-photon avalanche diode (SPAD) array fabricated in 45nm standard CMOS image sensor (CIS) process, which is electrically connected to a digital processing and communication unit (DPCU), fabricated in 65nm standard (1P4M) low-power CMOS process. The image sensor is part of a LiDAR system featuring a dual-axis laser scanner and a laser signature identification technique, which employs digital polar modulation based on phase-shift keying (PSK) on the outgoing light and on the detected photon timing information for proper signal recovery. As a result, any light source, other than its own, is spread in time on the histogram that reconstructs the echo of the light pulse, as depicted in Fig. 5.9.1.

A block diagram of the image sensor at the core of the LiDAR is shown in Fig. 5.9.2. The top tier (Tier 1) contains solely the SPAD array and the bottom tier (Tier 2) hosts quenching and DPCU. The image sensor is conceived in a modular fashion. Each module comprises an array of 2×8×8 SPADs sharing a single time-to-digital converter (TDC), through a decision tree, which ensures that every SPAD is connected to the TDC through a 6-level path of *identical length*, virtually free of skew. Each node of the tree acts as an arbiter, or *decision maker*, directing the SPAD pulse by way of a first-come-win-all policy, while the ID of the originating SPAD, formed on the decision tree, is the address of a pixel-wise 21b memory holding the timestamp. Unlike previous approaches, where the TDC is activated by photon arrival [1,2], our TDC runs continuously, thus avoiding dead time zones and intensity-dependent power dissipation; this leads to a virtually calibration-free system. To minimize the overall power consumption and to ensure scalability, the always-on TDC is designed to dissipate less than 500μW. Each SPAD can be individually masked via a controllable passive quenching; two operation modes are available: pulse and state; the former generates a pulse proportional to the dead time of the SPAD, whereas the latter generates a state that is only reset at the end of a frame, thus ensuring operation flexibility in different scenarios and applications. In this work, we present two operating modules, totaling 8×32 pixels, which are completely autonomous and serve as building blocks of a larger system built in a completely modular fashion.

The decision makers, shown in the inset of Fig. 5.9.2, are constructed using D-type Flip-Flops (DFF) with reset. Upon arrival of concurrent digital inputs, associated with distinct photon detections, the fixed state '1' is sampled, where the early corresponding DFF output resets the late counterpart, thus blocking it from subsequent detection. Although, the DFF has no metastability at the clock inputs, the potential conflict of the DFF outputs is mitigated by a nMOS latch, which reaches an overall metastability of 14ps. The outputs of the DFFs are then combined using a symmetric OR gate to generate output Q and through an SR-latch to generate address A. The Q's are combined in a 6-level tree (64-to-1) to generate the final DTOF output, whereas the A's are used to generate a 6b ID address, through a series of multiplexers. This architecture enables high uniformity (<1%) and low dead time (<2.4ns) between photon detections.

The DTOF pulse samples the TDC and the address is used to read the previous value stored in the corresponding pixel memory. The arithmetic logic unit (ALU) combines these timestamps and the result is stored in the next independent DTOF event. After each detection, the decision tree is reset and made available to the next event. The TDC consists of a current-starved 8-stage, pseudo-differential ring oscillator (RO) connected to a 10b ripple counter, to form a 14b TDC with a resolution (LSB) of 60ps and a range of 1μs. The remaining 7b of memory can be configured to perform photon counting and/or to store the fractional part of the low-pass filter (IIR), provided by the ALU. A maximum of 1.7Gevents/s is achieved, including intensity and concurrent data processing.

Figure 5.9.3 demonstrates the performance of the sensor. At room temperature and an excess bias voltage ( $V_E$ ) of 2.5V, the SPAD achieves a dark count rate (DCR) of 55.4cps/μm<sup>2</sup>, a maximum photon detection probability (PDP) of 31.8% at 600nm and over 5% in the 420-to-920nm spectrum, and a full-width at half-maximum (FWHM) timing jitter of 107.7ps. Although the native fill factor is 31.3%, two types of microlenses are implemented to improve this figure. Conventional microlenses increase it to 50.6%, whereas fresnel microlenses show negligible improvement. Figure 5.9.3 also shows the TDC linearity across the entire range, measured via statistical code density test. The worst-case DNL is +0.8/-0.7LSB and INL is +3.4/-0.8LSB. To comply with space applications, which require radiation tolerance; the imager was exposed to a <sup>60</sup>Co gamma source. At a dose rate of 73 krad/h, the DCR increases from 2.8kcps to 5.8kcps over a 90-minute exposure, as plotted in Fig. 5.9.3.

The chip can operate in two modes: high and low resolution. In high-resolution mode, a 150m range is achieved with 7cm accuracy (0.3% non-linearity) and 15cm precision ( $\sigma = 0.1\%$ ); in low-resolution mode, a 430m range is achieved with 80cm accuracy (0.4% non-linearity) and 47cm precision ( $\sigma = 0.11\%$ ). Figure 5.9.4 shows single-point telemetry with a 50% reflectivity target. The system uses a 532nm laser with an average power of 6mW at 1MHz repetition rate in high resolution and 1mW at 300kHz in low resolution.

Figure 5.9.4 also shows the histogram of a DTOF measurement in presence of an interference laser (637nm), identical in strength to the reference signal (532nm), with no filter and 300lux of background. The effects of different modulation indices are presented, where the interference peak is reduced by 12.6dB, 15.6dB and 18.6dB, when an 8-, 16- and 32-PSK is applied, respectively, whereas the target echo strength is unaffected.

The LiDAR scanner enables configurable spatial resolution, based on the target scene-of-interest, where fast sweep can be traded for fine 3D mapping. To demonstrate both scenarios, a 256×256 3D image was acquired from objects at 4.5 meters (DTOF and intensity acquired simultaneously), as shown in Fig. 5.9.5, obtaining millimetric details. Furthermore, a 32×32 depth map was also acquired, featuring targets with different reflectivities and shapes, without compromising ranging accuracy, at distances ranging from 4 to 10m. A cross-section, along with the ground truth measurement of the scene, confirms the accuracy of the measurement.

Figure 5.9.6 presents a performance summary of the sensor and a comparative table of the state-of-the-art DTOF CMOS image sensors. Figure 5.9.7 shows the micrograph of the sensor and of the microlenses with details of the SPAD. The sensor measures 0.25×0.8mm<sup>2</sup> with a sensitive area of 0.1mm<sup>2</sup>.

### Acknowledgements:

The authors wish to thank PicoQuant GmbH for the laser loan, TSMC for chip fabrication, CEA-Leti for radiation measurement, and the Netherlands Organization for Scientific Research for funding, in part, this research. The first two authors contributed equally.

### References:

- [1] Veerappan, C., et al., "A 160×128 Single-Photon Image Sensor with On-Pixel 55ps 10b Time-to-Digital Converter", *ISSCC Dig. Tech. Papers*, pp. 312-313, Feb. 2011.
- [2] Perenzoni, M., et al., "A 64×64-Pixel Digital Silicon Photomultiplier Direct ToF Sensor with 100MPhotons/s/pixel Background Rejection and Imaging/Altimeter Mode with 0.14% Precision up to 6km for Spacecraft Navigation and Landing", *ISSCC Dig. Tech. Papers*, pp. 118-119, Feb. 2016.
- [3] Niclass, C., et al., "A 100-m Range 10-Frame/s 340×96-Pixel Time-of-Flight Depth Sensor in 0.18-μm CMOS", *IEEE J. Solid-State Circuits*, vol. 48, no. 2, pp. 559-572, Feb. 2013.
- [4] Dutton, N., et al., "A Time-Correlated Single-Photon-Counting Sensor with 14GS/s Histogramming Time-to-Digital Converter", *ISSCC Dig. Tech. Papers*, pp. 204-205, Feb. 2015.



Figure 5.9.1: Conceptual dual-axis scanning LiDAR system, featuring laser polar modulation for interference suppression.



Figure 5.9.3: PDP, DCR as a function of  $^{60}\text{Co}$  gamma irradiation dose, TDC non-linearity, and SPAD normalized timing jitter.



Figure 5.9.2: Block and timing diagram of the digital processing and communication unit of the bottom tier (Tier 2) chip.



Figure 5.9.4: Single-point optical measurement in high- and low-resolution modes and PSK laser modulation.



Figure 5.9.5: 3D reconstruction of multiple targets using the dual-axis scanner at various distances and target reflectivities.

| Parameter                     | Unit                | This Work                             | [2]                                | [3]                    | [4]                      |
|-------------------------------|---------------------|---------------------------------------|------------------------------------|------------------------|--------------------------|
| Technology                    | nm                  | 45/65nm CMOS                          | 150nm CMOS                         | 180nm                  | 130nm CIS                |
| Architecture                  | -                   | Always-on, shared TDC & decision tree | Event-driven, single TDC per pixel | Column wise shared TDC | Histogramming shared TDC |
| Depth resolution              | -                   | 256 x 256 <sup>(1)</sup>              | 64 x 64                            | 340 x 96               | 32 x 32                  |
| Sensor characteristics        |                     |                                       |                                    |                        |                          |
| Pixel pitch                   | µm                  | 19.8                                  | 60                                 | 25                     | 21                       |
| Pixel fill factor             | %                   | 31.3/50.6 <sup>(2)</sup>              | 26.5                               | 70                     | 43                       |
| SPAD median DCR @ $V_E$       | cps/µm <sup>2</sup> | 55.4 @ 2.5 V                          | 57 @ 3 V                           | 6 @ 3.3 V              | N/A                      |
| TDC depth                     | bit                 | 14                                    | 16/15                              | 12                     | 8                        |
| TDC resolution                | ps                  | 60 - 320                              | 250 - 20000                        | 208                    | 71.4                     |
| TDC power                     | mW                  | 0.5 - 0.1                             | N/A                                | N/A                    | 14.1                     |
| TDC area                      | µm <sup>2</sup>     | 550                                   | N/A                                | 31,000 <sup>(5)</sup>  | 30,000                   |
| TDC linearity                 | DNL (LSB)           | +0.8/-0.7                             | +1.2/-1 <sup>(3)</sup>             | +0/-0.52               | +0.75/-0.61              |
|                               | INL (LSB)           | +3.4/-0.8                             | +4.8/-3.2 <sup>(3)</sup>           | +0.73/-0.49            | +0.65/-0.2               |
| Measured distance performance |                     |                                       |                                    |                        |                          |
| Distance range                | m                   | 150 - 430                             | 367 - 5862 <sup>(4)</sup>          | 128                    | 2.82 - 3.375             |
| Precision ( $\sigma$ )        | m                   | 0.15 - 0.47                           | 0.2 - 0.5 <sup>(4)</sup>           | 0.1 <sup>(6)</sup>     | N/A                      |
| (Repeatability)               | %                   | 0.1 - 0.11                            | 0.13 - 0.14 <sup>(4)</sup>         | 0.1 <sup>(6)</sup>     | N/A                      |
| Accuracy                      | m                   | 0.07 - 0.8                            | 1.5 - 35 <sup>(4)</sup>            | 0.37 <sup>(6)</sup>    | N/A                      |
| (Non-linearity)               | %                   | 0.3 - 0.4                             | 0.37 - 1.9 <sup>(4)</sup>          | 0.37 <sup>(6)</sup>    | N/A                      |

<sup>(1)</sup> Flexible resolution depending on scanner and available laser optical power. <sup>(2)</sup> Without microlenses / with conventional microlenses.

<sup>(3)</sup> Measured over 5% of the total range. <sup>(4)</sup> Emulated results. <sup>(5)</sup> Estimated by layout. <sup>(6)</sup> Measured at 100nm.

Figure 5.9.6: Performance summary and comparison with state-of-the-art DTOF image sensors.



Figure 5.9.7: Conventional and fresnel zone plate microlens arrays and complete chip micrograph.

## 5.10 A 32×32-Pixel Time-Resolved Single-Photon Image Sensor with 44.64μm Pitch and 19.48% Fill-Factor with On-Chip Row/Frame Skipping Features Reaching 800kHz Observation Rate for Quantum Physics Applications

Leonardo Gasparini<sup>1</sup>, Majid Zarghami<sup>1</sup>, Hesong Xu<sup>1</sup>, Luca Parmesan<sup>1</sup>, Manuel Moreno Garcia<sup>1</sup>, Manuel Unternährer<sup>2</sup>, Bänz Bessire<sup>2</sup>, André Stefanov<sup>2</sup>, David Stoppa<sup>1,3</sup>, Matteo Perenzoni<sup>1</sup>

<sup>1</sup>Fondazione Bruno Kessler (FBK), Trento, Italy

<sup>2</sup>University of Bern, Bern, Switzerland; <sup>3</sup>now with ams AG, Rüschlikon, Switzerland

Entangled photons, beyond the classical physics understanding, show quantum correlations in some of their degrees of freedom. They find application in quantum computing, quantum key distribution, and super-resolution (i.e., beyond the diffraction limit) microscopy, but they are undistinguishable using conventional image sensors.

The optical process of spontaneous parametric down-conversion (SPDC) occurring in a non-linear (NL) crystal pumped with an intense laser beam is a common way to generate pairs of photons that are spatially entangled. They show correlations in their emission location and anticorrelations in the direction of emission (momentum) and are generated simultaneously (<1 ps). Image sensors with high detection efficiency and time resolution are needed to detect both the photons and identify them as an entangled pair. A first experiment that combines an SPDC source with a time-resolved single-photon imager is described in [1].

This work presents a CMOS imager based on single-photon avalanche diodes (SPAD) with per-pixel time-to-digital converter (TDC) for the recording of the spatial correlation functions of a flux of entangled photons, with on-chip features to increase the duty cycle. The 32×32-pixel array (1.69×1.88 mm<sup>2</sup>), combining a 44.64-μm pitch with 19.48% fill-factor, is fabricated in a 150nm 1P6M CMOS standard technology. A current-based mechanism requiring only 2 transistors per pixel exploits low photon rates to avoid reading empty frames, so allowing the sensor to open 50-ns-long observation windows at up to 800kHz. An additional transistor per pixel is used to sense the absence of SPAD activity in each row and reduce the readout time.

Figure 5.10.1 shows the pixel schematic and the applied waveforms. Initially, the SPAD is off ( $V_{An} = V_{SPAD} - V_{3V3} < V_{Breakdown}$ ). Then, a 5ns pulse on CHARGE drives  $V_{An}$  to 0V, turning the SPAD on, and if it fires due to a photon or a dark count, a voltage step is observed at  $V_{An}$ . The signal propagates through the clamping transistor M4, separating the 3.3V and the 1.8V domains. The NOR/NAND gates operate as a pulse generator triggering the START signal if the event occurs while GATEn = L.

The TDC has been designed to aggressively minimize its area (402.7μm<sup>2</sup>). It is based on a 3-stage ring oscillator (RO), enabled by START. A sample-and-hold mechanism controlled by STOP stores the phase of the RO, which is then encoded into 2b (fine timestamp). At the same time, a 6b digital counter of RO periods produces a coarse timestamp.

Noise is mitigated in multiple ways: DISCHn turns off the SPAD at the end of the observation window to minimize afterpulsing, noisy SPADs are disabled through a programmable 1b SRAM, and no SPAD wells sharing scheme is used to recover fill-factor as in [4], as it negatively impacts crosstalk and the sensor modulation transfer function (MTF).

Figure 5.10.2 shows the overall sensor architecture. Full 1 kpixel readout is performed in 11.20μs streaming 8b digital data at 100MHz. Additional circuitry is used to speed up the readout process by skipping empty rows (highlighted in blue in Fig. 5.10.2). Each pixel in the i<sup>th</sup> row shares a ROWEMPTY[i] net which is pulled-up when the pixels are reset. During the observation, the first firing pixel in the row sets the flag low through transistor M10 in Fig. 5.10.1. At the end of the observation, the value of the flag is sampled in the row decoder and is provided at the output to speed up the readout process. For example, 8 triggered pixels lead to a readout time ≤3.52μs.

A second mechanism is implemented to skip entire frames when the total number of triggered pixels is below a user defined threshold. Each pixel in the array contains a tunable current source (M11 and M12 in Fig. 5.10.1) sinking a current  $I_{SPAD}$  from a global TRIG<sub>int</sub> net only when the pixel is triggered (highlighted in red

in Fig. 5.10.2). At the periphery of the array, replicas of the current source generate a reference current  $I_{th}$  that corresponds to the requested minimum number of triggered pixels and is subtracted to the global current  $I_{SPAD\ array}$ . The difference  $I_{diff}$  is provided to a current comparator that outputs a logic '0' or a logic '1' if  $I_{diff}$  is negative or positive, respectively. The value is then sampled in TRIG so that the external controller can skip the readout phase and start a new acquisition. When observing rare events (e.g., groups of 4 entangled photons), the observation rate approaches the limit of 800kHz. The timing diagram of Fig. 5.10.2 shows two consecutive acquisitions. The first frame is ignored due to an insufficient number of triggered pixels, while the second one is read out, skipping empty rows.

TDCs are thoroughly characterized in terms of full scale variation, differential and integral nonlinearity (DNL, INL) over a 45ns window for 64 pixels randomly distributed across the array. Results are shown in Fig. 5.10.3. The TDC resolution is of 204.5±2.7ps. DNL is in the -0.73 to +0.77 LSB range. All the INL plots lay in the -0.93 to +1.24 LSB range.

The sensor accuracy is measured for 64 equally spaced pixels, with a pulsed 470nm laser source with 70ps FWHM. Figure 5.10.4 summarizes the results, including raw and calibrated data. Calibration includes DNL correction (bin oversampling and resizing) and offset compensation, caused by device mismatch and propagation of critical timing signals. The figure shows that the calibration procedure effectively recovers the increasing spread for longer delays. Single-photon timing precision is of 240ps rms.

The sensor is tested acquiring first- ( $G^{(1)}(\rho)$ ) and second-order ( $G^{(2)}(\rho_s, \rho_i)$ ) correlation functions of a flux of spatially entangled photons. Figure 5.10.5(top) shows a block diagram of the experimental setup. A NL crystal pumped with a 405nm continuous-wave laser generates entangled photon pairs at 810nm. Being entangled, both photons exit the crystal with an angle, symmetric with respect to the propagation axis, so that their barycenter lays on the axis itself. To verify this property, the detector is placed in the far-field of the photons, thus measuring their direction of emission. A coincidence window of 5 TDC codes is used to identify entangled photons. Figure 5.10.5(middle, right) shows a model of the photon flux, with two photons of a pair, referred to as signal photon and idler photon, respectively.  $G^{(1)}(\rho)$ , where  $\rho = (x_p, y_p)$  represents the photon position in the SPAD array, describes the spatial distribution of entangled photons. Figure 5.10.5(middle, left) shows the expected  $G^{(1)}$ , consisting of a circle, and the measured one. The circle is clearly visible, with hot spots due to dark counts and cold spots due to disabled SPADs.  $G^{(2)}(\rho_s, \rho_i)$ , where  $\rho_s, \rho_i$  represent the linearized coordinates ( $\rho_{s,i} = 32 \times x_p + y_p$ ) of the signal and idler photon, respectively, is obtained by building the 2D histogram of spatial coincidences for all possible pixel combinations. The expected measured function model in Fig. 5.10.5(bottom) shows an anticorrelation pattern due to the entanglement. The obtained  $G^{(2)}$  shows multiple antidiagonals: since the system is not ideal, given one photon detection, the other photon may fall in the anticorrelated pixel or in one of its neighbors. In linear coordinates, pixels in a column laying on adjacent rows become separated by 32 points. The measured histogram shows also correlation diagonals caused by crosstalk events, which are temporally coincident as entangled photons are. In typical conditions, less than 8 SPADs trigger for each observation window and the sensor consumes 11.1mW at 250kfps. Figure 5.10.6 summarizes the sensor performance and compares it with the state of the art.

### Acknowledgements:

We thankfully acknowledge the support of the European Commission through the SUPERTWIN project, id. 686731.

### References:

- [1] M. Unternährer, et al., "Coincidence Detection of Spatially Correlated Photon Pairs with a Monolithic Time-Resolving Detector Array," *Opt. Express*, vol. 24, pp. 28829-28841, 2016.
- [2] C. Veerappan, et al., "A 160×128 Single-Photon Image Sensor with On-Pixel 55ps 10b Time-to-Digital Converter," *ISSCC Dig. Tech. Papers*, pp. 312-313, Feb. 2011.
- [3] F. Villa, et al., "CMOS Imager with 1024 SPADs and TDCs for Single-Photon Timing and 3-D Time-of-Flight," *IEEE J. Sel. Topics in Quantum Electronics*, vol. 20, no. 6, pp. 364-373, Dec. 2014.
- [4] M. Perenzoni, et al., "A 64 × 64-Pixels Digital Silicon Photomultiplier Direct TOF Sensor With 100-MPhotons/s/pixel Background Rejection and Imaging/Altimeter Mode With 0.14% Precision Up To 6 km for Spacecraft Navigation and Landing," *IEEE JSSC*, vol. 52, no. 1, pp. 151-160, Jan. 2017.



Figure 5.10.1: Pixel high-level schematic with detailed inset for the TDC and timing diagram.



Figure 5.10.2: Architecture of the imager and timing diagram. The frame and row skipping mechanisms are highlighted in red and blue, respectively. Externally driven/readable global signals are bold, while the italic names are for internal signals.



Figure 5.10.3: TDC characterization including histogram of the resolution for 64 randomly distributed pixels, DNL and INL obtained from a code density test. Typical, maximum and minimum values are shown.



Figure 5.10.4: Sensor accuracy (histograms of pixel mean values) sweeping a 470nm 70ps pulsed laser within a 16ns time range in 9 steps of 2ns. 64 pixels are enabled at a time. Calibrated data means corrected for DNL and offset. The insets show the laser statistics for a given pixel.



Figure 5.10.5: Experimental setup (top) and measurement results of a flux of entangled photons, including: model of the flux with expected and measured  $G^{(1)}$  (middle), and expected and measured  $G^{(2)}$ , with zoom-in of the central region (bottom). Anticorrelated diagonals due to entanglement and correlated diagonals due to crosstalk are highlighted in green and sky-blue, respectively.

| Parameter                     | This work                                   | [2]                                   | [3]   | [4]        |
|-------------------------------|---------------------------------------------|---------------------------------------|-------|------------|
| Year                          |                                             | 2011                                  | 2014  | 2017       |
| Architecture                  | TDC                                         | TDC                                   | TDC   | TDC        |
| Process                       | 150nm                                       | 130nm                                 | 350nm | 150nm      |
| Pixel                         |                                             |                                       |       |            |
| Pixel pitch ( $\mu\text{m}$ ) | 44.64                                       | 50                                    | 150   | 60         |
| SPAD size ( $\mu\text{m}$ )   | 19.8                                        | N.A.                                  | 30    | N.A.       |
| Pixel fill factor (%)         | 19.48                                       | 1                                     | 3.14  | 26.5       |
| Sharing of SPAD well          | No                                          | No                                    | No    | Yes        |
| SPAD excess bias voltage (V)  | 3.0                                         | 0.73                                  | 6.0   | 3.0        |
| SPAD DCR, median (Hz)         | 600                                         | 50                                    | 120   | 6800       |
| TDC                           |                                             |                                       |       |            |
| TDC area ( $\mu\text{m}^2$ )  | 402.7                                       | N.A.                                  | N.A.  | N.A.       |
| Range (ns)                    | 52                                          | 55                                    | 360   | 16k        |
| Time resolution / 1LSB (ps)   | 204.5±2.7                                   | 55                                    | 350   | 250        |
| Depth (bit)                   | 8                                           | 10                                    | 10    | 16         |
| DNL (LSB)                     | -0.73..+0.77                                | -0.3..+0.3                            | ±0.02 | -0.4..+0.5 |
| INL (LSB)                     | -0.93..+1.24                                | -2.3..+1.7                            | ±0.10 | -0.9..+1.2 |
| Single-photon precision (ps)  | 240 (raw)<br>205 (after calibration)        | 170                                   | 254   | N.A.       |
| Chip                          |                                             |                                       |       |            |
| Array size                    | 32x32                                       | 160x128                               | 32x32 | 64x64      |
| Chip size ( $\text{mm}^2$ )   | 1.69x1.88                                   | 11.0x12.3                             | 9x9   | 4.4x4.4    |
| Max frame rate (fps)          | 80k (full readout)<br>250k (row skipping)*  | 250k (by design)<br>50 (demonstrated) | 100k  | 17.9k      |
| Max observation rate (Hz)     | 800k (frame skipping)                       | -                                     | -     | -          |
| Power consumption (mW)        | 11.1 (row skipping)<br>4.8 (frame skipping) | 550                                   | 400   | 93.5       |

Figure 5.10.6: Chip performance summary table. \*Values are measured ensuring 8 detected photons in each frame, which is a worst-case condition for the quantum physics experiment described in Fig. 5.10.5.



Figure 5.10.7: Die micrograph.

# Session 6 Overview: *Ultra-High-Speed Wireline*

## WIRELINE SUBCOMMITTEE



**Session Chair:** *Mounir Meghelli*  
*IBM T. J. Watson Research Center,  
 Yorktown Heights, NY*



**Associate Chair:** *Hyeon-Min Bae*  
*KAIST, Daejeon, Korea*

**Subcommittee Chair:** *Frank O'Mahony, Intel, Hillsboro OR*

High-speed serial I/Os continue to be pushed to higher bandwidth and density for every new generation of systems, which enable the scaling of data centers, fueled by a world that is becoming increasingly connected and digital. This session starts with the presentation of two low-power transmitters demonstrating a data rate of 112Gb/s using PAM-4 modulation, both implemented in advanced CMOS FinFet technologies. It continues with a presentation of a multi-standard 4-lane 1.25-to-28.05Gb/s transceiver designed in 14nm CMOS FinFET technology and supporting up to 40dB of channel loss at a power efficiency of 6pJ/b. Three papers describing PAM-4 transceivers are presented next, two implemented in 16nm CMOS FinFET technology targeting long reach links at 56Gb/s and 64Gb/s respectively, and one implemented in 28nm CMOS FDSOI targeting 64Gb/s short reach links. Finally, the session concludes with a paper describing a 4.16pJ/b 32Gb/s PAM-4 transceiver implemented in 65nm CMOS technology.



1:30 PM

### 6.1 A 112Gb/s PAM-4 Transmitter with 3-Tap FFE in 10nm CMOS

*J. Kim, Intel, Hillsboro, OR*

In Paper 6.1, Intel presents a reconfigurable 56GS/s 3-tap FFE TX that operates up to 112Gb/s with PAM-4 or at 56Gb/s with NRZ. The transmitter employs a quarter-rate architecture, a 1UI pulse-generator-based 4-to-1 serializer combined with a CML driver, a multi-segment  $\pi$ -coil for pad bandwidth extension, and per-lane duty-cycle correction and quadrature-error correction circuits. Implemented in a 10nm FinFET CMOS technology, the TX achieves 2.07pJ/b efficiency at 112Gb/s with 0.0302mm<sup>2</sup> area.



2:00 PM

### 6.2 A 112Gb/s 2.6pJ/b 8-Tap FFE PAM-4 SST TX in 14nm CMOS

*C. Menolfi, IBM Zurich Research Laboratory, Rueschlikon, Switzerland*

In Paper 6.2, IBM Research describes a 112Gb/s PAM-4 transmitter that is based on a quarter-rate 56GS/s 8b SST DAC along with a digital 8-tap FIR filter for channel equalization. Implemented in a 14nm bulk CMOS FinFET technology the circuit occupies an area of 0.095 mm<sup>2</sup> and consumes 286mW from a 0.95V supply.



2:30 PM

**6.3 A 4-Lane 1.25-to-28.05Gb/s Multi-Standard 6pJ/b 40dB Transceiver in 14nm FinFET with Independent TX/RX Rate Support**
*M. van Ierssel*, Rambus, Toronto, Canada

In Paper 6.3, Rambus presents a multi-standard, long-reach, low-power 4-lane 1.25-to-28.05Gb/s transceiver implemented in 14nm CMOS FinFET technology. By using a per-lane PLL-based CDR, the receiver supports independent data rate across the four lanes. Independent rate between the transmitter and the receiver of a single lane is also supported. The measured reach of this design is 40dB at 28.05Gb/s at a power efficiency of 6pJ/b. The area of each lane is 0.38mm<sup>2</sup>.

6



3:15 PM

**6.4 A Fully Adaptive 19-to-56Gb/s PAM-4 Wireline Transceiver with a Configurable ADC in 16nm FinFET**
*P. Upadhyaya*, Xilinx, San Jose, CA

In Paper 6.4, Xilinx describes a 19-to-56Gb/s PAM-4 transceiver in 16nm FinFET. The transceiver features a fully adaptive receiver consisting of a multi-stage CTLE, a configurable 3-to-7b ADC, a 14-tap FFE/1-tap DFE DSP, and a baud-rate CDR. It also includes a 4-tap FIR voltage mode transmitter. For 56Gb/s transmission over a 32dB channel, the design achieves <1e-12 BER without explicitly added crosstalk and <1e-6 with 2mV<sub>rms</sub> of added crosstalk, while consuming 9.7pJ/b. With the ADC operating in 3b mode, the transceiver achieves 6.4pJ/b over a 7.4dB channel.



3:45 PM

**6.5 A 64Gb/s PAM-4 Transceiver Utilizing an Adaptive Threshold ADC in 16nm FinFET**
*L. Wang*, University of Toronto, Toronto, Canada

In Paper 6.5, the University of Toronto presents a 64Gb/s PAM-4 transceiver including a 1.39pJ/b 1Vppd transmitter with 3-tap FFE and a 32GS/s 4.41pJ/b ADC-based receiver front-end with half-rate sampling CTLE and 6b ADC with 1b folding. The transmitter has programmable nonlinearity compensation and achieves an RLM of 99%, while generating 162fs<sub>rms</sub> jitter. The ADC quantizer is reconfigurable to allow power scaling over channels with 9-to-30dB loss. A greedy search algorithm is used to seek BER-optimal non-uniformly spaced quantizer thresholds.



4:15 PM

**6.6 A 4.9pJ/b 16-to-64Gb/s PAM-4 VSR Transceiver in 28nm FDSOI CMOS**
*E. Depaoli*, STMicroelectronics, Pavia, Italy

In Paper 6.6, STMicroelectronics describes a low-power PAM-4 transceiver in 28nm CMOS FDSOI, targeting CEI-56G-VSR applications. The voltage-mode transmitter yields larger eye openings compared to the current-mode alternative. The receiver includes CDR, eye monitor, adaptation logic and a flexible CTLE meeting the tight PAM-4 equalization demands through optimal adaptation at low, mid, and high frequency. A BER of 1e-12 is measured at 64Gb/s over a 16dB-loss channel with a power efficiency of 4.9pJ/b.



4:45 PM

**6.7 A 32Gb/s 133mW PAM-4 Transceiver with DFE Based on Adaptive Clock Phase and Threshold Voltage in 65nm CMOS**
*L. Tang*, Peking University, Beijing, China

In Paper 6.7, Peking University introduces a 32Gb/s 133mW PAM-4 transceiver implemented in 65nm CMOS. A one-tap DFE featuring a phase-adaptive clock is proposed to alleviate the tight timing constraints of the direct feedback implementation. The threshold voltages at the receiver slicers are adaptively optimized for different channels. Transmission over a 23dB loss channel achieves better than 1e-12 BER at a power efficiency of 4.16pJ/b.

## 6.1 A 112Gb/s PAM-4 Transmitter with 3-Tap FFE in 10nm CMOS

Jihwan Kim, Ajay Balankutty, Rajeev Dokania, Amr Elshazly, Hyung Seok Kim, Sandipan Kundu, Skyler Weaver, Kai Yu, Frank O'Mahony

Intel, Hillsboro, OR

The rapidly growing demand for high-bandwidth data communication infrastructure has fueled the industry to develop ultra-high-speed/density wireline links compliant with electrical interface standards such as CEI-56G and 802.3bs-400GbE. Recent publications have demonstrated CMOS transmitters (TX) operating from 50-64Gb/s [1-4], and early planning for the next generation of 100Gb/s+ wireline standards is underway. Long-reach wireline standards at 56Gb/s have largely adopted a PAM-4 modulation scheme that maintains the same symbol rate as the previous generation of 28Gb/s NRZ transceivers. For 112Gb/s transceivers, however, higher order modulation (e.g., PAM-8/16) is unlikely to be adopted due to the tradeoff in SNR and backward compatibility. Therefore, the symbol rate for 112Gb/s PAM-4 must be doubled relative to the previous generation, which requires circuit bandwidth and jitter performance to improve by roughly a factor of two. In addition, the energy efficiency of the link must be maximized to keep local power delivery and system power consumption within practical limits. This paper presents a reconfigurable 56Gb/s 3-tap FFE TX that operates up to 112Gb/s with PAM-4 or at 56Gb/s with NRZ modulation. The TX employs a quarter-rate architecture, a 1-UI pulse-generator-based 4:1 serializer combined with a CML driver, a multi-segment  $\pi$ -coil for pad bandwidth extension, and per-lane duty-cycle detection/correction (DCD/DCC) and quadrature-error detection/correction (QED/QEC) circuits. The TX is implemented in a 10nm FinFET CMOS technology and benefits from improvements to transistor drive strength, interconnect electro-migration/resistance, and overall area scaling [5].

The TX architecture is shown in Fig. 6.1.1. The quarter-rate clock is generated by an LC-PLL and a quadrature generator. The quadrature clocks are distributed using regulated pseudo-differential CMOS buffers to minimize deterministic jitter and clocking power. Per-lane DCD/QED and DCC/QEC maintain accurate phase spacing between clocks. The MSB and LSB data streams are generated with a 32:4 serializer followed by a 3-tap FFE retimer and segmented CML driver that performs 4:1 serialization and summing of equalization taps. The driver is connected to ESD and pad through a bandwidth-extending “ $\pi$ -coil”. The quarter-rate architecture maintains a relaxed timing window (3UI) in the critical path of data serialization preserving timing integrity across PVT [6, 7]. It also reduces the clock frequency by a factor of two compared to a half-rate architecture, which results in lower-power clock generation/distribution. Compared with the half-rate architecture, the main challenges are designing quadrature phase detection/correction circuits and the high bandwidth 4:1 serializer.

The CML output stage (Fig. 6.1.2) consists of three equalization tap slices connected in parallel ( $C_1/C_0/C_{+1} = 0.25/1/0.25$ ) and operates as a 2b DAC in PAM-4 mode. The tap coefficients are controlled using current DACs in the bias circuit. Each MSB/LSB driver segment contains four parallel unit slices (quad1-4) that simultaneously multiplex and drive the quarter-rate data onto the pad using a differential CML driver. Each differential pair is driven by a 1UI pulse-generator that creates a low-pulse when the quadrature clocks overlap to switch the current at the pad. PMOS devices in the pulse-generator keep nodes X and Y pre-charged high during the idle 3UI period to minimize ISI. The combination of the 1UI pulse-generator and the CML MUX is similar in concept to the ones used in [4] and [7], but the pulse-generator is modified to avoid series stacking of clock/data switches to improve bandwidth and swing saturation into the CML driver. This configuration also allows removal of the clock switches from the high current-density driver output net, and serializes the data directly at the pad. Rather than having a separate 4:1 serializer and a full-rate driver as in [4] and [7], serializing the data at the pad minimizes the number of full-rate nets within the TX to reduce the number of inductors and overall area.

The TX output network uses a 3-segment inductor ( $\pi$ -coil) to extend the driver bandwidth and meet the stringent S11 specification. The  $\pi$ -coil absorbs the parasitic capacitance associated with the driver, ESD, termination resistor, and C4 bumps. Figure 6.1.3 illustrates an equivalent circuit diagram of the output network with all parasitic components. The series inductors  $L_1$  and  $L_2$  resonate

with the primary diode ( $D_{P1}$  and  $D_{N1}$ ) capacitances and the driver-associated capacitance. The shunt inductor  $L_3$  performs additional impedance boosting toward the termination resistor while canceling out its parasitic capacitance as well. The secondary ESD protection formed by the series resistance of  $L_2$ , diode  $D_{N2}$ , and parasitic diode  $D_{P2}$  reduces the total diode size for the same ESD target (CDM-250V/HBM-1KV). The three inductors in the  $\pi$ -coil,  $L_1$ ,  $L_2$ , and  $L_3$ , are implemented in a nested configuration to reduce area by leveraging their mutual inductance. The conductor widths for different segments are chosen to match the different current densities in each segment, resulting in lower area and a higher self-resonant frequency. Simulations show that the proposed  $\pi$ -coil network achieves  $2.7 \times 3$ dB bandwidth extension compared with the RC bandwidth.

Low-jitter quadrature clock generation and accurate phase control are critical for a high-speed quarter-rate TX. A four-stage injection-locked open-loop quadrature generator similar to [8] converts the 14GHz differential clocks generated by an LC-PLL to quadrature clocks with a phase accuracy of  $\pm 4^\circ$  and distributes them to the TX. Figure 6.1.4 shows the per-lane DCD/QED and DCC/QEC circuits. The duty cycle and quadrature errors are detected using a statistical asynchronous detector [6] at the input of the 1UI pulse-generator. The duty-cycle is adjusted by injecting additional up/down currents into the output of an inverter clock buffer. Sub-100fs duty-cycle correction resolution is achieved using 6b voltage DACs that control the injection current into each leg. A cascode device is added to shield the capacitance of the current source from the clock path so that the buffer bandwidth is approximately independent of the control code. The two-stage QEC uses a device capacitor DAC for coarse correction and a metal capacitor DAC for fine correction to implement tunable clock delay.

The TX is implemented in a 10nm FinFET CMOS process and characterized with a 65GHz-bandwidth real-time oscilloscope. The TX output is measured though a top-side package connector, 6-inch cable, adapter, and DC blocking capacitor, which adds approximately 4dB of loss at 28GHz. Figure 6.1.5 shows the 56Gb/s NRZ/PAM-4 and 112Gb/s PAM-4 eye diagrams. The measured 56Gb/s clock pattern RJ is  $185f_{rms}$  and TJ ( $BER=10^{-12}$ ) is 3.7ps with a 10-MHz CDR bandwidth. The unequalized TX swing is  $750mV_{ppd}$ . The TJ for the 56Gb/s NRZ PRBS7 eye is 7.8ps, of which the residual ISI is 3.4ps. The measured 112Gb/s PRBS7 PAM-4 eye heights are  $>23mV$  ( $BER=10^{-6}$ ). The level mismatch (RLM) is 98.5% without calibration and SNR is 31dB. The measured duty-cycle and quadrature-error correction range and resolution are  $\pm 8$ ps and  $<80$ fs, respectively. The TX consumes 232mW at 112Gb/s PAM-4 (2.07pJ/b) with PLL and quadrature clock generation power amortized over four lanes (for a symmetric 2+2 quad configuration). The TX front-end including regulated clock distribution, DCC/QEC, and local clock buffers consumes 193mW (1.72pJ/b). The die area of the TX front-end is  $110 \times 275\mu m^2$ . As shown in Fig. 6.1.6, this TX achieves the lowest area and better energy efficiency at about twice the data rate of previously published PAM-4 TXs.

### Acknowledgements:

The authors thank A. Jimenez, P. N. Le, K. Ren, D. Shi, D. John, A. C. Durgun, R. Garcia, B. D. Grossnickle, J. Bondie, and D. Baker for their contribution to this work.

### References:

- [1] G. Steffan, et al., “A 64Gb/s PAM-4 transmitter with 4-tap FFE and 2.26pJ/b energy efficiency in 28nm CMOS FDSOI,” *ISSCC*, pp. 116-117, Feb. 2017.
- [2] T. O. Dickson, et al., “A 1.8pJ/b 56Gb/s PAM-4 transmitter with fractionally spaced FFE in 14nm CMOS,” *ISSCC*, pp. 118-119, Feb. 2017.
- [3] M. Bassi, et al., “A 45Gb/s PAM-4 transmitter delivering 1.3Vppd output swing with 1V supply in 28nm CMOS FDSOI,” *ISSCC*, pp. 66-67, Feb. 2016.
- [4] Y. Frans, et al., “A 40-to-64Gb/s NRZ transmitter with supply-regulated front-end in 16nm FinFET,” *ISSCC*, pp. 68-69, Feb. 2016.
- [5] C. Auth, et al., “A 10nm high performance and low-power CMOS technology featuring 3rd generation FinFET transistors, self-aligned quad patterning, contact over active gate and cobalt local interconnects,” *IEDM*, Dec. 2017.
- [6] J. Kim, et al., “A 16-to-40Gb/s quarter-rate NRZ/PAM4 dual-mode transmitter in 14nm CMOS,” *ISSCC*, pp. 60-61, Feb. 2015.
- [7] A. A. Hafez, et al., “32-48Gb/s serializing transmitter using multiphase serialization in 65nm CMOS technology,” *IEEE JSSC*, vol. 50, no. 3, pp. 763-775, Mar. 2015.
- [8] K. H. Kim, et al., “A 2.6mW 370MHz-to-2.5GHz open-loop quadrature clock generator,” *ISSCC*, pp. 458-459, Feb. 2008.



Figure 6.1.1: Block diagram of the NRZ/PAM-4 reconfigurable TX.



Figure 6.1.2: Diagrams of the output driver and 1UI pulse-generator timing.



Figure 6.1.3: Diagrams of output pad-network with π-coil.



Figure 6.1.4: Diagrams and measurement of duty-cycle and quadrature-error detector/corrector.



Figure 6.1.5: Measured TX eye diagrams at 28GS/s and 56GS/s.

|                         | [1]                    | [2]                  | [3]                 | [4]                  | This Work             |           |
|-------------------------|------------------------|----------------------|---------------------|----------------------|-----------------------|-----------|
| Technology              | 28nm FDSOI             | 14nm FinFET          | 28nm FDSOI          | 16nm FinFET          | 10nm FinFET           |           |
| Architecture            | Quarter-rate           | Half-rate            | Half-rate           | Quarter-rate         | Quarter-rate          |           |
| Modulation              | PAM-4                  | PAM-4                | PAM-4               | NRZ                  | NRZ                   | PAM-4     |
| Data Rate               | 64Gb/s                 | 56Gb/s               | 45Gb/s              | 64Gb/s               | 56Gb/s                | 112Gb/s   |
| Clock Source            | On-chip PLL            | External             | External            | On-chip PLL          | On-chip PLL           |           |
| FFE                     | 4-tap                  | 3-tap                | 4-tap               | 3-tap                | 3-tap                 |           |
| Driver Type             | CML                    | SST                  | SST-Hybrid          | CML                  | CML                   |           |
| Output Network          | Double T-coil          | None                 | T-coil              | T-coil               | π-coil                |           |
| Output Swing w/o FFE    | 1.2V <sub>ppd</sub>    | 0.9V <sub>ppd</sub>  | 1.3V <sub>ppd</sub> | 0.8V <sub>ppd</sub>  | 0.75V <sub>ppd</sub>  |           |
| RJ - Clock Pattern      | 290fs <sub>rms</sub>   | 318fs <sub>rms</sub> | NA                  | 150fs <sub>rms</sub> | 185fs <sub>rms</sub>  |           |
| RLM                     | 94%                    | NA                   | 94%                 | NA                   | NA                    | 99.3%     |
| SNR                     | NA                     | NA                   | NA                  | NA                   | NA                    | >31dB     |
| Energy/bit (Tx FE only) | 2.26pJ/bit** (w/o PLL) | 1.8pJ/bit (w/o PLL)  | NA                  | 5.31pJ/bit           | 4.14pJ/bit            | 1.9pJ/bit |
| Energy/bit (Tx FE only) | NA                     | NA                   | 2.66pJ/bit          | 3.51pJ/bit           | 3.44pJ/bit            | 1.7pJ/bit |
| TX Area (w/o PLL)       | NA                     | 0.035mm <sup>2</sup> | 0.28mm <sup>2</sup> | 0.32mm <sup>2</sup>  | 0.0302mm <sup>2</sup> |           |

\* SNR measured is at 100Gb/s

\*\* I/O clock generation power is excluded

Figure 6.1.6: TX performance comparison.



Figure 6.1.7: Chip microphoto.

## 6.2 A 112Gb/s 2.6pJ/b 8-Tap FFE PAM-4 SST TX in 14nm CMOS

Christian Menolfi<sup>1</sup>, Matthias Braendli<sup>1</sup>, Pier Andrea Francese<sup>1</sup>, Thomas Morf<sup>1</sup>, Alessandro Cevrero<sup>1</sup>, Marcel Kossel<sup>1</sup>, Lukas Kull<sup>1</sup>, Danny Luu<sup>1,2</sup>, Ilter Ozkaya<sup>1,3</sup>, Thomas Toifl<sup>1</sup>

<sup>1</sup>IBM Zurich Research Laboratory, Rueschlikon, Switzerland

<sup>2</sup>ETH Zurich, Zurich, Switzerland

<sup>3</sup>EPFL, Lausanne, Switzerland

The ongoing demand for higher data rates in wireline and optical communications has led to emerging standards in the 100Gb/s+ regime [1]. Although these standards are still in the definition phase they will rely on multi-level signaling such as PAM-4 along with an increasing amount of digital signal processing. In the foreseeable future, a high-performance TX will consist of a CMOS DSP frontend followed by a high sampling rate data converter [2,3], whose design remains a significant challenge. This paper presents a 112Gb/s PAM-4 SST TX that is based on a quarter-rate 56GS/s 8b SST DAC along with a digital 8-tap FIR filter for channel equalization.

A block diagram of the TX is shown in Fig. 6.2.1. The TX consists of a 56GS/s 8b DAC and a DSP that implements an 8-tap FIR equalization filter. Key components of the DAC shown in the dashed frame of Fig. 6.2.1 include a 32:4 serializer data path, an array of weighted quarter-rate SST driver segments and a clock generator that derives the required sub-rate clocks from a half-rate input clock at 28GHz. A four-phase quarter-rate clocking scheme has been utilized that relaxes timing requirements in the final 4:1 serializer stage [4,5]. Static CMOS logic has been employed throughout most of the circuit, which facilitates compact and dense layout for minimum wiring parasitics. The DAC runs from two separate supplies for data and clock related circuitry, both at a nominal value of 950mV. ESD protection diodes have been applied at the output along with asymmetric T-coils to improve wideband impedance matching.

At the DAC input the  $32 \times 8$  parallel data samples are captured at a 1/32 sub-rate clock,  $c_{32}$  (1.75GHz). A dedicated 32:4 serializer per bit weight then converts the 32 sub-rate bit weights into corresponding quarter-rate bit streams for bit0, bit1, bit2 etc., which are then distributed to the weighted quarter-rate SST driver segments. A 2b-thermometer (bit6,7), 6b-binary (bit0-5) driver segmentation scheme has been implemented to mitigate excessive DNL transitions due to MSB segment switching at the moderate cost of an additional 32:4 serializer (9x instead of 8x) and the associated quarter-rate data stream. Each weighted quarter-rate SST driver segment shown in Fig. 6.2.1 consists of a 4:1 multiplexer, a full data-rate pre-driver and a full-data-rate SST driver. The driver segments are based on a unit layout cell at the 1/64 weight (bit2). Higher weights are achieved by using multiple identical copies of the unit cell while the two lowest weight segments (bit0,1) are custom scaled cells. Each weighted quarter-rate SST driver segment may be considered as its own driver sub-system driven with its own stream of quarter-rate bit data, and their output is combined according to their relative weight.

One of the key circuits in this quarter-rate design is the 4:1 serializing multiplexer shown in Fig. 6.2.2. The design uses a pulsed pass-gate multiplexer scheme and the selection pulses  $NSEL\_0$ ,  $PSEL\_0$  are shaped with dynamic AND/OR gates driven by two consecutive overlapping quarter-rate clock phases  $CK4\_180$  and  $CK4\_270$  or  $CK4\_0$  and  $CK4\_90$ , respectively. Unlike in other proposed high-speed 4:1 multiplexer implementations [4,5], the selection pulse shaping has been strictly separated from the data selection operation, which in turn minimizes data-history-dependent components at the multiplexer output.

The multiplexed full-data-rate 56Gb/s bit stream is then powered up in a pre-driver stage before it drives the weighted SST segment. At such high data rates, it becomes increasingly difficult to complete settling within 1UI using static CMOS inverters with a reasonable fanout, which results in data-dependent ISI within the individual weighted segments. To overcome this technology limitation and improve the pre-driver performance, an active peaking scheme has been employed [6], which is shown in Fig. 6.2.3a. The feedback loop shown within the dashed lines in Fig. 6.2.3a implements an RLC impedance at the inverter output. The net effect of this addition is an inverter stage with lower gain and higher bandwidth thanks to RLC peaking. This circuit enhancement entails some

additional static power consumption, nevertheless, it minimizes ISI substantially as shown in the simulations in Fig. 6.2.3a. In our pre-driver implementation, peaking can be switched on or off depending on speed requirements. The measured power penalty for pre-driver peaking at 56GS/s operation is on the order of 20mW, which is about 9% of the DAC power consumption while the measured reduction of ISI is on the order of  $3.2\text{ps}_{\text{pp}}$ .

Figure 6.2.3b shows a schematic of an SST driver segment. To achieve the static linearity requirements of an 8b DAC, the driver impedance consists of 74% linear resistance while the remaining 26% are contributed by MOS resistance. Both pull-up and pull-down branch impedances can be fine-tuned separately with 4b binary tuning codes  $\text{tunen}[3:0]$  and  $\text{tunep}[3:0]$  that are shared among all SST driver segments.

The DAC clocking shown in Fig. 6.2.1 is derived from a half-rate input clock at 28GHz that is divided in a first CPL divider to four-phase quarter-rate clock signals  $c4i$ ,  $c4q$ . The lower speed sub-rate clocks are then obtained using a synchronous divider based on conventional master-slave flip-flops. Given the quarter-rate operation, the timing accuracy of the quarter-rate clocks has a direct impact on the output jitter performance. Duty-cycle correction (DCC) circuits shown in Fig. 6.2.4a have been introduced in both quarter-rate clock paths  $c4i$  and  $c4q$  as well as in the half-rate input clock where the latter is intended for quarter-rate clock I/Q mismatch fine-tuning. The amount of correction can be set by intentional injection of an offset current into the summing input node of INV1 in Fig. 6.2.4a. In our implementation, a MOS-diode-based voltage DAC has been chosen, which has the advantage of low parasitic capacitance at the sensitive input of INV1.

To implement the TX, the DAC has been combined with a digital 8-tap PAM-4 FIR filter front-end. Synthesized in a standard digital design flow, the FIR filter operates using a 1/32 sub-rate clock,  $c_{32}$ , (1.75GHz) and processes 32 parallel 2b input data symbols to generate 32 parallel filtered data samples at 8b resolution. The 8 FIR filter taps are realized with a combination of 4 lookup tables containing pre-calculated tap values of two consecutive data symbols and summing adders. Furthermore, a bit pattern generator has been included for testing.

The TX is implemented in a 14nm bulk CMOS FinFET technology. A test chip micrograph is shown in Fig. 6.2.7. The active area of the DAC and the digital filter are  $258 \times 282\mu\text{m}^2$  and  $150 \times 150\mu\text{m}^2$ , respectively, resulting in a total area of  $0.095\text{mm}^2$ . The chip has been characterized using wafer probe needles. Measured power consumption at a nominal supply of 0.95V and 112Gb/s PAM-4 operation is 230mW for the DAC and 56mW for the FIR filter, corresponding to 286mW or 2.6pJ/b for the whole TX. Figure 6.2.4a shows the measured tuning range of the quarter-rate DCC circuit at a 14GHz clock frequency. A +/-2ps duty-cycle tuning range could be achieved within 60 steps. Figure 6.2.5 shows a typical DAC static INL and DNL measurement. Figure 6.2.6 (left) shows measured PRBS15 NRZ data eyes at 50Gbs/s and 56Gb/s operation without equalization. Figure 6.2.6 (right) shows measured equalized PRBS7 PAM-4 data eyes at 100Gb/s and 112Gb/s, respectively.

### Acknowledgements:

The authors would like to thank GlobalFoundries for test site access and chip manufacturing.

### References:

- [1] Optical Interconnect Forum (OIF) [Online]. Available: <http://www.oiforum.com/technical-work/current-oif-work/>
- [2] Y. M. Greshishchev, et al., "A 56GS/s 6b DAC in 65nm CMOS with 256x6b Memory," *ISSCC*, pp. 194-195, Feb. 2011.
- [3] J. Cao, et al., "A Transmitter and Receiver for 100Gb/s Coherent Networks with Integrated 4x64GS/s 8b ADCs and DACs in 20nm CMOS," *ISSCC*, pp. 484-485, Feb. 2017.
- [4] A. A. Hafez, et al., "A 32-to-48Gb/s Serializing Transmitter Using Multiphase Sampling in 65nm CMOS," *ISSCC*, pp. 38-39, Feb. 2013.
- [5] J. Kim, et al., "A 16-to-40Gb/s Quarter-Rate NRZ/PAM4 Dual-Mode Transmitter in 14nm CMOS," *ISSCC*, pp. 60-61, Feb. 2015.
- [6] H. W. Lu, et al., "A Scalable Digitized Buffer for Gigabit I/O," *IEEE CICC*, pp. 241-244, Sept. 2008.



Figure 6.2.1: Block diagram of the implemented transmitter.



Figure 6.2.2: Pulsed pass-gate multiplexer implementation using 4-phase quarter-rate clocks CK4\_{0,90,180,270}.



Figure 6.2.3: a) Full-data-rate active-peaking pre-driver, b) full-data-rate SST driver segment.



Figure 6.2.4: a) Duty-cycle correction circuit used to fine-tune the CK4 duty-cycles, b) measured quarter-rate clock duty-cycle tuning range @14GHz.



Figure 6.2.5: Measured DAC INL and DNL.



Figure 6.2.6: (left) measured NRZ PRBS15 data eyes at 50Gb/s and 56Gb/s (no equalization), (right) measured PAM-4 PRBS7 data eyes at 100Gb/s and 112Gb/s (with equalization).



Figure 6.2.7: TX test chip die micrograph.

### 6.3 A 4-Lane 1.25-to-28.05Gb/s Multi-Standard 6pJ/b 40dB Transceiver in 14nm FinFET with Independent TX/RX Rate Support

Mohammad Sadegh Jalali, Mohammad Hossein Taghavi, Angus McLaren, Jennifer Pham, Kamran Farzan, Dominic Diclemente, Marcus van Ierssel, William Song, Saman Asgarian, Chris Holdenried, Saman Sadr

Rambus, Toronto, Canada

The scaling of CMOS technology together with continued innovations in circuit and system design techniques is fueling a rising demand for increasingly high throughput serial data interfaces. However, advances in CMOS technology have little impact on channel performance, making channel impairments a bottleneck in wireline links. Furthermore, links are typically designed to cover multiple standards and are expected to operate over a wide range of data rates, making their design challenging [1-5]. This work presents a 4-lane 1.25-28.05Gb/s transceiver in 14nm FinFet technology. We measure a bit error rate (BER) lower than 1e-15 with a channel loss of 40dB at 28.05Gb/s.

An overview of this design is shown in Fig. 6.3.1. It includes four 28.05Gb/s lanes that share the same clock multiplier unit (CMU). The transmitter serializes the parallel data from the digital backend and transmits it to the channel. In the receiver, the data goes through the termination block, which has a loss of 1dB and a bandwidth of 35GHz. The continuous-time linear equalizer (CTLE) partially equalizes the received signal. The CTLE output is then fed to an 8-tap half-rate decision feedback equalizer (DFE) where the transmitted bits are recovered. Although the design of a low-jitter, low-area PLL-based CDR is challenging, this CDR topology is used in this work since the receivers of the four lanes are required to support independent data rates. Furthermore, the transmitter and receiver of a single lane are also needed to support independent data rates, which is required in some applications such as cellular base and fiber channels. Supporting these requirements with a PI-based CDR increases the number of PLLs in the CMU, increasing its area and power consumption, as well as the power consumption of the clock distribution network. Finally, the PLL-based CDR allows for consistent jitter tolerance performance within the supported RX/TX PPM offset range.

Figure 6.3.2 shows the details of the TX design. The transmit driver is based on a source-series terminated (SST) architecture with a 3-tap feed-forward equalizer (FFE) [1]. The FFE has one pre- and one post-cursor tap. To lower the power of the TX FFE, the tap delay generator block gets the quarter-rate data from the serializer (P2S) and generates three sets of quarter-rate outputs to implement TX equalization. The tap-delay block is designed such that the cascaded flip-flops have clocks that are at least 3UIs apart from each other (the delay of each signal compared to  $D_{IN\_BUF}$  is shown in the figure), easing the speed requirements of the flip-flop.

Figure 6.3.3 shows the RX equalization path. To support a wide range of input common-mode values, the output of the termination block is AC-coupled to the CTLE. To avoid baseline wander, the 3dB frequency of this high-pass filter is designed to be lower than 75kHz. The simulated peaking of the CTLE is 22dB at 20GHz. Although peaking above the Nyquist frequency ( $f_N$ ) amplifies noise, improvement in SNR is observed as the signal has considerable energy up to  $1.4 \times f_N$ . To achieve this high peaking frequency, four source-degenerated CML-based equalization stages are followed by a gain stage. Each equalizer stage uses shunt peaking inductors and negative capacitors to increase its bandwidth, as well as low capacitance biasing to lower the parasitic capacitance at the source of the input pair. The gain stage uses both shunt and series peaking inductors to drive the DFE load. To cancel the long tail of the pulse response, source-degeneration-based mid-band shaping is used. At lower data rates, the zero of the mid-band shaping circuitry is used as the main zero of the equalizer.

The block diagram of the even half of the half-rate DFE is shown in Fig. 6.3.3. The first DFE tap is loop-unrolled, making tap 2 the critical timing path. Tap 1 correction is applied using AC-coupling capacitors [1]. To avoid baseline wander from non-DC-balanced data, resistors are connected in parallel with the coupling capacitors. The feedback paths use CML signaling, which eliminates the timing challenge of conversion to full CMOS levels in the feedback loop. The performance of the latches following the MUX is a limiting factor in the equalization capability of the DFE. To mitigate the hysteresis effects from these latches, a third clock phase,  $\phi_{P_E}$ , is used to start clearing the latch before the start of the track phase.

Figure 6.3.4 illustrates the PLL-based CDR. Each CDR uses a ring-VCO in place of an LC-VCO to avoid magnetic coupling between adjacent data lanes and to minimize area. While the CDR loop suppresses VCO noise, the VCO frequency deviations caused by random noise need to remain below the +/-75MHz correction capability of the CDR. The VCO uses a current-starved topology with a low-voltage regulator [6], and is calibrated to the approximate desired frequency at start-up using a reference clock. The bandwidth of LPF1 is set to 60kHz after start-up calibration to limit the open-loop integrated noise at the  $V_{osc}$  node to 3mVpp for a BER of  $10^{-17}$ , which is within the correction capability of the CDR feedback loop. The VCO has a phase detection and accumulation path that is used to adjust the center frequency in response to the data. This accumulator tunes the VCO frequency by tuning the positive terminal of the second op-amp using the digital low-pass filter (DLPF) code, and with a bandwidth of several megahertz in order to track spread-spectrum modulated data. Furthermore, each stage of the ring-VCO has an input that is controlled by the output of the phase detector, which operates using a half-rate clock. If the TX and RX rates are the same, IQ calibration is performed by injecting a half-rate clock from the CMU into the CTLE, and setting the open-loop CDR frequency to a small intentional frequency offset. The IQ mismatch is sensed by the phase detector in the digital back-end while the CDR is open-loop. In the absence of IQ mismatch, the average output of the open loop CDR phase detector remains close to zero due to cycle slipping. The detected IQ mismatch is corrected by adjusting the clock phase after the VCO. For cases where the TX and RX rates are different, IQ calibration is performed by adjusting the phase of the 0° and 90° clocks after the VCO until the average of the XOR of these clocks is zero.

Figure 6.3.5 shows the measured performance of the transceiver. The measured 28.05Gb/s TX output eye goes through the package, trace and MMPX cables, which have an estimated loss of 6dB. The cables and package trace are software de-embedded and the TX FFE is set to 2dB of post-cursor to cancel package ISI, yielding a measured random jitter (RJ) of 200fs<sub>rms</sub>. To measure the equalization capability of the transceiver, we are using a channel whose overall loss, including trace and package, is 40.1dB at 28.05Gb/s, as shown in the figure. The RX output eye with a PRBS31 pattern shows an inner eye height of 86mVppd and an eye width of 0.58UI, after equalization. A BER better than 1e-15 is measured for this link over a 15-hour period. The bathtub curve also shows that the extrapolated BER is better than 1e-15 (to reduce measurement time, the on-chip bathtub curve measurements are performed with a BER of 1e-8). The measured jitter tolerance (JT) of the receiver with the 40dB channel at 100MHz with a PRBS31 pattern is 0.2UIpp. This JT is consistently achieved across a range of +/-1000ppm frequency offset between the BERT and PHY reference clocks. At 10Gb/s, the measured equalization capability of the transceiver is 43dB. Each lane has a total power consumption of 170mW at 28.05Gb/s. Including a quarter of the CMU power, the power of each lane increases to 195mW.

Figure 6.3.6 summarizes the results and compares this work with previous works. Figure 6.3.7 illustrates the die micrograph. The area of each lane, as well as the CMU is 0.38mm<sup>2</sup>.

#### Acknowledgement:

The authors would like to thank the digital, layout, application and test engineering teams at Rambus for their support. Also, they thank present and past colleagues in the analog team for their contribution to this work.

#### References:

- [1] J. Bulzacchelli, et al., "A 28Gb/s 4-Tap FFE/15-Tap DFE Serial Link Transceiver in 32-nm SOI CMOS Technology", ISSCC, pp. 324-326, Feb. 2012.
- [2] H. Kimura, et al., "28Gb/s 560mW multi-standard SerDes with single stage analog front-end and 14-tap decision-feedback equalizer in 28nm CMOS," ISSCC, pp. 38-39, Feb. 2014.
- [3] P. Upadhyaya et al., "A 0.5-to-32.75Gb/s flexible-reach wireline transceiver in 20nm CMOS," ISSCC, pp. 56-57, Feb. 2015.
- [4] H. Miyaoka et al., "A 28.3Gb/s 7.3pJ/bit 35dB Backplane Transceiver with Eye Sampling Phase Adaptation in 28nm CMOS," IEEE Symp. VLSI Circuits, pp. 1-2, 2016.
- [5] T. Norimatsu, et al., "A 25Gb/s multi-standard serial link transceiver for 50dB-loss copper cable in 28nm CMOS," ISSCC, pp. 60-61, Feb. 2016.
- [6] A. Loke et al., "An 8Gb/s hyper-transport transceiver for 32nm SOI-CMOS server processors," IEEE JSSC, Vol. 47, No. 11, pp. 2627-2642, Nov. 2012.



Figure 6.3.1: Transceiver overview.

Figure 6.3.2: TX overview. DATA<sub>CM1</sub>, DATA<sub>C0</sub>, and DATA<sub>C1</sub> have 6UI, 7UI, and 8UI latency with respect to D<sub>IN\_BUF</sub>.

Figure 6.3.3: CTLE and DFE circuits.



Figure 6.3.4: Clock and data recovery block diagram.



Figure 6.3.5: Measurement results: TX eye diagram, RX eye diagram, channel response and bathtub curve.

|                              | JSSC<br>2012 [1] | ISSCC<br>2014 [2] | ISSCC<br>2015 [3] | VLSI<br>2016 [4] | ISSCC<br>2016 [5] | This work             |
|------------------------------|------------------|-------------------|-------------------|------------------|-------------------|-----------------------|
| Technology                   | 32nm<br>SOI-CMOS | 28nm<br>CMOS      | 20nm<br>CMOS      | 28nm<br>CMOS     | 28nm<br>CMOS      | FinFet                |
| Power supply [V]             | 1.2/1.05/0.85    | 1.5/1.05/0.85     | 1.2/1/0.95        | 1.5/0.96         | 1.5/0.9           | 1.2/0.9               |
| Data rate [Gb/s]             | 14-28.05         | 1.25-28.5         | 0.5-32.75         | 28.3             | 0.3-28            | <b>1.25-28.05</b>     |
| Channel loss at Nyquist [dB] | 35 @ 28Gb/s      | 30 @ 28Gb/s       | 27 @ 28Gb/s       | 35 @ 28.3Gb/s    | 51 @ 25.78Gb/s    | <b>40 @ 28.05Gb/s</b> |
| TX FFE taps                  | 4                | 3                 | -                 | 3                | 4                 | 3                     |
| RX DFE taps                  | 15               | 14                | 15                | 2                | 14                | 8                     |
| CDR Type                     | PI-Based         | PI-Based          | PI-Based          | PI-Based         | PI-Based          | <b>PLL-Based</b>      |
| TX RJ [fsrms]                | 450              | 250               | 205               | 237              | -                 | 200                   |
| RX CTLE                      | 2-stage          | 1-stage           | 3-stage           | -                | 6-stage           | <b>5-stage</b>        |
| Power/Lane (mW)              | 693              | 560               | 785               | 207              | 403               | 170                   |
| Area/Lane [mm <sup>2</sup> ] | 0.81             | 0.835             | 1.49              | 0.42             | 3.2               | 0.38                  |
| Power Efficiency [pJ/bit]    | 24.7             | 19.6              | 23.96             | 7.3              | 14.4              | 6                     |
| FoM [pJ/bit/dB]*             | 0.7              | 0.65              | 0.89              | 0.21             | 0.29              | 0.15                  |

\* FoM (pJ/bit/dB) = Power/Data\_rate/Channel\_loss

Figure 6.3.6: Performance summary and comparison to previous works.



Figure 6.3.7: Die micrograph.

## 6.4 A Fully Adaptive 19-to-56Gb/s PAM-4 Wireline Transceiver with a Configurable ADC in 16nm FinFET

Parag Upadhyaya<sup>1</sup>, Chi Fung Poon<sup>1</sup>, Siok Wei Lim<sup>2</sup>, Junho Cho<sup>1</sup>, Arianne Roldan<sup>2</sup>, Wenfeng Zhang<sup>1</sup>, Jin Namkoong<sup>1</sup>, Toan Pham<sup>1</sup>, Bruce Xu<sup>1</sup>, Winson Lin<sup>1</sup>, Hongtao Zhang<sup>1</sup>, Nakul Narang<sup>2</sup>, Kee Hian Tan<sup>2</sup>, Geoff Zhang<sup>1</sup>, Yohan Frans<sup>1</sup>, Ken Chang<sup>1</sup>

<sup>1</sup>Xilinx, San Jose, CA

<sup>2</sup>Xilinx, Singapore, Singapore

Trends in IoT and cloud computing continue to accelerate bandwidth demand, requiring technology innovation to cover 50G, 100G and 400G ports without significant increase in cost or power per bit. In order to mitigate the cost of infrastructure upgrade, the industry has proposed a standard for 56Gb/s PAM-4 interfaces [1] that can support legacy channels using Forward Error Correction (FEC). Recent publications achieve good pre-FEC BER performance using ADC-based receivers [2,3], but the power consumption (e.g. >550mW per lane at 56Gb/s excluding DSP/digital power in [2]) could be prohibitive for products with large numbers of transceivers. This paper demonstrates a fully integrated and adaptive 19-to-56Gb/s PAM-4 (9.5-to-28Gb/s in NRZ mode) transceiver implemented in 16nm FinFET technology that consumes significantly lower power.

Figure 6.4.1 shows a quad transceiver architecture with two fractional-N LC PLLs per quad. The receiver analog front-end consists of a three-stage continuous-time linear equalizer (CTLE) and one variable-gain amplifier (VGA) stage. In addition to high-frequency boosting CTLE stages, a mid-frequency peaking stage is used for long-tail cancellation [4], which reduces the total power consumption by reducing the number of post-cursor FFE taps while also improving SNR at the ADC input. The resolution of the 28GS/s 32-way time-interleaved SAR ADC (TI-ADC) can be configured from 3b to 7b. The lower ADC resolution is used to opportunistically save ADC and DSP power when the transceiver is used on a clean, low-loss channel. The DSP consists of an ADC gain/offset correction block, a 14-tap FFE, and a 1-tap speculative DFE. Fully integrated equalization adaptation, ADC calibration, and baud-rate CDR logic concurrently and dynamically adjusts CTLE/VGA settings, TI-ADC front-end sampling phases, ADC gain/offset, and FFE/DFE coefficients in the DSP.

The TI-ADC consists of a 4-way (e.g. 7GHz for 56Gb/s) front-end interleaving stage followed by an 8-way (875MHz) interleaving stage [2]. Figure 6.4.2 shows a single slice of the 875MHz asynchronous SAR ADC. Since the input signal level can be higher than the maximum allowed voltage for thin-oxide devices, a track-and-hold (T&H) stage with a boot-strapped NMOS switch is used to meet long-term reliability requirements and to maintain good linearity. The SAR ADC can be configured from maximum 7b operation down to 3b operation by inserting a resolution control logic block. Taking advantage of the sequential nature of asynchronous SAR ADCs, the resolution control is performed by simply gating the asynchronous clock after the desired number of conversions is done.

The DSP, ADC calibration, equalization adaptation, and CDR architecture is shown in Fig. 6.4.3. The data from the TI-ADC is deserialized by a factor of 2 to allow the DSP to operate at a frequency that is optimized for power consumption. The ADC calibration block performs offset, gain and timing skew calibration. The ADC offset and gain are corrected digitally in the DSP and the timing skew is corrected in the TI-ADC front-end interleaving stage. A 14-tap FFE is used to equalize 4 pre-cursor taps ( $h_{m4} - h_{m1}$ ) and 10 post-cursor taps ( $h_2 - h_{11}$ ). A speculative 1-tap DFE is used to do the 1<sup>st</sup> post cursor ( $h_1$ ) equalization. A bank of digital slicers slices the FFE-corrected digital data into speculative PAM-4 data and error samples. The speculative samples are fed to the DFE multiplexer, which performs 1-tap decision feedback data selection. The CDR logic drives a phase interpolator (PI) with 1/64 UI resolution and is subsequently divided-by-2 to generate 4 phase of clocks for the TI-ADC front-end interleaving stage. When the ADC is configured in 3b mode, the FFE/DFE blocks in the DSP are completely bypassed. The 3b ADC functions as a conventional bank of PAM-4 data/error slicers used in the analog/mixed-signal PAM-4 receiver. In this mode, the VGA is set/adapted such that the input of the ADC spans the entire full-scale range of the ADC as shown on the right side of Fig. 6.4.3. Since FFE/DFE equalization is not used, the 3b mode can only be used for a clean, low-loss channel that can be equalized using only TX-FIR and a multi-stage CTLE.

Figure 6.4.4 shows the 4-tap PAM-4 voltage-mode transmitter with programmable swing from 0.25V to 1V with linearity adjustment capability. In comparison to the CML implementation in [2], it consumes 45% less power. It utilizes half rate clock architecture where the last stage of serialization uses a 2-to-1 latch mux structure. The transmitter employs a 128-to-16 serialization followed by 4 tap FIR (pre2, pre, main, post), which is generated using 4-symbol clock domain. This data is then fed into 16:4 serialization to generate even and odd data for both MSBs and LSBs. After retiming, the even and odd data is serialized into a full-rate data stream using single-ended latch-mux for clock power saving. To achieve good DC linearity, as measured using the PAM-4 eye-height mismatch ratio (RLM) [1], and to achieve better eye height/width, the driver slices are partitioned such that the MSB and LSB driver have a ratio of ~2:1 to generate PAM-4 levels. The driver structure consists of 90x driver slicers, where the main drivers slicers can be enabled or disable from 16.5x–28.5x slices to control the impedance of the driver. The post and pre cursor driver slices are kept on whereas the MSB and LSB slices of pre2 can be enabled or disabled individually, such that the RLM of the driver can be altered if needed. Instead of a fixed MSB:LSB ratio of 2:1, the slices are programmed such that the MSB:LSB ratio can be altered slightly to (2-x):(1-y) to improve RLM. The swing of the driver is controlled through a double regulator structure, with regulator P that sets Vrefp from ~0.5V to 1V and regulator N that sets Vrefn from 0 to ~0.3V to support swing levels from 250mVppd to 1Vppd. The impedance control scheme consists of a coarse loop that tracks passive high-resolution TiN resistors by selecting the number of driver slices and a closed fine loop which further adjusts the impedance via tuneable transistor switches, maintaining good return loss and a ~100 Ohm differential output impedance regardless of swing/FIR settings.

The 4-lane transceiver is fabricated in a 16nm FinFET process. Figure 6.4.5 shows the channel insertion loss (IL) profile used to characterize the transceiver performance, showing 32dB IL and 7.5dB IL at 28GHz. The RX bathtub in 7b ADC mode, with different crosstalk (XT) levels for 32dB IL, achieves <1e-12 BER without explicitly added cross-talk and <1e-6 with 2mV<sub>rms</sub> cross-talk (injected using a random noise generator with a crest factor of 10). This is sufficient margin to meet the CEI-56G-LR-PAM-4 standard, which requires a pre-FEC BER <1e-4 [1]. It consumes 545mW per lane (~9.7pJ/b), including DSP/digital power. Eye scan after DSP shows an open PAM-4 eye for the same link. The receiver clears the jitter tolerance mask with the baud-rate CDR tracking 200ppm of frequency difference.

Figure 6.4.6 shows that the transceiver configured in 3b ADC mode over a 7.5dB channel achieves a pre-FEC BER of <1E-12 without explicitly added cross-talk. It consumes 360mW per lane (~6.4pJ/b). The PAM-4 TX achieves a measured RLM of 0.98, which is better than the required 0.95 in [1], with RJ of 180fs. The performance summary for the transceiver compares well with previously published works [2,3,5] in terms of BER but with significantly lower power. The transceiver power (TX/RX/Clocking) excluding DSP for 32dB IL is ~325mW/lane which corresponds to ~5.8pJ/b. Figure 6.4.7 shows the micrograph of the transceiver die.

### Acknowledgments:

The authors would like to thank the entire Xilinx SERDES design team as well as Sai Lalith Chaitanya Ambatipudi, Santiago Asuncion, and Vaibhav Kamdar for their valuable lab support.

### References:

- [1] Optical Internetworking Forum (OIF), “CEI-56G-LR-PAM4 Long Reach Implementation Agreement Draft Text”, Optical Internetworking Forum Contribution OIF2014.380.03, June 2016.
- [2] Y. Frans, et al., “A 56-Gb/s PAM4 Wireline Transceiver Using a 32-way Time-Interleaved SAR ADC in 16-nm FinFET,” *IEEE JSSC*, vol.54, no.4, pp. 1101-1110, Apr. 2017.
- [3] K. Gopalakrishnan, et al., “A 40/50/100Gb/s PAM-4 Ethernet Transceiver in 28nm CMOS,” *ISSCC*, pp. 62-63, Feb. 2016.
- [4] S. Parikh, et al., “A 32Gb/s Wireline Receiver with a Low Frequency Equalizer, CTLE and 2-Tap DFE in 28nm CMOS,” *ISSCC*, pp. 28-29, Feb. 2013.
- [5] Pen-Hui Peng, et. al., “A 56Gb/s PAM4/NRZ Transceiver in 40nm CMOS,” *ISSCC*, pp. 110-111, Feb. 2017.



Figure 6.4.1: Transceiver showing 4 TX/RX lanes with two fractional-N LC PLLs.



Figure 6.4.2: Reconfigurable 7b to 3b asynchronous SAR ADC slice.



Figure 6.4.3: Simplified DSP and equalization architecture and low-power 3b mode data &amp; error decode bypassing FEE&amp;DFE for low-loss links.



Figure 6.4.4: 4-Tap voltage-mode PAM4 transmitter.



Figure 6.4.5: Channel response showing IL at 14GHz and corresponding jitter tolerance, RX BER bathtub and internal eye scan after equalizing a 32 dB LR channel.



Figure 6.4.6: A 3b ADC mode bathtub for a 7.4dB channel without FFE/DFE, PAM-4 voltage-mode TX eye with PRBS-31 and transceiver performance summary.



Figure 6.4.7: Die photo (2.2830mm × 3.858mm).

## 6.5 A 64Gb/s PAM-4 Transceiver Utilizing an Adaptive Threshold ADC in 16nm FinFET

Luke Wang<sup>1</sup>, Yingying Fu<sup>2</sup>, MarcAndre LaCroix<sup>3</sup>, Euhan Chong<sup>3</sup>, Anthony Chan Carusone<sup>1</sup>

<sup>1</sup>University of Toronto, Toronto, Canada

<sup>2</sup>Huawei, Markham, Canada

<sup>3</sup>Huawei, Ottawa, Canada

ADC-based transceivers having up to 8 bits of resolution have been reported for PAM-4 links above 50Gb/s [1,2], although fewer bits are sufficient and offer lower power for short reach (SR) channels. To further reduce the power consumption of ADC-based wireline transceivers, non-uniform quantization has been explored [3,4] using performance metrics for the complete link, such as bit-error rate (BER), to optimize the quantizer thresholds. Both [3,4] are PAM-2 (NRZ) receivers, demonstrating non-uniform quantization specifically for a decision feedback equalizer (DFE) at 10Gb/s and a feedforward equalizer (FFE) at 4Gb/s respectively. An LMS algorithm in [4] adjusts the threshold levels requiring fine-tuning (8b resolution). This paper presents a 64Gb/s PAM-4 transceiver utilizing an ADC-based receiver (RX), with an analog front-end (AFE) based on a 6b, 1b folding, flash ADC with adaptive threshold levels. A fast greedy-search algorithm is used to choose the optimal quantizer thresholds for minimum BER over a given channel. This provides a near-optimal way of power-scaling the ADC when the channel loss doesn't require the ADC's full resolution. The optimization can work in the background for any equalizer structure, does not place additional requirements on the ADC design, and never diverges, unlike LMS-based approaches [4].

The transmitter, Fig. 6.5.1, consists of separate 32:4 serializers for the MSB and LSB data streams and 3 identical driver clusters. The MSB serializer drives 2 driver clusters while the LSB serializer drives 1 cluster. In NRZ mode, identical data streams are applied to both MSB and LSB serializers. Each driver cluster comprises 11 SST slices: 6 dedicated to the main pre-emphasis tap only, 2.5 driven by either the main or pre-cursor tap, and 2.5 driven by either the main or post-cursor tap. The half slices are needed to increase the pre-emphasis tap weight resolution in PAM-4 mode. The slice-based design enables termination calibration and facilitates nonlinearity compensation: relative weights of MSB and LSB slices can be tuned to adjust the center eye in steps of  $36mV_{ppd}$ , and the upper and lower eyes in steps of  $18mV_{ppd}$ . This allows the RLM to change from 0.99 to 0.89 if an asymmetrical eye is desired. A CML stage in parallel with the SST driver slices increases output voltage swing by a further  $100mV_{ppd}$  (10%), similar to [5]. The TX uses quarter-rate (QR) clocks for the 4:2 multiplexer (MUX) and a half-rate (HR) clock for the final 2:1 MUX. The HR clock is created by XORing the quadrature QR clocks. The skew between the QR and HR clocks is minimized by design and verified in post-layout simulations across corners. A duty cycle correction (DCC) circuit provides fine adjustment with 250fs resolution by controlling delay on the rising edge of *clock* and falling edge of *clockb* as shown.

The RX AFE, shown in Fig. 6.5.2 consists of an 8-way time-interleaved 32GS/s 6b flash ADC with 1b folding. The front-end sampler based on [6] is fitted with degeneration to provide  $\sim$ 6dB boost at  $f_s/2$ . It is driven by an HR clock and acts as a 2-way time-interleaver. Hence, as long as accurate DCC is performed on the HR clock, up to 1UI skew is tolerable on the subsequent  $f_s/8$  sub-ADC clocks. The duty-cycle corrector is designed with a range of  $\pm 10\%$  and a resolution of approximately 0.35%. Each sampler output directly drives 4 sub-ADCs. Each sub-ADC has a PMOS track-and-hold (T&H) at its input, which is in track mode for  $<2$ UI per conversion to ensure that only 1 T&H loads the sampler (on either side) at a time. The BW of the entire sampling network including interconnect, at the minimum CTLE boost and maximum gain setting, is approximately 20GHz in simulation (RCC extracted). In each sub-ADC, a VGA follows the T&H on a 1.2V supply to accept the high common-mode from the sampler. All other circuits operate from a 0.9V supply. After the VGA is an MSB comparator, a differential folding stage, and a 5b flash. The 1b folding stage rectifies the signal based on the MSB decision, reducing the number of comparators by half, hence saving power. Each comparator in the 5b flash can be individually disabled (clock-gated) for non-uniform quantization. Note that the 1b folding stage does not affect the non-uniform quantizing effect of the 5b flash since the sampled input distribution is almost symmetric in a differential link. This fact also reduces the search space for the optimal ADC threshold levels. The flash utilizes a Wallace Tree encoder and a resistor ladder for reference generation. In order to interpret the output of

the flash when non-uniform quantization is employed, an encoder is needed and can be constructed as a LUT [4] (simulated power of  $\sim 1.5$ mW per subADC in this technology). In this prototype, the encoder and subsequent digital equalizer is implemented off-chip in software. In order to power scale the ADC for channels not requiring the full 6b resolution, a fast greedy-search algorithm is used. Starting with a uniformly quantized 6b ADC, the back-end digital equalizer (EQ) taps are adapted using an LMS algorithm. Next each quantization level is removed in turn by deactivating one comparator at a time and the EQ taps are re-converged. The level whose removal causes the smallest increase in BER is removed. This procedure is repeated until a power consumption or BER target is reached.

Prototypes are fabricated in TSMC 16nm FinFET CMOS. The TX occupies an active area of  $500\mu m \times 180\mu m$ , while the RX occupies an active area of  $650\mu m \times 250\mu m$  (Fig. 6.5.7). TX measurements are shown in Fig. 6.5.3. The PAM-4 PRBS15 eye, with package and PCB losses de-embedded at 64.375Gb/s, shows that the TX achieves an  $RLM > 0.99$  at  $1V_{ppd}$  full swing. Linearity control is possible to increase the lower eye heights by  $\sim 20mV$  each. Clock pattern jitter decomposition at 32.1875Gb/s shows an RJ of  $162fs_{rms}$  and a TJ of  $2.82ps$  at  $1e-12$ . The TX draws 89.7mW including clock distribution (simulated 26% of total) from a 1.2V supply.

The RX AFE measurement results are shown in Fig. 6.5.4. An SNDR of 27.8dB at a Nyquist input frequency of 16GHz is observed after static nonlinearity compensation is performed using an off-chip DAC to tune comparator thresholds. Peak-to-peak gain and offset mismatch between the 8 sub-ADCs up to 2.2% and 2.18LSB respectively are corrected with a combination of coarse analog correction (on-chip), and fine digital correction (off-chip) based on a single-tone low-frequency calibration. DCC suppresses the skew tone to  $\sim 45$ dBFS. The CTLE sampler measurements exhibit  $\sim$ 6dB boost with the 2<sup>nd</sup> pole at  $>20$ GHz. The complete RX-AFE consumes 283.89mW (including clocking, excluding retimer) from 0.9V/1.2V supplies. Link measurements are shown in Fig. 6.5.5 using 3 channels with different loss at 16GHz: A) 8.6dB; B) 13.6dB; and C) 29.5dB. In performing these measurements, the phase is frozen and swept using a phase interpolator on the TX side. For A, the TX equalization ( $-13.6\%$  precursor and  $-9.1\%$  postcursor) in combination with RX CTLE is enough to open the eye as shown. Using the ADC in 2b mode (as a slicer) is sufficient to obtain a  $BER < 1e-6$ , yielding a power efficiency of  $3.3pJ/b$  (TX+RX, excluding CDR, PLL). For C, the eye is completely closed at the input to the ADC. Hence, the entire 6b uniform ADC is used in conjunction with CTLE at maximum boost and TX equalization ( $-13.6\%$  precursor and  $-22.7\%$  postcursor) to achieve a  $BER < 1e-4$  with a 16-tap FFE (off-chip), yielding  $6.2pJ/b$  (excluding CDR, PLL, and equalizer). Channel B is used to investigate non-uniform quantization targeting a  $BER < 1e-5$  (limited by ADC memory acquisition time). The ADC is first calibrated as a uniform 6b quantizer with a 6-tap FFE. Comparators are then disabled using a greedy search with BER as the optimization metric. The final 31 non-uniform levels yield a  $BER < 4.2e-6$  at  $4.26pJ/b$  compared to a  $BER < 2.8e-5$  using the ADC in 5b uniform mode. Comparisons with recent state-of-the-art TX and RX are shown in Fig. 6.5.6. This TX achieves the best jitter, RLM and power efficiency compared to [1,2,5]. The ADC-based RX-AFE is comparable in power efficiency to existing work and its folding flash architecture allows for aggressive power scaling and non-uniform quantization that the SAR architectures used in [1,2] do not.

### Acknowledgments:

The authors would like to acknowledge the financial and logistic support provided by Huawei, especially Dustin Dunwell, David Cassan, and Davide Tonietto, and the Huawei layout team, especially Muhammad Ali Khan, Diana Ilieva, Mark Roberts, and Trevor Monson.

### References:

- [1] Y. Frans, et al., "A 56-Gb/s PAM4 Wireline Transceiver Using a 32-Way Time-Interleaved SAR ADC in 16-nm FinFET," *IEEE JSSC*, vol. 52, no. 4, pp. 1101-1110, Apr. 2017.
- [2] K. Gopalakrishnan, et al., "A 40/50/100 Gb/s 4-PAM Ethernet Transceiver in 28nm CMOS," *ISSCC*, pp. 62-63, Feb. 2016.
- [3] E. H. Chen, et al., "Power optimized ADC-based serial link receiver," *IEEE JSSC*, vol. 47, no. 4, pp. 938-951, Apr. 2012.
- [4] Y. Lin, et al., "A Study of BER-Optimal ADC-Based Receiver for Serial Links," *IEEE TCAS-I*, vol. 63, no. 5, pp. 693-704, May 2016.
- [5] G. Steffan, et al., "A 64Gb/s 4-PAM Transmitter with 4-Tap FFE and 2.26pJ/b Energy Efficiency in 28nm CMOS FDSOI," *ISSCC*, pp. 116-117, Feb. 2017.
- [6] Y. Duan, E. Alon, "A 12.8 GS/s Time-Interleaved ADC With 25 GHz Effective Resolution Bandwidth and 4.6 ENOB," *IEEE JSSC*, vol. 49, no. 8, pp. 1725 - 1738, Aug. 2014.



Figure 6.5.1: TX block diagram: 3 identical driver clusters, of which 2 driven by the MSB serializer. A DCC circuit provides adjustment (250fs res.) by controlling delay on the rising edge of *clock* and falling edge of *clockb*.



Figure 6.5.2: RX block diagram: 32GS/s 8-way time-interleaved ADC with CTLE half-rate sampler, sub-ADC with 1b folding followed by 5b full flash with individual enables for each comparator.



Figure 6.5.3: TX measurements: De-embedded PAM-4 PRBS15 eyes without and with FFE weights [-1.5, 30, -1.5], and with linearity control; jitter decomposition with clock pattern.



Figure 6.5.4: RX measurements: SNDR/SFDR; 16k point FFT with Nyquist frequency input; front-end sampling CLK DCC CTRL; CTLE responses relative to flat setting.



Figure 6.5.5: Link measurements: Insertion loss; Channel A eye (52k bits) at ADC output; Channel B ADC output histogram (black) and optimized 5b non-uniform thresholds (blue); BER bathtub for channels A and C.

|                                  | [1]         | [2]      | [5]        | This work                         |
|----------------------------------|-------------|----------|------------|-----------------------------------|
| Data Rate (Gb/s)                 | 56          | 64       | 64         | 64.375                            |
| RX ADC Res (bits)                | 8           | 8        | -          | 6                                 |
| RX AFE ENOB at Nyquist (bits)    | 4.9         | 4.9**    | -          | 4.31                              |
| RX AFE Power (mW)                | 370*        | -        | -          | 100@ChA<br>184.9@ChB<br>283.9@ChC |
| TX Swing (V <sub>pp-diff</sub> ) | 1.2         | 1.4      | 1.2        | 1                                 |
| TX RJ (fs <sub>rms</sub> )       | 200         | 240      | 290        | 162                               |
| TX RLM (%)                       | 97          | -        | 94         | 99                                |
| TX Power (mW)                    | 140         | -        | 145        | 89.7                              |
| Supplies (V)                     | 0.9/1.2/1.8 | 0.9/1.2  | 1.0/1.2    | 0.9/1.2                           |
| Chip Area (mm <sup>2</sup> )     | 2.8         | 30.87*** | -          | TX: 0.09<br>RX: 0.1625            |
| Technology                       | 16nm FinFET | 28nm     | 28nm FDSOI | 16nm FinFET                       |

\*Including retimer, \*\*at 10GHz in figure, \*\*\*total area including 2 RX/TX/PLL & DSP.

Figure 6.5.6: Comparison Table: this transceiver is the only one which allows for non-uniform quantization and has the best power scaling capability.



Figure 6.5.7: Die Photos: TX active area  $500\mu\text{m} \times 180\mu\text{m}$ , RX AFE active area  $650\mu\text{m} \times 250\mu\text{m}$ ; RX-AFE layout view with sub-blocks annotated.

## 6.6 A 4.9pJ/b 16-to-64Gb/s PAM-4 VSR Transceiver in 28nm FDSOI CMOS

Emanuele Depaoli<sup>1</sup>, Enrico Monaco<sup>1</sup>, Giovanni Steffan<sup>1</sup>, Marco Mazzini<sup>1</sup>, Hongyang Zhang<sup>2</sup>, Walter Audoglio<sup>1</sup>, Oscar Belotti<sup>1</sup>, Augusto Andrea Rossi<sup>1</sup>, Guido Albasini<sup>1</sup>, Massimo Pozzoni<sup>1</sup>, Simone Erba<sup>1</sup>, Andrea Mazzanti<sup>2</sup>

<sup>1</sup>STMicroelectronics, Pavia, Italy

<sup>2</sup>University of Pavia, Pavia, Italy

PAM-4 modulation paired with forward error correction schemes has been introduced in recent wireline communication standards operating up to 56Gb/s per-lane. PAM-4 enables a more efficient use of the available link bandwidth but, compared to NRZ, design of low-power transceivers entails new challenges. Transmitters must deliver high swing with wide bandwidth and high linearity [1-3]. The multilevel signal suffers from heightened sensitivity to channel loss and reflections, because transitions between adjacent levels are impaired from ISI generated by 3× larger pk-to-pk transitions [4]. As a result, enhanced equalization accuracy is mandatory before symbol detection.

ADC-based receivers benefit from powerful equalization and detection in the digital domain, proving robust operation over channels with high loss (>30dB) [3]. But for very-short-reach (VSR) links, i.e. 10–15dB loss chip-to-module interconnects, analog solutions are attractive for power and area savings [5,6]. In these kinds of receivers, the PAM-4 decision feedback equalizer (DFE) needs hardware triplication and improved resolution, compared to NRZ, raising difficulties in satisfying critical timing constraints with low area and power consumption.

In this work a compact, low-power analog PAM-4 transceiver (TX+RX) is proposed in 28nm CMOS FDSOI. The transmitter embeds a flexible 4-tap FIR and employs a low-power voltage-mode driver, yielding larger eye openings compared to the current-mode alternative. The quarter-rate receiver comprises a flexible continuous-time linear equalizer (CTLE), CDR, eye monitor and adaptation logic. The CTLE features a transfer function optimally adapted at low, mid, and high frequencies, allowing the link to meet the performance requirements of the CEI-56G-VSR standard without a DFE. At the maximum speed of 64Gb/s the minimum BER is  $10^{-12}$  across a 16.8dB-loss channel, and the horizontal eye opening at a BER=10<sup>-6</sup> is 0.19UI. Power dissipation is 315mW from a 1V supply, corresponding to the lowest reported link energy of 4.9pJ/b.

The receiver architecture is shown in Fig. 6.6.1. A T-coil peaking network compensates for the pad, ESD and input capacitance of the analog front-end. The latter consists of two variable gain amplifiers (VGA-1,2) and the CTLE. VGA-1, with coarse gain control, adjusts the signal swing to keep the CTLE in the linear range while VGA-2 is used for fine amplitude control at the sampler inputs. Independent phase interpolators (PIs) with 7b resolution generate the clocking signals for the edge, data and eye monitor paths. A 2-to-8GHz PLL, based on LC-VCOs and shared among multiple RX and TX slices, drives a self-calibrated injection-locked oscillator providing the eight quarter-rate phases feeding the PIs [7]. Edge and data samplers for clock and data recovery are realized with track-and-hold amplifiers followed by strong-arm comparators. After sampling, thermometer-to-binary decoders provide 4<sub>MSB</sub>+4<sub>LSB</sub> NRZ streams, further parallelized by 4:40 de-muxes. Early-late information for the second-order CDR driving the PIs is derived after the de-muxes, allowing selective removal of undesired PAM-4 transitions in the digital domain. Offsets are calibrated in the analog front-end and in each comparator with dedicated routines at start-up. The integrated eye monitor builds PAM-4 signal statistics for adaptation of the sampler thresholds, VGA gains and CTLE frequency response.

The CTLE, drawn in Fig. 6.6.2, features a transfer-function that can be shaped at low, mid and high frequencies independently, enabling accurate channel inversion while greatly simplifying adaptation. The feed-forward path in the first stage ( $R_2$ - $C_2$  and  $gm_2$ ) introduces a mild ~1.5-2dB peaking at low frequency, with a zero-pole pair that can be shifted across the 0.2-1GHz range by changing  $R_2$ . The shape of the response at mid frequency (~1-10GHz) is tunable through the degeneration capacitance  $C_S$  of the transconductor in the main path. Both  $R_2$  and  $C_S$  are tuned through an iterative algorithm at start-up, leveraging eye monitor measurements. The second stage of the CTLE provides 6dB maximum boost at high frequency. A feedback topology is devised for two reasons: (1) the zero in the transfer

function  $H_{HF}(s)$  (see Fig. 6.6.2) is fixed, and peaking is controlled by  $G_{LOOP}$  shifting the gain and pole positions only. As a result, a selective control of the high-frequency boost is achieved with only a mild impact at mid frequency. (2)  $H_{HF}(s)$  can be optimally adapted with LMS, by taking the CTLE output signal as gradient information and using the eye monitor for error slicing.

The transmitter (block diagram in Fig. 6.6.3) receives 40<sub>MSB</sub>+40<sub>LSB</sub> bit streams from the on-chip pattern generator. After 40:8 muxes, a shift register delays the data for 5 FIR taps (of which up to 4 are selected and combined at TX output). Finally, 8:1 muxes provide full-rate MSB & LSB data to the output drivers. The TX is clocked at quarter-rate, and the clock path includes circuits for quadrature error and duty-cycle distortion (DCD) correction. Two different TX versions, with the current-mode (CM) and voltage-mode (VM) drivers in Fig. 6.6.3 have been realized for experimental comparison. The CM driver consists of current sources, switched on/off by inverters and loaded by the termination resistors ( $R_T$ ). The VM driver is made of 72 identical inverters in series with  $R_T$ , allocated to MSB and LSB data according to FIR and output voltage settings. The small programmable resistors  $R_C$  are used for trimming the TX termination across process variations, while  $C_C$  shorts  $R_C$  at high frequency, increasing speed. The CM driver allows the use of a higher voltage supply, yielding larger output swing, while the VM driver provides better performance at high speed. In fact, the time constant of the large device parasitic capacitance  $C_D$ , arising from the many driver paths in parallel, is  $\tau \sim R_T C_D$  in the CM driver while it is drastically reduced to  $\tau \sim r_{on} C_D$  (being  $r_{on}$  the inverter resistance) in the VM topology.

The transceiver operates from 16 to 64Gb/s in PAM-4 and from 8 to 32Gb/s in NRZ. Chips comprising several transceivers are encapsulated in a BGA package and mounted on a PCB. Eye diagrams at the outputs of the CM and VM TX are compared at maximum speed in Fig. 6.6.4. In both cases the TX FIR is enabled, with the same coefficients, to recover ~3.5dB loss of the measurement setup. The two TX alternatives perform similarly in NRZ mode, but with PAM-4 the speed advantage of the VM driver yields a remarkable improvement in both vertical and horizontal eye openings.

Link tests have been carried out with the VM TX feeding the RX through a channel with a length of 10.6cm, test-board traces, connectors and cables with a loss from BGA-to-BGA of 16.8dB at 16GHz. The TX FIR is configured for 2.8dB precursor pre-emphasis. Signal quality at the samplers after RX adaptation is measured through eye-opening BER contours and bathtub curves for both NRZ and PAM-4. The results are shown in Fig. 6.6.5. In NRZ, the horizontal eye opening at BER=10<sup>-12</sup> is 0.35UI. In PAM-4 the horizontal opening is 0.19UI at a BER=10<sup>-6</sup> and the bathtub is still open at BER=10<sup>-12</sup>. As shown in Fig. 6.6.5, only marginal degradation due to crosstalk is observed when two adjacent transceivers are operating simultaneously. Jitter tolerance tests, performed at maximum speed, meet the CEI-56G-VSR mask.

Finally, measured results are summarized and compared to prior works at similar data rates in Fig. 6.6.6, not considering clock generation power. The proposed transceiver demonstrates the maximum speed at the lowest link energy of 4.9pJ/b (2.1pJ/b TX, 2.8pJ/b RX), with compact silicon area occupation. Accounting for the 60mW power for clock generation, shared between eight transceivers, the link energy rises to 5.02pJ/b. A chip photo, with a breakdown of power consumption, is shown in Fig. 6.6.7.

### References:

- [1] T. O. Dickson, et al., "A 1.8pJ/b 56Gb/s PAM-4 Transmitter with Fractionally Spaced FFE in 14nm CMOS," ISSCC, pp. 118–119, Feb. 2017.
- [2] G. Steffan, et al., "A 64Gb/s PAM-4 Transmitter with 4-Tap FFE and 2.26pJ/b Energy Efficiency in 28nm CMOS FDSOI," ISSCC, pp. 116–117, Feb. 2017.
- [3] Y. Frans, et al., "A 56-Gb/s PAM4 Wireline Transceiver Using a 32-Way Time-Interleaved SAR ADC in 16-nm FinFET" IEEE JSSC, vol. 52, no. 4, pp. 1101–1110, Apr. 2017.
- [4] A. R.-Zamir, et al., "A Reconfigurable 16/32Gb/s Dual-Mode NRZ/PAM4 SerDes in 65-nm CMOS," IEEE JSSC, in press.
- [5] J. Im, et al., "A 40-to-56Gb/s PAM-4 Receiver with 10-Tap Direct Decision-Feedback Equalization in 16nm FinFET," ISSCC, pp. 114–115, Feb. 2017.
- [6] P.-J. Peng, et al., "A 56Gb/s PAM-4/NRZ Transceiver in 40nm CMOS," ISSCC, pp. 110–111, Feb. 2017.
- [7] E. Monaco, et al., "A 2–11GHz 7-Bit High-Linearity Phase Rotator Based on Wideband Injection-Locking Multi-Phase Generation for High-Speed Serial Links in 28-nm CMOS FDSOI," IEEE JSSC, vol. 52, no. 7, pp. 1739–1752, July 2017.



Figure 6.6.1: Receiver block diagram.



Figure 6.6.2: CTLE schematic and simulated transfer functions.



Figure 6.6.3: TX block diagram and simplified single-ended schematic of the current-mode (CM) and voltage-mode (VM) driver.



Figure 6.6.4: Measured TX eye diagrams.



Figure 6.6.5: Measured RX eye diagrams and bathtubs at maximum speed in NRZ and PAM-4.

| <b>TX summary and comparison</b> | [1] Dickson ISSCC 2017 | [3] Frans JSSC 2017 | [6] Peng ISSCC 2017 | <b>This work</b>  |
|----------------------------------|------------------------|---------------------|---------------------|-------------------|
| Technology                       | 14nm CMOS              | 16nm FinFET         | 40nm CMOS           | <b>28nm FDSOI</b> |
| Data Rate [Gb/s]                 | 56                     | 56                  | 56                  | <b>64</b>         |
| Driver                           | Voltage                | Current             | Current             | <b>Voltage</b>    |
| Swing [V]                        | 0.9                    | 1.2                 | N/A                 | <b>1</b>          |
| FFE                              | 3-tap                  | 3-tap               | 3-tap               | <b>4-tap</b>      |
| RLM [%]                          | N/A                    | 0.97                | N/A                 | <b>&gt;94</b>     |
| Area [mm²]                       | 0.035                  | (TX+RX)             | 1.4                 | <b>0.12</b>       |
| Supply [V]                       | 0.95                   | 0.9/1.2             | 1                   | <b>1</b>          |
| Power [mW]                       | 101                    | 140                 | 200                 | <b>135</b>        |
| Efficiency [pJ/bit]              | 1.8                    | 2.18                | 3.57                | <b>2.1</b>        |

| <b>RX summary and comparison</b> | [3] Frans JSSC 2017                                      | [5] Im ISSCC 2017            | [6] Peng ISSCC 2017         | <b>This work</b>       |
|----------------------------------|----------------------------------------------------------|------------------------------|-----------------------------|------------------------|
| Technology                       | 16nm FinFET                                              | 16nm FinFET                  | 40nm CMOS                   | <b>28nm FDSOI</b>      |
| Data rate [Gb/s]                 | 56                                                       | 56                           | 56                          | <b>64</b>              |
| Loss [dB]                        | 31                                                       | 10                           | 24                          | <b>16.8</b>            |
| Equalization                     | TX FIR<br>CTLE<br>24-tap FFE<br>1-tap DFE<br>(ADC based) | TX FIR<br>CTLE<br>10-tap DFE | TX FIR<br>CTLE<br>3-tap DFE | <b>TX FIR<br/>CTLE</b> |
| Min BER                          | ~1E-8                                                    | ~1E-12                       | < 1E-12                     | <b>~1E-12</b>          |
| H @ 10^-5 BER [UI]               | 0.15                                                     | 0.2                          | 0.36                        | <b>0.19</b>            |
| Area [mm²]                       | 1.4                                                      | 0.36                         | 1.26                        | <b>0.32</b>            |
| Supply [V]                       | 0.9/1.2                                                  | 0.9/1.2                      | 1/1.5                       | <b>1</b>               |
| Power [mW]                       | 370*                                                     | 230                          | 382                         | <b>180</b>             |
| Efficiency [pJ/bit]              | 6.6                                                      | 4.1                          | 6.8                         | <b>2.8</b>             |

\*DSP power not included

Figure 6.6.6: Performance summary and comparison.



Figure 6.6.7: Chip photograph.

## 6.7 A 32Gb/s 133mW PAM-4 Transceiver with DFE Based on Adaptive Clock Phase and Threshold Voltage in 65nm CMOS

Liangxiao Tang, Weixin Gai, Linqi Shi, Xiao Xiang, Kai Sheng, Ai He

Peking University, Beijing, China

With the proliferation of the Internet of Things and mobile computing, network speed is accelerating to support data-rich services. This drives the explosion of bandwidth requirement on backplane interconnects while channel length and power efficiency remain intact. This paper presents a 32Gb/s PAM-4 transceiver fabricated in a 65nm CMOS process. It achieves a BER<10<sup>-12</sup> through 23dB channel loss at 8GHz while consuming 133mW of power.

Figure 6.7.1 shows the block diagram of the transceiver. The transmitter (TX) incorporates a low-voltage differential signaling (LVDS) PAM-4 driver in which the “Main tap” and “Post tap” work as a 4b DAC, along with a 64:8 multiplexer (MUX), 8 encoders, a clock divider and a built-in PRBS generator. The encoders generate 64b control code for different pre-emphasis strengths and signal amplitudes. Unlike the current mode logic (CML) often used in PAM-4 transmitters [1], which requires constant current to prevent fluctuation of common mode voltage, the current of this LVDS driver is proportional to the signal level of the output data. This reduces the power consumption of the driver by more than 66% for random data patterns. The data is encoded prior to the 64:8 MUX at low speed (1GHz) for further power savings. The receiver (RX) shown in Fig. 6.7.1 consists of a continuous-time linear equalizer (CTLE), a 4-way interleaved one-tap decision feedback equalizer (DFE), a PAM-4 decoder and a de-multiplexer (DEMUX). The bandwidth of the CTLE is limited to a quarter of the baud rate to suppress the half-rate and higher frequency noise. It is used to cancel the long-tail inter-symbol interference (ISI). The one-tap DFE cancels the first post-cursor ISI without amplifying channel noise. The “Phase-adaptation” module alleviates the timing constraints of direct feedback for the DFE. The threshold voltage is adaptively controlled by an off-chip adaptation engine.

It is well known that adaptive threshold voltages are desirable for PAM-4 signaling, as the received signal amplitude depends on the channel. The threshold voltage is calculated from the average peak-to-peak swing [1]. However, it may suffer from the skewed distribution of the level histogram resulting from non-linearity and offset of the sampling point decided by clock and data recovery [2]. This challenge is solved in this work as shown in Fig. 6.7.2. Instead of binary code, Gray code is used for 4 signal levels  $V_4 \sim V_1$ , representing 10, 11, 01, 00. Three threshold levels are  $+V_T$ , 0 and  $-V_T$ . The differences between adjacent levels are equal, therefore the final  $+V_T$  and  $-V_T$  shall be two times  $V_3$  and  $V_2$ , respectively. The adaptation operates as follows. In the coarse tuning, the numbers of ones and zeros in the LSBs after the PAM-4 decoder are counted. Thanks to the Gray code, the LSBs are all zeros or all ones if the thresholds are too small ( $V_2 < V_T < V_3$ ) or too large ( $V_4 < V_T$  or  $-V_T < V_1$ ) as shown in the top-right graph of Fig. 6.7.2. The threshold voltages increase or decrease until the difference between counts of ones and zeros is small enough in a bit length that tolerates the running disparity of popular data encoding. However, the threshold voltage may wander in a wide range due to the small loop gain of coarse tuning. As shown in the top-right graph of Fig. 6.7.2, the gain is approximately zero when  $V_T$  approaches the middle level of  $V_2$  and  $V_1$  or  $V_4$  and  $V_3$ .

Fine tuning follows by summing the signals with level shifting 0 and  $-V_T$ , and the signals with level shifting 0 and  $+V_T$ , as shown in the respective blocks “ $-V_T$  Sampler” and “ $+V_T$  Sampler”. After the slicer, the signs (+1 or -1) are produced. If the corresponding input signal is  $V_3$  (11) or  $V_2$  (01), the output,  $V_{out}$ , of the “Pattern Selector” is  $\text{sign}((V_3-V_T)+V_3)$  or  $-\text{sign}((V_2+V_T)+V_2)$ , respectively. As illustrated in Fig. 6.7.2, the principle of the fine tuning dictates that when  $V_3-V_T+V_3$  reaches 0,  $V_T$  is optimal. After accumulating the digital  $V_{out}$ , a DLPF sends an UP/DN signal to the counter. Threshold voltage,  $V_T$ , changes accordingly until  $V_{out}$  equals zero, statistically. As shown in the bottom-right graph of Fig. 6.7.2, the fine tuning has large gain when  $V_T$  lies between either  $V_1$  and  $V_2$  or  $V_3$  and  $V_4$ . Fine tuning is also data-pattern tolerant, which keeps the threshold voltage stable even if no  $V_3$  or  $V_2$  signal is detected. To avoid undesired convergence to ground or power supply in the case of either  $V_T < V_1$  or  $V_T > V_4$  and dead-lock of the loop in the case of  $V_2 < V_T < V_3$ , the coarse tuning is necessary to limit the effective range of the fine tuning.

Instead of a current-mode amplifier, a sense-amplifier is used to reduce the power consumption of the DFE. However, the eye height of a PAM-4 signal is only one third of the NRZ signal with nearly three times the loading. Consequently, the delay of a single-stage sense amplifier (SSSA) is too much to meet the timing constraints of direct feedback. A DFE based on a two-stage sense amplifier (TSSA) is proposed in this work, as shown in Fig. 6.7.3. The smaller loading of the first sense amplifier shortens the output delay so as to provide much larger differential input for the second sense amplifier. Better timing performance is achieved even taking into account the delay between clocks. The simulation result shows that the TSSA has 2.5 $\times$  larger output than the SSSA at a sampling time of 0.8UI.  $CKD_0$  is a clock lagging  $CK_0$ . The timing diagram in Fig. 6.7.3 shows that it takes 32.4% less time for the output of the TSSA to reach half of the supply voltage.

A phase-adaptive clock is implemented for the TSSA, as shown in Fig. 6.7.4.  $CKD$  is the clock lagging  $CK$ . In the data-tracking module, there is a replica of the DFE loop with zero input. The polarity of the DFE cell with  $CK_{90}$  is turned over to achieve a 180-degree phase shift. When the timing constraints of the DFE are satisfied, all DFE cells act as clocked inverters and the DFE loop works like a ring oscillator. Either two continuous ones or zeros generate a positive pulse on phase error signal  $ERR_{DT}$ , which indicates the timing failure of the DFE loop. The phase of  $CKD_0$  must lie between  $CK_0$  and  $CK_{90}$  to avoid false locking of the DFE loop. A positive pulse occurs on error signal  $ERR_{CT}$  in the case of the violation of the  $CKD_0$  phase. As the clock and data tracking modules work at 4GHz, whereas the DLPF works at 125MHz, a two-way interleaved “Pulse Catcher” is proposed and shown in the inset of Fig. 6.7.4. Node X is pre-charged to VDD during the low portion of  $CK_{125}$ . During the high portion of  $CK_{125}$ , the pulses on either  $ERR_{DT}$  or  $ERR_{CT}$  pull node X down to ground, which makes the phase interpolator (PI) shift the phase of  $CKD$ . Two Pulse Catchers controlled by complementary clocks  $CK_{125}$  and  $CKB_{125}$  are guaranteed to receive all the pulses on  $ERR_{DT}$  and  $ERR_{CT}$ .

The transceiver was implemented in 65nm CMOS. The eye-diagram at the transmitter end is shown in Fig. 6.7.5 with an eye height of 78mV and an eye width of 0.6UI at 32Gb/s. The top-right graph of Fig. 6.7.5 shows that the threshold voltage converges for either large or small initial codes through the off-chip adaptation engine operating at low speed. Fine tuning kicks in when the counts of ones and zeros have less than 25% difference. Fig. 6.7.6 shows two measured bathtub curves for two channels. When the DFE is active, BER<10<sup>-12</sup> may be achieved for both channels whose measured loss are 18dB and 23dB at 8GHz, respectively. The measured adaptation of the clock phase for the DFE is shown in the bottom-right of Fig. 6.7.6. The phase delay of  $CKD$  lagging  $CK$  converges to 28 degrees and BER reaches less than 10<sup>-12</sup>. TX and RX occupy 0.028mm<sup>2</sup> and 0.16mm<sup>2</sup> with measured power consumptions of 53mW and 80mW, respectively. The bottom table in Fig. 6.7.5 summarizes the performance of this design compared with the latest published results. This design achieves the best power efficiency of 4.16pJ/b with comparable equalization capability. Figure 6.7.7 shows the die photo and power breakdown.

### Acknowledgements:

This work was supported in part by National Natural Science Foundation of China (61774006, 61376035). The authors thank C.K. Ken Yang and Hongfei Ye for technical discussions and their valuable advice. The authors also thank Jichao Huang, Xiaoting Zhi, Shihao Li and Tong Zhao for their help in simulation and layout.

### References:

- [1] P.-J. Peng, et al., “A 56Gb/s PAM-4/NRZ Transceiver in 40nm CMOS,” ISSCC, pp. 110-111, Feb. 2017.
- [2] K.-L. J. Wong, et al., “Edge and Data Adaptive Equalization of Serial-Link Transceivers,” IEEE JSSC, vol. 43, no. 9, pp. 2157-2169, Sept. 2008.
- [3] D. Cui, et al., “A 320mW 32Gb/s 8b ADC-Based PAM-4 Analog Front-End with Programmable Gain Control and Analog Peaking in 28nm CMOS,” ISSCC, pp. 58-59, Feb. 2016.
- [4] K. Gopalakrishnan, et al., “A 40/50/100Gb/s PAM-4 Ethernet Transceiver in 28nm CMOS,” ISSCC, pp. 62-63, Feb. 2016.
- [5] A. Roshan-Zamir, et al., “A Reconfigurable 16/32 Gb/s Dual-Mode NRZ/PAM4 SerDes in 65-nm CMOS,” IEEE JSSC, vol. 52, no. 9, pp. 2430-2447, Sept. 2017.



Figure 6.7.1: Block diagram of the transceiver.



Figure 6.7.2: Adaptive threshold voltage control.



Figure 6.7.3: Simulated performance of the TSSA.



Figure 6.7.4: Adaptation of clock phase for the DFE.



Figure 6.7.5: Eye diagram of TX, adaptive code for threshold voltage, and performance comparison.



Figure 6.7.6: Measured loss of two channels, bathtub curves and adaptation of phase delay.



Figure 6.7.7: Die photo and power breakdown.

# Session 7 Overview: *Neuromorphic, Clocking and Security Circuits*

## DIGITAL CIRCUITS SUBCOMMITTEE



**Session Chair:**  
**Youngmin Shin**  
Samsung, Hwasung, Korea



**Associate Chair:**  
**Phillip Restle**  
IBM T. J. Watson Research Center,  
Yorktown Heights, NY

### Subcommittee Chair: **Edith Beigné**, CEA-LETI, Grenoble, France

The eight papers in this session highlight developments in neuromorphic acceleration, clocking circuits and security building blocks. A highlighted paper demonstrates a neuromorphic accelerator with stochastic synapses and embedded online reinforcement learning in autonomous micro-robots. The clocking papers presented demonstrate an all-digital multiplying DLL, a synthesizable fractional-N PLL and a synthesizable period-jitter sensor. Improvements to random-number generators and physically unclonable functions provide lower error rates and lossless stabilization by a novel remapping scheme.



8:30 AM

#### 7.1 A 0.0056mm<sup>2</sup> All-Digital MDLL Using Edge Re-Extraction, Dual-Ring VCOs and a 0.3mW Block-Sharing Frequency Tracking Loop Achieving 292fs<sub>rms</sub> Jitter and -249dB FOM

S. Yang, University of Macau, Macau, China

In Paper 7.1, the University of Macau presents an MDLL with an all-digital frequency tracking loop in 28nm CMOS occupying 0.0056mm<sup>2</sup>. The frequency tracking loop consumes 0.3mW and the MDLL achieves 292fs integrated jitter with -249dB FoM.



9:00 AM

#### 7.2 A 0.02mm<sup>2</sup> Fully Synthesizable Period-Jitter Sensor Using Stochastic TDC Without Reference Clock and Calibration in 10nm CMOS Technology

K. Choo, Samsung Electronics, Hwaseong, Korea

In Paper 7.2, Samsung Electronics describes a fully synthesizable period-jitter sensor that does not require a reference clock or calibration. Three jitter sensors with different sizes are fabricated in 10nm CMOS and they consume 1.5mW, 6mW, 24mW, respectively, and occupy 0.0012mm<sup>2</sup>, 0.005mm<sup>2</sup> and 0.02mm<sup>2</sup>, respectively.



9:30 AM

**7.3 A 0.3-to-1.2V Frequency-Scalable Fractional-N ADPLL with a Speculative Dual-Referenced Interpolating TDC**

*M. Lee*, Pohang University of Science and Technology, Pohang, Korea

In Paper 7.3, Pohang University demonstrates a fractional-N ADPLL for DVFS with a frequency-tracking speculating dual-interpolation TDC. The PLL, implemented in  $0.0043\text{mm}^2$  with 28nm CMOS, achieves a wide frequency/voltage range without calibration. The PLL FOM is  $-225\text{dBc/Hz}$  at 2GHz 1.0V and  $-203\text{dBc/Hz}$  at 20MHz 0.3V.



10:15 AM

**7.4 A 55nm Time-Domain Mixed-Signal Neuromorphic Accelerator with Stochastic Synapses and Embedded Reinforcement Learning for Autonomous Micro-Robots**

*A. Raychowdhury*, Georgia Institute of Technology, Atlanta, GA

In Paper 7.4, Georgia Institute of Technology presents a  $3.4\text{mm}^2$  55nm CMOS test chip demonstrating online reinforcement learning in autonomous micro-robots. Ultra-low-power operation is achieved through time-domain mixed-signal circuit design and stochastic synaptic connections. Measured performance is 3.12TOPS/W and peak energy-efficiency is 1.25pJ/MAC.



10:45 AM

**7.5 An Enhanced-Security Buck DC-DC Converter with True-Random-Number-Based Pseudo Hysteresis Controller for Internet-of-Everything (IoE) Devices**

*L.-C. Chu*, National Chiao Tung University, Hsinchu, Taiwan

In Paper 7.5, National Chiao Tung University presents a buck converter in 55nm with a true random-number-based pseudo hysteresis controller enhancing security against power-side-channel attacks (PSCA) and power-injection attacks (PIA) simultaneously. Measured peak electromagnetic interference noise of  $54.32\text{dB}_{\mu}\text{V}$  meets the EN 55032 Class B requirement.



11:15 AM

**7.6 A Secure Camouflaged Logic Family Using Post-Manufacturing Programming with a 3.6GHz Adder Prototype in 65nm CMOS at 1V Nominal  $V_{DD}$**

*N. E. C. Akkaya*, Carnegie Mellon University, Pittsburgh, PA

In Paper 7.6, Carnegie Mellon University presents a secure camouflaged logic family to protect IP in ICs during manufacturing and defend against reverse engineering. A camouflaged 4b carry-select adder was implemented in 65nm bulk CMOS running at 3.6GHz at 1V  $V_{DD}$  and room temperature. Circuits can be performance-boosted or securely erased in the field.



11:30 AM

**7.7 A PUF Scheme Using Competing Oxide Rupture with Bit Error Rate Approaching Zero**

*M.-Y. Wu*, eMemory, Hsinchu, Taiwan

In Paper 7.7, eMemory presents a physically unclonable function (PUF) scheme using an oxide rupture mechanism with proved uniformly random and reliable output under varying operating conditions. Bit error rate is consistently low regardless of voltage, temperature, aging and fab corner in 55nm CMOS with a unit size of  $0.66\mu\text{m}^2$ .



11:45 AM

**7.8 A 445F<sup>2</sup> Leakage-Based Physically Unclonable Function with Lossless Stabilization Through Remapping for IoT Security**

*J. Lee*, Sungkyunkwan University, Suwon, Korea

In Paper 7.8, Sungkyunkwan University presents a leakage-based physically unclonable function (PUF) with  $445\text{F}^2$  area per bit in 180nm CMOS. Lossless stabilization is achieved by a remapping scheme, where PUF cells in unstable challenge-response pairs (CRPs) are remapped to construct stable CRPs, avoiding costly CRP loss in a conventional trimming approach. Lowest achieved BER is 0.004%.

## 7.1 A 0.0056mm<sup>2</sup> All-Digital MDLL Using Edge Re-Extraction, Dual-Ring VCOs and a 0.3mW Block-Sharing Frequency Tracking Loop Achieving 292fs<sub>rms</sub> Jitter and -249dB FOM

Shiheng Yang<sup>1</sup>, Jun Yin<sup>1</sup>, Pui-In Mak<sup>1</sup>, Rui P. Martins<sup>1,2</sup>

<sup>1</sup>University of Macau, Macau, China

<sup>2</sup>Instituto Superior Tecnico/University of Lisboa, Lisbon, Portugal

Multiplying delay-locked loops (MDLL) and injection-locked clock multipliers (ILCM) have shown improved jitter performance in recent years [1-5], but their PLL-based frequency-tracking loops (FTLs) for securing performance against frequency and PVT variations are area and power hungry. In [1], the frequency ( $F_{OUT}$ ) is tracked by a replica-delay cell of the VCO, such that the intrinsic phase information is preserved under reference injection (REF-INJ). Regrettably an analog FTL is susceptible to noise and mismatch from the charge pump, demanding more area (loop filter: 0.01mm<sup>2</sup>) and power (charge pump: 0.8mW) to limit the in-band jitter deterioration.

For the digital FTLs, using double REF-INJ and pulse-width comparison [2] can track  $F_{OUT}$  while expanding the REF-INJ bandwidth. Yet, two extra blocks (digital-to-analog and time-to-voltage converters: 0.015mm<sup>2</sup>) consume significant area for calibrating the duty-cycle mismatch and comparator offset, which otherwise could cause deterministic error. In [3], the deterministic error is resolved by calculating the period's error rate, but this entails the use of a 3.5mW algorithm for in-situ time-domain mismatch detection. In [4], a pulse-gating method skips the REF-INJ periodically every 8 REF cycles, such that the jitter can be accumulated for phase-error comparison. The main expense is a reduced  $F_{REF}$  in terms of the REF-INJ bandwidth (12.5%) and FTL bandwidth (78.5%).

The presented all-digital MDLL (Fig. 7.1.1) features edge re-extraction, dual-ring VCOs and a low-power (0.3mW) block-sharing FTL. Together, they improve the jitter performance (292fs<sub>rms</sub> at 3GHz) and stabilize it (<9% variation) against voltage, temperature and frequency changes, while requiring less area (0.0056mm<sup>2</sup>) and power (1.45mW) than recent art [1-5].

In our MDLL, the FTL incorporates a frequency selector for frequency coarse tuning (Fig. 7.1.1) and tracks for  $(FCW-1) \times T_0 + T_1 = T_{REF}$ , where FCW is the frequency command word;  $T_0$ ,  $T_1$  and  $T_{REF}$  are the periods of free-running, REF-INJ and REF, respectively. The edge extractor and time-interval comparator are used for frequency fine tuning to achieve  $T_0=T_1$ . The edge extractor is reused to extract the specific OUT edges from  $E_1$  and  $E_2$  containing the information of  $T_1$  and  $T_0$ . For frequency fine tuning, the time-interval comparator directly detects the difference ( $\Delta T$ ) between  $T_0$  and  $T_1$ . The coarse tuning and fine tuning loops run automatically in the background to accomplish frequency locking ( $T_{REF}=FCW \times T_0$ ), even under a large frequency disturbance  $>F_{REF}$ , while [2,3] do not.

Figure 7.1.2 details the calibration steps at FCW=4. The digitally controlled delay line (DCDL)'s offset calibration is realized by: i) extracting the 1<sup>st</sup> and 2<sup>nd</sup> edges via the edge extractor controlled by  $SEL_{REF}$ , such that  $T_0$  is not affected by REF-INJ ( $t^{1st}-t^{2nd}=T_0$ ); ii) sending the 1<sup>st</sup> edge via the DCDL to delay it by  $T_{DCDL}$ , and comparing it with the 2<sup>nd</sup> edge via a bang-bang phase detector (BBPD); and, iii) returning the BBPD's decision to adjust the DCDL until  $T_{DCDL}=T_0$ .

Next, the fine-tuning calibration is enabled to compare  $T_0$  and  $T_1$  in 3 steps: i) extract the 3<sup>rd</sup> and 4<sup>th</sup> edges by  $SEL_{(FCW-2)}$  sharing from the counter output; ii) delay the 3<sup>rd</sup> edge by  $T_0$  via the DCDL, and compare it with the 4<sup>th</sup> edge via the same BBPD ( $t^{3rd}-t^{4th}=T_1$ ); and, iii) send the decision to the fine-tuning path to adjust  $F_{OUT}$ . After  $F_{OUT}$  is finely tuned, the DCDL offset calibration will execute for the subsequent  $T_{REF}$  to work for  $T_{DCDL}=T_0$ .

The above two calibration steps run separately in each  $T_{REF}$  to continuously track  $F_{OUT}$ , while sharing the same blocks. When  $T_0=T_1$ ,  $\Delta T$  is nullified and  $F_{OUT}$  is driven to  $FCW \times F_{REF}$ . The overall time-domain calibration of the DCDL has a wide timing window (200-810ps) and a fine resolution ( $K_{DCDL}$  of 0.35ps/LSB) to ensure a wide  $F_{OUT}$  coverage (1.6-3.2GHz). The area (0.0015mm<sup>2</sup>) is 10x more efficient than its voltage-domain counterpart in [3] (0.015mm<sup>2</sup>). The edge extractor and the BBPD are reused for  $T_0$  and  $T_1$  to avoid offset and mismatch caused by the DFF and BBPD.

Figure 7.1.3 (upper-left) depicts the seamless frequency-tuning scheme. It features coarse tuning (B<13:11>) directly applied to the VCO's switched varactor banks, and moderate (MSB<10:6>) and fine (LSB<5:0>) tuning are applied as the

varactor's control voltage through the DAC. The voltage step of one MSB (-25mV) is around half of the voltage range of the whole LSB band. Once the LSB is full (<111111>), the MSB will increment, and the LSB is set to the middle (<100000>). This undertaking not only improves the linearity and monotonicity, but also avoids the unwanted voltage step (~25mV with 50% overlapping) that otherwise could cause frequency fluctuation (4MHz) when all LSB bits are fully switched on/off. The DAC (Fig. 7.1.3, upper-right) is implemented by 2 banks of coarse-fine current steering: MSB (LSB) for coarse (fine) control, and is 20x more area efficient (0.001mm<sup>2</sup>) than the unit-element-based tuning in [3] (0.02mm<sup>2</sup>). The DAC for VCO tuning has a fine resolution ( $K_{DAC-VCO}=120\text{kHz}/\text{LSB}$ ) to match with the  $K_{DCDL}$ ,

$$K_{DAC-VCO} = K_{DCDL} \times \frac{(F_{OUT})^2}{FCW}$$

Small  $K_{DAC-VCO}$  and  $K_{DCDL}$  are essential to reduce the reference spur (e.g. -65dBc at 1.8GHz and FCW=9),

$$\text{REF. Spur}|K_{DAC-VCO} \approx 20 \log \left( \frac{K_{DAC-VCO}}{F_{REF}} \right)$$

and

$$\text{REF. Spur}|K_{DCDL} \approx 20 \log \left( \frac{FCW \times K_{DCDL}}{T_{REF}} \right)$$

while the VCO's phase noise further aids to randomize the reference spur.

The dual-ring VCOs (Fig. 7.1.3, lower-left) are based on the REF-INJ MDLL with different sizes, covering two bands with overlap: 1.55-2.47GHz and 2.35-3.35GHz. A replica cell is inserted before the MUX to match the slopes between the injection signals and the outputs of the ring VCO. The pseudo-differential delay cells are implemented using inverters coupled in a feed-forward manner, with the varactors loading for frequency tuning.  $F_{OUT}$  is tuned by the varactor to preserve a constant jitter performance over the tuning range. Instead of tuning  $V_{DD}$  [2, 4] or current [1], which suffer from a degraded phase noise at low frequency (since the carrier power decreases faster than the noise power), the SEL for the multiplexer is controlled directly by REF, generating a pulse like an ILCM. This technique solves the bottleneck of a timing issue in the conventional MDLLs [2, 3] that utilize the output signal via a frequency divider. The SEL's window size ( $T_{SEL}$ ) for REF-INJ is properly controlled, otherwise the deterministic error (glitch) from an oversize (undersize) window will disturb the OUT period.

With the proposed block-sharing FTL, the MDLL occupies 0.0056mm<sup>2</sup> in 28nm CMOS. It consumes 1.45mW at 0.8V over a 1.55-to-3.35GHz range (VCO: 1.15mW, FTL: 0.3mW). With a 200MHz  $F_{REF}$ , the phase noise at 3.0GHz exhibits an integrated jitter (10kHz to 40MHz) improvement from 463 to 292fs<sub>rms</sub> when the FTL is enabled, and the result is dominated by the BBPD's and REF's noise (Fig. 7.1.4). The reference spur is -44dBc dominated by the reference power coupling in the I/O, and can be improved by upsizing the reference buffer. For the robustness of the MDLL, the worst variation of the integrated jitter is <9% against  $V_{DD}$  (0.74-0.9V), temperature (-40-120°C) and frequency (1.6-3.2GHz), as shown in Fig. 7.1.5.

Benchmarking with the recent ring-VCO-based MDLLs and ILCMs [1-5] in Fig. 7.1.6, this work shows improved area efficiency (>4.2x) and FoM, (>2dB). The FTL's power (0.3mW) is the lowest reported. The experimental setup and die photo are shown in Fig. 7.1.7.

### Acknowledgements:

The authors thank Macau Science and Technology Development Fund (FDCT) - SKL Fund and University of Macau - MYRG2017-00185-AMSV for financial support.

### References:

- [1] S. Choi, et al., "A 185fsrms-Integrated-Jitter and -245dB FOM PVT-Robust Ring-VCO-Based Injection-Locked Clock Multiplier with a Continuous Frequency-Tracking Loop Using a Replica-Delay Cell and a Dual-Edge Phase Detector," *ISSCC*, pp. 194-195, 2016.
- [2] H. Kim, et al., "A 2.4GHz 1.5mW Digital MDLL Using Pulse-Width Comparator and Double Injection Technique in 28nm CMOS," *ISSCC*, pp. 328-329, 2016.
- [3] S. Kubdu, et al., "A 0.2-to-1.45GHz Subsampling Fractional-N All-Digital MDLL with Zero-Offset Aperture PD-Based Spur Cancellation and In-Situ Timing Mismatch Detection," *ISSCC*, pp. 326-327, 2016.
- [4] D. Coombs, et al., "A 2.5-to-5.75GHz 5mW 0.3ps<sub>rms</sub>-Jitter Cascaded Ring-Based Digital Injection-Locked Clock Multiplier in 65nm CMOS," *ISSCC*, pp. 152-153, 2017.
- [5] H. Ngo, et al., "A 0.42ps-Jitter -241.7dB-FOM Synthesizable Injection-Locked PLL with Noise-Isolation LDO," *ISSCC*, pp. 150-151, 2017.



Figure 7.1.1: Proposed all-digital MDLL with a block-sharing FTL for better area and power efficiency. Edge re-extraction equalizes the periods  $T_0$  and  $T_1$  of OUT, being immune to the timing offset in the DFF ( $T_{DFF\_0s}$ ) and BBPD ( $T_{BBPD\_0s}$ ).



Figure 7.1.2: Timing diagram of the FTL with block sharing in each calibration step. Reusing the same DFF and BBPD logic not only cancels the timing offset and path mismatch, but also saves area and power. FCW=4.



Figure 7.1.3: Upper:  $F_{OUT}$  tuning for seamless decoding and coarse-fine DAC block to optimize area and linearity; Lower: ring-VCO with direct REF-INJ to avoid the divider time constraint and SEL size effect to OUT.



Figure 7.1.4: Measured phase noise: free-running ring-VCO, MDLL with and without the FTL. REF=200MHz and output at 3.0GHz.



Figure 7.1.5: Measured RMS-jitter variation: <9% against supply voltage and temperature, and <5% versus frequency. REF=200MHz and output at 3.0GHz.

|                                            | This Work                          | [5] ISSCC'17                            | [4] ISSCC'17                     | [3] ISSCC'16            | [2] ISSCC'16           | [1] ISSCC'16          |
|--------------------------------------------|------------------------------------|-----------------------------------------|----------------------------------|-------------------------|------------------------|-----------------------|
| Technology (nm)                            | 28                                 | 65                                      | 65                               | 65                      | 28                     | 65                    |
| Architecture                               | Ring-MDLL                          | Ring-ILCM                               | Ring-ILCM                        | Ring-MDLL               | Ring-MDLL              | Ring-ILCM             |
| Calibration Method                         | Edge Re-extraction & Block-Sharing | Synthesizable Symmetric PD Cancellation | Pulse Gating & 2x REF            | Error-Rate Calculation  | PWC & Double Injection | Replica Delay Cell    |
| Supply Voltage $V_{DD}$ (V)                | 0.8                                | 1.2                                     | 0.9-1.1                          | 1.2                     | N/A                    | 1.1                   |
| Freq. Range (GHz)                          | 1.55 to 3.35<br>(73.5%)            | 0.52 to 1.15<br>(75.4%)                 | 2.5-5.75 <sup>3</sup><br>(78.8%) | 0.2 to 1.45<br>(151.5%) | 2.4                    | 0.96 to 1.44<br>(40%) |
| Output Freq. (GHz)                         | 3.0                                | 0.9                                     | 5.0                              | 1.4                     | 2.4                    | 1.2                   |
| Ref Freq. (MHz)                            | 200                                | 150                                     | 125                              | 87.5                    | 75                     | 120                   |
| Multiply Ratio                             | 15                                 | 6                                       | 40                               | 16                      | 32                     | 10                    |
| Ref. Spur (dBc)                            | -44                                | N/A                                     | -45                              | -45                     | -51.4                  | -53                   |
| Output Integrated Jitter ( $p_{s_{rms}}$ ) | 0.292                              | 0.42                                    | 0.34                             | 2.8                     | 0.7                    | 0.185                 |
| Total Power (mW)                           | 1.45                               | 3.8                                     | 5.3                              | 8                       | 1.51                   | 9.5                   |
| FTL Power (mW)                             | 0.3                                | 2.9                                     | 2.0                              | 3.5                     | 0.43                   | 4.75                  |
| $FoM^1$ (dB)                               | -249.1                             | -241.7                                  | -242.4                           | -225                    | -241.3                 | -244.9                |
| $FoM^2$ (dB)                               | -249.1                             | -242.9                                  | -244.4                           | -228.6                  | -245.6                 | -247.1                |
| Active Area (mm <sup>2</sup> )             | 0.0056                             | 0.062                                   | 0.09                             | 0.054                   | 0.024                  | 0.06                  |

1:  $FoM = 10\log \left[ \frac{\sigma_{rms}}{1 \text{ sec}} \right]^2 \cdot \frac{\text{Power}}{1 \text{ mW}}$

2:  $FoM_r = 10\log \left[ \frac{\sigma_{rms}}{1 \text{ sec}} \right]^2 \cdot \frac{\text{Power}}{1 \text{ mW}} \cdot \frac{F_{REF}}{200 \text{ MHz}}$ , normalized to  $F_{REF} = 200 \text{ MHz}$

3:  $V_{DD}$  tuning included.

Figure 7.1.6: Comparison with the recent integer-N MDLLs and ILCMs.



Figure 7.1.7: Experimental setup, chip photo and breakdown of total area.

## 7.2 A 0.02mm<sup>2</sup> Fully Synthesizable Period-Jitter Sensor Using Stochastic TDC Without Reference Clock and Calibration in 10nm CMOS Technology

Kangyeop Choo, Hyunik Kim, Wooseok Kim, Jihyun Kim, Taeik Kim, Hyungjung Ko

Samsung Electronics, Hwaseong, Korea

With the increasing clock speed and complexity of SoCs, measuring clock jitter becomes challenging. To effectively manage the tight jitter performance required by an SoC, the clock quality should be directly evaluated at every point where the clock is used in the SoC. In previous work [1-3], on-chip clock measurements were carried out, but these circuits are difficult to integrate into varied systems due to their special requirements of low-noise analog circuitry, a clean reference clock and additional calibration. We propose a fully synthesizable period jitter sensor (PJS), requiring no analog circuit, reference or calibration by improving upon a stochastic time-to-digital converter (TDC) [4]. The inverter delay chain of the stochastic TDC in [4] has a weakness: it may lose delayed signals due to its lengthy delay chain. It is replaced with a short-pulse delay chain, which does not lose delayed signals under any process, voltage and temperature (PVT) conditions. In addition, the proposed PJS is specified in RTL code and is fully synthesizable. It achieves 0.5ps<sub>pp</sub> resolution for real-time period jitter measurements when its input clock frequency is 1GHz.

Figure 7.2.1 shows the block diagram and timing diagram of synthesizable PJS, which is composed of a CLOCK\_GEN, 4096 UNITS and a 4096-to-12 ADDER. The CLOCK\_GEN receives the input clock (FIN) and generates DIV2, RST and WIN. DIV2 is just the divided-by-2 clock of FIN. RST is the reset pulse for DFFs in UNITS. WIN has a pulse width ( $\Delta t$ ), which is equal to one period of FIN. The period jitter of FIN can be measured by repeatedly quantizing  $\Delta t$ . At first, DIV2 is input to the short-pulse delay chain that generates 4096 delayed short pulses denoted by CK[0:4095]. The short-pulse chain is realized by connecting each DFF's Q port to its R port. Because of PVT variation and mismatches among the short-pulse delay chain, all the delays of CK[0:4095] are different and uncertain. However, their frequencies are always equal to the frequency of DIV2. Therefore, each CK[0:4095] has only one rising edge during any time interval, which is equal to the period of DIV2 denoted by  $P_{DIV2}$ , and the total number of rising edges of CK[0:4095] during  $P_{DIV2}$  is always 4096. The following 4096 DFFs check if each CK[0:4095] has a rising edge during  $\Delta t$ , and generate 4096 one-bit results (Q[0:4095]). Finally, the 4096-to-12 ADDER adds up all of Q[0:4095] and generates a 12b output signal (OUT[11:0]), which has the digitally quantized information of  $\Delta t$ , according to the proportional expression (1). Since this proportion equation is always true regardless of PVT variations, the proposed architecture does not require any extra calibration or additional blocks which increase power and area.

$$P_{DIV2} : 4096 = \Delta t : OUT[11:0] \quad (1)$$

Figure 7.2.2 describes how the proposed PJS generates the reference time interval from the moving average of DIV2's period so that it compares  $\Delta t$  with the moving average. Under typical PVT conditions, the total delay of the 4096 short-pulse delay chain is about 800ns. When the frequency of FIN is 1GHz and  $P_{DIV2}$  is 2ns, the rising edges of CK[0:4095] during  $P_{DIV2}$  are generated from the previous 400 rising edges of DIV2. For example, the rising edge of final short-pulse, CK[4095], during  $P_{DIV2}$  is generated from the 800ns previous rising edge of DIV2. Due to the jitter of FIN and DIV2, each rising edge of DIV2 generates a different number of rising edges of CK[0:4095] during  $P_{DIV2}$ , as shown in the proportional expression on the right side of figure. However, the ratio between Avg(P<sub>DIV2</sub>[k:k-399]) and the total number of rising edges converges to the proportional expression (2). As the total delay of the short-pulse chain increases, the moving average converges to the ideal period of DIV2 and a more accurate jitter measurement becomes available.

$$Avg(P_{DIV2}[k:k-399]) : 4096 = \Delta t : OUT[11:0] \quad (2)$$

Figure 7.2.3 briefly summarizes the fully synthesizable RTL code of the PJS, which has the same hierarchy as the block diagram of Figure 7.2.1. The top module has 3 sub modules: CLOCK\_GEN, UNIT and 4096-to-12 ADDER. The function of the CLOCK\_GEN module and 4096-to-12 ADDER module is realized with a few lines of RTL code. The UNIT module is composed of only 4 standard cells. When the names of standard cells in the RTL code are matched with those in the Process Development Kit, this RTL code becomes fully synthesizable. This offers convenient process portability and scalability. Three PJSs with different sizes (JS08, JS10 and JS12) are designed in 10nm CMOS technology by editing some numbers in the RTL code and they have 256, 1024 and 4096 UNITS, respectively. Figure 7.2.3 also shows Verilog simulation results when a 1GHz of input clock is modulated with a 10MHz sine wave and its modulation amplitude is  $\pm 10\%$ . Compared to JS12 and JS10, JS08 has coarse measurement accuracy due to its small number of UNITS. For the fine-grained period jitter measurement, a larger number of UNITS is absolutely necessary.

Figure 7.2.4 shows a comparison of period jitter histograms, measured from the commercial oscilloscope, DSAV254A, and the three fabricated PJSs (JS08, JS10 and JS12). To check that the PJS measurements are not self-generating jitter but input jitter, 0.5GHz clocks with various types of jitter are generated using a waveform generator, 33250A, and a pulse/pattern generator, 81134A. Without any calibration, the PJS generates a 12b digital signal which is collected by a logic analyzer, 16832A. The histograms of 2<sup>19</sup> sampled digital outputs show that the histogram of the PJS with the larger number of UNITS has a shape similar to that measured with a commercial oscilloscope. The calculated root-mean-square (RMS) and peak-to-peak (P2P) period jitter from JS12 are 3.85% and 33.7% of the input clock period, respectively, and they are nearly identical to the values from the oscilloscope.

The jitter measuring performance comparison between the three PJSs and the commercial oscilloscope is shown in Fig. 7.2.5. The frequency of input clock is fixed to 1GHz and the amount of its Gaussian jitter is controlled by using a waveform generator and a pulse/pattern generator. The RMS and P2P period jitter from JS08 shows rough correlation, because of its low number of UNITS. In contrast, the measured period jitter from JS12 shows the comparable measuring accuracy with the commercial oscilloscope. The mass measurements of 32 chips are carried out and JS12 also has smaller chip-to-chip variation, compared to other smaller PJSs.

The comparison table in Fig. 7.2.6 indicates that the proposed architecture has structural merits compared to previously published papers. It does not require a reference clock for measuring the input clock jitter and does not require any calibration to relieve the PVT variation. As it is fully synthesizable, it can even be integrated inside a logic block. The key performance of the three PJSs are also summarized in the comparison table. JS08, JS10 and JS12 have measurement resolutions of 8ps<sub>pp</sub>, 2ps<sub>pp</sub> and 0.5ps<sub>pp</sub>, respectively. Their power consumption and chip area increases according to the number of UNITS. They consume 1.5mW, 6mW and 24mW, respectively, with 0.75V power supply and occupy 0.0012mm<sup>2</sup>, 0.005mm<sup>2</sup> and 0.02mm<sup>2</sup>, respectively, as shown in Fig. 7.2.7.

### References:

- [1] K. Niitsu, et al., "CMOS Circuits to Measure Timing Jitter Using a Self-Referenced Clock and a Cascaded Time Difference Amplifier with Duty-Cycle Compensation," *IEEE JSSC*, vol. 47, no. 11, pp. 2701-2710, 2012.
- [2] B. Dehlaghi, et al., "A 12.5-Gb/s On-Chip Oscilloscope to Measure Eye Diagrams and Jitter Histograms of High-Speed Signals," *IEEE TVLSI*, vol. 22, no. 5, pp. 1127-1137, 2014.
- [3] J. Liang, et al., "On-Chip Measurement of Clock and Data Jitter with Sub-Picosecond Accuracy for 10 Gb/S Multilane CDRs," *IEEE JSSC*, vol. 50, no. 4, pp. 845-855, 2015.
- [4] S. Kim, et al., "A 0.6V 1.17ps PVT-Tolerant and Synthesizable Time-To-Digital Converter Using Stochastic Phase Interpolation with 16x Spatial Redundancy in 14nm FinFet Technology," *ISSCC*, pp. 280-281, 2015.



**Figure 7.2.1: Block diagram and timing diagram of period jitter sensor.**



**Figure 7.2.2:** Generating a reference from the moving average of DIV2's period.

```

module JS12 ( FIN, OUT );
    CLOCK_GEN U1 ( FIN, DIV2, RST, WIN );
    UNIT U2 [4095:0] ( Q[4095:0], CK[4094:0], DIV2 ), CK[4095:0], RST, WIN );
    ADDER U3 ( Q[4095:0], OUT[11:0] );
Endmodule

module CLOCK_GEN ( FIN, DIV2, RST, WIN ); // Generate DIV, RST and WIN from FIN
    ...
Endmodule

Module UNIT ( Q, CK0, CK1, RST, WIN ); // Generate short-pulse & Check rising edge
    OR2 U1 ( A(WIN), B(Q), Y(WIN0) );
    DFF U2 ( D(WIN0), CK(CK00), R(CK00), RST, Q(Q) );
    DFF U3 ( D(HI), CK(CK0), R(CK00), .Q(CK00) );
    INV U4 ( A(CK00), Y(CK1) );
Endmodule

Module ADDER ( Q, OUT ); // 4096-to-12 ADDER
    Assign OUT[11:0]=Q[0]+Q[1]+Q[2]+...+Q[4095];
Endmodule

```



**Figure 7.2.3: Synthesizable RTL code and its Verilog simulation.**



**Figure 7.2.5: Measured period jitter comparison between commercial oscilloscope and three PJSSs and mass measurement result when  $F_{\text{EM}}=1.0\text{GHz}$ .**

**Figure 7.2.4: Period jitter histograms measured from commercial oscilloscope and three PJSs when  $F_{FIN}=0.5\text{GHz}$ .**

|                        | JSSC12<br>[1]         | TVLSI14<br>[2] | JSSC15<br>[3]        | This Work                       |                     |                     |
|------------------------|-----------------------|----------------|----------------------|---------------------------------|---------------------|---------------------|
|                        |                       |                |                      | JS08                            | JS10                | JS12                |
| Process                | 65nm                  | 65nm           | 65nm                 | 10nm                            |                     |                     |
| Ref. Clock             | Not Necessary         | Necessary      | Not Necessary        | <b>Not Necessary</b>            |                     |                     |
| Type                   | All-digital           | Analog         | Analog               | <b>Synthesizable RTL Code</b>   |                     |                     |
| Calibration            | Necessary             | Necessary      | Necessary            | <b>Not Necessary</b>            |                     |                     |
| Output type            | Histogram             | Histogram      | Histogram            | <b>Histogram &amp; Realtime</b> |                     |                     |
| Power                  | -                     | 1.9mW          | 132.8mW              | 1.5mW                           | 6mW                 | 24mW                |
| Resolution             | 0.03ps <sub>rms</sub> | -              | 1.0ps <sub>rms</sub> | 8.0ps <sub>pp</sub>             | 2.0ps <sub>pp</sub> | 0.5ps <sub>pp</sub> |
| Input Frequency        | 0.82GHz               | 10GHz          | 5GHz                 | 1GHz                            | 1GHz                | 1GHz                |
| Area(mm <sup>2</sup> ) | 0.0013                | 0.0024         | 0.32                 | 0.0012                          | 0.005               | 0.02                |

**Figure 7.2.6: Comparison table.**