

# Session 6 Overview: *Advance Wireline Links and Techniques*

## WIRELINE SUBCOMMITTEE



**Session Chair:** Friedel Gerfers  
TU Berlin, Berlin, Germany



**Session Co-Chair:** Takashi Takemoto  
Hitachi, Sapporo, Japan

Increasing demand for bandwidth in networking and computing drives wireline links to push data rate limits, while at the same time improving the energy-efficiency, bit-error-rate, and throughput per millimeter of chip edge and silicon area. The first paper in the session describes the design of a 112Gb/s PAM-4 transceiver system dealing with 43dB channel loss by utilizing a 3-tap FFE and 18-tap DFE. The second paper demonstrates current state of the art of 112Gb/s ADC-DSP based transceivers, achieving transmission over a 48dB loss channel while consuming only 4.63pJ/b. The following paper exhibits how an on-chip coplanar waveguide is used to realize a 5-tap FFE receiver for 200Gb/s transmission compensating 17.2dB channel loss with only 0.43pJ/b. Paper 6.4 demonstrates a 32Gb/s die-to-die chiplet transceiver, achieving 8Tb/s/mm beach-front bandwidth while consuming 0.44pJ/b. The next two papers showcase advanced clock-and-data recovery (CDR) techniques. One paper applies a network of autocorrelators for flash frequency acquisition, and the other achieves low-power operation by adopting a pattern-based PD with a simple baud-rate operation for short-link operation. The last two short papers describe advanced techniques to implement >100Gb/s PAM-4 and PAM-8 transmitters in 28nm and 40nm processes.

1:30 PM



### 6.1 A 112Gb/s Serial Link Transceiver With 3-tap FFE and 18-tap DFE Receiver for up to 43dB Insertion Loss Channel in 7nm FinFET Technology

Bo Zhang, Broadcom, Irvine, CA

In Paper 6.1, Broadcom demonstrates a flexible transceiver for a wide range of data rates up to 112Gb/s. The RX features a 3-tap FFE and an 18-tap DFE equalizer while the TX applies a 7b DAC driver with a 6-tap FFE achieving an exceptional RLM of 0.999. The PLL has a tuning range of 40-to-60GHz with only 0.12ps<sub>rms</sub> jitter measured at 56GHz. The transceiver is able to compensate up to 43.9dB channel loss at 112Gb/s. The 7nm CMOS TX/RX consumes 690mW and occupies a silicon area of 0.63mm per TX/RX.

2:00 PM



### 6.2 A 4.63pJ/b 112Gb/s DSP-Based PAM-4 Transceiver for a Large-Scale Switch in 5nm FinFET

Henry Park, MediaTek, Irvine, CA

In Paper 6.2, MediaTek presents a high-performance DSP-based transceiver achieving 48dB and 49dB loss compensation at 112.5Gb/s and 106.25Gb/s data rates, respectively. The system has two PLLs per lane for fully independent TX and RX data rate programming. Robustness bottlenecks arising from PVT corner variation and EM coupling are handled by an adaptive biasing technique and a wideband PLL.

2:30 PM

**6.3 A 0.43pJ/b 200Gb/s 5-Tap Delay-Line-Based Receiver FFE with Low-Frequency Equalization in 28nm CMOS**

Bingyi Ye, Peking University, Beijing, China

In Paper 6.3, Peking University describes a 200Gb/s 5-tap delay-line-based receiver FFE in 28nm CMOS. The FFE employs on-chip grounded coplanar waveguides as delay lines and a one-stage topology for higher bandwidth and lower power. The H0 and H1 taps are implemented using distributed amplifiers to alleviate reflection. The RX FFE achieves a power efficiency of 0.43pJ/b and compensates for a 17.2dB-loss channel.



6

3:15 PM

**6.4 A 4nm 32Gb/s 8Tb/s/mm Die-to-Die Chiplet Using NRZ Single-Ended Transceiver With Equalization Schemes And Training Techniques**

Kihwan Seong, Samsung Electronics, Hwasung, Korea

In Paper 6.4, Samsung Electronics demonstrates a 32Gb/s per-lane die-to-die chiplet transceiver in 4nm CMOS. The TX features a reflection cancellation driver for suppressing the effects of impedance mismatch in the interposer. The RX has a 1-tap direct DFE with timing relaxation scheme for compensating ISI. The transceiver achieves a power efficiency of 0.44pJ/b and beach-front bandwidth density of 8Tb/s/mm.



3:45 PM

**6.5 A 37.8dB Channel Loss 0.6μs Lock Time CDR with Flash Frequency Acquisition in 5nm FinFET**

Chien-Kai Kao, MediaTek, Hsinchu, Taiwan

In Paper 6.5, MediaTek solves the difficult problem of a CDR to fast lock input data with significant frequency offset using 37.8dB channel loss. The proposed CDR uses a network of autocorrelators for flash frequency acquisition, adopting an open-loop frequency acquisition to avoid noise accumulation. The CDR achieves 0.6μs lock time independent of frequency offset under 37.8dB channel loss.



4:15 PM

**6.6 A 0.83pJ/b 52Gb/s PAM-4 Baud-Rate CDR with Pattern-Based Phase Detector for Short-Reach Applications**

Seungwoo Park, Korea University, Seoul, Korea

In Paper 6.6, Korea University presents a 52Gb/s PAM-4 baud-rate CDR in 28nm CMOS for short-reach application. To attain a power-efficient RX by reducing the number of comparators with a baud-rate sampling, the proposed CDR employs a pattern-based phase detector and shares the path of clock and data recoveries. The CDR achieves a power efficiency of 0.83pJ/b under 7.1dB channel loss while occupying only 0.011mm<sup>2</sup>.



4:45 PM

**6.7 A 128Gb/s PAM-4 Transmitter with Programmable-Width Pulse Generator and Pattern-Dependent Pre-Emphasis in 28nm CMOS**

Kai Sheng, Peking University, Beijing, China

In Paper 6.7, Peking University describes a 128Gb/s PAM-4 transmitter, which improves signal degradations due to transitions between non-adjacent levels and data dependent jitter. The transmitter has programmable-width pulse generation for edge-optimization and a pattern-dependent pre-emphasis for enlarging eye opening. A 28nm CMOS transmitter IC achieves a power efficiency of 1.4pJ/b and silicon area of 0.137mm<sup>2</sup>.



5:00 PM

**6.8 A 100Gb/s 1.6V<sub>ppd</sub> PAM-8 Transmitter with High-Swing 3+1 Hybrid FFE Taps in 40nm**

Jeonghyu Yang, Hanyang University, Seoul, Korea

In Paper 6.8, Hanyang University demonstrates a 100Gb/s PAM-8 high-swing transmitter with 3-tap FFE in 40nm CMOS. The proposed transmitter adopts a current-mode driver with current bleeders and high-voltage protection cascode to obtain a 1.6V<sub>ppd</sub> output swing. The transmitter achieves a power efficiency of 3.35pJ/b with sufficient eye opening under 9.4dB channel loss and compact silicon area of 0.36mm<sup>2</sup>.



## 6.1 A 112Gb/s Serial Link Transceiver With 3-tap FFE and 18-tap DFE Receiver for up to 43dB Insertion Loss Channel in 7nm FinFET Technology

Bo Zhang<sup>1</sup>, Anand Vasani<sup>1</sup>, Ashutosh Sinha<sup>2</sup>, Alireza Nilchi<sup>1</sup>, Haitao Tong<sup>1</sup>, Lakshmi Rao<sup>1</sup>, Karapet Khanoyan<sup>1</sup>, Hamid Hatamkhani<sup>1</sup>, Xiaochen Yang<sup>1</sup>, Xin Meng<sup>1</sup>, Alexander Wong<sup>2</sup>, Jun Kim<sup>2</sup>, Ping Jing<sup>2</sup>, Yehui Sun<sup>2</sup>, Ali Nazemi<sup>1</sup>, Dean Liu<sup>2</sup>, Anthony Brewster<sup>1</sup>, Jun Cao<sup>1</sup>, Afshin Momtaz<sup>1</sup>

<sup>1</sup>Broadcom, Irvine, CA

<sup>2</sup>Broadcom, San Jose, CA

Social media, video streaming and working at home fuels the demand for bandwidth in metro networks and data centers and pushes serial link data rates into 100Gb/s territory. To cope with severe channel impairments at 112Gb/s PAM-4 with >30dB loss at Nyquist, recently published transceivers [1-4] adopted ADC/DSP-based receivers in advanced FinFET processes. This work presents a low-power and area-efficient non-ADC/DSP-based transceiver that employs fully adaptive 3-tap FFE/18-tap DFE at the receiver (RX) and a 7b DAC-based transmitter (TX) up to 112Gb/s in 7nm FinFET technology for large-scale ASIC applications.

The transceiver block diagram is illustrated in Fig. 6.1.1. The RX input signal goes to the 5Ω termination, which utilizes a T-coil and a shunt inductor to extend the bandwidth and includes AC capacitors to enable direct attachment to an external transmitter. The first two stages are continuous-time linear equalizers (CTLE) providing separate peaking controls for Nyquist and long-tail equalization. This is followed by a programmable-gain amplifier (PGA) and a fixed-gain stage, both with T-coils. The 4x Interleaved sample/hold amplifiers (S/H) enable a 3-tap FFE. The outputs drive an 18-tap DFE featuring 1+D PAM-4 Tap1 and NRZ loop-unrolled MUXes to further equalize the channel loss and reflections. Six quarter-rate phase interpolators (PI) provide flexibility for sampling-based baud-rate and non-sampling-based operations, and generate the needed clocks for data, phase and LMS channels. The DMUX sends de-serialized information into the digital side for CTLE/PGA/FFE/DFE adaptation and timing recovery (CDR). Multiple analog calibrations, such as resistor/offset calibrations, clock phase alignment and PI phase calibration are built-in to improve the performance. The TX adopts a half-rate MUX topology and a 7b DAC-based source-series terminated (SST) driver. One common PLL utilizes two LC VCOs to achieve a wide operation range. It distributes half-rate clocks to RX/TX clock generators.

One advantage of a fully analog RX equalization (non-ADC/DSP-based) is that more DFE taps can be implemented boosting the signal without noise and crosstalk increase. Moreover, as a result of all analog equalizations and CDR functions being in a single loop, a more optimal sampling point can be achieved. Finally, the non-ADC/DSP-based RX provides a higher CDR bandwidth because the latencies of ADC and DSP are eliminated. This CDR bandwidth advantage is critical for applications that require higher jitter tolerance. The addition of the FFE not only provides the pre-cursor equalization that a CTLE/DFE cannot offer, but also enhances equalization stability over voltage and temperature variation. With interleaved S/H amplifiers after the PGA, the FFE can be reasonably implemented, even for pre-cursor taps. In this design, 4x interleaving is adopted with trade-off between complexity of DFE routing and FFE availability. As each data sample is 4UI wide, a 3-tap FFE is implemented, one pre-cursor and one post-cursor with main tap. Each interleave block uses the same clock and is placed together for ease of clock routing (Fig. 6.1.2). As a result, the FFE summer inputs for each interleave comes from different S/H outputs. For example, S/H output "A" from the falling edge of "clkd" is used for the pre-cursor of interleave D, main tap of interleave A and post-cursor of interleave B. As shown in the timing diagram of Fig. 6.1.2, the overlapping periods between three data inputs of "D" (post), "A" (main), and "B" (pre) for FFE interleave A, along with the use of rising edge of "clkA" for the slicers provides a relaxed 2-UI of settling time for FFE/DFE summers and leads to a lower power implementation. The FFE pre/post cells, which are implemented using a differential NMOS pair with degeneration NMOS resistor, are controlled by tap weight DAC bits subtracting to the main tap differential stage shown in Fig. 6.1.2. The FFE tap ranges are set to [-0.25, 0] for pre and [-0.25, 0.25] for post. S/H function can be bypassed by stopping its clock (Fig. 6.1.2).

As discussed in [3, 6], 1+D pulse shaping equalization is an approach to handle high loss channels, e.g. >30dB. 1+D equalization is equivalent to having the main cursor and tap1 equal to 1. It reduces overall equalization by the 1+D transfer function of  $1+z^{-1}$ . On the other hand, it doesn't work well for low-loss channels, e.g. <25dB. It produces a PAM-7 signal (6 eyes). To decode 1+D, six data slicers,  $D[5:0]_n$ , with references of [-5, -3, -1, 1, 3, 5] are used to sample each eye. The 1+D decoder converts the six inputs  $D[5:0]_n$  to a 3b PAM-4 output  $Q[2:0]_n$  by previous  $Q[2:0]_{n-1}$  according to the truth table in Fig. 6.1.3, which is equivalent to  $Q[2:0]_n = D[5:0]_n - Q[2:0]_{n-1}$ , assuming other ISIs are compensated by the FFE/DFE. This is similar to a loop-unrolled DFE implementation [5, 6] with tap1=1 in PAM-4. A 4:1 MUX controlled by 3 inputs is a key block. The

conventional 1+D decoder implementation (Fig. 6.1.3) requires the propagation delay of a MUX and a DFF to be less than 1-UI, e.g. 17.8ps for 56Gbaud/s. To alleviate the timing constraint, a look-ahead approach is utilized that relaxes the timing constraint to a 2-UI period. As,

$$Q[2:0]_n = D[5:0]_n - Q[2:0]_{n-1} = D[5:0]_n - (D[5:0]_{n-1} - Q[2:0]_{n-2}) = D[5:0]_n - D[5:0]_{n-1} + Q[2:0]_{n-2}$$

In the proposed implementation, intermediate signals  $X[11:0]_n$  are generated through 12 additional MUXes between current  $D[5:0]_n$  and previous  $D[5:0]_{n-1}$ , and  $Q[2:0]_n$  can be converted from  $Q[2:0]_{n-2}$  as shown in Fig. 6.1.3. In the interleave topology,  $D[5:0]_{n-1}$  is available from the adjacent interleave of  $D[5:0]_n$ . Although the PAM-4 output from the decoder is the input to the DFE feedback path of tap 8 and beyond, for earlier taps, the decoder latency requires that PAM-7 signals,  $D[5:0]$ , be directly used. Furthermore, Tap12-18 serve as floating taps which can be programmable between 12 to 38 UI to deal with reflections. For low-loss channels, e.g. <30dB, only three data slicers with references [-2, 0, 2] are enabled and direct PAM-4 DFE feedback without tap1 are used.

The TX uses a half-rate 40:1 multiplexer followed by a 7-b SST DAC driver capable of putting out  $1_{Vpp}$  from a 0.9V supply in Fig. 6.1.4. Most of the critical circuits in the TX run from the 0.9V supply, with some on 0.75V to save additional power. The TX being half rate reduces the complexity in clocking and results in significant power saving. The clocking includes the ability to track ppm using a phase interpolator and has duty-cycle correction (DCC) circuits. A 2b thermometer of bit[6:5], 5b-binary of bit[4:0] driver segmentation scheme has been implemented to mitigate excessive DNL transitions due to MSB segment switching, and improve linearity. The TX FIR taps are implemented in the DSP domain. At the DSP interface, 40b-wide data/DAC bit stream is sent into the 40:1 mux consisting of a 40:8 mux followed by an 8:2 driving a 2:1 mux. The driver receives full-rate data from the final 2:1 mux, which utilizes both edges of the half rate clock. At the driver output, the bandwidth is extended using a T-coil. The pre-drivers are actively peaked to further enhance the bandwidth of the TX. The driver is also equipped with a continuous-time high-pass filter implemented as a capacitive path in parallel with the main driver path. This high pass filter sharpens the TX output edges and improves the bandwidth further. The TX consumes 95mW at 106Gb/s PAM-4 transmission.

The PLL is common for the 8 RX/TX pairs shown in Fig. 6.1.5. A fractional-N multi-module divider (MMD) with 3<sup>rd</sup>-order delta-sigma modulator is implementation for flexibility of reference frequency. Two LC VCOs cover a 41-57GHz operation frequency range, one for 41-50GHz and another covering 50-57GHz. The latch with built-in VCO selection inside 56GHz divided by 2 shown in Fig. 6.1.5 results in power and area saving.

The transceiver is fabricated in 7nm FinFET technology and tested against multiple 100Gb/s and legacy 50G/25G/10G standards. It passes the 2KV HBM ESD test in a 60×60mm<sup>2</sup> package and meets 400GBASE-KR4/CR4/C2M and CEI112G specifications. For a link with insertion loss of 43.9dB at 28GHz for 112Gb/s PAM-4 (Fig. 6.1.6), the transceiver achieves raw BER better than 1E-5 with frequency offset. The JTOL of 400GBASE-KR4 shows CDR bandwidth more than 16MHz and this can be doubled if needed thanks to its low latency. The 16MHz bandwidth is 2x higher than similar published solutions (Fig. 6.1.6). The measured TX 112Gb/s eye diagram in Fig. 6.1.6 shows an RLM of 0.999, J4U of 119mUI,  $J_{rms}$  of 15.6mUI, EOJ of 7.5mUI, and exhibits an SNDR better than 37dB. The PLL has a lock range from 40G to 60GHz and the two VCOs have 5GHz overlap for robust operation, with measured rms jitter of 0.12ps and 0.15ps for integer and fractional-N modes respectively. The transceiver consumes 600mW for analog and 690mW overall at 0.9V/0.75V supply per RX/TX and is among the lowest compared to published ADC/DSP-based solutions with similar data rates and processing [1, 3], shown in Fig. 6.1.6. The die area is 0.47mm<sup>2</sup> for analog and 0.63mm<sup>2</sup> overall per RX/TX, and a micrograph of 8 transceivers with a PLL is shown in Fig. 6.1.7.

### References:

- [1] J. Im et al., "A 112Gb/s PAM-4 Long-Reach Wireline Transceiver Using a 36-Way Time-Interleaved SAR-ADC and Inverter-Based RX Analog Front-End in 7nm FinFET", ISSCC, pp. 116-117, Feb. 2020.
- [2] T. Ali et al., "A 460mW 112Gb/s DSP-Based Transceiver with 38dB Loss Compensation for Next-Generation Data Centers in 7nm FinFET Technology", ISSCC, pp. 118-119, Feb. 2020.
- [3] M. LaCroix et al., "A 116Gb/s DSP-Based Wireline Transceiver in 7nm CMOS Achieving 6pJ/b at 45dB Loss in PAM-4/Duo-PAM-4 and 52dB in PAM-2", ISSCC, pp. 132-133, Feb. 2021.
- [4] Z. Guo et al., "A 112.5Gb/s ADC-DSP-Based PAM-4 Long-Reach Transceiver with >50dB Channel Loss in 5nm FinFET", ISSCC, pp. 116-117, Feb. 2022.
- [5] S. Kasturia et al., "Techniques for High-speed Implementation of Nonlinear Cancellation", IEEE Journal on Selected Areas in Communications, vol. 9, no. 5, June 1991.
- [6] N. Kocaman et al., "An 182mW 1-60Gbps Configurable PAM4/NRZ Transceiver for Large Scale ASIC Integration in 7-nm FinFET Technology", ISSCC, pp. 120-121, Feb. 2022.



Figure 6.1.1: Block diagram of the 112Gb/s transceiver.



Figure 6.1.2: 3-tap FFE block diagram with FFE schematic and S/H, summing, slicer clock timing diagram.



Figure 6.1.3: 1+D decoder truth table, 4:1 MUX by 3 selection, conventional 1+D decoder and proposed 1+D decoder with look-ahead MUXs to relax timing constraints.



Figure 6.1.4: TX block diagram features a 7b DAC SST driver.



Figure 6.1.5: PLL block diagram with two LC VCOs and schematic of a built-in MUX latch for 56GHz divider 2.



Figure 6.1.6: Channel loss (bump to bump), measurement BER/FEC projected BER at 112Gb/s, a TX 112Gb/s PAM-4 eye, 400GBase-KR4 JTOL and performance comparison table of recent published 112Gb/s transceivers.



Figure 6.1.7: Die photo of 8 transceivers with a common PLL.

## 6.2 A 4.63pJ/b 112Gb/s DSP-Based PAM-4 Transceiver for a Large-Scale Switch in 5nm FinFET

Henry Park<sup>\*1</sup>, Mohammed Abdullatif<sup>\*1</sup>, Ehung Chen<sup>1</sup>, Ahmed Elmallah<sup>1</sup>, Qaiser Nehal<sup>1</sup>, Miguel Gandara<sup>1</sup>, Tsz-Bin Liu<sup>2</sup>, Amr Khashaba<sup>1</sup>, Joonyeong Lee<sup>1</sup>, Chih-Yi Kuan<sup>2</sup>, Dhinessh Ramachandran<sup>1</sup>, Ruey-Bo Sun<sup>2</sup>, Atharav Atharav<sup>1</sup>, Yusang Chun<sup>1</sup>, Mantian Zhang<sup>1</sup>, Deng-Fu Weng<sup>2</sup>, Chung-Hsien Tsai<sup>2</sup>, Chen-Hao Chang<sup>2</sup>, Chia-Sheng Peng<sup>2</sup>, Sheng-Tsung Hsu<sup>2</sup>, Tamer Ali<sup>1</sup>

<sup>1</sup>MediaTek, Irvine, CA

<sup>2</sup>MediaTek, Hsinchu, Taiwan

\*Equally Credited Authors (ECAs)

In hyper scale data centers, high-speed links beyond 100Gb/s are required by applications such as XSR (co-packaged optics and die-to-die interconnects) or LR (Ethernet switches, ASICs, and retimers). This work presents a 112Gb/s LR SerDes system with a DSP-based PAM-4 transceiver for large-scale switching ASICs (25.6–51.2Tb/s). For a compact and low-cost system configuration, each lane is recommended to drive more than 40dB channel loss [1–4] without repeaters. Maximum heat capacity of a package comprising a die with hundreds of transceiver lanes imposes a strict limit on the maximum power consumption per lane. In addition, large-scale integration leads to noisy operating conditions due to Xtalk and supply noise. Concurrent supports of multiple standard specifications (for example Ethernet and Fibre Channel) require independent lane speed as well as lane swapping. This flexible clocking requirement can be solved by utilizing a TX-PLL and an RX-PLL per lane [4]. However, electromagnetic (EM) coupling between inductors in neighboring lanes is a challenge, especially when an LC oscillator is necessary for low-jitter clocking. This work overcomes those challenges by using a low-power and long-reach DSP-based transceiver with careful modeling of aggressive ASIC impairments on sensitive circuits.

The TX block diagram shown in Fig. 6.2.1 has a dedicated frac-N digital PLL. The TX PLL operating speed can be controlled independently from the RX PLL of the same lane. The PLL outputs a fully differential 14GHz clock that feeds an IQ generator circuit implemented with digitally controlled delay lines. The 4-phase clocks are used for 75% pulse generation to drive the 4:1 high-speed MUX. The pulse generator outputs are constantly monitored by duty-cycle and skew detectors. For area saving, the monitoring circuits (error amplifier and LPF) are shared between each clock phase detection. The 4:1 MUX adopts a near-CML architecture [5], whose output transition is faster than any other architecture with stacked devices. A replica MUX with a 1010 input pattern is used to set the output common-mode level to VDD\_DRV/2 using a P-type load and an error amplifier. The amplifier output ( $V_{ctrl}$ ) biases the main data path MUX array. The MUX output amplitude can be expressed as  $2 \times (V_{DD\_MUX} - VDD\_DRV/2)$  and it is programmable by changing the target output CM level or VDD\_MUX. A smaller pull-up resistance is necessary to suppress data path ISI, while the output swing must be sufficiently large to fully switch the pre-driver (PDRV) inverter. The PDRV then produces a full CMOS level swing at the SST driver input. The TX driver can output  $1.1V_{ppd}$  swing.

Figure 6.2.2 shows the RX block diagram. The RX front-end includes a passive element attenuator (ATT), VGA, CTLE, and T/H buffer. The wideband attenuator scales down the incoming signal swing to avoid RX data path saturation when a link partner's TX swing is as large as  $1.2V_{ppd}$ . The attenuator is made of shunt elements in parallel. Each unit has a capacitor and a resistor in series (Fig. 6.2.3(a)) whose gain equation can be approximated as Gain\_LF at low frequency (cap dominant) and Gain\_HF at high frequency (resistor dominant). Mismatch between Gain\_LF and Gain\_HF leads to long post-cursors in the RX pulse response that may degrade timing error detection (TED) quality if timing recovery FFE has a limited tap number. The VGA is placed before the CTLE to minimize high-frequency reflection and to improve data path bandwidth. The VGA gain code is adaptively controlled (AGC) to maintain the ADC input swing amplitude at  $400mV_{ppd}$ .

The CTLE output signal is sampled by 8-way interleaved T/H switches and each path drives a 7GS/s ADC. The interleaved T/H bandwidth mismatch impacts the residual ISI unless each path's FFE/DFE coefficients are independently controlled. A wideband (>28GHz) buffer and T/H are designed to minimize the ISI tap value mismatch. The sampling error power of the interleaving path can be analytically estimated by: 1) the sampled signal power or the power of its derivative (pulse response and PAM-4 scaler 5/9 in Fig. 6.2.2); and 2) the expected variance of the skew/gain residual error after calibration. The expressions in the figure can accurately model the skew-induced error power that is strongly affected by the timing recovery locking point ( $t_{TR}$ ). The error power scales up from the PAM-4 eye center to the transition edge in proportion to the square of the slope of the pulse response ( $h'$ ). Using the sampling error models, the gain/skew control step size is determined by keeping their noise power smaller than the ADC Q-noise power. The 7GS/s ADC is made of  $7 \times 1$ GS/s asynchronous 7b SARADCs with 1b redundancy for fast cycling and a smaller DNL. The timing recovery's MM phase

detection can either use a low-latency short FFE to minimize peaking in the jitter transfer curve or use the main FFE/DFE output for very high loss channels (> 40dB).

The unit 7b ADC (1GS/s) and its S/H switch can resolve  $400mV_{ppd}$  signal swing. Since the core devices operate under 0.75V supply, the ADC S/H switch linearity may limit RX performance due to either settling error or leakage. An efficient way to improve the ADC S/H switch linearity is by adaptive common-mode (CM) biasing. A slow corner device favors a high CM level as the switch leakage is inherently small, while a fast corner device can benefit from a low CM level (Fig. 6.2.3(b)). Figure 6.2.3(c) shows an ADC input CM generation circuit for a PVT-insensitive S/H switch performance. This circuit generates a minimum tolerable input voltage of a replica S/H switch ( $V_{SH\_MIN}$ ) with a target resistance " $R_{REF}$ " whose value is set for at least 5 tau settling accuracy. The main data path CM level ( $V_{ADC\_CM}$ ) is 100mV greater than  $V_{SH\_MIN}$  as its swing can be as large as  $400mV_{ppd}$ . As shown in Fig. 6.2.3(c), the ADC S/H switch and the input device of the ADC buffer use the same P-type device, which stabilizes the CM level of the T/H switch over PVT corners. The ADC buffer has local discrete-time feedback with two non-overlapping clocks ( $\phi_1$  and  $\phi_2$ ) to keep its bias current stable.

A compact PLL is a key enabler for a flexible per-lane TX/RX speed programming. A digital PLL provides compact area and technology scaling advantage. The TX/RX DPLL block diagram employing an LC VCO is drawn in Fig. 6.2.4(a). The TDC gain curve is automatically calibrated to suppress its clipping rate to lower than 0.3% and to minimize its Q-noise. The main loop filter of the PLL is implemented in the digital domain. A bandwidth control loop monitors the autocorrelation of the TDC output for optimum phase noise shaping. The fractional-N divider Q-noise is bypassed to the digitally controlled delay line (DCDL) in the feedback path to suppress a high-frequency delta-sigma noise. Aggressive LC VCO integration results in electromagnetic (EM) coupling between the VCOs of neighboring lanes (Fig. 6.2.4(b)). Ring-based VCOs do not suffer from this phenomenon [4], but LC VCOs have superior power efficiency, noise, and supply insensitivity. The EM coupling noise at the VCO output is suppressed by the feedback loop, and the noise shape is characterized by the PLL bandwidth and the coupling strength [6]. From wireline communications such as 802.3ck, the worst-case EM coupling arises from neighboring lanes operating at the same speed but with frequency offsets less than 200ppm (2.8MHz for a 14GHz oscillator). The DPLLs are designed with a minimum bandwidth of 10MHz that can reliably reject EM coupling noise with at least 10dB suppression. If the neighboring lanes operate at two different frequencies, the EM coupling is filtered by the Q of the LC tank.

Figure 6.2.5 shows the TX/RX measurement data. The 112.5Gb/s TX PAM-4 eye diagram is shown in Fig. 6.2.5(a) with DC de-emphasis filtering to compensate 4.2dB loss from the package and off-chip connection. Measured RLM is higher than 99% and SNDR is 39dB. Figure 6.2.5(b) is a measured RX front-end response over frequency with VSR mode, LR mode, and MAX peaking setup (PKG loss is de-embedded). RX SNDR at DC is 35dB when the RXFE is set to LR mode (Fig. 6.2.5(c)). The interleaving path's offset/gain calibration is active in this single-tone test, but the skew calibration is disabled. The high-frequency SNDR is limited by clocking path RJ (150fs). The measured RX JTOL has 150mUI margin at 112.5Gb/s data rate (Fig. 6.2.5(d)). Figures 6.2.6(a) and 6.2.6(b) show long reach link performance with TX to RX loopback. In Fig. 6.2.6(a), TX and RX from the same lane are connected via 20"~26" board trace. Loss compensation with  $BER < 1e-5$  is 48dB and 49dB under 112.5Gb/s and 106.25Gb/s data rate, respectively. A more comprehensive compliance test is demonstrated in Fig. 6.2.6(b) by using a backplane channel. The measured IL is 36.2~39dB over temperature sweep and the test setup has 6 neighboring lanes in cable loopback to inject MAX near-end Xtalk (NEXT) from PKG, PCB, and MXP connectors. The DUT's TX and RX data rate has 100ppm offset to exercise RX phase rotation. KR mode BER varies by 8x over temperature sweep primarily because of the loss variation. The table in Fig. 6.2.6(c) compares power, performance, area, and analog/digital architecture of this work to the state of the art. The transceiver consumes 521mW per lane at 112.5Gb/s data rate and its analog+digital area is 0.461mm<sup>2</sup>/lane. A die micro-photo is included in Fig. 6.2.7.

### References:

- [1] M. LaCroix et al., "A 116Gb/s DSP-Based Wireline Transceiver in 7nm CMOS Achieving 6pJ/b at 45dB Loss in PAM-4/Duo-PAM-4 and 52dB in PAM-2," ISSCC, pp. 132-133, Feb. 2021.
- [2] P. Mishra et al., "A 112Gb/s ADC-DSP-Based PAM-4 Transceiver for Long-Reach Applications with >40dB Channel Loss in 7nm FinFET," ISSCC, pp. 138-139, Feb. 2021.
- [3] Z. Guo et al., "A 112.5Gb/s ADC-DSP-Based PAM-4 Long-Reach Transceiver with >50dB Channel Loss in 5nm FinFET," ISSCC, pp. 116-117, Feb. 2022.
- [4] A. Varzaghi et al., "A 1-to-112Gb/s DSP-Based Wireline Transceiver with a Flexible Clocking Scheme in 5nm FinFET," IEEE Symp. VLSI Circuits, pp. 26-27, June 2022.
- [5] Z. Toprak-Deniz et al., "6.6 A 128Gb/s 1.3pJ/b PAM-4 Transmitter with Reconfigurable 3-Tap FFE in 14nm CMOS," ISSCC, pp. 122-123, Feb. 2019.
- [6] C. -J. Li et al., "A Rigorous Analysis of a Phase-Locked Oscillator Under Injection," IEEE TMTT, vol. 58, no. 5, pp. 1391-1400, May 2010.



Figure 6.2.1: TX block diagram.



Residual sampling error by gain mismatch and skew

$$E\left\{\sigma^2(\tilde{v}_{e,\Delta t}(t_{TR}))\right\} \approx \sum_{n=0}^{\infty} (h(nT_b + t_{TR}))^2 \cdot \frac{5}{9} \cdot \sigma^2(\Delta t) \cdot \frac{N_{TH}-1}{N_{TH}}$$

$$E\left\{\sigma^2(\tilde{v}_{e,\Delta g}(t_{TR}))\right\} \approx \sum_{n=0}^{\infty} (h(nT_b + t_{TR}))^2 \cdot \frac{5}{9} \cdot \sigma^2(\Delta g) \cdot \frac{N_{ADC}-1}{N_{ADC}}$$

\* $h$ : pulse response of a channel + TX & RX,  $T_b$ : unit interval,  $t_{TR}$ : TR locking point ( $0 \sim T_b$ ),  $\Delta t$  &  $\Delta g$ : residual timing and gain mismatch after calibration.

Figure 6.2.2: RX block diagram.

Figure 6.2.3: (a) RXFE attenuator design target (GAIN\_LF = GAIN\_HF), (b) 1GS/s ADC S/H ENOB with 400mV<sub>ppd</sub> input swing over PVT corners, (c) T/H switch, ADC buffer, and ADC CM level generator for constant S/H switch resistance.

Figure 6.2.4: (a) DPLL block diagram (digital blocks in purple color), (b) EM coupling model.

Figure 6.2.5: TX/RX performance plots, (a) 112.5Gb/s TX eye diagram, (b) RX frequency response (PKG de-embedded), (c) RX single tone test (LR mode, 360mV<sub>ppd</sub> @ADC), (d) 112.5Gb/s RX JTOL.

Figure 6.2.6: (a) 106.25/112.5Gb/s T2R loopback BER vs. loss (dB), (b) 106.25Gb/s KR BER (36.2~39dB loss over temperature) under thermal cycle, 100ppm frequency offset, and 6 NEXTs, (c) comparison table.



Figure 6.2.7: Transceiver micro-photograph.

### 6.3 A 0.43pJ/b 200Gb/s 5-Tap Delay-Line-Based Receiver FFE with Low-Frequency Equalization in 28nm CMOS

Bingyi Ye, Guangdong Wu, Weixin Gai, Kai Sheng, Yandong He

Peking University, Beijing, China

The ever-increasing demand for greater I/O bandwidth has pushed the transceiver data rate to 200Gb/s [1]. At this rate, the implementation of decision-feedback equalizers faces severe timing constraints. Discrete-time feed-forward equalizers (FFEs) in receivers (RXs) break the timing loop and compensate for electrical and optical impairments [2-3]. However, it relies on accurate, multiphase, and high-speed sampling clocks. The RX FFEs implemented in the continuous-time domain use active [4-5] or passive [5-6] delay lines, which eliminate clock and interleaved sample-and-hold circuits. In addition, the continuous-time FFE preserves edge information and therefore supports the oversampling clock and data recovery (CDR). This paper presents a 5-tap delay-line-based receiver FFE operating at 200Gb/s and equalizing a 17.2dB-loss channel.

The top of Fig. 6.3.1 shows the insertion loss of two on-chip grounded coplanar waveguides (GCPWs) with 50ps and 10ps delays, corresponding to unit intervals (UIs) of 20GBaud and 100GBaud, respectively. The length of a 50ps GCPW is approximately 7.5mm, which results in significant DC loss and area occupation when used in a 20GBaud FFE. To avoid long on-chip transmission lines, some designs utilize lumped inductors and capacitors [5-6] as passive delay lines. When the baud rate reaches 100GBaud, the DC and Nyquist losses of a single UI GCPW are only 0.35dB and 1.15dB, respectively, which are much less than those of the single UI GCPW for 20GBaud. This significant reduction in loss and area makes the 100GBaud transmission-line-based FFE feasible. Although the transmission line has a constant delay, our design can operate over a wide range of data rates when configured as a fractionally spaced FFE. The bottom of Fig. 6.3.1 shows the block diagram of the proposed RX FFE. The input data  $V_i$  goes through a 2-UI delay line and is terminated by the AC-grounded resistors  $R_{\text{Termt}}$ . Five taps,  $H_1$  to  $H_5$ , receive the incoming signals at 0.5-UI spacing and amplify them using programmable gain coefficients. The amplified signals are summed using another 0.5-UI-spaced delay line to generate the FFE output  $V_o$ . This FFE supports traditional coefficient-based mid- and high-frequency equalization, as well as low-frequency equalization based on RC source degeneration. To characterize the FFE performance, an off-chip oscilloscope was used to measure  $V_o$  in this work.

The FFE employs a one-stage topology to achieve a higher bandwidth and lower power. Because realizing an amplifier with a bandwidth of >50GHz is difficult in a 28nm CMOS process, neither the input buffer nor the output driver is implemented in this design. Therefore, large lumped amplifiers in the  $H_0$  and  $H_1$  taps, which account for 2/3 of the total capacitive loading, would induce impedance discontinuities and cause significant reflection, as depicted at the top left of Fig. 6.3.2. This problem can be alleviated by using distributed amplifiers in the  $H_0$  and  $H_1$  taps, as shown at the bottom left. Although there is a 1ps intra-tap delay mismatch, it only reduces the Nyquist gain by 0.1dB. The other three taps,  $H_1$ ,  $H_2$ , and  $H_3$ , are relatively smaller and are implemented with lumped amplifiers. The simulation results in the top right of Fig. 6.3.2 show that the distribution of the  $H_0$  and  $H_1$  tap amplifiers improves the return loss of the FFE by 5dB at the Nyquist frequency. The bottom right of Fig. 6.3.2 shows the layout of the  $H_0$  and  $H_1$  taps. Two delay lines for input  $V_i$  and output  $V_o$  are implemented using GCPW, with a 20μm wide ground wire in between for crosstalk suppression. The impedance of the delay line is designed to be 60Ω instead of 50Ω to compensate for the drop in impedance caused by the capacitive loading of the tap amplifiers. The shielding plane in green color beneath the high-speed delay lines provides a low-resistance ground that significantly mitigates the ground bounce noise induced by the common-mode return current of the tap amplifiers.

The left of Fig. 6.3.3 presents three types of tap amplifiers. The top left is the conventional cell-based tap amplifier [5]. Because each transconductance ( $G_m$ ) cell operates at the same optimized tail current, good linearity is achieved with any coefficient. However, a tap amplifier consisting of multiple  $G_m$  cells suffers from large wiring parasitics because each  $G_m$  cell must support RC degeneration independently. The tap amplifier for  $H_1$ ,  $H_0$  and  $H_1$  utilizes a tail-current-controlled variable- $G_m$  cell to achieve smaller wiring parasitics, as shown in the middle left. The coefficients of these three taps are much larger than zero, preventing severe nonlinearity caused by near-threshold voltage operation. The tap amplifier for  $H_2$  or  $H_3$  is composed of two variable- $G_m$  cells with their outputs cross-connected, as shown in the bottom left. The subtraction of the  $G_m$  of the two cells enables the small coefficients. Only one  $G_m$  cell is used to realize the large coefficients, with the other turned off. As a result, the tap amplifier maintains good linearity for all coefficients. The right of Fig. 6.3.3 shows the schematic of the variable- $G_m$  cell. Its gain is controlled by a tail current with a resolution of four bits. RC source degeneration with 2b programmable resistors provides low-frequency equalization.

Compared to adding more FFE taps, the source degeneration consumes no extra power. In addition, the linearity of the differential pair is improved, which is favorable for PAM-4 signaling. The  $H_0$  and  $H_1$  taps are composed of four and two variable- $G_m$  cells, respectively, to support larger coefficients and finer low-frequency equalization adjustment.

The FFE is fabricated in a 28nm CMOS technology. The time-domain performance is characterized using an arbitrary waveform generator (AWG) and a real-time oscilloscope, and the frequency-domain performance is measured using a vector network analyzer. The top left of Fig. 6.3.4 shows the measured pulse response of the  $H_0 + H_1$  taps, with the  $H_1$  coefficient varying from zero to the maximum, typical  $H_0$ , and the other taps turned off. The pulse response shows that the  $H_1$  tap has a maximum normalized gain of 0.6 and a delay of approximately 10ps. The top right of Fig. 6.3.4 shows the frequency response of the  $H_0$  tap, showing a 3.4dB tuning range of the low-frequency gain. The frequency response of the 5-tap FFE can be configured to provide a wide range of peaking up to 15dB with a Nyquist gain of 1.7dB, as shown in the bottom left of Fig. 6.3.4. A larger peaking value can be achieved by reducing the  $H_0$  gain. A flat frequency response is measured and serves as the baseline, which is removed from the other frequency responses to make the curves more distinguishable. The bottom right of Fig. 6.3.4 shows the measured output noise of the FFE, which is less than 1.1mV<sub>rms</sub> for all peaking configurations. This number would be somewhat increased when implemented in an RX because of the power supply noise induced by other circuits.

The 200Gb/s QPRBS9 PAM-4 signal was generated using the AWG, passed through the RX FFE, and captured by the oscilloscope to measure the eye diagrams. Two test channels, including RF probes and cables, are measured and have insertion losses of 9.0dB and 17.2dB, respectively, at 50GHz. Figure 6.3.5 shows the eye diagrams before and after equalization. The eye diagram is completely closed after a 7.1dB-loss cable. By optimizing FFE coefficients and low-frequency equalization, the eye diagrams are opened for both the 9.0dB and 17.2dB loss channels with no error observed over 2E5 symbols. However, the scope does not support the PAM-4 CDR and induces a large low-frequency jitter when recording for a longer time. The bit-error rate calculated with eye heights and noise amplitude is lower than 1E-6 for the 17.2dB-loss channel. When low-frequency equalization is turned off, FFE coefficients can hardly recover the PAM-4 signal for the 17.2dB-loss channel. The top right of Fig. 6.3.5 shows an equalized 100Gb/s NRZ eye diagram of the same channel. The FFE also compensates for a 15.4dB loss at 42GHz when operating at a data rate of 168Gb/s. For 17.2dB equalization, powered by a 2.2V supply, the chip consumes 115mW for the FFE and 2mW for the bias and serial interface. The supply voltage and total power can be reduced to 1.6V and 86mW, respectively, when the FFE is implemented in the receiver and draws current from 50Ω resistors at both ends of the output delay line.

Figure 6.3.6 shows the measured performance and a comparison with previously published continuous-time FFEs and an ADC-based 224Gb/s RX [1]. This work is the first continuous-time FFE operating at 200Gb/s. Here, low-frequency equalization enhances channel loss compensation. Furthermore, the passive delay lines and the one-stage topology enable low output noise. When implemented in a 200Gb/s receiver, the proposed FFE provides a low-power solution for short-reach transmission. The 0.43pJ/b energy efficiency is only 31% of the ADC-based RX and the 0.32mm<sup>2</sup> core area is comparable although passive transmission lines are implemented in the chip. The die micrograph is shown in Fig. 6.3.7.

#### Acknowledgement:

This work was supported by the National Key R&D Program of China under Grant 2018YFB2202301.

#### References:

- [1] Y. Segal et al., "A 1.41pJ/b 224Gb/s PAM-4 SerDes Receiver with 31dB Loss Compensation," ISSCC, pp. 114-115, Feb. 2022.
- [2] B. Ye et al., "A 2.29pJ/b 112Gb/s Wireline Transceiver with RX 4-Tap FFE for Medium-Reach Applications in 28nm CMOS," ISSCC, pp. 118-119, Feb. 2022.
- [3] H. Li et al., "A 100 Gb/s-8.3dBm-Sensitivity PAM-4 Optical Receiver with Integrated TIA, FFE and Direct-Feedback DFE in 28 nm CMOS," ISSCC, pp. 190-191, Feb. 2021.
- [4] E. Mammei et al., "A Power-Scalable 7-Tap FIR Equalizer with Tunable Active Delay Line for 10-to-25Gb/s Multi-Mode Fiber EDC in 28nm LP-CMOS," ISSCC, pp. 142-143, Feb. 2014.
- [5] A. Momtaz and M. M. Green, "An 80 mW 40 Gb/s 7-Tap 7/2-Spaced Feed-Forward Equalizer in 65 nm CMOS," IEEE JSSC, vol. 45, no. 3, pp. 629-639, March 2010.
- [6] J. Sewter and A. C. Carusone, "A CMOS Finite Impulse Response Filter With a Crossover Traveling Wave Topology for Equalization up to 30 Gb/s," IEEE JSSC, vol. 41, no. 4, pp. 909-917, April 2006.



| Baud rate (Gbaud) | 20   | 100 (5X)    |
|-------------------|------|-------------|
| TL Delay (ps)     | 50   | 10 (0.2X)   |
| TL Length (mm)    | 7.5  | 1.5 (0.2X)  |
| DC Loss (dB)      | 1.63 | 0.35 (0.2X) |
| Nyquist Loss (dB) | 2.64 | 1.15 (0.4X) |



Figure 6.3.1: Insertion loss of two on-chip GCPWs and FFE block diagram.



Figure 6.3.5: Measured eye-diagrams before and after equalization.

Figure 6.3.6: Performance comparison with prior works.



Figure 6.3.7: Die micrograph.

## 6.4 A 4nm 32Gb/s 8Tb/s/mm Die-to-Die Chiplet Using NRZ Single-Ended Transceiver With Equalization Schemes And Training Techniques

Kihwan Seong, Donguk Park, Gyeomje Bae, Hyunwoo Lee, Youngseob Suh, Wooseuk Oh, Hyemun Lee, Juyoung Kim, Takgun Lee, Geonhoo Mo, Sukhyun Jung, Dongcheol Choi, Byoung-Joo Yoo, Sanghune Park, Hyo-Gyuem Rhew, Jongshin Shin

Samsung Electronics, Hwasung, Korea

Recently, the demand for multi-chip solutions, such as chip-on-wafer-on-substrate (CoWoS) and embedded multi-die interconnect bridge (EMIB), is increasing as they reduce chip size and cost for high-performance computing (HPC), artificial intelligence (AI), and big data applications [1]. Following this trend, the industry is establishing standard specifications for die-to-die interfaces, such as Universal Chiplet Interconnect Express (UCIE) and Open High-Bandwidth Interface (OpenHBI), and developing various chiplets with high data transmission bandwidth per unit width, low latency, and low power consumption [2-5]. This work implements a die-to-die (D2D) chiplet compatible with the UCIE specification using 2.5D packaging technology for die-to-die communication. A transmitter (TX) adopts a reflection-cancellation driver (RCD) that cancels the reflections caused by the impedance mismatch in case of not using an on-die termination of a receiver (RX). Also, a TX clock phase training scheme is implemented to achieve low latency with a synchronous reset generator. As for the RX, a direct decision-feedback equalizer (DFE) combined with a double tail latch compensates for the inter-symbol interference (ISI) while reducing its feedback time even at the high-speed operation. All necessary circuits for the offset calibration, the duty-cycle distortion, and the skew calibration are fully implemented in digital to eliminate static power consumption. This transceiver in 4nm FinFET CMOS technology operates at 32Gb/s/wire with 0.44pJ/b energy efficiency and shows 8Tb/s/mm beach-front bandwidth.

Figure 6.4.1 shows the 2.5D package structure of the D2D. Each chiplet has the D2D stacked in two rows to achieve higher total beach-front bandwidth, and each D2D consists of 4 slices of TX and RX. Each slice of the TX and the RX consists of 39 DQs and a DQS, which operate up to 32Gb/s. Every channel has an equal length to minimize the skew among channels, connecting the inner transceiver in chiplet1 to the outer transceiver in chiplet2. The number of metal layers required for the interposer is determined by the amount of transmitted data, the channel characteristics, and the length of die edge [5]. Logic circuits for training are implemented in the PCS layer of the D2D; the offset calibration logic achieves better voltage margin, and the skew calibration increases the timing margin in the RX sense amplifier. Also, data bus inversion encoders and decoders reduce the simultaneous switching noise. In addition, the lane repair function improves the reliability by lowering the yield loss caused by defects.

Figure 6.4.2 shows the implemented single-ended NRZ transmitter and receiver architecture of the D2D PHY. The D2D PHY adopts a source-synchronous architecture and has 39 DQs for data and a DQS for strobe clock in the slice of TX and RX. The TX includes low-swing output drivers for low power consumption; the output drivers consist of the push-pull NMOS-type voltage-mode driver for low swing and the reflection cancellation driver to eliminate the reflected waves. The DQS of the TX includes a synchronous reset generator (TX SYNC GEN) for low latency; by using the TX SYNC GEN, a FIFO used for the asynchronous interface of each DQ can be eliminated. As a result, the latency is also reduced. The RX includes a DFE that can operate at high-speed and de-skew circuits to align the clock with the data from the TX. The DQs and DQS of RX include local de-skew circuits and global de-skew circuits, respectively. The local and global de-skew circuits are implemented for per-bit and per-slice de-skew respectively so that they can be used depending on different data rates. At low data rates with sufficient timing margin, the global de-skew circuits are used to save power consumption; they consist of NAND-based coarse delay lines and phase-interpolator-based fine delay lines. The combination of the two delay lines can improve resolution while operating seamlessly. One of the 39 DQs is used for the periodic skew calibration (PSC) to compensate for the skew between the data and clock due to either voltage or temperature variations. The DQs for data align the clock to the center of the data from the TX, while the DQ for PSC aligns the clock to the edge of the data from the TX and tracks the change between the data and the clock due to either voltage or temperature variations. The DQ for PSC uses the clock pattern as TX data input (TXDATA[15:0]).

The synchronous reset generator (sync gen) for TX clock phase training is shown in Fig. 6.4.3. The purpose of the TX clock phase training is to find an optimal clock phase for the sampling data transmitted from the asynchronous clock domain; a FIFO was used to capture the data, which resulted in additional latency in a previous work [1]. Instead of shifting the clock directly in the TX clock phase training, the D2D shifts the reset signal used by the divider of each DQ to find the optimal clock position for data capturing. The reset signal is shifted by psel[1:0]; the shift range is  $T_{\text{async\_clk}} + N \times 4UI$ . By sweeping all the psel values, the psel code which generates the clock with the most timing margin is selected. The TX clock phase training can reduce latency to one cycle by eliminating the FIFO. A circuit diagram of the reflection cancellation driver (RCD) is shown in Fig. 6.4.3. Due to the impedance mismatch in the interposer, the input at the RX (RX PAD) can see the reflected waves on the transition. To eliminate the effect of reflected waves on the RX PAD, the D2D uses the reflection cancellation driver (RCD). A pre-driver in the RCD finds a compensation location to effectively remove a distorted signal due to the reflected waves.

Fig. 6.4.4 shows the implemented direct decision-feedback equalizer. In general, the direct DFE is limited in its operating speed due to the feedback time ( $T_{fb}$ ). A look-ahead DFE or loop-unrolled DFE architecture that mitigates the limitation from the feedback time has been used in previous work. However, the loop-unrolled DFE has disadvantages in size and power consumption compared to the direct DFE. The D2D proposes a direct DFE combined with a double-tail latch for high data rate operation. In previous work [6], the feedback data ( $E_{\text{OP}}/E_{\text{ON}}, O_{\text{OP}}/O_{\text{ON}}$ ) is connected to the first latch to compensate for ISI; the feedback time ( $T_{fb}$ ) of the DFE is required to be less than 1UI for proper operation. In the implemented DFE, the feedback data ( $E_{\text{OP}}/E_{\text{ON}}$  and  $O_{\text{OP}}/O_{\text{ON}}$ ) are connected to the gate of the pull-down transistor of the second latch ( $M_1, M_2$ ). The implemented scheme is used to compensate for ISI by controlling the pull-down strength of the second latch. When ISI is present in the input signal of the RX (RX\_PAD), the differential input from the first stage of the slicer becomes small, which slows down the second stage decision. The DFE feedback of the second stage slicer helps speed up the second stage operation. The control circuits for calculating the DFE coefficient are fully implemented in digital, eliminating static current.

The measurement results of the D2D with interposer channel at 32Gb/s are shown in Fig. 6.4.5. In order to monitor the TX eye-diagram, we also designed a 2D package using the D2D and measured TX eye-diagram. A measured eye-diagram at 32Gb/s with a 24.2ps/90mV margin shows that the implemented TX operates up to 32Gb/s as well. The measured eye-diagram is a waveform, which reflects the characteristics of the package, probe card and the internal VSS termination of the equipment. The post processing on-chip eye diagram of the D2D in a 2.5D package at 32Gb/s shows the horizontal/vertical eye-opening of 16ps/170mV.

The performance summary is shown in Fig. 6.4.6. The D2D is implemented in 4nm CMOS FinFET technology for a 2.5D package with silicon interposer. The D2D operates up to 32Gb/s/wire with 3mm silicon interposer channels. The channel loss is -4dB at 16GHz. The D2D achieves the best beach-front bandwidth and figure-of-merit (FoM) compared to previous works. The testchip micrograph with the interposer in a 2.5D package is shown in Fig. 6.4.7.

### References:

- [1] M.-Shan Lin et al., "A 7nm 4GHz Arm-core-based CoWoS Chiplet Design for High Performance Computing," *IEEE Symp. VLSI Technology*, pp.C28-C29, July 2019.
- [2] K. McCollough et al., "A 480Gb/s/mm 1.7pJ/b Short-Reach Wireline Transceiver Using Single-Ended NRZ for Die-to-Die Application," *ISSCC*, pp.184-185, Feb. 2021.
- [3] G. Gangasani et al., "A 1.6Tb/s Chiplet over XSR-MCM Channels using 113Gb/s PAM-4 Transceiver with Dynamic Receiver-Driven Adaptation of TX-FFE and Programmable Roaming Taps in 5nm CMOS," *ISSCC*, pp.122-123, Feb. 2022.
- [4] Y.-Y. Hus et al., "A 7nm 0.46pJ/bit 20Gbps with BER 1E-25 Die-to-Die Link Using Minimum Intrinsic Auto Alignment and Noise-Immunity Encode," *IEEE Symp. VLSI Technology*, pp. JFS1-3, June 2021.
- [5] Y. Nishi et al., "A 0.297-pJ/bit 50.4-Gb/s/wire Inverter-Based Short-Reach Simultaneous Bidirectional Transceiver for Die-to-Die Interface in 5nm CMOS," *IEEE Symp. VLSI Technology*, pp.154-155, June 2022.
- [6] J. Seo et al., "A 7.8-Gb/s 2.9-pJ/b Single-Ended Receiver With 20-Tap DFE for Highly Reflective Channels," *IEEE TVLSI*, pp.818-822, March 2020.



Figure 6.4.1: Implemented die-to-die (D2D) chiplet.



Figure 6.4.2: Implemented TX and RX of the D2D PHY.



Figure 6.4.3: The implemented TX scheme: TX synchronous reset generator and reflection-cancellation driver (RCD).



Figure 6.4.4: Conceptual view of the implemented direct decision-feedback equalizer.



Figure 6.4.5: Measured TX eye-diagram at 32Gb/s and post-processing on-chip eye diagram at 32Gb/s.



Figure 6.4.6: Performance summary.



Figure 6.4.7: Testchip micrograph with the interposer in a 2.5D package.

## 6.5 A 37.8dB Channel Loss 0.6μs Lock Time CDR with Flash Frequency Acquisition in 5nm FinFET

Chien-Kai Kao, Shih-Che Hung, Tse-Hsien Yeh, Chen-Yu Hsiao

MediaTek, Hsinchu, Taiwan

High-speed SerDes is accompanied by high channel loss. Channel loss is usually compensated by transmitter feed-forward equalization (FFE), receiver continuous time linear equalization (CTLE), and receiver decision-feedback equalization (DFE). However, the FFE, CTLE, and DFE can only adjust compensation strength after the CDR is locked. Without well-compensated channel loss, it is difficult for a CDR to lock to input data with significant frequency offset. Several frequency detection techniques are presented in [1-3], but the effect of channel loss on the behavior of CDR locking is less addressed. Achieving fast lock time under high channel loss is a challenge for CDR circuit design. In [3], a stochastic PFD weights three consecutive patterns, data-edge-data, to obtain frequency information. However, the calculated pattern weighting may vary with different channel losses. Moreover, it can only obtain the sign of the frequency offset as a conventional approach. In this work, a CDR with a flash frequency acquisition (FFA) technique is proposed. The proposed CDR adopts open-loop frequency acquisition to avoid noise accumulation in a closed-loop operation. The sampled data from a bang-bang phase detector forms a time series. The FFA uses an autocorrelation network with weighting information and autocorrelation function, borrowing concepts from a RNN, to acquire frequency offset. By using FFA, this CDR acquires the magnitude of frequency offset under 37.8dB channel loss. Thanks to the magnitude of the frequency offset, the proposed CDR achieves lock time independent of frequency offset.

As illustrated in Fig. 6.5.1, the receiver front-end circuit consists of AFE circuits, a 12-Tap DFE, and a PD. AFE circuits include an attenuator circuit, a variable-gain amplifier, and a CTLE. The CTLE can provide boost of up to 15dB at 16GHz. The PD uses 2× oversampling to generate an early/late signal. The proposed CDR consists of a digital controller and a phase interpolator. The digital controller uses deserialized sampled data to generate UP/DN signals. However, in a high-loss environment, the UP/DN information of sampled data could be wrong because of low SNR. The CDR can easily be unlocked due to a single burst of the wrong sampled data stream during frequency pulling. In order to deal with the high-loss data signal, the proposed CDR adopts open-loop frequency acquisition. When the flash frequency acquisition operates, the autocorrelation network in the digital controller stores UP/DN information, weights the stored UP/DN information, and uses the autocorrelation function to acquire the frequency offset. After the frequency acquisition completes, the CDR will enter closed-loop operation with the result of the frequency acquisition as the initial frequency. The loop settling time is designed only by a pre-defined duration without requirement of additional circuits or detectors. Unlike conventional methods [1-3], the FFA obtains the magnitude of the frequency offset and achieves lock time independent of channel loss and frequency offset.

Figure 6.5.2 shows the autocorrelation network for the flash frequency acquisition design concept. The sampled data from the sense amplifier are forwarded to a digital controller. Two consecutive data and edge will be used to produce UP and DN signals. The  $UP[0]$  minus  $DN[0]$  is written as  $S[0]$ , representing current phase information.  $S[n]$  means phase information of  $n$  time periods apart from now. Because we sample the input data without changing the phase or frequency of the clock, if frequency offset is non-zero, the phase information forms a periodic signal rotating like a circle in the phasor diagram as shown in Fig. 6.5.2. The autocorrelation function (ACF),  $R(n)$ , helps to measure how lagged the signal is related to its original in a time series. The stored  $S[n]$  can be used to find the magnitude of the frequency offset.  $R(n)$  close to zero means the dot product of  $S[0]$  and  $S[n]$  is zero. If the dot product of two periodic signals is zero, the two signals are orthogonal. With orthogonality, the delayed number,  $n$ , and sampling clock period,  $T_{ck}$ , we can obtain the magnitude of the frequency offset from the formula shown in Fig. 6.5.2. In other words, each ACF block has a zero output with a corresponding frequency offset. For example, if the input frequency offset is 2000ppm, the first zero of  $R(n)$  happens when  $n$  is 125. As we can see, the normalized value of  $R(41)$  is larger than  $R(82)$ , and  $R(n)$  gradually decreases with  $n$  from 0 towards to 125. The ACF block for  $n$  equal to 125 represents 2000ppm frequency offset. As expected from  $R(n)$ , only one ACF result will have a value close to zero. Other ACF results will increase or decrease towards the boundary. The one-hot activation block will find the  $R(n)$  closest to zero and raise the corresponding bit. By designing multiple ACF blocks, we can acquire multiple frequency detection results at the same time, making fast and constant time acquisition.

Figure 6.5.3 represents simulated  $R(n)$  with different input frequency offsets, and  $R(n)$  of two input frequency offsets with 10dB and 37.8dB channel loss. As seen from  $R(n)$  with different input frequency offsets, the zero crossing of  $R(n)$  is inversely proportional to the magnitude of the input frequency offset. In other words, less information needs to be stored if the frequency offset is large. As shown in the 2000ppm and 4000ppm cases with two channels, the zero crossing of  $R(n)$  is almost independent of channel loss, making this technique able to support high channel loss. In other words, channel loss only degrades the SNR of  $S[0]$  but doesn't affect the angle between  $S[0]$  and  $S[n]$ . From simulation results of a 37.8dB channel loss case, the zero crossing of  $R(n)$  deviation causes only 35ppm frequency acquisition uncertainty.

The measured results of the frequency acquisition are shown in Fig. 6.5.4. In the measurement, the TX FIR is disabled and the CTLE provides 8dB boost at the Nyquist rate. Background impairments include a 0.16UI sinusoidal jitter at 100MHz, a 100mV common mode noise at 210MHz, and a 0.5ps<sub>rms</sub> random jitter. Frequency acquisition results for 10dB and 37.8dB channel loss with a PRBS31 pattern running at 32Gb/s are shown in bubble charts, which represent higher probability with bigger circles. We apply the 32Gb/s data with a specified input frequency offset and measure the acquisition result of the FFA as a sample. Every input frequency offset is measured with 50 samples for each channel loss case. A total of 500 samples are measured for two channel cases. With the designed number for ACF blocks, the expected frequency resolution is 800ppm. The measured average residual frequency error is less than 500ppm. Under 37.8dB channel loss with 1000ppm input frequency offset, 1300ppm residual frequency error occurs with 4% probability. Within the frequency error, the CDR can achieve lock smoothly. The accuracy can be easily improved by increasing number of ACF blocks and interpolating of ACF results before the one-hot activation block.

Figure 6.5.5 shows the measured frequency acquisition of two channels for various ACF blocks with their corresponded frequency offsets, CDR locking behavior with different input frequency offsets under 37.8dB channel loss, and CDR locking behavior with a 4000ppm input frequency offset case with/without FFA. Frequency acquisition results of the 10dB and 37.8dB channel loss cases show that 4124ppm is the closest frequency offset for both channels because its ACF result closes to zero. For ACF blocks of other corresponded frequency offsets, the results keep increasing/decreasing until the designed boundary is reached. As shown in the 10dB channel loss case, the ACF saturates within 280ns. In the 37.8dB channel loss case, the slope of the ACF result is slower than that of the 10dB channel loss case. As seen from Fig. 6.5.5, 280ns acquisition time is enough to distinguish the ACF result of the target frequency offset from others. As expected from  $R(n)$ , channel loss only affects the slope of the non-zero ACF results, not the zero-crossing position. The measured CDR locking waveforms are shown in Fig. 6.5.5. In the initial open loop operation, a 6500ppm frequency shift is applied to avoid an ambiguous sign of frequency offset. After frequency acquisition, the residual frequency error will be tracked by closed-loop operation. The lock time is 600ns and is independent of frequency offset. The CDR locking behavior with and without FFA is also shown in Fig. 6.5.5. By using FFA, the CDR achieves lock in under 37.8dB channel loss without TX FIR and DFE compensation. Without FFA and with 4000ppm input frequency offset, CDR is not able to achieve lock because the required CDR bandwidth to pull-in large frequency offset degrades SNR.

Figure 6.5.6 summarizes the results and comparison with other frequency detection techniques. The flash frequency acquisition achieves the fastest lock time under the highest channel loss. This work obtains the magnitude of the frequency offset. Thanks to foreground open-loop operation, the flash frequency acquisition doesn't have static power consumption or performance impact. Figure 6.5.7 provides the die micrograph.

### Acknowledgement:

The authors would like to thank Yu-Hsuan Tu and Hsien-Sheng Huang for valuable suggestion and review.

### References:

- [1] W. Rahman et al., "A 22.5-to-32Gb/s 3.2pJ/b Referenceless Baud-Rate Digital CDR with DFE and CTLE in 28nm CMOS," *ISSCC*, pp. 120-121, Feb. 2017.
- [2] C. Yu et al., "A 6.5-12.5-Gb/s Half-Rate Single-Loop All-Digital Referenceless CDR in 28-nm CMOS," *IEEE JSSC*, vol. 55, no. 10, pp. 2831-2841, Oct. 2020.
- [3] K. Park et al., "Design Techniques for a 6.4-32-Gb/s 0.96-pJ/b Continuous-Rate CDR With Stochastic Frequency-Phase Detector," *IEEE JSSC*, vol. 57, no. 2, pp. 573-585, Feb. 2022.



Figure 6.5.1: Receiver block diagram.



Figure 6.5.2: Autocorrelation network for flash frequency acquisition design concept.



Figure 6.5.3: Autocorrelation with different cases.



Figure 6.5.4: Measured frequency acquisition results.



Figure 6.5.5: Measured ACF results with two channels and CDR locking behavior.

|                                 | ISSCC 17[1]      | JSSC 20[2]     | JSSC 22[3]        | This work               |
|---------------------------------|------------------|----------------|-------------------|-------------------------|
| Technology [nm]                 | 28               | 28             | 40                | 5                       |
| Architecture                    | Quarter rate     | Half rate      | Quarter rate      | Quarter rate            |
| Data Rate [Gbps]                | 22.5 - 32        | 6.5 - 12.5     | 6.4 - 32          | 1.25 - 32               |
| Methodology                     | FD               | FD             | Pattern Weighting | Autocorrelation Network |
| Extra Clock Phase or Comparator | Extra Comparator | Extra Phase    | Free              | Free                    |
| Open Loop Detection             | No               | No             | No                | Yes                     |
| Acquire Frequency Magnitude     | No               | No             | No                | Yes                     |
| Constant Locking Time           | No               | No             | No                | Yes                     |
| Unlimited Locking Range         | No               | No             | Yes               | No                      |
| External Impairments            | Not Report       | Not Report     | Not Report        | SJ = 0.16UI@100MHz      |
| Data Pattern                    | PRBS31           | PRBS31         | PRBS31            | PRBS31                  |
| Channel Loss @Nyquist Rate [dB] | 14.8             | 8.6            | 10                | 37.8                    |
| Locking Time [us]               | <10100           | <1.5           | <11               | <0.6                    |
| Area [mm²]                      | 0.213            | 0.031          | 0.041             | 0.008▲                  |
| Power [mW]                      | 102 @ 32Gbps     | 21.13 @ 10Gbps | 30.8 @ 32Gbps     | 30.2 @ 32Gbps▲          |
| FOMp [pJ/bit/dB]                | 0.22             | 0.25           | 0.1               | 0.025                   |
| FOMs [us/dB]                    | 682              | 0.174          | 1.1               | 0.016                   |

▲ CDR only

Figure 6.5.6: Summary and comparison table.



Figure 6.5.7: Die micrograph.

## 6.6 A 0.83pJ/b 52Gb/s PAM-4 Baud-Rate CDR with Pattern-Based Phase Detector for Short-Reach Applications

Seungwoo Park, Yoonjae Choi, Jincheol Sim, Jonghyuck Choi, Hyunsu Park, Youngwook Kwon, Chulwoo Kim

Korea University, Seoul, Korea

With increasing demand for 50Gb/s+ transceivers, PAM-4 modulation has become dominant over NRZ modulation [1-5], [7], and multiphase clocking is used to maximize data rate in a given process technology. However, the use of high-resolution phase interpolators (PIs) [1] or a 2 $\times$  frequency oscillator (OSC) and multiple PIs [2] results in high power consumption. Therefore, baud-rate clock and data recovery (CDR) can be a great option to replace 2 $\times$  oversampling CDR for the energy-efficient receiver (RX). It only requires a single phase per UI, reducing the burden of multiphase clock generation and distribution. Recently, the Mueller-Muller (MM) phase detector (PD) has been widely utilized for baud-rate CDR in ADC-based PAM-4 RXs for elaborate equalization in a high-loss channel [3]. However, using a time-interleaved SAR ADC results in high power consumption, which is unsuitable for short-reach applications with alleviated channel loss.

This paper proposes a simple baud-rate PAM-4 CDR to realize an energy-efficient RX design. Figure 6.6.1 compares the conventional 2 $\times$  oversampling CDR and the proposed baud-rate CDR. In the conventional CDR, three comparators are utilized to distinguish four-level PAM-4, and one comparator with an additional phase is used to obtain clock information. The proposed CDR uses four comparators with only a single phase, which reduces the burden of multiphase clocking while maintaining the same load capacitance as a conventional CDR.

Figure 6.6.2 illustrates the top block diagram of the proposed CDR, which is designed using a quarter-rate architecture. The attenuated PAM-4 signal is equalized by a continuous time linear equalizer (CTLE) with a boost up to 8dB at 14GHz by tuning the resistor. Subsequently, the signal amplified by a variable-gain amplifier (VGA) is sampled by four comparators with four reference voltages ( $V_{REFP}$ ,  $V_{REFN}$ ,  $V_{REFPM}$ , and  $V_{REFNM}$ ) per phase. Each comparator is expressed as  $COMP_P$ ,  $COMP_N$ ,  $COMP_{PM}$ , and  $COMP_{NM}$ . In the phase detection path,  $COMP_{PM}$  and  $COMP_{NM}$  are used for baud-rate phase detection. The acquired information of the clock phase is classified into three statuses (early, late, and stay) based on three consecutive data information. The majority voter (MV) votes for the major status among four consecutive PD outputs, which relaxes the complexity and bandwidth of the clock recovery (CR) logic. The proportional path directly from MV and the integrational path from digital synthesized logic control the digitally controlled ring OSC. In the data recovery path, if the voltage level of the signal is higher than  $V_{REFP}$ , +3 is decoded. Conversely, if the voltage level of the signal is lower than  $V_{REFN}$ , -3 is decoded, which represents the conventional method that uses  $COMP_P$  and  $COMP_N$ . The difference is the decoding method of -1 and +1.  $COMP_{PM}$  and  $COMP_{NM}$ , already used for the phase detection, are reused instead of the additional comparator with  $V_{REFM}$ . Then, a time-based decoder decodes -1 and +1. Therefore, the number of comparators following the VGA can be reduced. The time-based technique, which converts the voltage domain to the time domain, is used in [7] to mitigate the voltage variations of a comparator and to reduce the number of comparators. However, the time-based decoder of [7] has a small timing margin from the nonlinearity of the comparator decision time.

Figure 6.6.3 shows the operation principle of the pattern-based baud-rate PD. Eight cases (C1-C8) based on three consecutive data patterns are selected to obtain the clock phase information on whether the clock is early or late. C5 to C8 represent the differential signal versions of C1 to C4, respectively. If current data ( $d[n]$ ) is +1, clock phase information is acquired based on  $V_{REFPM}$ , and if it is -1, clock phase information is acquired based on  $V_{REFNM}$ . For example, if the clock is early in C1, the sampled current error ( $E[n]$ ) is lower than  $V_{REFPM}$ , and if the clock is late in C1,  $E[n]$  is higher than  $V_{REFPM}$ . The numbers of all cases are listed in the table. Except for the selected eight cases, PD maintains the clock phase, and the transition density is 0.125 (=8/4<sup>2</sup>). Unlike NRZ MM-CDR [6], the slope of the sampling point is secured because three consecutive rising or falling patterns are used, even if pre- and post-cursor ISI is close to zero. For explaining the clock lock point with ISI, the dominant 1-tap pre ( $h_{-1}$ ) and post-cursor ( $h_1$ ) ISI are considered. Three consecutive bit responses of C1 to C8 are shown by combining the simulated single bit responses (SBRs) of the 7.8dB channel loss with and without CTLE. Because C1 to C4 (also C5 to C8) are not crossed at one-point, the PD has two lock options for lowering the recovered clock jitter. First, PD adopts four cases (C1, C2, C5, and C6) out of eight cases (C1 to C8) in the under-equalized situation. Then, the PD locks to the one point where  $h_1=h_{-1}$  at the expense of reducing transition density. Second, the PD adopts eight cases (C1 to C8) in the equalized situation. With proper equalization in the low loss channel, ISIs can be well compensated to  $h_1, h_{-1} \approx 0$ , then four-point lock is gathered near

one point.  $V_{REFPM}$  is set slightly above  $h_0$ , and  $V_{REFPN}$  is set slightly under  $-h_0$  for the optimal point. In this work, a single-stage CTLE is implemented for an energy-efficient equalizer, and the second lock option is chosen.

Figure 6.6.4 shows the circuit implementation and the operation of the data recovery path. The detailed operation is illustrated as two phases (CK0,90) in simulated transient waveforms. A conventional four-input strong-arm latch comparator is implemented, and the difference in decision time between  $COMP_{PM}$  and  $COMP_{NM}$  is used for -1 and +1 decoding. When data is +1, there is little voltage difference between the sum of  $V_{INP}$  and  $V_{REFNM}$  and the sum of  $V_{INN}$  and  $V_{REFPM}$  at the  $COMP_{PM}$ . Then, output nodes  $PM_P$  and  $PM_N$  are slowly determined to be 1 and 0, or 0 and 1 depending on noise sources such as ISI, data jitter, clock jitter, and reflection. They may not even be determined (meta-stability) before the pre-charge action. Meanwhile, because there is a sufficient voltage difference between the sum of  $V_{INP}$  and  $V_{REFPM}$  and the sum of  $V_{INN}$  and  $V_{REFNM}$  at the  $COMP_{NM}$ , output nodes  $NM_P$  and  $NM_N$  are quickly determined to be 1 and 0, respectively. For these reasons, the decision time of  $COMP_{NM}$  is faster than  $COMP_{PM}$ , and node  $I_{NM_P}$  rises first among nodes  $I_{PM_{P,N}}$  and  $I_{NM_{P,N}}$ . By the same mechanism, when data is -1, the decision time of  $COMP_{PM}$  is faster than  $COMP_{NM}$ , then node  $I_{PM_P}$  rises first. The time-based decoder detects the fastest rise time between positive input nodes ( $I_{PM_P}, I_{PM_N}$ ) and negative input nodes ( $I_{NM_P}, I_{NM_N}$ ). When CK is low, the output nodes (C and  $\bar{C}$ ) are pre-charged to VDD, and input nodes are also pre-charged to 0 by the comparator and inverter. When CK is high, the input node with a faster rise time turns on the MOSFET of that node, and C and  $\bar{C}$  are then determined by latch operation. Finally,  $\bar{A}$ , B (from  $COMP_P$  and  $COMP_N$ ), C, and  $\bar{C}$  (from the time-based decoder) decode the four-level PAM-4, as presented in the decoding table.

The proposed CDR is fabricated in a 28nm CMOS process, and the test chip is wire bonded to a PCB. The channel loss of the cable, power divider, and PCB trace is 7.1dB at 13GHz. To confirm the CDR performance, a 52Gb/s PRBS7 PAM-4 input signal is generated by using two input programmable pattern generators (PPG) and a power divider. Once the initial frequency of the ring OSC is manually controlled close to 6.5GHz, the CR logic is turned on. The reference voltages are manually controlled so that the CDR is locked and the BER is minimized. The measured results are shown in Fig. 6.6.5. When the CDR is locked, the phase noise of the recovered clock (divided by 2 for the BER measurement) is measured using a signal source analyzer. The integrated jitter from 1kHz to 100MHz is 430fs. Additionally, jitter tolerance (JTOL) is measured by adding sinusoidal jitter in the PPG from 100kHz to 100MHz. At a BER <10<sup>-12</sup>, the measured jitter tolerance exceeds the jitter tolerance mask for CEI-56G-VSR.

The proposed CDR performance is summarized and compared with the recent 48-64Gb/s PAM-4 receivers in Fig. 6.6.6. The pattern-based PD has a simple baud-rate operation suitable for short-reach applications, which can reduce the power consumption of the clocking. Moreover, owing to the shared path for data recovery and phase detection, only four comparators are utilized per UI, and a sufficient bandwidth is obtained without using area-consuming inductors in the CTLE and VGA. With the proposed technique, best energy efficiency is achieved with the smallest active area among the compared works. Figure 6.6.7 shows a chip micrograph and area breakdown. The core circuit occupies an area of 0.011mm<sup>2</sup>.

### References:

- [1] E. Depaoli et al., "A 4.9pJ/b 16-to-64Gb/s PAM-4 VSR Transceiver in 28nm FDSOI CMOS", ISSCC, pp. 112-113, Feb. 2018.
- [2] S. Shahramian et al., "A 1.41pJ/b 56Gb/s PAM-4 Wireline Receiver Employing Enhanced Pattern Utilization CDR and Genetic Adaptation Algorithms in 7nm CMOS," ISSCC, pp. 482-483, Feb. 2019.
- [3] B-J Yoo et al., "A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET Using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier" ISSCC, pp. 122-123, Feb. 2020.
- [4] C. Wang et al., "A 52-Gb/s Sub-1pJ/bit PAM4 Receiver in 40-nm CMOS for Low-Power Interconnects," IEEE Symp. VLSI Circuits, June 2019.
- [5] H. Ju et al., "Design Techniques for 48-Gb/s 2.4-pJ/b PAM-4 Baud-Rate CDR With Stochastic Phase Detector," IEEE JSSC, vol. 57, no. 10, pp. 3014-3024, Oct. 2022.
- [6] M. -C. Choi et al., "A 0.1-pJ/b/dB 28-Gb/s Maximum-Eye Tracking, Weight-Adjusting MM CDR and Adaptive DFE with Single Shared Error Sampler," IEEE Symp. VLSI Circuits, June 2020.
- [7] H. Park et al., "A 56-Gb/s PAM-4 receiver using time-based LSB decoder and S/H technique for robustness to comparator voltage variations," IEEE JSSC, vol. 57, no. 2, pp. 562-572, Feb. 2022.



2x oversampling CDR

Proposed baud-rate CDR

Figure 6.6.1: Comparison of the conventional 2x oversampling CDR and the proposed CDR, shown in a full-rate system.



Figure 6.6.2: Top block diagram of the proposed CDR.



Figure 6.6.3: Operation principle of the pattern-based baud-rate phase detector (top), clock lock point considering ISI (bottom).



Figure 6.6.4: Circuit implementation and simulated transient waveforms of the data recovery path.



Figure 6.6.5: Measured results: power breakdown (top left), phase noise of the recovered clock (top right) and jitter tolerance (bottom).

|                         | ISSCC'18 [1]          | ISSCC'19 [2]          | VLSI'19 [4]            | ISSCC'20 [3]             | JSSC'22 [5]               | This work                 |
|-------------------------|-----------------------|-----------------------|------------------------|--------------------------|---------------------------|---------------------------|
| Technology              | 28nm FDSOI-CMOS       | 7nm FinFET            | 40nm CMOS              | 10nm FinFET              | 40nm CMOS                 | 28nm CMOS                 |
| Modulation              | PAM-4                 | PAM-4                 | PAM-4                  | PAM-4                    | PAM-4                     | PAM-4                     |
| Data Rate [Gb/s]        | 64                    | 56.25                 | 52                     | 56                       | 48                        | 52                        |
| Channel Loss [dB]       | 16.8                  | 17.8                  | 7.3                    | 38                       | 4                         | 7.1                       |
| Clock architecture      | Quarter-rate PI-based | Quarter-rate PI-based | Quarter-rate PLL-based | 32-way PI-based          | Half-rate PI-based        | Quarter-rate PLL-based    |
| PD type                 | 2x oversampling       | 2x oversampling       | 2x oversampling        | Baud-rate (MM PD)        | Baud-rate (Stochastic PD) | Baud-rate (Pattern-based) |
| # of COMPs per UI       | 4                     | 4                     | 3.5*                   | 32 x 7b ADC              | 5                         | 4                         |
| BER                     | $10^{-12}$            | $10^{-12}$            | $10^{-12}$             | $10^{-10}$               | $10^{-11}$                | $10^{-12}$                |
| Equalization            | TX FIR CTLE           | TX FIR CTLE           | CTLE 1-tap FFE         | CTLE 9-tap FFE 1-tap DFE | CTLE 1-tap DFE            | CTLE                      |
| Inductor-less           | No                    | No                    | No                     | No                       | No                        | Yes                       |
| Power [mW]              | 180                   | 79                    | 48                     | 321.2**                  | 116.3                     | 43.1                      |
| Energy Eff. [pJ/bit]    | 2.8                   | 1.41                  | 0.92                   | 7.7**                    | 2.42                      | 0.83                      |
| Area [mm <sup>2</sup> ] | 0.32                  | 0.13                  | 0.72                   | 0.72 / Lane              | 0.24                      | 0.011                     |

\*To save power, only two edge comparators are utilized for clock recovery

\*\*TX+RX

Figure 6.6.6: Performance summary and comparison with previous works.



Figure 6.6.7: Chip microphotograph and area breakdown.

## 6.7 A 128Gb/s PAM-4 Transmitter with Programmable-Width Pulse Generator and Pattern-Dependent Pre-Emphasis in 28nm CMOS

Kai Sheng, Weixin Gai, Zeze Feng, Haowei Niu, Bingyi Ye, Hang Zhou

Peking University, Beijing, China

The ever-growing demands for high-bandwidth communications continuously push wireline links to operate at higher speeds. Recently reported transmitters (TXs) have achieved a data rate of more than 100Gb/s [1-6]. PAM-4 modulation, which doubles the data rate at the same symbol rate, has been widely adopted to make use of the link bandwidth more efficiently. However, the complex transitions introduce greater data-dependent jitter, decreasing the horizontal eye-opening. In addition, the transitions between non-adjacent levels bring about twice or three times inter-symbol interference (ISI) compared with transitions between adjacent levels, resulting in reduced vertical eye-opening. Although a feed-forward equalizer (FFE) can be used to mitigate these issues, it is usually implemented in a de-emphasis manner in PAM-4 TXs, which reduces the output swing and lowers the signal-to-noise ratio. The proposed TX incorporates a pulse generator with programmable width for optimizing transition edges and a pattern-dependent pre-emphasis scheme that performs equalization without sacrificing output swing.

The block diagram of the TX is presented in Fig. 6.7.1. The MSB and LSB data are generated with a pattern generator and serialized with two 64:16 serializers. The data are then sent to six segments for FFE and three segments for pattern-dependent pre-emphasis. Each FFE segment can be configured as the pre-tap, the main-tap, the 1<sup>st</sup> post-tap, or the 2<sup>nd</sup> post-tap through the data selection logic, which is followed by a 16:4 serializer and a retimer. The final stage combines a 1-UI pulse-based 4:1 MUX with a CML driver (4:1 & Driver). The current of the driver in different segments can be adjusted independently to finely tune the FFE tap coefficients. The three segments for pattern-dependent pre-emphasis apply to three kinds of transitions. In each segment, the pattern detection logic finds specific transitions and produces two flag signals, which are serialized and retimed to match the delay in FFE segments. The 4:1 MUX and current-injection switch (4:1 & Switch) apply the pre-emphasis to the output signal accordingly. The data selection and pattern detection logic are inserted before the 16:4 serializer to minimize the power dissipation while maintaining sufficient timing margin. The output network incorporates shunt peaking and a T-coil to extend the output bandwidth. The TX receives 4-phase quarter-rate clocks and utilizes clock selection to address the timing issue for the retimer.

The schematics of the final 4:1 MUX and pulse generator are shown in the top half of Fig. 6.7.2. Four CML differential pairs driven by 1-UI pulse generators share one tail current and the pulses P0~P3 (PB0~PB3) alternately arise to switch the current. The switching speed is the key to realizing faster transitions and it depends on the voltage of the crossing point of two adjacent pulses, which can be considered as the common mode voltage of the CML driver. As shown in the bottom left of Fig. 6.7.2, the gain of the driver reaches its maximum at 0.7V common mode voltage, which is 3dB higher compared to the gain at 0.4V common mode voltage, leading to 40% faster transition edges. In order to adjust the common-mode voltage, we propose a pulse generator with programmable width as shown in the top right of Fig. 6.7.2. The node X drops to "0" and the input signal D0 is transferred to the output node P0 when CK4I and CK4Q are both high. Five units of discharging branches are added to node X to adjust the pulse width. When CK4I is high and CK4Q is low (1UI before the output pulse is generated), the branches are turned on to discharge node X and the output pulse starts to rise earlier than the rising edge of CK4Q. By increasing the discharging strength through PG\_CTRL<4:0>, the overlap of adjacent pulses becomes wider and the common-mode voltage becomes higher. The simulated waveforms with different discharging strengths are illustrated in the bottom of Fig. 6.7.2, indicating a tuning range of 0.4V to 0.85V. A 13% increase in eye width is obtained according to the measured eye diagrams with minimal and optimal common mode voltages. Since the charge on node X will all be discharged eventually, the discharging branches only pre-discharge a small fraction of the charge and they do not cause any extra power dissipation.

Figure 6.7.3 shows the implementation and operation principle of the pattern-dependent pre-emphasis. The transitions are decomposed into three categories as depicted in the top left of Fig. 6.7.3. Three pattern detection logics in each segment monitor the data streams and produce pull-up and pull-down flags. The pull-up flag becomes "+1" when a rising transition is detected and the pull-down flag becomes "-1" when a falling transition is detected; otherwise, both flags stay "0". After serializing and retiming, the pull-up and pull-down flags control the switches M2 and M1 that are connected to OUTN and OUTP, respectively. When the pull-up flag is "+1", a positive pulse appears at node YN, and switch M2 injects a single-ended current to OUTN. As a result, the falling edge

of OUTN speeds up and the rising edge of the differential output gets enhanced. The pull-down flag works in the same way. The transistors below M1/M2 serve the purpose of strength tuning. The high voltage of the control signal EQ\_CTRL<3:0> is limited to half V<sub>DD</sub> so that the transistors are biased in the saturation region and the output impedance of the TX is not affected. The injection currents in three segments are proportional to the transition amplitudes to avoid distortions. The operation principle and the equalizing effects are shown in the bottom half of Fig. 6.7.3. The three kinds of pre-emphasis accelerate corresponding transitions and eliminate ISI while keeping the swing unchanged. Given that the ISI and transitions depend on more than two consecutive symbols for lossy channels, increasing the length of the monitored patterns can provide better equalization. However, this leads to an exponential growth in the number of current-injection switches, degrading the bandwidth. In this work, the length of the monitored patterns is chosen to be two.

Figure 6.7.4 depicts the block diagram of the clock path. The quarter-rate clocks are received through TIA-based buffers to match the impedance. Two resonant buffers are employed for low-power clock distribution, which use inductors to neutralize the capacitance from wires and transistor loading. To meet the stringent timing constraint for the retimer, a phase rotator is used in [6] at the cost of extra power. This work utilizes a low-power selection-based solution, as shown in the bottom left of Fig. 6.7.4, which passes through two of the four CK4 phases. With different S1 and S2 settings, two complementary clocks are selected to feed the divider and the selected CK4 has a tuning step of 1UI. Therefore, CK8 and the output data of the 8:4 MUX have a flexible delay relative to CK4I that triggers the first stage in the retimer. The clock relations are shown in the right half of Fig. 6.7.4. Simulation results indicate that the clock selection circuit only consumes 3mW of power and the 1-UI tuning step is enough for 64Gbaud operation in 28nm CMOS technology across PVT corners. In addition, the symmetry of this selection circuit guarantees that the loading seen by 4-phase clocks remains the same and no phase error is caused.

The TX is implemented in 28nm CMOS technology and tested using a real-time oscilloscope. The TX output signals are measured through the probe, cables, and DC blocks, which introduce 4dB loss. The differential output swing is 0.84V. The top left of Fig. 6.7.5 shows the measured 64Gb/s NRZ eye diagram with an eye width of 0.63UI and an eye height of 246mV. In the bottom of Fig. 6.7.5, two measured 128Gb/s PAM-4 eye diagrams are presented to show the effectiveness of the pattern-dependent pre-emphasis. When the pre-emphasis is turned on, the equalization provided by the FFE is reduced and hence higher swing is achieved. The eye-opening area for upper, middle, and lower eyes are increased by 26%, 23%, and 25%, respectively. The performance summary and comparison with recently published 100+Gb/s TXs are given in Fig. 6.7.6. This work presents an edge optimization technique and a pre-emphasis scheme for PAM-4 TX with an energy efficiency of 1.4pJ/b. The die photo is shown in Fig. 6.7.7.

### Acknowledgement:

This work was supported by the National Key R&D Program of China under Grant 2018YFB2202301.

### References:

- [1] J. Kim et al., "A 112Gb/s PAM-4 transmitter with 3-Tap FFE in 10nm CMOS," ISSCC, pp. 102-103, Feb. 2018.
- [2] Z. Toprak-Deniz et al., "A 128Gb/s 1.3pJ/b PAM-4 Transmitter with Reconfigurable 3-Tap FFE in 14nm CMOS," ISSCC, pp. 122-123, Feb. 2019.
- [3] P. -J. Peng et al., "A 112Gb/s PAM-4 Voltage-Mode Transmitter with 4-Tap Two-Step FFE and Automatic Phase Alignment Techniques in 40nm CMOS," ISSCC, pp. 124-125, Feb. 2019.
- [4] E. Groen et al., "A 10-to-112Gb/s DSP-DAC-Based Transmitter with 1.2Vppd Output Swing in 7nm FinFET," ISSCC, pp. 120-121, Feb. 2020.
- [5] M. A. Kossel et al., "An 8b DAC-Based SST TX Using Metal Gate Resistors with 1.4pJ/b Efficiency at 112Gb/s PAM-4 and 8-Tap FFE in 7nm CMOS," ISSCC, pp. 130-131, Feb. 2021.
- [6] J. Kim et al., "A 224Gb/s DAC-Based PAM-4 Transmitter with 8-Tap FFE in 10nm CMOS," ISSCC, pp. 126-127, Feb. 2021.



Figure 6.7.1: Block diagram of the TX.



Figure 6.7.2: Schematics of the 4:1 MUX and pulse generator and effects of the pulse-width adjustment.



Figure 6.7.3: Implementation and operation principle of the pattern-dependent pre-emphasis.



Figure 6.7.4: Block diagram of the clock path and relationship between CK4 and CK8.



Figure 6.7.5: Measured eye diagrams: 64Gb/s NRZ (top left), 128Gb/s PAM-4 (bottom left), and 128Gb/s PAM-4 with pre-emphasis (bottom right).

|                                      | This work                                  | ISSCC'18 [1] | ISSCC'19 [2] | ISSCC'19 [3] | ISSCC'20 [4] | ISSCC'21 [5] |
|--------------------------------------|--------------------------------------------|--------------|--------------|--------------|--------------|--------------|
| Technology                           | 28nm                                       | 10nm         | 14nm         | 40nm         | 7nm          | 7nm          |
| Modulation                           | PAM-4                                      | PAM-4        | PAM-4        | PAM-4        | PAM-4        | PAM-4        |
| Data Rate (Gb/s)                     | 128                                        | 112          | 128          | 112          | 112          | 112          |
| Driver Type                          | CML (with edge optimization)               | CML          | Tailless CML | SST          | H-bridge     | SST          |
| Output Swing ( $V_{ppd}$ )           | 0.84                                       | 0.75         | 1            | 1            | 1.2          | -            |
| Equalization                         | 4-tap FFE + Pattern-Dependent Pre-Emphasis | 3-tap FFE    | 3-tap FFE    | 4-tap FFE    | 7-tap FFE    | 8-tap FFE    |
| Efficiency (pJ/b) (with clocking)    | 1.4                                        | 2.07         | 1.3          | 3.89         | 1.56         | 1.4          |
| Efficiency (pJ/b) (without clocking) | 0.9                                        | 1.72         | -            | -            | 1.05         | -            |
| Area ( $\text{mm}^2$ )               | 0.137                                      | 0.030        | 0.048        | 0.560        | 0.193        | 0.032        |

Figure 6.7.6: Performance summary and comparison with prior works.



Figure 6.7.7: Die photo.

## 6.8 A 100Gb/s 1.6V<sub>ppd</sub> PAM-8 Transmitter with High-Swing 3+1 Hybrid FFE Taps in 40nm

Jeonghyu Yang\*, Eunji Song\*, Seungwook Hong, Dongjun Lee, Sangwan Lee, Hyunwoo Im, Taeho Shin, Jaeduk Han

Hanyang University, Seoul, Korea

\*Equally Credited Authors (ECAs)

To match pace with the performance enhancement of computing systems for data-centric applications, data rates of high-speed I/O transceivers for the computing systems are increasing beyond 100Gb/s/lane. Four-level pulse-amplitude modulation (PAM-4) signaling has been widely adopted due to its energy efficiency and high data rate under finite channel bandwidth [1–4]. To overcome the timing constraints related to the dynamic power consumption and horizontal eye margin, transmitters with higher modulation complexities such as PAM-8 are required for next-generation wireline communication systems. However, as observed in previous PAM-4 transmitter designs, multi-level signaling schemes suffer from their limited signal-to-noise ratios (SNR) and increased bit-error ratios (BER), as the dynamic range is divided down due to maximum swing constraints. To achieve a high data rate (100Gb/s/lane) without sacrificing the eye-margin performance, this paper presents a reconfigurable high-swing current-mode transmitter with hybrid FFE tap configurations and high-speed frontend serializers in 40nm CMOS technology.

Figure 6.8.1 shows the overall architecture of the PAM-8 transmitter. The external differential 16.65GHz clock drives an internal C2MOS clock divider for quadrature clocking (C4[0:3]). The clock phase selectors and subsequent dividers produce serialization clocks with sufficient timing margin for the frontend 4:1 serialization and retiming stages. The data-serialization path receives 128×4-bit patterns. Among them, 128×3 bits (MSB, MIDB, LSB) are used to generate PAM-8 3-tap FFE signals, and the remaining 128 bits (AUXB) are applied to the sliding-tap feed-forward equalization path for the most-significant bit (MSB) signal, which produces the largest amount of ISI among the 3-bit words. Then the input data streams are serialized in the subsequent serializer chain and transformed to tap switching signals in the 3-tap FIR shuffler block for the selected FFE preset. The final serialization is done by ten 4-to-1 high-speed serializers (HSSERs) which are composed of retimers, 1-UI pulse generators, a single-stack 4:1 high-speed multiplexer (HSMUX), and predrivers. Finally, the ten pairs of 33.3Gbaud/s differential signals are applied to the high-swing FFE driver, which is connected to the output pads via T-coils.

To compensate for the SNR penalty arising from the use of the high-order PAM signaling, the transmit driver needs to support a high voltage output swing. Figure 6.8.2 shows the schematic diagram of the cascode current-mode PAM-8 FFE driver with current bleeders and their associated pre-drivers, which achieves a 1.6V<sub>ppd</sub> output swing. As shown in the previous work on the PAM-4 driver without FFE [6], the Vgd stress of the driver input transistors is regulated by the protective cascode transistors (M0 and M1), in combination with current bleeders (M4, M5, and IBLD) to provide the leakage discharge path and prevent the rise of Vgd of the switch transistors (M2 and M3) when they are turned off. The driver consists of 9 cascode current-mode logic (CML) slices for PAM-8 signaling with 3-tap normal FFE and one auxiliary tap for MSB FFE. The driver tap slices are sized based on the digit and tap strength ratio, while the current bleeders do not scale with their associated tap strength to avoid excessive area consumption in high-order PAM drivers. The auxiliary tap modulates the MSB pulse response only, as the impact of the MSB on the inter-symbol interference (ISI) and eye-opening is much higher than the rest of the bits, requiring a higher equalization capability. The auxiliary tap receives the N-cycle delayed bit stream of the MSB from the digital backend for additional flexibility such as sliding-tap operations. The input loadings of the driver taps are precisely matched with the driving capabilities of their predriver counterparts by properly upsizing the predrive inverters and attaching dummy transistors for small taps (Fig. 6.8.2), as their interconnect loading capacitances (which are denoted as Cw capacitors in Fig. 6.8.2) do not scale down linearly with the tap sizes, which increases the loading effects for smaller taps. It should be noted that similar techniques are applied to determine the sizing of skewed inverters which precede the predrive inverters, as shown in Fig. 6.8.2.

The FFE tap constellations can be (re)configured based on the channel characteristics and potential receiver-side equalization capabilities, by allocating (delayed) data patterns to proper driver taps in the 3-tap data shuffler. Unlike the segmented FFE approaches with full data and/or clock phase controls [4], which provide a full manipulation capability of FFE tap assignments at the expense of power and complexity overhead, the FIR shuffler chooses one of the five representative data allocation settings summarized in the table in Fig. 6.8.3. Then the shuffler receives the 8×3-bit patterns (MSB8, MIDB8, and LSB8) and selectively maps them to proper tap paths (TAPA8, TAPB8, and TAPC8) to implement

the selected FFE setting, with proper shift operations involved. This approach minimizes the area and power overhead for the FFE multiplexing, while still covering a sufficient range of configurations for representative channels.

The FIR controller output and the N-bit shifted MSB signals are further upconverted by the following ten 8:4 serializers, producing a 40-bit data stream (TA4, TB4, and TC4 for the normal taps, and TAUX4 for the aux tap) for the final retiming, 4:1 serialization, and output driving operations (Fig. 6.8.4). Despite the relatively low transit frequency of 40nm transistors, the output multiplexing technique utilized by PAM-4 transmitter designs in planar technologies [4–5] is not applicable to the PAM-8 driver due to its excessive number of output slices. Instead, the final 4-to-1 serialization operations are performed separately before the output driving, by utilizing high-speed pulse generation and single-stack multiplexing [2] to avoid stacked transistors at the critical node (4:1 serializer output), similar to [2]. The high-speed pulse generators are composed of dynamic NAND+NOR gates to trim the incoming retimed NRZ signals (R[1:4]) to 0.25UI pulse patterns (P[1:4]). Then the four pulses per each tap enable the following pseudo-NMOS multiplexer branches, one at a time, generating the differential 33.3Gb/s data streams. The multiplexer outputs are then amplified and buffered by the following inverter chain to deliver high-speed signals to the final driver array. The PMOS transistor width of the multiplexer is sized such that the multiplexer output swing is high enough to avoid static current flows at subsequent inverters. The P/N width ratios of the predrive inverters are skewed to tune their duty cycles for the output driver's operating point.

To demonstrate the high-swing 100Gb/s PAM-8 transmitter operation, a prototype transmitter chip is implemented in a 40nm CMOS technology, and its chip micrograph is shown in Fig. 6.8.7. The fabricated design occupies a compact die area of 0.362mm<sup>2</sup>. The test chip is a flip-chip attached to the GCPW board trace structure to measure the transmitter performance in a realistic environment. The channel loss estimated from the measured pulse response [1,4] is about 9.4dB at 16.65GHz as plotted in Fig. 6.8.5. The figure also shows the measured PAM-8 output eye diagram of the transmitter at 100Gb/s and its power breakdown. The eye-diagram is constructed by overlapping PRBS7 and extra patterns to populate all possible signal transitions for PAM-8. The transmitter achieves a 1.6V differential peak-to-peak output swing without FFE, and 922mV<sub>ppd</sub> swing with 57.7mV worst-case eye-opening with all FFE taps enabled, which is large enough to achieve low bit-error rates. The measured ratio of level mismatch (RLM) is 0.94. The transmitter consumes 335mW, which corresponds to 3.35pJ/b efficiency, with a 2.2V termination voltage and a 1.2V supply voltage for the rest of the analog and mixed-signal circuits. The transmitter power breakdown is presented in Fig. 6.8.5. The performance of the PAM-8 transmitters is summarized and compared with previous transmitter designs that achieved over 100Gb/s in Fig. 6.8.6. Despite its implementation in the 40nm process, the design achieved 100Gb/s data rate and 1.6V<sub>ppd</sub> swing with sufficient eye-openings for the highest modulation index.

### Acknowledgement:

The research is sponsored in part by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Number SRFC-IT2001-02, Samsung Electronics (Chip Interconnect Solutions), and Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2020-0-01307).

### References:

- [1] T. Dickson et al., "A 72Gb/s, 8-bit DAC-based Wireline Transmitter in 4nm FinFET CMOS for 200+Gb/s Serial Links," *IEEE Symp. VLSI Circuits*, pp. 28–29, June 2022.
- [2] J. Kim et al., "A 224Gb/s DAC-Based PAM-4 Transmitter with 8-Tap FFE in 10nm CMOS," *ISSCC*, pp. 126–127, Feb. 2021.
- [3] M. A. Kossel et al., "An 8b DAC-Based SST TX Using Metal Gate Resistors with 1.4pJ/b Efficiency at 112Gb/s PAM-4 and 8-Tap FFE in 7nm CMOS," *ISSCC*, pp. 130–131, Feb. 2021.
- [4] M. Choi et al., "An Output-Bandwidth-Optimized 200Gb/s PAM-4 100Gb/s NRZ Transmitter with 5-Tap FFE in 28nm CMOS," *ISSCC*, pp. 128–129, Feb. 2021.
- [5] P.-J. Peng et al., "A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS," *ISSCC*, pp. 130–131, Feb. 2020.
- [6] E. Song et al., "A 32-Gb/s High-Swing PAM-4 Current-Mode Driver with Current-Bleeding Cascode Technique and Capacitive-Coupled Pre-drivers in 40-nm CMOS for Short-Reach Wireline Communications," *IEEE Int. Midwest Symp. on Circuits and Systems*, pp. 1–4, Aug. 2022.



Figure 6.8.1: Architecture of the high-voltage swing PAM-8 FFE transmitter with variable tap configurations.

| Preset | TAPA (1x) | TAPB (4x) | TAPC (1x) | Tap Assignments        |
|--------|-----------|-----------|-----------|------------------------|
| Mode 0 | -D[n]     | +D[n-1]   | -D[n-2]   | [Pre, Main, Post]      |
| Mode 1 | +D[n-1]   | +D[n-1]   | +D[n-1]   | FFE off                |
| Mode 2 | -D[n-1]   | +D[n]     | -D[n-2]   | [Main, Post, 2nd Post] |
| Mode 3 | +D[n-1]   | +D[n-1]   | -D[n-2]   | [Main, Main, Post]     |
| Mode 4 | -D[n]     | +D[n-1]   | +D[n-1]   | [Pre, Main, Main]      |



Figure 6.8.3: 3-tap FIR shuffler and its data allocation table.



Figure 6.8.2: Cascode CML tap schematic and the frontend structure composed of load-matched inverter chains and driver taps.



Figure 6.8.4: Schematics of 4:1 HSSER and skewed predriver (left) and timing diagrams (right).



Figure 6.8.5: 100Gb/s PAM-8 TX eye diagram (top), measured power breakdown (bottom-left) and the channel response extracted from the pulse response (bottom-right).

|                                       | This work    | Timothy VLSI22 | Kim ISSCC21 | Kossel ISSCC21 | Choi ISSCC21 | Peng ISSCC20 |
|---------------------------------------|--------------|----------------|-------------|----------------|--------------|--------------|
| Technology                            | 40nm         | 4nm            | 10nm        | 7nm            | 28nm         | 40nm         |
| Data Rate [Gb/s]                      | 100          | 144 / 216      | 224         | 112            | 200          | 100          |
| Modulation                            | PAM-8        | PAM-4 / 8      | PAM-4       | PAM-4          | PAM-4        | NRZ          |
| Architecture                          | Mixed Signal | 8b DAC         | 7b DAC      | 8b DAC         | Mixed Signal | 7b DAC       |
| Driver                                | CML          | SST            | CML         | SST            | CML          | Tail less    |
| Max. Output Swing [V <sub>dpp</sub> ] | 1.6          | 0.92           | 1.0         | 0.92           | 0.8          | 0.56         |
| Vertical Eye Opening [mV]             | 57           | N/A            | 90          | 59             | 53           | 73           |
| FFE Taps                              | 3+1          | 8              | 8           | 8              | 5            | 8            |
| Energy Efficiency [pJ/bit]            | 3.35         | 2 / 1.33       | 1.74        | 1.4            | 4.63         | 6.19         |
| Analog Supply [V]                     | 1.2 / 2.2    | 0.95           | 0.8/1.5     | 0.96           | 1.4          | 1.1/2.1.5    |
| Channel Loss @ Nyq. Frequency [dB]    | -9.4         | -8.8           | -4.0        | -15.1          | -6.0         | -7.1         |
| Packaging                             | Flip-chip    | Flip-chip      | Flip-chip   | Bare die*      | Bare die*    | Bare die*    |

\* RF probes used for signal acquisition.

Figure 6.8.6: Performance summary and comparisons with recent 100-224Gb/s TX designs.



Figure 6.8.7: Chip micrograph of TX in 40nm CMOS technology.