



ISSCC 2023

# SESSION 6

# Advanced Wireline Links and Techniques

# **A 112Gb/s Serial Link Transceiver With 3-tap FFE and 18-tap DFE Receiver for up to 43dB Insertion Loss Channel in 7nm FinFET Technology**

**Bo Zhang, Anand Vasani, Ashutosh Sinha, Alireza Nilchi, Haitao Tong,  
Lakshmi Rao, Karapet Khanoyan, Hamid Hatamkhani, Xiaochen Yang,  
Xin Meng, Alexander Wong, Jun Kim, Ping Jing, Yehui Sun, Ali  
Nazemi, Dean Liu, Anthony Brewster, Jun Cao, Afshin Momtaz**

**Broadcom, Irvine/San Jose, CA**



# Outline

- **Introduction**
- **112Gb/s Multiple Rate Transceiver**
  - Receiver
  - Transmitter
  - PLL
- **Measurement Results**
- **Summary and Conclusion**

# Outline

## ■ Introduction

### ■ 112Gb/s Multiple Rate Transceiver

- Receiver
- Transmitter
- PLL

### ■ Measurement Results

### ■ Summary and Conclusion

# Internet Traffic Growth

■ **27% CAGR  
2017-2022**

- Video streaming
- Cloud traffic
- Mobile Internet
- Digital commerce

■ **66% population penetration**

- 6% CAGR



© Statista

# Data Center Transition: 100GE/200GE to 400GE

## ■ Fast adoption to 400GE

- IEEE 802.3 work group
  - 25G: 100GBASE-KR4 (NRZ)
  - 50G: 200GBASE-KR4/CR4/C2M (PAM4)
  - 100G: 400GBASE-KR4/CR4/C2M (PAM4)
- OIF-CEI
  - 28G: CEI-28G-LR/SR/VSR(NRZ)
  - 56G: CEI-56G-LR/SR/VSR (PAM4)
  - 112G: CEI-112G-LR/MR/SR/VSR (PAM4)



# Outline

## ■ Introduction

### ■ 112Gb/s Multiple Rate Transceiver

- Receiver
- Transmitter
- PLL

## ■ Measurement Results

## ■ Summary and Conclusion

# ADC Based Solution Vs. Analog RX Solution

## ■ Advantage Comparison

| ADC based receiver                            | Analog receiver                  |
|-----------------------------------------------|----------------------------------|
| FFE and DFE have no PVT dependence            | Higher CDR bandwidth             |
| DSP power/area scaling nicely with processing | More DFE taps can be implemented |

**This work is analog based receiver solution!**

# Equalization Basic

## ■ Channel equalization

- Tradeoff
  - FFE equalize both pre/post-cursor
  - TX FFE reduces signal level
  - RX CTLE/FFE amplifies noise and crosstalk
  - CTLE/DFE equalize post-cursor ISI
  - DFE doesn't enhance noise and crosstalk, but may have error propagation
- Combine all to maximize the equalization performance



# Receiver Block Diagram



# Sampling Based FFE and DFE

## ■ 4X interleaving topology

- Each 2-stage sample/hold (S/H) has 4UI duration of data
- 3-tap FFE is implemented
  - 3-tap FFE has 2UI overlapping between three data
- Cascaded FFE and DFE summers



# Sampling Based FFE and DFE In Details

## ■ 4X Interleaving

- Interleaving: A, B, C, D
- 4X4T clock: clka, clkcb, clkcc, clkdd
- S/H data:  $S_A, S_B, S_C, S_D$

## ■ 3-tap FFE

- 1 pre cursor and 1 post cursor, e.g.  
 $FFE_A$  is sum of  $S_B, S_A, S_D$

## ■ Clocking

- Place S/H (falling) and slicers (rising)  
under same clock together for easy  
clocking, e.g.  $S_B$  and Interleaving A by  
clka

## ■ Multi-mode operation

- Baud-rate mode by enabling S/H
- Oversampling mode by bypassing S/H



# FFE Timing Diagram

## ■ 2UI overlapping at FFE sum output

- FFE has 2UI settling time
  - Low power
- Slicers sample at the end of 2UI overlapping edge

$$FFE(z) = -a_1 z^1 + 1 - b_1 z^{-1}$$



# FFE Summer Schematic

- Unit cell: differential pair degenerated with NFET resistor → compact area



# DFE Timing

- Direct feedback for Tap<sub>1</sub> is not feasible for 100G PAM4
  - $T_{d2q} + T_{dfe\_sum} < 1\text{UI}$
- Loop unrolled Tap<sub>1</sub> requires more slicers
  - PAM4: 12 slicers vs. 3 slicers
- Trade off: 1+D pulse shaping
  - Reduce overall equalization by  $1 + z^{-1}$
  - Tap<sub>1</sub>=1 for PAM4 → PAM7
    - PAM4: symbol [-3, -1, 1, 3] with Tap<sub>1</sub>=1 → PAM7: symbol [-6, -4, -2, 0, 2, 4, 6] in **thermometer format**
    - 6 slicers with reference of [-5, -3, -1, 1, 3, 5]



# PAM7 to PAM4 Conversion Diagram

## ■ Critical timing loop: 1+D MUX

- PAM7:  $D[5:0]_n$  to PAM4:  $Q[2:0]_n$  conversion by  $Q[2:0]_{n-1} < 1T_s$  period
  - $T_{\text{mux4:1}} + T_{\text{latch}} < 17.8\text{ps}$  @ 56GBaud/s, **not feasible for 7nm**



# Look Ahead Approach

- Look ahead approach relaxes critical timing to 2UI



# Look Ahead 1+D MUX Diagram

## Traditional



## Look Ahead



- $X[11:0]_n$  are not in critical timing path!
- 3X 4:1 MUX → 15X 4:1 MUX



Critical timing is increased from 1UI to 2UI

# 4X Interleaving 1+D MUX Diagram

## ■ 4X interleaving 1+D MUXs

- Group interleaving A/C and interleaving D/B
  - Same clock domain
  - Data are 2UI apart: A|C and B|D
- MUX between adjacent data for X
  - $X_A = D_A - D_D$
  - $X_B = D_B - D_A$
  - $X_C = D_C - D_B$
  - $X_D = D_D - D_C$
- 2UI timing closure
  - $A \xleftarrow{MUX} C$  and  $B \xleftarrow{MUX} D$



# Transmitter Block Diagram

## ■ TX highlight

- 6-tap TX FFE
- Half-rate topology
  - 2:1 MUX by 2T clock before  $50\Omega$  driver
- 7-bit segmented DAC at  $50\Omega$  driver
  - 2-bit thermometer, 5-bit binary
  - Capacitor based active peaking
  - T-Coil at output



# PLL Diagram

## ■ Two low noise LC-VCOs

- Fully differential LC-VCO design to extend tuning range
- Low band: 41-50GHz
- High band: 50-57GHz

## ■ 3<sup>rd</sup>-order $\Delta\Sigma$ fractional-N divider

- CML divided by 2 first to trade-off between MMD feasibility and  $\Delta\Sigma$  noise



# Transceiver Operation Mode Summary

## ■ Main operation modes

- 100G PAM4 long channels: 1+D pulse shaping
  - RX 3-tap FFE and 18-tap DFE
- 100G PAM4 short and moderate channels: normal DFE
  - RX 3-tap FFE and 17 tap DFE without Tap1
- Legacy 50G PAM4 channels, similar to 100G w/wo FFE
- Legacy NRZ channels: 25G and 10G
  - Unrolled Tap1 based 18-tap DFE, no FFE
- 12-18 taps are floating taps

# Outline

- **Introduction**
- **112Gb/s Multiple Rate Transceiver**
  - Receiver
  - Transmitter
  - PLL
- **Measurement Results**
- **Summary and Conclusion**

# Transceiver BER Test

- **112Gbps PAM4 BER test with extreme channel loss**
  - 43.9dB loss @ 28GHz (bump to bump)
  - BER < 1E-5 with frequency offset



| ppm         | 0        | 50       | 100      | 150      | 200      |
|-------------|----------|----------|----------|----------|----------|
| PreFEC BER  | 5.42E-06 | 8.02E-06 | 8.51E-06 | 8.71E-06 | 9.59E-06 |
| PostFEC BER | 3.24E-35 | 5.83E-33 | 5.68E-32 | 7.46E-32 | 1.55E-31 |

# Receiver Jitter Tolerance Test

- **400G-KR4 JTOL: 106.25Gb/s per CL163.9.2**
  - PRBS31Q @ BER=1E-4 with Channel IL @26.5625G = 27.5dB~28.5dB
  - CDR bandwidth of 16MHz



# Receiver Interference Tolerance Test

## ■ 400G-KR4 ITOL: 106.25Gb/s per CL163.9.2

- PRBS31Q @ BER<=1E-4 (Spec.)
  - Test1: with Channel IL @26.5625G = 13.5dB~14.5dB
  - Test2: with Channel IL @26.5625G = 27.5dB~28.5dB



# PLL Jitter Measurement in Integer Mode

## ■ 312.5MHz reference clock is used



# PLL Jitter Measurement in Fractional-N Mode

- 622.08MHz reference clock is used



# Transmitter NRZ Eye Diagram

- **53.125Gbps (no de-embedding)**
  - RJ is 170fs, ISI 2ps, TJ@1E-12 is 4.2ps



# Transmitter PAM4 Eye Diagram

## ■ 112.5Gbps (no de-embedding)

- J4U is 119mUI (128mUI spec)
- Jrms 15.6mUI (23mUI spec)
- EOJ is 7.5mUI (25mUI spec)
- RLM is 0.999 (0.95 spec)



# Transceiver Die Photo

- **8X RX and TX with one PLL**
  - Digital is about 25% and RXs are about 50% of total area



# Outline

- **Introduction**
- **112Gb/s Multiple Rate Transceiver**
  - Receiver
  - Transmitter
  - PLL
- **Measurement Results**
- **Summary and Conclusion**

# Performance Comparison Table

| Design                  | Im ISSCC 2020 | Ali ISSCC 2020 | LaCroix ISSCC 2021 | Guo ISSCC 2022 | This work      | Comment         |
|-------------------------|---------------|----------------|--------------------|----------------|----------------|-----------------|
| Technology              | 7nm           | 7nm            | 7nm                | 5nm            | 7nm            |                 |
| Data rate               | 112Gbps       | 112Gbps        | 112Gbps            | 112Gbps        | 112Gbps        |                 |
| RX topology             | 7-bit ADC     | 7-bit ADC      | 7-bit ADC          | 7-bit ADC      | Analog FFE/DFE | Analog RX       |
| TX topology             | 7-bit DAC     | 7-bit DAC      | 7-bit DAC          | 7.5-bit DAC    | 7-bit DAC      |                 |
| RX DFE tap              | 1             | 1              | 2                  | 1              | 18             | More DFE taps   |
| RX FFE tap              | 31            | 24             | 25                 | 30             | 3              |                 |
| CDR BW                  | ~4MHz (est)   | ~6MHz (est)    | NA                 | ~7MHz (est)    | 16MHz          | Highest CDR BW  |
| TX FFE tap              | 4             | 6              | 7                  | 6              | 6              |                 |
| TX RLM                  | NA            | 0.99           | NA                 | 0.96           | 0.99           | Best RLM        |
| TX SNDR                 | NA            | 36dB           | NA                 | 36dB           | 37dB           | Best SNDR       |
| Channel loss            | 37.5dB        | 38dB           | 45dB               | 50dB           | 43.9dB         |                 |
| TXVR power              | 602mW*        | 460mW*         | 662mW              | 504mW          | 600mW*/690mW   |                 |
| Area (mm <sup>2</sup> ) | 0.4*          | 0.385*         | 0.53*              | 0.49           | 0.47*/0.63     | Smallest in 7nm |

\* Analog only

# Summary

## ■ A 112Gbps multi-rate transceiver is presented

- It features RX analog 3-tap FFE / 18-tap DFE and TX 6-tap FFE using 7-bit DAC
- It can compensate up to 43dB loss channel at 112Gbps PAM4 signaling with high CDR bandwidth
- It consumes 690mW and occupies 0.63mm<sup>2</sup> (per RXTX) including digital, in 7nm FinFET technology

# Acknowledgements

- The authors would like to thank the support of the Broadcom DSP, ASIC, and Layout groups for this design and layout, and the DVT groups for the measurement

# A 4.63pJ/b 112Gb/s DSP-Based PAM-4 Transceiver for a Large-Scale Switch in 5nm FinFET

Henry Park<sup>\*1</sup>, Mohammed Abdullatif<sup>\*1</sup>, Ehung Chen<sup>1</sup>, Ahmed Elmallah<sup>1</sup>,  
Qaiser Nehal<sup>1</sup>, Miguel Gandara<sup>1</sup>, Tsz-Bin Liu<sup>2</sup>, Amr Khashaba<sup>1</sup>, Joonyeong Lee<sup>1</sup>,  
Chih-Yi Kuan<sup>2</sup>, Dhinessh Ramachandran<sup>1</sup>, Ruey-Bo Sun<sup>2</sup>, Atharav Atharav<sup>1</sup>,  
Yusang Chun<sup>1</sup>, Mantian Zhang<sup>1</sup>, Deng-Fu Weng<sup>2</sup>, Chung-Hsien Tsai<sup>2</sup>,  
Chen-Hao Chang<sup>2</sup>, Chia-Sheng Peng<sup>2</sup>, Sheng-Tsung Hsu<sup>2</sup>, Tamer Ali<sup>1</sup>

<sup>1</sup>Mediatek USA, Irvine, CA, <sup>2</sup>Mediatek, Hsinchu, Taiwan



# Outline

- Introduction
- TX Overview
- RX Overview
- PLL Overview
- Measurement

# ASIC Serdes Design Challenges



## ■ General purpose LR Serdes

- 0~40+dB loss, 1Gbps ~ 112Gbps, >1.2Vppd swing
- High port density & large package (Xtalk, supply noise, EM coupling)
- BER/SNR margin
- Power efficient DSP-based Serdes



\*Image from Cisco



6.2: A 4.63pJ/b 112Gb/s DSP-Based PAM-4 Transceiver for a Large-Scale Switch

# ASIC Serdes Overview



- TX: 7b DAC, 4:1 HSMUX (tail-less CML), voltage-mode driver
- RX: 7b ADC, up to 34dB compensation RXFE, PVT insensitive datapath design
- TX/RXPLL: wideband, digital PLL

# Outline

- Introduction
- TX Overview
- RX Overview
- PLL Overview
- Measurement

# TX Block Diagram



- One DPLL per TX lane
- CML style 4:1 MUX [Z. Toprak-Deniz, ISSCC 2019]



# TX Block Diagram



- 4:1 MUX Bandwidth control by the pull-up resistor
- Replica MUX provides adaptive control voltage



# Outline

- Introduction
- TX Overview
- RX Overview
- PLL Overview
- Measurement

# RX Block Diagram



## ■ RX analog front-end

- Attenuator, VGA(AGC), CTLE (CQE), TH buffer
- Robust datapath design

# Attenuator



**Gain\_LF (cap dominant)**

$$\left( \frac{R_2}{R_1 + R_2} \right) \times \left( \frac{C_1}{C_1 + N \cdot C_u + C_p} \right)$$

**Gain\_HF (resistor dominant)**

$$\left( \frac{R_2 \cdot R_u}{R_2 \cdot R_u + R_1 \cdot R_u + N \cdot R_1 \cdot R_2} \right)$$

## ■ Wideband attenuator

- Adaptive  $V_{on}$  from a replica bias: 1) PVT insensitive gain control, 2) No pole-zero doublet
- Negative  $V_{off}$  (baseline wander)

# ADC S/H Switch Distortion Under a Low VDD



## S/H switch on-resistance

- Level dependent settling error
- Target  $R_{REF} \rightarrow 5\tau$  settling at MIN level

# Adaptive CM Biasing



## ■ Adaptive CM generator

- Replica S/H switch
- Programmable  $R_{REF}$

# S/H ENOB with Adaptive CM Biasing



## ■ 1GS/s ADC S/H ENOB simulation

- Optimum CM level: settling error vs. leakage
- Controlled by  $R_{REF}$

# RX Block Diagram



## ■ RX digital backend

- 24+8 (sliding) tap FFE + 1 tap DFE
- Low latency TR FFE (JTOL) or TED from the main slicer (>45dB)

# Interleaved Path Sampling Error

$$x(t) + \boxed{x(t) \cdot \Delta g} \rightarrow P\{x(nT_b + t_{TR})\} \cdot \sigma^2(\Delta g) \cdot \frac{N_{ADC} - 1}{N_{ADC}}$$
$$x(t) + \boxed{x'(t) \cdot \Delta t} \rightarrow P\{x'(nT_b + t_{TR})\} \cdot \sigma^2(\Delta t) \cdot \frac{N_{TH} - 1}{N_{TH}}$$

- Autocorrelation model: [El-Chammas, TCAS 2009]
- Simple models by using statistical approach

# N-Way Sampling Error (PAM-4 Signal)



$$P(Skew) \approx \sum_{-\infty}^{\infty} (h'_n)^2 \cdot \frac{5}{9} \cdot \sigma^2(\Delta t) \cdot \frac{N_{TH}-1}{N_{TH}}$$

$$P(Gain) \approx \sum_{-\infty}^{\infty} (h_n)^2 \cdot \frac{5}{9} \cdot \sigma^2(\Delta g) \cdot \frac{N_{ADC}-1}{N_{ADC}}$$

$$P(Offset) \approx \sigma^2(\Delta v_{offset}) \cdot \frac{N_{ADC}-1}{N_{ADC}}$$

- Pulse response and mismatch information → sampling error

# N-Way Sampling Error (PAM-4 Signal)



$$P(Skew) \approx \sum_{-\infty}^{\infty} (h'_n)^2 \cdot \frac{5}{9} \cdot \sigma^2(\Delta t) \cdot \frac{N_{TH}-1}{N_{TH}}$$

$$P(Gain) \approx \sum_{-\infty}^{\infty} (h_n)^2 \cdot \frac{5}{9} \cdot \sigma^2(\Delta g) \cdot \frac{N_{ADC}-1}{N_{ADC}}$$

$$P(Offset) \approx \sigma^2(\Delta v_{offset}) \cdot \frac{N_{ADC}-1}{N_{ADC}}$$

- Pulse response and mismatch information → sampling error



# Residual Sampling Error w/ Calibration

Signal swing = 400mVdpp



- Random variable's PDF: uniform PDF (Q-noise) \* Gaussian PMF (dynamic effect by dither, feedback loop's SNR, and skew/gain control's interaction)

# Outline

- Introduction
- TX Overview
- RX Overview
- PLL Overview
- Measurement

# PLL Block Diagram



- LC VCO with wide loop BW (>10MHz)
- Frac-N divider noise cancelled by DCDL

# EM Coupling



[C. -J. Li, T-MTT 2010]

## ■ EM coupling suppression

- Neighboring lanes at the same frequency
- EM modeling ( $\omega_{LR} \downarrow$ )
- MAX PLL BW (> 10MHz)

# Outline

- Introduction
- TX Overview
- RX Overview
- PLL Overview
- Measurement

# Chip Overview

- 2 PLLs per lane
- 521mW at 112.5Gbps LR mode w/ 40dB loss channel
- 0.461mm<sup>2</sup> per lane
  - Includes amortized power/area of bandgap & test port



6.2: A 4.63pJ/b 112Gb/s DSP-Based PAM-4 Transceiver for a Large-Scale Switch in 5nm FinFET

# TX Measurement



- TX FIR applied (4.2dB)
- 1.1Vppd, RLM > 99%, 39dB SNDR @112.5Gbps

# RX Datapath Measurement



- RXFE AC response with PKG & PCB de-embedded
- Single tone test with CTLE LR mode

# PLL/CDR Measurement



- TX PLL phase noise (1kHz ~ 100MHz): 150fs\_rms
- RX MIN JTOL: 150mUI

# EM Coupling Test



- Wideband PLL can suppress the spur PJ < 1mUIpp

# T2R Test Setup (Clean Channel)



- TX PKG, PCB, MXP, cable, 26" ISI, cable, MXP, PCB, RX PKG

# T2R BER w/ LR Channels



## ■ Demo session 2 (Tuesday 5pm)

- 112.5Gbps with 48dB loss channel

# Compliance Test (Real Field Condition)



- Backplane with 6 aggressors from neighboring lanes, TX-RX 100ppm frequency offset
- Temperature control (~3dB loss variation)

# KR Channel T2R Performance



- Less than 0.2 DFE coefficient
- BER varies by 8x over temperature sweep (-10°C ~ 125°C)

# Comparison Summary

|                              | [1] M. LaCroix           | [2] P. Mishra        | [3] Z. Guo          | [4] A. Varzaghanian | This Work                       |
|------------------------------|--------------------------|----------------------|---------------------|---------------------|---------------------------------|
| Process                      | 7nm                      | 7nm                  | 5nm                 | 5nm                 | <b>5nm</b>                      |
| Data Rate (Gb/s)             | 112                      | 112                  | 112                 | 112.5               | <b>112.5</b>                    |
| BER @Loss                    | BER < 1e-5 @45dB         | BER < 1e-6 @41.5dB   | BER < 1e-9 @40dB    | BER < 1e-5 @42dB    | <b>BER &lt; 7e-6 @48dB</b>      |
| Power/Lane (mW)              | A+D: 662                 | A+D: 729.1           | <b>A+D: 504</b>     | A+D: 632            | <b>A+D: 521</b>                 |
| Area (mm <sup>2</sup> /lane) | A: 0.531                 | A+D: 0.92            | A+D: 0.49           | <b>A+D: 0.372</b>   | <b>A+D: 0.461</b>               |
| TX Architecture              | Analog: 7-bit SST+CML    | Analog: 7-bit CML    | Analog: 7.5-bit SST | Analog: 7-bit CML   | <b>Analog: 7-bit SST</b>        |
|                              | DSP: -                   | DSP: 5-Tap FIR       | DSP: 6-tap FIR      | DSP: 5-Tap FIR      | <b>DSP: 6-tap FIR</b>           |
| PLL/Lane                     | TX/RXPLL shared          | TXPLL shared/1 RXPLL | 1 PLL               | 2 PLLs              | <b>2 PLLs</b>                   |
| RX Architecture              | Analog: ATT / CTLE / VGA | Analog: CTLE / VGA   | Analog: CTLE        | Analog: VGA1 / VGA2 | <b>Analog: ATT / VGA / CTLE</b> |
|                              | ADC: 7 bit               | ADC: 7 bit           | ADC: 7 bit          | ADC: 7 bit          | <b>ADC: 7 bit</b>               |
|                              | DSP: 25 FFE / 2 DFE      | DSP: -               | DSP: 22+8 FFE/1 DFE | DSP: 30 FFE/1 DFE   | <b>DSP: 24+8 FFE/1 DFE</b>      |

■ Power measured at 112.5Gbps data rate with 40+dB loss channel

# Acknowledgement

- Mediatek layout and QA team's support

# References

- [1] M. LaCroix et al., "A 116Gb/s DSP-Based Wireline Transceiver in 7nm CMOS Achieving 6pJ/b at 45dB Loss in PAM-4/Duo-PAM-4 and 52dB in PAM-2," ISSCC, pp. 132-134, Feb. 2021.
- [2] P. Mishra et al., "A 112Gb/s ADC-DSP-Based PAM-4 Transceiver for Long-Reach Applications with >40dB Channel Loss in 7nm FinFET," ISSCC, pp. 138-140, Feb. 2021.
- [3] Z. Guo et al., "A 112.5Gb/s ADC-DSP-Based PAM-4 Long-Reach Transceiver with >50dB Channel Loss in 5nm FinFET," ISSCC, pp. 116-118, Feb. 2022.
- [4] A. Varzaghani et al., "A 1-to-112Gb/s DSP-Based Wireline Transceiver with a Flexible Clocking Scheme in 5nm FinFET," IEEE Symp. VLSI Circuits, pp. 26-27, June 2022.
- [5] Z. Toprak-Deniz et al., "6.6 A 128Gb/s 1.3pJ/b PAM-4 Transmitter with Reconfigurable 3-Tap FFE in 14nm CMOS," ISSCC, pp. 122-124, Feb. 2019.
- [6] C. -J. Li et al., "A Rigorous Analysis of a Phase-Locked Oscillator Under Injection," in IEEE TMTT, vol. 58, no. 5, pp. 1391-1400, May 2010.

# A 0.43pJ/b 200Gb/s 5-Tap Delay-Line-Based Receiver FFE with Low-Frequency Equalization in 28nm CMOS

Bingyi Ye, Guangdong Wu, Weixin Gai, Kai Sheng, Yandong He



Peking University, Beijing, China

# Outline

- Motivation
- FFE Architecture
- Delay Lines
- Distributed Tap Amplifier
- Variable- $G_m$  Circuits
- Measurement results
- Comparisons and Conclusions

# DFE Critical Path at 200Gb/s



Direct Feedback



Unroll the 1<sup>st</sup> Tap

Hard to meet at 200Gb/s !

# RX Analog Feed-forward Equalizers



| Types of FFE Delay Element | Time-interleaved Sample-and-hold | Active delay line | Passive delay line <input checked="" type="checkbox"/> |
|----------------------------|----------------------------------|-------------------|--------------------------------------------------------|
| BW requirement             | Low                              | High (cascaded)   | Moderate                                               |
| Sampling Clock             | Required                         | Not required      | Not required                                           |
| Area                       | Small                            | Small             | Moderate at 200Gb/s                                    |
| CDR support                | Baud-rate                        | Baud-rate & 2X    | Baud-rate & 2X                                         |

# FFE Architecture



# FFE Architecture



- Input signal  $V_I$  goes through a 2-UI delay line and is terminated

# FFE Architecture



- Input signal  $V_i$  goes through a 2-UI delay line and is terminated
- Five taps ( $H_{-1}$  to  $H_3$ ) receive the incoming signals at 0.5-UI spacing

# FFE Architecture



- The incoming signals are amplified and summed using another 2-UI delay line to generate the output signal  $V_O$

# FFE Architecture



- Terminations match the delay line to absorb the signals and prevent reflections

# FFE Architecture



- Five replicas of input signal with different delays and gains summed at the output: principle of the 5-tap FFE.

# FFE Architecture



- Support both coefficient-based HF equalization and source-degeneration-based LF equalization
- Off-chip scope for  $V_o$  measurement

# Outline

- Motivation
- FFE Architecture
- **Delay Lines**
- Distributed Tap Amplifier
- Variable- $G_m$  Circuits
- Measurement results
- Comparisons and Conclusions

# Artificial delay line or T-Line?



## Artificial (LC) delay line

- Smaller Area
- Carry large capacitance
- BW related to values of LC,  
requires very small L for 100Gaud
- Parasitic L
- Crosstalk limits area reduction

## Transmission line

- Unlimited BW
- Lower crosstalk by shielding
- Accurate EM model favorable for  
design evaluation
- Large area and loss for ~20Gbaud
- Impedance drop due to C loading



# GCPW Delay Line



- On-chip grounded coplanar waveguide (GCPW)
- Impedance is designed to be  $60\Omega$  instead of  $50\Omega$  to compensate for the impedance drop

# GCPW as Delay Elements



| Baud rate (GBaud) | 20   | 100  |
|-------------------|------|------|
| 1-UI Delay (ps)   | 50   | 10   |
| Area              | 1X   | 0.2X |
| DC Loss (dB)      | 1.63 | 0.35 |
| Nyquist Loss (dB) | 2.64 | 1.15 |

- Area, DC and Nyquist losses significantly reduce when baud rate increases to 100Gbaud

# Outline

- Motivation
- FFE Architecture
- Delay Lines
- **Distributed Tap Amplifier**
- Variable- $G_m$  Circuits
- Measurement results
- Comparisons and Conclusions

# Distributed Tap Amplifier



■ Lumped tap amplifier:  
Severe impedance discontinuities  
cause significant reflections

■ Distributed tap amplifier:  
Spread the capacitance loading over  
the delay line and reduce reflections

# Distributed Tap Amplifier



- **Distributed  $H_0$  and  $H_1$  tap amplifiers improve the input return loss by 5dB at 50GHz**

# Distributed Tap Amplifier



Layout of the  $H_0$  and  $H_1$  taps

- 20 $\mu\text{m}$ -wide ground wire for crosstalk suppression
- Shielding plane (green) mitigates the ground bounce noise induced by the common-mode return current

# Distributed Tap Amplifier



- Intra-tap delay mismatch caused by different signal paths
- Mismatch  $\approx 1\text{ps} \ll 1\text{UI}$ , only reduce the Nyquist gain by 0.1dB

# Outline

- Motivation
- FFE Architecture
- Delay Lines
- Distributed Tap Amplifier
- **Variable- $G_m$  Circuits**
- Measurement results
- Comparisons and Conclusions

# Coefficient Control Method



Cell-based Variable- $G_m$



Tail-current-based Variable- $G_m$

| Coefficient Control       | Cell-based | Tail-current-based | <input checked="" type="checkbox"/> |
|---------------------------|------------|--------------------|-------------------------------------|
| Linearity for large coef. | Good       | Good               |                                     |
| Linearity for small coef. | Good       | Poor               |                                     |
| Wiring parasitics         | Larger     | Smaller            |                                     |

# Cross-connected Variable- $G_m$ Cells



(Large coef.)



(Small coef.)

- Cross-connected cells for  $H_2$  and  $H_3$  tap amplifiers
- Large coefficients: only one  $G_m$  cell
- Small coefficients: Subtraction of  $G_{mp}$  and  $G_{mn}$
- Linear for all coefficients

# The Variable- $G_m$ Cell



# The Variable- $G_m$ Cell



- 4b gain control with programmable tail current

# The Variable- $G_m$ Cell



- 4b gain control with programmable tail current
- 2b low-frequency equalization with RC source degeneration; compared to adding more FFE taps, no extra power;

# The Variable- $G_m$ Cell



- 4b gain control with programmable tail current
- 2b low-frequency equalization with RC source degeneration; compared to adding more FFE taps, no extra power;
- Improved linearity for PAM-4

# The Variable- $G_m$ Cell



- 4b gain control with programmable tail current
- 2b low-frequency equalization with RC source degeneration; compared to adding more FFE taps, no extra power;
- Improved linearity for PAM-4
- Four cells are used in the  $H_0$  tap for larger gain and finer equalization

( $H_0$ : 6b gain control, 4b LF EQ)

# Outline

- Motivation
- FFE Architecture
- Delay Lines
- Distributed Tap Amplifier
- Variable-Gm Circuits
- **Measurement results**
- Comparisons and Conclusions

# Frequency Responses



- Measured using a vector network analyzer (R&S ZNA67)
- 3.4dB tuning range of the low-frequency gain
- Configurable HF Peaking up to 15dB (can be larger if reduce  $H_0$ )
- The baseline is removed from the others to make the curves more distinguishable

# Single Pulse Responses



- Measured using an AWG (Keysight M8199A) and a scope (DSAZ594A)
- $H_0$  fixed,  $H_{-1}=H_2=H_3=0$ ,  $H_1$  coefficient varies from zero to the maximum
- Max  $H_1$  coefficient: 0.6 (typical  $H_0$ )
- Delay between  $H_1$  and  $H_0$  taps: approximately 10ps

# Output Noise



- Measured by the oscilloscope with 59.9GHz Bandwidth
- Scope noise floor and the influence of the cable loss are corrected
- Less than 1.1mV<sub>rms</sub> for all peaking configurations

# Measurement Setup For Eye Diagrams



# Measurement Setup For Eye Diagrams



- Two Probes:  
**-1.8dB@50GHz**
- Channel 1:  
**-9.0dB@50GHz**
- Channel 2:  
**-17.2dB@50GHz**  
**(-15.4dB@42GHz)**

# Measured Eye Diagrams

200Gb/s QPRBS9 PAM-4  
7.1dB Loss, No EQ



200Gb/s QPRBS9 PAM-4  
9.0dB Loss, FFE w/ LF EQ



# Measured Eye Diagrams

200Gb/s QPRBS9 PAM-4  
17.2dB Loss , FFE w/ LF EQ



200Gb/s QPRBS9 PAM-4  
17.2dB Loss , FFE w/o LF EQ



# Measured Eye Diagrams

**100Gb/s PRBS9 NRZ  
17.2dB Loss , FFE w/ LF EQ**



**168Gb/s QPRBS9 PAM-4  
15.4dB Loss , FFE w/ LF EQ**



# Outline

- Motivation
- FFE Architecture
- Delay Lines
- Distributed Tap Amplifier
- Variable-Gm Circuits
- Measurement results
- Comparisons and Conclusions

# Comparison with Prior Continuous-time FFEs

|                              | ISSCC'14 [4] | JSSC'10 [5] | JSSC'06 [6] | This Work      |
|------------------------------|--------------|-------------|-------------|----------------|
| Technology                   | 28-nm CMOS   | 65-nm CMOS  | 90-nm CMOS  | 28-nm CMOS     |
| Date Rate (Gb/s)             | 10 - 25      | 40          | 30          | 200            |
| Modulation                   | NRZ          | NRZ         | NRZ         | PAM-4          |
| Delay Element                | Active       | Active + LC | LC          | GCPW           |
| Number of taps               | 7            | 7           | 3           | 5              |
| LF Equalization              | No           | No          | No          | Yes            |
| Channel Loss (dB)            | -            | 9           | 15          | 17.2           |
| Power (mW)                   | 55 - 90      | 80          | 25          | 117 (86 in RX) |
| Noise (mV <sub>rms</sub> )   | <4*          | -           | -           | <1.1           |
| Core Area (mm <sup>2</sup> ) | 0.085        | 1           | 0.3         | 0.32           |

\* With a tunable delay line

# Comparison with an ADC-based 224Gb/s RX

|                              | ISSCC'22 [1]       | This Work           |
|------------------------------|--------------------|---------------------|
| Technology                   | 5-nm CMOS          | 28-nm CMOS          |
| Date Rate (Gb/s)             | 224                | 200                 |
| Equalization                 | CTLE + Digital FFE | Analog FFE w/ LF EQ |
| Channel Loss (dB)            | 31.6 (w/ TX FFE)   | 17.2                |
| Efficiency (pJ/b)            | 1.41 **            | 0.43 ***            |
| Core Area (mm <sup>2</sup> ) | 0.34 **            | 0.32 ***            |



\*\* AFE, ADC and Clocking, Excluding DSP

\*\*\* RX FFE only

**Die micrograph**  
**Core area: 0.32 mm<sup>2</sup>**

# Conclusions

- We presented a 200Gb/s delay-line-based receiver FFE
- Passive delay lines become feasible at higher data rates
- Distributed tap amplifiers improves the return loss
- RC source degeneration provides efficient low-frequency equalization

# A 4nm 32Gb/s 8Tb/s/mm Die-to-Die Chiplet Using NRZ Single-Ended Transceiver With Equalization Schemes And Training Techniques

**Kihwan Seong**, Donguk Park, Gyeomje Bae, Hyunwoo Lee,  
Youngseob Suh, Wooseok Oh, Hyemun Lee, Juyoung Kim,  
Takgun Lee, Gunhu Mo, Sukhyun Jung, Dongcheol Choi,  
Byoung-Joo Yoo, Sanghune Park, Hyo-Gyuem Rhew, Jongshin Shin



Samsung Electronics, Hwaseong, Korea

# Outline

- Motivation
- Overall Architecture
- Transmitter (TX)
- Receiver (RX)
- Measurement Results
- Performance Comparison
- Conclusion

# Outline

- Motivation
- Overall Architecture
- Transmitter (TX)
- Receiver (RX)
- Measurement Results
- Performance Comparison
- Conclusion

# Motivation (1/2)

## Monolithic vs Chiplets



ref) S-H. You, et al., IEDM 2020

## Low Cost



ref) S. Naffziger, et al., ISSCC 2020

- **Monolithic chips problems :** Cost, yield and integrations of one-die
- **Increases chiplets industries**

# Motivation (2/2)



- Developing the high BW and energy-efficiency chiplets

# Outline

- Motivation
- Overall Architecture
- Transmitter (TX)
- Receiver (RX)
- Measurement Results
- Performance Comparison
- Conclusion

# Overall Architecture (1/3)



- 2.5D package structure for die-to-die communication
- Channels implemented on si-interposer (~ 3mm)

# Overall Architecture (2/3)



- Stacked in two rows to achieve higher beach-front BW
- A stack has four TX and RX Slices

# Overall Architecture (3/3)



- **PHY (H/M)**
  - A slice with 39 DQs and a DQS
- **PCS (S/M)**
  - Training logic (de-skew, Alignment, etc)
  - Encoder/Decoder for SSN

# Outline

- Motivation
- Overall Architecture
- **Transmitter (TX)**
- Receiver (RX)
- Measurement Results
- Performance Comparison
- Conclusion

# Transmitter (TX)



- NRZ single-ended transmitter with low-swing
- TX SYNC GEN and RCD used for low latency and low power

# TX synchronous reset generator (TX SYNC GEN)



- Search optimal sampling point
  - Step 1) Insert Phase update
  - Step 2) Change the PSEL code
  - Step 3) Check the BIST results
- Achieve low latency

# TX synchronous reset generator (TX SYNC GEN)



# TX synchronous reset generator (TX SYNC GEN)



# TX synchronous reset generator (TX SYNC GEN)



# Reflection Cancellation Driver (RCD)



- Due to impedance mismatch in the si-interposer, reflection exists
- Eliminate the effect of reflected waves on the RX\_PAD

# Outline

- Motivation
- Overall Architecture
- Transmitter
- Receiver**
- Measurement Results
- Performance Comparison
- Conclusion

# Receiver (RX)



- Direct decision feedback equalizer (DFE)
- DQs and DQS include de-skew circuits for wide range operating frequency

# De-Skew Circuits : Global de-skew



- Implemented in **DQS**
- Use when data rate is low
- Consist of coarse and fine delay cell

# De-Skew Circuits : Local de-skew



- Implemented in **each DQ**
- Use when data rate is high
- Consist of phase-rotator

# Global De-Skew Circuits

## Previous work



## This work



- Operate Seamless and monotonic operation compared to previous work

# Wide Range De-Skew Circuits



- Implemented the wide-range delay line using multi-phase clock

# Direct Decision Feedback Equalizer (DFE)



# Periodic Skew Calibration (PSC)



- One of *RX DQ* used for *RX\_PSC*
- Sampling clock (*sam\_clk*) is aligned to edge of the input data (*RX\_PSC*)

# Periodic Skew Calibration (PSC)

Normal Case



Drift Case by Temp variation



- ***RX\_PSC*** used toggle pattern (1010) as input
- Compare the ***RXPSC[15:0]*** to monitor clock drift by temperature variation

# Outline

- Motivation
- Overall Architecture
- Transmitter
- Receiver
- **Measurement Results**
- Performance Comparison
- Conclusion

# Measurement Results and Chip Photo



# Outline

- Motivation
- Overall Architecture
- Transmitter
- Receiver
- Measurement Results
- **Performance Comparison**
- Conclusion

# Performance comparison



|                                    | [1] VLSI19          | [2] ISSCC21 | [3] ISSCC22      | [4] VLIS21        | [5] VLSI22            | This work         |
|------------------------------------|---------------------|-------------|------------------|-------------------|-----------------------|-------------------|
| Technology                         | 7nm                 | 7nm         | 5nm              | 7nm               | 5nm                   | 4nm               |
| Channel length (mm)                | 0.5<br>(Interposer) | 20<br>(MCM) | 5-to-80<br>(MCM) | 1<br>(Interposer) | 1.2<br>(Interposer**) | 3<br>(Interposer) |
| Bump pitch (um)                    | 40                  | 130         | -                | 40                | 55                    | 50                |
| Data rate (Gbps/pin)               | 8                   | 40          | 113              | 20                | 50.4                  | 32                |
| Bandwidth of beach front (Tbps/mm) | 0.625               | 0.45        | 0.46             | 5.31              | 2.68                  | 8                 |
| Power efficiency (pJ/bit)          | 0.56                | 1.17        | 1.55             | 0.46              | 0.297                 | 0.44              |
| FoM ((Tbps/mm)/(pJ/bit))           | 1.11                | 0.38        | 0.296            | 11.5              | 9                     | 18.2              |

\*\* An on-chip channel that simulates the characteristics of the interposer was used.

# Outline

- Motivation
- Overall Architecture
- Transmitter
- Receiver
- Measurement Results
- Performance Comparison
- Conclusion

# Conclusion

- Implemented **chiplet for 2.5D package in 4nm**
  - NRZ single-ended transceiver with 32Gb/s
  - RCD to reduce the effect on impedance mismatch
  - direct DFE with relaxed feedback time
  - training techniques for high-speed operation
- Achieved **8Tb/s/mm BW of beachfront**

# Thank you

# A 37.8dB Channel Loss 0.6us Lock Time CDR with Flash Frequency Acquisition in 5nm FinFET

**Chien-Kai Kao, Shih-Che Hung, Tse-Hsien Yeh, Chen-Yu Hsiao**

*MediaTek, Hsinchu, Taiwan*



# Outline

- Motivation
- Proposed CDR with Flash Frequency Acquisition
- Measurement Results
- Conclusions

# Motivation



- With high channel loss and frequency offset:
  - Difficult to predict CDR lock time
  - Possible to be unlocked because of ISI, RJ, DJ, or different initial phase

# Motivation



- With high channel loss and frequency offset:
  - Difficult to predict CDR lock time
  - Possible to be unlocked because of ISI, RJ, DJ, or different initial phase

# Motivation



- Conventional Frequency Detector:
  - Bang-bang result (UP/DN) → Slow 😞
  - Lock time increase with channel loss → Slow 😞
  - Could induce additional jitter → Performance degradation 😞
  - May need additional circuit and clock phase → Power increase 😞
- **Is it possible to achieve fast and constant lock time?**

# Observing PD Output

- Open loop sampling
- Assume frequency offset = 2000ppm:
- At T = 0 sampling cycle:



# Observing PD Output

- Open loop sampling
- Assume frequency offset = 2000ppm:
- At T = 125 sampling cycles:



# Observing PD Output

- Open loop sampling
- Assume frequency offset = 2000ppm:
- At T = 250 sampling cycles:



# Observing PD Output

- Open loop sampling
- Assume frequency offset = 2000ppm:
- At T = 375 sampling cycles:



# Observing PD Output

- Open loop sampling
- Assume frequency offset = 2000ppm:
- At T = 500 sampling cycles:



- UP/DN information forms a periodic signal

# Observing PD Output

- $\Delta\phi = nT_{CK} \cdot \Delta freq$
- If we store past phase information

$$(UP[0] - DN[0]) = S[0] \rightarrow z^{-n} \rightarrow S[n]$$



- Different phases of the periodic signal are gained

# Flash Frequency Acquisition: $R(n)$

- How to find frequency offset?
- $\Delta freq = \Delta\phi / n \cdot f_{CK}$
- $R(n) = E[S[0] \cdot S[n]]$



# Flash Frequency Acquisition: Orthogonality

- Find  $R(n) = E[S[0] \cdot S[n]] = 0$ 
  - $R(250) = 0$  for 1000ppm case
  - $R(125) = 0$  for 2000ppm case
- $\langle S[0], S[n] \rangle = 0 \Leftrightarrow S[0] \perp S[n]$   $\rightarrow \Delta freq = (1/4) \cdot (1/n) \times f_{CK}$



# Flash Frequency Acquisition: Effect of Channel

- Channel loss doesn't change orthogonality
- With orthogonality, lock time can be independent of channel loss



# Flash Frequency Acquisition Design

- Flash-like design: using multiple ACF blocks at the same time
- One-hot activation circuit finds the ACF result closest to zero



# Proposed CDR with FFA

- Open loop frequency acquisition
- Autocorrelation network: a synthesized circuit



# Proposed CDR with FFA

- Overall RXFE response with maximum VGA gain
- Tunable CTLE peaking range: ~17dB



# Measurement Results

- M8040: generating pattern and adjusting frequency offset
- ISI board: providing various channel loss



# Measurement Results

- M8040: generating pattern and adjusting frequency offset
- ISI board: providing various channel loss



# Measurement Results

- Channel loss only affects integration speed
- Correlators saturate around 280ns for 10dB case



# Measurement Results

- CDR lock time is independent of frequency offset
- With FFA, CDR can lock with 4000ppm frequency offset



# Measurement Results



- Measured at 32Gbps with PRBS31
- TX FFE and RX DFE are disabled
- CTLE provides 8dB boost at 16Gbps
- Each frequency offset: 50 samples
- Acquire frequency offset magnitude

# Measurement Results

## ■ Recovered clock jitter histogram at 32Gbps



# Performance Comparison

|                                        | ISSCC 17[1]      | JSSC 20[2]    | JSSC 22[3]        | This work                 |
|----------------------------------------|------------------|---------------|-------------------|---------------------------|
| <b>Technology [nm]</b>                 | 28               | 28            | 40                | 5                         |
| <b>Architecture</b>                    | Quarter rate     | Half rate     | Quarter rate      | Quarter rate              |
| <b>Data Rate [Gbps]</b>                | 22.5 - 32        | 6.5 - 12.5    | 6.4 - 32          | 1.25 - 32                 |
| <b>Methodology</b>                     | FD               | FD            | Pattern Weighting | Autocorrelation Network   |
| <b>Extra Clock Phase or Comparator</b> | Extra Comparator | Extra Phase   | Free              | Free                      |
| <b>Open Loop Detection</b>             | No               | No            | No                | Yes                       |
| <b>Acquire Frequency Magnitude</b>     | No               | No            | No                | Yes                       |
| <b>Constant Locking Time</b>           | No               | No            | No                | Yes                       |
| <b>Unlimited Locking Range</b>         | No               | No            | Yes               | No                        |
| <b>External Impairments</b>            | Not Report       | Not Report    | Not Report        | SJ = 0.16UI@100MHz        |
| <b>Data Pattern</b>                    | PRBS31           | PRBS31        | PRBS31            | PRBS31                    |
| <b>Channel Loss @Nyquist Rate [dB]</b> | 14.8             | 8.6           | 10                | <b>37.8</b>               |
| <b>Locking Time [us]</b>               | <10100           | <1.5          | <11               | <b>&lt;0.6</b>            |
| <b>Area [mm<sup>2</sup>]</b>           | 0.213            | 0.031         | 0.041             | 0.008 <sup>▲</sup>        |
| <b>Power [mW]</b>                      | 102 @32Gbps      | 21.13 @10Gbps | 30.8 @32Gbps      | 30.2 @32Gbps <sup>▲</sup> |
| <b>FOMp [pJ/bit/dB]</b>                | 0.22             | 0.25          | 0.1               | <b>0.025</b>              |
| <b>FOMs [us/dB]</b>                    | 682              | 0.174         | 1.1               | <b>0.016</b>              |

▲ CDR only

# Conclusions

- A 1.25-32Gbps CDR with FFA is designed and verified
- The flash frequency acquisition technique:
  - Operate in open loop
  - Acquire frequency offset magnitude
  - Use past sampling information
  - Lock with constant time
- Achieve 0.6us lock time under 37.8dB channel loss at 32Gbps

# A 0.83pJ/b 52Gb/s PAM-4 Baud-Rate CDR with Pattern-Based Phase Detector for Short-Reach Applications

Seungwoo Park, Yoonjae Choi, Jincheol Sim, Jonghyuck Choi,  
Hyunsu Park, Youngwook Kwon, and Chulwoo Kim



Korea University, Seoul, Korea

# Outline

- **Introduction**
- **Conventional Baud-Rate CDR**
  - Mueller-Muller PD (MMPD)
- **Proposed Baud-Rate CDR**
  - Pattern-Based PD
  - Time-Based Decoder
- **Measurement Results**
- **Conclusion**

# Quarter-Rate RX System



- **2x oversampling CDR**
  - Data + clock recovery (4-phase + 4-phase)
  - 8-phase generation & distribution
  - 8-phase skew calibration

⌚ **Power consuming 8-phase clocking**



- **Baud-rate CDR**
  - Data and clock recovery (4-phase)
  - 4-phase generation & distribution
  - 4-phase skew calibration

😊 **Power efficient 4-phase clocking**

# PAM-4 RX for MR and LR

## ■ CEI-56G-MR-PAM4

- Chip to Chip
- ~25dB channel loss @  $f_b/2$

## ■ CEI-56G-LR-PAM4

- Backplane
- ~35dB channel loss @  $f_b/2$

48~64Gbps PAM-4 TRX or RX\*:

\*Recent 5years ISSCC, VLSI, JSSC



# PAM-4 RX for SR (XSR+VSR) (1)

## ■ CEI-56G-XSR-PAM4

- Chip to Optical engine
- ~4dB channel loss @  $f_b/2$

## ■ CEI-56G-VSR-PAM4

- Chip to Module
- ~10dB channel loss @  $f_b/2$

48~64Gbps PAM-4 TRX or RX\*:

\*Recent 5years ISSCC, VLSI, JSSC



# PAM-4 RX for SR (XSR+VSR) (2)

## ■ CEI-56G-XSR-PAM4

- Chip to Optical engine
- ~4dB channel loss @  $f_b/2$

## ■ CEI-56G-VSR-PAM4

- Chip to Module
- ~10dB channel loss @  $f_b/2$

48~64Gbps PAM-4 TRX or RX\*:

\*Recent 5years ISSCC, VLSI, JSSC



# Outline

## ■ Introduction

## ■ Conventional Baud-Rate CDR

- Mueller-Muller PD (MMPD)

## ■ Proposed Baud-Rate CDR

- Pattern-Based PD
- Time-Based Decoder

## ■ Measurement Results

## ■ Conclusion

# NRZ Baud-Rate CDR



| $d[n-1]$        | $d[n]$ | $E[n-1]$ | $E[n]$ | PD    |
|-----------------|--------|----------|--------|-------|
| 0               | 1      | 1        | 0      | Early |
| 0               | 1      | 0        | 1      | Late  |
| 1               | 0      | 1        | 0      | Early |
| 1               | 0      | 0        | 1      | Late  |
| All other cases |        |          |        | Stay  |



- **Mueller-Muller PD (MMPD)**
  - Baud-rate operation using ISI
  - # of COMPs/UI: 1(data)+2(clock)
  - Read: two consecutive data and errors  
→ Care four consecutive data!

# Transition Density of MMPD (1)



## Case study (0011,1100)

- Early clock: PD is 'Early'
- Late clock: PD is 'Late'
- 100% update



# Transition Density of MMPD (2)



## Case study (0010,1101)

- Early clock: PD is 'Early'
- Late clock: PD is 'Stay'
- 50% update

Error transition X



# Transition Density of MMPD (3)



## Case study (1011,0100)

- Early clock: PD is 'Stay'
- Late clock: PD is 'Late'
- 50% update

Error transition X



# Transition Density of MMPD (4)



## Case study (Total)

- Total of 16( $=2^4$ ) cases
- Transition density:  
 $(2+2\times0.5+2\times0.5)/16=1/4$

Reduced transition density:  
 $1/4 \leftrightarrow 1/2$  (2x oversampling)



# MMPD Extended to PAM-4 CDR for LR (1)



## ■ ADC based RX for LR

- High channel loss
- 7b ADC-DSP equalizer
- # of COMPs/UI: 7(DSP)

# MMPD Extended to PAM-4 CDR for LR (2)



## ■ ADC based RX for LR

- High channel loss
- 7b ADC-DSP equalizer
- # of COMPs/UI: 7(DSP) also used for 3(data)+4(clock)

# MMPD Extended to PAM-4 CDR for SR



Reduced # of COMPs

Two examples for SR



## ■ ADC based RX for LR

- High channel loss
- 7b ADC-DSP equalizer
- # of COMPs/UI: 3(data)+4(clock)

## ■ Analog RX for SR

- Low channel loss
- Analog equalizer
- # of COMPs/UI: 3(data)+2(clock)

# Transition Density of MMPD in PAM-4



|  |  | Early&Late |    |    |    |    |    |    |    |
|--|--|------------|----|----|----|----|----|----|----|
|  |  | -3         | -3 | +3 | +3 | +3 | +3 | -3 | -3 |
|  |  | -3         | -3 | +3 | +1 | +3 | +3 | -3 | -1 |
|  |  | -3         | -3 | +3 | -1 | +3 | +3 | -3 | +1 |
|  |  | -3         | -3 | +3 | -3 | +3 | +3 | -3 | +3 |
|  |  | +3         | -3 | +3 | +3 | -3 | +3 | +3 | -3 |
|  |  | +1         | -3 | +3 | +3 | -1 | +3 | -3 | -3 |
|  |  | -1         | -3 | +3 | +3 | +1 | +3 | -3 | -3 |

| $d[n-1]$        | $d[n]$ | $E[n-1]$ | $E[n]$ | PD    |
|-----------------|--------|----------|--------|-------|
| -3              | +3     | 1        | 0      | Early |
| -3              | +3     | 0        | 1      | Late  |
| +3              | -3     | 1        | 0      | Early |
| +3              | -3     | 0        | 1      | Late  |
| All other cases |        |          |        | Stay  |

## Case study

- Total of 256( $=4^4$ ) cases
- Transition density:  
 $(2+6 \times 0.5 + 6 \times 0.5)/256 = 1/32$
- Reduced transition density:  
 $1/32 \leftrightarrow 1/4$  (2x oversampling)

# Drawbacks of MMPD in PAM-4



## 😢 x1/8 transition density

- Reduced PD gain ( $K_{PD}$ )
  - Increased loop filter gains ( $K_I$ ,  $K_P$ )
  - Increased quantization noise
- Degraded recovered clock jitter

## 😢 x5/4 # of COMPs/UI

- Increased COMPs power
- Limited bandwidth of VGA

# Outline

- Introduction
- Conventional Baud-Rate CDR
  - Mueller-Muller PD (MMPD)
- Proposed Baud-Rate CDR
  - Pattern-Based PD
  - Time-Based Decoder
- Measurement Results
- Conclusion

# NRZ Baud-rate CDR



Four data patterns      Error based on  $V_{REFP,N}$

| $d[n-1]$        | $d[n]$ | $d[n+1]$ | $E[n]$ | PD         |
|-----------------|--------|----------|--------|------------|
| 0               | 0      | 1        | -/+    | Early/Late |
| 1               | 0      | 0        | -/+    | Late/Early |
| 1               | 1      | 0        | -/+    | Late/Early |
| 0               | 1      | 1        | -/+    | Early/Late |
| All other cases |        |          |        | Stay       |



- **Pattern-based PD**
    - Baud-rate operation using ISI
    - # of COMPs/UI: 1(data)+2(clock)
    - Read: three consecutive data and one error
- Care three consecutive data!

# Transition Density of Pattern-Based PD

- Case study
  - Total of 8( $=2^3$ ) cases
  - Transition density:  $4/8=1/2$
  - Increased transition density:  
  $1/2 \leftrightarrow 1/4$  (MMPD)

100% update



→ Extend pattern-based PD to PAM-4 signaling!

# $V_{REF}$ s for Pattern-Based PD in PAM-4 (1)

- 3+3+3 and +3+3-3 used for the phase detecting



# $V_{REF}$ s for Pattern-Based PD in PAM-4 (2)

- 3+3+3 and +3+3-3 used for the phase detecting



Fully equalized

Increased eye margin  
But, reduced slope



- 3+1+3 and +3+1-3 used for the phase detecting (adopted)



Fully equalized

Increased eye margin  
And, secured slope



# PAM-4 Pattern-Based PD (1)

- Option 1 @ under-equalized → Adopt C1,2,5,6



| $d[n-1]$        | $d[n]$ | $d[n+1]$ | $E[n]$ | PD      |
|-----------------|--------|----------|--------|---------|
| C1              | -3     | +1       | +3     | -/+ E/L |
| C2              | +3     | +1       | -3     | -/+ L/E |
| C3              | -1     | +1       | +3     | -/+ E/L |
| C4              | +3     | +1       | -1     | -/+ L/E |
| C5              | +3     | -1       | -3     | -/+ L/E |
| C6              | -3     | -1       | +3     | -/+ E/L |
| C7              | +1     | -1       | -3     | -/+ L/E |
| C8              | -3     | -1       | +1     | -/+ E/L |
| All other cases |        |          |        | Stay    |

😊 Increased transition density:  $1/16(=4/4^3) \leftrightarrow 1/32$  (MMPD)

# PAM-4 Pattern-Based PD (2)

## ■ Lock point of option1 @ under-equalized



- Adopting C1,2,5,6
- One-point lock
- $h_{-1}, h_1 > 0$
- Shifted to  $-3h_{-1} + 3h_1 = 3h_{-1} - 3h_1 \rightarrow h_{-1} = h_1$



# PAM-4 Pattern-Based PD (3)

- Option 2 @ equalized → Adopt C1~C8



|                 | $d[n-1]$ | $d[n]$ | $d[n+1]$ | $E[n]$ | PD   |
|-----------------|----------|--------|----------|--------|------|
| C1              | -3       | +1     | +3       | -/+    | E/L  |
| C2              | +3       | +1     | -3       | -/+    | L/E  |
| C3              | -1       | +1     | +3       | -/+    | E/L  |
| C4              | +3       | +1     | -1       | -/+    | L/E  |
| C5              | +3       | -1     | -3       | -/+    | L/E  |
| C6              | -3       | -1     | +3       | -/+    | E/L  |
| C7              | +1       | -1     | -3       | -/+    | L/E  |
| C8              | -3       | -1     | +1       | -/+    | E/L  |
| All other cases |          |        |          |        | Stay |

😊 Increased transition density:  $1/8 (=8/4^3) \leftrightarrow 1/32 (MMPD)$

# PAM-4 Pattern-Based PD (4)

## ■ Lock point of option2 @ equalized



- Adopting C1~C8 (8cases)
- Four-point lock
- $h_{-1}, h_1 \approx 0$
- Gathered near one-point,  
 $T_{d[n]}$



# Top Block Diagram



## ■ Clock path

- Pattern-based PD
- PLL-based CDR (ring OSC)
  - ✓ Proportional path directly from MV
  - ✓ Integration path from digital logic

## ■ Data path

- AFE (CTLE+VGA)
- Decoder (COMP+time-based dec.)

# Outline

- Introduction
- Conventional Baud-Rate CDR
  - Mueller-Muller PD (MMPD)
- Proposed Baud-Rate CDR
  - Pattern-Based PD
  - Time-Based Decoder
- Measurement Results
- Conclusion

# Shared Path for Data Recovery and PD



## ■ Data recovery

- -3, +3 decoded by  $COMP_P$  and  $COMP_N$
- -1, +1 decoded by time-based dec. and shared path  
→ Reduced # of COMPs

# Comparator Operation When Data = +1 (1)



$$V_{INP} = +1$$



- The decision time of  $COMP_{NM}$  is faster than  $COMP_{PM}$
- Node  $NM_N$  falls first

# Comparator Operation When Data = +1 (2)



# Comparator Operation When Data = -1 (1)



$V_{INP} = -1$



- $PM_P$
- $PM_N$
- $NM_P$
- $NM_N$



- The decision time of  $COMP_{PM}$  is faster than  $COMP_{NM}$
- Node  $PM_P$  falls first

# Comparator Operation When Data = -1 (2)



# Time-Based Decoder Operation



Time-based decoder

- Fastest rise time detection between positive input nodes ( $I_{PM_P}$ ,  $I_{PM_N}$ ) and negative input nodes ( $I_{NM_P}$ ,  $I_{NM_N}$ )



# Comparison of PAM-4 CDR (1)



# Comparison of PAM-4 CDR (2)



# Outline

## ■ Introduction

## ■ Conventional Baud-Rate CDR

- Mueller-Muller PD (MMPD)

## ■ Proposed Baud-Rate CDR

- Pattern-Based PD
- Time-Based Decoder

## ■ Measurement Results

## ■ Conclusion

# Chip Microphotograph and Power Breakdown



|   | Blocks         | Area(μm <sup>2</sup> ) |
|---|----------------|------------------------|
| A | AFE            | 987                    |
| B | COMPs + TB DEC | 891                    |
| C | Clock Dist.    | 668                    |
| D | Ring OSC       | 351                    |
| E | CR Logic       | 2690                   |

# Measurement Setup



1. 52Gb/s PRBS7 PAM-4 input signal is generated from (1) and confirmed at (2)
2. The  $V_{REF}$ s are manually controlled so that the CDR is locked and the BER is minimized.
3. After the CDR is locked, the phase noise of the recovered clock is measured in (3)
4. Jitter tolerance (JTOL) is measured in (4) by adding sinusoidal jitter at (1)

# Measurement Results



- CDR BW is estimated to 10~20 MHz compared with PN of free-running OSC
- Integrated jitter from 1kHz to 100MHz is 430fs
- Jitter tolerance mask is satisfied

# Comparison Table with Previous Works

|                         | ISSCC'18 [1]          | ISSCC'19 [2]          | VLSI'19 [4]            | ISSCC'20 [3]             | JSSC'22 [5]               | This work                        |
|-------------------------|-----------------------|-----------------------|------------------------|--------------------------|---------------------------|----------------------------------|
| Technology              | 28nm FDSOI-CMOS       | 7nm FinFET            | 40nm CMOS              | 10nm FinFET              | 40nm CMOS                 | 28nm CMOS                        |
| Modulation              | PAM-4                 | PAM-4                 | PAM-4                  | PAM-4                    | PAM-4                     | PAM-4                            |
| Data Rate [Gb/s]        | 64                    | 56.25                 | 52                     | 56                       | 48                        | 52                               |
| Channel Loss [dB]       | 16.8                  | 17.8                  | 7.3                    | 38                       | 4                         | 7.1                              |
| Clock architecture      | Quarter-rate PI-based | Quarter-rate PI-based | Quarter-rate PLL-based | 32-way PI-based          | Half-rate PI-based        | Quarter-rate PLL-based           |
| PD type                 | 2x oversampling       | 2x oversampling       | 2x oversampling        | Baud-rate (MM PD)        | Baud-rate (Stochastic PD) | <b>Baud-rate (Pattern-based)</b> |
| # of COMPs per UI       | 4                     | 4                     | 3.5*                   | 32 x 7b ADC              | 5                         | <b>4</b>                         |
| BER                     | $10^{-12}$            | $10^{-12}$            | $10^{-12}$             | $10^{-10}$               | $10^{-11}$                | $10^{-12}$                       |
| Equalization            | TX FIR CTLE           | TX FIR CTLE           | CTLE 1-tap FFE         | CTLE 9-tap FFE 1-tap DFE | CTLE 1-tap DFE            | CTLE                             |
| Inductor-less           | No                    | No                    | No                     | No                       | No                        | Yes                              |
| Power [mW]              | 180                   | 79                    | 48                     | 321.2**                  | 116.3                     | <b>43.1</b>                      |
| Energy Eff. [pJ/bit]    | 2.8                   | 1.41                  | 0.92                   | 7.7**                    | 2.42                      | <b>0.83</b>                      |
| Area [mm <sup>2</sup> ] | 0.32                  | 0.13                  | 0.72                   | 0.72 / Lane              | 0.24                      | <b>0.011</b>                     |

\*To save power, only two edge comparators are utilized for clock recovery

\*\*TX+RX

# Conclusion

- PAM-4 baud-rate CDR for short reach applications is proposed to reduce clocking power compared to 2x oversampling CDR
- The proposed CDR is implemented in 28nm CMOS process while achieving 0.83pJ/b energy efficiency at 52Gb/s
  - Pattern-based PD: baud-rate operation with 4 times transition density when using MMPD
  - Time-base decoder: reduced # of COMPs/UI (5→4)

# **A 128Gb/s PAM-4 Transmitter with Programmable-Width Pulse Generator and Pattern-Dependent Pre-Emphasis in 28nm CMOS**

**Kai Sheng, Weixin Gai, Zeze Feng, Haowei Niu, Bingyi Ye, Hang Zhou**



**Peking University, Beijing, China**

# Outline

- Motivation
- TX Architecture
- Programmable-Width Pulse Generator
- Pattern-Dependent Pre-Emphasis
- TX Clocking
- Measurement Results
- Summary

# PAM-4 Signaling Challenges



- Complex transitions introduce larger data-dependent jitter (DDJ)
  - Optimize driver to achieve faster transitions
- Eye height is reduced to 1/3
  - Avoid extra swing reduction caused by de-emphasis FFE

# Outline

- Motivation
- TX Architecture
- Programmable-Width Pulse Generator
- Pattern-Dependent Pre-Emphasis
- TX Clocking
- Measurement Results
- Summary

# TX Architecture



- 6 segments for FFE
- 3 segments for pattern-dependent pre-emphasis
- Selection-based clock path

# Outline

- Motivation
- TX Architecture
- **Programmable-Width Pulse Generator**
- Pattern-Dependent Pre-Emphasis
- TX Clocking
- Measurement Results
- Summary

# Effect of Common-Mode Voltage



- Crossing points of pulses ( $V_{cm}$ ) set the gain of driver
- $V_{cm} \approx V_{DD}/2$  for most pulse generator designs
- Increasing  $V_{cm}$  yields higher gain / faster edges

# Programmable-Width Pulse Generator (1/2)



- Pre-discharge node X to adjust the pulse width
- Wider pulses produces higher  $V_{cm}$

# Programmable-Width Pulse Generator (2/2)



- Pre-discharge node X to adjust the pulse width
- Wider pulses produces higher V<sub>cm</sub>
- 13% increase in eye width

# Outline

- Motivation
- TX Architecture
- Programmable-Width Pulse Generator
- **Pattern-Dependent Pre-Emphasis**
- TX Clocking
- Measurement Results
- Summary

# Pattern Detection Logic



| Pattern Detection Logics |         |              |                |
|--------------------------|---------|--------------|----------------|
| Transitions              |         | Pull-up flag | Pull-down flag |
| Major                    | Rising  | 1            | 0              |
|                          | Falling | 0            | 1              |
| Middle                   | Rising  | 1            | 0              |
|                          | Falling | 0            | 1              |
| Minor                    | Rising  | 1            | 0              |
|                          | Falling | 0            | 1              |



- Each segment monitors one kind of transitions
- Pull-up / pull-down flags become “+1” at certain transitions

# 4:1 & Current-Injection Switch



- Inject single-ended current based on pull-up / pull-down flags
- Injection currents are proportional to transition amplitudes

# Equalization Effect of Pre-Emphasis

Simulated eye diagrams without and with pre-emphasis



- Output swing does not shrink
- Increasing the length of monitored pattern
  - ✓ provides better equalization
  - ✗ degrades output bandwidth
  - The length is two in this work

# Outline

- Motivation
- TX Architecture
- Programmable-Width Pulse Generator
- Pattern-Dependent Pre-Emphasis
- TX Clocking
- Measurement Results
- Summary

# TX Clocking



- CK4: TIA-based clock receiver & resonant buffer
- CK8: Use selected CK4 to address timing issue

# Outline

- Motivation
- TX Architecture
- Programmable-Width Pulse Generator
- Pattern-Dependent Pre-Emphasis
- TX Clocking
- **Measurement Results**
- Summary

# Measurement Setup



6.7: A 128Gb/s PAM-4 Transmitter with Programmable-Width Pulse Generator and Pattern-Dependent Pre-Emphasis in 28nm CMOS

# Die Photo



■ Area: 0.137mm<sup>2</sup>

# 64Gb/s NRZ Eye Diagram



- Eye width = 0.63UI
- Eye height = 246mV

# 128Gb/s PAM-4 Eye Diagrams



with FFE



with FFE and pre-emphasis

## PAM-4 Eye-Opening

| Pre-emphasis | OFF                                       | ON                                        | Area Increased  |
|--------------|-------------------------------------------|-------------------------------------------|-----------------|
| Upper Eye    | $EH = 41\text{mV}$ , $EW = 0.17\text{UI}$ | $EH = 44\text{mV}$ , $EW = 0.2\text{UI}$  | $26\% \uparrow$ |
| Middle Eye   | $EH = 37\text{mV}$ , $EW = 0.18\text{UI}$ | $EH = 41\text{mV}$ , $EW = 0.2\text{UI}$  | $23\% \uparrow$ |
| Lower Eye    | $EH = 36\text{mV}$ , $EW = 0.16\text{UI}$ | $EH = 40\text{mV}$ , $EW = 0.18\text{UI}$ | $25\% \uparrow$ |

# Performance Comparison

|                                         | This work                                  | ISSCC'18 [1] | ISSCC'19 [2] | ISSCC'19 [3] | ISSCC'20 [4] | ISSCC'21 [5] |
|-----------------------------------------|--------------------------------------------|--------------|--------------|--------------|--------------|--------------|
| Technology                              | 28nm                                       | 10nm         | 14nm         | 40nm         | 7nm          | 7nm          |
| Modulation                              | PAM-4                                      | PAM-4        | PAM-4        | PAM-4        | PAM-4        | PAM-4        |
| Data Rate (Gb/s)                        | 128                                        | 112          | 128          | 112          | 112          | 112          |
| Driver Type                             | CML<br>(with edge optimization)            | CML          | Tailless CML | SST          | H-bridge     | SST          |
| Output Swing ( $V_{ppd}$ )              | 0.84                                       | 0.75         | 1            | 1            | 1.2          | -            |
| Equalization                            | 4-tap FFE + Pattern-Dependent Pre-Emphasis | 3-tap FFE    | 3-tap FFE    | 4-tap FFE    | 7-tap FFE    | 8-tap FFE    |
| Efficiency (pJ/b)<br>(with clocking)    | 1.4                                        | 2.07         | 1.3          | 3.89         | 1.56         | 1.4          |
| Efficiency (pJ/b)<br>(without clocking) | 0.9                                        | 1.72         | -            | -            | 1.05         | -            |
| Area (mm <sup>2</sup> )                 | 0.137                                      | 0.030        | 0.048        | 0.560        | 0.193        | 0.032        |

# Outline

- Motivation
- TX Architecture
- Programmable-Width Pulse Generator
- Pattern-Dependent Pre-Emphasis
- TX Clocking
- Measurement Results
- Summary

# Summary

- Programmable-width pulse generator optimizes the eye width by 13%
- Pattern-dependent pre-emphasis enlarges the eye-opening by ~25%
- The TX achieves 128Gb/s with a power efficiency of 1.4pJ/b and area of 0.137mm<sup>2</sup> in 28nm CMOS

# Thank you!

# A 100Gb/s 1.6Vppd PAM-8 Transmitter with High-Swing 3+1 Hybrid FFE Taps in 40nm

Jeonghyu Yang, Eunji Song, Seungwook Hong, Dongjun Lee,  
Sangwan Lee, Hyunwoo Im, Taeho Shin, and Jaeduk Han



Hanyang University, Seoul, Republic of Korea

# Outline

- Introduction
- Architecture
- Circuit Implementation
- Measurement Results
- Comparison & Conclusion

# Outline

- **Introduction**
- **Architecture**
- **Circuit Implementation**
- **Measurement Results**
- **Comparison & Conclusion**

# Key Challenges in PAM-8 Modulation



NRZ



PAM-4



PAM-8

- Degraded SNR ( $\propto \frac{\sqrt{\#\_of\_bits}}{2^{\#\_of\_bits}-1}$ ) for the same signal power
    - -12.13dB reduction in PAM-8
  - To overcome the SNR penalty, the **signal power constraint** needs to be resolved
- **High-voltage driver**

# Key Challenges in High-Swing Driver Design



- Max. swing is hard-constrained by process technology
    - Reliability:  $V_{TT} \leq BV_{GD}$
    - Current drive:  $V_{TT} - V_{sw,max} \geq V_{DRV,min}$   
(to drive current thru  $M_0$ )
- $\rightarrow V_{sw,max} \leq BV_{GD} - V_{DRV,min}$
- Circuit techniques to overcome the constraints need to be investigated

# Outline

- Motivation
- **Architecture**
- Circuit Implementation
- Measurement Results
- Comparison & Conclusion

# Summary of Design Considerations

| 100Gbps TX in 40nm CMOS |                |                      |
|-------------------------|----------------|----------------------|
|                         | Design Choices | Consideration Items  |
| Modulation              | PAM-8          | Data-rate, Baud-rate |
| Input Clock Freq.       | 16.65G         | Power                |
| Final Multiplexing      | 4:1            | BW, Power            |
| Pre-driver              | Inverter       | BW, Power            |
| FFE Taps                | 3+1            | Hardware overhead    |
| Driver                  | CML            | Linearity, BW        |
| Frontend                | T-coil         | BW                   |

# Transmitter Architecture



- **3-tap FIR shuffler**
  - Variable tap configuration
- **Auxiliary tap control**
  - Reconfigurable MSB FFE
- **High-speed MUX**
  - Single-stack
- **Limiting pre-driver**
  - Skewed P/N ratio
- **High-swing driver**
  - Protective cascoding
  - Current bleeding

# Outline

- Motivation
- Architecture
- **Circuit Implementation**
  - 3-tap FIR Shuffler
  - Auxiliary Tap Control
  - High-Speed 4:1 Serializer
  - Skewed Pre-driver
  - High-Swing PAM-8 FFE Driver
- Measurement Results
- Comparison & Conclusion

# 3-tap FIR Shuffler



- Reconfigurable sign & data allocation
- Area & power efficient
  - Device overhead ↓
- Support 5 different modes

| Preset | TAPA (1x) | TAPB (4x) | TAPC (1x) | Tap Assignments     |
|--------|-----------|-----------|-----------|---------------------|
| Mode 0 | -D[n]     | +D[n-1]   | -D[n-2]   | [Pre, Main, Post]   |
| Mode 1 | +D[n-1]   | +D[n-1]   | +D[n-1]   | FFE off             |
| Mode 2 | -D[n-1]   | +D[n]     | -D[n-2]   | [Main, Post, Post2] |
| Mode 3 | +D[n-1]   | +D[n-1]   | -D[n-2]   | [Main, Main, Post]  |
| Mode 4 | -D[n]     | +D[n-1]   | +D[n-1]   | [Pre, Main, Main]   |

# Auxiliary Tap Control



- Additional flexibility
- Modulate MSB only
  - Handles **most-significant ISI**
  - With minimal **circuit** and **output loading** overheads



# 1-UI Pulse Generator



- **Dynamic logic (NAND + NOR)**
  - Faster speed with smaller parasitic
- **Convert *retimed* 4-UI pulses ( $R[0:3]$ ) to 1-UI pulses ( $P[0:3]$ )**



# 4:1 High-Speed MUX



- **Pseudo-NMOS based MUX**
- **Always-turned-on PMOS for load device**
  - Area and parasitic efficient than passive resistors
  - Sized to achieve about 0.3VDD logic-low level (close to  $V_{TH}$  of NMOS)
    - The output swing is amplified to rail-to-rail in subsequent stages

# Limiting Pre-driver



- Convert HSMUX outputs to rail-to-rail ones
  - Inverters with skewed P/N widths to adjust the logic threshold
- Drive lengthy interconnects and driver taps
  - Interconnect loading capacitances ( $C_w$ ) do not scale down along with tap input capacitances  
→ Additional dummy transistors to remove timing skews between taps

# High-Swing PAM-8 FFE Driver



- Without protective cascode

$$V_{sw,max} \leq BV_{GD} - V_{DRV,min}$$

- With protective cascode

- (1)  $V_{TT} - V_C \leq BV_{GD}$  (Reliability)
- (2)  $V_{TT} - V_{sw,max} \geq V_{DRV,min,cas}$  (Current-drive)
- (3)  $V_{sw,max,cas} \leq BV_{GD} + V_C - V_{DRV,min,cas}$

\*  $BV_{GD}$  : maximum gate-drain voltage to meet reliability constraints

# High-Swing PAM-8 FFE Driver



- Rise of internal voltages ( $V_{INTP}$  and  $V_{INTN}$ ) due to leakage current
  - Current bleeders to retain  $V_{GS}$  of cascode transistors
- Maximum cascode bias voltage ( $V_c$ ) without bleeders:
$$V_c < BV_{GD} - V_{IL}$$
  - Limits  $V_c$  and  $V_{sw,max}$
- $V_c$  with bleeders:

$$V_c < BV_{GD} - V_{IL} + V_{CGSO}$$

$\uparrow$   
VGS of  $M_0$  when  $M_2$  is off  
Improves  $V_c$  and  $V_{sw,max}$ !

# High-Swing PAM-8 FFE Driver



- Rise of internal voltages ( $V_{INTP}$  and  $V_{INTN}$ ) due to leakage current
  - Current bleeders to retain  $V_{GS}$  of cascode transistors



6.8: A 100Gb/s 1.6Vppd PAM-8 Transmitter with High-Swing 3+1 Hybrid FFE Taps in 40nm

# Current Source with Stacked Transistor



[T. No, ISOCC 2020]

## ■ 2-stack current source

- Non-uniform width ratio between stack transistors ( $W_{top}:W_{bot}=2:1$ )
- Yields higher  $r_o \cdot I_D$  (normalized resistance)
  - Higher output resistance for the same current drive
  - Better current density than higher stacks



|   | Stack | Width ratio |
|---|-------|-------------|
| A | 1     | -           |
| B | 2     | 1:1         |
| C | 2     | 2:1         |
| D | 3     | 1:1:1       |

6.8: A 100Gb/s 1.6Vppd PAM-8 Transmitter with High-Swing 3+1 Hybrid FFE Taps in 40nm

# Outline

- Motivation
- Architecture
- Circuit Implementation
- **Measurement Results**
- Comparison & Conclusion

# Chip Micrograph



- 40nm CMOS
- Total area =  $0.362mm^2$
- Bumps for flip-chip

|   | Block               | Area ( $mm^2$ ) |
|---|---------------------|-----------------|
| 1 | Test Pattern Memory | 0.133           |
| 2 | Clock Path          | 0.0055          |
| 3 | Data Path           | 0.032           |
| 4 | HSSER + PDRV        | 0.0029          |
| 5 | HVDRV + T-coil      | 0.0237          |

# Measurement Setup

COM



Test Pattern Generator  
(Wavegen Xpress)



DC Power Supply  
(Keysight N6705B)

POWER

TESTING

CLK

DATA



Test Board



Flip-chip on board

AC coupled



Sampling Oscilloscope  
(Keysight N1000A)



Clock Generator  
(Keysight E8257C)



Balun

# Measurement Results (1)



- **100 Gb/s data-rate**
- **Overlapped PRBS7 & extra patterns**
- **Swing =  $922 \text{ mV}_{\text{dpp}}$** 
  - $1.6 \text{ V}_{\text{dpp}}$  without FFE
- **Eye opening  $\geq 57.7 \text{ mV}$**
- **Measured RLM = 0.94**
- **VTT = 2.2 V**

6.8: A 100Gb/s 1.6Vppd PAM-8 Transmitter with High-Swing 3+1 Hybrid FFE Taps in 40nm

# Measurement Results (2)

|                   |       |
|-------------------|-------|
| Driver supply     | 2.2 V |
| Pre-driver supply | 1.2 V |
| SER/CLK supply    | 1.2 V |



- Higher supply and protective cascode to support 1.6Vdpp output swing
- Power efficiency = 3.35 pJ/bit
- Channel loss estimated from measured pulse response: 9.4 dB @ 16.65 GHz

# Outline

- Motivation
- Architecture
- Circuit Implementation
- Measurement Results
- Comparison & Conclusion

# Comparison

|                                       | This work                                                                                | Timothy<br>VLSI22 | Kim<br>ISSCC21 | Kossel<br>ISSCC21 | Choi<br>ISSCC21 | Peng<br>ISSCC20 |
|---------------------------------------|------------------------------------------------------------------------------------------|-------------------|----------------|-------------------|-----------------|-----------------|
| Technology                            | 40nm                                                                                     | 4nm               | 10nm           | 7nm               | 28nm            | 40nm            |
| Data Rate[Gb/s]                       | 100                                                                                      | 144 / 216         | 224            | 112               | 200             | 100             |
| Modulation                            | PAM-8  | PAM-4 / 8         | PAM-4          | PAM-4             | PAM-4           | NRZ             |
| Architecture                          | Mixed Signal                                                                             | 8b DAC            | 7b DAC         | 8b DAC            | Mixed Signal    | 7b DAC          |
| Driver                                | CML                                                                                      | SST               | CML            | SST               | CML             | Tail less       |
| Max. Output Swing [Vdpp]              | 1.6    | 0.92              | 1.0            | 0.92              | 0.8             | 0.56            |
| Vertical Eye Opening [mV]             | 57     | N/A               | 90             | 59                | 53              | 73              |
| FFE Taps                              | 3+1                                                                                      | 8                 | 8              | 8                 | 5               | 8               |
| Energy Efficiency [pJ/bit]            | 3.35                                                                                     | 2 / 1.33          | 1.74           | 1.4               | 4.63            | 6.19            |
| Analog Supply [V]                     | 1.2 / 2.2                                                                                | 0.95              | 0.8/1/1.5      | 0.96              | 1.4             | 1.1/1.2/1.5     |
| Channel Loss<br>@ Nyq. Frequency [dB] | -9.4                                                                                     | -8.8              | -4.0           | -15.1             | -6.0            | -7.1            |
| Packaging                             | Flip-chip                                                                                | Flip-chip         | Flip-chip      | Bare die*         | Bare die*       | Bare die*       |

\* RF probes used for signal acquisition.

# Conclusion

- A 100-Gb/s PAM-8 transmitter fabricated in 40-nm CMOS technology
- Key design techniques
  - Hybrid 3+1 tap FFE for efficient channel equalization
  - 3-tap shufflers for minimized area & power overheads
  - Single stack multiplexer with PMOS load for output BW
  - High-swing driver for better SNR in PAM-8 constellations
- This design achieved a 57.7-mV eye-opening and 0.94 RLM under 9.4-dB channel loss

# Acknowledgement

- **Samsung Research Funding & Incubation Center of Samsung Electronics**
  - Project Number SRFC-IT2001-02
- **IITP grant funded by the Korea government(MSIT)**
  - Project Number 2020-0-01307

# References

- [1] Dickson, T. et al. A 72GS/s, 8-bit DAC-based Wireline Transmitter in 4nm FinFET CMOS for 200+Gb/s Serial Links. *IEEE Symp. VLSI Circuits*, 28-29 (2022)
- [2] Choi, M. et al. An Output-Bandwidth-Optimized 200Gb/s PAM-4 100Gb/s NRZ Transmitter with 5-Tap FFE in 28nm CMOS. *ISSCC*, 128-130 (2021)
- [3] J. Kim et al. 8.1 A 224Gb/s DAC-Based PAM-4 Transmitter with 8-Tap FFE in 10nm CMOS. *ISSCC*, 126-128 (2021)
- [4] Kossel, M. A. et al. An 8b DAC-Based SST TX Using Metal Gate Resistors with 1.4pJ/b Efficiency at 112Gb/s PAM-4 and 8-Tap FFE in 7nm CMOS. *ISSCC*, 130-132 (2021)
- [5] Loke, A. L. S. et al. Analog/mixed-signal design challenges in 7-nm CMOS and beyond C/CC, 1-8 (2018)
- [6] E. Song, J. Yang, S. Hong and J. Han, "A 32-Gb/s High-Swing PAM-4 Current-Mode Driver with Current-Bleeding Cascode Technique and Capacitive-Coupled Pre-drivers in 40-nm CMOS for Short-Reach Wireline Communications," *MWSCAS*, 1-4 (2022).
- [7] No, T. & Han, J. Design Techniques for Robust and Area-efficient Current Sources in Nanometer CMOS Technology. *ISOCC*, 232-233 (2020)
- [8] Peng, P. -J. & Lai, S. -T. & Wang, W. -H. & Lin, C. -W. & Huang , W. -C. & Shih, T. A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS. *ISSCC*, 130-132 (2020)