



# High-Speed Link Circuits and Systems for Chiplet

---

江文宁  
芯片与系统前沿技术研究院

博学而笃志 切问而近思

# 目 录

## CONTENTS

01

### Introduction

What is high-speed link for Chiplet?

What is the advanced features?

Why we need it?

02

### Signal Integrity

What is signal integrity?

The sources

The impact

03

### TX/RX

What is RX?

The architecture and circuit level implementation

The advanced techniques

04

### TX/RX Continue

More concerns

The examples

1. ECEN720 from Sam Palermo, TAMU, " High speed wireline links circuit design"
2. ECE 546 from Jose E. Schutt-Aine, UIUC, "High-Speed Links"
3. Other internet info



FUDAN UNIVERSITY

04

TX/RX Continue

博学而笃志 切问而近思



# TX FIR Adaptation Error Extraction

- While we are adapting the TX FIR, we need to measure the response at the receiver input
- Equalizer adaptation (error) information is often obtained by comparing the receiver input versus the desired symbol levels,  $dLev$
- This necessitates additional samplers at the receiver with programmable threshold levels



[Stojanovic JSSC 2005]

[Sam Palermo, Texas A&M]

# TX FIR Adaptation Algorithm

- The sign-sign LMS algorithm is often used to adapt equalization taps due to implementation simplicity

$$w_{n+1}^k = w_n^k + \Delta_w \text{sign}(d_{n-k}) \text{sign}(e_n)$$

$w$  = tap coefficients,  $n$  = time instant,  $k$  = tap index,  $d_n$  = received data,

$e_n$  = error with respect to desired data level,  $dLev$

- As the desired data level is a function of the transmitter swing and channel loss, the desired data level is not necessarily known and should also be adapted

$$dLev_{n+1} = dLev_n - \Delta_{dLev} \text{sign}(e_n)$$

[Stojanovic JSSC 2005]



38

# CTLE Tuning with PSD Measurement

- One approach to CTLE tuning is to compare low-frequency and high-frequency spectrum content of random data
- For ideal random data, there is a predictable ratio between the low-frequency power and high-frequency power
- The error between these power components can be used in a servo loop to tune the CTLE



$$s_x(f) = T_b \left[ \frac{\sin(\pi f T_b)}{\pi f T_b} \right]^2$$

$$\int_0^{f_m} s_x(f) df = \int_{f_m}^{\infty} s_x(f) df = \frac{1}{4}$$

$$\text{where } f_m \approx \frac{0.28}{T_b}$$



[Lee JSSC 2006]

41

# CTLE Tuning with Output Amplitude

- CTLE tuning can also be done by comparing low-frequency and high-frequency average amplitude
- Approximating the equalized data as a sine wave, a predictable ratio exists between the low frequency average and high-frequency average
- Equalizer settings are adjusted until the high frequency peak-to-peak swing matches the low-frequency peak-to-peak swing



# CTLE Tuning with Data Edge Distribution

- The width and shape of the data edge distribution can be used to reliably calibrate an equalizer
- By oversampling the data bits with sub-period accuracy, this information can be obtained
- Objective is to maximize eye opening, or equivalently minimizing the standard deviation of the edge distribution



[Gefers JSSC 2008]



# CTLE Tuning - FIR Feedback



- 2x oversampling the equalized signal at the edges can be used to extract information to adapt a DFE and drive a CDR loop
- Sign-sign LMS algorithm used to adapt DFE tap values

[Payne JSSC 2005]

# CTLE Tuning - IIR Feedback



[Huang ISSCC 2011]

博学而笃志 切问而近思



|                             |                |                         |
|-----------------------------|----------------|-------------------------|
| Case 1<br>D2,D1,D0= 110     | $ISI_{E0} > 0$ | Increases amplitude     |
|                             | $ISI_{E0} < 0$ | Decreases amplitude     |
| Case 2<br>D3,D2,D1,D0= 1110 | $ISI_{E0} > 0$ | Increases time constant |
|                             | $ISI_{E0} < 0$ | Decreases time constant |

Y. Hidaka,  
et al., ISSCC07

- Also called regenerative amplifier, sense-amplifier, flip-flop, latch
- Samples the continuous input at clock edges and resolves the differential to a binary 0 or 1



- Offset and hysteresis,
- Sampling aperture, timing resolution, uncertainty window
- Regeneration gain, voltage sensitivity, metastability
- Random decision errors, input-referred noise



Strong-Arm Latch

- To form a flip-flop
  - After strong-arm latch, cascade an R-S latch
  - After CML latch, cascade another CML latch
- Strong-Arm flip-flop has the advantage of no static power dissipation and full CMOS output levels



CML Latch



- 4 operating phases: reset, sampling, regeneration, and decision

- Sampling phase starts when clk goes high,  $t_0$ , and ends when PMOS transistors turn on,  $t_1$
- M1 pair discharges  $X/X'$
- M2 pair discharges out+/-

$$\frac{v_{out}(s)}{v_{in}(s)} = \frac{g_{m1}g_{m2}}{sC_{out}C_x \left( s + \frac{g_{m2}(C_{out} - C_x)}{C_{out}C_x} \right)}$$

$$\approx \frac{g_{m1}g_{m2}}{s^2C_{out}C_x} = \frac{1}{s^2\tau_{s1}\tau_{s2}}$$

where  $\tau_{s1} \equiv C_x/g_{m1}$ ,  $\tau_{s2} \equiv C_{out}/g_{m2}$



- Regeneration phase starts when PMOS transistors turn on,  $t_1$ , until decision time,  $t_2$
- Assume M1 is in linear region and circuit no longer sensitive to  $v_{in}$
- Cross-coupled inverters amplify signals via positive-feedback:

$$G_R = \exp\left(\frac{t_2 - t_1}{\tau_R}\right)$$

$$\tau_R = C_{out} / (g_{m2,r} + g_{m3,r})$$



[Jeeradit VLSI 2008]



- Sampling time of SA latch varies with VDD, while CML isn't affected much

[Jeeradit VLSI 2008]



- CML latch has higher sampling gain with small input pair
- StrongARM latch has higher sampling bandwidth
  - For CML latch increasing input pair also directly increases output capacitance
  - For SA latch increasing input pair results in transconductance increasing faster than capacitance



- Device noise causes random decisions even with zero input signal
- Noise variance can be found by fitting output to a Gaussian CDF as the input is swept and transient noise is enabled
- Noise can also be simulated with PSS+PAC+PNOISE, but requires post processing to find ISF from sideband transfer function [Kim TCAS-I 2009]



$$t_{reg} \sim \tau_{comp} \ln\left(\frac{V_{DD}}{V_{in}}\right)$$

$$t_{samp} \sim \frac{C_{out} V_{THP}}{I_D}$$



- Comparator evaluation time grows proportional to  $\ln(V_{in}^{-1})$
- Metastability occurs when the input is too small and the comparator doesn't have sufficient time to fully evaluate
- This metastability window is a major component of the comparator  $V_{min}$

# RX Sensitivity & Offset Correction

- RX sensitivity is a function of the input referred noise, offset, and min latch resolution voltage

$$v_S^{pp} = 2v_n^{rms} \sqrt{SNR} + v_{\min} + v_{\text{offset}*}$$

Typical Values :  $v_n^{rms} = 1mV_{rms}$ ,  $v_{\min} + v_{\text{offset}*} < 6mV$

$$\text{For BER} = 10^{-12} (\sqrt{SNR} = 7) \Rightarrow v_S^{pp} = 20mV_{pp}$$

- Circuitry is required to reduce input offset from a potentially large uncorrected value (>50mV) to near 1mV



- The input referred offset is primarily a function of  $V_{th}$  mismatch and a weaker function of  $\beta$  (mobility) mismatch



$$\sigma_{V_t} = \frac{A_{V_t}}{\sqrt{WL}}, \quad \sigma_{\Delta\beta/\beta} = \frac{A_\beta}{\sqrt{WL}}$$

- To reduce input offset 2x, we need to increase area 4x
- Not practical due to excessive area and power consumption
- Offset correction necessary to efficiently achieve good sensitivity

# Offset Correction Range & Resolution



- Generally circuits are designed to handle a minimum variation range of  $\pm 3\sigma$  for 99.7% yield
- Example: Input differential transistors  $W=4\mu m$ ,  $L=150nm$

$$\sigma_{V_t} = \frac{A_{V_t}}{\sqrt{WL}} = \frac{2.8mV\mu m}{\sqrt{4\mu m \cdot 150nm}} = 3.6mV, \quad \sigma_{\Delta\beta/\beta} = \frac{A_\beta}{\sqrt{WL}} = \frac{2\%\mu m}{\sqrt{4\mu m \cdot 150nm}} = 2.6\%$$

- If we assume (optimistically) that the input offset is only dominated by the input pair  $V_t$  mismatch, we would need to design offset correction circuitry with a range of about  $\pm 11mV$
- If we want to cancel within  $1mV$ , we would need an offset cancellation resolution of 5bits, resulting in a worst-case offset of

$$1LSB = \frac{\text{Offset Correction Range}}{2^{\text{Resolution}} - 1} = \frac{22mV}{2^5 - 1} = 0.65mV$$

# Current-Mode Offset Correction Example

- Differential current injected into input amplifier load to induce an input-referred offset that can cancel the inherent amplifier offset
  - Can be made with extended range to perform link margining
- Passing a constant amount of total offset current for all the offset settings allows for constant output common-mode level
- Offset correction performed both at input amplifier and in individual receiver segments of the 2-way interleaved architecture

[Balamurugan JSSC 2008]



[Sam Palermo, Texas A&M]

# Capacitive Offset Correction Example

- A capacitive imbalance in the sense-amplifier internal nodes induces an input-referred offset
- Pre-charges internal nodes to allow more integration time for more increased offset range
- Additional capacitance does increase sense-amp aperture time
- Offset is trimmed by shorting inputs to a common-mode voltage and adjusting settings until an even distribution of "1"s and "0"s are observed
- Offset correction settings can be sensitive to input common-mode



[Sam Palermo, Texas A&M]

ADC REQUIREMENTS FOR POLARIZATION  
 MULTIPLEXED 112-Gb/s TRANSMISSION

| Modulation Format | Nyquist BW | Baud Rate    | Sampling Rate | ENOB  |
|-------------------|------------|--------------|---------------|-------|
| DP-QPSK           | 28 GHz     | 28 Gbaud/s   | 56 GS/s       | 3.8 b |
| DP-16QAM          | 14 GHz     | 14 Gbaud/s   | 28 GS/s       | 4.9 b |
| DP-64QAM          | 9.25 GHz   | 9.25 Gbaud/s | 18.5 GS/s     | 5.7 b |
| DP-256QAM         | 7 GHz      | 7 Gbaud/s    | 14 GS/s       | 7 b   |

 ADC REQUIREMENTS FOR POLARIZATION  
 MULTIPLEXED 224-Gb/s TRANSMISSION

| Modulation Format | Nyquist BW | Baud Rate    | Sampling Rate | ENOB  |
|-------------------|------------|--------------|---------------|-------|
| DP-QPSK           | 56 GHz     | 56 Gbaud/s   | 112 GS/s      | 3.8 b |
| DP-16QAM          | 28 GHz     | 28 Gbaud/s   | 56 GS/s       | 4.9 b |
| DP-64QAM          | 18.5 GHz   | 18.5 Gbaud/s | 37 GS/s       | 5.7 b |
| DP-256QAM         | 14 GHz     | 14 Gbaud/s   | 28 GS/s       | 7 b   |



- Multiple noise sources can degrade link timing and amplitude margin



- Circuits draw current from the VDD supply nets and return current to the GND nets
- Supply networks have finite impedance
- Time-varying (switching) currents induce variations on the supply voltage
- Supply noise a circuit sees depends on its location in supply distribution network

Bad – Block B will experience excessive supply noise



Better – Block B will experience 1/2 supply noise, but at the cost of double the power routing through blocks



Even Better – Block A & B will experience similar supply noise



Best – Block A & B are more isolated



[Hodges]

[Hodges]

- Common “noise” sources
  - Power supply noise
  - Receiver offset
  - Crosstalk
  - Inter-symbol interference
  - Random noise
- Power supply noise
  - Switching current through finite supply impedance causes supply voltage drops that vary with time and physical location
- Receiver offset
  - Caused by random device mismatches
- Crosstalk
  - One signal (aggressor) interfering with another signal (victim)
  - On-chip coupling (capacitive)
  - Off-chip coupling (t-line)
    - Near-end
    - Far-end
- Inter-symbol interference
  - Signal dispersion causes signal to interfere with itself
- Random noise
  - Thermal & shot noise
  - Clock jitter components

# Noise in High-Speed Links

- Bounded or *deterministic* noise sources
  - Have theoretically predictable values with defined worst-case bounds
  - Allows for simple (but pessimistic) worst-case analysis
  - Examples
    - Crosstalk to small channel count
    - ISI
    - Receiver offset
- Statistical or *random* noise sources
  - Treat noise as a random process
    - Source may be pseudo-random
  - Often characterized w/ Gaussian stats
    - RMS value
    - Probability density function (PDF)
  - Examples
    - Thermal noise
    - Clock jitter components
    - Crosstalk to large channel count
- Understanding whether noise source is bounded or random is critical to accurate link performance estimation

- Some noise is *proportional* to signal swing
  - Crosstalk
  - Simultaneous switching power supply noise
  - ISI
- Can't overpower this noise
  - Larger signal = more noise
- Some noise is *independent* to signal swing
  - RX offset
  - Non-IO power supply noise
- Can overpower this noise

$$V_N = K_N V_S + V_{NI}$$

Total noise →  $V_N$

Independent noise →  $V_{NI}$

Proportional noise constant →  $K_N$

Signal swing →  $V_S$





- A clock and data recovery system (CDR) produces the clocks to sample incoming data
- The clock(s) must have an effective frequency equal to the incoming data rate
  - 10GHz for 10Gb/s data rate
  - OR, multiple clocks spaced at 100ps
  - Additional clocks may be used for phase detection
- Sampling clocks should have sufficient timing margin to achieve the desired bit-error-rate (BER)
- CDR should exhibit small effective jitter

## PLL-based CDR



## Dual-Loop CDR



- Clock frequency and optimum phase position are extracted from incoming data
- Phase detection continuously running
- Jitter tracking limited by CDR bandwidth
  - With technology scaling we can make CDRs with higher bandwidths and the jitter tracking advantages of source synchronous systems is diminished
- Possible CDR implementations
  - Stand-alone PLL
  - “Dual-loop” architecture with a PLL or DLL and phase interpolators (PI)
  - Phase-rotator PLL



- A primary difference between CDRs and PLLs is that the incoming data signal is not periodic like the incoming reference clock of a PLL
- A CDR phase detector must operate properly with missing transition edges in the input data sequence

# Hogge Phase Detector



| A | B | 输出Y |
|---|---|-----|
| 0 | 0 | 0   |
| 0 | 1 | 1   |
| 1 | 0 | 1   |
| 1 | 1 | 0   |

- Linear phase detector
- With a data transition and assuming a full-rate clock
  - The late signal produces a signal whose pulse width is proportional to the phase difference between the incoming data and the sampling clock
  - A  $Tb/2$  reference signal is produced with a  $Tb/2$  delay
- If the clock is sampling early, the late signal will be shorter than  $Tb/2$  and vice-versa

# Hogge Phase Detector



| A | B | 输出Y |
|---|---|-----|
| 0 | 0 | 0   |
| 0 | 1 | 1   |
| 1 | 0 | 1   |
| 1 | 1 | 0   |

- Linear phase detector
- With a data transition and assuming a full-rate clock
  - The late signal produces a signal whose pulse width is proportional to the phase difference between the incoming data and the sampling clock
  - A  $Tb/2$  reference signal is produced with a  $Tb/2$  delay
- If the clock is sampling early, the late signal will be shorter than  $Tb/2$  and vice-versa

[Razavi]



- XOR outputs can directly drive the charge pump
- Need a relatively high-speed charge pump

| CDR PLL带宽的各种标准        |                                    |              |
|-----------------------|------------------------------------|--------------|
| 数据速率                  | PLL带宽                              | 标准           |
| 1x/2x/4x 光纤通道         | = 数据速率/1667                        | FC-PI-2 Rev6 |
| 10xGb以太网 (3.125x4 通道) | = 1.875 MHz                        | 802.3ae-2002 |
| 10xGb以太网              | = 4 MHz                            | 802.3ae-2002 |
| 1xGb以太网               | < 637 kHz @是德科技 302.3-2002 high... |              |

- Low frequency jitter (<CDR bandwidth) could be tracked by CDR → no bit error
- High frequency jitter (>CDR bandwidth) could not be tracked by CDR → bit error
- Large CDR bandwidth → short lock time/large jitter tolerant/large recovery clock jitter
- SONET (Synchronous Optical Networking)/SDH(Synchronous Digital Hierarchy) limited jitter
- --Large CDR bandwidth but limited with standard!



Smaller CDR bandwidth



Larger CDR bandwidth

# Alexander (2x-Oversampled) PD

- Most commonly used CDR phase detector
- Non-linear (Binary) "Bang-Bang" PD
  - Only provides sign information of phase error (not magnitude)
- Phase detector uses 2 data samples and one "edge" sample
- Data transition necessary

$$D_n \oplus D_{n+1}$$

- If "edge" sample is same as second bit (or different from first), then the clock is sampling "late"

$$E_n \oplus D_n$$

- If "edge" sample is same as first bit (or different from second), then the clock is sampling "early"

$$E_n \oplus D_{n+1}$$



[Sheikholeslami]



- Phase detector only outputs phase error sign information in the form of a late OR early pulse whose width doesn't vary
- Phase detector gain is ideally infinite at zero phase error
  - Finite gain will be present with noise, clock jitter, sampler metastability, ISI

# Alexander PD Characteristic (w/ noise)

- Total transfer characteristic is the convolution of the ideal PD transfer characteristic and the noise PDF
- Noise linearizes the phase detector over a phase region corresponding to the peak-to-peak jitter

$$K_{PD} \approx \frac{2}{J_{PP}}(TD)$$

- TD is the transition density – no transitions, no information
  - A value of 0.5 can be assumed for random data



# Mueller-Muller Baud-Rate PD



[Spagna ISSCC 2010]



- Simplified MM-PD only considers transition patterns
- If consecutive error samples are different, phase error polarity is given by  $e_j$

[Balamurugan JSSC 2008]



- Differential input stage followed by high-swing output stage
- Can be sensitive to power-supply noise and reduce jitter benefits of low-swing distribution techniques
- Often require some type of duty-cycle control

[Kossel JSSC 2008]



- AC-coupled self-biased inverter input stages and cross-coupled buffer stages can help improve duty cycle performance

# ILO-Based Multi-Phase Clock Generation



- ILO generates multiple output phases from differential injected clock
- Coarse frequency tuning loop ensures that the ILO will lock
- Fine quadrature-locked loop minimizes phase error

# Clock Error Calibration Loop



- Clocks are AC-coupled to input inverters that are biased at the trip point with feedback resistors
- $I_{DC}$  injected at inverter input shifts trip point and output duty cycle
- Monotonic control achieved with pull-up/down diodes
- $R_{DC}$  can also be adjusted to change tuning range



# A 4nm 48Gb/s/wire Single-Ended NRZ Transceiver with Offset-Calibration and Equalization Schemes for Next-Generation Memory Interfaces and Chiplets

Kihwan Seong, Wooseok Oh, Hyunwoo Lee, Gyeomje Bae,  
Youngseob Suh, Hyemun Lee, Juyoung Kim, Eunsu Kim, Yeongeon Kang,  
Gunhu Mo, Youjin Lee, Mingyeong Kim, Seongno Lee, Donguk Park,  
Byoung-Joo Yoo, Hyo-Gyuem Rhew, Jongshin Shin

**SAMSUNG**

Samsung Electronics, Hwaseong, Korea

# Example

## Motivation



- Use Single-ended Transceivers with 2.5D PKG for high-amount of data
- Difficult to use general equalization schemes such as T-Coil and CTLE

# Overall Architecture



- Implement **high-speed NRZ transceivers with standard package**
- Each slice includes 10 DQs for Data and 1 DQS for Clock.

# Transmitter (TX)



- Adopt source synchronous single-ended transmitter with low-swing
- **Implement On-chip Feedback EQ and HCRG** for high-speed operation

# On-chip TX Feedback Equalizer (OFE)



- ISI occurs due to the parasitic capacitance by the 4:1 serializer
- To improve feedback time, **Source Follower(SF)** used in the proposed OFE

# Duty Cycle Corrector (DCC)



- DCC with ac-coupling capacitance used to improve clock performance
- **Minimize the settling time and eliminate the dynamic voltage stress**

# High-speed Clock and Reset Generator (HCRG)



- IQ divider needed for quarter rate clock scheme
- Implement high-speed clock and reset generator

## Receiver (RX)



- local and global de-skew circuits to compensate the skew between data and clock
- Implement **digitally controlled offset calibration scheme**

# RX AFE – Offset Calibration Scheme



<Conventional 1<sup>st</sup> Tail offset calibration>



<Proposed 1<sup>st</sup> Tail w/ offset calibration>



- DAC widely used for offset calibration of RX AFE
- To reduce the power and area, **digitally controlled offset calibration scheme used**

# Measurement Results and PKG Photo



# Example

近3年有关短距D2D互联接口的发展情况

2024年发表

2023年发表



2022年发表

|               | 时钟提升                           | TX均衡                                   | RX均衡                                                | 速度/能效                | 调制方式     |
|---------------|--------------------------------|----------------------------------------|-----------------------------------------------------|----------------------|----------|
| SK_Hynix_13.1 | 高SNR、低功耗时钟架构（快速唤醒）             | ----                                   | ----                                                | 35.4Gb/s/p           | PAM3     |
| 南科大_13.5      | 抖动降低87% (CIJ)                  | XTC (5b-C-peaking) 、可重构FS-FFE          | ----                                                | 64Gb/s/p<br>1.27pJ/b | PAM4     |
| Samsung_13.6  | WCK CDN优化（减少了60mA/die, 16Gb/s） | Gain-controlled FFE(优化PVT性能)、ZQ校准优化RLM | CTLE与1-tap DFE合并使用，简化反馈路径                           | 37Gb/s               | PAM3     |
| SK_Hynix_13.8 | 时钟校准（减小了四相偏移量65%）              | ----                                   | Offset_calib(w 1-tap DFE) 、用于检测的IO减小了DQ PAD的寄生电容39% | 10.5Gb/s/p           | ----     |
| Samsung_13.10 | HCRG、DCC                       | Cap_EQ<br>On chip EQ(周期更短)             | Offset_calib (数字控制)                                 | 48Gb/s<br>0.67pJ/b   | NRZ      |
| Samsung_6.4   | Phase-training(SYNC_G)         | Cap_EQ(XTC)                            | DFE w double tail latch(减小了反馈时间)                    | 32Gb/s               | NRZ      |
| Samsung_28.3  | SWJC(调整温度码的转换)                 | FS-FFE<br>3bit C-peaking<br>松弛阻抗匹配     | 松弛阻抗匹配                                              | 16Gb/s               | PAM4/NRZ |

近3年有关短距D2D互联接口的发展情况

2024年发表

2023年发表

2022年发表

|              | 时钟提升                     | TX均衡                                                          | RX均衡                                             | 速度/能效               | 调制方式    |
|--------------|--------------------------|---------------------------------------------------------------|--------------------------------------------------|---------------------|---------|
| 首尔大学_28.6    | CEC(边沿校正器,修正4相相位误差、减小功耗) | Driver(PN over NP,更好的非线性)<br>2-tap Edge boosting (cap和T-coil) | ----                                             | 32Gb/s<br>0.51pJ/b  | NRZ     |
| Samsung_28.2 | WCK训练(调节电流、3ps)          | ZQ编码 (多路复用合并、优化了ISI和PSIJ、移除了2个T-coil)                         | ----                                             | 27Gb/s              | ----    |
| 浦项科技大学_28.4  | ----                     | 4-tap A-FFE(基于反相器、改善面积功耗摆幅)                                   | ----                                             | 20Gb/s<br>1.18pJ/b  | NRZ     |
| 高丽大学_28.5    | ----                     | Di-code EQ(可提供合适CM电压)                                         | INV-BASED TIA<br>Di-code ECC (硬件开销更小)<br>无电容失配校准 | 10Gb/s<br>0.385pJ/b | Di-code |
| 国立首尔大学_28.6  | ----                     | 电容驱动链路(FFE)与Ground强制偏置技术组合                                    | ----                                             | 12Gb/s              | NRZ     |
| Samsung_28.7 | 无需高速DCC/CDR              | ----                                                          | ----                                             | 20Gb/s<br>1.24pJ/b  | DECS    |