

# ECEN720: High-Speed Links Circuits and Systems

## Spring 2023

---

### Lecture 6: RX Circuits



Sam Palermo  
Analog & Mixed-Signal Center  
Texas A&M University

# Announcements

---

- Lab 4 report and Prelab 5 due Mar 6
- Exam 1 Mar 7
  - Covers material through Lecture 6
  - Previous years' exam 1s are posted on the website for reference
- Sampler and comparator papers are posted on the website

# Outline

---

- Receiver parameters
- T-coils at RX examples
- Analog front-end
- Clocked comparators
- Sensitivity & offset correction
- Demultiplexing
- PAM4 RX example

# High-Speed Electrical Link System



# Receiver Parameters

---

- RX sensitivity, offsets in voltage and time domain, and aperture time are important parameters
- Minimum eye width is determined by aperture time plus peak-to-peak timing jitter
- Minimum eye height is determined by sensitivity plus peak-to-peak voltage offset



# RX Block Diagram

---



- RX must sample the signal with high timing precision and resolve input data to logic levels with high sensitivity
- Input pre-amp can improve signal gain and improve input referred noise
  - Can also be used for equalization, offset correction, and fix sampler common-mode
  - Must provide gain at high-bandwidth corresponding to full data rate
- Comparator can be implemented with static amplifiers or clocked regenerative amplifiers
  - Clocked regenerative amplifiers are more power efficient for high gain
- Decoder used for advanced modulation (PAM4, Duo-binary)

# Outline

---

- Receiver parameters
- T-coils at RX examples
- Analog front-end
- Clocked comparators
- Sensitivity & offset correction
- Demultiplexing
- PAM4 RX example

# 56Gb/s PAM4 Input Network

[Pisati ISSCC 2019]

## Analog Front End



- T-coil isolates ESD and input stage capacitance
- Shunt peaking with termination network provides bandwidth extension

# 100Gb/s PAM4 Input Network



[Loi ISSCC 2019]

- Bridged T-coil isolates ESD and provides further bandwidth extension
- Series peaking isolates input stage capacitance

# Outline

---

- Receiver parameters
- T-coils at RX examples
- Analog front-end
- Clocked comparators
- Sensitivity & offset correction
- Demultiplexing
- PAM4 RX example

# Analog Front-End (AFE)

[Pisati ISSCC 2019]

## Analog Front End



- AFE provides equalization (CTLE) and gain stages (VGA) to optimize the signal for symbol detection (mixed-signal RX) or quantization (ADC-based RX)
- Shrinking supply voltages make it difficult to efficiently achieve gain

# RX Static Differential Amplifiers

- Differential input amplifiers often used as input stage in high performance serial links
  - Rejects common-mode noise
  - Sets input common-mode for preceding comparator
- Input stage type (n or p) often set by termination scheme
- High gain-bandwidth product necessary to amplify full data rate signal
- Offset correction and equalization can be merged into the input amplifier



$$A_v = g_{m1} (R_L \| r_{o1}) \approx g_{m1} R_L$$



$$A_v = \frac{g_{m1}}{g_{m3} + g_{o3} + g_{o4} + g_{o1}} \approx \frac{g_{m1}}{g_{m3}}$$

# Low-Voltage Gm-TIA Amplification

---



[Pisati ISSCC 2019]

- Two-stage topology consisting of an input transconductance (Gm) stage followed by an output transimpedance (TIA) stage allows for low-voltage operation
- Both NMOS and PMOS transconductance can be utilized
- TIA stage allows for improved gain with better linearity, as mostly signal current passes through  $R_F$

# eSilicon 56Gb/s PAM4 CTLE Gm-Stage

- Input AC-coupling for optimal common-mode to utilize both NMOS and PMOS Gm
- RC degeneration at main input transistors' sources provides high-frequency peaking
- Additional tunable bias resistor at the NMOS input provides an additional zero for low-frequency channel compensation
- Gain control achieved through bias programmability

[Pisati ISSCC 2019]



# eSilicon 56Gb/s PAM4 CTLE TIA-Stage

- Inverter-based gain stage with feedback resistor
- Supply noise rejection achieved with a replica-bias regulated power supply
- As mostly signal current flows through RF, good linearity is achieved with high signal swing

[Pisati ISSCC 2019]



# Inverter-Based Design

[Zheng CICC 2018]



- Inverter-based design allows for both NMOS and PMOS transconductance
- Cells can also be used as resistive and active-inductor loads

# 56Gb/s Inverter-Based CTLE Replica Bias Loop

[Zheng CICC 2018]



- Replica-biasing with ring oscillator-based process monitor

# Outline

---

- Receiver parameters
- T-coils at RX examples
- Analog front-end
- Clocked comparators
- Sensitivity & offset correction
- Demultiplexing
- PAM4 RX example

# RX Clocked Comparators

---

- Also called regenerative amplifier, sense-amplifier, flip-flop, latch
- Samples the continuous input at clock edges and resolves the differential to a binary 0 or 1



[J. Kim]

# Important Comparator Characteristics

---

- Offset and hysteresis
- Sampling aperture, timing resolution, uncertainty window
- Regeneration gain, voltage sensitivity, metastability
- Random decision errors, input-referred noise

# Dynamic Comparator Circuits



**Strong-Arm Latch**



**CML Latch**

- To form a flip-flop
  - After strong-arm latch, cascade an R-S latch
  - After CML latch, cascade another CML latch
- Strong-Arm flip-flop has the advantage of no static power dissipation and full CMOS output levels

# StrongARM Latch Operation

[J. Kim TCAS1 2009]



- 4 operating phases: reset, sampling, regeneration, and decision

# StrongARM Latch Operation – Sampling Phase

[J. Kim TCAS1 2009]

- Sampling phase starts when clk goes high,  $t_0$ , and ends when PMOS transistors turn on,  $t_1$
- M1 pair discharges  $X/X'$
- M2 pair discharges  $\text{out}+/-$

$$\frac{v_{out}(s)}{v_{in}(s)} = \frac{g_{m1}g_{m2}}{sC_{out}C_x \left( s + \frac{g_{m2}(C_{out} - C_x)}{C_{out}C_x} \right)}$$

$$\approx \frac{g_{m1}g_{m2}}{s^2C_{out}C_x} = \frac{1}{s^2\tau_{s1}\tau_{s2}}$$

where  $\tau_{s1} \equiv C_x/g_{m1}$ ,  $\tau_{s2} \equiv C_{out}/g_{m2}$



# StrongARM Latch Operation – Regeneration

[J. Kim TCAS1 2009]

- Regeneration phase starts when PMOS transistors turn on,  $t_1$ , until decision time,  $t_2$
- Assume M1 is in linear region and circuit no longer sensitive to  $v_{in}$
- Cross-coupled inverters amplify signals via positive-feedback:

$$G_R = \exp\left(\frac{t_2 - t_1}{\tau_R}\right)$$

$$\tau_R = C_{out} / (g_{m2,r} + g_{m3,r})$$



# StrongARM Latch Operation – Diff. Output

[J. Kim TCAS1 2009]

---



# Conventional RS Latch

- RS latch holds output data during latch pre-charge phase



- Conventional RS latch rising output transitions first, followed by falling transition

# Optimized RS Latch

[Nikolic JSSC 2000]

- Optimizing RS latch for symmetric pull-up and pull-down paths allows for considerable speed-up
- During evaluation, large driver transistors are activated to change output data and the keeper path is disabled
- During pre-charge, large driver transistors are tri-stated and small keeper cross-coupled inverter activated to hold data



# Delay Improvement w/ Optimized RS Latch

[Nikolic JSSC 2000]



- Strong-Arm flip-flop delay improves by close to a factor of two
- Has better delay performance than other advanced flip-flop topologies

# Sampler Analysis

- Sampler analysis provides insight into comparator operation

[Johansson JSSC 1998]



$$v_{sample} = \int_{-\infty}^{\infty} v_{in}(\tau) h(\tau) d\tau$$

input

$h(\tau)$   
sampling function

sampling clock



- Switch can be modeled as a device which determines a weighted average over time of the input signal
- The weighting function is called the sampling function

# Sampling Function Properties

---

- Sampling function should (ideally) integrate to 1

$$\int_{-\infty}^{\infty} h(\tau) d\tau = 1$$

- Ideal sampling function is a delta function
  - Sampled value is only a function of exact sampling time



ideal  $h(\tau) = \delta(t)$



$$v_{sample} = \int_{-\infty}^{\infty} v_{in}(\tau) h(\tau) d\tau$$

# Sampling Function Example

- Practical sampling function will weight the input signal near the nominal sampling time



# Sampler Frequency Response

- Fourier transform of the sampling function yields the sampler frequency response
- Sampler bandwidth is a function of sample clock transition time



# Sampler Aperture Time

- Aperture time is defined as the width of the SF peak were a certain percentage (80%) of the sensitivity is confined

$$w_{80} = t_{90} - t_{10}$$

$$0.1 = \int_{-\infty}^{t_{10}} h(\tau) d\tau$$

$$0.9 = \int_{-\infty}^{t_{90}} h(\tau) d\tau$$



# Clocked Comparator LTV Model



- Comparator can be viewed as a noisy nonlinear filter followed by an ideal sampler and slicer (comparator)
- Small-signal comparator response can be modeled with an ISF  $\Gamma(\tau) = h(t, \tau)$

# Clocked Comparator ISF

- Comparator ISF is a subset of a time-varying impulse response  $h(t, \tau)$  for LTV systems:

$$y(t) = \int_{-\infty}^{\infty} h(t, \tau) \cdot x(\tau) d\tau$$

- $h(t, \tau)$ : system response at  $t$  to a unit impulse arriving at  $\tau$

- ISF  $\Gamma(\tau) = h(t_{obs}, \tau)$

- For comparators,  $t_{obs}$  is before decision is made
- Output voltage of comparator

$$v_o(t_{obs}) = \int_{-\infty}^{\infty} v_i(\tau) \cdot \Gamma(\tau) d\tau$$

- Comparator decision

$$D_k = \text{sgn}(v_k) = \text{sgn}(v_o(t_{obs} + kT)) = \text{sgn}\left(\int_{-\infty}^{\infty} v_i(\tau) \cdot \Gamma(\tau) d\tau\right)$$



Fig. 1. LTV system is characterized either with time-varying impulse response  $h(t, \tau)$  or with time-varying transfer function  $H(j\omega; t)$ .

# Clocked Comparator ISF



- ISF is defined with respect to  $t_{obs}$ , or the decision time
- The comparator provides the most gain during the sampling phase

[J. Kim]

# Clocked Comparator ISF

- ISF shows sampling aperture or timing resolution
- In frequency domain, it shows sampling gain and bandwidth



# Characterizing Comparator ISF

[Jeeradit VLSI 2008]

1. Find Metastable  $V_{ms}(\tau) = V_{os}(t \rightarrow \infty, \tau)$  such that  $V(out+) = V(out-)$



2. Measure  $V_{MS}$  for varying  $\tau$



3. Derive ISF

$$SSF_{norm}(\tau) = \frac{V_{MS}(\tau) - V_L}{V_H - V_L}$$

$$ISF_{norm}(\tau) = \frac{d}{d\tau} SSF_{norm}(\tau)$$



- For more details, see  
[http://www.ece.tamu.edu/~spalermo/ecen689/ECEN720\\_lab4\\_2017.pdf](http://www.ece.tamu.edu/~spalermo/ecen689/ECEN720_lab4_2017.pdf)

Strong-Arm Latch



[J. Kim]

CML Latch



[Toifl]

# Comparison of SA & CML Comparator (1)

[Jeeradit VLSI 2008]



- Sampling time of SA latch varies with VDD, while CML isn't affected much

# Comparison of SA & CML Comparator (2)

[Jeeradit VLSI 2008]



- CML latch has higher sampling gain with small input pair
- StrongARM latch has higher sampling bandwidth
  - For CML latch increasing input pair also directly increases output capacitance
  - For SA latch increasing input pair results in transconductance increasing faster than capacitance

# Low-Voltage SA – Schinkel ISSCC 2007



- Does require  $\text{clk}$  &  $\text{clk\_b}$ 
  - How sensitive is it to skew?

# Low-Voltage SA – Schinkel ISSCC 2007



# Low-Voltage SA – Schinkel ISSCC 2007

---



90nm CMOS simulations.  $\Delta V_{in} = 50\text{mV}$ .

Circuits designed for equal offset  $\sigma_{os} = 10\text{mV}$  at  $V_{cm} = 1.1\text{V}$

# Low-Voltage SA – Goll TCAS2 2009



- Similar stacking to conventional SA latch
- However, now P0 and P1 are initially on during evaluation which speeds up operation at lower voltages
- Does require  $clk$  &  $clk_b$ 
  - How sensitive is it to skew?

# Low-Voltage SA – Goll TCAS2 2009

---



# Charge-Steering Latch



[Chiang 2013 VLSI, Bai 2014 ISSCC]

- 😊 First stage has small aperture time, but both outputs discharge to GND
- 😊 Second stage has small delay, provides gain, and latches the differential output
- 😊 Only requires one clock phase
- 😢 Gain is limited

# Charge-Steering Latch Headroom at Low- $V_{DD}$



- Only one effective transistor stack
  - Maximizes  $g_m$  of active transistors
  - Allows for low-voltage operation

# Charge-Steering Latch w/ Common-Mode Restore

[Bai 2014 ISSCC]



- Differential output swing is proportional to output voltage common-mode ( $V_{CM}$ ) drop
- However, excessive  $V_{CM}$  drop can limit subsequent stages' speed
- Addition of PMOS capacitors allows for larger overall gain

# 65nm Charge-Steering Latch Performance



[Bai 2014 ISSCC]



- Sampling aperture is  $\sim 17\text{ps}$  (post-layout)
- Latch has a gain  $>2$
- Also possible to configure the structure as a fast sample-and-hold (S/H)

# Charge-Steering Latch w/ Regeneration



- 😊 Addition of small  $Mp3/Mn3$  regeneration stage in parallel with second stage output provides a full-swing output
- 😊 Regeneration current set with an NMOS transistor
- 😊 Only requires one clock phase
- 😊 Overall, smaller delay relative to other low-voltage regenerative comparators (Schinkel latch)
- Utilized in a 32Gb/s PAM4 DFE receiver [Elhadidy 2015 VLSI]

# Outline

---

- Receiver parameters
- T-coils at RX examples
- Analog front-end
- Clocked comparators
- Sensitivity & offset correction
- Demultiplexing
- PAM4 RX example

# RX Sensitivity

---

- RX sensitivity is a function of the input referred noise, offset, and minimum latch resolution voltage

$$v_S^{pp} = 2v_n^{rms} \sqrt{SNR} + v_{\min} + v_{offset^*}$$

- Gaussian (unbounded) input referred noise comes from input amplifiers, comparators, and termination
  - A minimum signal-to-noise ratio (SNR) is required for a given bit-error-rate (BER)

For BER =  $10^{-12}$  ( $\sqrt{SNR} = 7$ )

- Minimum latch resolution voltage comes from hysteresis, finite regeneration gain, and bounded noise sources

Typical  $v_{\min} < 5mV$

- Input offset is due to circuit mismatch (primarily  $V_{th}$  mismatch) & is most significant component if uncorrected

# Front-End Noise



$$\text{Output Noise Power Spectrum: } V_{n,FE}^2 = |H(f)|^2 \cdot V_{n,in}^2(f)$$

Integrating this noise spectrum over the decision circuit bandwidth  $BW_D$  gives the total noise power experienced by the decision circuit

$$\overline{v_{n,FE}^2} = \int_0^{BW_D} |H(f)|^2 \cdot V_{n,in}^2(f) df$$

- Note that since  $H(f)$  generally rolls-off quickly, the exact upper bound is not too critical and could be set to a very high value (infinity)

# 56Gb/s Front-End Output Noise Example



- Iterating front-end configuration (DC gain, peaking, bandwidth) to compensate for a 37dB channel
- While front-end ISI is reduced with higher bandwidth and peaking, the rms noise also grows
- The optimum bandwidth is generally near or slightly higher than the Nyquist frequency

# Comparator Noise



- Device noise causes random decisions even with zero input signal
- Noise variance can be found by fitting output to a Gaussian CDF as the input is swept and transient noise is enabled
- Noise can also be simulated with PSS+PAC+PNOISE, but requires post processing to find ISF from sideband transfer function [Kim TCAS-I 2009]

# Comparator Metastability



$$t_{samp} \sim \frac{C_{out} V_{THP}}{I_D}$$

$$t_{reg} \sim \tau_{comp} \ln \left( \frac{V_{DD}}{V_{in}} \right)$$



- Comparator evaluation time grows proportional to  $\ln(V_{in}^{-1})$
- Metastability occurs when the input is too small and the comparator doesn't have sufficient time to fully evaluate
- This metastability window is a major component of the comparator  $V_{min}$

# RX Sensitivity & Offset Correction

- RX sensitivity is a function of the input referred noise, offset, and min latch resolution voltage

$$v_S^{pp} = 2v_n^{rms} \sqrt{SNR} + v_{\min} + v_{\text{offset}*}$$

Typical Values :  $v_n^{rms} = 1mV_{rms}$ ,  $v_{\min} + v_{\text{offset}*} < 6mV$

$$\text{For BER} = 10^{-12} (\sqrt{\text{SNR}} = 7) \Rightarrow v_S^{pp} = 20mV_{pp}$$

- Circuitry is required to reduce input offset from a potentially large uncorrected value (>50mV) to near 1mV



# Comparator Offset



- The input referred offset is primarily a function of  $V_{th}$  mismatch and a weaker function of  $\beta$  (mobility) mismatch

$$\sigma_{V_t} = \frac{A_{V_t}}{\sqrt{WL}}, \quad \sigma_{\Delta\beta/\beta} = \frac{A_\beta}{\sqrt{WL}}$$

- To reduce input offset 2x, we need to increase area 4x
- Not practical due to excessive area and power consumption
- Offset correction necessary to efficiently achieve good sensitivity

# Offset Correction Range & Resolution

---

- Generally circuits are designed to handle a minimum variation range of  $\pm 3\sigma$  for 99.7% yield
- Example: Input differential transistors  $W=4\mu m$ ,  $L=150nm$

$$\sigma_{V_t} = \frac{A_{V_t}}{\sqrt{WL}} = \frac{2.8mV\mu m}{\sqrt{4\mu m \cdot 150nm}} = 3.6mV, \quad \sigma_{\Delta\beta/\beta} = \frac{A_\beta}{\sqrt{WL}} = \frac{2\%\mu m}{\sqrt{4\mu m \cdot 150nm}} = 2.6\%$$

- If we assume (optimistically) that the input offset is only dominated by the input pair  $V_t$  mismatch, we would need to design offset correction circuitry with a range of about  $\pm 11mV$
- If we want to cancel within  $1mV$ , we would need an offset cancellation resolution of 5bits, resulting in a worst-case offset of

$$1LSB = \frac{\text{Offset Correction Range}}{2^{\text{Resolution}} - 1} = \frac{22mV}{2^5 - 1} = 0.65mV$$

# Current-Mode Offset Correction Example

- Differential current injected into input amplifier load to induce an input-referred offset that can cancel the inherent amplifier offset
  - Can be made with extended range to perform link margining
- Passing a constant amount of total offset current for all the offset settings allows for constant output common-mode level
- Offset correction performed both at input amplifier and in individual receiver segments of the 2-way interleaved architecture

[Balamurugan JSSC 2008]



# Capacitive Offset Correction Example

- A capacitive imbalance in the sense-amplifier internal nodes induces an input-referred offset
- Pre-charges internal nodes to allow more integration time for more increased offset range
- Additional capacitance does increase sense-amp aperture time
- Offset is trimmed by shorting inputs to a common-mode voltage and adjusting settings until an even distribution of “1”s and “0”s are observed
- Offset correction settings can be sensitive to input common-mode



# Outline

---

- Receiver parameters
- T-coils at RX examples
- Analog front-end
- Clocked comparators
- Sensitivity & offset correction
- Demultiplexing
- PAM4 RX example

# Demultiplexing RX

- Demultiplexing allows for lower clock frequency relative to data rate
- Gives extra regeneration and pre-charge time in comparators
- Need precise phase spacing, but not as sensitive to duty-cycle as TX multiplexing



# 1:4 Demultiplexing RX Example



- Increased demultiplexing allows for higher data rate at the cost of increased input or pre-amp load capacitance
- Higher multiplexing factor more sensitive to phase offsets in degrees

# Low-Voltage Serial I/O Transceiver



- Utilizes a high TX output multiplexing (4:1) and RX input multiplexing (1:8) factor for low-voltage operation

# 1:8 Input De-Multiplexing RX



- 1:8 input de-multiplexing allows input comparators to operate at low voltages
- Injection-locked-oscillator is used for efficient multi-phase clock generation and de-skew

# 0.47-0.66pJ/bit, 4.8-8Gb/s GP 65nm CMOS Prototype



## Testing with 20cm FR-4 Channel



### TX Power Breakdown (6.4Gb/s at 0.65V)

|                                                  |         |
|--------------------------------------------------|---------|
| LDO & Output Driver (150mV <sub>ppd</sub> )      | 793uW   |
| Serializer, Pre-drivers, Clocking                | 933uW   |
| Global Impedance Control (amortized across 9 TX) | 193uW   |
| TX Energy Efficiency                             | 0.3pJ/b |

### RX Power Breakdown (6.4Gb/s at 0.65V)

|                         |          |
|-------------------------|----------|
| CTLE, Quantizers, ILRO  | 1.07mW   |
| Clock Distribution      | 38uW     |
| RX Energy Efficiency    | 0.17pJ/b |
| Total Energy Efficiency | 0.47pJ/b |

- Optimal 0.47pJ/b energy efficiency achieved at 6.4Gb/s
  - At low data rates, less amortization of static current
  - At high data rates, higher voltage required for serialization timing

Y.-H. Song, R. Bai, P. Chiang, and S. Palermo, "A 0.47-0.66pJ/bit, 4.8-8Gb/s I/O Transceiver in 65nm-CMOS," IEEE JSSC, vol. 48, no. 5, pp. 1276-1289, May 2013.

# Outline

---

- Receiver parameters
- T-coils at RX examples
- Analog front-end
- Clocked comparators
- Sensitivity & offset correction
- Demultiplexing
- PAM4 RX example

# PAM4 RX Example

[Roshan Zamir JSSC 2019]



- 2b flash ADC (3 comparators) for PAM4 symbol decisions
- Swept error sampler for PAM4 threshold adaptation
- Edge samplers provide information for CDR & equalization adapt
- CTLE & DFE cancel ISI

# PAM4 Slicer Threshold Adaptation



- Fully adaptive background calibration to positon slicers in the middle of the PAM4 eyes
- Error sampler tracks eyes edges and finds heights



[Roshan Zamir JSSC 2019]

# PAM4 Slicer Threshold Adaptation



# 56Gb/s PAM4 RX Test Setup

Channel w/  
20.8dB of  
loss @  
14GHz



**GP 65nm  
Prototype**



# 56Gb/s PAM4 RX Measured Results

## GP 65nm Prototype



- 20.8dB channel
- 4.6pJ/b

## Equalization Adaptation



## Threshold Adaptation



## Timing Margin



## Voltage Margin



## Jitter Tolerance



# RX Take-Away Points

---

- AFE provides equalization and gain stages to optimize the signal for symbol detection (mixed-signal RX) or quantization (ADC-based RX)
- Gm-TIA and inverter-based front-ends allow for higher gain with shrinking supply voltages
- Achieving good RX sensitivity requires careful front-end noise analysis and sampler offset correction
- Higher input stage demultiplexing relaxes clock frequencies at the cost of front-end loading and clock phase generation
- PAM4 receivers require extra threshold adaptation

# Next Time

---

- Equalization theory and circuits
  - Equalization overview
  - Equalization implementations
    - TX FIR
    - RX FIR
    - RX CTLE
    - RX DFE
  - Setting coefficients
  - Equalization effectiveness
  - Alternate/future approaches