



**ISSCC 2020**

**SESSION 6**

**Ultra-High-Speed Wireline**

# A 112Gb/s PAM-4 Long-Reach Wireline Transceiver Using a 36-Way Time-Interleaved SAR-ADC and Inverter-Based RX Analog Front-End in 7nm FinFET

Jay Im<sup>1</sup>, Kevin Zheng<sup>1</sup>, Adam Chou<sup>1</sup>, Lei Zhou<sup>1</sup>, Jae Wook Kim<sup>1</sup>, Stanley Chen<sup>1</sup>, Yipeng Wang<sup>2</sup>,  
Hao-Wei Hung<sup>2</sup>, KeeHian Tan<sup>2</sup>, Winson Lin<sup>1</sup>, Arianne Roldan<sup>1</sup>, Declan Carey<sup>3</sup>, Ilias Chlis<sup>3</sup>,  
Ronan Casey<sup>3</sup>, Ade Bekele<sup>1</sup>, Ying Cao<sup>1</sup>, David Mahashin<sup>1</sup>, Hong Ahn<sup>1</sup>, Hongtao Zhang<sup>1</sup>,  
Yohan Frans<sup>1</sup>, Ken Chang<sup>1</sup>

<sup>1</sup>Xilinx, San Jose, CA, <sup>2</sup>Xilinx, Singapore, Singapore, <sup>3</sup>Xilinx, Cork, Ireland



# Outline

---

- Background
- Transceiver architecture
- Circuit highlights
  - Inverter-based RX AFE & ADC SFE
  - 56GS/s time-interleaved SAR ADC
  - RX clock generation
- Measurement results
- Conclusions
- References

# Outline

---

- **Background**
- Transceiver architecture
- Circuit highlights
  - Inverter-based RX AFE & ADC SFE
  - 56GS/s time-interleaved SAR ADC
  - RX clock generation
- Measurement results
- Conclusions
- References

# 112Gb/s wireline transceivers in advanced PDK

Data rate vs. process node & year\*



- ❑ IO area & power efficiency
  - ❑ TB/s aggregate BW per VLSI needed for high throughput apps
- ❑ 112Gb/s transceiver in advanced PDK
  - ❑ Esp. for ASICs & FPGAs with monolithic IO integration
- ❑ PAM-4 remains dominant modulation scheme\*\*
  - ❑ Better spectral efficiency than NRZ
  - ❑ Reuse 56Gb/s learning & IPs

\* [6] ISSCC 2018 Trends; \*\* [1-5]

# More recent 100+ Gb/s TX & RX publications

❑ Trending toward 7/10nm nodes



| Data-Rate<br>(Gb/s) | PDK<br>(FinFET) | CDR Architecture |    | RX AFE |        | TX           |             | Reference            | (*)          |  |  |
|---------------------|-----------------|------------------|----|--------|--------|--------------|-------------|----------------------|--------------|--|--|
|                     |                 | Baud-Rate        |    |        |        | Quarter-rate |             |                      |              |  |  |
|                     |                 | PLL              | PI | ADC    | Analog | Pad drv      | FIR         |                      |              |  |  |
| 112                 | 7nm             |                  |    |        |        |              |             | Rambus, ISSCC 2020   | Paper 6.3    |  |  |
| 112                 | 7nm             |                  |    |        |        |              |             | MediaTek, ISSCC 2020 | Paper 6.2    |  |  |
| 112                 | 7nm             |                  | v  | v      |        | CML          | 4-tap       | Xilinx, ISSCC 2020   | (this work)  |  |  |
| 112                 | 10nm            | v                |    | v      |        |              |             | Intel, VLSI 2019     |              |  |  |
| 128                 | 14nm            |                  |    |        |        | CML          | 3-tap       | IBM, ISSCC 2019      |              |  |  |
| 106                 | 16nm            | v                |    | v      |        | CML          | (DAC)       | Inphi, ISSCC 2019    | optical link |  |  |
| 100                 | 14nm            |                  | v  |        | v      |              |             | IBM, ISSCC 2019      | PR (1+0.5D)  |  |  |
| 112                 | 16nm            |                  | v  | v      |        | SST          | 4-tap       | Xilinx, VLSI 2018    |              |  |  |
| 112                 | 10nm            |                  |    |        |        | CML          | 3-tap       | Intel, ISSCC 2018    |              |  |  |
| 112                 | 14nm            |                  |    |        |        | SST          | 8-tap (DAC) | IBM, ISSCC 2018      | external clk |  |  |

# Outline

---

- Background
- **Transceiver architecture**
- Circuit highlights
  - Inverter-based RX AFE & ADC SFE
  - 56GS/s time-interleaved SAR ADC
  - RX clock generation
- Measurement results
- Conclusions
- References

# Transceiver top level



- ❑ Shared LC-PLL sends 14GHz clock to 2 TRX channels via supply-regulated CMOS clock buffers\*
- ❑ Designed with production specs
  - ❑ Yield: PVT tolerance, STA margin
  - ❑ EMIR, self-heating, ESD/LU, BTI/HCI/TDDB reliability

\* [9] Im, VLSI 2018

(Shared bias/rcal blocks not shown.)

# TX sub-system



- Quarter-rate architecture
  - Background IQ clock calibration using a replica mux
- 4-tap FFE
  - 2 pre & 1 post
- Tail-less CML driver\*
  - Compatible with power efficient CMOS 4:1 MUX & pre-driver\*\*
  - TX pad node capacitance distributed by series inductor & T-coil

\* [5] Toprak-Deniz, ISSCC 2019; \*\* [4] Tan, VLSI 2018

# RX sub-system



- CTLE: 2 stages
  - HF/MF + HF/MF
- PGA: 2 stages
  - PGA2 = TH1 buffer
- ADC: 56GS/s 7b TI-ADC
  - 36 \* 1.56GS/s SAR ADC
- DSP (PnR)
  - ADC cal, FFE, DFE
  - 31-tap FFE
  - Speculative h1 DFE
- Baud-rate CDR
  - PI-based
- Off-chip FPGA controller
  - Background adaptation & calibration

# RX sub-system



# Outline

---

- Background
- Transceiver architecture
- **Circuit highlights**
  - **Inverter-based RX AFE & ADC SFE**
    - 56GS/s time-interleaved SAR ADC
    - RX clock generation
- Measurement results
- Conclusions
- References

# Inverter-based RX AFE and ADC SFE stages

- Area and power efficient in advanced PDK as demonstrated in [8]\*
  - Gm/Gml ratio-metric nature reduces PVT sensitivity
  - Linearity is not a limiting factor in the whole link performance

Each box (stage) implemented as inverter (Gm) driving diode load (Gml)

Unit Gm/inverse-Gm-load cell with hybrid active-passive peaking



\* No series inductor for TH1 & TH2 stages



\* [8] Zheng, VLSI 2018

# CTLE implementation (single stage shown)

- Design based on [8]\*

[8] 56Gb/s PAM-4 in 16nm



- Area reduction: triode devices used as resistor; LF Cz reduced by inserting a series resistor
- 28GHz peaking: passive series inductor added

This work: 112Gb/s PAM-4 in 7nm



\* [8] Zheng, VLSI 2018

(\*\*) Excluding series inductor area)

# PGA and ADC-SFE implementation

- Design based on [8]\*

## [8] 56Gb/s PAM-4 in 16nm

- 4x8 time interleaving
- Source follower TH2 buffer
- bootstrapped TH2 switch



- All CMOS implementation: source follower & bootstrap switches → inverter-based buffer & CMOS PG
- BW extension: unit cell layout & FO optimization

## This work: 112Gb/s PAM-4 in 7nm

- 6x6 time interleaving
- Inverter-based TH buffers
- CMOS PG switches



\* [8] Zheng, VLSI 2018

# 7nm CMOS CTLE layout example

- Single CTLE stage = 30um\*15um
  - CMOS devices only
    - No TiN resistor, bias circuits, or CMFB
  - Uniform min PO pitch & fin grid
  - Includes everything (Gm, Gml, Rf, Rz, Cz, test circuit, & dummies) but passive inductor



\* [7] Chen, ISSCC 2018; [11] Chang, ISSCC Forum 2018

# Outline

---

- Background
- Transceiver architecture
- **Circuit highlights**
  - Inverter-based RX AFE & ADC SFE
  - **56GS/s time-interleaved SAR ADC**
  - RX clock generation
- Measurement results
- Conclusions
- References

# ADC: time interleave (TI) architecture selection

- ❑ TI architecture determines the overall area and power efficiency

- ❑ Assume 2-rank interleaving:  $N \times M$

- ❑ Limiting factors for TI-ADC

- ❑ Sub-ADC sampling speed:  $F_s = (56\text{GS/s})/(N*M)$ 
    - ❑ Technology dependent
    - ❑ SAR ADC: also considering metastability effect on BER\*
  - ❑ Rank1 sampler speed & settling BW
  - ❑ Clock generation & distribution complexity
  - ❑ Power/area cost



\* [10] A. Yu, D. Bankman, K. Zheng and B. Murmann, "Understanding Metastability in SAR ADCs: Part II: Asynchronous," IEEE Solid-State Circuits Magazine, vol. 11, no. 3, 2019

# ADC: TI architecture selection

## □ Design tradeoffs for 4x8, 8x8, & 6x6 architectures

|                         | 4x8                                                                                                                                                               | 8x8*                                                                                                                                                  | 6x6 [this work]                                                                                                                              |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| Total # of sub-ADCs     | 32                                                                                                                                                                | 64                                                                                                                                                    | 36                                                                                                                                           |
| Sub-ADC sampling speed  | 1.75GS/s                                                                                                                                                          | 875MS/s                                                                                                                                               | 1.56GS/s                                                                                                                                     |
| Rank1 sampler frequency | 14GHz                                                                                                                                                             | 7GHz                                                                                                                                                  | 9.33GHz                                                                                                                                      |
| Rank1 settling BW (7b)  | ~28GHz                                                                                                                                                            | ~13GHz                                                                                                                                                | ~18GHz                                                                                                                                       |
| # of T/H amplifiers     | 2 + 4                                                                                                                                                             | 4 + 8                                                                                                                                                 | 3 + 6                                                                                                                                        |
| Area penalty            | Small                                                                                                                                                             | Large                                                                                                                                                 | Medium                                                                                                                                       |
| Clock distribution      | Easy                                                                                                                                                              | Difficult                                                                                                                                             | Moderate                                                                                                                                     |
| Challenges              | <ul style="list-style-type: none"><li><input type="checkbox"/> Sub-ADC &amp; rank1 speed</li><li><input type="checkbox"/> Rank1 settling BW requirement</li></ul> | <ul style="list-style-type: none"><li><input type="checkbox"/> Area &amp; power penalty</li><li><input type="checkbox"/> Clock distribution</li></ul> | <ul style="list-style-type: none"><li><input type="checkbox"/> Manageable area &amp; power penalty (compared to 4x8)</li></ul>               |
| Advantages              | <ul style="list-style-type: none"><li><input type="checkbox"/> Compact layout</li><li><input type="checkbox"/> Easy clocking</li></ul>                            | <ul style="list-style-type: none"><li><input type="checkbox"/> Relaxed BW &amp; timing requirement</li></ul>                                          | <ul style="list-style-type: none"><li><input type="checkbox"/> Good trade-off between process <math>f_T</math> &amp; design margin</li></ul> |

\* [2] Hudner, VLSI 2018

# Outline

---

- Background
- Transceiver architecture
- **Circuit highlights**
  - Inverter-based RX AFE & ADC SFE
  - 56GS/s time-interleaved SAR ADC
  - **RX clock generation**
- Measurement results
- Conclusions
- References

# RX clock generation



- ❑ RX clock generation
  - ❑ PLL: 14GHz diff
  - ❑ ILO: 14GHz 8 phases
  - ❑ PI : 14GHz diff
  - ❑ DIV6 : 2.33GHz 4 phases
  - ❑ MILO: 9.33GHz 6 phases  
(Multiplying ILO)
  
- ❑ ADC clocks
  - ❑ Rank1 clock = 9.33GHz
  - ❑ Rank2 clock = 1.56GHz
  - ❑ TH1 tracking time = 2.4UI @112Gb/s
  - ❑ TH2 tracking time = 4UI

# RX clock generation: timing diagram



- DIV6 input duty cycle error translates to output IQ error
- Sense IQ error & digitally correct DCD
  
- Multiplying ILO (MILO)
  - Analog phase alignment loop
  - ADC clkgen corrects residual skew



(\* Only cycle-wise accurate, not in delay, edge rate, duty cycle, or overlaps/non-overlaps)

# RX clock generation: DIV6 & MILO



# ADC clock generation

- ☐ Rank1 clock: MILO output → deskew → DCC (40% duty cycle at 112Gb/s)
- ☐ Rank2 clock: Rank1 clk (deskewed; no DCC) → Div6 (rotating 6UI-long pulse) → Trim by 2UI
- ☐ Timing diagram:\*



(\* Only cycle-wise accurate, not in delay, edge rate, duty cycle, or overlaps/non-overlaps)

# ADC clock generation: circuit implementation

- 3 copies of 2x groups for rank1 clock conditioning & rank2 clock generation



# Outline

---

- Background
- Transceiver architecture
- Circuit highlights
  - Inverter-based RX AFE & ADC SFE
  - 56GS/s time-interleaved SAR ADC
  - RX clock generation
- **Measurement results**
- Conclusions
- References

# (Visit DS1 for live demonstration)



# Bench test setup: (TX to RX link test example)

## 37.5dB die-to-die (LR)



**TX → RX Link via LR (37.5 dB die-to-die) Channel:**  
20mm package traces,  
2\*1.5" PCB trace, 4 SMK connectors, DC-blocking capacitor, 2\*12" HUBER+SUHNER cables;  
2 break-out boards; 0.8m cable w/ sockets



- PLL ref clk source
- DUT
  - Testchip BGA package
- FPGA controller board
- Eye-scan, BER sweep (on-die PRBS error checker)
- Closed loop calibrations & adaptations (ADC OS, gain, rank1 clk phase, CTLE peaking, PGA gain, BLW/OS, DSP slicer levels, FFE, & DFE coeffs)
- (Visit DS1 for live demonstration)**

# AFE TF measurement vs. simulation

- CTLE1+CTLE2 aggregate TF (w.r.t. min peaking code TF)



- 17.5dB max peaking at 30GHz
- Good correlation between silicon & sim
  - <~0.5dB mismatch DC—30GHz
- Supports **model accuracy & sim methodology for CMOS CTLE at high freq (device, layout parasitic, field solver)**

# AFE TF measurement vs. simulation

- ❑ PGA1+PGA2 aggregate TF (w.r.t. min gain code TF)



- ❑ 10.5dB gain tuning range
- ❑ Good correlation between silicon & sim
  - ❑  $<\sim 0.5$ dB
  - ❑ Small mismatch may be due to slower process corner

# TX eyes: 112Gb/s PAM-4

PRBS-7



PRBS-31



- TX output to Keysight DSA-X 96204Q
  - 2" Megtron-6 PCB trace & 24" HUBER+SUHNER cable; using built-in equalization of oscilloscope
- Max swing = 1V diff-pp; RJ = 140fs-rms (clk pattern)

# RX performance: 112Gb/s PRBS-31 over LR (37.5dB)



- RX eye
  - Y-axis = slicer input; X-axis = sampling phase (PI code)
  - 8K PAM-4 symbols
- Histogram
  - Clear 4-level PAM-4 symbols at CDR lock position



- BER bathtub by sweeping PI code around CDR lock position
  - CDR loop disabled
  - Consistent with live BER (→)



# JTOL and X-talk noise performance



OIF-CEI-05.0 mask in dashed line

\* OIF-CEI-05.0 pre-FEC BER specification.



Async 56Gb/s NRZ from BERT as an aggressor

# Outline

---

- Background
- Transceiver architecture
- Circuit highlights
  - Inverter-based RX AFE & ADC SFE
  - 56GS/s time-interleaved SAR ADC
  - RX clock generation
- Measurement results
- **Conclusions**
- References

# Die micrograph and power breakdown



| Supply | Usages                                                                           |
|--------|----------------------------------------------------------------------------------|
| 0.88V  | Digital (SAR ADC, TX serializer/FIR/pre-DRV)                                     |
| 1.2V   | RX AFE (CTLE & PGA), ADC SFE (T/H buffers), regulated CMOS HS clock, ADC Vrefgen |
| 1.5V   | TX pad driver, regulator OTA                                                     |

(\* Not including the DSP power)

602mW/ch @112Gb/s (37.5dB)\*



- ADC slice active power
  - 2.72mW @1.56GS/s (measured)
  - including slicer, SAR logic, CDAC, Vref, & retimers

# Performance summary

|                          | This work                                            | Ref [1]                         | Ref [2][4]                                          | Ref [5]                       | Ref [3]                       |
|--------------------------|------------------------------------------------------|---------------------------------|-----------------------------------------------------|-------------------------------|-------------------------------|
| Technology               | 7nm FinFET                                           | 10nm CMOS                       | 16nm FinFET                                         | 14nm FinFET                   | 10nm FinFET                   |
| Data Rate                | 112Gb/s                                              | 112Gb/s                         | 112Gb/s                                             | 128Gb/s                       | 112Gb/s                       |
| TX Architecture          | Quarter-rate CML<br>4-Tap FFE                        | -                               | Quarter-rate SST<br>4-Tap FFE                       | Quarter-rate CML<br>3-Tap FFE | Quarter-rate CML<br>3-Tap FFE |
| RX Equalization          | CTLE<br>31-Tap FFE<br>1-Tap DFE                      | CTLE<br>16-Tap FFE<br>1-Tap DFE | CTLE<br>31-Tap FFE<br>1-Tap DFE                     | -                             | -                             |
| Active Area (excl. DSP)  | TX: 0.136mm <sup>2</sup><br>RX: 0.265mm <sup>2</sup> | RX: 0.281mm <sup>2</sup>        | TX: 0.38mm <sup>2</sup><br>RX: 0.674mm <sup>2</sup> | TX: 0.048mm <sup>2</sup>      | TX: 0.03mm <sup>2</sup>       |
| TX Power (excl. PLL)     | 1.34pJ/b                                             | -                               | 3.08pJ/b                                            | 1.0pJ/b (112Gb/s)             | 1.72pJ/b                      |
| RX Power (excl. DSP)     | ADC: 4.8pJ/S<br>RX: 3.6pJ/b*                         | ADC: 6.71pJ/S<br>RX: 4.2pJ/b    | ADC: 8.48pJ/S<br>RX: 5.27pJ/b                       | -                             | -                             |
| PLL and Clk Distribution | 0.44pJ/b                                             | -                               | 1.47pJ/b                                            | -                             | -                             |
| Supplies                 | 0.88V/1.2V/1.5V                                      | 0.95V/1.5V                      | 0.9V/1.2V/1.8V                                      | 0.85V/0.9V                    | 1.0V                          |
| Channel                  | 37.5dB                                               | 35dB (BERT)                     | 20dB (BERT)                                         | -                             | -                             |
| BER (PRBS31)             | 1E-8                                                 | 1E-6                            | 2E-5                                                | -                             | -                             |

\* Including contribution from shared channel clock blocks.



- 60% area reduction
  - 40% power reduction
- (compared to the previous  
16nm transceiver [2][4])

# Conclusions

---

- ❑ Long-reach capable 112Gb/s PAM-4 transceiver implemented in 7nm FinFET CMOS PDK
- ❑ All CMOS RX analog front-end & ADC sampling front-end
  - ❑ Area and power efficient
  - ❑ Simpler circuit topologies (no T/H bootstrap, current mirrors, or CMFB)
  - ❑ Less layout time → easier PDK porting
- ❑ 56GS/s 36-way time-interleaved SAR ADC
  - ❑ Max SAR ADC speed for 7nm PDK → optimal SFE & clocking solution

# Acknowledgement

---

- The authors thank Xilinx SERDES design, layout, and verification teams for having made this project successful.

# References

---

- [1] Y. Krupnik et al., "112 Gb/s PAM4 ADC Based SERDES Receiver for Long-Reach Channels in 10nm Process," IEEE Symp. VLSI Circuits, pp. 266–267, June 2019.
- [2] J. Hudner et al., "A 112Gb/s PAM4 Wireline Receiver using a 64-way Time-Interleaved SAR ADC in 16nm FinFET," IEEE Symp. VLSI Circuits, pp. 47–48, June 2018.
- [3] J. Kim et al., "A 112Gb/s PAM-4 Transmitter with 3-Tap FFE in 10nm CMOS," ISSCC, pp. 102-103, Feb. 2018.
- [4] K. Tan et al., "A 112-Gb/s PAM4 Transmitter in 16nm FinFET," IEEE Symp. VLSI Circuits, pp. 45-46, June 2018.
- [5] Z. Toprak-Deniz et al., "A 128Gb/s 1.3pJ/b PAM-4 Transmitter with Reconfigurable 3-Tap FFE in 14nm CMOS," ISSCC, pp.122-123, Feb. 2019.
- [6] "Through the Looking Glass-The 2018 Edition: Trends in solid-state circuits from the 65th ISSCC," IEEE Solid-State Circuits Magazine, Winter 2018.
- [7] S. Chen et al., "A 4-to-16GHz Inverter-Based Injection-Locked Quadrature Clock Generator with Phase Interpolators for Multi-Standard I/Os in 7nm FinFET," ISSCC, pp. 390-391, Feb. 2018.
- [8] K. Zheng et al., "An Inverter-based Analog Front End for a 56 Gb/s PAM4 Wireline Transceiver in 16nm CMOS," IEEE Symp. VLSI Circuits, pp. 269–270, June 2018.
- [9] J. Im et al., "A 0.5-28Gb/s Wireline Transceiver with 15-Tap DFE and Fast-Locking Digital CDR in 7nm FinFET," Symposium on VLSI Circuits, pp.145-146, 2018.
- [10] A. Yu, D. Bankman, K. Zheng and B. Murmann, "Understanding Metastability in SAR ADCs: Part II: Asynchronous," in IEEE Solid-State Circuits Magazine, vol. 11, no. 3, pp. 16-32, Summer 2019.
- [11] M. Erett et al., "A 126mW 56Gb/s NRZ Wireline Transceiver for Synchronous Short-Reach Applications in 16nm FinFET," ISSCC, pp.274-275, Feb. 2018.
- [12] "Through the Looking Glass-2020 Edition: Trends in Solid-State Circuits From ISSCC," IEEE Solid-State Circuits Magazine, Spring 2020.

# A 460mW 112Gbps DSP-Based Transceiver with 38dB Loss Compensation for Next Generation Data Centers in 7nm FinFET technology

Tamer Ali, Ehung Chen, Henry Park, Ramy Yousry, Yu-Ming Ying,  
Mohammed Abdullatif, Miguel Gandara, Chun-Cheng Liu, Po-Shuan  
Weng, Huan-Sheng Chen, Mohammad Elbadry, Qaiser Nehal, Kun-Hung  
Tsai, Kevin Tan, Yi-Chieh Huang, Chung-Hsien Tsai, Yuyun Chang, Yuan-  
Hao Tung



# Outline

- Motivation
- Receiver Blocks
- Transmitter Blocks
- Testing and Performance
- Conclusion

# Continued Growth inside Data Center



Source: Cisco Global Cloud Index, 2016-2021

- 25% Compound Annual Growth rate
- Traffic inside the data center to reach 20.6 ZB
- ~70% of data traffic resides inside the data center

# Wireline Growth Challenges



- State of the art 12.8Tbps/25Tbps switches w/ packages  $\sim 8.5 \times 8.5 \text{ cm}^2$
- Total power approaching package thermal capacity
- Total throughput scales much faster than package size...4X

# The Case for 112G DSP Based Transceiver

Up to ~40dB ch loss



Large # of FFE Taps + 1 DFE

Discontinuity in ~30mm Pkg traces



Sliding taps for Reflection Cancelation

Technology Scaling



~50% Reduction in DSP power & area

Wide range of V & T variation



Stable DSP performance (<10X BER Drop)

# Transceiver Overview



# Outline

- Motivation
- System Overview
- **Receiver Blocks**
- Transmitter Blocks
- Testing and Performance
- Conclusion

# Rx Block Diagram

- AFE: CTLE and VGA
  - 8dB peaking & 8dB gain range
  - CTLE and VGA are calibrated to maximize SNR



Enough to relax ADC resolution/noise to 7bits with moderate RXFE power



# Rx Block Diagram

- 56GS/s time-interleaving (TI) receiver
  - Two levels time interleaving architecture
  - 8-way T/Hs  $\times$  7 SAR ADCs  $\rightarrow$  56 SAR ADCs
- 8-way T/H is lowest possible TI to simplify clock routing & skew calibration
  - Limited by T/H clock quality



# Rx Block Diagram

- 56GS/s time-interleaving (TI) receiver
  - Two levels time interleaving architecture
  - 8-way T/Hs  $\times$  7 SAR ADCs  $\rightarrow$  56 SAR ADCs
- 7 SARs/channel: Lowest # of SARs to minimize area without high noise penalty



# Rx Block Diagram (Contd.)

- GM-TIA cells for CTLE & VGA
- VGA before CTLE to improve noise
- Each buffer drives one alternate T/H phase:
  - Minimize loading → Lower power
  - Reduces kickback on following phase
- 1GSps Async. SAR with double tail comparator similar to [1]



6.2: A 460mW 112Gbps DSP-Based Transceiver with 38dB Loss Compensation for Next Generation Data Centers in 7nm FinFET technology

# SAR ADC Slice Timing



- $\text{Clk}_r_n$  resets SAR CDAC after convergence → Relaxed buffer settling → Lower power

# Reference Buffer

- Use 1 ref. buffer per 28Gsps for better power efficiency
- Dominant pole at buffer output
  - Better stability & Reduced ripple voltage
- Star connection to optimize routing resistance



# Reference Buffer

- Main loop for DC precision
- Flipped source follower loop for HF performance
- Diode connected loop for PVT stability
  - FSF loop gain ( $LG = gm_2/gm_{2f}$ )
  - Tight PM control over PVT
- Voltage shift FB resistor to keep M1&M1f in saturation



# DSP Block Diagram : CQE

- Channel Quality Estimator (CQE) Calculates ch. pulse response at bring up
- CTLE gear is chosen accordingly to enable CDR initial training.
- CQE also performs equalizer (EQ) initial coarse adjustment



# DSP Block Diagram: CDR

- Baud rate CDR for lower power
- CDR loop bypasses equalizer to reduce latency and achieve 4MHz BW
- FIR filter restores pulse shape to improve decision quality
- Channel Loss Estimator (CLE) uses CQE to adjust CDR BW



# Channel Loss Estimator (CLE)



- CDR BW depends on gain of TED (Timing Error Detector) of MM CDR
  - TED gain depends on channel loss
- Relationship between CDR BW & CLE transfer function is based on statistics sims. & verified by silicon
- TED gain variation  $-0.37 \sim -0.92$  (2.49 times)  $\rightarrow -0.38 \sim -0.59$  (1.55 times)

# DSP Block Diagram: EQ

- Programmable 8-24 taps FFE with 1-tap loop unrolled DFE
- FIR factorization



- Find common factors between parallel paths
- Less multipliers (& more adders) needed by calculating common factors
- Power ↓ 30%

# Outline

- Motivation
- System Overview
- Receiver Blocks
- **Transmitter Blocks**
- Testing and Performance
- Conclusion

# TX Block Diagram



- SST 7-bit DAC with DSP Pre-emphasis for configurability
- Quarter rate architecture to simplify clocking
- DAC sized to balance power and DAC linearity
- DAC supply is 1.2V through LDO, rest of TX powered by 0.85

# Skew calibration & Timing Diagram

- Pulse Gen. produces 14GHz 25% quad. clocks at 14GHz
- Replica 4:1 MUX used to calibrate 4 phases skew, 2 at a time:  $(\Phi_0, \Phi_{180})$ ,  $(\Phi_{90}, \Phi_{270})$ , and  $(\Phi_0, \Phi_{90})$
- At end of Reset phase, D0 & D180 polarity changes  $\rightarrow$  2X amplification
- Amplifier outputs error sign  $\rightarrow$  Digital loop provides correction values to clock buffers with cap. DAC



# Outline

- Motivation
- System Overview
- Receiver Blocks
- Transmitter Blocks
- Testing and Performance
- Conclusion

# Implementation & Die Photo

- Implemented in TSMC N7
- Quad lane & a common lane that comprises 3 PLLs, biasing, and debug functions
- Analog occupies  $0.385\text{mm}^2/\text{lane}$ 
  - 20% increase than 56Gbps[1]
- DSP has:
  - Tx: 6-tap pre-emphasis
  - Rx: programmable 8-24 taps FFE with 1-tap loop unrolled DFE
  - Baud rate timing recovery
  - Background adaptive equalization & AGC



6.2: A 460mW 112Gbps DSP-Based Transceiver with 38dB Loss Compensation for Next Generation Data Centers in 7nm FinFET technology

# Measurement & Test Setup



- Large ASIC Package → ~5dB IL
- TX eye is measured at TP0a
- ISI board to add loss
- TX/RX neighbors lanes are in loop back for NEXT/FEXT aggressors

# Tx Eye Diagram



Package+PCB loss is ~5dB, FIR applied to open eye  
36dB SNDR & 99% RLM

# TX DAC Performance



- THD limits TX DAC performance at low freq.
- Beyond 15GHz quadrature phase skew limits performance
  - ~200fs phase skew mismatch → 4.2ENOB at 28GHz

# Rx ADC Measurements

THD=51.40dB SNR=34.37dB SNDR=34.29dB ENOB=5.40b



THD=40.34dB SNR=29.94dB SNDR=29.56dB ENOB=4.62b



- 34.3dB SNDR (5.4 ENOB) at DC
  - 2X BW & < 3dB SNR degradation in [1] (Was 5.7ENOB @56Gbps)
- Nyquist performance is 29.6dB (4.6 ENOB)
- Offset, gain, and skew are background calibrated to ~7bits

# Rx ADC Measurements ...Contd



- SNDR is limited by noise
- THD improves >10GHz due to 3<sup>rd</sup> harmonic filtering >28GHz
- >15GHz Jitter limits the SNR & SNDR performance
  - Estimated jitter is 150fs

# CTLE Freq. Response



noise

- PRBS Convolution → Pulse response → FT → Freq. response
- Socket introduces 5dB loss
- 8dB peaking at 28GHz after de-embedding socket

# Link & JTOL Measurement



- BER < 5E-7 with 38.9dB IL including 4 NEXT & FEXT aggressors
  - 21dB SNR
- 80mUI high freq. JTOL

# Comparison Table

|                                 | [3][4]                                   | [5]                                     | [6][7]                                             | This Work                                           |
|---------------------------------|------------------------------------------|-----------------------------------------|----------------------------------------------------|-----------------------------------------------------|
| Tech.                           | 16nm                                     | 16nm                                    | 10nm                                               | 7nm                                                 |
| DR (Gb/s)                       | 112                                      | 106                                     | 112                                                | 112                                                 |
| Loss at Nyq.<br>(dB) / BER      | 20 @ BER<2e-5                            | DR4/FR4 optical                         | 35 @ BER=1e-6                                      | 38.9 @ BER <5e-7                                    |
| Power<br>(mW/Lane)              | 935 (analog only)                        | 900                                     | 232 (TX & PLL)<br>470 (RX analog only)             | 460mW (Analog)                                      |
| Area<br>(mm <sup>2</sup> /lane) | 1.06 (analog only)                       | 1.54                                    | 0.0325 (TX only, no PLL)<br>0.281 (RX analog only) | 0.385 (Analog Only)                                 |
| TX Arch.                        | 4-tap FIR                                | CML 7-bit DAC<br>3-Tap FIR              | CML<br>3-Tap FIR                                   | 7-bit DAC<br>6-tap FIR                              |
| RX Arch.                        | Analog: 7-bit<br>ADC/CTLE/VGA<br>DSP:N/A | Analog: 7-bit<br>ADC/VGA<br>DSP: 10 FFE | Analog: 6-bit<br>ADC/CTLE/VGA<br>DSP: 16 FFE/1 DFE | Analog:7-bit ADC<br>CTLE/VGA<br>DSP: 8-24 FFE/1-DFE |

- **4dB more loss and 2X better BER than prior work**
- **35% lower power than prior work**

# Summary

- We presented state of the art 112G DSP based transceiver on N7 FinFet
- Power & area optimized for next generation ASICs with high number of ports and total throughput
- The transceiver is capable of equalizing up to 40dB channel loss with worst case Xtalk aggressors
- Transceiver outperforms prior work power by 35%, BER by 2X, and channel reach by 4dB

# A 10-to-112Gb/s DSP-DAC-Based Transmitter with 1.2V ppd Output Swing in 7nm FinFET

**E. Groen<sup>1</sup>, C. Boecker<sup>1</sup>, M. Hossain<sup>2</sup>, R. Vu<sup>1</sup>, S. Vamvakos<sup>1</sup>, H. Lin<sup>1</sup>, S. Li<sup>1</sup>, M. van Ierssel<sup>3</sup>, P. Choudhary<sup>1</sup>, N. Wang<sup>1</sup>, M. Shibata<sup>3</sup>, M. Taghavi<sup>3</sup>, N. Nguyen<sup>1,4</sup>, S. Desai<sup>1</sup>**

<sup>1</sup>Rambus, Sunnyvale, California

<sup>2</sup>Univ. of Alberta, Edmonton

<sup>3</sup>Rambus, Toronto

<sup>4</sup>San Jose State Univ., San Jose

# Outline

- Motivation
- Proposed 1.2V ppd driver:
  - Soft switching
  - Source follower pre-driver
  - 2:1 Mux at 112 Gb/s
- Clocking to support multi protocol SerDes
- Measured results
- Comparison and Conclusion

# Motivation



- Target best performance for highest data rate 112 Gb/s.
- Needs to support other data rates and modulation such as PAM-4 and NRZ

# Challenges in conventional Tx Architecture



- Limited flexibility : Modulator & equalizer is fixed by design
- SNR Limitation: Transmit swing is set by  $V_{DD}$  ( $\sim 900$  mV @ 7nm)
- Mux speed limitation: Quarter rate mux adds complexity
- High speed clock distribution and jitter targets

# Overall Block Diagram of DSP-DAC Solution



- Flexibility: move equalization to DSP to add flexibility
- Improve SNR by targeting 1.2 V Tx swing
- Flexible low power clock distribution

# Conv. Voltage Mode Driver



Creating 1.2Vppd swing consumes only 6mA Current

# Conv. Voltage Mode Driver



- Termination resistance is divided between device resistance and explicit resistor, but device resistor is small part of the resistor.
- OFF devices are exposed past maximum reliable voltage ( $\sim 950$  mV)

# Conv. Current Mode Driver



Current mode driver can accommodate 1.2V ppd swing  
without device over-voltage

# Conv. Current Mode Driver



But the current consumption increases by 4x (24 mA)

# Improve Signaling Efficiency



- Push-pull drivers can improve the efficiency by 2x.
- Shunt termination meets breakdown criteria but reduces efficiency compared with voltage-mode drivers

# 'Soft Switching' Driver



- Soft switching driver devices allows elimination of the current source
- Pre-driver swing requirement is relaxed
- Generation of the bias point ' $V_{GSX}$ ' requires careful consideration

# Source Follower Pre-driver



- Source Follower Pre-driver:
  - Generates 400mV pp swing
  - Provides level shift
  - Isolates driver Capacitance
  - BW concern

# Improved Source Follower Pre-driver



- Source Follower Pre-driver:
  - Generates 400mV pp swing
  - Provides level shift
  - Isolates driver Capacitance
  - High frequency content is amplified similar to CS amplifier
  - Avoid using inductor

# Driver, pre-driver and ESD



# 2:1 Mux and Level Shifter



Provides muxing and level shifting from 900mV to 1.2 V domain

# High Speed Signal Path



- The blue color reflects the signal path that can be replicated for gate voltage bias of the driver

# Replica Bias generation



- $M_1, M_2, M_3$  mux replica “on path”
- $I_1, M_4, R_1$  source follower replica
- $M_5 + I_2$  set target “on” gate voltage for output device
  - $M_5$  replica of output device
- Feedback changes mux current to have source follower output match target voltage
- $Vfb$  controls main circuit mux current as well to provide correct output current

# Serializer



- No material on the serializer
- Mostly standard cells, with a few “customized” standard cells

# Digital Signal Processing (DSP)



# Equalization DSP



LUT for each tap

Each LUT has 4 entries corresponding to the PAM4 symbols

4 levels  
Without  
Equalization



# Equalization DSP

Values are + and - stored as  
2's compliment, can use a  
simple adder



# Equalization DSP



Values are + and - stored as  
2's compliment, can use a  
simple adder



# Equalization DSP



Values are + and - stored as  
2's compliment, can use a  
simple adder



Add an offset value in all pre2  
entries to re-center to 0-127

# Transmitter Clocking



# Basic Clocking



- Supporting 10 to 112 Gb/s requires wide range in VCO tuning range 22 GHz to 29 GHz:
  - Requires 2 LC VCOs
  - Challenging to achieve High Q
  - With reasonable CDR BW of 4 MHz untracked jitter can be limited to 150 fs

# Clocking with Distribution



- After distribution Jitter becomes difficult to manage:
  - Each CMOS FO2 adds 15 to 20 fs
  - Quadrature correction, DCD adds more jitter
  - Eventually we are noise floor limited

# Clocking with Cascaded PLL



Local Ring PLL can re-set the jitter budget with lower noise floor but it adds ring VCO's phase noise → we need to find a way to lower this.

# Clocking with Cascaded PLL



# Flexible Clocking System



Common and Local PLLs follow same partition

- 1<sup>st</sup> order loop is linear for best performance
- 2<sup>nd</sup> order loop is digital to ease integration/filter requirements

Low frequency distribution clock from a clean reference is multiplied locally to get final frequency

# Clocking – Common PLL and Clock Dist.



Phase noise Profile of distributed clock from LC PLL



# Lane Ring PLL Bandwidth Impact

As bandwidth increases, the noise decreases at low frequencies and increases slightly at high frequencies



# Optimized Phase Noise



# Implementation



| Components                | Power Consumption |
|---------------------------|-------------------|
| Driver/Signalling Power   | 15 mW             |
| Pre-driver                | 17 mW             |
| Level Sifter+2:1 Mux      | 20 mW             |
| Serializer                | 11 mW             |
| Tx FIR                    | 15 mW             |
| Clock Buffer, DCD         | 40 mW             |
| Tx PLL                    | 45 mW             |
| Bias circuit              | 5 mW              |
| Amortized global clocking | 48 mW/4 = 12mW    |
| <b>Total</b>              | <b>175 mW</b>     |

## Implemented 7nm FinFET Prototype

### Sampling scope



# Measured 28 Gbps NRZ Results



# Measured PAM4 Results



# Performance Comparison

|                         | This Work                                                                           | ISSCC 2018 [1]                            | ISSCC 2018 [2]                                         | VLSI 2016 [3]             |
|-------------------------|-------------------------------------------------------------------------------------|-------------------------------------------|--------------------------------------------------------|---------------------------|
| Technology              | 7nm FinFET                                                                          | 14nm FinFET                               | 10nm FinFET                                            | 16 nm FinFET              |
| Architecture            | DSP-DAC                                                                             | DSP-DAC                                   | Analog                                                 | Analog                    |
| Supported Data Rate     | 10 Gb/s-to-112 Gb/s                                                                 | 112 Gb/s                                  | 56 Gb/s,<br>112 Gb/s                                   | 56 Gb/s,<br>112 Gb/s      |
| Transmit Swing          | 1.2 Vppd                                                                            | 0.92 Vppd                                 | 0.75Vppd                                               | 1 Vppd                    |
| Clocking                | Shared LC-PLL<br>Local SSRPLL                                                       | External                                  | Shared LC<br>Local I/Q gen                             | LC-PLL                    |
| Clock Distribution      | 2.25 GHz to 3,625 GHz<br>Over 2.1 mm                                                | ----                                      | 14 GHz LC PLL<br>510 um                                | ----                      |
| Random Jitter           | LC PLL 130 fs (RMS)<br>Ring PLL 171 fs (RMS)<br>(4 MHz to 23 GHz)                   | ----                                      | 154 fs (RMS)                                           | LC PLL130 fs (RMS)        |
| Power Consumption (mw)  | Transmitter: 118 mW<br>Tx SSRPLL: 45 mW<br>Shared clocking: 12 mW<br>Total : 175 mW | Total:286 mW                              | Transmitter: 193 mW<br>Clocking:39 mW<br>Total: 232 mW | Total: 345 mW             |
| FoM (pJ/bit)            | 1.05 pJ/blt w/o clocking<br>1.56 pJ/blt with clocking                               | 2.55 pJ/blt w/o clocking                  | 1.72 pJ/blt w/o clocking<br>2.07 pJ/blt w clocking     | 3.08 pJ/blt with clocking |
| Area (mm <sup>2</sup> ) | 0.193 mm <sup>2</sup>                                                               | 0.183 mm <sup>2</sup><br>Without clocking | 0.03 mm <sup>2</sup>                                   | 0.3825 mm <sup>2</sup>    |

# Conclusion

- **Proposed 1.2V ppd driver:**
  - Soft switching
  - Source follower pre-driver
  - 2:1 Mux at 112 Gb/s
- Clocking to support multi protocol SerDes
- Measured results

# A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET Using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier

Byoung-Joo Yoo<sup>1</sup>, Dong-Hyuk Lim<sup>1</sup>, Hyonguk Pang<sup>1</sup>, June-Hee Lee<sup>1</sup>, Seung-Yeob Baek<sup>1</sup>, Nixin Kim<sup>1</sup>, Dong-Ho Choi<sup>1</sup>, Young-Ho Choi<sup>1</sup>, Hyeyeon Yang<sup>1</sup>, Taehun Yoon<sup>1</sup>, Sang-Hyeok Chu<sup>1</sup>, Kangjik Kim<sup>1</sup>, Woochul Jung<sup>1</sup>, Bong-Kyu Kim<sup>1</sup>, Jaechol Lee<sup>1</sup>, Gunil Kang<sup>1</sup>, Sang-Hune Park<sup>1</sup>, Michael Choi<sup>1</sup>, Jongshin Shin<sup>1</sup>, Jaehong Park<sup>1</sup>

<sup>1</sup>Samsung Foundry, Hwaseong, Republic of Korea

# Outline

- Motivations
- 56Gb/s Transceiver Architecture and Implementation
  - ADC-DSP-DAC based Transceiver
  - Transmitter with Gm-Boosted Switches
  - Receiver Front-End with Passive Equalizer
  - DSP-Based ADC Skew Calibration Technique
  - DSP using Approximate Multipliers
- Measurement Results
- Summary and Conclusion

# Outline

- **Motivations**
- 56Gb/s Transceiver Architecture and Implementation
  - ADC-DSP-DAC based Transceiver
  - Transmitter with Gm-Boosted Switches
  - Receiver Front-End with Passive Equalizer
  - DSP-Based ADC Skew Calibration Technique
  - DSP using Approximate Multipliers
- Measurement Results
- Summary and Conclusion

# Motivation for Low-Power Transceivers

- Heavy ADC-DSP based architecture for LR transceivers is common
  - A few hundreds transceiver lanes for the tera-bit ethernet resulting in a complex power design
- Such hyper-scale interfaces absolutely require low-power transceivers



6.4: A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier

# Outline

- Motivations
- **56Gb/s Transceiver Architecture and Implementation**
  - ADC-DSP-DAC based Transceiver
  - Transmitter with Gm-Boosted Switches
  - Receiver Front-End with Passive Equalizer
  - DSP-Based ADC Skew Calibration Technique
  - DSP using Approximate Multipliers
- Measurement Results
- Summary and Conclusion

# Overall Transceiver Architecture

- 4-lane transceivers and common LCPLL with dual LC-VCOs
- 7-bit ADC-DSP Rx and 7-bit DAC-DSP Tx
- 1-tap DFE and 9-tap FFE (pre 2 & post 7) with passive equalizer(PEQ) assistance
- Digital dual mode CDR (BB&MM) with 128 steps per 1-UI phase interpolators



6.4: A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier

# Transmitter with Look-Up Table

- 7-bit DAC-DSP with Look-Up-Table for FIR and PAM
- VM-driver with thermometer encoded 2-bit and binary encoded 5-bit



# Serializer with Cascaded IQ-Dividers

- One re-timer and late differential conversion for low-power serialization
- Evenly distributed multiple data with cascaded two IQ-dividers



# Gm-Boosted Transmitter

- 31.75 segmented with the unit-broken 2 LSB slicers
- Feed-forwarded path for gm-boosted switch-TRs
- Achieved larger eye-opening area by 32%



6.4: A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier

# Receiver Front-End Stage

- T-coil and passive equalizer without conducting current
- On-chip MIM AC-coupling capacitors
- 1 CTLE and 2 VGAs
- 8x4 time interleaved SAR-ADC with 7-bit resolution



# Eye-Opening vs. ADC ENOB vs. Skew

- At least 5.0-ENOB ADC to achieve enough eye-opening
- IQ-mismatch should be less than 0.6-ps to meet 5.0-ENOB



6.4: A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier

# MMCDR-Based ADC Skew Calibration

- A MM-PD has a lot of hold conditions
- Hold conditions include jitter and skew information statistically
- Recycling garbage transitions for skew calibration



| Case | Transition        | $D_j$ | $D_{j+1}$ | $E_j$ | $E_{j+1}$ | Conventional | Proposal       |
|------|-------------------|-------|-----------|-------|-----------|--------------|----------------|
| 1    | $A \rightarrow d$ | +     | -         | +     | +         | $D_n$        | $D_n$          |
| 2    | $B \rightarrow c$ | -     | +         | -     | -         | $D_n$        | $D_n$          |
| 3    | $a \rightarrow D$ | +     | -         | -     | -         | $Up$         | $Up$           |
| 4    | $b \rightarrow C$ | -     | +         | +     | +         | $Up$         | $Up$           |
| 5    | $A \rightarrow D$ | +     | -         | +     | -         | <b>Hold</b>  | <b>Skew --</b> |
| 6    | $B \rightarrow C$ | -     | +         | -     | +         | <b>Hold</b>  | <b>Skew --</b> |
| 7    | $a \rightarrow d$ | +     | -         | -     | +         | <b>Hold</b>  | <b>Skew ++</b> |
| 8    | $b \rightarrow c$ | -     | +         | +     | -         | <b>Hold</b>  | <b>Skew ++</b> |

# Before vs. After ADC Skew Calibration

- Skew in ADC generates offsets in every 4 converted data
- Skew calibration loop should be faster than CDR loop due to PI INL



# Approximate Multipliers in DSP

- The probability that multiple Generate Signals( $G_{XY}$ ) have two or more outputs '1' is very low
- $G_{XY}$  in partial product can be approximated with an OR gate



| Number of Generate Signals | Probabilities |           |            |              |
|----------------------------|---------------|-----------|------------|--------------|
|                            | All 0s [%]    | One 1 [%] | Two 1s [%] | Three 1s [%] |
| 2                          | 87.89         | 11.72     | 0.39       | -            |
| 3                          | 82.40         | 16.48     | 1.10       | 0.024        |
| 4                          | 77.25         | 20.60     | 2.06       | 0.093        |

\* S. Venkatachalam and S.-B. Ko "Design of Power and Area Efficient Approximate Multipliers," IEEE Trans. Very Large Scale Integr. Syst., vol 25, no. 5, pp. 1782-1786, May 2017

# Approximate vs. Accurate Multipliers

- Less than 2% difference in JTOL simulation between approximate and accurate multipliers
- 32.5% area reduction and 40.4% power saving with the approximate multipliers in Rx DSP



| Architecture Example for EQ | Area ( $\mu\text{m}^2$ ) | Power (mW) |
|-----------------------------|--------------------------|------------|
| Accurate Multiplier         | 42454                    | 85.364     |
| Approximate Multiplier      | 28669                    | 50.868     |
| Improvement                 | 67.5%                    | 59.6%      |

# Outline

- Motivations
- 56Gb/s Transceiver Architecture and Implementation
  - ADC-DSP-DAC based Transceiver
  - Transmitter with Gm-Boosted Switches
  - Receiver Front-End with Passive Equalizer
  - DSP-Based ADC Skew Calibration Technique
  - DSP using Approximate Multipliers
- **Measurement Results**
- Summary and Conclusion

# PLL Jitter Measurement

- RJ 87.7fs,rms@10.3125GHz, 125.7fs,rms@12.890625GHz, and 165.0fs,rms@13.28125GHz respectively



[ 10.3125-GHz LCPLL ]

[ 12.890625-GHz LCPLL ]

[ 13.28125-GHz LCPLL ]

# Transmitter Eye-Diagrams

- The transceiver was verified from 10Gb/s to 56.25Gb/s



6.4: A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier

# Lossy Channel for LR Test

- 27-inch and 40-inch backplanes(B/P) to test LR
- -39dB@14GHz overall channel loss including -1.2dB package IL



# BER at 56.25Gb/s through LR Channel

- BER 3.9E-07 at -35dB and 1.8E-05 at -39dB ILs respectively



# JTOL through LR Channel

- 53.125Gb/s and 56.25Gb/s meet the JTOL spec in IEEE 802.3cd standard



# BER vs. IL and ICR

- BER of less than 1E-06 at lower IL than -35dB
- Relatively constant BER at higher ICR than 30dB



# BER vs. Temperature and Power Noise

- BER less than 1E-05 even at 125°C hot temperature
- BER less than 1E-06 even at supply noise with 50mV and 2MHz



# Eye-Diagrams for the Best Sample

- The transmitter can support maximum 64.0625Gb/s



# Outline

- Motivations
- 56Gb/s Transceiver Architecture and Implementation
  - ADC-DSP-DAC based Transceiver
  - Transmitter with Gm-Boosted Switches
  - Receiver Front-End with Passive Equalizer
  - DSP-Based ADC Skew Calibration Technique
  - DSP using Approximate Multipliers
- Measurement Results
- **Summary and Conclusion**

# Power and Performance Summary



|                              | [2] (ISSCC 2019) | [3] (ISSCC 2019)  | [4] (VLSI 2019) | <b>This Work</b>      |
|------------------------------|------------------|-------------------|-----------------|-----------------------|
| Technology                   | 7nm              | 7nm               | 7nm             | 10nm                  |
| Supply [V]                   | 0.9 / 0.75       | 1.3 / 0.85 / 0.75 | N/A             | 1.2 / 0.85 / 0.75     |
| Power [mW/Gb/s]              | 4.17             | 8.1               | 11.25           | <b>7.7</b>            |
| Spec.                        | 56G-LR           | 56G-LR            | 56G-LR          | 56G-LR                |
| IL@Nyquist [dB]              | 42.5             | 38                | 33.6            | 38                    |
| Area/lane [mm <sup>2</sup> ] | 0.468            | 0.84              | 0.37            | 0.72                  |
| Adaptation                   | MCU+Embedded     | No Info           | No Info         | <b>Fully Embedded</b> |

6.4: A 56Gb/s 7.7mW/Gb/s PAM-4 Wireline Transceiver in 10nm FinFET using MM-CDR-Based ADC Timing Skew Control and Low-Power DSP with Approximate Multiplier

# Summary

- The SerDes fabricated in 10nm FinFET technology supports maximum 56.25Gb/s/lane through -39dB IL channel
- This work achieved 7.7mW/Gb/s power efficiency adopting MM-CDR-Based ADC skew calibration, DSP with approximate multipliers, and GM-boosted transmitter

# Acknowledgement

- The authors would like to thank Jaehong Park in Samsung Foundry for his technical discussions and contributions
- Samsung Foundry design technology team, package development team, layout group and IP solution group in IP development team

# Appendix

- Why Direct-Drive Electrical Transceiver for Long-Reach
- Why Power-Efficient DSP
- Why Passive Equalizer
- Pre-FEC vs. Post-FEC

# A 6.4-to-32Gb/s 0.96pJ/b Referenceless CDR Employing ML-Inspired Stochastic Phase-Frequency Detection Technique in 40nm CMOS

Kwanseo Park, Minkyo Shim, Han-Gon Ko, and  
Deog-Kyoon Jeong

Seoul National University, Korea

# Outline

- Motivation
- Proposed Design Procedure
- Circuit Implementation
- Measurement Results
- Conclusions

# Motivation



- Referenceless CDR
  - Eliminate an external reference clock lowering a system cost
  - Suitable for a wide-range continuous-rate CDR
  - Frequency information should be obtained only from the input random data

# Motivation

## Conventional 2x oversampling BBPD



- 2x oversampling bang-bang phase detector
  - Detect a transition between data and edge samples
  - Simple and robust, but only phase information is obtained

# Motivation

Conventional 2x oversampling BBPD



**Detect transition**

Proposed phase-frequency detector



**Monitor sequential pattern**

- 2x oversampling bang-bang phase detector
  - Detect a transition between data and edge samples
  - Simple and robust, but only phase information is obtained
- **What if we monitored the sequential pattern?**

# Motivation

Conventional 2x oversampling BBPD



**Detect transition**

Proposed phase-frequency detector



**Monitor sequential pattern**

- Based on a premise that data transition informs the direction of phase detection
- Deductive and logical method

- Monitor many histograms of sequential pattern and determine the weight
- Inductive and stochastic method
- Uses basic methodologies of ML

# Proposed Design Procedure

- Step1: Collect histograms of the sequential patterns under various conditions
- Step2: Select representative histograms
- Step3: Calculate weight using Bayes' Theorem
- Step4: Apply weight and check PD & FD gain curves

# Step1: Collect Histograms

- Monitoring sequential pattern, “Data-Edge-Data”



Ex)  $D[n] = 0, E[n] = 1, D[n+1] = 1 \rightarrow \text{Pattern} = 3$

- Histograms under various conditions
  - Phase difference:  $P_{DIFF}$
  - Frequency difference:  $F_{DIFF}$  ( $= F_D - F_c$ )
  - Data pattern: PRBS7, PRBS15, and PRBS31
  - Noise: FM / PM jitters in DCO modeling
  - ISI: various channel models

# Step1: Collect Histograms

- Phase difference:  $P_{DIFF}$

CK leading:  
EARLY  
( $P_{DIFF} < 0$ )



CK lagging:  
LATE  
( $P_{DIFF} > 0$ )



# Step1: Collect Histograms

- Frequency difference:  $F_{\text{DIFF}}$  ( $= F_D - F_C$ )

Fast CK:  
EARLY  
( $F_{\text{DIFF}} < 0$ )



Slow CK:  
LATE  
( $F_{\text{DIFF}} > 0$ )



# Step1: Collect Histograms

- Pattern histograms with various  $P_{\text{DIFF}}$  ( $F_{\text{DIFF}} = 0$ )

x-axis: sequential pattern  
(ex. 3 = '0 1 1')  
y-axis: probability



Larger  
 $P_{\text{DIFF}}$



# Step1: Collect Histograms

- Pattern histograms with various  $P_{\text{DIFF}}$  ( $F_{\text{DIFF}} = 0$ )

x-axis: sequential pattern  
(ex. 3 = '0 1 1')  
y-axis: probability



More Obvious →



- Pattern of 1, 6: EARLY (confidence level  $\approx 1$ )
- Pattern of 3, 4: LATE (confidence level  $\approx 1$ )

# Step1: Collect Histograms

- Pattern histograms with various  $F_{\text{DIFF}}$  ( $F_C$  is fixed to 10GHz)



Larger  
 $F_{\text{DIFF}}$



# Step1: Collect Histograms

- Pattern histograms with various  $F_{\text{DIFF}}$  ( $F_C$  is fixed to 10GHz)



More  
Obvious →



- Pattern of 0, 7: EARLY (confidence level = ?)
- Pattern of 1, 2, 3, 4, 5, 6: LATE (confidence level = ?)

# Step2: Select Representative Histograms

- Case I: Single histogram
  - $P_{\text{DIFF}} = \pm 20\text{ps}$ ,  $F_{\text{DIFF}} = 0$
  - Only phase information



# Step3: Calculate Weight Using Bayes' Theorem

- Case I: Single histogram

- $P_{\text{DIFF}} = \pm 20\text{ps}, F_{\text{DIFF}} = 0$



Weight for pattern '1'

$$\begin{aligned} &= \Pr(\text{LATE} | 1) - \Pr(\text{EARLY} | 1) \\ &= \frac{\Pr(1 | \text{LATE}) \Pr(\text{LATE})}{\Pr(1)} - \frac{\Pr(1 | \text{EARLY}) \Pr(\text{EARLY})}{\Pr(1)} \\ &= \frac{0 \times 0.5}{(0 + 0.25)/2} - \frac{0.25 \times 0.5}{(0 + 0.25)/2} = -1 \end{aligned}$$

| Pattern  | 0 | 1  | 2 | 3  | 4  | 5 | 6  | 7 |
|----------|---|----|---|----|----|---|----|---|
| Weight   | 0 | -1 | 0 | +1 | +1 | 0 | -1 | 0 |
| Decision | X | E  | X | L  | L  | X | E  | X |

E: EARLY   L: LATE   X: No output

# Step4: Apply weight and check PD & FD curves

- Case I: Single histogram
  - $P_{\text{DIFF}} = \pm 20\text{ps}$ ,  $F_{\text{DIFF}} = 0$
  - Frequency detection is impossible



# Step2: Select Representative Histograms

- Case II: Combined histogram
  - $P_{DIFF} = \pm 20\text{ps}$ ,  $F_{DIFF} = 0$  &  $F_{DIFF} = \pm 9.7\text{GHz}$
  - Both phase and frequency information



# Step3: Calculate Weight Using Bayes' Theorem

- Case II: Combined histogram

- $P_{\text{DIFF}} = \pm 20\text{ps}$ ,  $F_{\text{DIFF}} = 0$  &  $F_{\text{DIFF}} = \pm 9.7\text{GHz}$



Weight for pattern '0'

$$\begin{aligned} &= \Pr(\text{LATE} | 0) - \Pr(\text{EARLY} | 0) \\ &= \frac{\Pr(0 | \text{LATE})\Pr(\text{LATE})}{\Pr(0)} - \frac{\Pr(0 | \text{EARLY})\Pr(\text{EARLY})}{\Pr(0)} \\ &= \frac{0.192 \times 0.5}{(0.192 + 0.374)/2} - \frac{0.374 \times 0.5}{(0.192 + 0.374)/2} = -0.32 \end{aligned}$$

| Pattern  | 0     | 1     | 2  | 3     | 4     | 5  | 6     | 7     |
|----------|-------|-------|----|-------|-------|----|-------|-------|
| Weight   | -0.32 | -0.34 | +1 | +0.98 | +0.98 | +1 | -0.34 | -0.32 |
| Decision | E     | E     | L  | L     | L     | L  | E     | E     |

E: EARLY   L: LATE   X: No output

# Step4: Apply weight and check PD & FD curves

- Case II: Combined histogram
  - $P_{\text{DIFF}} = \pm 20\text{ps}$ ,  $F_{\text{DIFF}} = 0$  &  $F_{\text{DIFF}} = \pm 9.7\text{GHz}$
  - Phase-frequency detection is achieved



# Effect of Data Pattern

- PD and FD gain curves showing effect of data pattern



- PRBS7, PM -115dB, No channel
- PRBS15, PM -115dB, No channel
- ▲ PRBS31, PM -115dB, No channel

# Effect of Noise and ISI

- PD and FD gain curves showing effect of noise and ISI



- ▲ PRBS31, PM -115dB, No channel
- ◆ PRBS31, PM -115dB, FM -90dB@1MHz, No channel
- PRBS31, PM -115dB, Channel loss -8.5dB@Nyquist

# Effect of Averaging Time

- PD and FD gain curves showing effect of averaging time
- **Gain curves spread, but maintain the direction**



# Effect of Harmonic Locking

- PD and FD gain curves showing effect of harmonic locking
- **It avoids harmonic locking with random data**



# Implementation of Proposed CDR

- **Proposed Digital CDR**
  - Referenceless operation
  - Quarter-rate clocking
  - 2x oversampling
  - Stochastic phase-frequency detector



6.5: A 6.4-to-32Gb/s 0.96pJ/b Referenceless CDR Employing ML-inspired Stochastic Phase-Frequency Detection Technique in 40nm CMOS

# Implementation of Proposed CDR



# Implementation of Proposed CDR



# Chip Photomicrograph



| Block                         | Power [mW]         |     |
|-------------------------------|--------------------|-----|
| A                             | CTLE               | 8.1 |
| B                             | Sampler & DEMUX    | 8.0 |
| C                             | DCO & Clock Buffer | 9.2 |
| D                             | CDR Digital        | 5.5 |
| Total Power = 30.8mW @ 32Gb/s |                    |     |

- Fabricated in 40-nm LP CMOS process
- Active area of 0.041 mm<sup>2</sup>

# Measurement Setup



- Signal Quality Analyzer: jitter tolerance
- Oscilloscope: frequency acquisition behavior and jitter histogram

# Measurement Setup



- Signal Quality Analyzer: jitter tolerance
- Oscilloscope: frequency acquisition behavior and jitter histogram

# Measured Frequency Acquisition



# Measured Frequency Acquisition



- Frequency acquisition behavior with various initial clock frequency
- Tested with 32Gb/s PRBS31 input pattern

# Measured Frequency Acquisition



- Frequency acquisition behavior with various initial clock frequency
- Tested with 20Gb/s PRBS31 input pattern

# Measured Frequency Acquisition



- Frequency acquisition behavior with various initial clock frequency
- Tested with 6.4Gb/s PRBS31 input pattern

# Measured Acquisition Time



- Acquisition time with frequency difference at various data rates
- Acquisition time < 11μs

# Measured Frequency Acquisition



- Frequency acquisition behavior with various data patterns, sinusoidal jitter, and a worse channel

# Measured Jitter Histogram



- Jitter histogram of the recovered clock at 6.4Gb/s and 32Gb/s

# Measured Jitter Tolerance



- Jitter tolerance at 32Gb/s ( $\text{BER} < 10^{-12}$ )
- Similar JTOL performance with BBPD and SPFD

# Performance Comparison

|                          | ISSCC 16      | ISSCC 17     | JSSC 18          | VLSI 19       | This work     |
|--------------------------|---------------|--------------|------------------|---------------|---------------|
| Technology               | 28nm          | 28nm         | 65nm             | 65nm          | 40nm          |
| Supply [V]               | 0.9           | 0.9          | 1.0              | 1.2           | 1.1           |
| Architecture             | Half rate     | Quarter rate | Full / half rate | Half rate     | Quarter rate  |
| Samples/UI               | 2             | 2.5          | 4                | 4             | 2             |
| Channel Loss [dB]        | 5             | 14.8         | Not reported     | Not reported  | 10            |
| Data Pattern             | PRBS9         | PRBS31       | PRBS7            | PRBS31        | PRBS31        |
| Data Rate [Gb/s]         | 7.4 – 11.5    | 22.5 – 32    | 0.75 – 3         | 4 – 20        | 6.4 – 32      |
| Unlimited Detection      | No            | No           | No               | Yes           | Yes           |
| Lock Time [ $\mu$ s]     | < 18          | < 10100      | < 100            | < 25          | < 11          |
| Area [ $\text{mm}^2$ ]   | 0.21          | 0.213        | 0.35             | 0.045         | 0.041         |
| Power [mW]               | 22.9 @ 12Gb/s | 102 @ 32Gb/s | 21.6 @ 3Gb/s     | 37.3 @ 20Gb/s | 30.8 @ 32Gb/s |
| Energy Efficiency [pJ/b] | 1.9           | 3.19         | 7.2              | 1.87          | 0.96          |

# Conclusions

- A 6.4-to-32Gb/s Referenceless CDR is designed and verified in 40nm CMOS
- A simple phase-frequency detection technique is proposed
  - Monitors histograms of the sequential patterns
  - Uses an inductive and stochastic design procedure
  - Achieves an unlimited frequency detection capability and avoids harmonic locking
- The measured energy efficiency is 0.96pJ/b and acquisition time is less than 11μs

# Reference-Noise Compensation Scheme for Single-Ended Package-to-Package Links

Xi Chen, Nikola Nedovic, Stephen G. Tell, Sudhir S. Kudva,  
Brian Zimmer, Thomas H. Greer, John W. Poulton, Sanquan Song,  
Walker J. Turner, John M. Wilson, C. Thomas Gray

NVIDIA Corp.

# Outline

- Background and Motivation
- Reference noise
- Compensation circuit design
  - Offset and delay compensation loops
  - Digital compensation
  - 1-UI delay circuit
- Measurement results
- Summary

# Short Reach Link over Multiple Levels

High-speed interconnects enable new levels of integration



# Off-Package I/O Requirements

- Challenges:
  - Bandwidth requirement
  - Power limitation
  - Bump limitation
- Single-ended signaling
  - Lower power
  - Higher bump efficiency



# Reference Noise in Single-Ended Link



Low-power links for both intra- and inter-packages communication

# Outline

- Background and Motivation
- Reference noise
- Compensation circuit design
  - Offset and delay compensation loops
  - Digital compensation
  - 1-UI delay circuit
- Measurement results
- Summary

# On-Package Noise Sources

- On-chip activity and PDN resonance cause on-package noise
- Chip rails may have hundreds of mV noise at tens of MHz



# Reference Noise Effect

- Loop-back test with experimental board  
Inject sinusoidal noise through PCB plane to modulate reference



# Reference Noise Extraction

- Forwarded clock carries reference noise information, which can be extracted and applied to all parallel signal lanes for compensation



# Outline

- Background and Motivation
- Reference noise
- Compensation circuit design
  - Offset and delay compensation loops
  - Digital compensation
  - 1-UI delay circuit
- Measurement results
- Summary

# Clock Duty-Cycle Modulation

- Reference matching appears as clock duty-cycle after RX Amp
- Positive reference error => Larger than 50% duty-cycle**



# Self-Sampling Clock

- Clock self-sampling with **1 Unit-Interval (UI) delay** works like low pass filter
- Reference error is converted to digital information



# Offset Compensation Loop



- Dynamically trim RX front end offset, based on the de-serialized reference error information.

## Offset Compensation



# Delay-Locked Loop

- Locks the inserted clock delay to “1UI”
- Responses to VT variations
- Shares resource with offset compensation loop



# Digital Domain Compensation

- **Bit Sum Detection** logic monitors clock lane de-serialized outputs, for possible reference error and delay error
- Incrementally changes offsets/delay control codes, from default settings
- **Pro:** No impact to analog front-end designs; maximum flexibility
- **Con:** Routing latency and processing time limit bandwidth

Compensation Logic



# Reference Error Quantization

Reference error to digital data transfer is not always ideal



Ideal Case



Real Case (example)



# Error Detections – Example.1



Ref. error = 5mV,  
Inserted delay >1UI



$$\text{Thresholds: } 8 \geq T_{osh} > T_{osl} \geq 0; \quad T_{dly} = 8 - T_{osh} + T_{osl}$$

- RX offset setting is adjusted when parallel-bit-sum (count of “1” bits) is out of the range defined by thresholds

# Error Detections – Example.2



Ref. error = 2.5mV,  
Inserted delay >1UI



$$\text{Thresholds: } 8 \geq T_{\text{osh}} > T_{\text{osl}} \geq 0; \quad T_{\text{dly}} = 8 - T_{\text{osh}} + T_{\text{osl}}$$

- Large delay error increases the dead-zone of offset detection temporarily, and triggers delay tuning within offset dead-zone

# Error Detections – Example.3



Ref. error = 2.5mV,  
Inserted delay  $\approx 1\text{UI}$



Thresholds:  $8 \geq T_{\text{osh}} > T_{\text{osl}} \geq 0$ ;  $T_{\text{dly}} = 8 - T_{\text{osh}} + T_{\text{osl}}$

- Programmable thresholds tolerate finite resolutions, and avoid competition or dead-lock between offset and delay loops

# Loop Bandwidth Optimization

- **Latency** in the compensation loops may impact the stability and introduce additional jitter

- “**Holdoff cycles**” added to improve stability and accuracy, at the cost of lower bandwidth



6.6: Reference-Noise Compensation Scheme for Single-Ended Package-to-Package Links

# 1-UI Delay Implementation

## Requirements:

1) Monotonicity, 2) Tuning range (>25% UI),

3) Fine resolution (0.8ps avg.)

Plus: Good linearity (DNL=0.3ps),  
and low-power (<1mA)



| Delay cells assignment |            |            |            |         |
|------------------------|------------|------------|------------|---------|
|                        | Stage 1    | Stage 2    | Stage 3    | Stage 4 |
| <4>                    | Delay <3>  |            |            |         |
| <3>                    | Delay <3>  |            |            |         |
| DEn <2>                | Delay <2>  |            |            |         |
| <1>                    | Delay <1>  | Delay <0>  | Delay <1>  | Tied-lo |
| <0>                    | Dither <0> | Dither <1> | Dither <1> | Tied-lo |

# Sampling Clock Dithering

## Problem:

Samplers work as 1-bit quantizers. May cause high gain (& jitter) in low-noise situation

## Solution:

Dither the sampling clock for better linearity



Clock delay variation (LSB) over Pclk cycle



# Outline

- Background and Motivation
- Reference noise
- Compensation circuit design
  - Offset and delay compensation loops
  - Digital compensation
  - 1-UI delay circuit
- Measurement results
- Summary

# GRS Receiver Offset Trim

Pseudo-differential amplifier stage provides very stable offset tuning



ISSCC 2018, JSSC 2019

RX offset tuning range



Ground: Lowest Z, most robust reference

Signed thermo-meter code

# Chip Photo and Floorplan

- Use GRS links (4+1 lanes @ 25Gb/s/lane) to connect DNN accelerator chips over package or PCB channels
- TSMC 16FF, negligible area for reference noise compensation

**Chip Photo**



**GRS Transceivers**



**RxClk Block**



# Ground Noise Injection Setup

- Inject low-frequency sinusoidal current through PCB ground plane, emulating the environment of large multi-package system
- Run die-to-die link test, without and with compensation



# Ground Noise Injection (60Hz) Results



- **50A noise** consumes half of the margin; **100A noise** “kills” the link



- Reference noise **compensation** recovers almost all eye closure

# Ground Noise Injection (260KHz) Results



- Reference noise compensation provides obvious improvement.
- 260KHz noise causes less degradation than 60Hz, due to noise coupling to channels at higher frequency (in this board)
- Driving PCB GND plane at high frequency is hard, because of low impedance ( $70\mu\Omega$  between PKGs)

# Transmitter Noise Injection Setup

- Embed higher frequency noise into data and clock signals, inject “modulated” signals through PCB TX pads probing.
- Emulate real TX output amplitude, with pre-emphasis EQ.



# Transmitter Noise Injection Results

- With higher noise frequency and amplitude, the compensation loop can still recover most of the lost margin, while uncompensated eyes were closed



# Transmitter Noise Injection Results

Compensation loop can be optimized for noise frequency

- Reducing the “hold cycle” trades low-frequency performance for larger loop bandwidth
- Could be adjusted adaptively with multi-bit offset quantization (sampling clock dithering)



# Transmitter Noise Injection Results

The clock dithering improves noise compensation performance at various frequencies



# Power Breakdown

## 4 Data & 1 Clock Lanes [mW]



- Power overhead for reference noise compensation in 4+1 lanes GRS link is ~1%
- Only adds tens of logic gates in digital domain

# Performance Comparison

|                                       | [3] ISSCC16  | [2] ISSCC18  | This Work    |              |
|---------------------------------------|--------------|--------------|--------------|--------------|
| <b>Data Lane #</b>                    | 6            | 8            | 4            | 8            |
| <b>Signaling</b>                      | CNRZ-5       | Differential | Single-Ended |              |
| <b>Data Rate / Pin [Gb/s]</b>         | 20.83        | 28 (56/2)    | 25           |              |
| <b>Reach</b>                          | $\leq 12$ mm | N/A          | $\leq 80$ mm |              |
| <b>Channel Type</b>                   | MCM          | PCB          | PCB          |              |
| <b>Technology</b>                     | 28 nm        | 16 nm        | 16 nm        |              |
| <b>Energy / Bit [pJ/bit]</b>          | 0.94         | 2.25         | <b>1.65</b>  | <b>1.18</b>  |
| <b>Per Lane Area [mm<sup>2</sup>]</b> | 0.105        | 0.33         | <b>0.02</b>  | <b>0.01</b>  |
| <b>Channel Loss</b>                   | 3 dB         | 8 dB         | 8.5 dB       |              |
| <b>Power Overhead *</b>               | 17.92%       | 17.80%       | <b>1.13%</b> | <b>0.83%</b> |

\* Compared to implementations without reference noise tolerant designs:  
**[3] 5bit in 5 wires, [2] single-ended TX, [This Work] without compensation hardware.**

# Summary

- Demonstrated a very efficient reference noise compensation scheme for single-ended links
- Digital compensation method provides maximum design flexibility, with virtually no impact to analog front end.
- Reference noise compensation enables energy-efficient high-density package-to-package communications in noisy environments, provides margin for performance scaling

# An 8Gb/s/ $\mu$ m FFE-Combined Crosstalk-Cancellation Scheme for HBM on Silicon Interposer with 3D-Staggered Channels

Han-Gon Ko, Soyeong Shin, Jonghyun Oh,  
Kwanseo Park, Deog-Kyoon Jeong

Seoul National University, Seoul, Korea

# Outline

- Motivation
- Proposed Crosstalk Cancellation Scheme
- Transmitter Architecture
- 3D-Staggered Channel
- Measurement Results
- Conclusion

# Motivation



$$\text{Throughput (Gb/s/um)} = \frac{\text{Data rate (Gb/s)}}{\text{Channel pitch (um)}}$$

- Crosstalk limits scaling of channel pitch
- To increase throughput further, crosstalk cancellation is required

# Proposed Crosstalk Cancellation Scheme



- Distort TX FFE output to cancel XT
- Amplitude of FFE output is adjusted by summing edge detector's outputs

# Proposed Crosstalk Cancellation Scheme



- Waveform of XT is confined within transition time ( $T_{TR}$ )
- XT is cancelled out by adjusting the amplitude of FFE output

# Transmitter Architecture



- Amplitude of FFE is encoded into 5 bits and converted to 70ps-wide pulses
- Charge-injecting FFE driver (ISSCC'09)
- 3D-staggered on-chip channel to mimic silicon interposer

# Transmitter Architecture



- Required strength of driver is calculated based on numerical simulation
- Regulated-cascade transimpedance amplifier is used for low input-Z
- Low input-Z RX increases bandwidth at the cost of signal swing

# Transmitter Architecture



- Encoder receives 1:4 deserialized data and calculates amp. of FFE output
- Change the combination of switched-on segments to produce required strength of driver

# 3D-staggered Channel



Cross-section view of 3D-staggered channel



Lumped element channel model

- 6mm channels are vertically staggered without ground shield
- $R_{tot} = 1.27\text{k}\Omega$ ,  $C_{tot} = 476\text{fF}$  ( $C_{g,tot} = 262\text{fF}$ ,  $C_{1,tot} = 96\text{fF}$ ,  $C_{2,tot} = 118\text{fF}$ )

# 3D-staggered Channel



- XT from 4 adjacent channels (FEXT1,2) contributes almost equally
- Low input-Z RX boosts high frequency component by 7.7dB

# Die Photo



|   | Block        | Area (6ch)    |
|---|--------------|---------------|
| 1 | Pattern gen. | 371um X 166um |
| 2 | Encoder      | 225um X 135um |
| 3 | TX           | 110um X 145um |
| 4 | Channel      | 439um X 186um |
| 5 | RX           | 50um X 80um   |



- 65nm CMOS
- 6 channel, 4Gb/s
- 6mm on-chip wire
- Power consumption
  - 36.6mW @ 6X4Gb/s

# Test setup



- Internal pattern generator generates 4Gb/s PRBS7 pattern
- Internal data sampler and BERT are used to measure eye

# Measurement Results



- Eye diagram with 4Gb/s PRBS7 input pattern

# Measurement Results



- Eye diagram with 4Gb/s PRBS7 input pattern

# Measurement Results



- Measured bathtub curve and XT induced jitter (CIJ)
- 0.32UI eye opening with proposed XTC

# Comparison with Multi-channel XTC schemes

|                             | Oh<br>JSSC'13 | Lee<br>JSSC'13 | Aprile<br>JSSC'18 | This work |
|-----------------------------|---------------|----------------|-------------------|-----------|
| Technology (nm)             | 65            | 130            | 32                | 65        |
| Channel number              | 4             | 3              | 8                 | 6         |
| Data rate (Gb/s)            | 12            | 5              | 7                 | 4         |
| Energy efficiency<br>(pJ/b) | 1.8*          | 4.3            | 5.9*              | 1.5       |
| CIJ reduction ratio         | 90%**         | 75%            | 63%**             | 78%       |
| CIJ reduction (ps)          | N/A           | 36             | N/A               | 245       |

\* RX only

\*\* Estimated from the reduction ratio of the crosstalk noise amplitude

# Comparison with On-chip Serial Links

|                                           | Seo<br>ISSCC'10 | Walter<br>ISSCC'12 | Chen<br>VLSI'15 | Chiu<br>JSSC'18 | Wei<br>JSSC'18 | This work |
|-------------------------------------------|-----------------|--------------------|-----------------|-----------------|----------------|-----------|
| Technology (nm)                           | 90              | 65                 | 65              | 65              | 65             | 65        |
| Link length (mm)                          | 5               | 6                  | 5               | 10              | 5              | 6         |
| Data rate per wire<br>(Gb/s/ch)           | 4.9             | 10                 | 4               | 10              | 10             | 4         |
| Channel pitch (um)                        | 1.12            | 3.9                | 1               | 5               | 2              | 0.5       |
| Throughput<br>(Gb/s/um)                   | 4.4             | 2.56               | 4               | 2               | 5              | 8         |
| Energy efficiency<br>per length (fJ/b/mm) | 68              | 174                | 48.4            | 77.2            | 148            | 254       |
| XT compensation                           | No              | No                 | No              | No              | No             | Yes       |

# Conclusion

- An 8Gb/s/ $\mu$ m transceiver on silicon interposer is designed and verified in 65nm CMOS
- FFE-combined XT cancellation scheme efficiently reduces CIJ using the existing FFE
- The prototype achieves **the highest throughput of 8Gb/s/ $\mu$ m** by significantly reducing the channel pitch

# *Paper 6.8*

## **A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS**

**Pen-Jui Peng<sup>1</sup>, Sheng-Tsung Lai<sup>1</sup>, Wei-Hung Wang<sup>1</sup>,  
Chiang-Wei Lin<sup>1</sup>, Wei-Chien Huang<sup>1</sup>, Ted Shih<sup>2</sup>**

**<sup>1</sup>Yuan Ze University, Taoyuan, Taiwan**

**<sup>2</sup>Teletrix, Taipei, Taiwan**

# Outline

---

- **Background**
- **Transmitter Architecture**
- **Building Blocks**
- **Experimental Results**
- **Comparison**
- **Conclusion**

# Applications

## □ DAC-Based TX

- Flexible for FFE's number of taps/resolution.
- Can be adopted to multilevel modulation (e.g., PAM-4) or coherent modulation for optical links.
- 100GS/s DAC is stringent for CMOS technology.

## Coherent Optical System



⇒ Propose high-frequency and power-efficient design!

# TX Architecture



6.8: A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS

# TX Architecture



# TX Architecture



# TX Architecture



6.8: A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS

# TX Architecture



6.8: A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS

# 2:1 MUX-Driver with Quarter-Rate Clocking



- One zero-crossing is achieved from the preceding MUX stage.
- Clock is ac-coupled and  $V_b$  is used to adjust the output swing.



6.8: A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS

# 2:1 MUX-Driver with Quarter-Rate Clocking



- PMOSs  $M_{1-2}$  are used to turn off each branch rapidly.



6.8: A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS

# 128:2 Serializer



6.8: A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS

# Clock Path



- It can drive 250fF load with only 8mW of power.
- Duty-cycle mismatch is eliminated by LC-resonator.
- Use switch capacitor array to adjust the resonant frequency from 20 to 25GHz.

# Clock Path



- QEC for 12.5GHz clock



- Range: +3ps ~ -3ps
- Resolution: <185fs



- Resolution: <200fs

# Die Photo & Power Breakdown



Total: 619mW @ 100Gb/s



- TSMC's 40-nm CMOS Technology
- Core Area :  $0.97 \times 0.52 \text{ mm}^2$

# Measurement Setup



- Approximately 7-dB loss @ 50 GHz from testing setup.
  - ➡ Probe, DC Blocks, Cables, Sampling Module

6.8: A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS

# Output Waveforms

## Without FFE

PRBS31 @ 100Gb/s



- DAC Swing: 560mV<sub>ppd</sub>

## With Proper FFE Coefficient

PRBS31 @ 100Gb/s



- RMS Jitter < 760fs
- $t_r/t_f < 6\text{ps}$

# DAC DNL/INL



- **DNL < 0.78LSB**



- **INL < 1LSB**

# Comparison Table

|                                   | [1]<br>ISSCC '18            |             | [2]<br>ISSCC '18            |            | [5]<br>ISSCC '19         |             | [6]<br>ISSCC '16           | [8]<br>ISSCC '14           | This Work                   |
|-----------------------------------|-----------------------------|-------------|-----------------------------|------------|--------------------------|-------------|----------------------------|----------------------------|-----------------------------|
| <b>Architecture</b>               | <b>Quarter-rate</b>         |             | <b>Quarter-rate</b>         |            | <b>Quarter-rate</b>      |             | <b>Half-rate</b>           | <b>Half-rate</b>           | <b>Quarter-rate</b>         |
| <b>Signalling</b>                 | PAM-4                       | NRZ         | PAM-4                       | NRZ        | PAM-4                    | NRZ         | NRZ                        | NRZ                        | <b>NRZ</b>                  |
| <b>Data-rate (Gb/s)</b>           | 112                         | <b>56</b>   | 112                         | <b>56</b>  | 112                      | <b>56</b>   | <b>64</b>                  | <b>60</b>                  | <b>80</b> <b>100</b>        |
| <b>Clock Source</b>               | <b>On-Chip PLL</b>          |             | <b>External</b>             |            | <b>On-Chip PLL</b>       |             | <b>On-Chip PLL</b>         | <b>On-Chip PLL</b>         | <b>External</b>             |
| <b>Driver Topology</b>            | <b>CML</b>                  |             | <b>SST</b>                  |            | <b>SST</b>               |             | <b>CML</b>                 | <b>CML</b>                 | <b>CML (tailless)</b>       |
| <b>Output Swing w/o FFE</b>       | <b>0.75 V<sub>ppd</sub></b> |             | <b>0.92 V<sub>ppd</sub></b> |            | <b>1 V<sub>ppd</sub></b> |             | <b>1.2 V<sub>ppd</sub></b> | <b>0.5 V<sub>ppd</sub></b> | <b>0.56 V<sub>ppd</sub></b> |
| <b>TX FFE</b>                     | <b>3-tap</b>                |             | <b>8-tap</b>                |            | <b>4-tap</b>             |             | <b>4-tap</b>               | <b>None</b>                | <b>8-tap</b>                |
| <b>Efficiency (pJ/bit)*</b>       | 1.72                        | <b>3.44</b> | 2.6                         | <b>5.2</b> | 3.62                     | <b>7.24</b> | <b>4.6</b>                 | <b>6.25</b>                | <b>4.23</b> <b>6.19</b>     |
| <b>Core Area (mm<sup>2</sup>)</b> | <b>0.0302</b>               |             | <b>0.095</b>                |            | <b>0.56</b>              |             | <b>0.32</b>                | <b>2.1</b>                 | <b>0.504</b>                |
| <b>Technology</b>                 | <b>10nm FinFET</b>          |             | <b>14nm FinFET</b>          |            | <b>40nm CMOS</b>         |             | <b>16nm FinFET</b>         | <b>65nm CMOS</b>           | <b>40nm CMOS</b>            |

\* Excluding PLL

6.8: A 100Gb/s NRZ Transmitter with 8-Tap FFE Using a 7b DAC in 40nm CMOS

# Conclusion

---

- A 100-Gb/s DAC-based NRZ transmitter with 8-tap FFE is designed and verified in 40nm CMOS.
- The quarter-rate 2:1 MUX-driver samples the data with one zero-crossing to improve the bandwidth.
- The tuned amplifier is adopted for all of the 100-Gb/s MUX-driver, making a power-efficient design.
- The DAC-based TX can be further applied into PAM-4 or coherent system to support 100G baud rate.