

## 22.1 A 0.275pJ/b 42Gb/s/pin Clock-Referenced PAM3 Transceiver Tolerant to Supply Noise, Reference Offset and Crosstalk for Chiplets and Short-Reach Memory Interfaces

Kahyun Kim<sup>1</sup>, Jung-Hun Park<sup>2</sup>, Ha-Jung Park<sup>1</sup>, Jia Park<sup>1</sup>, Jihee Kim<sup>1</sup>, Woo-Seok Choi<sup>1</sup>

<sup>1</sup>Seoul National University, Seoul, Korea

<sup>2</sup>University of California, Berkeley, CA

The trend of multi-chip modules (MCMs), die-to-die (D2D), and chiplet interfaces (e.g. UCIe) requires high-bandwidth densities while minimizing power consumption [1,3,4,11-13]. Single-ended (SE) PAM3 signaling has been adopted in GDDR7 [2] as a high-bandwidth solution, increasing the data rate by 150%, compared to NRZ (PAM2) using the same baud rate. SE PAM3 signaling can also be applied to chiplet interfaces, but it faces challenges (Fig. 22.1.1): (1) the RX requires large-area reference-voltage ( $V_{refp}$  &  $V_{refn}$ ) generators. (2) Matching RX and TX references is challenging, and discrepancies result in a TX-RX reference offset. (3) TX drivers generate simultaneous switching noise (SSN) of up to a few hundred millivolts on  $V_{DD}$  in TX die [3]; this noise coupled to TX data and cannot be rejected by the RX, thereby degrading voltage margins. (4) Reducing channel pitch increases crosstalk, and PAM3 is 2x more vulnerable to crosstalk than NRZ.

To address these difficulties, a  $V_{ref}$  tracking method which extracts common-mode information has been proposed [4]. However, its tracking bandwidth is limited by its complicated feedback loop. Several differential data-encoding methods, such as C-PHY [5] and CNRZ [6], have also been suggested, but multi-lane encoding schemes necessitate complex encoding and decoding hardware, which increase area and power overhead. This paper presents a 42Gb/s/pin clock-referenced SE PAM3 (CR-PAM3) transceiver: it provides tolerance to supply noise, reference offset, and crosstalk in an area- and power-efficient manner. The proposed transceiver's RX does not include  $V_{ref}$  generation; rather, the forwarded clocks are used as PAM3  $V_{ref}$ . The forwarded clocks (CLKp & CLKn) toggle between  $V_{refp}$  and  $V_{refn}$  (Fig. 22.1.1), rather than  $V_{DD}$  and  $V_{SS}$ . Consequently, for any given clock phase, one of the differential clocks is providing  $V_{refp}$  and the other  $V_{refn}$ . Two RX samplers compare the incoming data with the two differential clocks for data recovery, see Fig. 22.1.1 bottom. Since both data and clock voltages exhibit a common dependence on TX  $V_{DD}$ , the proposed transceiver is structurally immune to TX-RX reference offset, and TX  $V_{DD}$  noise is effectively canceled. In addition to CR-PAM3 signaling, several circuit-level techniques are proposed to reduce the power consumption and BER: including a differentially weighted driver with fractional-spaced puller (FS-puller), fractional-spaced crosstalk-cancellation (FS-XTC), and DFE-embedded sampler.

Figure 22.1.2 shows a block diagram of the proposed CR-PAM3 transceiver. TX data is processed through two paths after the serializer: one operating at the UI rate for data level information and the other at a sub-UI rate for level transition information. Per-pin de-skew circuits compensate for skew among data channels. The quarter-rate (7GHz) forwarded-clock signal levels are set by ratioing digitally controlled pull-up and pull-down transistors (see Fig. 22.1.2 bottom). A capacitive-peaking equalizer aids fast clock-edge transitions. The forwarded clocks are provided to the quadrature generator, quadrature error corrector (QEC), and digitally controlled delay line (DCDL) to generate the 4-phase sampling clocks used by the RX samplers. RX samplers compare the received data to the received forwarded clocks. The RX is unterminated to reduce power consumption.

Conventional voltage-mode PAM3 drivers conduct a significant current between  $V_{DD}$  and  $V_{SS}$  [7] when transmitting the mid-level symbol (**M**) due to the large driver transistor sizes. This short-circuit current can be reduced by decreasing driver strength, but at the expense of a low-driving capability at higher data rates. The proposed PAM3 driver, illustrated in Fig. 22.1.3, addresses this tradeoff by differentially weighting the driver strength based on the symbol transmitted. Strong drivers are used to drive  $V_{DD}$  and  $V_{SS}$  for **H** and **L** symbols. To drive **M** symbols, moderately-sized transistors, or the FS puller, generates a sub-UI spaced pulse to immediately switch the voltage level to **M**. The sub-UI spaced pulse accelerates edge-transitions and is only active during signal transitions to avoid static power consumption. The weak transistors then generate a mid-voltage level using a reduced static current. Combined with the FS-puller, the FS-XTC [8] compensates for FEXT from two adjacent channels. There exist 9 crosstalk (XT) levels in PAM3: from 4-level rise (+4 XT) to 4-level fall (-4 XT). For example, when one aggressor rises from **L** to **H**, and the other from **M** to **H**, the victim suffers a 3-level rise (+3 XT) due to capacitive coupling. To cancel out FEXT, a sub-UI spaced pulse with the opposite polarity is reutilized. Unlike in [8], XT is canceled only when the victim's transition polarity is opposite to the sum of the aggressors to further enlarge the worst-case eye height. Data-dependent XTC logic is added after the 4:1 serializer to reduce the number of 4:1 serializers. The FS-puller and FS-XTC utilize a 0.5UI spaced pulse to coincide with edge transition times. A capacitive-peaking equalizer in TX compensates for channel loss.

Figure 22.1.4 shows the RX implementation. Each data is provided to two samplers: one compares data with CLKp, and the other with CLKn. Typical DFEs employ a CML summers and taps that consume static current; whereas the proposed DFE-embedded sampler eliminates these CML circuits and integrates the DFE tap as another input pair to the StrongArm (SA) sampler. Two decisions, from previous sampling phases, are provided into the added input pair; thereby, reducing the parasitic capacitance at RB and SB node and subsequent CLK-to-Q delay. In addition, the inverter positioned between the SA and the tap-branch has a P:N transistor drive ratio of 1:2 to minimize feedback time. After being precharged, the sampler starts evaluating and either the S or R node discharges from  $V_{DD}$  to  $V_{SS}$ . Therefore, a stronger NMOS enhances the pull-down capability, thereby reducing tap feedback time and ensures timing margins. A CMOS-based clock-edge corrector (CEC) is employed for quarter-rate phase error correction, instead of an area- and power-intensive RC-based QEC [9]. For robust die-to-die communication, an on-chip foreground training sequence is introduced to sequentially calibrate the TX per-lane skew, clock-swing levels, and RX CEC. For training, a preset pattern is transmitted through the data channel, and the calibrated parameter is sent through the sideband channel. SS-LMS logic performs calibration based on early/late and high/low information. A digital block is used to emulate the low-speed sideband channel in chiplet standards. Once link initialization begins, clock alignment, per-lane deskew, and clock-level training are executed sequentially, see Fig. 22.1.4 bottom.

The prototype chip, including the proposed transceiver and on-chip test channels, is fabricated in 28nm CMOS. Each of the 6 test channels is implemented with a metal that is 2mm long, 0.5um wide, and with a 2.5um channel pitch. The total insertion loss is measured to be 8.5dB at 12GHz, and the worst-case FEXT from adjacent channels is -15.2dB at Nyquist. The transceiver utilizes a 1.0V  $V_{DD}$  and 0.6V  $V_{DDO}$ . The measured data rate is 42Gb/s/pin. Each lane has its own PRBS7 generator. To emulate a noisy chiplet environment a 200mV<sub>pp</sub> supply noise is injected on TX  $V_{DDO}$  via capacitive coupling on the PCB. To compare the supply noise tolerance, a conventional PAM3 transceiver with  $V_{REF}$  generated by RX is implemented in a replica channel. Figure 22.1.5 shows the measured eye diagram and bathtub curves for the proposed transceiver, with and without injected noise, EQ and XTC. Without EQ and XTC, a BER less than  $10^{-12}$  cannot be achieved. When both EQ and XTC are enabled, the horizontal eye margin increases by up to 0.38UI. Using a 120MHz 200mV<sub>pp</sub> sinusoidal  $V_{DDO}$  supply noise, the CR-PAM3 transceiver achieves a 0.34UI horizontal and a 121mV vertical eye margin, whereas the conventional PAM3 transceiver shows no measurable vertical nor horizontal eye margins. A 60Hz 200mV<sub>pp</sub> sinusoidal  $V_{DDO}$  noise shows a measured 0.36UI and a 118mV margin with the CR-PAM3 transceiver; whereas, only a 0.08UI and 26mV margins are measured for the conventional PAM3 transceiver.

The performance summary and power breakdown of the proposed transceiver are shown in Fig. 22.1.6. The energy efficiency of the proposed transceiver is 0.275pJ/b, which is the smallest among prior state-of-the-art chiplet transceivers. The proposed CR-PAM3 transceiver enhances supply noise and reference offset tolerance, cancels crosstalk, and reduces power consumption through differentially a weighted driver, FS puller and DFE-combined sampler. The unit-transceiver area is 1187 $\mu\text{m}^2$ , which is the smallest compared to prior chiplet transceivers, and the beach-front bandwidth is 9.16Tb/s/mm.

### Acknowledgement:

This work was supported in part by Creative-Pioneering Researchers Program through Seoul National University, by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (No. 2021-0-00871, Development of DRAM-Processing-In-Memory Chip for DNN Computing), and by IITP under the artificial intelligence semiconductor support program to nurture the best talents (IITP-2023-RS-2023-00256081) grant funded by the Korea government (MSIT). The EDA Tool was supported by the IC Design Education Center(IDEA), Korea. Woo-Seok Choi (wooseokchoi@snu.ac.kr) is the corresponding author.



Figure 22.1.1: Reference-noise issues for conventional SE chiplet interfaces (top). Proposed clock-referenced PAM3 operation (bottom).



Figure 22.1.2: Proposed PAM3 transceiver architecture (top) and circuit implementation of the amplitude-modulating CLK driver (bottom).



Figure 22.1.3: Proposed differentially weighted N-N driver with FS puller operation (top) and the proposed PAM3 FS-XTC (bottom).



Figure 22.1.4: Implementation of the proposed DFE-embedded SA (top), and the link-training state-flow diagram (bottom).



Figure 22.1.5: Measured CR-PAM3 TRX eye diagrams and bathtub curves.



Figure 22.1.6: Channel configuration, power breakdown, and key-metric comparison table.



Figure 22.1.7: Chip microphotograph.

#### References:

- [1] S.-Y. Cho et al., "A 16Gb 37Gb/s GDDR7 DRAM with PAM3-Optimized TRX Equalization and ZQ Calibration," *ISSCC*, pp. 242-244, 2024.
- [2] B. Dehlaghi and A. Chan Carusone, "A 0.3 pJ/bit 20 Gb/s/Wire Parallel Interface for Die-to-Die Communication," *IEEE JSSC*, vol. 51, no. 11, pp. 2690-2701, Nov. 2016.
- [3] J. W. Poulton et al., "A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications," *IEEE JSSC*, vol. 48, no. 12, pp. 3206-3218, Dec. 2013.
- [4] X. Chen et al., "Reference-Noise Compensation Scheme for Single-Ended Package-to-Package Links," *ISSCC*, pp. 126-128, 2020.
- [5] W. Choi, et al., "A 1V 7.8mW 15.6Gb/s C-PHY transceiver using tri-level signaling for post-LPDDR4," *ISSCC*, pp. 402-403, 2017.
- [6] A. Shokrollahi et al., "A pin-efficient 20.83Gb/s/wire 0.94pJ/bit forwarded clock CNRZ-5-coded SerDes up to 12mm for MCM packages in 28nm CMOS," *ISSCC*, pp. 182-183, 2016.
- [7] H. Park et al., "30-Gb/s 1.11-pJ/bit Single-Ended PAM-3 Transceiver for High-Speed Memory Links," *IEEE JSSC*, vol. 56, no. 2, pp. 581-590, Feb. 2021.
- [8] H.-G. Ko et al., "An 8Gb/s/μm FFE-Combined Crosstalk-Cancellation Scheme for HBM on Silicon Interposer with 3D-Staggered Channels," *ISSCC*, pp. 128-130, 2020.
- [9] J.-H. Park et al., "A 32Gb/s/pin 0.51pJ/b Single-Ended Resistor-less Impedance-Matched Transmitter with a T-Coil-Based Edge-Boosting Equalizer in 40nm CMOS," *ISSCC*, pp. 410-412, 2023.
- [10] Y. Kwon et al., "A 33-Gb/s/Pin 1.09-pJ/Bit Single-Ended PAM-3 Transceiver With Ground-Referenced Signaling and Time-Domain Decision Technique for Multi-Chip Module Memory Interfaces," *IEEE JSSC*, vol. 58, no. 8, pp. 2314-2325, Aug. 2023.
- [11] Y. Nishi et al., "A 0.297-pJ/Bit 50.4-Gb/s/Wire Inverter-Based Short-Reach Simultaneous Bi-Directional Transceiver for Die-to-Die Interface in 5-nm CMOS," *IEEE JSSC*, vol. 58, no. 4, pp. 1062-1073, April 2023.
- [12] K. Seong et al., "A 4nm 32Gb/s 8Tb/s/mm Die-to-Die Chiplet Using NRZ Single-Ended Transceiver With Equalization Schemes And Training Techniques," *ISSCC*, pp. 114-116, 2023.
- [13] J. Jin et al., "A 4nm 16Gb/s/pin Single-Ended PAM4 Parallel Transceiver with Switching-Jitter Compensation and Transmitter Optimization," *ISSCC*, pp. 404-406, 2023.
- [14] J. Seo, et al., "A 20-Gb/s/pin 0.0024-mm<sup>2</sup> Single-Ended DECS TRX with CDR-less Self-Slicing/Auto-Deserialization to Improve Tolerance on Duty Cycle Error and RX Supply Noise for DCC/CDR-less Short-Reach Memory Interfaces," *ISSCC*, pp. 456-458, 2022.