

# 30-Gb/s 1.11-pJ/bit Single-Ended PAM-3 Transceiver for High-Speed Memory Links

Hyunsu Park, *Graduate Student Member, IEEE*, Junyoung Song<sup>✉</sup>, *Member, IEEE*, Jincheol Sim, *Student Member, IEEE*, Yoonjae Choi, *Graduate Student Member, IEEE*, Jonghyuck Choi<sup>✉</sup>, *Graduate Student Member, IEEE*, Jeongsik Yoo, and Chulwoo Kim<sup>✉</sup>, *Senior Member, IEEE*

**Abstract**—A 30-Gb/s three-level pulse amplitude modulation (PAM-3) transceiver is designed with a one-tap tri-level decision feedback equalizer (DFE) to realize a high-speed dynamic random access memory (DRAM) interface via the 28-nm CMOS process. A 1.5-bit/pin bit efficiency is achieved by encoding and decoding 3-bit data in two unit intervals (UIs). The half-rate PAM-3 transmitter modulates single-ended pseudorandom binary sequence (PRBS) 7/15 data using a low-power encoding logic and an output driver. The receiver achieves a bit error rate (BER) of less than 1E-12 over an 80-mm FR-4 printed circuit board (PCB) channel. At the maximum data rate, the bit efficiency of the transceiver is 1.11 pJ/bit, consuming 33.4 mW. In the receiver, the attenuated PAM-3 data are equalized by a continuous-time linear equalizer (CTLE) and a one-tap tri-level DFE, which has the same complexity as that of non-return-to-zero (NRZ) signaling. The tri-state buffers, which have a floating PMOS switch, convert the output of the comparator into NRZ data, resulting in reduced delay and power dissipation. Four channels of the transceivers operate at data rates of up to  $30 \times 4$  Gb/s, and the horizontal eye margin of the measured PAM-3 data is achieved at a UI of 0.14 for the PRBS-7 pattern at the maximum data rate.

**Index Terms**—Decision feedback equalizer (DFE), double data rate (DDR), high-speed memory interface, pulse amplitude modulation (PAM-3), single-ended interface.

## I. INTRODUCTION

MEMORY interfaces have employed single-ended signaling as they require numerous DQ pins. Their bandwidths are being increased to allow for a high data throughput, while the input/output (I/O) supply voltage has been scaled down. The bandwidths of the I/Os of dynamic random access

Manuscript received February 19, 2020; revised May 11, 2020; accepted June 28, 2020. Date of publication July 15, 2020; date of current version January 28, 2021. This article was approved by Guest Editor Jonathan Chang. This work was supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP) Grant funded by the Korean Government (MSIT) (Development of LPDDR5 Memory Interface for A.I Application Processor) under Grant 2019-0-01370. (*Corresponding author: Chulwoo Kim*.)

Hyunsu Park is with the Department of Semiconductor System Engineering, Korea University, Seoul 02841, South Korea.

Junyoung Song is with the Department of Electrical Engineering, Incheon National University, Incheon 22012, South Korea.

Jincheol Sim, Yoonjae Choi, Jonghyuck Choi, Jeongsik Yoo, and Chulwoo Kim are with the Department of Electrical Engineering, Korea University, Seoul 02841, South Korea (e-mail: ckim@korea.ac.kr).

Color versions of one or more of the figures in this article are available online at <https://ieeexplore.ieee.org>.

Digital Object Identifier 10.1109/JSSC.2020.3006864

memory (DRAM) have been extended by increasing the clock frequency, such as by employing a double data rate (DDR), low-power double data rate (LPDDR), and graphics double data rate (GDDR), as shown in Fig. 1 [1]–[4]. However, the increased frequency causes degradations in the clock performance, such as duty distortion and jitter amplification. The degradation of the DRAM process is higher than that of the logic process because of low performance. To cope with design constraints, the power dissipation of the clock path has been increased, and a duty-cycle corrector (DCC) has become an essential building block for securing a voltage margin, although it increases the hardware cost [5].

In graphics memory applications, a large parallel I/O has been used to reduce the operating clock frequency and channel loss [6]. However, the fabrication cost is higher than that of DRAMs with a GDDR due to the through-silicon-via (TSV) process. Because of the high design cost involved in the TSV process, the aforementioned parallel I/O cannot be easily employed in DDR and LPDDR interfaces. In the state-of-the-art DDR interface, a phase-rotator-based delay-locked loop (DLL) is proposed instead of the conventional DLL to suppress the degradation of clock performance. However, this increases the design complexity, causing a higher power consumption and occupied area compared with the case of using a conventional DLL [1].

Pulse-amplitude modulation (PAM) is a popular candidate for high-bandwidth memory interfaces. PAM-4 (four-level pulse amplitude modulation) signaling, which transmits 2 bits during one unit interval (UI), is commonly adopted in high-speed wireline interfaces [7]. The voltage margin of PAM-4 is theoretically equal to 1/3 of that of non-return-to-zero (NRZ) and lower than that in actual designs due to simultaneous switching noise (SSN), crosstalk, and random noise in single-ended signaling. Eventually, the reduced voltage margin degrades the signal-to-noise-ratio (SNR), causing it to fail to meet the bit error rate (BER) specification of the memory interface. To achieve the target BER, the peak-to-peak voltage should be increased, although this causes substantial power dissipation [8].

Non-linearity is another critical issue in single-ended PAM-4 signaling because the second harmonic is much larger than that in differential signaling, which degrades the voltage margin. As shown in Fig. 2(a), the eyes of PAM-4 can be easily



Fig. 1. Trends of DRAM interfaces.



Fig. 2. Distorted RLM of (a) PAM-4 and (b) PAM-3 signaling.

distorted by non-ideal amplifier factors in the analog front end of the receiver. When the gain of the amplifier is larger than the designed value, the upper and lower eyes are compressed due to a voltage headroom and a current-mode-logic driver limitation [9], [10]. In addition, gain compression also distorts the signal levels of PAM-4. When the amplifier has an excessive voltage gain of 3 dB or a gain distortion of 1 dB, the values of ratio of level mismatch (RLM) are degraded from 100% to 79.3% and 86.4%, respectively.

Three-level pulse-amplitude modulation (PAM-3) signaling can be an alternative to PAM-4 signaling because it has higher feasibility with regard to the single-ended interface compared with PAM-4 signaling. The ideal voltage margin of PAM-3 signaling is 1.5 times larger than that of PAM-4 signaling. As shown in Fig. 2(b), the eye of PAM-3 is less sensitive to excessive amplifier gain because it only amplifies the middle-level noise voltage, maintaining the same RLM, in contrast to PAM-4 eye. Nevertheless, it is affected by gain compression from the channel response and the analog-front-end circuits of the receiver, although the distortion is lower than that observed in PAM-4 signaling, with a higher RLM value. Therefore, the feasibility of PAM-3 signaling is higher than that of PAM-4 signaling with regard to a single-ended interface.

As a method of performing PAM-3 signaling, duo-binary signaling has been used in interfaces that have a high channel loss [11]. As shown in Table I, duo-binary signaling has three voltage levels, and the data rate is the same as that of NRZ signaling because an additional level is used for embedded equalization. Therefore, duo-binary signaling cannot benefit from the reduced clock frequency in PAM-N ( $N \geq 3$ ) signaling.

TABLE I  
COMPARISON WITH PAM-N SIGNALING

|                                      | NRZ | Duo-binary | PAM-4 | PAM-3<br>(3-bit/2-UI) |
|--------------------------------------|-----|------------|-------|-----------------------|
| No. of data levels                   | 2   | 3          | 4     | 3                     |
| Ideal Voltage margin                 | 1   | 1/2        | 1/3   | 1/2                   |
| No. of bits per UI                   | 1   | 1          | 2     | 1.5                   |
| Half-rate frequency (D is data-rate) | D/2 | D/2        | D/4   | D/3                   |
| Crosstalk                            | 1 X | 1 X        | 3 X   | 2 X                   |
| Sensitivity to non-linearity         | Low | Middle     | High  | Middle                |



Fig. 3. Top block diagram of the transmitter.

In this article, a PAM-3 signaling method with a 1.5-bit/pin efficiency, which transmits three bits during two UIs, is proposed, and several sub-circuits were revised to increase the maximum data rate from the previous work [12]. The proposed PAM-3 signaling method reduces the clock frequency by 2/3 compared with NRZ signaling, for the same data rate. The clock frequency reduction enables a lower jitter amplification and duty distortion, as well as a lower channel attenuation. In terms of non-linearity and the noise margin, the proposed method has a higher feasibility than PAM-4 signaling. In addition, this article presents several circuit techniques, such as the transmitter output driver and receiver equalizers. Notably, the equalization is optimized for PAM-3 signaling, thereby lowering hardware costs. This article is organized as follows. In Sections II and III, the transmitter and receiver circuit designs for 3-bit/two-UI PAM-3 signaling are elucidated, respectively. In Section IV, the test environment and the measurement results of the transceiver are described, and this article is concluded in Section V.

## II. PAM-3 TRANSMITTER DESIGN

A top block diagram of PAM-3 transmitter is shown in Fig. 3. In the proposed PAM-3 signaling, three bits are transmitted during two UIs. This operation is implemented using the full-rate data and the half-rate encoder/decoder. As a PAM-3 modulator, a low-swing  $N$ -over- $N$  driver is used to achieve a low-energy efficiency as well as a simple implementation. A detailed explanation of the encoding block



Fig. 4. Block diagram of 3-bit/two-UI PAM-3 encoder.

and the output driver is presented in Section II. The maximum data rate is 30 Gb/s using a clock frequency of 10 GHz. Three different parallel data of eight bits are generated by an internal pseudorandom-binary-sequence (PRBS) generator. The generated parallel data are serialized to full-rate data (A–C) by three 8:1 serializers. Then, a PAM-3 encoder converts the full-rate input data into half-rate output data for PAM-3 modulation.

#### A. 3-bit/Two-UI PAM-3 Encoder

PAM-3 signaling can theoretically transmit  $\log_2 3$  ( $\approx 1.56$ ) bits during one UI. However, a 1.5-bit efficiency of the PAM-3 signaling cannot be achieved using a single wire and UI. Two methods can be employed to increase the bit efficiency of PAM-3 signaling. First, multi-wire encoding techniques can be adopted for PAM signaling [13]. However, per-pin skews limit the sampling margin, thereby degrading the BER. The skews between the encoded wires are caused by a channel and local variation mismatch; furthermore, compensating such skews incurs substantial hardware costs. The second method for bit-efficiency extension is a multi-window encoding technique. Unlike multi-wire encoding, multi-window encoding is free from the issue of per-pin skew. As an example of the multi-window encoding, the 8/10-b encoding has been used to achieve a dc balance and clock embedding [15]. In PAM-3 signaling, there are nine transition cases during two UIs. Out of these, eight cases are used for 3-bit encoding to achieve the desired pin efficiency of 1.5 bit/(wire·UI). A two-UI encoding can be implemented with a half-rate operation; this has been commonly used in conventional DDR interfaces. Three full-rate data, namely, A–C, are encoded to four signals to generate each PAM-3 data pattern from the output driver, including odd/even and high/low data, as shown in Fig. 4. In the encoder, D-flip flops sample the four outputs of the encoding logic to secure the timing margin of the 2:1 multiplexer. The 2:1 multiplexer converts the four encoded signals into half-rate outputs for the PAM-3 modulation of the output driver. Table II shows the encoding map of the PAM-3 transmitter. Among the nine transition cases, the high-to-high transition has the highest power consumption in a low-voltage swing terminated logic (LVSTL) driver with a ground termination. Therefore, eight cases are mapped for the 3-bit expression, except for the high–high case. By using this encoding table, the total power consumption of the output

TABLE II  
ENCODING TABLE AND CURRENT DISSIPATIONS

| Input Data   | Encoded Data (Odd) |   | Encoded Data (Even) |                | Output <sub>TX</sub> (Odd+Even) | Current Dissipations |                |     |                    |                    |
|--------------|--------------------|---|---------------------|----------------|---------------------------------|----------------------|----------------|-----|--------------------|--------------------|
|              | A                  | B | C                   | H <sub>O</sub> | L <sub>O</sub>                  | H <sub>E</sub>       | L <sub>E</sub> | Odd | Even               |                    |
| Not Assigned | -                  |   | -                   |                |                                 |                      |                |     | I <sub>0</sub>     | I <sub>0</sub>     |
| 1            | 1                  | 1 | 1                   | 1              | 0                               | 0                    | 0              |     | I <sub>0</sub>     | 0.75I <sub>0</sub> |
| 1            | 1                  | 0 | 1                   | 0              | 1                               | 0                    | 1              |     | I <sub>0</sub>     | 0                  |
| 0            | 0                  | 1 | 0                   | 0              | 0                               | 1                    | 0              |     | 0.75I <sub>0</sub> | I <sub>0</sub>     |
| 1            | 0                  | 1 | 0                   | 0              | 0                               | 0                    | 0              |     | 0.75I <sub>0</sub> | 0.75I <sub>0</sub> |
| 1            | 0                  | 0 | 0                   | 0              | 0                               | 0                    | 1              |     | 0.75I <sub>0</sub> | 0                  |
| 0            | 1                  | 1 | 0                   | 1              | 1                               | 1                    | 0              |     | 0                  | I <sub>0</sub>     |
| 0            | 1                  | 0 | 0                   | 0              | 1                               | 0                    | 0              |     | 0                  | 0.75I <sub>0</sub> |
| 0            | 0                  | 0 | 0                   | 1              | 0                               | 1                    | 0              |     | 0                  | 0                  |

driver decreases by 8.92% compared with that of the conventional case, which uses all of the PAM-3 transition cases. The unassigned case is used for error detection with regard to the issue of one-UI shifted windowing. If the high–high case is detected in the receiver, the clock phase is toggled to 180° to solve the windowing issue.

#### B. PAM-3 Output Driver

The encoded data from the 3-bit/two-UI PAM-3 encoder are applied to the LVSTL output driver to modulate the PAM-3 data. In a single-ended voltage-mode driver with a single supply voltage, a short current is essential for generating the middle voltage level of PAM-3 signaling. The output driver employed in a previous work dissipates a large, short current for the middle level because it uses only two N metal-oxide-semiconductors (NMOSs) to perform pull-ups and pull-downs [12]. Fig. 5(a) shows the modified schematic of the PAM-3 output driver, which consists of  $M_{N1-4}$ . Three different input cases are used for the PAM-3 modulation, as shown in Fig. 5(b). To reduce the short current, two additional branches,  $M_{N3}$  and  $M_{N4}$ , are added for the middle-level modulation. Each NMOS is constructed to modulate the three output levels, namely, 0.5 V<sub>DD</sub>, 0.25 V<sub>DD</sub>, and zero, with a ground termination resistor  $R_{TERM}$ . In the case of a high level, both  $M_{N2}$  and  $M_{N4}$  are turned on with a series resistance of  $R_{TERM}$ , generating 0.5 V<sub>DD</sub>. As for the middle level,  $M_{N3}$  and  $M_{N4}$  are turned on with the series resistance of  $2R_{TERM}$ . These resistances are set for the 0.25-V<sub>DD</sub> output level and for impedance matching with  $R_{TERM}$ . In this case, the total current is 3 V<sub>DD</sub>/8 $R_{TERM}$ , which is 0.75 times lower than that for a high level. In the case of a low level, the size of  $M_{N1}$  is set to match the falling and rising times. In this work, the PAM-3 output driver uses a supply voltage of 0.6 V, modulating the data with a 300-mV peak-to-peak voltage swing.

### III. PAM-3 RECEIVER DESIGN

Fig. 6 shows the top architecture of the PAM-3 receiver. For single-ended multi-level signaling, equalization is a



Fig. 5. Single-ended voltage-mode PAM-3 driver. (a) Schematic. (b) Operations.



Fig. 6. Top architecture of the receiver.

critical issue. Notably, the feedback delay of a decision feedback equalizer (DFE) limits the maximum data rate as well as energy efficiency. To deal with these design constraints, a novel DFE structure is proposed. The one-tap half-rate DFE reduces the hardware cost by optimizing the structure of a unit tap for the PAM-3 signaling. Because the proposed DFE operates with differential inputs, a single-to-differential amplifier (S2D) is designed with the following conventional continuous-time linear equalizer (CTLE). The clock signal used in the DFE is shared with the eye-opening monitor (EOM) circuit, and the clock edge timing is controlled by a digitally controlled delay line (DCDL). The DCC and the four-phase generator improve the sampling margin of the receiver. The inter-symbol interference (ISI)-compensated PAM-3 data are decoded to three original data, namely, A–C, using the 3-bit/two-UI half-rate PAM-3 decoder. To avoid the one-UI shifted windowing, the one-UI shift detector and 2:1 multiplexer are designed in the receiver. Because the high–high case is not included in the encoding logic, the detected high–high case in the receiver can be used to solve the windowing issue. In the half-rate structure, one-UI is easily shifted by inverting



Fig. 7. Schematic of S2D amplifier.



Fig. 8. CTLE. (a) Schematic. (b) AC response.

the sampling clock. The selected bit of the 2:1 multiplexer is decided by the one-UI shift detector.

### A. S2D Amplifier and CTLE

The S2D amplifier is important with regard to using equalizers in single-ended memory interfaces. The output driver of the transmitter generates PAM-3 data with a 0.3-V peak-to-peak voltage and 0.15-V common-mode voltage. The first stage, which is designed with a conventional current-mode-logic (CML) driver, receives the transmitted PAM-3 data with a reference voltage of 0.15 V, as shown in Fig. 7. In the CTLE of Fig. 8(a), the feed-forward path through the series resistors  $R_F$  boosts the positive gain. The CTLE, designed as per the conventional structure, equalizes the attenuated PAM-3 data. As shown in Fig. 8(b), the peaking point of the CTLE is set to slightly higher than 10 GHz, and the peaking gain is controlled via an analog voltage.

### B. PAM-3 DFE

A DFE is an essential circuit for single-ended interfaces because it can eliminate ISI without amplifying high-frequency noises, such as crosstalk and SSN. In PAM-3 signaling, there are three decision levels, high, middle, and low. These decision levels can be expressed using two bits, as shown in Fig. 9(a); in addition, PAM-3 signaling has the same complexity as PAM-4 signaling. Due to the unused case of the unit DFE tap, the hardware cost of the PAM-3 DFE increases. If the number of DFE taps  $N$ , is increased to compensate for a large ISI arising from a high channel loss, the total number of current branches is increased to two times  $N$ , resulting in a large parasitic capacitance at the output nodes [16], [17].



Fig. 9. Schematics and operations of PAM-3 DFE. (a) Conventional approach. (b) Proposed approach.

TABLE III  
COMPARISON OF THE CONVENTIONAL AND THE  
PROPOSED DFE IMPLEMENTATION

| Decided Value | Subtracted Current |                         |     | No. of Current Branches for N-tap Full-rate DFE Implementation | No. of Feedback Signals |
|---------------|--------------------|-------------------------|-----|----------------------------------------------------------------|-------------------------|
|               | High               | Mid                     | Low |                                                                |                         |
| Conventional  | $I_{TAP}$          | $0.5I_{TAP}$            | 0   | $2 \times N$<br>(Same as that of PAM-4)                        | $4 \times N$            |
| Proposed      | $I_{TAP}$          | $0.5I_{TAP} \pm \alpha$ | 0   | $N$<br>(Same as that of NRZ)                                   | $2 \times N$            |

To reduce the hardware cost for the PAM-3 DFE, a new structure for the unit tap, comprising a single current branch, is proposed, as shown in Fig. 9(b). The unit tap uses an additional case in which both input P metal–oxide–semiconductors (PMOSs) are fully turned on. In this case, half of the tail current flows through each PMOS, representing the intermediate level. As for the differential input cases, the operation of the unit tap is the same as that of the conventional structure. Using these three cases, the unit tap can express the three different input cases of PAM-3 signaling with the same complexity as that of NRZ signaling.

Table III presents a comparison between the conventional and the proposed unit taps of PAM-3 DFEs. Both can express three different input cases. The number of current branches of the proposed unit tap is two times smaller than that of the conventional structure. Due to the reduced number of current branches, the number of feedback signals is also decreased to 50% of the conventional structure. The DFE brings greater benefits, especially in a large number of taps, because the feedback path and the DFE summer can be area-efficiently designed. However, the amounts of current flowing into the two PMOS inputs slightly vary depending on channel-length modulation. The current mismatch is affected by the common-mode voltage and the peak-to-peak voltage swing of the DFE output nodes, as shown in Fig. 10. According to the simulated results, the maximum offset current is 7% of the tap current when the common-mode voltage is set to 0.2 V, and this value



Fig. 10. Simulated DFE offset current.



Fig. 11. Schematic of a one-tap half-rate PAM-3 DFE.

is small enough to be ignored. When both the previous and current data are middle, the offset current is zero because two PMOS inputs have the same  $V_{SD}$ . To implement the proposed DFE unit tap, Comp\_H and Comp\_LB, which are the outputs of the comparators, are assigned, instead of the differential input cases. When the decided value is middle, both Comp\_H and Comp\_LB become zero, which turns on the two MOS inputs. The one-tap half-rate DFE is designed, as shown in Fig. 11. The half-rate operation is implemented by using PMOS switches and quadrature clocks, which are 90° phase delayed from the sampling clocks of comparators. Only two current branches are used for the one-tap half-rate operation, resulting in a lower parasitic capacitance compared with that of the conventional structure. The DFE coefficient can be controlled from 0 to 0.4.

The outputs of the strong-arm-latch-comparators are return-to-zero (RZ) data due to the pre-charge states. In the conventional high-speed wire-line interfaces, set–reset (SR) latches have been used for NRZ conversion. However, CMOS-based SR latches have a bandwidth limitation due to the feedback time. To overcome the speed limitation, CML-based latches have commonly been used in high-speed links, although they consume a large static current [18].

To achieve a high speed and low power dissipation, a CMOS-based RZ-to-NRZ converter is designed with floating PMOS switches, as shown in Fig. 12. The RZ-to-NRZ conversion is performed by floating the output nodes during the pre-charge state of the strong-arm latch comparators. The feedback-less RZ-to-NRZ conversion reduces delays and power consumption. When compared to the CML-latch, it operates with a high data rate without using a static current dissipation. However, the clock duty distortion should be minimized to secure the timing margin. The half-rate structure is effective for two-UI encoding because an additional register is not required. The 3-bit/two-UI encoder receives four



Fig. 12. RZ-to-NRZ converter.



Fig. 13. 3-bit/two-UI PAM-3 decoder.



Fig. 14. DCC.

outputs from the comparators, as shown in Fig. 13. Using the decoding logics, the original full-rate data are decoded. When the unassigned data pattern, shown in Table II, is detected, the sampling clock phase is shifted for proper operations. The BER of the recovered data is measured using the output driver and a BER test equipment.

### C. Clock Path

The receiver also uses the external clock, as shown in Fig. 6. The DCDL comprises a NAND-based coarse delay line and a phase-interpolation-based fine delay line. The external register controls manually the sampling clock phase. Then, a DCC compensates duty distortion to increase the sampling margin in the half-rate structure. As shown in Fig. 14, the DCC consists of a digitally controlled duty-cycle adjustor (DCA) and a bang-bang duty-cycle detector (DCD). The back-to-back inverter and shunt capacitors amplify the small duty difference of the differential input clocks to digital output. By averaging upper and lower bound of the DCA codes, a 50% clock



Fig. 15. Quadrature-phase clock generator.



Fig. 16. EOM. (a) Block diagram. (b) Operations.

duty is achieved. After the duty-cycle correction, the DCD is disabled to save power consumption. In the one-tap half-rate DFE, quadrature clocks are required because post-cursor cancellation should be completed within 0.5 UI to achieve an optimum eye margin. Fig. 15 shows a block diagram of a quadrature-phase clock generator based on phase interpolation. There are two NAND-based DCDLs that have a coarse delay resolution of 20 ps at a typical corner. A D-flip-flop-based phase detector makes the delays of the DCDLs equal to 90° by aligning the clock phases of CKP(t) and CKN(t-2T<sub>D</sub>). The DCDLs do not have to be set to an exact 90° phase delay because the quadrature phases are achieved by interpolating the two clocks, which have an average phase of 90°. To match the load, CK<sub>I</sub> and CK<sub>IB</sub> also use the same phase interpolator, and dummy loads are added to output nodes, except for CKP(t-T<sub>D</sub>) and CKN(t-T<sub>D</sub>). The generated CK<sub>I</sub> and CK<sub>IB</sub> are used for the half-rate comparators, and CK<sub>Q</sub> and CK<sub>QB</sub> are used for the unit taps of the DFE summer.

### D. Eye-Opening Monitor (EOM)

The functionality of the PAM-3 DFE is verified by using the EOM circuit, as shown in Fig. 16. To minimize bandwidth degradation, parasitic capacitors are used as sampling capacitors. The operating clock is divided by 1024 to generate a sampling clock. Then, an internal DCDL controls the sampling timing to cover the two UIs of the PAM-3 data. The sampled



Fig. 17. Test environment.



Fig. 18. Chip microphotograph.

analog data are measured through the unity-gain buffers. By externally processing the data, the eye diagrams of the DFE output nodes are recovered.

#### IV. MEASUREMENT

##### A. Test Environment

A test environment of the PAM-3 transceiver is shown in Fig. 17. The PAM-3 transceiver includes four channels to consider the SSN and crosstalk. In the transmitter, internal PRBS data (A-C) are generated by using an external clock provided by a programmable pattern generator (PPG). Then, the three PRBS data are modulated to PAM-3 data by the 3-bit/two-UI PAM-3 encoder and the output driver. The PAM-3 data of the transmitter are applied to the receiver via four-single-ended printed circuit board (PCB) channels. The receiver also uses an external clock of the same frequency as that of the transmitter for the DFE and the comparators. The recovered data, namely, A-C, are measured by the sampling oscilloscope and the programmable error detector (PED). The sampled EOM data are also measured by a real-time oscilloscope to recover the internal eye diagram of the DFE output nodes. The internal test options of the transceiver are controlled by an inter-integrated circuit ( $I^2C$ ). As shown in Fig. 18, the chip areas of the transmitter and the receiver are  $0.00338$  and  $0.0106 \text{ mm}^2$  per channel, respectively, including the test blocks.



Fig. 19. Measured PAM-3 eye diagrams of the transmitter. PRBS-7 input data at (a) 24 Gb/s, (b) 27 Gb/s, (c) 30 Gb/s, and (d) PRBS-15 input data at 30 Gb/s.



Fig. 20. Plot of voltage margins versus the number of DQs (PRBS7).

##### B. Measurement Results of Transmitter

The proposed 3-bit/two-UI PAM-3 transmitter operates with the half-rate clock. The data rate of the output of the transmitter is three times the clock frequency because the two-UI PAM-3 symbol contains three bits. The LVSTL output driver uses the supply voltage of 0.6 V, modulating the PAM-3 outputs with a peak-to-peak voltage of 0.3 V. The measured root-mean-square (rms) jitter of the input clock is 0.33 ps at 10 GHz. A real-time oscilloscope is used to measure the output signal of the transmitter. The measured eye diagrams of the transmitter without a channel attenuation are shown in Fig. 19. The worst eye for the PAM-3 data has a horizontal eye width of 0.576 UI and a voltage margin of 69.6 mV for a 30-Gb/s PRBS-7 pattern. In the case of the PRBS-15 pattern, the eye width and height are 0.55 UI and 66.7 mV, respectively. In accordance with the encoding logic presented in Table II, the high-to-high transition between odd and even UIs is not observed in the measured results; this decreases the modulation power dissipation.

The eye diagrams of the transmitter that is prone to SSN are shown in Fig. 20. The SSN differs with respect to the number of enabled channels. When the transmitters are enabled, the eye value of the lowest eye decreases from 69.6 to 61.5 mV. The effect of the SSN cannot be easily overcome, although scaling down the driving current and decoupling capacitor can mitigate the degradation caused by SSN.



Fig. 21. RX recovered data. (a) A. (b) B. (c) C.



Fig. 22. Measured bathtubs (24–30 Gb/s; PRBS7/15).

### C. Measurement Results of Receiver

The proposed 3-bit/two-UI PAM-3 receiver is also tested with the external half-rate clock. The attenuated PAM-3 data are amplified to differential outputs using an S2D amplifier and a ground-terminated  $50\Omega$  resistor. Then, the CTLE and DFE compensate ISI of the PAM-3 signal. The control voltage of the CTLE is set to 0.43 V for 6.5-dB peaking at 10 GHz and the DFE coefficient is adjusted to 0.35 for the smallest BER. The PRBS data recovered from the 3-bit/two-UI full-rate decoder are measured, as shown in Fig. 21. The 30-Gb/s PAM-3 data are recovered to 10-Gb/s full-rate data, namely, A–C.

The BER of the recovered data is measured by the PED, and bathtubs are plotted by changing the sampling clock phase. According to Fig. 13, data B and C have more complex decoder logic than that of data A. Among the measured data of B and C, C has the worst BER, and the measured bathtubs of the recovered data C are drawn, as shown in Fig. 22. According to the bathtubs, the horizontal eye margin of a BER of  $10^{-12}$  is 0.14 UI at the maximum data rate through an 80-mm FR4 channel that has  $-6.6$ -dB attenuation at 10 GHz. When the PRBS-15 pattern is enabled, the horizontal margin is measured to be 0.103 UI. The recovered on-chip eye diagrams of the DFE summer output from the EOM circuit are shown in Fig. 23. When the DFE is disabled, there is no voltage sensing margin due to the post-cursor of the PAM-3 data. Concurrently, the proposed PAM-3 DFE cancels the first post-cursor, resulting in a larger sampling margin than the DFE-disabled case.

### D. Power Breakdown and Comparison Table

Fig. 24 shows the power breakdowns of the transmitter and receiver. The transmitter consumes 7.9 mW at the maximum



Fig. 23. Recovered 30-Gb/s eye diagram. (a) DFE disabled and (b) enabled eye diagrams of DFE outputs.



Fig. 24. Power dissipations of (a) transmitter and (b) receiver.

data rate. The clocking power dissipation occupies the largest portion in both the transmitter and receiver power breakdowns. The output driver consumes 2.17 mW, which is 46.5% lower than that of the previous design, due to the revised encoder and driver [12]. The power efficiency of the transmitter is 0.26 pJ/bit at 30 Gb/s. The receiver consumes 25.5 mW, and its power efficiency is 0.85 pJ/bit at the maximum data rate. The power dissipation of each of the sub-circuits is calculated via the measured total power dissipation and the simulated results.

A comparison with state-of-the-art memory interfaces is shown in Table IV. When compared with NRZ signaling [18], [19], the proposed PAM-3 single-ended memory interface has a pin efficiency of 150%, and the forwarded clock frequency is decreased by 2/3 for the same data rate as that of NRZ signaling. Unlike duo-binary signaling, the proposed 3-bit/two-UI PAM-3 signaling has a superior pin efficiency, resulting in a lower clock frequency. The transceiver is robust to per-pin skews because it uses multi-window coding instead of multi-wire coding [13], [14]. The power efficiency of the single-ended PAM-3 transceiver is measured to be 1.11 pJ/bit at the maximum data rate. It also uses the lowest supply

TABLE IV  
COMPARISON OF STATE-OF-THE-ART MEMORY INTERFACES

|                           | [11]<br>JSSC 2014                | [13]<br>ISSCC 2016           | [14]<br>ISSCC 2017           | [19]<br>TCAS-I 2018       | [20]<br>JSSC 2019         | [12]<br>ISSCC 2019<br>(Previous work) | <b>This work</b>            |
|---------------------------|----------------------------------|------------------------------|------------------------------|---------------------------|---------------------------|---------------------------------------|-----------------------------|
| Signaling Style           | Duo-binary<br>(1 bit/1-UI·1-pin) | CNRZ-5<br>(5 bit/1-UI·6-pin) | C-PHY<br>(16 bit/7-UI·3-pin) | NRZ<br>(1 bit/1-UI·1-pin) | GRS<br>(1 bit/1-UI·1-pin) | PAM-3<br>(3 bit/2-UI·1-pin)           | PAM-3<br>(3 bit/2-UI·1-pin) |
| Data Rate per pin         | 6.4 Gb/s                         | 20.83 Gb/s                   | 6.85 Gb/s                    | 10 Gb/s                   | 25 Gb/s                   | 27 Gb/s                               | <b>30 Gb/s</b>              |
| Energy per bit            | 0.56 pJ/bit                      | 0.94 pJ/bit                  | 0.5 pJ/bit                   | 4.18 pJ/bit               | 1.17 pJ/bit               | 1.03 pJ/bit                           | <b>1.11 pJ/bit</b>          |
| Forwarded Clock Frequency | 3.2 GHz<br>(half-rate)           | 3.125 GHz<br>(RX PLL)        | -                            | 5 GHz<br>(1/2-rate)       | 12.5 GHz<br>(1/2-rate)    | 9 GHz<br>(1/3 rate)                   | 10 GHz<br>(1/3 rate)        |
| Pin Efficiency            | 100%                             | 83%                          | 76%                          | 100%                      | 100%                      | 150%                                  | <b>150%</b>                 |
| VDDQ                      | 1.05 V                           | 0.925 V                      | 1 V                          | 1 V                       | 0.7 V                     | 0.6 V                                 | <b>0.6 V</b>                |
| Equalization              | Duo-binary                       | T-Coil + CTLE                | Active Inductor              | FFE + IIR + DFE           | T-Coil + CTLE             | DFE                                   | CTLE + DFE                  |
| Reach                     | 15 mm                            | 12 mm                        | 40 mm                        | 100 mm                    | 80 mm                     | 20 mm                                 | 80 mm                       |
| Area                      | 0.0333 mm <sup>2</sup>           | 0.629 mm <sup>2</sup>        | 0.023 mm <sup>2</sup>        | 0.0091 mm <sup>2</sup>    | 0.3875 mm <sup>2</sup>    | 0.0135 mm <sup>2</sup>                | 0.014 mm <sup>2</sup>       |
| Technology                | 65 nm CMOS                       | 65 nm CMOS                   | 65 nm CMOS                   | 65 nm CMOS                | 16 nm CMOS                | 28 nm CMOS                            | 28 nm CMOS                  |

voltage of the output driver, thereby reducing the PAM-3 modulation power dissipations.

## V. CONCLUSION

In this article, a 3-bit/two-UI PAM-3 transceiver was proposed to realize a high-speed memory interface. In single-ended memory interfaces, PAM-3 signaling has a higher feasibility compared with PAM-4 signaling. When compared to NRZ signaling, the proposed single-ended PAM-3 interface achieves a 1.5 times higher pin efficiency. Furthermore, the forwarded clock frequency is decreased by 2/3 for the same data rate as that of NRZ signaling. The half-rate structure, commonly adopted in DRAM interfaces, is used for the implementation of 3-bit/two-UI PAM-3 encoding and decoding. To mitigate the inefficiency of PAM-3 circuits, a tri-level DFE is proposed with the same complexity as that of NRZ signaling. The feedback delay of the one-tap DFE is decreased by using a CMOS-based RZ-to-NRZ converter with a low power dissipation. The power dissipations of the transmitter and the receiver are 7.9 and 25.5 mW at the maximum data rate of 30 Gb/s, respectively. An 80-mm FR-4 channel is used for the measurement of the transceiver. Four channels of PAM-3 transceivers are implemented for the data rate of 4 × 30 Gb/s.

In high-speed DRAM interfaces, channel attenuation and clock performance degradation have been critical design issues. PAM-N signaling has been extensively employed in high-speed differential links, such as Ethernet interfaces. In this regard, the proposed 3-bit/two-UI PAM-3 signaling can be a good candidate for application in future memory interfaces.

## REFERENCES

- [1] D. Kim *et al.*, “23.2 a 1.1 V 1ynm 6.4Gb/s/spin 16Gb DDR5 SDRAM with a phase-rotator-based DLL, high-speed SerDes and RX/TX equalization scheme,” in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2019, pp. 380–381.
- [2] K.-S. Ha *et al.*, “23.1 a 7.5Gb/s/spin LPDDR5 SDRAM with WCK clocking and non-target ODT for high speed and with DVFS, internal data copy, and deep-sleep mode for low power,” in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2019, pp. 378–379.
- [3] Y.-J. Kim *et al.*, “A 16-gb, 18-gb/s/spin GDDR6 DRAM with per-bit trainable single-ended DFE and PLL-less clocking,” *IEEE J. Solid-State Circuits*, vol. 54, no. 1, pp. 197–209, Jan. 2019.
- [4] C. Kim, *High-Bandwidth Memory Interface*. New York, NY, USA: Springer, Nov. 2014.
- [5] W.-J. Yun *et al.*, “17.7 A digital DLL with hybrid DCC using 2-step duty error extraction and 180° phase aligner for 2.67Gb/S/spin 16Gb 4-H stack DDR4 SDRAM with TSVs,” in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2015, pp. 322–323.
- [6] J. H. Cho *et al.*, “A 1.2 V 64Gb 341GB/S HBM2 stacked DRAM with spiral point-to-point TSV structure and improved bank group data control,” in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2018, pp. 208–209.
- [7] A. Cevrero *et al.*, “6.1 A 100Gb/s 1.1pJ/b PAM-4 RX with dual-mode 1-tap PAM-4 / 3-Tap NRZ speculative DFE in 14nm CMOS FinFET,” in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2019, pp. 112–113.
- [8] J. L. Zerbe *et al.*, “1.6 Gb/s/spin 4-PAM signaling and circuits for a multidrop bus,” *IEEE J. Solid-State Circuits*, vol. 36, no. 5, pp. 752–760, May 2001.
- [9] K.-L. Fu and S.-I. Liu, “A 56Gbps PAM-4 optical receiver front-end,” in *Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC)*, Nov. 2017, pp. 77–80.
- [10] C.-T. Hung, Y.-P. Huang, and W.-Z. Chen, “A 40 Gb/s PAM-4 receiver with 2-Tap DFE based on automatically non-even level tracking,” in *Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC)*, Nov. 2018, pp. 213–214.
- [11] S.-M. Lee *et al.*, “An 80 mV-swing single-ended duobinary transceiver with a TIA RX termination for the point-to-point DRAM interface,” *IEEE J. Solid-State Circuits*, vol. 49, no. 11, pp. 2618–2630, Nov. 2014.
- [12] H. Park, J. Song, Y. Lee, J. Sim, J. Choi, and C. Kim, “23.3 A 3-bit/2UI 27Gb/s PAM-3 single-ended transceiver using one-tap DFE for next-generation memory interface,” in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2019, pp. 382–383.
- [13] A. Shokrollahi *et al.*, “10.1 A pin-efficient 20.83Gb/s/wire 0.94pJ/bit forwarded clock CNRZ-5-coded SerDes up to 12 mm for MCM packages in 28nm CMOS,” in *IEEE ISSCC Dig. Tech. Papers*, Jan. 2016, pp. 182–183.
- [14] W. Choi, T. Kim, J. Shim, H. Kim, G. Han, and Y. Chae, “23.8 A 1 V 7.8 mW 15.6Gb/s C-PHY transceiver using tri-level signaling for post-LPDDR4,” in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2017, pp. 402–403.
- [15] A. X. Widmer and P. A. Franaszek, “A DC-balanced, partitioned-block, 8B/10B transmission code,” *IBM J. Res. Develop.*, vol. 27, no. 5, pp. 440–451, Sep. 1983.

- [16] W. C. Chen *et al.*, “A 56Gb/s PAM-4 receiver with voltage pre-shift CTLE and 10-tap DFE of tap-1 speculation in 7nm FinFET,” in *Proc. Symp. VLSI Circuits*, Jun. 2019, pp. 272–273.
- [17] J. Im *et al.*, “A 40-to-56 Gb/s PAM-4 receiver with ten-tap direct decision-feedback equalization in 16-nm FinFET,” *IEEE J. Solid-State Circuits*, vol. 52, no. 12, pp. 3486–3502, Dec. 2017.
- [18] T. Toifl *et al.*, “A 22-Gb/s PAM-4 receiver in 90-nm CMOS SOI technology,” *IEEE J. Solid-State Circuits*, vol. 41, no. 4, pp. 954–965, Apr. 2006.
- [19] J. Song, S. Hwang, H.-W. Lee, and C. Kim, “A 1-V 10-Gb/s/pin single-ended transceiver with controllable Active-Inductor-Based driver and adaptively calibrated cascaded-equalizer for post-LPDDR4 interfaces,” *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 65, no. 1, pp. 331–342, Jan. 2018.
- [20] J. W. Poulton *et al.*, “A 1.17-pJ/b, 25-Gb/s/pin ground-referenced single-ended serial link for Off- and on-package communication using a Process- and temperature-adaptive voltage regulator,” *IEEE J. Solid-State Circuits*, vol. 54, no. 1, pp. 43–54, Jan. 2019.



**Hyunsu Park** (Graduate Student Member, IEEE) received the B.S. degree in electronics engineering and the M.S. degree in semi-conductor system engineering from Korea University, Seoul, South Korea, in 2016 and 2018, respectively, where he is currently pursuing the integrated Ph.D. degree in integrated circuits and systems.

His research interests include memory interfaces, high-speed transceivers, and clock generation circuit design.



**Junyoung Song** (Member, IEEE) received the B.S. and M.S. degrees in electronics engineering and the Ph.D. degree in electrical and computer engineering from Korea University, Seoul, South Korea, in 2008, 2010, and 2014, respectively.

In 2012, he was a Visiting Scholar with the University of California at Los Angeles, Los Angeles, CA, USA. In 2014, he joined the Analog Serial I/O Group, Intel Corporation, San Jose, CA, USA, where he was involved in the wireline transceiver design for high-performance FPGA. Since 2018, he has been

with the School of Electronics Engineering, Incheon National University, Incheon, South Korea, where he is currently an Assistant Professor. He has coauthored the book *High-Bandwidth Memory Interface* (Springer, 2013). His research interests include the high-speed wireline transceiver, memory, and clock generator.

Dr. Song received the IEEE Seoul Section Student Paper Contest Bronze Award in 2011 and 2013 and the Minister of Ministry of Education, Science and Technology Award at the Korea Semiconductor Design Contest in 2011. He is serving on the Technical Program Committee of the IEEE Asian Solid-State Circuits Conference.



**Jincheol Sim** (Student Member, IEEE) was born in Seoul, South Korea, in 1992. He received the B.S. degree in electrical engineering from Korea University, Seoul, in 2017, where he is currently pursuing the integrated M.S. and Ph.D. degree.

His current research interest includes high-speed wireline transceivers.



**Yoonjae Choi** (Graduate Student Member, IEEE) was born in Seoul, South Korea, in 1992. He received the B.S. degree in electrical engineering from Korea University, Seoul, in 2016, where he is currently pursuing the integrated M.S. and Ph.D. degree.

His current research interest includes high-speed wireline transceivers.

Mr. Choi was a recipient of the Ministry of Trade, Industry and Energy Award at the Korea Semiconductor Design Contest in 2017. He was granted a student scholarship by the Korea Semiconductor Industry Association in 2016.



**Jonghyuck Choi** (Graduate Student Member, IEEE) received the B.S. degree in electrical engineering from Korea University, Seoul, South Korea, in 2017, where he is currently pursuing the integrated M.S. and Ph.D. degree.

His research interests include memory interfaces, high-speed transceivers, and energy-efficient wireline systems.

Mr. Choi was a recipient of the Prime Minister Award at the Korea Semiconductor Design Contest in 2019 and the Ministry of Trade, Industry and Energy Award at the Korea Semiconductor Design Contest in 2017. He was granted a student scholarship by the Korea Semiconductor Industry Association in 2018.



**Jeongsik Yoo** received the B.S. degree in electronics engineering from Chungnam National University, Daejeon, South Korea, in 2007, and the M.S. degree from the Korea Advanced Institute of Science and Technology, Daejeon, in 2009. He is currently pursuing the Ph.D. degree in industrial–educational cooperation with the AISL Laboratory, Korea University, Seoul, South Korea.

He has been with the Memory Division, Samsung Electronics, Hwaseong, South Korea. His current research interests include signal integrity, power integrity, and transceiver design for high-speed interfaces.



**Chulwoo Kim** (Senior Member, IEEE) received the B.S. and M.S. degrees in electronics engineering from Korea University, Seoul, South Korea, in 1994 and 1996, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana–Champaign, IL, USA, in 2001.

In 1999, he was a Summer Intern with Design Technology, Intel Corporation, Santa Clara, CA, USA. In 2001, he joined the IBM Microelectronics Division, Austin, TX, USA, where he was involved in cell processor design. Since 2002, he has been with the School of Electrical Engineering, Korea University, where he is currently a Professor. He was a Visiting Professor with the University of California at Los Angeles, Los Angeles, CA, USA, in 2008, and the University of California at Santa Cruz, Santa Cruz, CA, USA, in 2012. He has coauthored two books, namely, *CMOS Digital Integrated Circuits: Analysis and Design* (McGraw Hill, Fourth Edition 2014) and *High-Bandwidth Memory Interface* (Springer, 2013). His current research interests are in the areas of wireline transceivers, memories, data converters, and power management.

Dr. Kim received the Samsung HumanTech Thesis Contest Bronze Award in 1996; the ISLPED Low-Power Design Contest Award in 2001 and 2014; the DAC Student Design Contest Award in 2002; SRC Inventor Recognition Award in 2002; the Young Scientist Award from the Ministry of Science and Technology, South Korea, in 2003; the Seoktop Award for excellence in teaching in 2006 and 2011; the ASP-DAC Best Design Award in 2008; the Special Feature Award in 2014; and the Korea Semiconductor Design Contest: Ministry of Trade, Industry and Energy Award in 2013. He was selected as a Distinguished Lecturer of the IEEE Solid-State Circuits Society from 2015 to 2016. He served on the Technical Program Committee of the IEEE International Solid-State Circuits Conference and as a Guest Editor for the IEEE JOURNAL OF SOLID-STATE CIRCUITS. He serves on the Editorial Board of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS.