

# Design Techniques for a 60-Gb/s 288-mW NRZ Transceiver With Adaptive Equalization and Baud-Rate Clock and Data Recovery in 65-nm CMOS Technology

Jaeduk Han, *Student Member, IEEE*, Nicholas Sutardja, *Student Member, IEEE*, Yue Lu, *Member, IEEE*, and Elad Alon, *Senior Member, IEEE*

**Abstract**—Design techniques for a complete 60-Gb/s non-return-to-zero transceiver with adaptive equalization as well as baud-rate clock and data recovery (CDR) are demonstrated. A complete equalization front end with per-path adaptation and per-sampler offset calibration enables 60-Gb/s operation over realistic channels. Current integration in the front end for energy-efficient equalization is combined with integration phase dithering to realize a robust baud-rate CDR. Correlation of the adaptive error sampler output with the phase dithering sequence indicates the direction of phase offset, and the resulting baud-rate CDR saves power and complexity compared to an oversampling CDR by not requiring additional clock phases/deserializers. The proposed 65-nm CMOS transceiver operates at 60 Gb/s with an eye opening of 30% UI and consumes 288 mW while equalizing 21 dB of loss at 30 GHz over a 0.7-m Twinax cable.

**Index Terms**—Chip-to-chip communication, clock and data recovery (CDR), current integration, decision feedback equalizer (DFE), feed-forward equalizer (FFE), high-speed links.

## I. INTRODUCTION

AS THE rapid growth of Internet connectivity has required systems to become more distributed and data oriented, the demand for high bandwidth wireline communication systems continues to increase. Industrial standards have responded to this trend by increasing the data-rate of chip I/Os, doubling per-pin data-rate around every four years [1]. Following this trend, 60-Gb/s chip-to-chip transceiver systems will soon (within 2–3 years) need to be widely supported. While bandwidth continues to increase rapidly, the power consumption budgeted for these high-speed transceivers remains relatively constant, which implies that improving energy efficiency is a must. Specifically, transceivers operating around 60 Gb/s should ideally achieve  $\sim 5$  pJ/bit efficiency to stay within the budget ( $\sim 300$  mW/transceiver) of current designs.

Manuscript received April 18, 2017; revised July 17, 2017; accepted August 7, 2017. Date of publication September 4, 2017; date of current version November 21, 2017. This paper was approved by Guest Editor Mounir Meghelli. (*Corresponding author: Jaeduk Han.*)

J. Han, N. Sutardja, and E. Alon are with the University of California, Berkeley, CA 94704 USA (e-mail: jdhan@eecs.berkeley.edu).

Y. Lu is with Qualcomm Atheros Inc., San Jose, CA 95110 USA.

Color versions of one or more of the figures in this paper are available online at <http://ieeexplore.ieee.org>.

Digital Object Identifier 10.1109/JSSC.2017.2740268

Complex modulation schemes, such as PAM4, have been widely explored by recent transceiver designs [2]–[4] to meet the data-rate requirement. While PAM4 transceivers are capable of handling high-loss channels, most PAM4 designs require error coding and/or complex digital equalization, which typically leads to their power consumptions being higher than the aforementioned 5-pJ/bit target. Therefore, conventional non-return-to-zero (NRZ) signaling may remain an attractive alternative if its simpler modulation scheme can be translated to reduced power. If the overall insertion loss remains within a reasonable range ( $\sim 20$  dB)—which presumably can be achieved by improving the channels themselves [5]–[7]—this goal could be achieved by relaxing the circuit complexity.

Several high-speed NRZ receivers [8]–[11] and transmitters [12] have been published recently and included various levels of equalization. These designs primarily focus on critical building blocks for equalization, implementing continuous-time linear equalizer (CTLE), decision feedback equalizer (DFE), and receive feed-forward equalizer (FFE) circuits demonstrating the ability to cancel inter-symbol interference (ISI), while operating close to intrinsic speed limits of the underlying technologies. Recently published NRZ transceivers operating up to 56.5 Gb/s with a one-tap DFE and/or CTLE [13], [14] in 28–40 nm CMOS processes have been demonstrated for  $<20$ -dB loss channels with energy efficiencies ranging from 4.4 to 11.96 pJ/bit. These designs progressed toward the need for complete transceivers at 50–60 Gb/s speeds, and our previous work [11] demonstrated a receiver with more complex equalization while retaining energy efficiency in order to address channels with higher losses.

Continuing in this direction and in order to demonstrate that a complete transceiver can be realized while meeting the bandwidth and power targets, this paper presents techniques to realize a 60-Gb/s NRZ transceiver realized in a 65-nm process (Fig. 1) that includes transmit FFE, CTLE, receive FFE, DFE, output slicers, clock generation and distribution, adaptation, and baud-rate clock and data recovery (CDR) [15]. Notably, despite the 65-nm process, the receiver is able to support reasonable equalization (CTLE, two-tap FFE, and three-tap DFE) and robust baud-rate CDR while only



Fig. 1. Transceiver architecture.

consuming 136 mW. This paper begins by briefly reviewing the data-path circuitry and techniques used in the transceiver (with the receiver data-path design adapted from [11]) and the per-path adaptation and calibration loops necessary for finding optimal equalization coefficients. Then, Section III describes in detail the baud-rate CDR—specifically addressing phase wandering, the CDR logic, and a phase dithering technique. Section IV describes the measurement results, showing that the transceiver supports 60 Gb/s under a 21-dB loss channel while consuming 288 mW.

## II. 60-Gb/s TRANSCEIVER ARCHITECTURE

This section presents all the components in the transceiver data-path used for cancelling various types of ISI at 60 Gb/s.

### A. Transmitter Design

Since the TX generally must be terminated to match the channel and hence drive relatively low impedance, removing a substantial amount of ISI in terms of amplitude relative to the cursor is generally more expensive than doing so at the RX. However, since the delay elements within a TX FFE can simply consist of digital latches (as opposed to analog latches in a mixed-signal RX FFE), expanding the number of ISI taps covered by the TX is generally less costly than doing so at the RX. The transceiver balances these competing considerations by designing the TX FFE to include three taps (as opposed to the RX FFE's two taps), while having the TX cancel only a small fraction of the precursor ISI. The maximum first and second tap strengths of the TX FFE are designed to be 25% and 12.5% of the main tap strength, respectively, while the two-tap adaptive RX FFE can cancel up to 50% of the main cursor.

The transmitter's block level architecture in Fig. 2 shows the three phases of serialization used in this design to achieve a serialization ratio of 128:1. A 64:4 CMOS shift register-based serializer and a 4:1 mux-based serializer [12] make up the



Fig. 2. Transmitter architecture.



Fig. 3. Transmitter front-end circuits. (a) CML MUX. (b) CML latch. (c) Three-tap FFE.

first two phases of serialization in each (odd and even) path. Shift register-based serializers were selected for the lower-frequency portion for their simplicity, and because of the lower frequency, the absolute power penalty of this choice (as compared to a multiplexer-based design) is relatively minimal. Following the 4:1 mux-based serializer, current mode logic (CML) muxes [Fig. 3(a)] combined with CML latches [Fig. 3(b)][8] and drivers [Fig. 3(c)] make up the final 2:1 serialization and three taps of equalization necessary for achieving the desired equalization at the data-rate. The transmitter clocking path is composed of an injection-locked *LC* oscillator that generates a differential 30-GHz clock, a following resonant buffer, and a divide-by-64 chain; this structure is nearly identical to the receive-side clock generation, distribution, and division circuits from [11]. For the TX, the injection-locking devices are always turned on and driven by an external 10-GHz reference clock source. The divider chain consists of a dynamic-latch-based 30-GHz divider, a phase interpolator, a 15-GHz CML divider, a CML2CMOS converter, and following CMOS divider stages.



Fig. 4. RX data-path architecture with adaptation, offset calibration, and dLev tracking loops.

### B. Receiver Design

The complete receiver is adapted from the receiver front end described in detail in [11], using the same equalization methods while adding adaptation and calibration loops. The receiver takes advantage of current integration to provide  $\sim 3 \times$  power-savings compared to resistively loaded stages [16], [17]. First, the source-degenerated, current-integrating CTLE provides long-tail ISI cancellation while demultiplexing the input signal so that the following stages can operate at half-rate with wider integration windows. Then, two stages of dynamic latches following the CTLE convert the RZ outputs generated from the current-integrating stage to NRZ outputs and also provide the UI-delayed signals for the two-tap FFE. Following the CTLE and latches, two half-rate, current-integrating two-tap FFE and three-tap DFE stages perform precursor and postcursor ISI equalization. Programmable weights in the current-integrating FFE are achieved through variable cascode gate-voltage bias [11]. The current-integrating DFE uses a separate dynamic-latch-based stage to meet the stringent latency requirements on the first postcursor of the DFE feedback path [8]. A retiming latch is added following the DFE in only the odd path to minimize the number of high frequency latches to save power consumption. Finally, the adaptation and calibration loops necessary for the receiver to find optimal equalization coefficients and compensate for path mismatches due to the retiming structure are described in the following section.

### C. Equalizer Adaptation

In order to support various channel characteristics, the sign-sign-least-mean-squares algorithm [18], [19] is used for adapting DFE coefficients and the zero-forcing algorithm [20] is used for FFE coefficients. As in any adaptation engine, the equalizer coefficients should solely be determined by the channel response, and there should only be one optimal value for each tap. However, since our design incorporates a retiming

latch in only the odd path, loading and bandwidth mismatches are present between the even and odd paths; note that the prior summer stages (gm0 in Fig. 4) would have to be substantially overdesigned in order to suppress these mismatches. Instead of incurring a power penalty by overdesigning the summers, dedicated loops for each path (as shown in Fig. 4) are used to handle these mismatches along with any other mismatches due to bias DAC variations, parasitic capacitors, and clock skew. As a result, the even and odd paths will each have their own separate adaptation loops and therefore converge to different equalizer coefficients. In total, 26 adaptation and offset calibration loops (8 for FFE+DFE, 2 for data level (dLev) tracking, and 16 for the samplers' offset calibration) are implemented on chip, as shown in Fig. 4. While this introduces additional hardware, it should be noted that these loops are in the lower-frequency domain, and their power consumption is trivial compared to the power overhead necessary to handle these factors by upsizing and/or adding additional high-frequency stages to the signaling path.

## III. BAUD-RATE CLOCK AND DATA RECOVERY

### A. Clock and Data Recovery Design Considerations

In order to compensate the phase skew between the received signal and the receiver's internal clock, a CDR was included in the design. Dual-loop CDRs [21]–[23] are currently the most popular choice because they have the advantage (compared to their single-loop CDR counterparts) of allowing separate loop bandwidth selection for clock generation and phase tracking. In contrast, this design uses a single-loop CDR with an *LC* oscillator and resonant buffer in order to generate a low jitter differential clock and distribute its energy efficiently at a high frequency. The reason for this choice (single-loop CDR) is that although dual-loop CDR's can reject voltage-controlled oscillator noise by having a higher bandwidth in the oscillator loop compared to the phase tracking (often implemented with a phase interpolator) loop,

Fig. 5. (a) 2 $\times$  oversampling CDR. (b) Baud-rate CDR.

a dual-loop CDR is not necessary in this scenario because the *LC* oscillator in this design already has relatively low phase noise. Furthermore, the power overhead of distributing 30-GHz clock over multiple lanes would be substantial, while the physical dimension of an on-chip inductor for 30-GHz clock generation is only  $\sim 50 \mu\text{m} \times 50 \mu\text{m}$ , and is hence easy to fit into the area of a single lane. Therefore, even for multi-lane implementations, once the clock rate is high enough that the physical area of the inductor is relatively small, a single-loop CDR is an attractive option for high-speed applications.

Turning now to the phase detection scheme, commonly adopted 2 $\times$  oversampling CDRs capture phase difference information by sampling the edge signals and comparing them with neighboring data signals, which implies that the CDR requires additional samplers and clock phases for this edge sampling. In particular, four-phase clock generation and distribution (instead of differential clock generation and distribution) and an additional edge sampler as well as deserializer would be necessary for half-rate operation (Fig. 5). These additional samplers and clock phases can be very expensive at data-rates close to the limit of the process technology (which 60 Gb/s is for 65 nm). For this reason, alternative approaches such as baud-rate CDRs have been investigated [14], [24]–[26] to remove the need for edge sampling. The most commonly adopted baud-rate design is based on the Mueller–Muller CDR (MMCDR) [27]. While the MMCDR scheme will find a locking point that represents equal levels of precursor and first tap postcursor ISI, it is not guaranteed that this condition corresponds to an optimum point, or even to a unique point. This issue is denoted as phase wandering. In other words, an MMCDR relies on specific characteristics in the pulse response, and these conditions cannot be guaranteed over variations in channel environment or process. Consider for example the extreme case where the input pulse is a perfect square wave; the first postcursor tap will be equal in level to the first precursor tap over the entire width of the pulse (i.e., the entire UI), and hence the phase selected by the MMCDR is effectively free to wander anywhere (i.e., the design would not operate robustly). To avoid these kinds of issues, we set out to realize a baud-rate CDR with a unique locking point, and we describe our proposed approach in the following section.



Fig. 6. Integration-and-reset front-end behavior.

### B. Baud-Rate CDR Design

This design proposes to augment traditional baud-rate CDRs by utilizing an integration-and-reset front end to create a unique locking point for the CDR. If the received signal is filtered by the integration-and-reset front end, the front-end output will be maximized when the integration window perfectly overlaps with the incoming pulse, which translates to a single locking point, even when the input pulses have wide flat regions.

To demonstrate how the integrate-and-reset front end helps to address the issue of non-unique locking, let us consider the expected value of the output of the front-end  $y(t)$  given a perfectly square input pulse (ranging from 0 to some amplitude)

$$E[y(t)] = \begin{cases} (V_a/T) \cdot [(t - T_0) + T], & \text{if } T_0 - T < t \leq T_0 \\ (V_a/T) \cdot [(T + T_0) - t], & \text{if } T_0 - t < T \leq T_0 \\ 0, & \text{otherwise} \end{cases} \quad (1)$$

where  $V_a$  is the amplitude,  $T$  is the bit period, and  $T_0$  represents the phase of the incoming pulse relative to the receiver. This (1) implies that the average output of the integration-and-reset stage has a triangular output waveform, which gives only one single peak at  $t = T_0$ . Second, the integrator output offers the maximum sampling margin at phase  $T_0$ , since this is the point where the sampler will see the peak output voltage.

Fig. 6 visually depicts the integration-and-reset operation. Noting that the integration happens from  $t$  to  $t + T$  and the sampling is done at  $t + T$ , there is no need for additional clock phases with  $T/2$  spacing (i.e., the receiver is operating at baud-rate). Furthermore, in our receiver design, we already have the integration-and-reset stage in the front-end CTLE [11] (which is denoted as intg0 in Fig. 4); therefore, no additional hardware is required for this operation.

The above discussion points to the idea that once we have applied the front-end integration and reset, we should implement the CDR so that it converges to the peak point of the (postintegration) waveform. In order to lock to the phase at which the integrator reaches its peak expected output value, a signal that is highly correlated with the output amplitude is necessary as an indicator. Fortunately, the dLev tracking loop



Fig. 7. (a) Phase dithering and error sampling. (b) Truth table of the proposed phase detector.

that is typically used for equalizer adaptation [18] can serve for this purpose as well.

The most straightforward way to implement the baud-rate CDR using the ideas outlined above would be to adjust the receiver's phase setting (possibly in both directions), measure the cursor amplitude from the dLev output, and then move the phase in the direction that achieved a higher dLev value. However, since the CDR loop would then need to wait until the dLev loop converges and produces a correct value, this approach would limit the phase tracking bandwidth to be substantially slower than the dLev tracking bandwidth. Unfortunately, the dLev loop bandwidth is usually set to be very low in order to reject residual ISI.

Instead, the approach taken in this design decouples the CDR tracking bandwidth from the dLev tracking bandwidth by directly using the output of error (adaptive) samplers (which from the standpoint of the dLev loop, represents the change in the dLev) rather than the output of dLev tracking loop. Specifically, in order to find the setting with the maximum (postintegrator) dLev, the integration and sampling phase is dithered by a controlled amount, and then the error sampler output is correlated with the digital dither signal [Fig. 7(a)]. This approach achieves the desired result because the sign of the error sampler's output will be opposite to that of the dithering sequence if the integration phase is late, and the signs will be aligned if the integration phase is early. In other words, the phase detector's output is found by XOR'ing the error sampler output with the dithering sequence [Fig. 7(b)]. Additional data filtering is applied on top of the XOR operation, since the dLev tracking loop is configured to be activated by +1 symbols (and filtering out -1 symbols) in order to reduce the number of error samplers [18]. However, it should be noted that this data filtering does not degrade the pattern coverage of the proposed CDR scheme, since it is guaranteed that there is at least one +1 symbol whenever there is an NRZ data transition.

In order to elucidate how design decisions (such as dithering amplitude) should be made for this type of CDR, as well as what its limitations/tradeoffs are, we will next derive a small signal model for the baud-rate phase detector as a function of the input jitter (as well as the dither amplitude). First, the CDR should lock to the phase where the average of the



Fig. 8. Phase detector statistics under dither ( $\Delta$ ), random jitter ( $\sigma_j$ ), and input phase ( $\tau$ ). (a) In the locking condition. (b) When there is a phase shift by  $\tau$ .

phase detector output is zero, which can be expressed as the following condition:

$$P(E) = P(L) \quad (2)$$

where  $P(E)$  and  $P(L)$  are the probabilities of having early and late outputs from the phase detector, respectively. In the locked condition, the dLev loop will converge to the value that meets the following equation:

$$P(+e) = P(-e) \quad (3)$$

where  $P(+e)$  and  $P(-e)$  are the probabilities of having positive error [ $+e$  in Fig. 7(a)] and negative error [ $-e$  in Fig. 7(a)] outputs from the error sampler, respectively. In the presence of the phase dither for the CDR,<sup>1</sup> the probabilities in (2) and (3) are given by

$$P(E) = P(E|-\Delta) \cdot P(-\Delta) + P(E|\Delta) \cdot P(\Delta) \quad (4)$$

$$P(L) = P(L|-\Delta) \cdot P(-\Delta) + P(L|\Delta) \cdot P(\Delta) \quad (5)$$

$$\begin{aligned} P(+e) &= P(+e|-\Delta) \cdot P(-\Delta) + P(+e|\Delta) \cdot P(\Delta) \\ &= P(L|-\Delta) \cdot P(-\Delta) + P(E|\Delta) \cdot P(\Delta) \end{aligned} \quad (6)$$

$$\begin{aligned} P(-e) &= P(-e|-\Delta) \cdot P(-\Delta) + P(-e|\Delta) \cdot P(\Delta) \\ &= P(E|-\Delta) \cdot P(-\Delta) + P(L|\Delta) \cdot P(\Delta) \end{aligned} \quad (7)$$

where  $P(E|\Delta)$  and  $P(L|\Delta)$  are the conditional probabilities of having early and late outputs from the phase detector when the integration phase is dithered by  $\Delta$ , respectively. Since  $P(E|\Delta) = 1 - P(L|\Delta)$ ,  $P(E|-\Delta) = 1 - P(L|-\Delta)$ , and  $P(\Delta) = P(-\Delta) = 0.5$  (assuming that the dither is balanced), combining (2)–(7) gives the following locking condition:

$$P(E|-\Delta) = P(L|-\Delta) = P(E|\Delta) = P(L|\Delta) = 0.5 \quad (8)$$

<sup>1</sup>Note that in order to simplify the analysis, we do not explicitly include here the effects of random voltage variations (which would introduce dependence on the exact shape of the integrator output). However, the measured jitter tolerance results presented later match fairly closely (within ~20%) with the predictions from the simplified analysis.



Fig. 9. (a) Relationship between  $\Delta/\sigma_j$  and  $\tau_0/\sigma_j$  from (9). (b) Phase detector gain from (14).

This condition is illustrated in Fig. 8(a), with the converged dLev (i.e.,  $dLev_0$ ) indicated by the thick dashed line. If the jitter follows a Gaussian distribution with standard deviation  $\sigma_j$  [and its probability density function is  $p(t)$ , as in Fig. 8(a)], then as shown in Fig. 8(a), the phase  $\tau_0$  at which the integrator output  $Y(t)$  is equal to  $dLev_0$  [i.e.,  $Y(\tau_0) = dLev_0$ ] can be found by:

$$\begin{aligned} Q\left(\frac{-\tau_0 + \Delta}{\sigma_j}\right) - Q\left(\frac{\tau_0 + \Delta}{\sigma_j}\right) &= Q\left(\frac{-\tau_0 - \Delta}{\sigma_j}\right) \\ -Q\left(\frac{\tau_0 - \Delta}{\sigma_j}\right) &= 0.5. \end{aligned} \quad (9)$$

Fig. 9(a) shows the relationship between  $\Delta/\sigma_j$  and  $\tau_0/\sigma_j$ . Note that as  $\Delta$  increases,  $\tau_0$  converges to  $\Delta$  [and  $dLev_0$  approaches to  $Y(\Delta)$ ].

Let us next examine the case where the input data's phase has been temporarily shifted by  $\tau$ ; this will change the integrator's output waveform, but due to its (intentionally) low tracking bandwidth, dLev will remain stationary at  $dLev_0$  [Fig. 8(b)]. In this case, the probabilities become

$$\begin{aligned} P(E|\tau) &= P(E|-\Delta, \tau) \cdot P(-\Delta) + P(E|\Delta, \tau) \cdot P(\Delta) \\ &= \frac{1}{2} \left[ 1 - Q\left(\frac{-\tau_0 - \tau + \Delta}{\sigma_j}\right) + Q\left(\frac{\tau_0 - \tau + \Delta}{\sigma_j}\right) \right. \\ &\quad \left. + Q\left(\frac{-\tau_0 - \tau - \Delta}{\sigma_j}\right) - Q\left(\frac{\tau_0 - \tau - \Delta}{\sigma_j}\right) \right] \end{aligned} \quad (10)$$

$$\begin{aligned} P(L|\tau) &= P(L|-\Delta, \tau) \cdot P(-\Delta) + P(L|\Delta, \tau) \cdot P(\Delta) \\ &= \frac{1}{2} \left[ Q\left(\frac{-\tau_0 - \tau + \Delta}{\sigma_j}\right) - Q\left(\frac{\tau_0 - \tau + \Delta}{\sigma_j}\right) \right. \\ &\quad \left. + 1 - Q\left(\frac{-\tau_0 - \tau - \Delta}{\sigma_j}\right) + Q\left(\frac{\tau_0 - \tau - \Delta}{\sigma_j}\right) \right] \end{aligned} \quad (11)$$



Fig. 10. (a) Intrinsic path latency difference between the dither sequence (①) and the received signal modulated by the dither sequence (②), with their clock domains annotated. (b) Proposed CDR architecture to compensate the path mismatch by using two separate dither generators and a phase interpolator.

The expected value of the phase detector output is thus

$P\text{Dout}$

$$\begin{aligned} &= TD \cdot FD \cdot [P(E|\tau) - P(L|\tau)] \\ &= TD \cdot FD \cdot \left[ Q\left(\frac{\tau_0 - \tau + \Delta}{\sigma_j}\right) - Q\left(\frac{-\tau_0 - \tau + \Delta}{\sigma_j}\right) \right. \\ &\quad \left. + Q\left(\frac{-\tau_0 - \tau - \Delta}{\sigma_j}\right) - Q\left(\frac{\tau_0 - \tau - \Delta}{\sigma_j}\right) \right] \end{aligned} \quad (12)$$

where TD is the transition density (0.5 for PRBS), and FD is the data filtering density (0.5 for +1 filtering). The small-signal PD gain is therefore given by finding the derivative of (12) at  $\tau = 0$ , which for Gaussian input jitter evaluates to

$$PD\text{gain} = TD \cdot FD \cdot \sqrt{\frac{2}{\pi}} \left( e^{-\frac{(\Delta+\tau_0)^2}{2\sigma_j^2}} - e^{-\frac{(\Delta-\tau_0)^2}{2\sigma_j^2}} \right). \quad (13)$$

If the dithering amplitude  $\Delta$  increases,  $\tau_0$  approaches  $\Delta$  [as indicated in Fig. 9(a)], and (12) and (13) can be approximated as

$$\begin{aligned} P\text{Dout} &= TD \cdot FD \cdot \left[ Q\left(\frac{-\tau + 2\Delta}{\sigma_j}\right) + Q\left(\frac{-\tau - 2\Delta}{\sigma_j}\right) \right. \\ &\quad \left. - 2Q\left(\frac{-\tau}{\sigma_j}\right) \right] \end{aligned} \quad (14)$$

$$PD\text{gain} = TD \cdot FD \cdot \sqrt{\frac{2}{\pi}} \left( e^{-\frac{2\Delta^2}{\sigma_j^2}} - 1 \right). \quad (15)$$



Fig. 11. CDR implementation.



Fig. 12. Dither strength code versus simulated dithering amplitude.

Equations (12)–(15) reveal the dependence of the CDR's loop gain (and hence tracking bandwidth) upon the dithering amplitude  $\Delta$ . Specifically, from (13), a higher  $\Delta$  gives a larger phase detector gain [also as shown in Fig. 9(b)], and from (12), the phase detector output diminishes to zero as the instantaneous phase shift  $\tau$  becomes much larger (by  $\sim 3\times$ ) than the dithering amplitude  $\Delta$  [because all  $Q$  values in (12) converge to one]. However, increasing  $\Delta$  gives a narrower horizontal eye margin due the (intentional) pseudo-randomized dithering operation. In this paper, 2% dithering amplitude is selected to have a negligible impact on the horizontal eye opening, while still supporting megahertz range<sup>2</sup> tracking capability.

<sup>2</sup>For practical values of dither amplitude  $\Delta$ , the achievable absolute bandwidth of this type of CDR will be lower than e.g. a 2x over-sampled design. However, since most 50+Gb/s standards continue to require CDR tracking bandwidth only in the  $\sim$ MHz range – which is easily achievable at these sample rates and with reasonable values of  $\Delta$  – sacrificing the maximum achievable CDR bandwidth for energy-efficiency is typically a favorable trade-off.



Fig. 13. Die photograph.

Fig. 10(a) highlights the signaling and clocking hardware associated with the dithering and phase detector functions. Note that there is a latency difference between the two inputs to the phase detector: the dithering sequence (①) and the signal modulated by the dithering sequence (②). Separate dither generators that produce the same sequence but with a time offset are used to compensate for this latency difference. Specifically, one dither generator is input directly to the phase detector, while the other modulates the actual phase used in the high-speed data-path. Altering the seeds of these two dither generators (which are just PRBS sequences) enables alignment with step sizes of 8-UI, and as shown in Fig. 10(a), an additional phase interpolator enables finer 1-UI steps within



Fig. 14. Measurement setup for (a) channel frequency response, (b) pulse response, and (c) equalizer and CDR characterizations.



Fig. 15. Measured channel frequency response.

this 8-UI window. If there is  $x$  UI of timing mismatch between these two paths, the effective gain of the phase detector will be scaled by a factor of  $x/8$  (since the dither generators are triggered by the eight-UI period/7.5-GHz clock); the one-UI step phase interpolator therefore limits the reduction in phase detector gain to <12.5%.

As shown in Fig. 10(b), the CDR loop is then closed by applying additional filtering and feeding the PD's output signal back to control the clock generator. The resulting CDR and clocking path hardware is shown in Fig. 11. The dithering is achieved by modulating the load capacitance of



Fig. 16. Transmitter and channel characterizations. (a) Eye diagram. (b) Pulse response.

the clock driver, and the dithering capacitor is chosen based on simulated results with S-parameter models of the inductor and clock distribution network, extracted from an electromagnetic simulator [28]. In this paper, the dithering capacitor is constructed as a differential capacitive DAC (CDAC) array with 7.5-fF unit capacitor cells for programmability. The capacitance modulation circuit supports up to ~2% dithering amplitude, as shown in the simulated characteristic (Fig. 12).

While the integral path of the CDR is implemented in the digital domain (which operates on deserialized data) with an accumulator, in order to minimize its latency, the proportional path is implemented in the 7.5-GHz (UI/8) domain with analog summation.



Fig. 17. Equalizer measurements. (a) On-chip eye diagram. (b) Bathtub curve.



Fig. 18. Converged receive equalizer coefficients (output referred).

#### IV. MEASUREMENT RESULTS

The transceiver test chip was designed and fabricated in a 65-nm CMOS process (Fig. 13). The entire transceiver occupies  $2.48 \text{ mm}^2$  (TX:  $0.45 \text{ mm}^2$  and RX:  $2.03 \text{ mm}^2$ ). Fig. 14 shows measurement setup for the test chip and



Fig. 19. Measured divided clock outputs.

TABLE I  
TRANSCEIVER POWER BREAKDOWN

|                   | Item                                   | Power (mW) |
|-------------------|----------------------------------------|------------|
| TX                | TXFFE, 2:1SER                          | 63         |
|                   | 128:2SER, Pattern Generator            | 20         |
|                   | OSC, CKDRV, DIV4                       | 51         |
|                   | DIV16                                  | 18         |
|                   | TX Total                               | 152        |
| RX                | CTLE, RXFFE, DFE                       | 46         |
|                   | 2:128DES, CDR, Adaptation, Error Count | 17         |
|                   | OSC, CKDRV, CKDIV/4                    | 55         |
|                   | CKDIV/16                               | 18         |
|                   | RX Total                               | 136        |
| Transceiver Total |                                        | 288        |

channel environment. In order to maintain signal integrity and minimize unwanted reflections and/or additional losses from parasitic elements, a 0.7-m differential 32AWG twinax cable is directly soldered on the testing board made out of Megtron6-5670 material. The chip is also directly attached to the board via flip-chip bumps to minimize parasitic loading. Fig. 15 shows the measured S21 of the channel, from the configuration shown in Fig. 14(a). The insertion loss at 30 GHz is 21 dB, without significant reflections from complex package and connector structures [5]. Fig. 16 shows the associated pulse response and eye diagram, measured by Keysight 86118A sampling heads [Fig. 14(b)]. Non-zero precursor ISI,  $\sim 3$  taps of significant postcursor ISI, and long-tail ISI are observed in the measured pulse response, which necessitates the use of the complete CTLE+FFE+DFE equalizer chain. The transmitter is configured to generate a 60 Gb/s  $2^7-1$  PRBS pattern, and a Keysight E8267D is used to generate a 10-GHz reference clock to injection lock the transmitter's 30-GHz

TABLE II  
PERFORMANCE SUMMARY

| Reference                        | [2]<br>Frans 2016           | [3]<br>Peng 2017                  | [13]<br>Lee 2015      |           | [14]<br>Shibasaki 2016            | This work                                                   |
|----------------------------------|-----------------------------|-----------------------------------|-----------------------|-----------|-----------------------------------|-------------------------------------------------------------|
| <b>Modulation</b>                | PAM4                        | PAM4                              | PAM4                  | NRZ       | NRZ                               | <b>NRZ</b>                                                  |
| <b>Process</b>                   | 16nm                        | 40nm                              | 40nm                  | 40nm      | 28nm                              | <b>65nm</b>                                                 |
| <b>Data-rate (Gb/s)</b>          | 56                          | 56                                | 54.1~56.8             | 55.5~56.5 | 56                                | <b>60</b>                                                   |
| <b>Channel loss (dB)</b>         | 25                          | 24                                | N/A                   | N/A       | 18.4                              | <b>21</b>                                                   |
| <b>VISI/VCURSOR</b>              | -                           | -                                 |                       |           | -                                 | <b>2.9*</b>                                                 |
| <b>Equalizer</b>                 | 3-tap TX FFE<br>CTLE<br>DSP | 3-tap TX FFE<br>CTLE<br>3-tap DFE | CTLE                  | CTLE      | 2-tap TX FFE<br>CTLE<br>1-tap DFE | <b>3-tap TX FFE<br/>2-tap RX FFE<br/>CTLE<br/>3-tap DFE</b> |
| <b>SERDES ratio</b>              | 1:32                        | 1:64                              | 1:16 (TX)<br>1:2 (RX) | 1:8       | 1:32                              | <b>1:128</b>                                                |
| <b>Adaptation</b>                | Y                           | N                                 | N                     | N         | Y                                 | <b>Y (per-path)</b>                                         |
| <b>Eye opening</b>               | N/A                         | 25% @ 1e-9                        | N/A                   | N/A       | 28% @ 1e-9                        | <b>35% @ 1e-9<br/>30% @ 1e-12</b>                           |
| <b>Tx Power (mW)</b>             | 140                         | 200                               | 290                   | 450       | 104.7                             | <b>152</b>                                                  |
| <b>Rx Power (mW)</b>             | 370                         | 382                               | 420                   | 220       | 141.7                             | <b>136</b>                                                  |
| <b>Tot. Power (mW)</b>           | 550◊                        | 602‡                              | 710                   | 670       | 246.4†                            | <b>288</b>                                                  |
| <b>Tx Efficiency (pJ/bit)</b>    | 2.5                         | 3.57                              | 5.17                  | 8         | 1.87                              | <b>2.53</b>                                                 |
| <b>Rx Efficiency (pJ/bit)</b>    | 6.61                        | 6.82                              | 7.5                   | 3.93      | 2.53                              | <b>2.26</b>                                                 |
| <b>Total Efficiency (pJ/bit)</b> | 9.82                        | 10.75                             | 12.67                 | 11.96     | 4.4                               | <b>4.8</b>                                                  |

\*VISI/VCURSOR is measured from a probing setup with additional <2dB loss @ 30GHz

◊DSP power not included, 40mW clocking power

‡20mW clocking power

†Clocking power is amortized over 2 lanes



Fig. 20. (a) Measured dLev values. (b) Subsampled phase detector output.

*LC* oscillator. Significant ISI from channel loss and a completely closed eye diagram are observed, corroborating the measured ratio between ISI and cursor of 2.9.

Following channel characterization, on-chip eye diagrams and a bit error rate (BER) bathtub curve were measured to test the adaptive equalizer [Figs. 14(c) and 17]. All sampler offsets are cancelled in advance by the on-chip offset calibration loops using an algorithm similar to one used in [29]. After that, equalizer coefficients are adaptively converged by the internal tracking loops with a PRBS7 sequence transmitted from the TX. Notably, the design achieves  $>0.3$  UI opening at  $10^{-12}$  BER for both the odd and even paths. The converged equalizer coefficients are plotted in Fig. 18. When the CDR is turned on, the RX clock is locked to the TX clock, and their divide-by-



Fig. 21. Measured jitter tolerance.

64 clocks are plotted in Fig. 19. Fig. 20 shows the measured output of the dLev tracking loop and phase detector output while sweeping the receiver's clock phase. The CDR loop is designed to converge to the zero-crossing point of Fig. 20(b), where the converged dLev output (and the cursor amplitude) is close to its maximum value. In order to prevent unwanted

interactions between the equalizer adaptation and CDR, their tracking bandwidths are substantially separated ( $\sim 100$  kHz for the adaptation and  $\sim 3$  MHz for the CDR) [30], [31].

Fig. 21 shows the measured jitter tolerance curve. As expected from the bathtub curve, the jitter tolerance curve confirms  $\sim 0.3$  UI of margin at high frequency, and  $\sim 3$  MHz of tracking bandwidth. The transceiver power breakdown is shown in Table I, and Table II shows a performance summary and comparisons between various state-of-the-art transceivers. Particularly considering the 65-nm process technology, the RX efficiency highlights the benefits of the energy-efficient adaptive signaling path and the proposed baud-rate CDR.

## V. CONCLUSION

This paper described design techniques enabling the realization of an efficient ( $<5$  pJ/bit) and complete 60-Gb/s NRZ transceiver in 65-nm CMOS technology. The transceiver incorporates CTLE, FFE, and DFE, output slicing, clocking, equalizer adaptation, and a baud-rate CDR, allowing the design to support operation over a 21-dB loss channel. An energy efficient but robust baud-rate CDR is realized by making use of an integrating front end and phase dithering circuits. The design achieves 60-Gb/s operation with  $>0.3$  UI opening at  $10^{-12}$  BER (and error-free operation over  $10^{13}$  bits at the center of the eye), while consuming 288 mW and occupying 2.48 mm<sup>2</sup>.

## ACKNOWLEDGMENT

The authors would like to thank Systems on Nanoscale Information fabriCs, BWRC sponsors, students, staff, and faculty, Berkeley Design Automation, Integrand EMX, Lorentz PeakView, the TSMC University Shuttle Program for chip fabrication donation, and B. Casper and J. Jaussi of Intel, K. Chang and J. Lim of Xilinx, F. Liu, Z. S. Shehadeh, and V. Lee of Oracle Labs, P. Y. Chiang of OSU, C.-K. K. Yang of UCLA, P. K. Hanumolu of UIUC, and V. Stojanovic of UC Berkeley.

## REFERENCES

- [1] S. Narendra, L. Fujino, and K. Smith, "Through the looking glass -the 2015 edition: Trends in solid-state circuits from ISSCC," *IEEE Solid-State Circuits Mag.*, vol. 7, no. 1, pp. 14–24, Feb. 2015.
- [2] Y. Frans *et al.*, "A 56 Gb/s PAM4 wireline transceiver using a 32-way time-interleaved SAR ADC in 16nm FinFET," in *Proc. IEEE Symp. VLSI Circuits*, Jun. 2016, pp. 1–2.
- [3] P. J. Peng, J. F. Li, L. Y. Chen, and J. Lee, "A 56 Gb/s PAM-4/NRZ transceiver in 40nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 110–111.
- [4] J. Im *et al.*, "A 40-to-56 Gb/s PAM-4 receiver with 10-tap direct decision-feedback equalization in 16nm FinFET," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 114–115.
- [5] B. Casper, G. Balamurugan, J. E. Jaussi, J. Kennedy, and M. Mansuri, "Future microprocessor interfaces: Analysis, design and optimization," in *Proc. IEEE Custom Integr. Circuits Conf.*, Sep. 2007, pp. 479–486.
- [6] H. Braunisch *et al.*, "High-speed flex-circuit chip-to-chip interconnects," *IEEE Trans. Adv. Packag.*, vol. 31, no. 1, pp. 82–90, Feb. 2008.
- [7] M. Mansuri *et al.*, "A scalable 0.128–1 Tb/s, 0.8–2.6 pJ/bit, 64-lane parallel I/O in 32-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 48, no. 12, pp. 3229–3242, Dec. 2013.
- [8] Y. Lu and E. Alon, "Design techniques for a 66 Gb/s 46 mW 3-tap decision feedback equalizer in 65 nm CMOS," *IEEE J. Solid-State Circuits*, vol. 48, no. 12, pp. 3243–3257, Dec. 2013.
- [9] T. Shibasaki *et al.*, "A 56-Gb/s receiver front-end with a CTLE and 1-tap DFE in 20-nm CMOS," in *Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 2014, pp. 1–2.
- [10] A. Awny, L. Moeller, J. Junio, J. C. Scheytt, and A. Thiede, "Design and measurement techniques for an 80 Gb/s 1-tap decision feedback equalizer," *IEEE J. Solid-State Circuits*, vol. 49, no. 2, pp. 452–470, Feb. 2014.
- [11] J. Han, Y. Lu, N. Sutardja, K. Jung, and E. Alon, "Design techniques for a 60 Gb/s 173 mW wireline receiver frontend in 65 nm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 51, no. 4, pp. 871–880, Apr. 2016.
- [12] M.-S. Chen and C.-K. K. Yang, "A 50-64 Gb/s serializing transmitter with a 4-Tap, LC-ladder-filter-based FFE in 65 nm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 50, no. 8, pp. 1903–1916, Aug. 2015.
- [13] J. Lee, P.-C. Chiang, and C.-C. Weng, "56 Gb/s PAM4 and NRZ SerDes transceivers in 40nm CMOS," in *Proc. IEEE Symp. VLSI Circuits*, Jun. 2015, pp. 118–119.
- [14] T. Shibasaki *et al.*, "A 56 Gb/s NRZ-electrical 247mW/lane serial-link transceiver in 28nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2016, pp. 64–65.
- [15] J. Han, Y. Lu, N. Sutardja, and E. Alon, "A 60 Gb/s 288mW NRZ transceiver with adaptive equalization and baud-rate clock and data recovery in 65nm CMOS technology," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 112–113.
- [16] M. Park, J. Bulzacchelli, M. Beakes, and D. Friedman, "A 7 Gb/s 9.3 mW 2-tap current-integrating DFE receiver," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2007, pp. 230–239.
- [17] B. Kim, Y. Liu, T. O. Dickson, J. F. Bulzacchelli, and D. J. Friedman, "A 10-Gb/s compact low-power serial I/O with DFE-IIR equalization in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 44, no. 12, pp. 3526–3538, Dec. 2009.
- [18] V. Stojanovic *et al.*, "Autonomous dual-mode (PAM2/4) serial link transceiver with adaptive equalization and data recovery," *IEEE J. Solid-State Circuits*, vol. 40, no. 4, pp. 1012–1026, Apr. 2005.
- [19] B. S. Leibowitz *et al.*, "A 7.5 Gb/s 10-Tap DFE receiver with first tap partial response, spectrally gated adaptation, and 2nd-order data-filtered CDR," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2007, pp. 228–239.
- [20] Y. Hidaka *et al.*, "A 4-channel 1.25-10.3 Gb/s backplane transceiver macro with 35 dB equalizer and sign-based zero-forcing adaptive control," *IEEE J. Solid-State Circuits*, vol. 44, no. 12, pp. 3547–3559, Dec. 2009.
- [21] J. F. Bulzacchelli *et al.*, "A 10-Gb/s 5-tap DFE/4-tap FFE transceiver in 90-nm CMOS technology," *IEEE J. Solid-State Circuits*, vol. 41, no. 12, pp. 2885–2900, Dec. 2006.
- [22] P. Upadhyaya *et al.*, "A 0.5-to-32.75 Gb/s flexible-reach wireline transceiver in 20nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [23] B. Zhang *et al.*, "A 28 Gb/s multi-standard serial-link transceiver for backplane applications in 28nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [24] F. Spagna *et al.*, "A 78mW 11.8 Gb/s serial link transceiver with adaptive RX equalization and baud-rate CDR in 32nm CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2010, pp. 366–367.
- [25] C. Thakkar *et al.*, "A 32 Gb/s bidirectional 4-channel 4 pJ/b capacitively coupled link in 14 nm CMOS for proximity communication," *IEEE J. Solid-State Circuits*, vol. 51, no. 12, pp. 3231–3245, Dec. 2016.
- [26] V. Balan *et al.*, "A 4.8-6.4-Gb/s serial link for backplane applications using decision feedback equalization," *IEEE J. Solid-State Circuits*, vol. 40, no. 9, pp. 1957–1967, Nov. 2005.
- [27] K. Mueller and M. Muller, "Timing recovery in digital synchronous data receivers," *IEEE Trans. Commun.*, vol. 24, no. 5, pp. 516–531, May 1976.
- [28] S. Kapur and D. E. Long, "Modeling of integrated RF passive devices," in *Proc. IEEE Custom Integr. Circuits Conf.*, Sep. 2010, pp. 1–8.
- [29] J. E. Jaussi *et al.*, "8-Gb/s source-synchronous I/O link with adaptive receiver equalization, offset cancellation, and clock de-skew," *IEEE J. Solid-State Circuits*, vol. 40, no. 1, pp. 80–88, Jan. 2005.

- [30] R. Payne *et al.*, “A 6.25-Gb/s binary transceiver in 0.13- $\mu$ m CMOS for serial data transmission across high loss legacy backplane channels,” *IEEE J. Solid-State Circuits*, vol. 40, no. 12, pp. 2646–2657, Dec. 2005.
- [31] H. Kimura *et al.*, “A 28 Gb/s 560 mW multi-standard SerDes with single-stage analog front-end and 14-tap decision feedback equalizer in 28 nm CMOS,” *IEEE J. Solid-State Circuits*, vol. 49, no. 12, pp. 3091–3103, Dec. 2014.



**Jaeduk Han** (S’15) received the B.S. and M.S. degrees in electrical engineering from Seoul National University, Seoul, South Korea, in 2007 and 2009, respectively. He is currently pursuing the Ph.D. degree in electrical engineering with the University of California, Berkeley, CA, USA.

He was a Circuit Design Engineer at TLI, Seongnam, South Korea, from 2009 to 2012, and has held engineering intern positions at Altera, Intel, Xilinx, and Apple, in 2012, 2014, 2015, and 2016, respectively, where he was involved in high-speed wireline communication circuits and power management circuits. His current research interests include high-speed wireline communication circuit design and analog circuit design automation.



**Nicholas Sutardja** (S’12) received the B.S. degree in electrical engineering and computer science from the University of California at Berkeley, Berkeley, CA, USA, in 2012, and the B.A. degree in applied mathematics from the University of California at Berkeley in 2012, where he is currently pursuing the Ph.D. degree in electrical engineering.

Additionally, he was involved in high-speed wireline receivers at Altera in 2011, sensors for pulse oximetry at ADI, Seongnam, South Korea, in 2014, and field-programmable gate array architecture at Palo Alto Networks, Seongnam, South Korea, in 2016. His current research interests include mixed-signal ICs and low latency, and energy-efficient high-speed link systems.



**Yue Lu** (S’08–M’14) received the B.E. degree from Shanghai Jiao Tong University, Shanghai, China, in 2008, and the Ph.D. degree from the University of California, Berkeley, CA, USA, in 2014.

He joined Carnegie Mellon University, Pittsburgh, PA, USA, in 2007, as an Undergraduate Exchange Student. He is currently with Qualcomm Atheros Inc., San Jose, CA, USA.

Dr. Lu was a recipient of the 2013–2014 IEEE Solid-State Circuits Society Predoctoral Achievement Award, the 2013 James H. Eaton Memorial Scholarship from UC Berkeley, the 2013 ADI Outstanding Student Designer Award, and the 2012 Custom Integrated Circuits Conference Best Student Paper Award.



**Elad Alon** (M’06–SM’12) received the B.S., M.S., and Ph.D. degrees in electrical engineering from Stanford University, Stanford, CA, USA, in 2001, 2002, and 2006, respectively.

He is a Professor of electrical engineering and computer sciences with the University of California, Berkeley, CA, USA, as well as a Co-Director of the Berkeley Wireless Research Center. He has held advisory, consulting, or visiting positions at Locix, Lion Semiconductor, Cadence, Xilinx, Wilocity (now Qualcomm), Oracle, Intel, AMD, Rambus, Hewlett Packard, and IBM Research, where he was involved in digital, analog, and mixed-signal integrated circuits for computing, test and measurement, power management, and high-speed communications. His current research interests include energy-efficient integrated systems, including the circuit, device, communications, and optimization techniques used to design them.

Prof. Alon was a recipient of the IBM Faculty Award in 2008, the 2009 Hellman Family Faculty Fund Award, as well as the 2010 and 2017 UC Berkeley Electrical Engineering Outstanding Teaching Awards, and has co-authored papers that received the 2010 ISSCC Jack Raper Award for Outstanding Technology Directions Paper, the 2011 Symposium on VLSI Circuits Best Student Paper Award, the 2012 as well as the 2013 Custom Integrated Circuits Conference Best Student Paper Awards, and the 2010–2016 Symposium on VLSI Circuits Most Frequently Cited Paper Award.