

# A 60-Gb/s 1.9-pJ/bit NRZ Optical Receiver With Low-Latency Digital CDR in 14-nm CMOS FinFET

Ilter Ozkaya<sup>✉</sup>, Student Member, IEEE, Alessandro Cevrero, Member, IEEE, Pier Andrea Fransese, Senior Member, IEEE, Christian Menolfi, Member, IEEE, Thomas Morf, Senior Member, IEEE, Matthias Brändli, Daniel M. Kuchta, Senior Member, IEEE, Lukas Kull, Senior Member, IEEE, Christian W. Baks, Jonathan E. Proesel, Senior Member, IEEE, Marcel Kossel, Senior Member, IEEE, Danny Luu, Student Member, IEEE, Benjamin G. Lee, Senior Member, IEEE, Fuad E. Doany, Mounir Meghelli, Member, IEEE, Yusuf Leblebici, Fellow, IEEE, and Thomas Toifl, Senior Member, IEEE

**Abstract**—This paper presents an analysis on the loop dynamics of the digital clock and data recovery (CDR) circuits and the design details of a non-return to zero optical receiver (RX) in a 14-nm bulk CMOS finFET technology with high jitter tolerance (JTOL) performance, which is designed based on the analysis. The digital CDR logic is designed full custom in order to keep it running at a quarter rate clock of 15 GHz at 60-Gb/s sampling speed to minimize the CDR loop latency. The RX is characterized in a vertical cavity surface emitting laser-based link recovering a 7-bit pseudo-random bit sequence bit pattern at 60 Gb/s with a JTOL corner frequency of around 80 MHz while maintaining an energy efficiency of 1.9 pJ/bit.

**Index Terms**—Clock and data recovery (CDR), decision feedback equalization (DFE), I/O link, non-return to zero (NRZ), optical receiver (RX), phase rotator (PR), RX, self-timed comparator, sensitivity, shunt feedback, transimpedance amplifier (TIA), variable gain amplifier (VGA).

## I. INTRODUCTION

SERIAL data rates beyond 50 Gb/s/lane will eventually be required in wireline communications with the ever increasing growth of cloud computing and big data applications. Optical links become more and more competitive at high data rates against their electrical counterparts due to huge losses of copper interconnects even at relatively short ranges (<1 m).

In our previous work [1], [2], we presented an optical receiver (RX) data path that runs up to 64 Gb/s with an

Manuscript received August 8, 2017; revised October 13, 2017; accepted November 16, 2017. Date of publication February 7, 2018; date of current version March 23, 2018. This paper was approved by Guest Editor Ken Chang. This work was supported by the European Union's Seventh Framework Program (FP7/2007-2013) through the ADDAPT Project under Grant 619197. (*Ilter Ozkaya and Alessandro Cevrero contributed equally to this work.*) (*Corresponding author:* Ilter Ozkaya.)

I. Ozkaya is with IBM Research—Zurich, 8803 Rüschlikon, Switzerland, and also with the Microelectronic Systems Laboratory, Swiss Federal Institute of Technology, 1015 Lausanne, Switzerland (e-mail: ilt@zurich.ibm.com).

A. Cevrero, P. A. Fransese, C. Menolfi, T. Morf, M. Brändli, L. Kull, M. Kossel, D. Luu, and T. Toifl are with IBM Research—Zurich, 8803 Rüschlikon, Switzerland (e-mail: ace@zurich.ibm.com).

D. M. Kuchta, C. W. Baks, J. E. Proesel, B. G. Lee, F. E. Doany, and M. Meghelli are with the IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598 USA.

Y. Leblebici is with the Microelectronic Systems Laboratory, Swiss Federal Institute of Technology, 1015 Lausanne, Switzerland.

Color versions of one or more of the figures in this paper are available online at <http://ieeexplore.ieee.org>.

Digital Object Identifier 10.1109/JSSC.2017.2778286



Fig. 1. Linearized CDR model.

energy efficiency of 1.4 pJ/bit in a 14-nm finFET technology enabling the direct integration of the proposed optical RX into larger CMOS chips. This paper introduces an improved version, including a complete clock and data recovery (CDR) circuit. The CDR performance of the RX is measured up to 60 Gb/s with a jitter tolerance (JTOL) corner frequency of around 80 MHz. High JTOL performance is important for the relaxation of reference clock generation specifications. Moreover, it allows the RX to be used for spread spectrum clocking applications with harder requirements.

The remainder of this paper is organized as follows. Section II analyzes the effects of various parameters of a linearized CDR model, such as proportional and integral gain coefficients and the delays introduced in those paths. Section III presents the RX architecture and the CDR loop describing in detailed circuit implementation of each building block. Section IV describes the experimental setup and presents measurement results of the fabricated optical RX. Finally, Section V provides the conclusion.

## II. EFFECTS OF LATENCY ON CLOCK AND DATA RECOVERY CIRCUITS

In this section, we will provide an insight into the loop dynamics of the CDR to motivate the design choices explained in Section III. The linearized model to be used for the analysis is given in Fig. 1. The open loop transfer function of this model is given by

$$\text{OLTF} = G_{\text{PD}} \left( G_{\text{prop}} + G_{\text{int}} \frac{z^{-N_{\text{int}}}}{1 - z^{-1}} \right) \frac{z^{-N_{\text{EL}}}}{1 - z^{-1}} G_{\text{DPC}} \quad (1)$$



Fig. 2. CDR model open-loop transfer function for different  $N_{\text{EL}}$  ( $G_{\text{int}} = 0$ ) values.

where  $G_{\text{prop}}$  is the proportional path gain,  $G_{\text{int}}$  is the integral path gain,  $N_{\text{int}}$  is the extra delay on the integral path in terms of unit interval (UI),  $N_{\text{EL}}$  is the complete loop delay in terms of UI,  $G_{\text{DPC}}$  is the digital to phase converter (which is a phase rotator (PR) in our implementation), and  $G_{\text{PD}}$  is the phase detector (PD) gain, which includes a bang-bang PD followed by a tree-style 4-to-1 majority voting in our implementation. Also  $\phi_{\text{in}}$  is the input phase,  $\phi_{\text{CDR}}$  is the sampling phase, and  $\phi_{\text{err}}$  is the phase error.  $\phi_{\text{err}}$  defines the performance of the CDR loop as the JTOL is directly proportional to the reciprocal of  $\phi_{\text{err}}$ .

Despite the fact that a bang-bang phase detection and majority voting both have non-linear responses, in the existence of noise (and jitter), a linear model can be extracted for those operations as shown in [3]. Although our linearized CDR model is based on the model provided in [3], we will be concentrating on the impact of the loop latency on the CDR dynamics.

We will start our analysis by simplifying the CDR model by setting  $G_{\text{int}} = 0$ , which results in an open loop transfer function of

$$\text{OLTF} = G_{\text{PD}} G_{\text{prop}} \frac{z^{-N_{\text{EL}}}}{1 - z^{-1}} G_{\text{DPC}}. \quad (2)$$

The bode plot that corresponds to this model with various loop delay values ( $N_{\text{EL}}$ ) is given in Fig. 2. The loop latency values are chosen around our simulated latency of around 60–70 UI. There are three main parameters that determine the open loop transfer function: 1) integration which shapes the amplitude response to  $1/s$  and adds a  $-90^\circ$  phase shift; 2) the loop gain which moves the amplitude vertically; and 3) loop delay ( $N_{\text{EL}}$ ) which corresponds to a negative phase change proportional to the frequency resulting in the bending of the phase response downward. The bigger  $N_{\text{EL}}$  is the earlier the bend occurs. As a result, as  $N_{\text{EL}}$  increases the phase margin reduces, resulting in peaking in the closed-loop transfer function. The impact of this peaking on the JTOL is shown in Fig. 3.



Fig. 3. JTOL for different  $N_{\text{EL}}$  ( $G_{\text{int}} = 0$ ) values.



Fig. 4. JTOL for different loop gains ( $N_{\text{EL}} = 128$  UI and  $G_{\text{int}} = 0$ ).

In order to have an optimum settling behavior that does not create peaking in JTOL,  $60^\circ$  phase margin must be satisfied by changing the loop gain. The change in JTOL with respect to loop gain (at a fixed loop latency of 128 UI) is given in Fig. 4. As the loop gain is reduced, the phase margin increases and the peaking disappears, whereas the JTOL corner frequency also drops.

Now that the relation between loop latency and JTOL function is explained, the integral path can be introduced to study its effects. The bode plot for the open-loop transfer function of the CDR for various integral path gains ( $G_{\text{int}}$ ) with other parameters constant is given in Fig. 5. Also the integral path delay is assumed to be 0. The introduction of an integral path increases the slope of the amplitude in the lower frequencies to 40 dB/decade and shifts the phase by another  $-90^\circ$  at 0 input frequency. As  $G_{\text{int}}$  is increased, the gain at lower frequencies increases, whereas the phase margin decreases. The JTOL functions corresponding to the same  $G_{\text{int}}$  values are given in Fig. 6. The introduction of an integral gain improves the JTOL response in the lower frequencies significantly. On the other hand, a too large  $G_{\text{int}}$  value starts to create peaking (as expected from the open-loop transfer function), compromising the JTOL performance.

In the final part of the analysis,  $N_{\text{int}}$  will be increased while keeping the other parameters constant at their nominal values.

Fig. 5. Open-loop transfer function for different  $G_{\text{int}}$  ( $N_{\text{EL}} = 64$  UI) values.Fig. 6. JTOL for different integral gains ( $N_{\text{EL}} = 64$  UI).Fig. 7. JTOL function for different  $N_{\text{int}}$  ( $N_{\text{EL}} = 64$  UI) values.

The effect of this additional integral path delay on JTOL is given in Fig. 7. The graph indicates only a slight change in the jitter response with increasing  $N_{\text{int}}$ . Even a delay of 256 UI does not change JTOL function significantly.

The conclusion from this paper is that the proportional path delay directly determines the JTOL performance of a CDR loop, whereas the integral path delay has almost no impact as long as the  $G_{\text{int}}$  coefficient is kept small enough not to compromise the phase margin. Thus, when designing a



Fig. 8. RX top-level block diagram.

CDR circuit much of the effort must be spent on minimizing the proportional gain while digital synthesis can be used for the design of integral path, as it is quicker than the custom design and allows the control of parameters easier.

Our implementation whose details are explained in the following includes a full custom CDR circuit that runs at a quarter rate clock of 15 GHz at the maximum data rate of 60 Gb/s to minimize the proportional path delay. The integral path was omitted from this version. However, the design of the integral path is relatively easy as discussed earlier. In the transistor level simulation of the RX, the proportional path latency of the CDR (including the analog delay of the clock path) is observed to be approximately 70 UI, which according to the analysis provided above results in a corner frequency of around 80 MHz at 60-Gb/s data rate that is supported by the measurement results as will be shown in Section IV.

### III. ARCHITECTURE AND CIRCUITS

The top-level block diagram of the RX is shown in Fig. 8. The input current signal from the photodiode is converted into a voltage signal via a transimpedance amplifier (TIA) to be amplified by a variable gain amplifier (VGA). Then, this signal is sampled by four-way time-interleaved data and edge comparators to provide the signal and phase information for baud-rate CDR. Each data sample consists of two comparators to generate speculative decisions for 1-tap decision feedback equalization (DFE). After that all the signals are aligned to a single quarter rate clock. The speculative decisions ( $D_H$  and  $D_L$ ) are resolved in the look ahead DFE generating the output data signals  $D$ , and edge signals ( $E'$ ) are delayed by the same amount as look ahead DFE to keep  $D$  and  $E$  signals in sync. After that the output data ( $D$ ) is sent into an on-chip pseudo-random bit sequence (PRBS) checker via a 4-to-32 demultiplexer to measure bit error rate (BER).

The CDR logic block receives data ( $D$ ) and edge ( $E$ ) signals to detect the phase information, and it generates the gated clock ( $c4_G$ ) and up-down ( $U\_D$ ) signals that drive the PR control ( $PR_C$ ) block.  $PR_C$  generates the 64-bit control signals for the 128-step  $PR$  itself. The PR receives in-phase (I) and quadrature (Q) clocks from a frequency divider and generates an output clock consisting of differential signals  $C_p$  and  $C_n$ . Then, an current mode logic (CML)-based IQ generator generates eight signals with nominal phase apertures of  $45^\circ$  that correspond to data and edge phases in quarter



Fig. 9. Comparator Schematic [2].

rate sampling. Finally, sampling phases are adjusted in the IQ calibration block and converted into CMOS levels via CML2CMOS converters.

#### A. Data Path

The data path of the RX includes an analog front end, comparators, aligner, and look ahead DFE.

1) AFE: The analog front end consists of TIA, VGA, and a current digital-to-analog converter (DAC). The current DAC is used to cancel out the average photocurrent of the photodiode. It has 12 bits of control and has a resolution of  $<1 \mu\text{A}$  current with a range of 1 mA. The ac photocurrent of the photodiode is converted into a voltage by a CMOS inverter-based shunt feedback resistor TIA. The input node of the TIA is used as a negative output node to create a pseudo-differential signal to drive the VGA [1]. This configuration has several advantages over the one with a replica-TIA: it creates less noise and the output signal amplitude is higher, whereas the layout area and power consumption are half the replica configuration [2].

The bandwidth and the transimpedance gain of the TIA are optimized for a 1-tap DFE in order to maximize the signal-to-noise ratio at its output. In our implementation, the optimum TIA bandwidth was around 15 GHz [1].

The VGA consists of two-stage Cherry-Hooper amplifiers [4] and amplifies the TIA output signal by 20 dB with a bandwidth of 20 GHz while driving the data and edge comparators (a total load of 100 fF). It also converts the pseudo-differential signal at its input into fully differential signals with output common mode adjusted to match the input common mode level of the comparators.

2) Comparator: The schematic of the implemented comparator is given in Fig. 9. It is driven by two differential signals: input ( $IN_p$  and  $IN_n$ ) and reference ( $REF_p$  and  $REF_n$ ). Each comparator has a dedicated 9-bit resistor ladder-based voltage DAC connected to the reference inputs to adjust its slicing level. The clock transistors are connected in cascode to the input transistors just like a Lewis-Grey comparator [5]. The advantage of this configuration over a standard StrongArm latch is that it can operate at a higher common mode voltage, which matches well with the implemented analog front-end (AFE). The second stage of the comparator is a self-timed latch that generates rail-to-rail output signals.

3) Look Ahead DFE: In this implementation, DFE is moved into digital domain to avoid the timing requirement of

$$t_{c2q} + t_{\text{mux}} + t_{\text{setup}} < 1\text{UI} \quad (3)$$

in the conventional DFEs due to the recursive nature of this equalization. A digital solution that reduces the timing



Fig. 10. Block diagram of look ahead DFE and comparators [2].



Fig. 11. Block diagram of CDR logic.

requirement is proposed by Parhi [6]. Also the implementation in [7] proposes a similar solution (look ahead DFE).

The block diagram of the implemented look ahead DFE is given in Fig. 10 together with the comparators and an aligner. The main idea is to generate two parallel speculative words of 4 bits  $L_H[0 : 3]$  and  $L_L[0 : 3]$  based on the assumptions  $D(3) = 1$  and  $D(3) = 0$ , respectively. At the end, those 4-bit speculations can be resolved in parallel with a single bit feedback at quarter rate, relaxing the timing limitation to

$$t_{c2q} + t_{\text{mux}} + t_{\text{setup}} < 4\text{UI}. \quad (4)$$

It must be noted that this relaxation of timing comes with a cost of power, circuit complexity, and latency. Also for higher order DFE implementations, the complexity of this implementation grows exponentially and becomes impractical even at 3-tap DFE.

#### B. CDR Logic

In order to minimize the latency of the CDR loop, the CDR logic is implemented with a quarter rate clock. All of the blocks in the CDR path were custom designed to satisfy timing requirements of this high update rate. The CDR logic consists of three stages: a bang-bang PD, a majority voter, and a loop filter. Its block diagram is given in Fig. 11.

1) Bang-Bang Phase Detector: The bang-bang PD uses two consecutive data bits  $D[n]$  and  $D[n+1]$  to detect the existence of a transition and the edge signal  $E[n]$  whose sampling phase is between  $D[n]$  and  $D[n+1]$  to decide whether the phase is early or late. It has two output signals: *EARLY* and *LATE*. When there is no transition, both are at logic 0. The logic



Fig. 12. Block diagram and logic implementation of the majority voter.

equations that realize this function are

$$\begin{aligned} \text{early}'[n] &= (D[n] \oplus D[n+1])D[n]E[n] \\ \text{late}'[n] &= (D[n] \oplus D[n+1])D[n+1]E[n]. \end{aligned} \quad (5)$$

2) *Majority Voter*: Majority voter receives four early and four late signals from the four PDs and decides whether there are more early or late signals as the name would suggest. In Fig. 12, the block diagram and the logic implementation of the majority voter are given. It consists of three 2-to-1 majority voters (MV\_21) and resolves the early or late information in a treelike structure in two steps. This divide-and-conquer approach minimizes the circuit complexity and increases modularity. The same approach can be extended into 8-bit parallel input.

It must be noted that this block “loses” some valid information for certain input conditions. Let us assume the input is “EELO”: in that case, the first MV\_21 with “EE” input will generate an early output, whereas the second one with “LO” at its input will generate a late output. Thus, the final MV\_21 will have an “EL” signal at its input resulting in an undecided output of “10.” Obviously, this means a loss of information as the number of the early inputs was more than late inputs. This loss of data corresponds to a slight decrease in the phase detection gain around the optimum sampling point. As the sampling point moves away from the edge, the probability of receiving that sort of information from the RX drops and the phase detection gain increases again.

3) *Loop Filter*: The loop filter is used to set the proportional path gain ( $G_{\text{prop}}$ ). It consists of a bidirectional shift register (BSR) and a simple finite-state machine (FSM) that checks for the trigger condition. The block diagram of the loop filter is given in Fig. 13 together with its state diagram. The simple implementation of the BSR allows it to be clocked at the quarter rate clock ( $\text{clk}_4$ ) minimizing CDR latency.

Initially, the BSR is reset to “0000000.” Then, depending on whether an early or late signal is received, it starts to fill “1”s from one end, which corresponds to a clockwise or counter-clockwise step in the state diagram. At the end, when the BSR is full of “1”s, the FSM resets the BSR to initial position and triggers an up- or downrotation in the  $\text{PR}_C$  block. It must be noted that the state can change in any of the two directions at any point in the state diagram. For example, after reset, if the loop filter receives an early and a late signal consecutively,



Fig. 13. Block diagram and logic implementation of the loop filter.



Fig. 14. Linearity comparison of octagonal and diamond constellation PRs.

it will first go to the state “10000000” then back to the “00000000” state.

The maximum output rate of the loop filter in this implementation is 1-step per 10  $\text{clk}_4$  cycles limiting the frequency offset tracking range of the RX to  $\pm 780$  ppm with a PR of 32-step per UI resolution. In practice, this becomes  $> \pm 500$  ppm error free tracking range as will be shown in Section IV. For frequency offsets between  $\pm 780$  and  $\pm 500$  ppm, the RX clock is still locked to the input clock but due to the lack of an integral path, the phase error becomes large, and the received data are not error free any more. This frequency tracking limitation can be easily extended by the integration of an integral path to the CDR loop.

### C. Phase Rotator

The implemented PR has an octagonal constellation, which provides good linearity and small amplitude variation without significant circuit complexity. Octagonal constellation PR architecture was introduced by Gangasani *et al.* [8], and later, an improved version with multiple stages was published in [9]. The calculated integral non-linearity (INL) and differential non-linearity (DNL) of an octagonal PR are compared with a conventional diamond constellation PR in Fig. 14. It must be



Fig. 15. Schematic and constellation of PR.

noted the diamond PR has a much higher amplitude variation in addition to its inherent INL and DNL.

Although our implementation provides the same octagonal constellation, it has superior characteristics in terms of transient response, circuit complexity, and simplicity of control logic.

The schematic and constellation diagram of the implemented PR is given in Fig. 15 together with its control logic. It comprises three identical DACs, and each DAC consists of 16 identical segments.

The control signals are color mapped to the portion of the constellation diagram that they control. For example, the pink ( $I_{ctrl}[0 : 15]$ ) and green ( $Q_{ctrl}[0 : 15]$ ) signals control the horizontal (pink) and vertical (green) sections on the diagram, respectively. It must be noted that the polarity switches used in the previous publications [8], [9] are removed in this design. This change introduced several advantages. First of all, the transient phase glitch that is observed in quadrature crossings is avoided. In the octagonal PR published in [9], all the polarity switches are switched altogether during the quadrature crossings. This injects charge into the output nodes resulting in glitches. In the proposed architecture, quadrature crossings do not require any special treatment and are exactly the same as other steps in the constellation.

The second advantage is the reduced circuit complexity. The whole PR consists of 48 (almost) identical current switching stages. The only differences between the vertical/horizontal and diagonal segments are the tail current value, and the way the control signals are connected. A tail current ratio of  $\sqrt{2}$  between vertical/horizontal and diagonal segments is optimal for minimized INL and DNL in PR transfer function. In the practical implementation, a ratio of 5/7 was used as a close approximation to  $\sqrt{2}$ .

Another significant advantage comes in the control circuitry. In the conventional architecture, a phase accumulator register is used to track the phase position and a binary-to-thermometer decoder to produce the piecewise thermometer-encoded control signals. The proposed PR architecture allows



Fig. 16. Quadrature Oscillator.

us to implement a thermometric-encoded phase accumulator whose outputs can be used as control signals of PR directly, eliminating the need to implement a binary-to-thermometer decoder and the latency it produces. Synchronization of all the outputs of the binary-to-thermometer converter is another challenge in conventional implementation. In the proposed architecture, there is no such problem, since only one control bit changes at a time, and multiple bit changes are not possible.

This control logic is implemented as a BSR with an inversion in the middle as given in Fig. 15. It is directly driven by the up-down ( $U_D$ ) and gated clock ( $c4_G$ ) signals generated by the loop filter. It must be noted that the shift register needs to be reset to a valid initial state such as shown in Fig. 15.

Integrating the phase accumulator and PR control logic in such a simple circuit allows it to be run at the quarter rate clock of 15 GHz (at 60 Gb/s). Thus, the maximum update rate of the PR control block is 1-step per clk4, which results in a potential frequency tracking range of  $\pm 7.8$  kppm.

#### D. IQ Generation for Data and Edge

Quadrature oscillators are widely used to generate  $90^\circ$  phase signals. In [10] and [11], the presented structures are named “tetrahedral” oscillators, and the functional dynamics are explained in a rather complicated way. In [12], the same circuit is drawn as a two-stage pseudo-differential ring oscillator (2xPDRO). The latter version is easier to analyze, and we will be using this notation in our analysis.

The evolution of the “tetrahedral oscillator” into a 2xPDRO is given in Fig. 16 together with its conceptual block diagram. The cross-coupled loading inverters (sized  $K$ ) in 2xPDRO has two main functions. First, they couple the two inverter (sized 1) outputs to build a one-stage pseudo-differential amplifier rather than two single-ended inverters. It should be noted that when  $K = 0$ , the oscillator becomes a positive feedback loop consisting of four inverters of size 1 and does not oscillate. In that case,  $I = IB$  and  $Q = QB$ . Thus,  $K$  must be increased to the point that the positive feedback loop is broken. Our simulations show that  $K > 0.5$  should be satisfied for oscillation in the 14-nm technology with nominal supply voltage. Second, they create a hysteresis in the transfer function of the one-stage pseudo-differential amplifier. This hysteresis prevents a direct small-signal analysis of the system. On the other hand, one can model this hysteresis as a delay element whose value is defined by the time it takes for



Fig. 17. Modeling of 1-stage pseudo-differential amplifier.



Fig. 18. Multistage ILO.

the differential input to reach the hysteresis threshold level starting at 0 analog level. The hysteresis thresholds and how it is mapped as the delay of the one-stage pseudo-differential amplifier are shown in Fig. 17. It must be noted that this additional delay due to hysteresis allows a two-stage amplifier with negative feedback to oscillate. A standard two-stage amplifier with negative feedback does not oscillate, since it has only two poles resulting in a phase change of only  $180^\circ$  at infinite frequency. However, for the oscillation to occur, there must be more than  $180^\circ$  with a higher than unity gain. That is why at least three gain stages are required. The introduction of additional delay in the loop would lead to a larger than  $180^\circ$  phase change even in a two-stage amplifier and leads to oscillation if other criterion is also met.

In [10], a quadrature oscillator-based injection locked oscillator (ILO) is presented. The “tetrahedral” figure is used in the publication with one additional inverter to increase the number of control parameters, and a model is derived to analyze how cascaded stages improve the phase error performance over a wide input frequency range.

A simplified ILO with multiple stages in 2xPDRO notation is given in Fig. 18. The size of the injection inverter ( $J$ ) determines the tradeoff between phase correction factor per stage and frequency tracking bandwidth. As  $J$  increases, the frequency tracking bandwidth increases and error correction factor decreases.

At quarter rate, an edge-based CDR technique requires  $45^\circ$  phases rather than  $90^\circ$  generated by the presented 2xPDRO. The conventional solution to this problem is to use two separate clock paths for data and edge sampling requiring



Fig. 19. Schematic and functional block diagram of a 4xPDRO.



Fig. 20. Schematic of the IQ generator.

two PRs with different control signals (to satisfy  $45^\circ$  phase offset between edge and data clocks) and two quadrature generators. However, this solution has several drawbacks. First of all, the INL of the two PRs will be uncorrelated leading to a factor of  $\sqrt{2}$  increase in systematic data to edge jitter caused by INL. Moreover, the random jitter generated on the two separate clock paths will also be uncorrelated increasing the effective random data to edge jitter created on the clock paths by  $\sqrt{2}$  as well. Instead, we propose generating  $45^\circ$  phases directly from a single four-stage ILO.

A 4xPDRO with injection inverters ( $J$ ) whose schematic and functional block diagram is given in Fig. 19 could be a good candidate for this purpose, as it creates nominal  $45^\circ$  phases. Nevertheless, the delay introduced by the cross-coupled inverters of this structure limits the maximum oscillation frequency significantly. It is also not possible to remove the hysteresis (by setting  $K = 0$ ) in order to eliminate the delay it produces, since the feedback gain of the loop becomes positive as mentioned earlier. On the other hand, a CML stage is inherently differential, and there is no need to introduce hysteresis that results in additional delay in the loop. As a result, a four-stage CML-based RO (CML-RO) can be run at a much higher frequency compared with the CMOS 4xPDRO alternative. Therefore, we decided to use a four-stage CML-RO with frequency injection. An additional benefit of using a CML-based ILO is its higher power supply ripple rejection ratio.

The implemented architecture for the IQ clock generation for the edge and data is given in Fig. 20. At its input,



Fig. 21. Optimal sampling point with the existence of  $h(-1)$  on DFE eye.

there are four CML stages used to generate a coarse estimate of  $45^\circ$  phases to drive the first ILO stage. In this way, the required number of cascaded ILO stages is reduced. Following the CML delay stages, there are three cascaded ILOs to generate fine  $45^\circ$  phases to be used for data and edge sampling. Each ILO has a four-stage CML-RO with frequency injecting CML pairs. The injection pair is sized half the main CML stage, which results in large frequency tracing range while keeping the phase error small enough. The natural oscillation frequency of the RO can be adjusted by changing the tail current and load resistance via two digital control bits. Finally, CML buffers are placed to drive the following stage.

#### E. IQ Correction

DFE technique enables the cancellation of postursors introduced by the limited bandwidth of the data path. However, most data paths have at least a second-order transfer function leading to non-zero pre-cursor(s), which cannot be cancelled by the DFE. Even so, the effect of the pre-cursor can be reduced by shifting the data sampling position back in time. In Fig. 21, the improvement in the equalized signal with shifted data sampling time is illustrated on the measured eye diagram of our RX at 60 Gb/s. As the sampling point is shifted from between the two edges toward left, the size of the pre-cursor ( $h(-1)$ ) drops much faster than the main cursor ( $h(0)$ ) size increasing the eye opening.

We implement this shift in time with the IQ correction block. The block diagram of the IQ correction block is given in Fig. 22 together with the data to edge spacing timing diagram. It comprises four differential phase interpolators, and each one has five digital control bits to adjust its output within a range of  $\pm 0.5$  UI with a resolution of 0.5 PR step (260 fs at 60 Gb/s). Thus, a 1-UI data to edge spacing are supported. During measurements, the optimal time-shift is found experimentally with a sweep.

Since the control signals of the phase interpolators are independent, any residual phase error (for example, due to



Fig. 22. IQ correction block.



Fig. 23. RX layout and chip micrograph.

mismatch in IQ generator transistors) can also be cancelled by the IQ correction block.

#### IV. MEASUREMENT RESULTS

The RX was implemented in the 14-nm bulk finFET technology and was wire bonded to a GaAs p-i-n diode with 50-fF capacitance, 25-GHz bandwidth, and 0.52-A/W responsivity [13]. The layout and chip micrograph of the RX is given in Fig. 23. The active area of the RX is around 150  $\mu\text{m}$  by 190  $\mu\text{m}$ .

A SiGe vertical cavity surface emitting laser driver with 2-tap feed-forward equalization (cursor and pre-cursor with a ratio of around 0.45) [14] provided the optical signal over 7 m of an OM2 multimode fiber (Fig. 24). An optical attenuator is connected in the signal path to adjust the optical modulation amplitude (OMA), which is calculated as follows:

$$\text{OMA} = 2 \frac{Av_{\text{cur}}}{\text{Res}} \frac{\text{ER} - 1}{\text{ER} + 1} \quad (6)$$

where ER is the extinction ratio (measured: ER = 1.8), Res is the responsivity of the PD (Res = 0.52), and  $Av_{\text{cur}}$  is the



Fig. 24. RX test setup.

Fig. 25. Bathtub and BER contour plots at 60 Gb/s and  $-5 \text{ dBm}$  OMA.

Fig. 26. JTOL and frequency tracking range of the RX.

TABLE I  
RX POWER BREAKDOWN AT 60 Gb/s

| Block                | Power (mW) |
|----------------------|------------|
| AFE                  | 25 @ 1 V   |
| CML Clock            | 23 @ 1 V   |
| CMOS Clocking, Comps | 41 @ 0.9 V |
| DFE, CDR logic, DMUX | 28 @ 0.9 V |
| <b>TOTAL</b>         | 117        |

average current of the PD. All the measurements were run with a  $-5 \text{ dBm}$  OMA signal amplitude and a 7-bit PRBS (PRBS7).

The RX is driven by a half rate clock whose phase can be modulated for JTOL measurements. A digital PRBS checker and a correlator engine running at 2 GHz (1/32 of the baud rate) are integrated on chip to assist with measurements. Measurement results stored on on-chip registers are then transmitted off-chip via a three-wire serial interface.

The bathtub and BER contour plots obtained at 60 Gb/s are given in Fig. 25. The eye is 28% open, and the maximum eye

opening is achieved with an  $h(1)$  coefficient of 175 mV at the output of the AFE.

The JTOL (for  $\text{BER} > 10^{-12}$ ) at 60 and 30 Gb/s data rates and frequency tracking range of the RX is given in Fig. 26. The JTOL corner frequency is around 80 MHz at 60 Gb/s, which is quite close to the theoretical value calculated in Section II.

The power breakdown of the presented RX is given in Table I.

## V. CONCLUSION

This paper analyzes the relation between the CDR loop latency and the JTOL performance. The conclusion is that the proportional path delay directly determines an upper limit to the JTOL performance of the CDR loop, whereas the integral path delay is relatively marginal. Thus, the main strategy to maximize the JTOL performance of a CDR loop should be to minimize the CDR proportional path latency.

A power-efficient non-return to zero optical RX with a first-order high jitter tracking bandwidth CDR has been designed in the 14-nm FinFET bulk technology and characterized up to 60 Gb/s. The design techniques to minimize the loop latency and the circuits used to implement it have been explained in detail. The measured JTOL corner frequency was 80 and 50 MHz at 60 and 30 Gb/s, respectively. The energy efficiency of the RX is less than 2 pJ/bit.

## REFERENCES

- [1] A. Cevrero et al., "29.1 A 64 Gb/s 1.4 pJ/b NRZ optical-receiver data-path in 14 nm CMOS FinFET," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 482–483.
- [2] I. Ozkaya et al., "A 64-Gb/s 1.4-pJ/b NRZ optical receiver data-path in 14-nm CMOS FinFET," *IEEE J. Solid-State Circuits*, vol. 52, no. 12, pp. 3458–3473, Dec. 2017, doi: [10.1109/JSSC.2017.2734913](https://doi.org/10.1109/JSSC.2017.2734913).
- [3] J. L. Sonntag and J. Stonick, "A digital clock and data recovery architecture for multi-gigabit/s binary links," *IEEE J. Solid-State Circuits*, vol. 41, no. 8, pp. 1867–1875, Aug. 2006.
- [4] E. M. Cherry and D. E. Hooper, "The design of wide-band transistor feedback amplifiers," *Proc. Inst. Elect. Eng.*, vol. 110, no. 2, pp. 375–389, Feb. 1963.
- [5] T. B. Cho and P. R. Gray, "A 10 b, 20 Msample/s, 35 mW pipeline A/D converter," *IEEE J. Solid-State Circuits*, vol. 30, no. 3, pp. 166–172, Mar. 1995.
- [6] K. K. Parhi, "Design of multigigabit multiplexer-loop-based decision feedback equalizers," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 13, no. 4, pp. 489–493, Apr. 2005.
- [7] T. Shibasaki et al., "A 56-Gb/s receiver front-end with a CTLE and 1-tap DFE in 20-nm CMOS," in *Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 2014, pp. 1–2.
- [8] G. R. Gangasani et al., "A 16-Gb/s backplane transceiver with 12-tap current integrating DFE and dynamic adaptation of voltage offset and timing drifts in 45-nm SOI CMOS technology," *IEEE J. Solid-State Circuits*, vol. 47, no. 8, pp. 1828–1841, Aug. 2012.
- [9] P. A. Francesc et al., "A 16 Gb/s 3.7 mw/Gb/s 8-tap DFE receiver and baud-rate cdr with 31 kppm tracking bandwidth," *IEEE J. Solid-State Circuits*, vol. 49, no. 11, pp. 2490–2502, Nov. 2014.
- [10] K.-H. Kim et al., "A 20-Gb/s 256-MB DRAM with an inductorless quadrature PLL and a cascaded pre-emphasis transmitter," *IEEE J. Solid-State Circuits*, vol. 41, no. 1, pp. 127–134, Jan. 2006.
- [11] T. Kusaga and T. Shima, "Four-stage ring oscillator for quadrature signal generation," in *Proc. Int. Conf. Signals Electron. Syst.*, Sep. 2008, pp. 89–92.
- [12] B. Casper and F. O'Mahony, "Clocking analysis, implementation and measurement techniques for high-speed data links—A tutorial," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 56, no. 1, pp. 17–39, Jan. 2009.
- [13] N. Dupuis et al., "Exploring the limits of high-speed receivers for multimode vcsel-based optical links," in *Proc. OFC*, Mar. 2014, pp. 1–3.
- [14] D. M. Kuchta et al., "A 71-Gb/s NRZ modulated 850-nm VCSEL-based optical link," *IEEE Photon. Technol. Lett.*, vol. 27, no. 6, pp. 577–580, Mar. 15, 2015.



**Ilter Ozkaya** (S'16) received the B.Sc. degree in electronics engineering from Middle East Technical University, Ankara, Turkey, in 2007, and the M.Sc. degree in electronics engineering from Istanbul Technical University, Istanbul, Turkey, in 2010. He is currently pursuing the Ph.D. degree with the Swiss Federal Institute of Technology, Lausanne, Switzerland.

In 2014, he joined IBM Research—Zurich, Rüschlikon, Switzerland, where he has been conducting research on analog circuit design for high-speed IO links. His current research interests include high-speed optical and electrical communications and mixed signal circuit design.



**Alessandro Cevrero** (M'16) received the M.Sc. degree in nanotechnology and the Ph.D. degree in electrical engineering from the Swiss Federal Institute of Technology, Lausanne, Switzerland, in 2007 and 2014, respectively.

In 2012, he joined IBM Research—Zurich, Rüschlikon, Switzerland, where he has been involved in analog circuit design and silicon validation of high-speed energy-efficient I/O links in advanced CMOS technologies. His current research interests include high-speed analog circuit design, 3-D integration, and semiconductor manufacturing. In these areas, he has authored or co-authored over 35 technical publications.



**Pier Andrea Francese** (M'01–SM'17) received the Degree (*cum laude*) in electrical engineering from the Politecnico di Milano, Milan, Italy, in 1993, and the Ph.D. degree from ETH Zürich, Zürich, Switzerland, in 2005.

He was involved in the field of IC product development with Teradyne, Milan, Philips Semiconductors, Zürich, and National Semiconductor, Munich, Germany. In 2010, he joined the IBM Research Laboratory—Zurich, Rüschlikon, Switzerland, where he develops circuits for energy-efficient high-speed I/O links in advanced CMOS technologies.



**Christian Menolfi** (S'97–M'99) received the Dipl.-Ing. and Ph.D. degrees in electrical engineering from ETH Zürich, Zürich, Switzerland, in 1993 and 2000, respectively.

From 1993 to 2000, he was with the Integrated Systems Laboratory, ETH Zürich, as a Research Assistant, where he was involved in highly sensitive CMOS VLSI data-acquisition circuits for silicon-based microsensors. Since 2000, he has been with IBM Research—Zurich, Rüschlikon, Switzerland, where he is involved in the design of multi-gigabit low-power communication circuits in advanced CMOS technologies.



**Thomas Morf** (S'89–M'90–SM'09) received the B.S. degree from the Zürich University of Applied Science, Zürich, Switzerland, in 1987, the M.S. degree in electrical and computer engineering from the University of California at Santa Barbara, Santa Barbara, CA, USA, in 1991, and the Ph.D. degree from ETH Zürich, Zürich, in 1996.

From 1996 to 1999, he led a research group in the area of InP-HBT circuit design and technology with ETH Zürich. In 1999, he joined IBM Research—Zurich, Rüschlikon, Switzerland. He has co-authored over 150 papers and is a co-inventor of over 30 issued patents. His current research interests include ESD circuit protection, electrical and optical high-speed high-density interconnects, and terahertz antennas and detectors.



**Matthias Brändli** received the Dipl.-Ing. (M.Sc.) degree in electrical engineering from ETH Zürich, Zürich, Switzerland, in 1997.

From 1998 to 2001, he was with the Integrated Systems Laboratory, ETH Zürich, where he was involved in deep-submicron technology VLSI design challenges, digital video image processing for biomedical applications, and testability of CMOS circuits. In 2001, he joined the Microelectronics Design Center, ETH Zürich, where he was involved in numerous digital and mixed-signal ASIC design projects and EDA design automation, and contributed to teaching. In 2008, he joined the IBM Research Laboratory—Zurich, Rüschlikon, Switzerland, where he has been involved in multi-gigabit/s, low-power communication circuits in advanced CMOS technologies.



**Daniel M. Kuchta** (SM'97) received the B.S., M.S., and Ph.D. degrees in electrical engineering and computer science from the University of California at Berkeley, Berkeley, CA, USA, in 1986, 1988, and 1992, respectively.

He subsequently joined the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, where he was involved in high-speed VCSEL characterization, multimode fiber links, and parallel fiber optic link research. He is currently a Research Staff Member with the Communications and Computation Subsystems Department, IBM Thomas J. Watson Research Center. He has authored or co-authored over 135 technical papers and an inventor/co-inventor of at least 20 patents.



**Lukas Kull** (S'10–M'14–SM'17) received the M.Sc. degree in electrical engineering from ETH Zürich, Zürich, Switzerland, in 2007, and the Ph.D. degree from the Swiss Federal Institute of Technology, Lausanne, Switzerland, in 2014.

He joined IBM Research—Zurich, Rüschlikon, Switzerland, in 2010, where he has been involved in analog circuit design for high-speed low-power ADCs. His current research interests include analog circuit design, hardware for cognitive workloads, and IR and terahertz imaging. In these areas, he authored or co-authored over 20 patents and 40 technical publications.



**Christian W. Baks** received the B.S. degree in applied physics from the Fontys College of Technology, Eindhoven, The Netherlands, in 2000, and the M.S. degree in physics from the State University of New York, Albany, NY, USA, in 2001.

In 2001, he joined the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, as an Engineer, where he is involved in high-speed optoelectronic package and backplane interconnect design specializing in signal integrity issues.



**Jonathan E. Proesel** (M'10–SM'16) received the B.S. degree in computer engineering from the University of Illinois at Urbana–Champaign, Urbana, IL, USA, in 2004, and the M.S. and Ph.D. degrees in electrical and computer engineering from Carnegie Mellon University, Pittsburgh, PA, USA, in 2008 and 2010, respectively.

He joined the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, in 2010, where he is currently a Research Staff Member involved in analog and mixed-signal circuit design for optical transmitters and receivers. He held internships with IBM Microelectronics, Essex Junction, VT, USA, in 2004, and the IBM Thomas J. Watson Research Center in 2009. His current research interests include high-speed optical and electrical communications, silicon photonics, data converters, and bioelectronics.

Dr. Proesel is a member of the IEEE Solid-State Circuits Society. He was a recipient of the Analog Devices Outstanding Student Designer Award in 2008 and the SRC Techcon Best in Session Award for Analog Circuits in 2009, and a co-recipient of the Best Student Paper Award at the 2010 IEEE Custom Integrated Circuits Conference. He has also received multiple technical awards at IBM. He serves on the Technical Program Committee for the Symposium on VLSI Circuits.



**Fuad E. Doany** received the Ph.D. degree in chemical physics from the University of Pennsylvania, Philadelphia, PA, USA, in 1984.

Following a Post-Doctoral Fellowship at the California Institute of Technology, he joined the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, in 1985. As a Research Staff Member at IBM, he was involved in laser spectroscopy, applied optics, projection displays, and laser material processing. Since 2000, he has been involved in high-speed optical interconnects and optoelectronic packaging. He has authored or co-authored over 120 technical papers and holds over 70 U.S. patents.



**Mounir Meghelli** (M'07) received the M.S. degree in electronics and automatics from the University of Paris Orsay, Paris, France, in 1992, the Engineering degree in telecommunication from Télécom ParisTech, Paris, in 1994, and the Ph.D. degree from the University of Paris VI, Paris, after a four-year research program with the CNET France Telecom Research Center, Paris, with a focus on the design of high-speed ICs for optical communications in GaAs and InP HBT technologies.

From 1998 to 2005, he was with the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, as a Research Staff Member, where he was involved in the design of high-frequency ICs in SiGe BiCMOS and CMOS technologies for wireline and wireless applications. From 2006 to 2011, he joined the IBM Server and Technology Group, Yorktown Heights, NY, USA, where he was a Senior Technical Staff Member, leading the design of advanced serial links for storage, networking, and server applications. He is currently managing the Mixed Signal Communication IC Design group, IBM Thomas J. Watson Research Center.



**Marcel Kossel** (S'99–M'02–SM'09) received the Dipl.-Ing. and Ph.D. degrees in electrical engineering from ETH Zürich, Zürich, Switzerland, in 1997 and 2000, respectively.

He was also involved in the field of microwave tagging systems and radio-frequency identification systems. In 2001, he joined IBM Research—Zurich, Rüschlikon, Switzerland, where he is involved in analog circuit design for high-speed serial links. His current research interests include analog circuit design and RF measurement techniques.



**Danny Luu** (S'17) received the B.Sc. and M.Sc. degrees in electrical engineering and information technology from ETH Zürich, Zürich, Switzerland, in 2013, where he is currently pursuing the Ph.D. degree.

He joined IBM Research—Zurich, Rüschlikon, Switzerland, in 2013, where he has been conducting research into analog circuit design for high-speed, high-resolution, and low-power ADCs in collaboration with ETH Zürich toward his doctoral degree.



**Yusuf Leblebici** (M'90–SM'98–F'10) received the B.Sc. and M.Sc. degrees in electrical engineering from Istanbul Technical University, Istanbul, Turkey, in 1984 and 1986, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana–Champaign, Champaign, IL, USA, in 1990.

Since 2002, he has been a Chair Professor with the Swiss Federal Institute of Technology, Lausanne, Switzerland, where he has also been the Director of the Microelectronic Systems Laboratory. He has authored six textbooks and over 300 articles published in various journals and conferences. His current research interests include the design of high-speed CMOS digital and mixed-signal integrated circuits, computer-aided design of VLSI systems, intelligent sensor interfaces, modeling and simulation of semiconductor devices, and VLSI reliability analysis.

Dr. Leblebici has served as an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II and the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATED SYSTEMS. He has been elected as a Distinguished Lecturer of the IEEE Circuits and Systems Society for 2010/2011.



**Thomas Toifl** (S'97–M'99–SM'09) received the Dipl.-Ing. (M.S.) degree and the Ph.D. degree (Hons.) in electrical engineering from the Vienna University of Technology, Vienna, Austria, in 1995 and 1999, respectively.

In 1996, he joined the Microelectronics Group, European Research Center for Particle Physics (CERN), Geneva, Switzerland, where he developed radiation-hard circuits for detector synchronization and data transmission, which were integrated in the four particle detector systems of the new Large Hadron Collider. In 2001, he joined the IBM Research Laboratory—Zurich, Rüschlikon, Switzerland, where he has been involved in multi-gigabit per second, low-power communication circuits in advanced CMOS technologies since 2001. In these areas, he authored or co-authored 19 patents and over 50 technical publications. Since 2008, he manages the I/O Link Technology Group, IBM Zurich Research Laboratory.

Dr. Toifl received the Beatrice Winner Award for Editorial Excellence at the 2005 IEEE International Solid-State Circuits Conference.



**Benjamin G. Lee** (M'04–SM'14) received the B.S. degree from Oklahoma State University, Stillwater, OK, USA, in 2004, and the M.S. and Ph.D. degrees from Columbia University, New York City, NY, USA, in 2006 and 2009, respectively, all in electrical engineering.

In 2009, he became a Post-Doctoral Researcher with the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, where he is currently a Research Staff Member. He is also an Assistant Adjunct Professor of electrical engineering with Columbia University. His current research interests include silicon photonic devices, integrated optical switches and networks for high-performance computing systems and datacenters, and highly parallel multimode transceivers.

Dr. Lee is a member of the Optical Society and the IEEE Photonics Society. He currently serves on the Board of Governors for the Photonics Society.