

# Channel-Adaptive ADC and TDC for 28 Gb/s PAM-4 Digital Receiver

Aurangozeb<sup>✉</sup>, Student Member, IEEE, AKM Delwar Hossain, Student Member, IEEE,  
Maruf Mohammad, and Masum Hossain, Member, IEEE

**Abstract**—A low-power channel-adaptive 28 Gb/s PAM-4 receiver is presented utilizing a predictive analog-to-digital converter (ADC), a successive-approximation-register (SAR) time-to-digital converter (TDC), and a feed-forward equalizer (FFE) in the digital domain. The variable-resolution flash ADC takes advantage of the channel inter-symbol interference (ISI) and can achieve 5.5 bits resolution utilizing only 16 comparators. By reusing the comparators, the ADC can provide a programmable resolution from 2 to 5.5 bits consuming 40 to 90 mW, respectively. The SAR-TDC generates 5 bits timing information that includes 2 bits ISI and 3 bits timing error to achieve a low-latency and low-jitter timing recovery. Subsequently, a three-to-eight programmable tap FFE is used to equalize up to 30-dB loss achieving bit error rate lower than  $10^{-8}$ . FFE is implemented in a field-programmable gate array, and the first three taps are realized in a look-up table (LUT). An offline higher resolution ADC is used to generate the pre-computed values for the LUT. Measured power consumption is 130 mW (excluding digital signal processing) from a 1.2-V power supply with active chip area of  $0.2025 \text{ mm}^2$  in 65-nm technology. Due to programmability on the both ADC resolution and the number of FFE taps according to the channel loss, the receiver enables energy efficiency according to loss compensation.

**Index Terms**—Channel-adaptive ADC, digital equalization, PAM-4 digital receiver, successive-approximation-register (SAR) time-to-digital converter (TDC).

## I. INTRODUCTION

IN HIGH-SPEED wireline transceivers, the frequency-dependent channel loss is the main source of inter-symbol interference (ISI). In simple term, ISI is the residue of the current symbol that affects the following symbols (pre-cursors) as well as the previous symbols (post-cursors). For high-loss channel, conventional receiver designs usually feature linear equalization techniques in analog domain such as a transmit side finite impulse response (FIR) filter and a receiver side continuous time linear equalization (CTLE). In addition, decision feedback equalization (DFE) is used for further ISI cancellation and bit detection. However, direct-feedback DFE

Manuscript received July 14, 2017; revised September 22, 2017 and November 10, 2017; accepted November 11, 2017. Date of publication January 1, 2018; date of current version February 21, 2018. This paper was approved by Guest Editor Sam Palermo. (*Corresponding author: Aurangozeb*)

Aurangozeb and M. Hossain are with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 1H9, Canada (e-mail: aurangoz@ualberta.ca; masum@ualberta.ca).

A. D. Hossain is with MACOM, Burlington, ON L7L 5Y7, Canada (e-mail: akmdelwa@ualberta.ca).

M. Mohammad is with Qualcomm Atheros, San Jose, CA 95110 USA (e-mail: maruf@vt.edu).

Color versions of one or more of the figures in this paper are available online at <http://ieeexplore.ieee.org>.

Digital Object Identifier 10.1109/JSSC.2017.2777099

structure is very challenging to implement for the first tap that requires closing the feedback loop within 1 UI [1]. Therefore, at high speed, loop-unrolled structures are adopted where possible outcomes are pre-calculated and then one of them is selected using previously decoded bit(s) [2], [3]. The loop-unrolled approach doubles the number of comparators with each tap, and also requires additional muxes that translate into a significant increase in power [3].

Analog mixed-signal solutions, in general, can equalize channel loss with excellent energy efficiency (around  $\sim 3 \text{ pJ/bit}$ ) [4], [5]. However, these solutions come with own limitations, for instance,

- 1) There is an SNR degradation resulting from CTLE that not only inverts the channel response but also amplifies noise, including crosstalk.
- 2) The linearity requirement becomes harder to achieve due to supply voltage scaling that reduces maximum achievable linear swing.
- 3) Due to process variation, it becomes difficult to achieve reliable control over zero and pole frequencies to have the desired frequency response.

All these limitations degrade the performance of symbol-by-symbol detection. The equalization becomes more complicated when we move to higher order modulation, such as PAM-4, for several reasons as follows.

- 1) Compared to binary/PAM-2 signaling, the eye height is reduced by  $3\times$  for the same transmit power—this translates to 9-dB SNR penalty and corresponding bit error rate (BER) degradation.
- 2) Linearity requirement for PAM-4 signaling is much more stringent, which makes analog processing much more challenging than that of nonreturn to zero (NRZ) signaling, and the supply voltage scaling with technology makes it more difficult.
- 3) Residue ISI has a much bigger impact on PAM-4 compared to NRZ.

This is because residue of the largest transition impacts the smallest bit as both ISI and crosstalk, which is  $3\times$  larger compared to NRZ [6]–[8]. All these challenges motivated us to rethink the equalization strategy as well as receiver architecture. Recently, ADC-based architectures are gathering interest in order to enhance performance through digital processing.

ADC-based receivers are becoming more popular in high-speed SerDes space to transform equalization in the digital domain [6], [9]–[16]. In digital equalization, we have precise control over zeros and poles placement, and thereby can accurately cancel the ISI. All-digital implementation makes the



Fig. 1. (a) ASICs with transceivers supporting LR, MR, and SR channels and (b) their frequency responses.

design easily portable to deep submicrometer CMOS nodes, and promises more flexibility in equalizer design. However, front-end ADC still remains as a major challenge since it consumes significant amount of area and power. Although ADC-based links promise better portability, their area and power consumption remains prohibitive. This limits their adoption in large ASICs where hundreds of transceivers are used within the strict thermal budget for the entire device including the digital core(s). Note that ASIC needs to support many links where channel loss varies over a wide range depending on the channel length [Fig. 1(a) and (b)]. There is potential for opportunistic power savings in digital equalization that analog mixed-signal solution did not allow. The focus of this paper is to enable such power savings both at ADC and in digital equalization. With that goal in mind, we will first discuss the overall architecture followed by implementation detail of different components.

## II. ADC-TDC-BASED RECEIVER ARCHITECTURE

The proposed ADC-based receiver is designed with the motivation to optimize the SNR at the ADC output [17]. Different noise sources associated with the ADC-based receiver are shown in Fig. 2. Primarily, it is the quantization noise ( $N_{QZ}$ ) that gets amplified by the feed-forward equalizer (FFE) while equalizing the signal. Note that timing noise also degrades the ADC's resolution which gets further amplified by the FFE. FFE's noise amplification is proportional to the channel loss—therefore, to compensate higher loss channel, ADC should also provide higher resolution to keep



Fig. 2. ADC-based digital receiver with noise sources.

the quantization noise low. Unfortunately, for flash ADCs, power consumption increases exponentially with resolution.

Fig. 3(a) shows the overall architecture of the proposed ADC-based digital receiver. It consists of a linear equalizer (LEQ), channel-adaptive analog-to-digital converter (ADC), multi-bit time-to-digital converter (TDC)-based timing recovery, and look-up table (LUT)-based digital FFE. The ADC is channel-adaptive by dynamically adjusting its references. As a result, with a fewer number of comparators, a higher number of resolution can be achieved. To assist the timing recovery, the raw coarse comparator outputs are shared. For timing recovery, a higher resolution TDC is used instead of a single bit bang-bang phase detector (PD). There are inherent built-in ISI and data-dependent jitter filters that help faster timing recovery without waiting for the digital equalization and long-latency path. The digital FFE is programmable from three-to-eight taps and the first three taps are realized in an LUT. As a result, the quantization noise penalty becomes less significant. The LEQ, channel-adaptive ADC, successive-approximation-register (SAR) TDC-based timing recovery, and the parallel interfacing are implemented in Taiwan semiconductor manufacturing corporation (TSMC) 65 nm. The digital FFE is implemented in field-programmable gate array (FPGA).

In the proposed ADC, the resolution is programmable according to the channel loss. Rather than designing a general-purpose ADC, we use the ADC that is facilitated from the ISI in the channel. In simple terms, ISI creates correlations within consecutive samples and by exploiting these correlations ADC power can be significantly reduced. The idea is better understood when we consider the step responses for two different loss cases: 1) short reach (SR) [Fig. 3(b)] and 2) long reach (LR) [Fig. 3(c)]. Based on the IEEE OIF forum, SR, medium-reach (MR), and LR channels have maximum length of 300, 500, and 700 mm, respectively [18]. However, depending on board material and interconnect, the loss can vary. Usually, SR channels have less than 15-dB loss and MR channels can have up to 25 dBs of loss, and higher loss would be considered as LR.

In an ADC-based receiver, bulk of the equalization is done in the digital domain. The front-end LEQ complements the function of digital FFE. In this implementation, we are using the LEQ boost to eliminate the long tail of the channel step response which comes from post-cursor ISI. As a result, we can reduce the number of FFE taps and save power. However, even with moderate boost from the LEQ, there



Fig. 3. (a) Overall architecture of the proposed ADC-based digital receiver. Step response at ADC input for (b) SR channel with 15-dB loss and (c) LR channel with 30-dB loss. In both cases, LEQ boost is set to 8 dB.

is significant amount of uncompensated ISI that limits the sample-to-sample variation. One example case is shown with a step response at the ADC input after LEQ in Fig. 3(c). Although LEQ provides 8 dB boost, the residue ISI from the 30-dB loss channel limits the sample-to-sample variation to 30% of the full scale only. The fullscale (FS) of the ADC is defined by the minimum and maximum value of the step response. However, the sample-to-sample signal variation is the dynamic FS (DFS) that the ADC must cover in order to appropriately digitize the signal without loss of information.

For the LR channel, from one symbol to next symbol, the variation is only 20%–30% of FS. Therefore, by placing the available comparators to cover the DFS only, we improve the ADC resolution. Note that comparators still need to cover the FS as the signal eventually reaches those boundaries. However, it takes several consecutive samples and that allows us sufficient time to adjust dynamically the comparator location to cover the entire range, as shown in Fig. 3(c). Although the sample-to-sample variation is only 30%, we are still covering 50% to capture the other uncorrelated sample-to-sample variations such as reflection and noise. Detail implementation will be discussed in Section III. For low-loss case, sample-to-sample variation is large—especially when LEQ is set to provide 8 to 10 dB boost, residue ISI becomes minimal. Therefore, available comparators are placed to cover the ADC FS, which limits the resolution of the ADC [Fig. 3(b)]. Fortunately, higher quantization noise is acceptable since less FFE boost is required to equalize such a low-loss channel.

To reduce quantization noise amplification in FFE, we adopt a hybrid approach—part of the FFE is implemented in an LUT that uses non-linear mapping to reduce quantization noise amplification. Remaining taps that have less impact on

quantization noise are implemented using FIR filter. Although the digital signal processing (DSP) is implemented in FPGA for this paper, we will provide power and area estimation based on simulation in 65-nm CMOS in Section V. To summarize, we proposed two techniques to reduce ADC-based receiver's power. First, to relax the resolution requirement by reducing quantization noise amplification in digital equalizer, and second to take advantage of the channel ISI to reduce the number of required comparators for the ADC. System level simulation of the proposed architecture including different voltage and time domain noise sources is shown in Fig. 4(a) and (b). The system simulation includes voltage and timing noises, as listed in Fig. 2. For simpler comparison to non-ADC-based receivers, equalized eye opening is referred back to ADC input node. Note that in the digital domain, eye is only valid at the ADC's sampling point (-1, 0, and 1), the oversampled eye is shown here for easier visualization. As expected, required ADC resolution and the number of required FFE taps increase with channel loss. One of the goals of this paper is to exploit digital capabilities to enable an equalizer solution that can take advantage of different channel losses. Consequently, ADC and following equalization are built with scalable resolution to save power. These techniques together have the potential to improve the energy efficiency of the ADC-based receiver to less than 4 pJ/bit.

In addition to SNR, we need to consider timing recovery. To address the critical timing margin of PAM-4 signaling, we use multi-bit TDC instead of a single bit PD. As it turns out, ADC and TDC work in a collaborative way to generate multi-bit data and edge information that enables low-power operation. The organization of this paper is as follows. The proposed ISI aware variable-resolution ADC is presented



Fig. 4. Equalized ADC input referred eye opening as a function of ADC resolution and number of FFE taps in DSP for (a) LR channel and (b) MR channel. Circled point shows the needed resolution and taps to achieve 40+ mV opening for  $\text{BER} < 10^{-6}$ .



Fig. 5. Different architectures of two-step flash ADC. (a) Subtraction of coarse reference from input and residue amplifications [18]. (b) Rectification, re-sampling, and sharing the resampled value between coarse and fine ADC [19]. (c) Sharing sampled value between coarse and fine ADC [22]. (d) Proposed ISI aware variable-resolution ADC where edge S/H pulse appears 0.5 UI earlier than data S/H pulse. (e) Reference passing with and without edge trigger.

in Section III. Section IV describes the SAR TDC-based low-latency and low-jitter timing recovery. Section V describes the potential benefit of the LUT-based digital equalization. Section VI discusses the implementation and measured results. Finally, Section VII summarizes the work with key aspects.

### III. ISI AWARE VARIABLE-RESOLUTION ADC

Based on the system level simulation presented in Section II, PAM-4 receiver requires 2- to 5.5-bit variable-resolution ADC. Two potential candidates are Flash and SAR ADCs. A conventional flash ADC requires  $2^N - 1$  number of comparisons to resolve  $N$  bits. Therefore, the number of comparators grows exponentially with resolution and so do the area and power.

On the other hand, SAR ADC resolves the bits sequentially following binary search. Since the same comparator can be used, power scales more graciously but requires  $N$  number of cycles to resolve  $N$  bits. Therefore, SAR ADCs are time-interleaved (TI) architectures for high-speed application that require front-end time interleavers, which makes them less energy efficient at lower resolution compared to flash.

A sub-ranging ADC is a compromise between conversion rate and comparator count. In the simplest form,  $N$  bits total can be resolved in two steps;  $M$  most significant bit (MSB) bits are resolved in the first cycle, and  $L$  least significant bits (LSB) are resolved in the second cycle. As a result, the number of comparators can be reduced to  $(2^M - 1) + (2^L - 1)$  [Fig. 5(a)] [19], to resolve  $M + L = N$  number of bits.



Fig. 6. S/H pulse generation. (a) Block diagram, (b) simulation, and (c) 10-k run Monte Carlo simulation of sampling timing variation between quad coarse and octal fine pulses.

Since the dynamic range after the MSB detection is scaled by  $1/2^M$ , these ADCs are known as sub-ranging ADCs. For sub-ranging ADC, we need to subtract the analog equivalent of the resolved  $M$  bits from the input with high linearity. In fact,  $M$ -bit DAC needs to provide  $M+L$ -bit linearity that is stringent. Rectification in the signal path can partially address this problem [14]. In this approach, a single comparator determines the polarity of the sampled input (MSB), and then, based on the MSB, the sampled value is folded and resampled on the input capacitor  $C_{in}$ . Due to rectification, both the dynamic range and number of comparators are reduced by a factor 2 [Fig. 5(b)]. Note that the additional processing in the signal path adds additional complexity—therefore, in existing solutions, folding is limited to 1 bit only [20].

Reference selection can provide a similar benefit without requiring a high linear DAC [21]. Here, coarse ADC still resolves  $M$  bits, but these  $M$  bits are not used to drive DAC, instead, these are used as selection bits to select  $2^L$  reference levels from  $2^N$  total using analog muxes. The main challenge in this architecture is the reference settling—before triggering the fine ADC, the  $M$  bits of coarse ADC should be resolved and reference levels must be settled. Especially, the maximum reference swing for the fine comparators is approximately the entire dynamic range that limits the speed of the operation [22]. Pre-charging the reference nodes with the input information partially addresses the reference settling [23] but adds additional loading in the signal path [Fig. 5(c)].

The proposed two-step ADC takes advantage of the channel loss [Fig. 5(d)]. In a moderately lossy channel, the signal variation is only 40% of the dynamic range—therefore, reference selection can be initiated ahead of time based on the edge samples—providing more time for the references to settle. In a 5.5-bit ADC, there are 45 reference levels. The edge sample outputs are used to select 15 out of those 45 reference levels, and then finally, coarse ADC comparator outputs are used to select five reference levels from 15. The pipelined operation in reference selection ensures proper settling of the fine references [Fig. 5(e)].

Mismatch in the sampled values of coarse and fine ADC generally results from gain mismatch and re-sampling error at the front-end that requires sophisticated calibration to correct. Instead, coarse and fine ADC's samplers can be used in

parallel after a common gain stage. Note that critical timing is set by fine ADC reference settling time followed by comparator triggering. During this time, coarse ADCs are idle. Therefore, a more efficient approach is to have a quarter rate coarse ADC followed by two octal rate fine ADCs (even–odd architecture). Since this approach doubles the number of comparators in the fine ADC, clocking requires careful consideration.

Fig. 6(a) shows the block diagram of the clocking for the quad-channel ADC. The voltage-controlled oscillator (VCO) provides eight phases (C4) of the clocks, four of them are for data sampling and four are for edge sampling and timing recovery. Similar to any other TI ADC, this quarter rate ADC is also sensitive to quadrature mismatch. To compensate this skew between the TI paths, binary-weighted capacitors are used as phase shifters in the data path that provides higher resolution and linearity at the same time [24]. Edge clock paths are also loaded with similar binary capacitors; however, they are used for SAR-TDC as we will be discussed in Section IV. For octal rate sampling, quarter rate clocks are divided by two to generate octal clock phases. However, both octal and quarter rate clocks need to have precise phase alignment to avoid mismatch. This is done at the phase aligner where quarter rate clock is used to re-time both the divided clock and itself. This eliminates any additional skew that is accumulated between quarter rate and octal rate clocks. After retiming, pulse generation circuit is used to generate the quad and octal sampling pulses. Therefore, the only source of skew is the single logic gate mismatch that appears at the pulse generation circuit. A two-input NAND gate is used to generate the sampling pulse. A 10-k run Monte Carlo simulation of the complete clocking block shows that the skew between the quad coarse and octal fine sampling pulses is around 0.2 ps [Fig. 6(c)].

Post-layout simulation of the S/H is illustrated in Fig. 6(b). It shows good alignment between quarter and octal rate sampling phases. As a result, the S/H mismatch between them is reduced to less than 1/2 LSB. This sub-ranging ADC enables scalable power versus resolution characteristics, as shown in Fig. 7. It also shows the comparison of power scalability with conventional SAR and flash ADC. Several published works [10], [12], [13], [25]–[29] are plotted to verify the theoretical analysis of power scalability for SAR and flash



Fig. 7. ADC resolution versus power at 14 GS/s in 65-nm CMOS.



Fig. 8. Simulation results of reference settling for coarse and fine comparators.

ADC. Due to its simplicity, at lower resolution, it is nearly as efficient as flash, and above 4-bit resolution, it is more efficient than SAR. Note that at higher resolution, in this case, the proposed ADC takes advantage of channel loss to reduce the number of comparators, results in an improved power efficiency.

#### A. Timing Budget for Reference Selection

Fig. 8 shows the reference selection procedure for the 5-bit mode. In this case, coarse comparators cover 50% of the ADC's FS at a time [Fig. 3(c)]. Therefore, the references of the coarse comparators need to be updated to cover the remaining 50% of the FS. To monitor the signal change, two edge comparators are used that are available 0.5-UI earlier than data S/H. Coarse comparators are triggered 2 UI after the edge comparators. Within this time period (2 UI), edge comparators need to resolve the sample ( $t_{EG}$ ) and coarse references need to be settled ( $t_{set,coar}$ ) at the output of analog mux. Worst case post-layout simulation shows that  $t_{EG}$  and  $t_{set,coar}$  are 0.56 UI (40 ps) and 0.7 UI (50 ps), respectively. Therefore, the margin for the coarse reference selection is

$$t_{margin,coar} = 2 \text{ UI} - t_{EG} - t_{set,coar} = 0.74 \text{ UI}(53 \text{ ps}). \quad (1)$$

TABLE I  
COARSE COMPARATOR REFERENCE SELECTION BUDGET

|                                                          |                         |
|----------------------------------------------------------|-------------------------|
| <b>Edge Comparator Decision Time,</b><br>$t_{EG}$        | <b>0.56 UI (40 ps )</b> |
| <b>Coarse Reference Settling Time,</b><br>$t_{set,coar}$ | <b>0.7 UI (50 ps )</b>  |
| <b>Margin</b>                                            | <b>0.74 UI (53 ps)</b>  |

TABLE II  
FINE COMPARATOR REFERENCE SELECTION BUDGET

|                                                              |                         |
|--------------------------------------------------------------|-------------------------|
| <b>Edge Comparator Output Retime,</b><br>$t_{EG,retime}$     | <b>0.42 UI (30 ps )</b> |
| <b>Preselected Reference Settling Time,</b><br>$t_{set,pre}$ | <b>0.84 UI (60 ps )</b> |
| <b>Coarse Comparator Decision Time,</b><br>$t_{COAR}$        | <b>0.56 UI (40 ps )</b> |
| <b>Coarse Comparator Output Retime,</b><br>$t_{COAR,retime}$ | <b>0.42 UI (30 ps )</b> |
| <b>Fine Reference Settling Time,</b><br>$t_{set,fine}$       | <b>0.98 UI (70 ps )</b> |
| <b>Margin</b>                                                | <b>0.60 UI (43 ps)</b>  |

The fine comparators are triggered 3 UI after the coarse comparators are triggered (Fig. 8). Therefore, the available time for fine reference selection is 5 UI. Since the edge and coarse comparators are running at quad rate and fine comparators at octal rate, both edge and coarse decisions are retimed (re-time is performed 1 UI later after comparator decision). This introduces two extra D-FF delay in the fine reference selection procedure along with comparator decision time and reference settling time. To meet the timing, we preselect the references based on the edge comparators' output. Therefore, the margin for the reference pre-selection is

$$\begin{aligned} t_{margin,pre} &= 3 \text{ UI} - t_{EG} - t_{EG,retime} - t_{set,pre} \\ &= 1.18 \text{ UI}(84 \text{ ps}) \end{aligned} \quad (2)$$

where,  $t_{EG,retime}$  is the time for retiming edge comparator output. The allocated time for fine reference pre-selection is 3 UI that allows sufficient time to settle the references. Similarly, the margin for the final fine reference selection is

$$t_{margin,fine} = 2 \text{ UI} - t_{COAR,retime} - t_{set,fine} = 0.6 \text{ UI}(43 \text{ ps}) \quad (3)$$

where  $t_{COAR,retime}$  is the time for retiming coarse comparator output and  $t_{set,fine}$  is the final fine reference settling time. Tables I and II summarize the coarse and fine reference selection budget, respectively.

Although quarter rate fine ADC doubles the hardware compared to a half-rate fine ADC, the choice is motivated by reference update and setline time. Since T/H requires 1 UI of track time, we are left with 3 UI for coarse comparator operation and reference update. From the reference selection budget (Fig. 8), it takes more than 1 UI (1.26 UI) for the



Fig. 9. ADC block diagram showing S/H, comparators, reference passing mux, and T-to-B decoder only.

TABLE III  
RESOURCES ALLOCATION BASED ON RESOLUTION

| Mode    | Edge   |        | Coarse    |        | Fine                            |        | Comparator Recycling |
|---------|--------|--------|-----------|--------|---------------------------------|--------|----------------------|
|         | Comp.  | T-to-B | Comp.     | T-to-B | Comp.                           | T-to-B |                      |
| 2-bit   | OFF    | OFF    | Coar<3:0> | ON     | OFF                             | OFF    | OFF                  |
| 3-bit   | OFF    | OFF    | Coar<3:0> | ON     | Fine<2> Even<br>Fine<2> Odd     | OFF    | OFF                  |
| 4-bit   | OFF    | OFF    | Coar<3:0> | ON     | Fine<3:1> Even<br>Fine<3:1> Odd | ON     | OFF                  |
| 5-bit   | E<1:0> | ON     | Coar<3:0> | ON     | Fine<3:1> Even<br>Fine<3:1> Odd | ON     | ON                   |
| 5.5-bit | E<1:0> | ON     | Coar<3:0> | ON     | Fine<4:0> Even<br>Fine<4:0> Odd | ON     | ON                   |

comparator decision and reference update in TSMC 65 nm that eliminates the feasibility of reliable half-rate architecture.

### B. Modes of Operation

Fig. 9 explains the configurability of the ADC that includes S/H, coarse and fine comparators, and thermometer-to-binary (T-to-B) decoder only. Based on the discussion in Section II, for 30-dB-loss channel, we need 5 bit or higher effective number of bits (ENOB). However, for SR channels with less than 15 dB loss Tx FIR and analog front-end (AFE) can equalize the eye. ADC in this case is used for symbol detection with 2-bit resolution. For intermediate channel losses, coarse, fine, and edge comparators are partially enabled, as summarized in Table III. By segmenting the back-end T-to-B allows significant power savings at lower resolution. Only 5- and 5.5-bit modes rely on channel ISI to improve its resolution. Unlike ISI, channel reflections are not correlated between sample-to-sample. Fortunately, reflections are the main concern for SR channels—for those channels comparators are “fixed” and cover the FS of the ADC. For lossy channels, reflections are also attenuated, and as a result, the additional uncertainty from the reflections are well captured within the margin of the DFS.

### IV. TIMING RECOVERY

Similar to the voltage noise, any timing noise-induced voltage error also translates to a loss of resolution or increase in quantization noise. Assuming that the sampling clock timing

uncertainty is  $\Delta t$ , then the resultant voltage noise can be estimated simply based on the slew rate of the signal,  $\Delta v_{p-p} = \Delta t_{p-p} \cdot (\text{Slew Rate})$ . For a sinusoidal input,  $A_{\text{in}} \sin(2\pi f_{\text{in}}t)$ , slew rate can be simply calculated based on its maximum slope

$$\begin{aligned} \text{Slew Rate} &= \text{Max} \left( \frac{\partial}{\partial t} \{A_{\text{in}} \sin(2\pi f_{\text{in}}t)\} \right) \\ &= A_{\text{in}} 2\pi f_{\text{in}}. \end{aligned} \quad (4)$$

Therefore, the rms noise voltage can be expressed as:  $\Delta v_{\text{rms}} = A_{\text{in}} 2\pi f_{\text{in}} \sigma$ , where  $\sigma$  is the standard deviation of the timing noise. So, the SNR degradation in the presence of timing noise can be expressed as [30]

$$\begin{aligned} \text{SNR}_{\text{jitter}} &= 20 \log_{10} \left( \frac{A_{\text{in}}/\sqrt{2}}{\sigma A_{\text{in}} 2\pi f_{\text{in}}/\sqrt{2}} \right) \\ &= -20 \log_{10}(\sigma 2\pi f_{\text{in}}). \end{aligned} \quad (5)$$

If the  $N$ -bit ADC’s SNR is only limited by quantization noise, SNR of the ADC becomes  $20 \log_{10}((3/2) \cdot 2^N)^{1/2}$ . So, including the jitter impact more accurate SNR can be expressed as [30]

$$\text{SNR}_{\text{w/o loss}} = -20 \log_{10} \left( \sqrt{(\sigma 2\pi f_{\text{in}})^2 + \left( \frac{1}{\sqrt{\frac{3}{2}} \cdot 2^N} \right)^2} \right). \quad (6)$$

Therefore, the SNR degrades at higher input frequencies due to higher slew rate, which enforces stringent jitter requirement to achieve targeted ENOB. Although the tradeoff between SNR and jitter is hurting the performance of the ADC at the Nyquist frequency, in the case of wireline equalization, it is pessimistic. Since the primary purpose of the ADC is ISI compensation, channel frequency response should be considered. Given the channel has significant loss around Nyquist frequency, input sinusoid amplitude should be appropriately scaled, therefore the slew rate and corresponding rms noise voltage should also be reduced.

To capture the channel frequency response, we consider two tone input  $A_{\text{low}} \sin(2\pi f_{\text{low}}t) + A_{\text{high}} \sin(2\pi f_{\text{high}}t)$ . Here,  $f_{\text{low}}$  is the lower frequency where channel is nearly lossless and  $f_{\text{high}}$  is near Nyquist frequency where the equalizer needs to compensate maximum loss. For a given channel loss  $A_{\text{low}}$  and  $A_{\text{high}}$  can be related as  $\xi = A_{\text{high}}/A_{\text{low}}$ . In this case, the maximum slew rate ( $\text{SR}_{\text{max}}$ ) can be expressed as

$$\begin{aligned} \text{SR}_{\text{max}} &= \text{Max} \frac{\partial}{\partial t} \{A_{\text{low}} \sin(2\pi f_{\text{low}}t) + A_{\text{high}} \sin(2\pi f_{\text{high}}t)\} \\ &= A_{\text{low}} \cdot 2\pi f_{\text{low}} + A_{\text{high}} \cdot 2\pi f_{\text{high}}. \end{aligned} \quad (7)$$

Similarly, the SNR in this case can be written as

$\text{SNR}_{\text{jitter, w/ loss}}$

$$\begin{aligned} &= 20 \log_{10} \left( \frac{\sqrt{\left( \frac{A_{\text{low}}}{\sqrt{2}} \right)^2 + \left( \frac{A_{\text{high}}}{\sqrt{2}} \right)^2}}{\sigma \cdot \sqrt{\left( \frac{A_{\text{low}}}{\sqrt{2}} \cdot 2\pi f_{\text{low}} \right)^2 + \left( \frac{A_{\text{high}}}{\sqrt{2}} \cdot 2\pi f_{\text{high}} \right)^2}} \right) \\ &= 20 \log_{10} \left( \frac{\sqrt{1 + \xi^2}}{\sigma \cdot \sqrt{(2\pi f_{\text{low}})^2 + (\xi 2\pi f_{\text{high}})^2}} \right). \end{aligned} \quad (8)$$



Fig. 10. (a) Effect of timing noise in W/O and W/loss channels. (b) SNR penalty as a function of input frequency in the presence of jitter.

Considering a lossy channel  $\xi \ll 1$ , therefore SNR degradation can be approximated as

$$\text{SNR}_{\text{jitter, w/ loss}} \approx -20 \log_{10} (\sigma \cdot 2\pi \cdot \sqrt{(f_{\text{low}})^2 + (\xi \cdot f_{\text{high}})^2}). \quad (9)$$

Similar to the previous case, overall SNR of the ADC for lossy channel can be approximated as

$$\begin{aligned} \text{SNR}_{\text{w/ loss}} \\ = -20 \log_{10} \left( \sqrt{\left( \sigma \cdot 2\pi \cdot \sqrt{(f_{\text{low}})^2 + (\xi \cdot f_{\text{high}})^2} \right)^2 + \left( \frac{1}{\sqrt{\frac{3}{2}} \cdot 2^N} \right)^2} \right). \end{aligned} \quad (10)$$

Interestingly, channel loss, in this case, reduces the SNR penalty by slowing the edge rate of the signal [Fig. 10(a)]. This updated SNR expression matches well with the SNR obtained from direct transient simulation [Fig. 10(b)] for both lossy and lossless cases. The solid red and blue lines basically represent (6) and (10), respectively, for a specific random jitter (RJ)  $\sigma$  ( $=513$  fs), ADC resolution  $N$  ( $=5.5$  bit), and  $\xi \ll 1$ . To verify the theoretical plot of the timing noise effect in lossy and lossless channels, two transient sinusoids (single tone and two tone) are prepared considering the above RJ ( $\pm 3 \sigma$ ). These sinusoids are digitized using an ideal 5.5-bit



Fig. 11. (a) TDC-based timing recovery loop. (b) PAM-4 in the presence of ISI with edge distribution with and without ISI filter.

ADC. Finally, the red circles and blue squares are obtained by calculating the SNR from the FFT of the digitized two tone and single tone sinusoids, respectively. Similarly, periodic jitter and phase mismatch impact can also be included in the simulation. However, 30-dB SNR target still translates to a challenging jitter target—RJ should be less than 600 fs, periodic jitter should be less than 500 fs, and timing skew between interleaved channels should be less than 350 fs.

Given the jitter target, timing recovery loop requires careful consideration. In a conventional digital clock and data recovery (CDR), there are three main sources of timing noise. The primary source of jitter in digital CDR is due to limited resolution of the bang-bang phase detector (BBPD). Since BBPD is essentially 1-bit TDC, phase quantization sets the in-band noise floor. Therefore, loop bandwidth should be lowered to reduce its impact. The second source of the noise is the VCO's self-generated noise that requires wider loop bandwidth to effectively filter VCO's RJ. Therefore, with BBPD, it is challenging to meet the jitter specification of an ADC-based receiver. However, in a digital receiver, the main challenge is the latency-induced jitter. The latency here is defined by the conversion rate and DSP that performs digital equalization. Usually, the DSP clock rate is relatively slow (for example,  $f_{\text{baud}}/32$ ). Therefore, even if the digital equalization and phase detection can be limited to 6 to 8 cycles, total latency can be hundreds of UI (256 UI). The digital nature of the loop causes the VCO in the CDR to dither between two frequencies  $f_{\text{DATA}} \pm \Delta f$ . This steady state dithering in frequency results in a limit-cycle jitter that increases proportionally to the loop latency,  $J_{\text{PP}} \propto K_P L$  [31]. Here  $K_P$  is the proportional gain of the CDR loop and  $L$  is the loop latency in UI. Note that the bandwidth of the CDR is also defined by the proportional gain,  $\omega_P = K_{\text{BB}} K_P K_{\text{VCO}}$ . As a result, to keep limit-cycle jitter within 500 fs for 256-UI loop latency, the CDR loop bandwidth should be less than 1 MHz.

Fig. 11(a) shows the architecture of the TDC-based timing recovery. In this paper, we use TDC for two reasons. First, multi-bit TDC reduces quantization noise when compared to

TABLE IV  
SUMMARY OF PD LOGIC AND ISI FILTERING

| $D_X$                | $D_Y$                | $E_X$                | ISI | Early/Late |
|----------------------|----------------------|----------------------|-----|------------|
| $D_X > D_{+1}$       | $D_Y < D_{-1}$       | $E_X > E_{-1}$       | No  | Earlier    |
|                      |                      | $E_{-1} > E_X > E_0$ | No  | Early      |
|                      |                      | $E_0 > E_Y > E_{-1}$ | No  | Late       |
|                      |                      | $E_{+1} > E_X$       | No  | Later      |
| $D_X > D_{+1}$       | $D_0 > D_Y > D_{-1}$ | $E_X > E_{-1}$       | YES | Early      |
|                      |                      | $E_X < E_{-1}$       | YES | Late       |
| $D_0 < D_X < D_{+1}$ | $D_0 > D_Y > D_{-1}$ | $E_X > E_{-1}$       | YES | Early      |
|                      |                      | $E_X < E_{-1}$       | YES | Late       |
| $D_X > D_0$          | $D_Y > D_0$          | ----                 | N/A | No change  |



(a)



Fig. 12. (a) SAR TDC operation in three consecutive cycles. (b) Block diagram of SAR TDC.

BBPD (i.e., 1-bit TDC). Second, TDC described in this paper not only detects the amount of phase error but also detects the edges that are impacted by ISI—therefore, those edges can be ignored from timing recovery loop. This inherent ISI jitter filtering capability allows us to recover the clock even from an un-equalized eye in the presence of ISI. Since the bulk of the equalization is done in the digital domain, at the input of the ADC, the eye will be completely closed. However, even with a moderate boost from the AFE, edge distribution shows sufficient statistics for timing recovery [Fig. 11(b)]. For clarity, we are plotting the histogram from all transitions shown in red. The tri-modal jitter distribution translates to a higher jitter in recovered clock and may even cause sub-optimal lock. Traditionally, PAM-4 CDRs consider only symmetric transitions [2]. Unfortunately, in the presence of ISI, even with symmetric transitions, we can have tri-modal jitter distribution. Assuming PAM-4 constellation is based on  $-3 - 1 + 1 + 3$ , transitions from  $+3$  to  $-3$  and vice-versa are symmetric transitions [Fig. 11(b)]. However, due to ISI, the pattern  $+3 - 3 + 3 - 3$  causes the left peak of the distribution and  $+3 + 3 - 3 + 3$  pattern causes the right peak in the distribution and both patterns bias the CDR to sub-optimal lock point. For accurate lock point and to reduce recovered clock jitter, these ISI-affected transitions must be rejected. One possible solution is to observe the timing information after the equalizer, but that would add significant latency as discussed before. An alternative option is to accumulate a larger number of samples before making a decision—this way we can avoid any pattern dependent bias. Unfortunately, this will also add significant latency to the loop. Therefore, to reduce latency, we have built-in ISI filter in the PD—three data comparator outputs ( $D_{-1}$ ,  $D_0$ , and  $D_{+1}$ ) are used to determine the patterns that are less affected by ISI. In this example  $+3 + 3 - 3 - 3$  is the pattern with transition at the optimal location and it can be detected by comparing consecutive data samples  $D_X$  and  $D_Y$  to  $D_{-1}$  and  $D_{+1}$  as described in Table IV. These ISI-free transitions are then compared with three edge references  $E_{-1}$ ,  $E_0$ , and  $E_{+1}$  to generate 2-bit timing information. When it comes to ISI-affected transitions from patterns  $+3 - 3 + 3 - 3$  and  $+3 + 3 - 3 + 3$ , we use  $E_{-1}$  and  $E_{+1}$  as the reference level instead of  $E_0$  and that generates 1-bit timing information. The benefit of such ISI filtering is visible in the transition distribution (shown in green) when super-imposed on top of transition distribution without ISI filter. Bigger benefit is the latency improvement—since this

ISI filter is the part of a PD and “ISI-free” edge information is available within 10 UI, we can achieve much wider loop bandwidth.

Although lower latency allows wider loop bandwidth, PDs quantization noise is still a concern. Similar to the data samplers, additional edge sampler can improve the resolution at the cost of power. Instead, we can improve the resolution by successively measuring the timing error from consecutive edge samples. After each early/late decision, SAR logic directly updates the edge clock phase by trimming the binary-weighted capacitive loading [Fig. 11(a)]. However, the DAC-to-digital controlled oscillator (DCO) path remains unchanged—therefore data sampling phase remains unchanged [Fig. 12(a)]. After three consecutive decisions, SAR provides a 3-bit phase code. These 3 bits along with 2-bit ISI codes are directly applied to 5-bit DAC to control the DCO and that updates the data sampling phase. For alternating data, the SAR output should be available after 12 UI, in case of “no transition” it may require waiting longer. Therefore, to avoid excessive latency, a counter enforces the DAC to be updated after 20 UI [Fig. 12(b)].

Note that TDC resolution generates multi-bit timing error information adjustable from 3 to 5 bits. Similar to the ADC,



Fig. 13. (a) Shaping of ADC quantization noise by FFE (theoretical and simulation). (b) ADC followed by FFE. (c) ADC followed by DFE architecture.

TDC resolution is also programmable. Depending on the channel loss, it enables a single or three edge comparator(s). For SR channels, a single edge comparator is used. However, we can still extract 3-bit edge information using SAR algorithm. To summarize, 5-bit TDC lowers the PD quantization noise, which relaxes the tradeoff between DCO phase noise filtering and CDR's self-generated noise. Similarly, ISI filtering capability allows us to bypass the digital equalizer and associated latency, enabling wider loop bandwidth. These two techniques together provide a 4× improvement in dithering jitter compared to the conventional BBPD while achieving the same tracking bandwidth.

## V. SCALABLE DIGITAL EQUALIZATION

In an ADC-based receiver, the equalization is done digitally that can take advantage of technology scaling, and provide a lot of freedom in equalization technique to compensate higher loss. The front-end of the receiver in this case requires an ADC that converts the input analog signal into a digital domain. The challenge in this architecture is the quantization noise and resultant SNR degradation. As shown in Fig. 2, the quantization noise of the ADC,  $N_{QZ}$  (i.e.,  $\sigma_{QZ}^2$ ) appears at the FFE's input, and therefore will be shaped by the FFE. Since FFE response is high pass—while amplifying the high-frequency content of the signal, it also amplifies the quantization noise. Therefore, the quantization noise at the FFE output can be

written as

$$N_{QZ,out} = \sqrt{\sigma_{QZ}^2 |H_{FFE}|^2 + \frac{1}{N} \sum_{i=1}^N \sigma_i^2 W_i^2} \quad (11)$$

where  $\sigma_i^2$  and  $W_i$  are quantization noise and tap weight, respectively. This quantization noise expression correlates well with the transient simulation, as shown in Fig. 13(a). As expected, the output quantization noise is a strong function of FFE transfer function, and as a result, for higher loss channels we see larger amplification of quantization noise. Note that although the FFE output can have more than  $N$  bits, effective resolution is still limited by the ADC output resolution. In fact, the resolution of this ADC sets the performance of the receiver since its added quantization noise can be amplified by digital FFE. Compared to that, in case of DFE, there is no amplification of quantization noise—ADC output is simply compared to predefined thresholds [Fig. 13(b) and (c)]. In addition, FFE is more impacted by non-linearity compared to DFE. However, when implementation is considered, digital DFE has the same timing constraint as analog mixed-signal implementation. Therefore, symbol decision needs to subtract the post-cursor ISI within 1 UI. Although loop unrolling can partially relax this requirement—the hardware complexity increases exponentially. For  $P$  tap(s)  $N$ -bit loop-unrolled DFE, it requires  $3 \times 4^P$  numbers of  $N$ -bit subtractions [32]. Therefore, in FFE-based approach, to keep the quantization noise low, ADC power dominates. In DFE-based approach, although lower resolution ADC is acceptable, DSP power becomes prohibitive.

In this paper, we propose LUT-based non-linear mapping for equalization that results in a less amplification of quantization noise (Fig. 14). LUTs have widely been used for digital filter implementation where filter outputs are pre-computed and stored [33]–[35]. Input digital values are used as selection bits. Since such selection process can be implemented only with muxes, power consumption can be significantly reduced. In this case, we are adopting LUT-based approach to save the power consumption and at the same time reduce the quantization noise impact. The equalization approach is inspired by the non-uniform quantization used in NRZ signaling [10]. Unfortunately, PAM-4 signal space gets a lot more congested with ISI, and therefore the direct use of non-uniform ADC is challenging as shown in Fig. 15(a). Fortunately, after equalizing one pre- and one post-cursor ISI, partially equalized output statistics is sufficiently well defined into four bins [Fig. 15(b)]. LUT is generated with these pre-computed values with higher resolution. Assuming the channel is known, 8-bit digital outputs are used to pre-calculate these values using a 10-bit FIR and they are stored with 10-bit resolution. These  $2^{10}$  values are picked based on their statistics—these are the values that occurred most of the times. To select a particular value from these  $2^{10}$  number of combinations, we need a 10-bit selection word. During runtime, ADC has only 5-bit resolution, to generate 10-bit selection code. So, we combine 5-bit ADC output from three consecutive samples. For example, we combine 5 bits from the current, 3 bits from previous, and 2 bits from next data



Fig. 14. Proposed ADC-based digital receiver where first three taps are implemented using LUT and rest taps in conventional way. An offline higher resolution ADC and higher resolution FFE are used to prepare the LUT.



Fig. 15. (a) ADC output eye and distribution. (b) Three-tap digital FFE output eye and distribution.

sample for a 30-dB loss channel. Obviously, this combination and the relative location depend on the channel response and needs to be adaptive. Although the ADC output is 5 bits only, the selected output can still have higher than 5-bit resolution depending on the resolution used during link training and LUT generation. This is because, unlike conventional FFE, ADC outputs are not used for computation rather for selection. The quantization noise benefit of this approach becomes clearer when the output quantization noise is compared with FFE's both in time domain and frequency domain. Depending on the channel response, the proposed approach can provide 5 to 9 dB improvement relaxing the ADC resolution by 1 bit (Fig. 16). One drawback of this approach is the memory requirement—since all the values are pre-computed memory requirement grows exponentially with the number of taps. Therefore, we keep LUT-based approach limited to three taps. Note that most of the equalization boost is applied to compensate first pre- and first post-cursor ISI. Therefore,



Fig. 16. Performance comparison in terms of quantization noise between three-tap conventional digital FFE and LUT-based FFE. (a) Quantization noise transient and (b) FFT of quantization noise at the output of the FFE with theoretical quantization noise floor. Here, ADC resolution is 5 bit.

the quantization noise amplification due to remaining taps is minimal. In this proof-of-concept implementation, “sampling scope” is used as 8-bit ADC by directly digitizing the AFE output through 50- $\Omega$  buffer. These 8-bit outputs are post-processed in MATLAB to generate the 9-bit FIR output used in the LUT. Practical implementation of the technique will



Fig. 17. Synthesized digital equalizer core for (a) conventional eight-tap FIR and (b) three-tap LUT based followed by five-tap conventional FIR.

TABLE V  
DSP COMPARISON

|                          | 3-tap LUT + 5-tap Conventional | 8-tap LUT         | 8-tap Conventional |
|--------------------------|--------------------------------|-------------------|--------------------|
| Memory (Bits)            | $9 \times 2^{10} + 20$         | $9 \times 2^{23}$ | 40                 |
| Cells/Gates              | 317                            | 2724              | 466                |
| Area ( $\mu\text{m}^2$ ) | 1,216,950                      | 53,165,824        | 230,400            |
| Power (mW)               | 67                             | 187               | 104                |

\* 5-bit ADC, and digital tap resolution is 4-bit

require higher resolution ADC modes during the calibration phase. In this particular proof-of-concept implementation, we generated and populated the LUT at power-up based on known channel and AFE response, while other practical considerations such as mission mode adaptation are left for future work. For a fair comparison between conventional and LUT-based approach, we synthesized (but not fabricated) both solutions as shown in Fig. 17 and summarized power, area, and gate count in Table V. There are 64 unit cores and each is running at 437.5 MHz. Due to reduced number of gates, LUT-based approach also consumes lower power but the area is  $4.5 \times$  due to an exponential increase in memory requirements, and as a result, we were not able to integrate it on-chip. However, as the SRAM and digital cells continues to scale, we expect a significant reduction in area. Since the FFE power is not measured, and it is not considered for comparison with the state-of-the-art (Table VI).

For equalization, FFE taps are generated from digital output by post-processing in MATLAB using sign-sign least mean squares to reduce the “fuzz” around the constellation. However, the initial tap values are assigned based on the measured single bit response at the AFE output. FFE only equalization is adopted that does not have critical timing constraint like DFE. As shown in Fig. 4(a) and (b), required ADC resolution and the number of FFE taps are proportional to channel loss. The key limitation of the LUT-based approach is the area penalty to store the pre-computed values. Since this area is not recoverable, there is no significant change in the LUT even at lower resolution. However, 5-tap FIR filter’s resolution and length are varied manually with the ADC resolution to save power at a lower loss.



Fig. 18. Implemented prototype in 65-nm CMOS.

## VI. IMPLEMENTATION AND MEASUREMENT

ADC and TDC prototype is implemented in 65-nm CMOS, as shown in Fig. 18. Analog front-end is minimized to three stage amplifiers that can provide 6- to 12-dB programmable boost at 7 GHz. ADC, including the AFE and TDC fits within  $450 \times 450 \mu\text{m}$  area, detail of each channel is shown in Fig. 19. We use passive boost (Fig. 19) that not only relaxes the linearity and gain peaking of the subsequent active stages but also saves the power consumption [36]. When  $R_4$  and  $R_5$  are open, the pole and zero locations are given by  $|\omega_p| = 1/[(R_2||R_3)C]$  and  $|\omega_z| = 1/[R_2C]$ , respectively, and the boost factor by  $1 + R_2/R_3$ . Amount of boost is adjustable from 4 to 9 dB by trimming  $R_4$  and  $R_5$  parallel to  $R_3$ . Following the passive equalizer, the high bandwidth amplifier is used as AGC with adjustable current and tunable load resistor. For visibility at the AFE output, we have an additional sampler with adjustable reference levels similar to the one used for on-chip eye opening monitor. The output of the comparator is used to detect the signal saturation and control the gain of the high BW amplifier. Note that AFE BW is around 9 GHz. Therefore, it does not add any significant ISI but its gain may vary over voltage and temperature variation that was corrected using the roving sampler.

We distribute the C4 and C8 clock across the clocking block. Pulse gen and S/H associated with the  $0^\circ$  and  $90^\circ$  are placed at the top middle of the clocking block, and for  $180^\circ$  and  $270^\circ$  at the bottom middle (Fig. 18).

Comparator's offset in advanced CMOS nodes can reach several LSBs. Therefore, fine comparators need to have periodic offset correction to track voltage and temperature variation to achieve better than 5-mV resolution. Since even and odd fine comparators are enabled in TI manner, offset can be corrected during downtime after LSBs are resolved. Each comparator has four NMOS inputs and a strong-ARM latch with capacitive load arrays attached to the inputs for offset correction [Fig. 20(a)]. During the offset correction, all inputs are tied together to  $V_{cm}$ , and capacitive loads are sequentially trimmed to create the unbalance that slows down the faster input path similar to [37] and [38]. To reduce the hardware, a state machine for offset correction is shared between the fine ADC comparators and offset correction is performed sequentially. The bit decisions for unbalanced load array are stored in conventional 6T SRAM instead of flip flop to further reduce the area. To measure the performance of the ADC, external clock is applied that bypasses the TDC.



Fig. 19. Block diagram of implemented digital receiver.



Fig. 20. (a) Comparator with offset correction. (b) Measured INL/DNL of the ADC for 5-bit resolution.

To ensure linearity, the AFE supply voltage is kept at 1.2 V similar to the ADC core. Differential non-linearity (DNL) and integral non-linearity (INL) of the ADC are measured by following the code density testing. Fig. 20(b) shows the measured DNL and INL before and after comparator offset calibration. Offset correction improves the DNL and INL from  $+0.50/-1$  and  $+1/-0$  LSB to  $+0.38/-0.31$  and  $+0.19/-0.41$  LSB, respectively.

Signal-to-noise and distortion ratio (SNDR) and spurious-free dynamic range (SFDR) plots are shown

Fig. 21. ADC performance. (a) FFT at the ADC output w/ and wo offset correction at  $F_s = 12.5$  GS/s. (b) Signal-to-noise and distortion ratio and spurious-free dynamic range versus input frequency at  $F_s = 14$  GS/s.

in Fig. 21(a) from the measured FFT of the ADC output at 12.5-GHz quad-channel sampling rate. Similar to the INL/DNL, offset correction also improves SNDR and SFDR from 23.9 and 30.35 dB to 29 and 33.97 dB, respectively, that improves ENOB from 3.67 to 4.52. Fig. 21(b) shows the measured SFDR



Fig. 22. Input eye and digitally reconstructed eye generated from ADC output for (a) open eye and (b) semi-open eye.



Fig. 23. (a) Phase noise plot of recovered clock. (b) Jitter tolerance at 28 Gb/s with  $\text{BER} < 10^{-9}$ .

and SNDR versus input frequency with and without calibration at 14-GHz sampling frequency. There is an average 4.7 dB improvement in SNDR due to offset correction.

For the 30-dB channel loss, the core ADC is configured for 5.5-bit mode—in this highest resolution mode the ADC consumes 83 mW including offset calibration circuit that translates to  $\sim 300 \text{ fJ/conversion step}$  at Nyquist input. Here,



Fig. 24. Test setup for receiver characterization.

the figure of merit is defined by

$$\text{FoM} = \frac{\text{Power}}{2^{\text{ENOB}} \cdot \min\{F_s, 2 \cdot \text{ERBW}\}} \quad (12)$$

where ENOB is the effective number of bits and effective resolution bandwidth (ERBW) is the effective resolution bandwidth. In addition to the eye at the channel, the output is compared with the reconstructed eye from the ADC output in Fig. 22 that reconfirms the functionality of the ADC.

The measured recovered clock phase noise and jitter profile are shown in Fig. 23(a) and (b), respectively. The advantage of TDC can be visible from these measured results. Bypassing the ADC path allows us to achieve much lower loop latency. As a result, a peaking free jitter transfer profile is achieved that results in an improved jitter tolerance performance [Fig. 23(b)]. The benefit of wider loop bandwidth allows us to filter ring VCOs phase noise more effectively—the integrated jitter from 1 kHz to 1 GHz in only 0.5134-ps rms.

Fig. 24 shows the link test setup. The arbitrary waveform generator is used as a PAM-4 transmitter equivalent to a three-tap Tx FIR providing 6-to-8-dB boost. The skew is adjusted between the differential channels. At the receiver side, we have a visibility at the LEQ output. The digital interfacing between ADC and FPGA is running at 437.5 MHz which consumes 7-mW power. As mentioned before, the FFE is implemented in FPGA. However, its power and area are estimated from transistor level simulation in 65-nm process. The digital equalizer is built with programmable three-to-eight tap FFE, with 1-pre-taps and 7-post-taps. For the measurement of link performance and BER estimation, the FPGA output is taken to plot the distribution of the bins (Fig. 25). Here, the measured distribution is plotted in linear scale. To extract the BER, this distribution is converted into log-scale and extrapolated. We can achieve BER up to level where extrapolated lines overlap. In our case (three-tap LUT + five-tap conventional), the achieved BER is up to  $10^{-9}$ , whereas in eight-tap conventional case, it is  $10^{-6}$ .

Fig. 26 shows the power breakdown for LR and MR channels. There is no critical latency concern in FFE as opposed to DFE. As a result, the implementation of FFE becomes simpler and it can fully take advantage of supply scaling. Also, its area scales gracefully with technology. More importantly, the number of post-cursor taps can be adjusted according to the channel loss. Therefore, under low-loss condition the unused post-cursor taps can be gated to reduce FFE power from 30 to 16 mW. For 30-dB loss case, we needed to use all eight taps with ADC set to the highest resolution of 5.5 bit to achieve BER lower than  $10^{-8}$  [Fig. 27(b)]. However, for lower loss,



Fig. 25. Link margin test at 28 Gb/s for a 30-dB channel where FFE is realized as (a) three-tap LUT + 5-tap conventional and (b) eight-tap conventional.

TABLE VI  
COMPARISON WITH STATE OF THE ART

|                           | Shafik<br>ISSCC<br>2015[16] | Frans<br>VLSI<br>2016[8] | Cui<br>ISSCC<br>2016[6] | Rylov<br>ISSCC<br>2016 [15] | Jung<br>JSSC<br>2015 [39] | Manian<br>JSSC<br>2017 [40] | This Work                  |                            |                            |
|---------------------------|-----------------------------|--------------------------|-------------------------|-----------------------------|---------------------------|-----------------------------|----------------------------|----------------------------|----------------------------|
| Technology                | 65 nm CMOS                  | 16 nm FinFET             | 28 nm CMOS              | 32 nm CMOS                  | 45 nm CMOS                | 45 nm CMOS                  | 65 nm CMOS                 |                            |                            |
| Data Rate (Gb/s)          | 10 NRZ                      | 56 PAM-4                 | 32 PAM-4                | 25 NRZ                      | 25 NRZ                    | 40 NRZ                      | 28 PAM-4                   | 24 PAM-4                   | 20 PAM-4                   |
| Architecture              | 32x TI SAR ADC              | 32x TI SAR ADC           | 32x TI SAR ADC          | 4x Flash ADC                | CTLE + DTLE + 2-tap DFE   | CTLE + 2-tap DFE            | 4x Flash ADC               |                            |                            |
| ENOB@ Nyquist             | 4.74                        | 4.9                      | 5.85                    | 4                           | --                        | --                          | 4.1                        |                            |                            |
| Timing Recovery           | N/A                         | Baud-rate                | Baud-rate               | Baud-rate                   | Half-rate                 | Half-rate                   | Edge & Data Sampled        |                            |                            |
| Tracking BW               | ---                         | ---                      | ---                     | ---                         | ---                       | ---                         | 10+ MHz                    |                            |                            |
| Jitter Tolerance          | ---                         | ----                     | ---                     | ---                         | ---                       | 0.45 UIpp @ 5 MHz           | 0.2 UIpp @ 50 MHz          |                            |                            |
| Channel Loss Equalization | 36.4 dB @ 5 GHz             | 25 dB @ 14 GHz           | 32 dB @ 8 GHz           | 40 dB @ 12 GHz              | 24 dB @ 12.5 GHz          | 18.6 dB @ 20 GHz            | 30 dB @ 7 GHz              |                            |                            |
| Power (mW) (w/o DSP)      | 79                          | 410                      | 320                     | 453                         | 8.1                       | 14                          | 130 @ 30 dB<br>45 @ 15 dB  | 109 @ 27 dB<br>38 @ 15 dB  | 89 @ 24 dB<br>33 @ 15 dB   |
| FOM (pJ/bit)              | 7.9                         | 7.32                     | 10                      | 18.12                       | 0.32                      | 0.35                        | 4.6 @ 30 dB<br>1.6 @ 15 dB | 4.5 @ 27 dB<br>1.6 @ 15 dB | 4.5 @ 24 dB<br>1.7 @ 15 dB |



Fig. 26. Power breakdown for (a) LR channel and (b) MR channel.

at 20 dB we only needed 4.5-bit resolution with four FFE taps only. For 15 dB or lower, only front-end LEQ with Tx-FIR is sufficient. Therefore, this solution allows us to linearly scale the power consumption with loss [Fig. 27 (a)].

Table VI compares the proposed ADC-TDC-based digital receiver with the state-of-the-art ADC-based solutions [6], [8],

[15], [16] and mixed-signal solutions [39], [40]. Here all ADC-based receivers use Baud-rate timing recovery—this paper shows that 2× timing recovery is possible without significant power/area penalty. Also, due to 9-dB SNR penalty in PAM-4 modulation, published ADC-based receivers employ SAR ADC with higher resolution. This paper demonstrates that 5.5 bit is sufficient if back-end digital equalization does not amplify quantization noise. Finally, taking advantage of channel ISI 5.5-bit ADC can be designed achieving excellent power efficiency not just for ADC but for the entire link.

## VII. CONCLUSION

The ADC-DSP based receivers are the future for multilevel signaling in advanced CMOS. However, its power has to be reduced. In this paper, a low-power PAM-4 receiver is



Fig. 27. (a) Power consumption of the receiver for different channel losses, and (b) measured receiver BER for 25- and 30-dB loss channels.

presented by utilizing a variable-resolution ISI aware ADC instead of a general-purpose ADC, a higher resolution SAR TDC rather than a single bit bang-bang PD, and an LUT-based FFE where quantization noise amplification is minimum.

#### ACKNOWLEDGMENT

The authors would like to thank CMC Microsystems for the provision of products and services that facilitated this paper, including CAD tools and design methodology, fabrication services using the 65-nanometer CMOS technology from TSMC, and test equipment support. They would also like to thank J. J. Quinn from CMC for the support regarding fabrication access and CAD tools, and MACOM and NSERC for funding this project.

#### REFERENCES

- [1] R. Payne *et al.*, “A 6.25 Gb/s binary adaptive DFE with first post-cursor tap cancellation for serial backplane communications,” in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, vol. 1, Feb. 2005, pp. 585–568.
- [2] V. Stojanovic *et al.*, “Autonomous dual-mode (PAM2/4) serial link transceiver with adaptive equalization and data recovery,” *IEEE J. Solid-State Circuits*, vol. 40, no. 4, pp. 1012–1026, Apr. 2005.
- [3] S. Kasturia and J. H. Winters, “Techniques for high-speed implementation of nonlinear cancellation,” *IEEE J. Sel. Areas Commun.*, vol. 9, no. 5, pp. 711–717, Jun. 1991.
- [4] T. Toifl *et al.*, “A 2.6 mW/Gbps 12.5 Gbps RX with 8-tap switched-capacitor DFE in 32 nm CMOS,” *IEEE J. Solid-State Circuits*, vol. 47, no. 4, pp. 897–910, Apr. 2012.
- [5] N. Kocaman *et al.*, “A 3.8 mW/Gbps quad-channel 8.5–13 Gbps serial link with a 5 tap DFE and a 4 tap transmit FFE in 28 nm CMOS,” *IEEE J. Solid-State Circuits*, vol. 51, no. 4, pp. 881–892, Apr. 2016.
- [6] D. Cui *et al.*, “A 320 mW 32 Gb/s 8 b ADC-based PAM-4 analog front-end with programmable gain control and analog peaking in 28 nm CMOS,” in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2016, pp. 58–59.
- [7] T. Toifl *et al.*, “A 22-Gb/s PAM-4 receiver in 90-nm CMOS SOI technology,” *IEEE J. Solid-State Circuits*, vol. 41, no. 4, pp. 954–965, Apr. 2006.
- [8] Y. Frans *et al.*, “A 56 Gb/s PAM4 wireline transceiver using a 32-way time-interleaved SAR ADC in 16 nm FinFET,” in *Proc. IEEE Symp. VLSI Circuits*, Jun. 2016, pp. 1–2.
- [9] B. Abiri, A. Sheikholeslami, H. Tamura, and M. Kibune, “A 5 Gb/s adaptive DFE for 2x blind ADC-based CDR in 65 nm CMOS,” in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2011, pp. 436–438.
- [10] E.-H. Chen, R. Yousry, and C.-K. K. Yang, “Power optimized ADC-based serial link receiver,” *IEEE J. Solid-State Circuits*, vol. 47, no. 4, pp. 938–951, Apr. 2012.
- [11] C. Ting, J. Liang, A. Sheikholeslami, M. Kibune, and H. Tamura, “A blind baud-rate ADC-based CDR,” *IEEE J. Solid-State Circuits*, vol. 48, no. 12, pp. 3285–3295, Dec. 2013.
- [12] A. Varzaghanian *et al.*, “A 10.3-GS/s, 6-bit flash ADC for 10 G Ethernet applications,” *IEEE J. Solid-State Circuits*, vol. 48, no. 12, pp. 3038–3048, Dec. 2013.
- [13] E. Z. Tabasy, A. Shafik, K. Lee, S. Hoyos, and S. Palermo, “A 6 bit 10 GS/s TI-SAR ADC with low-overhead embedded FFE/DFE equalization for wireline receiver applications,” *IEEE J. Solid-State Circuits*, vol. 49, no. 11, pp. 2560–2574, Nov. 2014.
- [14] B. Zhang *et al.*, “A 28 Gb/s multi-standard serial-link transceiver for backplane applications in 28 nm CMOS,” in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [15] S. Rylov *et al.*, “A 2.5 Gb/s ADC-based serial line receiver in 32 nm CMOS SOI,” in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Jan. 2016, pp. 56–57.
- [16] A. Shafik, E. Z. Tabasy, S. Cai, K. Lee, S. Hoyos, and S. Palermo, “A 10 Gb/s hybrid ADC-based receiver with embedded 3-tap analog FFE and dynamically-enabled digital equalization in 65 nm CMOS,” in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [17] D. Aurangozeb, A. K. M. D. Hossain, and M. Hossain, “Channel adaptive ADC and TDC for 28 Gb/s PAM-4 digital receiver,” in *Proc. IEEE Custom Integr. Circuits Conf. (CICC)*, May 2017, pp. 1–4.
- [18] N. Tracy. (2015). System architectures using OIF CEI-56G interfaces. *Optical Internetworking Forum*. [Online]. Available: <http://www.oiforum.com/wp-content/uploads/50317-FOE-Architecture-Presentation.pdf>
- [19] A. G. F. Dingwall and V. Zazzu, “An 8-MHz CMOS subranging 8-bit A/D converter,” *IEEE J. Solid-State Circuits*, vol. SSC-20, no. 6, pp. 1138–1143, Dec. 1985.
- [20] K. Yoshioka, R. Saito, T. Danjo, S. Tsukamoto, and H. Ishikuro, “Dynamic architecture and frequency scaling in 0.8–1.2 GS/s 7 b subranging ADC,” *IEEE J. Solid-State Circuits*, vol. 50, no. 4, pp. 932–945, Apr. 2015.
- [21] R. C. Taft and M. R. Tursi, “A 100-MS/s 8-b CMOS subranging ADC with sustained parametric performance from 3.8 V down to 2.2 V,” *IEEE J. Solid-State Circuits*, vol. 36, no. 3, pp. 331–338, Mar. 2001.
- [22] Y. Shimizu, S. Murayama, K. Kudoh, and H. Yatsuda, “A split-load interpolation-amplifier-array 300 MS/s 8b subranging ADC in 90 nm CMOS,” in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2008, pp. 552–635.
- [23] K. Ohhata, K. Uchino, Y. Shimizu, K. Oyama, and K. Yamashita, “Design of a 770-MHz, 70-mW, 8-bit subranging ADC using reference voltage precharging architecture,” *IEEE J. Solid-State Circuits*, vol. 44, no. 11, pp. 2881–2890, Nov. 2009.
- [24] M. El-Chammas and B. Murmann, “A 12-GS/s 81-mW 5-bit time-interleaved flash ADC with background timing skew calibration,” *IEEE J. Solid-State Circuits*, vol. 46, no. 4, pp. 838–847, Apr. 2011.
- [25] J. Matsuno, M. Hosoya, M. Furuta, and T. Itakura, “A 3-GS/s 5-bit flash ADC with wideband input buffer amplifier,” in *Proc. Int. Symp. VLSI Design, Autom., Test (VLSI-DAT)*, Apr. 2013, pp. 1–4.
- [26] B. P. Ginsburg and A. P. Chandrakasan, “500-MS/s 5-bit ADC in 65-nm CMOS with split capacitor array DAC,” *IEEE J. Solid-State Circuits*, vol. 42, no. 4, pp. 739–747, Apr. 2007.
- [27] T. Ito and T. Itakura, “A 3-GS/s 5-bit 36-mW flash ADC in 65-nm CMOS,” in *Proc. IEEE Asian Solid-State Circuits Conf.*, Nov. 2010, pp. 1–4.
- [28] J. Yao and J. Liu, “A 5-GS/s 4-bit flash ADC with triode-load bias voltage trimming offset calibration in 65-nm CMOS,” in *Proc. IEEE Custom Integr. Circuits Conf. (CICC)*, San Jose, CA, USA, 2011, pp. 1–4.
- [29] H. Chung, A. Rylyakov, Z. T. Deniz, J. Bulzacchelli, G. Y. Wei, and D. Friedman, “A 7.5-GS/s 3.8-ENOB 52-mW flash ADC with clock duty cycle control in 65 nm CMOS,” in *Proc. Symp. VLSI Circuits*, Jun. 2009, pp. 268–269.
- [30] T. Neu, “Clock jitter analyzed in the time domain, part 1,” *Analog Appl. J.*, vol. 3Q, pp. 5–9, Aug. 2010.

- [31] R. C. Walker *et al.*, “A two-chip 1.5-GBd serial link interface,” *IEEE J. Solid-State Circuits*, vol. 27, no. 12, pp. 1805–1811, Dec. 1992.
- [32] H. Yuksel *et al.*, “A 3.6 pJ/b 56 Gb/s 4-PAM receiver with 6-bit TI-SAR ADC and quarter-rate speculative 2-tap DFE in 32 nm CMOS,” in *Proc. 41st Eur. Solid-State Circuits Conf. ESSCIRC Conf.*, Sep. 2015, pp. 148–151.
- [33] J. Yoo, L. Yan, D. El-Damak, M. A. B. Altaf, A. H. Shoeb, and A. P. Chandrakasan, “An 8-channel scalable EEG acquisition SoC with patient-specific seizure classification and recording processor,” *IEEE J. Solid-State Circuits*, vol. 48, no. 1, pp. 214–228, Jan. 2013.
- [34] U. Meyer-Baese, *Digital Signal Processing with Field Programmable Gate Arrays*, 4th ed. New York, NY, USA: Springer, 2014.
- [35] P. K. Meher, “New approach to look-up-table design and memory-based realization of FIR digital filter,” *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 57, no. 3, pp. 592–603, Mar. 2010.
- [36] S. Gondi and B. Razavi, “Equalization and clock and data recovery techniques for 10-Gb/s CMOS serial-link receivers,” *IEEE J. Solid-State Circuits*, vol. 42, no. 9, pp. 1999–2011, Sep. 2007.
- [37] P. Nuzzo, C. Nani, C. Armiento, A. Sangiovanni-Vincentelli, J. Craninckx, and G. Van der Plas, “A 6-bit 50-MS/s threshold configuring SAR ADC in 90-nm digital CMOS,” *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 59, no. 1, pp. 80–92, Jan. 2012.
- [38] M. J. E. Lee, W. J. Dally, and P. Chiang, “Low-power area-efficient high-speed I/O circuit techniques,” *IEEE J. Solid-State Circuits*, vol. 35, no. 11, pp. 1591–1599, Nov. 2000.
- [39] J. W. Jung and B. Razavi, “A 25 Gb/s 5.8 mW CMOS equalizer,” *IEEE J. Solid-State Circuits*, vol. 50, no. 2, pp. 515–526, Feb. 2015.
- [40] A. Manian and B. Razavi, “A 40-Gb/s 14-mW CMOS wireline receiver,” *IEEE J. Solid-State Circuits*, vol. 52, no. 9, pp. 2407–2421, Sep. 2017.



**Aurangozeb** (S’09) received the B.Sc. degree in electrical and electronic engineering from the Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2009. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the University of Alberta, Edmonton, AB, Canada.

His current research interests include low-power ADC-based wireline receivers.



**AKM Delwar Hossain** (S’13) received the B.Sc. degree in electrical and electronic engineering from the Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2013, and the M.Sc. degree in electrical and computer engineering from the University of Alberta, Edmonton, AB, Canada.

He is currently a Design Engineer with MACOM Technology Solutions Inc., Burlington, ON, Canada. His current research interests include low-power digital receivers for wireline and optical applications.



**Maruf Mohammad** was born in Dhaka, Bangladesh. He received the B.Sc. in electrical engineering from the Bangladesh University of Engineering and Technology, Dhaka, in 1999, and the M.Sc. and Ph.D. degrees in electrical engineering from Virginia Tech, Blacksburg, VA, USA, in 2002 and 2006, respectively.

He joined the Mobile and Portable Radio Research Group, Virginia Tech, in 2000. He was involved in the wireless industry. His current research interests include joint data and channel estimation, wireless/wireline equalization, forward error correction coding, space-time processing, MIMO-OFDM systems, statistical (Markov and its variants) models, modeling RF and analog impairments, and calibration algorithms for wireless systems.

Dr. Maruf was awarded University Partnership in Research Fellowship by Motorola which supported his doctoral research.



**Masum Hossain** (M’11) received the B.Sc. degree from the Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, in 2002, the M.Sc. degree from Queen’s University, Kingston, ON, Canada, in 2005, and the Ph.D. degree from the University of Toronto, Toronto, ON, in 2010.

From 2008 to 2010, he was with the Analog and Mixed Signal Division, Gennum Corporation, Burlington, ON, where he was involved in the development of world’s highest capacity and most power efficient cross point router solution. He was a Senior Member of Technical Staff with the Rambus Laboratory, Sunnyvale, CA, USA, where he was involved in advanced equalization and clock recovery techniques for high-speed interfaces. He has spent several years in industrial research. In 2013, he joined the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada.

Dr. Hossain was a recipient of the Best Student Paper Award at the 2008 IEEE Custom Integrated Circuits Conference and the Analog Device’s Outstanding Student Designer Award in 2010.