



Figure 14.2.1: Comparison of loop-filter architectures in CT- $\Delta\Sigma$ s (top); proposed low-power loop-filter and linearization technique (bottom).



Figure 14.2.2: Reducing distortion: Implementation of reference-buffer assistance and dead-band switches to reduce glitches at the input of  $g_m$ .



Figure 14.2.3: Simplified single-ended implementation of CT- $\Delta\Sigma$  (top), and the complete neural recording front-end (bottom).



Figure 14.2.4: ADC measurements: PSD for peak SNDR (top), SNDR vs Input (bottom).



Figure 14.2.5: Front-end (CCIA+ADC) measurements: gain and input-referred noise (top), input impedance (center), linearity (bottom).

| Spec                                       | ADC comparison |              |                           | Front-end (CCIA+ADC) comparison |               |                           |
|--------------------------------------------|----------------|--------------|---------------------------|---------------------------------|---------------|---------------------------|
|                                            | [6] JSSC'17    | [5] JSSC'17  | This work Δ $\Sigma$ -ADC | [2] VLSI'17                     | [1] JSSC'17   | This work CCIA+ADC        |
| Power( $\mu$ W) / Sup(V)                   | 1120 / 1.8     | 280 / 1.8    | 4.5 / 1.2                 | 8 / 1                           | 7 / 1.2, 0.45 | 7.3 / 1.2                 |
| BW(Hz)                                     | 20k            | 24k          | 5k                        | DC to 500                       | 0.1 to 200    | 0.1 to 5k                 |
| Peak SNDR                                  | 103            | 98.5         | 93.5                      | 70.6 *                          | LFP: 75       | Full BW: 78<br>LFP: 86    |
| SNR / SFDR (dB)                            | 106 / 107.4    | 99.3 / 107.6 | 94.3 / 102.5              | N/A                             | 77.6 / 79     | 81 / 82 <sup>b</sup>      |
| DR (dB) <sup>a</sup>                       | 109            | 103.6        | 96.5                      | 73.6                            | LFP: 77.4     | Full BW: 81<br>LFP: 90    |
| FOM <sub>S,DR</sub> (dB)                   | 181.5          | 182.9        | 187                       | 151.5                           | 152           | 169.4                     |
| FOM <sub>S,SNDR</sub> (dB)                 | 175.5          | 177.8        | 184                       | 148.5                           | 149.6         | 166.4                     |
| Area (mm <sup>2</sup> ) / Tech (nm)        | 0.16 / 160     | 1 / 180      | 0.053 / 40                | N/A / 180HV                     | 0.135 / 40    | 0.113 / 40                |
| Peak Inp (V <sub>pp</sub> ) <sup>a</sup>   |                |              |                           | 23.2m *                         | 100m          | 200m                      |
| THD @ peak inp (dB)                        |                |              |                           | -73.6 *                         | -79           | -81                       |
| Inp. ref. noise ( $\mu$ V <sub>rms</sub> ) |                |              |                           | 1.7                             | LFP: 5.2      | Full BW: 6.35<br>LFP: 1.8 |
| $Z_{in,DC}$ (MΩ)                           |                |              |                           | 30 *                            | $\infty$      | 1500                      |
| Large signal CM tolerance                  |                |              |                           | No                              | No            | Yes                       |

LFP: Local Field Potentials (1Hz – 200Hz); \* Estimated, exact values not reported

<sup>a</sup> For front-end, measured for distortion power = noise power; <sup>b</sup> Measured for peak input

Figure 14.2.6: Comparison with current state-of-the-art.



Figure 14.2.7: Chip micrograph of prototype.

### 14.3 A 13-ENOB 2<sup>nd</sup>-Order Noise-Shaping SAR ADC Realizing Optimized NTF Zeros Using an Error-Feedback Structure

Shaolan Li, Bo Qiao, Miguel Gandara, Nan Sun

University of Texas, Austin, TX

The noise-shaping (NS) SAR ADC is an emerging hybrid architecture that achieves high resolution and power-efficiency simultaneously by combining the merits of the SAR ADC and the  $\Delta\Sigma$ ADC, making it attractive to sensor readout and healthcare applications. To implement NS, most prior works adopted the classic cascaded integrator feed-forward (CIFF) structure for noise filtering [1-5]. Opamp-based integrators were used in [1-3] to achieve a relatively sharp noise transfer function (NTF), but at the cost of power and scaling friendliness. Reference [4] demonstrated 2<sup>nd</sup>-order NS using fully passive switched-capacitor (SC) integrators, but has limited NS and thermal noise performance. Reference [5] combines passive SC filters with a dynamic amplifier (D-amp) to achieve good noise and power, but is vulnerable to process-voltage-temperature (PVT) variations. In addition, no prior NS SAR has realized complex NTF zeros for optimum NS. This presents a need for an NS SAR ADC that can combine optimized NTF, power efficiency and PVT robustness. To overcome the limitations of existing work, instead of adopting the CIFF structure, this paper presents a 2<sup>nd</sup>-order NS SAR ADC using an error-feedback (EF) structure. It highlights the capability of realizing optimized complex NTF zeros by simply using charge sharing summation, a passive SC FIR and a comparator-reused dynamic amplifier with PVT tracking. This work achieves sharp NS performance while maintaining both hardware and power efficiency merits of the NS SAR with improved robustness. The prototype chip, fabricated in 40nm CMOS, achieves a 79dB SNDR at an OSR of 8 using a 9b SAR, resulting in a peak Schreier FoM of 178dB.

In classic  $\Delta\Sigma$  ADCs, the EF structure is less preferred due to its sensitivity to error in the quantization error extraction. This limitation is naturally resolved in an NS SAR as the conversion and extraction DAC is identical, which makes the extraction inherently accurate. Therefore, the EF structure can demonstrate strong architectural advantage in the context of the NS SAR, especially in realizing complex NTF zeros. Figure 14.3.1 depicts the architectural block diagram of the proposed design. The signal flow resembles a standard 2<sup>nd</sup>-order EF structure, where the quantization error is extracted after each conversion, then filtered by an FIR and subtracted from the next input signal. To minimize circuit complexity and power, a fully passive SC FIR is used. This subsequently enables the feedback summing to be implemented in the form of charge sharing by merging the SAR CDAC and the FIR caps. Since the CDAC already contains the subtracted result, the quantization error will be ready in the CDAC by the end of each conversion. Hence, unlike prior works that rely on a multi-path comparator to perform active summation for the signal and error paths [1-5], this work only needs a standard two-input comparator, reducing the number of noise and mismatch sources. The passive FIR and charge sharing summation will introduce extra noise and signal attenuation. This can be alleviated by using a dynamic amplifier (D-amp) to gain up the residue before filtering.

In this work, we realize complex NTF zeros by controlling the loop gain of the EF path ( $K_{EF}$ ). An advantage of modifying the NTF by changing  $K_{EF}$  is that it can simplify the FIR design to use integer ratios. And since the gain of the charge sharing and FIR is well defined by cap ratios, the tuning nob of  $K_{EF}$  is only related to the gain of the D-amp, allowing precision control and programmability. The change of the NTF (magnitude, zero location) vs.  $K_{EF}$  is illustrated in Fig. 14.3.1. As  $K_{EF}$  reduces to less than 2, the two zeros become complex and deviate from  $z = 1$ . Although the complex zeros do not fall exactly on the unit circle, they are sufficiently close to it, and are thus able to create a good notch in the band of interest. Calculation shows that, comparing to real-zero NTF designs [4-5], more than 7dB SQNR improvement can be obtained with optimized zeros at an OSR = 8.

The schematic and timing diagram of the proposed EF NS SAR ADC is shown in Fig. 14.3.2. The SAR core adopts a 9b CDAC with bottom-plate sampling. The SAR conversion takes up about 3/4 of a clock period, with the EF operation using the rest. The passive FIR consists of  $C_{res1}$ ,  $C_{res2}$  and  $C_{delay}$ , where  $C_{res1}$  serves as the one-cycle delay, and  $C_{res2}$  and  $C_{delay}$  serve as the two-cycle delay with 0.5 attenuation. To achieve a high D-amp gain for suppressing the FIR noise and minimizing signal attenuation with low power, this work reuses the comparator as a regenerative D-amp [6]. By leveraging the combined integrating and positive-feedback behavior of a Strong-Arm latch, the D-amp provides a high gain of 30 within 2ns with low power. During the amplification mode,  $C_{res1}$  and  $C_{res2}$  are

connected directly to the comparator output. A gain-control timer will shut off the connection before the comparator fully regenerates. In this region, the output voltage of the comparator is proportional to its input, and is thus functions as an amplifier. As the D-amp only processes the residue of a 9b SAR, the gain is sufficiently linear. With a D-amp gain of 30, we can use as small as 140fF for the FIR with negligible area and noise penalty. The loading from  $C_{res1}$  and  $C_{res2}$  naturally dampens the bandwidth of the comparator, lowering the D-amp noise during amplification. By reusing the comparator as D-amp, their offsets are intrinsically matched, which obviates the need for offset mismatch calibration.

As for most D-amps, the comparator-reused D-amp is also subjected to gain variation from PVT changes. This design proposes a coarse-fine automatic gain-control timer to regulate the gain over PVT variation, as described in Fig. 14.3.3. The coarse tracking is implemented by a dynamic OR-structure, where the PMOS pair serves as a replica of the comparator's PMOS pair. The dynamic OR gate tracks the integrating period of the D-amp over PVT and will trigger a flip when it enters regeneration. The fine-tuning of the gain is done by applying a back-gate bias  $V_{TIMER}$  to the PMOS pair of the timer, which modulates the gate delay and thus controls the regeneration time seen by the load.  $V_{TIMER}$  is generated by a dither-based LMS background calibration loop as shown in Fig. 14.3.3. To speed up the calibration convergence, we only apply the EF summation during the last 5b of the SAR conversion. This restricts the PRM to be correlated with the last 5 bits only, hence reducing the interference from the signal components during gain error extraction. With 8-LSB redundancy, the conversion result is unaffected. Although a DAC is required to generate  $V_{TIMER}$ , it only needs 7b resolution and updates every 2048 samples, making it easy to implement and power efficient. Post-layout simulation shows that the background calibration unit consumes only 4uW. The background calibration (except PRNG) is implemented off-chip during measurement.

The prototype NS SAR ADC is fabricated in a 40nm LP-CMOS process, occupying an active area of 0.024mm<sup>2</sup>. Under 1.1V supply, it consumes 84uW when sampling at 10MHz, where 21uW, 29uW and 34uW are contributed by analog, reference and digital, respectively. CDAC mismatches are calibrated in the foreground. Figure 14.3.4 shows the measured output PSD with NS disabled/enabled. The peak SNDR improved from 64dB to 79dB over 625kHz bandwidth when NS is on, with a clear 2<sup>nd</sup>-order high-pass shaping. The measured SNR/SNDR vs input amplitude is also presented in Fig. 14.3.4, showing a DR of 80.5dB. Figure 14.3.5 demonstrates the measured functionality of the background gain calibration. The proposed ADC was first background calibrated to achieve the desired performance with nominal  $V_{DD}$ . We then applied a +5% step on the  $V_{DD}$ . With background calibration, the degraded SNDR and D-amp gain gradually converged back to the desired value. The time-constant of the calibration loop is about 30ms, which is sufficient to track most real-time changes. The measured SNDR and PSD under various D-amp gain settings are also shown in Fig. 14.3.5. The peak SNDR is achieved with the D-amp gain at 30, which corresponds to a  $K_{EF}$  of 1.875. Its corresponding measured PSD also suggests a lowest in-band noise floor. The measured optimum gain setting is in good agreement with the analysis. Figure 14.3.6 provides a performance summary and comparison table. This design achieves a peak Schreier and Walden FoM of 178dB and 9fJ/step, respectively. Figure 14.3.7 shows the die photo of the prototype. In summary, this work has demonstrated an effective solution to realize optimized NTF zeros for NS SARs with simple low-power scaling-friendly circuits. It also proves the efficiency and usefulness of the EF structure in implementing low-power high-resolution NS SARs.

#### Acknowledgement:

The authors would like to thank TSMC University Shuttle Program for chip fabrication.

#### References:

- [1] J. Fredenburg, M. Flynn, "A 90MS/s 11MHz Bandwidth 62dB SNDR Noise-Shaping SAR ADC," ISSCC, pp. 468-469, Feb. 2012.
- [2] K. Obata, et al., "A 97.99 dB SNDR, 2 kHz BW, 37.1  $\mu$ W Noise-Shaping SAR ADC with Dynamic Element Matching and Modulation Dither Effect," IEEE Symp. VLSI Circuits, pp. 22-23, June 2016.
- [3] Y.-S. Shu, et al., "An Oversampling SAR ADC with DAC Mismatch Error Shaping Achieving 105dB SFDR and 101dB SNDR over 1kHz BW in 55nm CMOS," ISSCC, pp. 458-459, Feb. 2016.
- [4] W. Guo, et al., "A 13b-ENOB 173dB-FoM 2nd -Order NS SAR ADC with Passive Integrators," IEEE Symp. VLSI Circuits, pp. C236-C237, June 2017.
- [5] C.-C. Liu, et al., "A 0.46mW 5MHz-BW 79.7dB-SNDR Noise-Shaping SAR ADC with Dynamic-Amplifier-Based FIR-IIR Filter," ISSCC, pp. 466-467, Feb. 2017.
- [6] M. Gandara, et al., "A Pipelined SAR ADC Reusing the Comparator as Residue Amplifier," IEEE CICC, pp. 1-4, May 2017.



Figure 14.3.1: Architectural diagram of the proposed EF NS SAR, and the NTF performance as a function of the EF loop gain  $K_{EF}$



Figure 14.3.2: Circuit schematic (single-ended) and timing diagram of the proposed 9b EF NS SAR, reusing the comparator as a residue D-amp.



Figure 14.3.3: The automatic gain-control timer and background calibration model for the residue D-amp.



Figure 14.3.4: Measured output PSD with NS disabled/enabled (Top). Measured SNR/SNDR versus input amplitude and power breakdown (Bottom).



Figure 14.3.5: Measured SNDR and D-amp gain over voltage changes with background calibration running (Top). Measured SNDR and PSD near band-edge versus D-amp gain (Bottom).

| Specifications                 | ISSCC 2012 Fredenburg | ISSCC 16 Shu | VLSI 16 Obata | ISSCC 17 Liu | VLSI 2017 Guo | This work    |
|--------------------------------|-----------------------|--------------|---------------|--------------|---------------|--------------|
| Architecture                   | CIFF                  | CIFF         | CIFF          | CIFF         | CIFF          | EF           |
| Opamp-free                     | ✗                     | ✗            | ✗             | ✓            | ✓             | ✓            |
| Optimized NTF Zeros            | ✗                     | ✗            | ✗             | ✗            | ✗             | ✓            |
| PVT Robust                     | ✓                     | ✓            | ✓             | ✗            | ✓             | ✓            |
| Technology [nm]                | 65                    | 55           | 28            | 28           | 40            | 40           |
| Active Area [mm <sup>2</sup> ] | 0.03                  | 0.072        | 0.116         | 0.0049       | 0.04          | <b>0.024</b> |
| Supply [V]                     | 1.2                   | 1.2          | 1.8/1.1       | 1            | 1.1           | 1.1          |
| NS Order                       | 1                     | 1            | 3             | 1            | 2             | 2            |
| Fs [MS/S]                      | 90                    | 1            | 1             | 132          | 8.4           | 10           |
| Bandwidth [kHz]                | 11000                 | 1            | 20            | 5000         | 262           | <b>625</b>   |
| OSR                            | 4                     | 500          | 25            | 13.2         | 16            | <b>8</b>     |
| SNDR [dB]                      | 62                    | 101          | 94            | 80           | 80            | <b>79</b>    |
| Power [uW]                     | 806                   | 15.7         | 493           | 460          | 143           | <b>84</b>    |
| FoMs* [dB]                     | 163                   | 179          | 170           | 180          | 173           | <b>178</b>   |
| FoMw** [fJ/step]               | 36                    | 85           | 305           | 5.8          | 33            | <b>9</b>     |

\*FoMs = SNDR + 10 \* log<sub>10</sub>(BW/Power)

\*\*FoMw = Power/(2<sup>ENOB</sup>\*2<sup>BW</sup>)

Figure 14.3.6: Performance summary and comparison with state-of-the-art NS SAR ADCs.



Figure 14.3.7: Die photograph.

## 14.4 A 1.1mW 200kS/s Incremental $\Delta\Sigma$ ADC with a DR of 91.5dB Using Integrator Slicing for Dynamic Power Reduction

Patrick Vogelmann, Michael Haas, Maurits Ortmanns

University of Ulm, Ulm, Germany

Nyquist-rate ADCs with high resolution are needed in many applications where, for example, multiplexed operation is needed as for multichannel sensor readout. For various tasks such as averaging-based analysis or lock-in detection high linearity at the presence of very low noise is required. While SAR ADCs are known for high power efficiency, they often show limited effective resolutions and linearity unless oversampling, mismatch error shaping or calibration techniques are employed [1,2]. The state of the art in high-resolution ADC design is thus dominated by oversampled converters, and for Nyquist-rate operation a significant power penalty must be paid.

Another candidate is the incremental Delta-Sigma ( $I-\Delta\Sigma$ ) ADC, which still exploits oversampling and noise-shaping, but which is a true Nyquist-rate converter by regularly resetting the internal integrators. In all implementations, the most dominant power consumer of the  $I-\Delta\Sigma$  modulator is the first integrator [3,4], which must fulfill the overall noise and linearity requirements. Later stages benefit from the loopfilter suppression of previous stages and are therefore less power hungry. Consequently, all measures to enhance this first integrator are highly demanded as it determines the performance and efficiency of the entire ADC. In this paper, a dynamic reconfiguration of the loopfilter input stage is proposed, which allows a decrease in power consumption of the first integrator at the cost of a negligible increase in the noise floor.

Figure 14.4.1 depicts a block diagram of a generic discrete-time  $I-\Delta\Sigma$  ADC operating for OSR clock cycles with frequency  $f_s$  on a continuously changing input  $u(t)$ , which is sampled by the first SC integrator of the modulator yielding  $u[k]$ . For OSR cycles, the digital output stream of the quantizer decisions  $d[k]$  is fed to a reconstruction filter to generate a digital output  $D[n]$  at Nyquist rate  $f_N=f_s/OSR$ . Subsequently, modulator and filter are both reset and the next conversion step starts. As seen at the bottom of Fig. 14.4.1, a 3<sup>rd</sup>-order CIFF  $I-\Delta\Sigma$  modulator architecture with a single-bit quantizer is chosen for reduced internal integrator swing and inherent DAC linearity. A dedicated frontend S&H operating at  $f_N$  can be omitted in a Nyquist-rate  $I-\Delta\Sigma$  ADC [4]. Assuming, without the loss of generality, that the signal transfer function  $stf[k]$  of the modulator is approximately unity for in-band signals, only the transfer function of the reconstruction filter  $h[k]$  contributes to the analog-input to digital-output transfer function. This filter can be efficiently realized by a chain of integrators (Col) filter. In Fig. 14.4.1, the weighting factors  $h[k]$  of a 3<sup>rd</sup>-order Col filter as a function of the internal clock cycle,  $k$ , is shown, normalized to the first weight  $h[1]$  for better readability. The filter causes a non-uniform weighting of the input signal  $u[k]$  as well as any input-referred noise  $t[k]$ . The noise  $v_{n,out,rms}^2[n]$  contained in the  $n^{th}$  Nyquist-rate output  $D[n]$  can be calculated as the sum of all squared filter weights multiplied by the respective noise spectral density in each cycle  $k$ . Thus, a dynamically increased input-referred noise floor towards the end of each conversion, where the filter weights  $h[k]$  become smaller, will not significantly degrade the noise performance of the respective sample. Consequently, the input-referred noise of the  $I-\Delta\Sigma$  modulator, and especially of the first, power-determining OTA, can be dynamically increased towards the end of each conversion, which allows significant power saving at the cost of a negligible increase of the output noise floor.

Figure 14.4.2 depicts a simplified schematic of the presented  $I-\Delta\Sigma$  ADC, including the proposed dynamic integrator slicing scheme. The first integrator is split in four identical slices, which can be independently activated. Each slice is realized as a standalone SC integrator employing bottom-plate sampling and using bootstrapped switches. All four slices in parallel were designed as an integrator, which fulfills the noise, GBW, SR and linearity requirements needed for nominal operation and an effective resolution of 15b. The slicing makes it possible to sequentially disable parts of the integrator during a single Nyquist conversion cycle, where, according to the scheme in Fig. 14.4.1, 4 slices are operated for  $k_0$

clock cycles, 3 slices for  $k_1$ , 2 slices for  $k_2$  and 1 slice is operated for  $k_3$  cycles. The dynamic reconfiguration leaves the scaling coefficient of the 1<sup>st</sup> integrator  $c_1=C_s/C_{fb}$  constant, since feedback  $C_{fb}$  and sampling capacitor  $C_s$  are sliced together with the OTA. Also, the gain bandwidth  $GBW=g_m/C_{load}$  of the OTA remains unaltered as the effective reduction of  $g_m$  is approximately the same as the reduction of the capacitive load  $C_{load}$  during slicing.

Figure 14.4.3 illustrates the degradation of the SNR and the achieved power saving in the first integrator due to the slicing for 4 slices and an OSR=150. A noise penalty factor  $p[k]$  is introduced to calculate the expected loss in SNR due to slicing. If all 4 slices are permanently active (black dashed line), no noise penalty is paid, but no power is saved either. By disabling two slices for  $k_2$  of the total OSR cycles, the noise power is doubled for this period, but the power consumption is halved. The dashed lines in Fig. 14.4.3 mark the calculated SNR degradation as well as the power saving with respect to the nominal resolution when 0...3 integrator slices are disabled for the complete conversion cycle. The solid curves depict the calculated SNR degradation and power saving by dynamically disabling 1...3 amplifiers after a certain number of cycles  $k \leq OSR$ . For example, after  $k=100$  out of the OSR=150 cycles, the SNR degradation is not visible anymore, even if 3 out of 4 integrators are switched off, which would result in 25% power saving in the first integrator. A similar negligible performance drop is achieved by switching off fewer slices, but even earlier. The overall power efficiency, for a given degradation in SNR, can be maximized by carefully choosing the number of slices as well as the switching instances  $k_{0..3}$  at which slices are disabled. Extensive measurements using integrator slicing, marked as crosses in Fig. 14.4.3, accurately confirm the theory.

The proposed  $I-\Delta\Sigma$  modulator is fabricated in a 0.180 $\mu m$  CMOS technology (Fig. 14.4.7). The active area is 0.363mm<sup>2</sup>. The sampling rate is  $f_s=30MHz$ , using an OSR=150 the Nyquist rate becomes 200kS/s. The modulator consumes 1.65mW from a 3V supply without making use of the slicing technique. Readout of the modulator and digital filtering as well as controlling the amplifier slices is performed by an FPGA. With the optimal slicing parameters for the prototype ( $k_{0,1,2,3} = 40, 30, 10, 70$ ), the power consumption can be reduced to 1.098mW, which results in only 0.7dB/0.8dB loss in SNR/SNDR as depicted in the spectrum in Fig. 14.4.4. The noise floor is dominated by the 1/f noise of the modulator as auto-zeroing or chopping have not been implemented in this prototype.

Figure 14.4.5 shows the measured SNR/SNDR vs. input amplitude for a 13kHz sinusoidal input with the same slicing configuration. The achieved SNR/SNDR/DR are 88.2/86.5/91.5dB. As depicted in Fig. 14.4.5, the signal source limits the maximum measurable SFDR of the ADC, which is confirmed by also showing the measured linearity of a commercially available 24b ADC. Nevertheless, the presented ADC achieves an SFDR of above 97dB over the complete inband frequency range. Figure 14.4.6 summarizes this work and compares it to state-of-the-art Nyquist-rate ADC designs. The proposed design achieves a Schreier FOM<sub>s</sub> of 171.1dB. The proposed slicing can be employed in every state-of-the-art  $I-\Delta\Sigma$  ADC, yielding power savings with negligible performance drop.

### Acknowledgement:

Gregor Neusser and the Focused Ion Beam Center UUlm supported by Thermo Fisher (formerly FEI), the German Science Foundation (INST40/385-F1UG), and the Struktur- und Innovationsfonds Baden-Württemberg is thanked.

### References:

- [1] A. AlMarashli, et al., "A 107 dB SFDR, 80 kS/s Nyquist-Rate SAR ADC Using a Hybrid Capacitive and Incremental  $\Sigma\Delta$  DAC", *IEEE Symp. VLSI Circuits*, pp C240-C241, June 2017.
- [2] D. Hummerston, P. Hurrell, "An 18-bit 2MS/s pipelined SAR ADC utilizing a sampling distortion cancellation circuit with -107dB THD at 100kHz," *IEEE Symp. VLSI Circuits*, pp. C280-C281, June 2017.
- [3] Y. Chae, et al., "A 6.3uW 20b Incremental Zoom-ADC with 6ppm INL and 1uV Offset", *ISSCC*, pp 276-277, Feb. 2013.
- [4] S. Tao and A. Rusu, "A Power-Efficient Continuous-Time Incremental Sigma-Delta ADC for Neural Recording Systems," *IEEE TCAS-I*, vol. 62, no. 6, pp. 1489-1498, June 2015.



Figure 14.4.1: Weighting of the rms noise as a function of clock cycle  $k$  in a DT I- $\Delta\Sigma$  ADC with 3rd-order Col filter.



Figure 14.4.2: Simplified schematic of the modulator including the slicing technique with clocking/controlling.



Figure 14.4.3: Calculated and measured SNR degradation and power saving due to the dynamic integrator slicing technique.



Figure 14.4.4: Measured spectrum with an input signal of -4dBFS at 13kHz with (top) and without (bottom) slicing.



Figure 14.4.5: Measured SNR/SNDR vs. amplitude (top). Measured linearity compared to a commercial 24b ADC (bottom).

|                                 | AlMarashli[1]<br>VLSI'17       | Hummerston[2]<br>VLSI'17       | Chae[3]<br>ISSCC'13 | Chen<br>ISSCC'13           | This work                              |
|---------------------------------|--------------------------------|--------------------------------|---------------------|----------------------------|----------------------------------------|
| Architecture                    | SAR $\uparrow\uparrow\uparrow$ | SAR $\uparrow\uparrow\uparrow$ | Incremental         | Incremental $\Delta\Sigma$ | Incremental $\Delta\Sigma$             |
| Technology                      | 40nm                           | 180nm                          | 160nm               | 160nm                      | 180nm                                  |
| Resolution(bit)                 | 16                             | 18                             | 20                  | 14                         | 15                                     |
| Total sampling capacitance (pF) | 2x16                           | -                              | 2x10.24             | 1x0.5                      | 2x0.35                                 |
| Area (mm <sup>2</sup> )         | 0.074                          | 3.87                           | 0.375               | 0.45                       | 0.363                                  |
| Power Supply (V)                | 2.5/1.1                        | 5/1.8                          | 1.8                 | 1                          | 3                                      |
| Power ( $\mu$ W)                | 101                            | 30520                          | 6.3                 | 20                         | 1098                                   |
| $f_{s,nyq}$ (kS/s)              | 80                             | 5000                           | 0.0025              | 1.334                      | 200                                    |
| SNDR (dB)                       | 84.8                           | 100                            | 119.8               | 81.9                       | 86.6                                   |
| SFDR (dB)                       | 107                            | -                              | -                   | -                          | 101.3                                  |
| FOM <sub>s</sub> (dB)           | 170.8 <sup>T</sup>             | 179.9 <sup>T</sup>             | 182.7 <sup>T</sup>  | 157.1 <sup>T</sup>         | 171.1 <sup>T</sup> /166.2 <sup>T</sup> |

<sup>T</sup>  $FOM_s = SN(D)R + 10 \cdot \log_{10}(BW/P)$

<sup>TT</sup>  $FOM_s = DR + 10 \cdot \log_{10}(BW/P)$

<sup>TTT</sup> Calibration is necessary

Figure 14.4.6: Performance summary and comparison with state-of-the-art Nyquist-rate converters.



Figure 14.4.7: Chip photo including power breakdown for the individual parts including power savings due to slicing.

## 14.5 A 280 $\mu$ W Dynamic-Zoom ADC with 120dB DR and 118dB SNDR in 1kHz BW

Shoubhik Karmakar<sup>1</sup>, Burak Gönen<sup>1</sup>, Fabio Sebastian<sup>1</sup>, Robert Van Veldhoven<sup>2</sup>, Kofi A. A. Makinwa<sup>1</sup>

<sup>1</sup>Delft University of Technology, Delft, The Netherlands

<sup>2</sup>NXP Semiconductors, Eindhoven, The Netherlands

Micro-power ADCs with high linearity and dynamic range (DR) are required in several applications, such as smart sensors, biomedical imaging, and portable instrumentation. Since the signals of interest are then often small (tens of  $\mu$ V) and slow (<1kHz BW), such ADCs should also exhibit low offset and flicker noise. Noise-shaping SAR [1] and incremental ADCs [2] have been proposed for such applications, but their DR is limited to about 100dB. Although the  $\Delta\Sigma$  modulator ( $\Delta\Sigma M$ ) proposed in [3] achieves 136dB DR, it is at the expense of high power consumption (12.7mW). The incremental zoom ADC proposed in [4] combines a coarse SAR ADC and a fine  $\Delta\Sigma$  ADC to efficiently achieve 119.8dB DR, but is limited to DC signals. The dynamic zoom ADC in [5] solves this problem, but requires external filtering to cope with out-of-band interference. This paper describes an interferer-robust dynamic zoom ADC that consumes 280 $\mu$ W while achieving 120.3dB DR and 118.1dB SNDR in 1kHz BW, resulting in a Schreier FOM of 185.8dB. It also achieves a maximum offset of 30 $\mu$ V and a 1/f corner of 7Hz. These advances are achieved by the combination of dynamic error-correction techniques, an asynchronous SAR ADC and a fully differential inverter-based  $\Delta\Sigma$  ADC.

As shown in Fig. 14.5.1, the proposed ADC consists of a coarse 5b asynchronous SAR ADC and a fine 2<sup>nd</sup>-order 1b  $\Delta\Sigma M$ . In contrast to previous zoom ADCs [4,5], the coarse and fine conversions employ the same sampling frequency ( $f_s = 2\text{MHz}$ ). During  $\phi_1$ , the SAR ADC ( $V_{LSB,SAR} = V_{REF}/31$ ) outputs a conversion result  $K$ , which is then used to update the  $\Delta\Sigma M$  references during the next half clock cycle ( $\phi_2$ ). The fine and coarse converters use separate DACs, and so there will be some mismatch between their LSBs. Furthermore, coarse conversion errors may occur. To ensure that the input swing of the modulator remains within its stable range under all of these conditions, over-ranging is applied. This is accomplished by setting the positive and negative references to  $V_{REF+} = (K+2)\cdot V_{LSB,SAR}$  and  $V_{REF-} = (K-1)\cdot V_{LSB,SAR}$ , respectively. The 5b digital output of the zoom ADC ( $Dout$ ) can then be obtained by combining the 5b output of the SAR with the 1b output of the  $\Delta\Sigma M$ .

Using a high-speed coarse ADC in a zoom ADC confers significant advantages. In this work, compared to [5], the use of an asynchronous SAR ADC ensures that the  $\Delta\Sigma M$  references are updated 10x faster – after half a clock cycle instead of after 5 clock cycles. This significantly reduces the input swing of the  $\Delta\Sigma M$ , thus requiring 2x less over-ranging, and making better use of its dynamic range. This, in turn, means that the target resolution could be obtained with a 2<sup>nd</sup>-order  $\Delta\Sigma M$ , rather than the 3<sup>rd</sup>-order  $\Delta\Sigma M$  used in [5]. Another advantage of the faster update rate is that it improves the robustness of the zoom ADC to out-of-band interference. Compared to [5], which could only handle full-scale signals up to 1.5x its BW before its SNDR degraded, the proposed design can handle full-scale signals up to 48x its BW. In many cases, e.g. sensor readout, this level of robustness obviates the need for external filtering.

The asynchronous SAR ADC consists of a 5b binary-weighted capacitive DAC, asynchronous SAR logic, and a comparator as shown in Fig. 14.5.2. Since the unit capacitor is small ( $C_0 = 5\text{fF}$ ), a preamplifier is used before the comparator to mitigate kick-back. The input is tracked until the rising edge of CLK, when it is sampled. Each conversion cycle starts by setting the DAC inputs and then resetting the comparator with  $compCLK=0$ . After a delay ( $t_{settle}$ ) to allow the DAC to settle, the comparator is clocked ( $compCLK=1$ ) to make a comparison. An XOR gate monitors the comparator output and generates the  $outputRDY=1$  signal once a decision is made. This is saved in the SAR register and a new cycle is started. After 5 asynchronous cycles, the SAR ADC returns to its input-tracking mode and the preamplifier is turned-off to save power.

A simplified schematic of the 2<sup>nd</sup>-order feed-forward 1b  $\Delta\Sigma M$  is shown in Fig. 14.5.3. It consists of two switched-capacitor integrators, a 5b unary capacitive DAC and a comparator. Correlated double sampling (CDS) is used to mitigate the

effects of the 1<sup>st</sup> integrator's offset and 1/f noise. At the end of  $\phi_1$ , the DAC capacitors  $C_{DAC[1..31]}$  (437fF each, equivalent to  $C_S = 13.5\text{pF}$ ) sample  $V_{IN}$  with respect to the input offset and 1/f noise of OTA1, which is configured as a unity-gain buffer. During  $\phi_2$ , the DAC is set to  $m = (K+2)$  or  $m = (K-1)$  depending on the output bitstream (bs), so that a charge  $C_S(V_{IN} - m\cdot V_{REF}/31)$  is transferred into the integration capacitor  $C_{INT,1}$  (9pF). In order to minimize the coupling between the SAR ADC and the  $\Delta\Sigma M$  through the ADC input terminal, their sampling instants are kept a half clock cycle apart (Fig. 14.5.1). Also, data-weighted averaging (DWA) is applied to the DAC to improve its linearity. Each SAR conversion only takes about 10% of  $\phi_1$ , which gives the DWA logic sufficient time before the start of  $\phi_2$ .

As noted above, rapidly updating the references of the fine  $\Delta\Sigma M$  reduces its input swing. As a result, the loop filter can be implemented with power-efficient fully differential current-reuse OTAs (Fig. 14.5.3). For robustness to PVT, OTA1 is biased with 40 $\mu$ A mirrored from a constant- $g_m$  reference. OTA2 is an 8x scaled down version of OTA1. Both OTAs use cascodes to achieve a DC gain of 60dB. While 40dB would have been sufficient to keep the quantization noise of the modulator well below the thermal noise floor, more gain improves the ability of the CDS scheme to suppress the offset and flicker noise of OTA1. Compared to the dynamically biased pseudo-differential inverter-based OTAs used in [4,5], the result is a more robust and area-efficient loop filter design.

The prototype chip (Fig. 14.5.7) is realized in a standard 0.16 $\mu$ m CMOS process and occupies an active area of 0.25mm<sup>2</sup>. It draws 154.5 $\mu$ A (88 $\mu$ A analog, 42 $\mu$ A digital, and 24.5 $\mu$ A references) from a 1.8V supply. On-chip LVDS drivers were used to output the 5b output code in order to minimize on-board coupling between analog and digital. Figure 14.5.4 shows the output spectrum of the zoom ADC with a full-scale signal, and also with its inputs shorted to measure its offset and 1/f corner. The “fuzz” visible above 2kHz is due to the fact that the output spectrum is the result of adding the SAR ADC output which contains wide-band quantization noise, to the fine  $\Delta\Sigma M$  output, which is processed by the low-pass  $\Delta\Sigma M$  signal transfer function. Being a signal-processing artifact, it does not cause intermodulation issues and is suppressed by the decimation filter of the ADC. The maximum offset is 30 $\mu$ V (10 samples) and the 1/f corner is at 7Hz. From DC to 3kHz, the PSRR is greater than 96dB, demonstrating the benefits of using fully differential OTAs. The SNDR remains the same even with full-scale out-of-band input signals with frequencies up to 48kHz.

Figure 14.5.5 shows the peak SNR, peak SNDR and DR of the ADC, which are 119.1dB, 118.1dB, and 120.3dB, respectively, for a 152Hz input signal in a 1kHz bandwidth. The measured -125.9dB THD (6 harmonics included) and SNDR were limited by a 1<sup>st</sup>-order filter used in the measurement setup. This filter (-3dB BW of 2.3MHz) was inserted between a pair of off-chip buffers and the ADC to limit the fold-back of the wideband noise from the buffer. However, this filter also causes incomplete settling, and hence introduces some distortion.

Figure 14.5.6 summarizes the ADC performance and compares it to ADCs with similar resolution and bandwidth (>95dB SNDR, <2kHz BW). It outperforms all other designs in terms of SNDR while achieving a state-of-the-art Schreier FOM of 185.8dB, thus demonstrating that zoom ADCs can offer state-of-the-art performance and robustness in low-bandwidth high-precision applications.

### References:

- [1] Y.-S. Shu, et al., “An oversampling SAR ADC with DAC mismatch error shaping achieving 105 dB SFDR and 101 dB SNDR over 1 kHz BW in 55 nm CMOS,” *IEEE JSSC*, vol. 51, no. 12, pp. 2928–2940, Dec. 2016.
- [2] Y. Zhang, et al., “A Two-Capacitor SAR-Assisted Multi-Step Incremental ADC with a Single Amplifier Achieving 96.6 dB SNDR over 1.2 kHz BW,” *IEEE CICC*, April 2017.
- [3] M. Steiner and N. Greer, “A 22.3 b 1kHz 12.7 mW switched-capacitor  $\Delta\Sigma$  modulator with stacked split-steering amplifiers,” *ISSCC*, pp. 284–286, Feb. 2016
- [4] Y. Chae, et al., “A 6.3  $\mu$ W 20-bit Incremental Zoom-ADC with 6 ppm INL and 1 V Offset,” *IEEE JSSC*, vol. 48, no. 12, pp. 3019–3027, Dec. 2013.
- [5] B. Gönen, et al., “A Dynamic Zoom ADC with 109-dB DR for Audio Applications,” *IEEE JSSC*, vol. 52, no. 6, pp. 1542–1550, June 2017.



Figure 14.5.1: Block diagram and timing diagram of the proposed zoom ADC.



Figure 14.5.2: Simplified single ended schematic and timing diagram of the SAR ADC.



Figure 14.5.3: Simplified single-ended schematic of the proposed zoom ADC and the OTAs.



Figure 14.5.4: Measured output spectrum of the zoom-ADC at peak SNDR and with zero input (inputs shorted).

Figure 14.5.5: SNR and SNDR versus input signal amplitude ( $f_{in} = 152\text{Hz}$ , DWA on).

|                          | This Work | [1]   | [2]   | [3]   | [4]   |
|--------------------------|-----------|-------|-------|-------|-------|
| Year                     | 2018      | 2016  | 2017  | 2016  | 2013  |
| Tech (nm)                | 160       | 55    | 180   | 350   | 160   |
| Area (mm <sup>2</sup> )  | 0.25      | 0.072 | 0.27  | 11.5  | 0.375 |
| Supply (V)               | 1.8       | 1.2   | 1.5   | 5.4   | 1.8   |
| Power ( $\mu\text{W}$ )  | 280       | 15.7  | 33.2  | 12700 | 6.3   |
| $F_{sample}$ (MHz)       | 2         | 1     | 0.64  | 0.64  | 0.05  |
| Bandwidth (kHz)          | 1         | 1     | 1.2   | 1     | 0.013 |
| Offset ( $\mu\text{V}$ ) | 30        | -     | -     | -     | 1     |
| SNR <sub>max</sub> (dB)  | 119.1     | 104   | 97.1  | -     | 119.8 |
| SNDR <sub>max</sub> (dB) | 118.1     | 101   | 96.6  | -     | -     |
| THD (dB)                 | -125.9    | -     | -     | -116  | -     |
| DR (dB)                  | 120.3     | 101.7 | 100.2 | 136.3 | 119.8 |
| FoM <sub>s</sub> ** (dB) | 185.8     | 179.7 | 175.8 | 185.3 | 182.7 |

\*\*FoM<sub>s</sub> = DR + 10log(BW/Power)

Figure 14.5.6: Performance summary and comparison with state of the art.



Figure 14.5.7: Chip micrograph.

## 14.6 A 0.4V 13b 270kS/s SAR-ISDM ADC with an Opamp-Less Time-Domain Integrator

Sung-En Hsieh, Chih-Cheng Hsieh

National Tsing Hua University, Hsinchu, Taiwan

With advanced DAC switching [1-3] and low-power comparator [4] techniques, the successive-approximation register (SAR) ADC demonstrates convincing performance with technology development for internet-of-everything (IoE) applications. However, the power efficiency and accuracy of SAR ADCs over 12b are limited by DAC mismatch and comparator noise requirements, which increases by 4 $\times$  for each additional 1b of resolution. Hybrid-SAR ADCs using sigma-delta modulators (SDMs) for fine conversion have been reported to reduce noise using oversampling and noise shaping operations with a power penalty from the required operational amplifier (opamp) for integrator realization. An integrator using a passive summing technique was reported [5] without using an opamp, however, the resulting gain loss degraded the effective resolution. This work presents a SAR-ISDM ADC with an opamp-less time-domain integrator without gain loss to effectively achieve a 13b resolution at 0.4V supply. An INL splitting (INLS) DAC switching scheme is also developed to achieve the lowest reported switching energy and improve the DNL/INL performance by 4 $\times$ .

Figure 14.6.1 shows the architecture of the 13b ADC consisting of two sub-ADCs (AD1 and AD0), an on-chip  $V_{cm}$  reference generator, and a global control unit. For each sub-ADC, a 9b DAC with unit capacitance of 5.4fF and an incremental SDM (ISDM) with time-domain integrator are implemented for 10b coarse SAR conversion and 4b fine conversion, respectively. The voltage-controlled delay line (VCDL) technique [6] is applied to realize V-to-T conversion and time-domain signal accumulation. The semi-resting DAC operation [2] is adopted to achieve a double rail-to-rail input range (AD1 for  $2V_{dd} > V_{ip} - V_{in} > 0$ , AD0 for  $0 > V_{ip} - V_{in} > -2V_{dd}$ ) and 2 $\times$   $V_{LSB}$  for SNR improvement. After top-plate sampling of the input signal, one of the sub-ADCs is disabled for power reduction depending on signal polarity. In coarse conversion, the polarity of the top-plate voltage is detected through the voltage-controlled delay line (VCDL) and quantizer. In fine conversion, the feedback loop of the VCDL is enabled to realize a voltage-controlled oscillator (VCO) as a time-domain integrator for noise shaping of the ISDM operation. By using the same VCDL and quantizer for input level comparison in coarse and fine conversions, the inter-stage comparator offset mismatch is eliminated without the need for redundancy and calibration. The input range of 4b fine ISDM is 16LSB for 2b effective output data (4LSB) with a redundancy of  $\pm 6$ LSB, which provides a  $\pm 5.35$ -sigma tolerance of residue error induced from thermal noise (1 sigma = 1.12LSB). A redundant 16C in the DAC is implemented to extend a searching range of  $\pm 64$ LSB, which provides a  $\pm 3$ -sigma tolerance of mismatch between sub-ADCs (1 sigma = 21LSB) with foreground digital calibration. The global control unit and bottom-plate switching are realized by using dynamic logic and a local-boosting technique for power reduction and leakage control, respectively.

Figure 14.6.2 shows a 5b example of the proposed INL splitting (INLS) switching procedure. During the sampling phase, the bottom plates of DACP1/DACN0 and DACP0/DACN1 are reset to  $V_{dd}$  and  $V_{ss}$ , respectively. Since the conversion process is symmetrical, only the case of  $V_{ip} - V_{in} > 0$  (MSB=1) is illustrated for simplicity. For MSB=1 case, the bottom plates of DACP1/DACN1 are merged together to generate the required top-plate level shift ( $-V_{dd}$ ) for the MSB-1 decision (i.e. is  $V_{ip} - V_{in} > V_{dd}$ ?) without any switching energy consumption. For the MSB-2 decision, the bottom plates of DACP1/DACN1 are switched to  $V_{cm}/V_{dd}$  (MSB-1=1) or  $V_{dd}/V_{cm}$  (MSB-1=0) depending on the MSB-1 result to generate the required top-plate level shift ( $\pm 0.5V_{dd}$ ), also without consuming any switching energy. For MSB-3 to LSB decisions, the bottom plates of DACP1/DACN1 are monotonically switched down in sequence from  $V_{cm}/V_{dd}$  to  $V_{ss}/V_{cm}$  depending on the previous-bit result. With a half and single-ended bottom-plate voltage toggling swing, the resultant energy consumption is 1/8 $\times$  smaller compared to the  $V_{cm}$ -based scheme. Due to the set-and-down switching characteristic, the linearity performance of the INLS scheme is immune to  $V_{cm}$  reference accuracy as long as it is within the implemented redundancy range of  $\pm 12$ mV. Therefore, a simple on-chip diode-connected voltage divider is implemented as the  $V_{cm}$  generator, and the simulated DNL/INL levels show a consistent performance with a  $V_{cm}$  level deviation of  $\pm 12$ mV.

Figure 14.6.3 shows the switching energy and DNL/INL performance versus output code of 10b conversion examples using identical MSB capacitances with a mismatch sigma of 0.4% (from process datasheet) and the same input range ( $\pm V_{ref}$ ) for comparison purposes. With bottom-plate merging and level-shifting techniques, the INLS scheme omits the largest energy consuming operation for MSB-1 and MSB-2 decisions. The resulting average switching energy of the INLS is 15.87CV<sub>ref</sub><sup>2</sup> which is only 37% of the semi-resting switching [2] and around 7- to-9% of other works. In the conventional DAC switching operation, the worst DNL level occurs at the MSB transition (512) due to the capacitor mismatch. In this work, the DNL peak at the MSB transition is from the mismatch of sub-ADCs instead of capacitors, which can be minimized to 0 by the mentioned redundancy implementation and digital foreground calibration. Moreover, since the top-plate levels for MSB-1 and MSB-2 decisions are also independent of the capacitance ratio, the worst DNL with INLS is split to 4 peaks occurring at the MSB-3 transitions (1/8, 3/8, 5/8, and 7/8 of full range). It shows the resulting DNL-INL peaks with INLS scheme are only 62%-45% (~1.6 $\times$ -2.2 $\times$  better) than RSW [3] and 28%-25% (~3.5 $\times$ -4 $\times$  better) than the conventional operation. With the double  $V_{LSB}$  and improved DNL/INL performance, the sampling capacitance of the sub-ADC can be reduced to be around 1/4 (kTC dominated) to 1/16 (mismatch dominated) of the conventional approach, which reduces the power consumption of signal and reference drivers.

Figure 14.6.4 shows the proposed opamp-less incremental-SDM (ISDM) with time-domain integrator. After SAR-based coarse conversion, the feedback loop of the VCDL is enabled (ien=1) as a VCO to implement the time-domain accumulation as an integrator and realize the 1<sup>st</sup>-order noise shaping function NTF(z)=1-z<sup>1</sup>. First, the time difference ( $\Delta T=T_{outp}-T_{outn}$ ) corresponding to the residue input ( $V_{SAR\_residue}$ ) is detected by a one-bit quantizer. In the  $T_{outp}-T_{outn} > 0$  case, the quantizer output ( $Q_{out}=1$ ) is fed back to control the bottom plates of  $C_{isdm}$  for -1 $V_{ref}$  operation. At the same time, the  $T$  of the 1<sup>st</sup> cycle is fed back to the input of the VCO and accumulated with that of the next one. After 16 cycles of oscillation (OSR=16) with the corresponding  $\pm 1V_{ref}$  operations, a simple digital counter is implemented as a decimation filter to output 4b of data and realize the ISDM fine conversion. Using the opamp-less time-domain integrator without static power consumption and finite-gain error, the SAR-ISDM ADC effectively reduces the thermal and quantization noise by 4 (2b) and 16 (4b), respectively, to achieve a 13b resolution at a 0.4V supply.

A prototype chip is fabricated in 90nm CMOS with an active area of 0.0594mm<sup>2</sup> (0.54 $\times$ 0.11mm). Figure 14.6.5 shows the static and dynamic performances. The measured DNL and INL are +0.23/-0.45 and +0.75/-0.75 LSB, respectively. With a Nyquist-rate input, the measured SNDR, SFDR, and ENOB are 73.57dB, 89.9dB, and 11.93b, respectively. The measured total power consumption is 638nW with a distribution of 2% for sample and hold, 23% for DAC, 52% for ISDM, and 23% for digital control. Figure 14.6.6 shows the comparison table. With SAR-ISDM and INLS designs, the prototyped 13b ADC achieves a Schreier FoM (FoM<sub>S</sub>) of 186.82dB and a Walden FoM (FoM<sub>W</sub>) of 0.606fJ/c.-s. at 0.4V and 270kS/s.

### Acknowledgements:

The authors acknowledge the support of National Chip Implementation Center (CIC) Taiwan, and Signal Sensing and Application Lab. (SiSAL), EE, NTHU.

### References:

- [1] J.-Y. Tai, et al., "A 0.85fJ/conversion-step 10b 200kS/s subranging SAR ADC in 40nm CMOS," *ISSCC*, pp. 196–197, Feb. 2014.
- [2] S. E. Hsieh, et al., "A 0.44fJ/conversion-step 11b 600KS/s SAR ADC with semi-resting DAC," *IEEE Symp. VLSI Circuits*, pp. 1-2, June 2016.
- [3] Y. S. Hu, et al., "A 510nW 12-bit 200kS/s SAR-assisted SAR ADC using a re-switching technique," *IEEE Symp. VLSI Circuits*, pp. 238-239, June 2017.
- [4] P. Harpe, et al., "An oversampled 12/14b SAR ADC with noise reduction and linearity enhancements achieving up to 79.1dB SNDR," , pp. 194-195, Feb. 2014.
- [5] Z. Chen, et al., "A 2nd order fully-passive noise-shaping SAR ADC with embedded passive gain," *IEEE ASSCC*, pp. 309-312, June 2016.
- [6] C.-C. Kao, et al., "A 0.5V 12-bit SAR ADC using Adaptive Time-Domain Comparator with Noise Optimization," *IEEE A-SSCC*, in press.



Figure 14.6.1: Architecture of the proposed SAR-ISDM ADC.



Figure 14.6.2: A 5b conversion example of the INL splitting switching procedure.



Figure 14.6.3: Comparison of switching energy and DNL/INL performance versus output codes.



Figure 14.6.4: Proposed incremental sigma-delta modulator with time-domain integrator.



Figure 14.6.5: Measured static and dynamic performances.

|                               | ISSCC-13[1]   | VLSI-16[2]     | ISSCC-14[4]   | VLSI-17[3]    | This work      |
|-------------------------------|---------------|----------------|---------------|---------------|----------------|
| Technology                    | 40nm          | 90nm           | 65nm          | 40nm          | 90nm           |
| Supply Voltage(V)             | 0.45          | 0.3            | 0.8           | 0.7           | 0.4            |
| Ideal input swing             | $\pm V_{ref}$ | $\pm 2V_{ref}$ | $\pm V_{ref}$ | $\pm V_{ref}$ | $\pm 2V_{ref}$ |
| Sample rate (kS/s)            | 200           | 600            | 32            | 200           | 270            |
| Resolution (bit)              | 10            | 11             | 14            | 12            | 13             |
| DNL (LSB)                     | 0.44          | 0.63           | 1.75          | 0.45          | 0.45           |
| INL (LSB)                     | 0.45          | 0.72           | 3.50          | 0.79          | 0.75           |
| Power (nW)                    | 84            | 187            | 352           | 510           | 638            |
| ENOB (bit)                    | 8.95          | 9.46           | 11.29         | 11.19         | 11.93          |
| SFDR (dB)                     | 76.25         | 73.35          | 78.5          | 81.72         | 89.9           |
| FoM (fJ/c.-s.)                | 0.85          | 0.44           | 4.4           | 1.1           | 0.606          |
| FoM (dB)                      | 176.4         | 180.8          | 176.3         | 182           | 186.82         |
| Active Area ( $\text{mm}^2$ ) | 0.0065        | 0.035          | 0.18          | 0.014         | 0.059          |

Figure 14.6.6: Performance summary and comparison table.



Figure 14.6.7: Chip micrograph.

## 14.7 A Signal-Independent Background-Calibrating 20b 1MS/s SAR ADC with 0.3ppm INL

Hongxing Li<sup>1</sup>, Mark Maddox<sup>1</sup>, Michael C. W. Coln<sup>1</sup>, William Buckley<sup>2</sup>, Derek Hummerston<sup>3</sup>, Naveed Naeem<sup>1</sup>

<sup>1</sup>Analog Devices, Wilmington, MA; <sup>2</sup>Analog Devices, Cork, Ireland

<sup>3</sup>Analog Devices, Newbury, United Kingdom

The SAR ADC is the architecture of choice for high-precision Nyquist ADCs ( $\geq 16b$ ) with MS/s speed. To achieve the required linearity performance, precision SAR ADCs require calibration to correct mismatch errors in the capacitive digital-to-analog converter (CDAC). One-time factory calibration suffers from aging, temperature sensitivity, and package and PCB stress while foreground calibration precludes continuous ADC operation. This paper presents a 20b 1MS/s SAR ADC with signal-independent background-calibration to address this challenge.

Figure 14.7.1 shows a double-conversion background-calibration technique in a SAR ADC. A single SAR ADC converts each analog sample twice, an error  $E_{\text{r}}$  is modulated by  $m_i$  (1/0/-1) between two conversions. The difference ( $\Delta = d_2 - d_1$ ) between the two results contains a modulated error, which is used by an error estimator to derive the error magnitude, and store it as a coefficient for correcting the error. Similar to [1], because  $\Delta$  contains no signal component, the calibration convergence is very fast. This approach can be extended to multiple errors.

Shuffling a segmented MSB capacitor array [1] is one way to modulate MSB capacitor mismatch error, however the number of unit capacitors increases exponentially with the number of binary bits and the calibration convergence speed is signal dependent. The perturbation-injection technique [2] can be used to calibrate deeper, but it has issues for a static input signal: 1) the applied perturbation and the conversion result can correlate and confuse the calibration engine; 2) the perturbation has to be large to exercise the capacitors sufficiently, thus adding significant extra conversion time.

Figure 14.7.2 details a different technique to solve the above issues. Instead of one big segmented capacitor array, it has multiple clusters of segmented capacitor arrays with a shuffler in each cluster. An additional shuffling [3] across the boundary of two adjacent clusters is included by a redundant complementary capacitor pair. For example, the  $\text{swap}_1$  signal controls one capacitor with a size of 16C in cluster 1, and its complementary signal controls 4 capacitors with the same total size of 16C in cluster 2. There is another similar swapping pair between cluster 2 and cluster 3. Independent of the B1~B7 (MSBs) bit trial results,  $\text{swap}_1$  and  $\text{swap}_2$  are randomly set, then flip to their complementary value after the first conversion to expose capacitor mismatch error. The selections of the actual “swapping capacitors” are randomized by shuffling.

To extend the calibration depth, an arbitrary number of clusters can be cascaded with a cost of 4 coefficients per binary bit (like cluster 2 in Fig. 14.7.2). The re-configuration of the CDAC after the first conversion only causes a small voltage change at  $V_{\text{top}}$  due to capacitor mismatch, thus the second conversion processing is greatly relaxed.

Equation (1) shows a matrix representation of results from one conversion. In this equation,  $\Delta = d_2 - d_1$ ; the modulation vector  $R = [m_1 \ m_2 \ \dots \ m_L]$ , where  $m_i$  is the random variable (1/0/-1) to modulate the  $i^{\text{th}}$  capacitor;  $e = [E_1 \ E_2 \ \dots \ E_L]^T$  is error vector which represents coefficient errors.

$$\Delta = R \cdot e \quad (1)$$

A digital iteration is then applied to estimate the new weight vector of the capacitors:

$$W_{\text{new}} = W_{\text{old}} - h \mu_e R^T \Delta \quad (2)$$

Matrix  $h$  represents weight normalization [1].  $\mu_e$  is a constant number  $< 1$ . Because  $W = W_{\text{exact}} + e$ , it can be derived:

$$e_{\text{new}} = (I - h \mu_e R^T R) \cdot e_{\text{old}} \quad (3)$$

The matrix  $C = R^T R$ . Because each element in the  $C$  matrix has an expected value determined by correlation of two random variables or auto-correlation, the  $C$  matrix has a statistical expected value of  $\hat{C}$ . Assuming the initial error vector is  $e_0$ , after  $K$  iterations, the expected value of  $e_k$  is:

$$\hat{e}_k = (I - h \mu_e \hat{C})^k \cdot e_0 \quad (4)$$

The criteria for the error vector to converge to 0 is  $(I - h \mu_e \hat{C})^K \rightarrow 0$ , which is true in this case. Since  $R$  sequence and  $\hat{C}$  are determined by the shuffling sequence and the swapping signals, the error vector will follow a convergence curve regardless of the input signal. A proper  $\mu_e$  is chosen to balance calibration speed

and calibration noise (the larger  $\mu_e$ , the faster calibration but higher calibration noise).

The ADC core, which implements this technique, is shown in Fig. 14.7.3. The framework is a pipelined SAR ADC architecture [3] to extend the acquisition time. The first stage has 11b resolution and the CDAC1 structure is similar to Fig. 14.7.2. The value of  $C_{fb}$  is chosen to provide a residue gain of 32. A mini-ADC is included to resolve the first 4 MSBs. Once the first 4 MSBs are decided, the SAR logic loads them to CDAC1, and continues with the remaining bit trials in the first stage while switch S1 is opened and the residue amplifier (RA) is in auto-zero mode. After all CDAC1 bits are decided, switch S1 is closed to connect CDAC1 with RA, and generate the first residue voltage, which is sampled by ADC2a. Following the ADC2a sampling, CDAC1 is re-configured and ADC2b is connected to RA sampling the second residue voltage. Shortly after this, ADC2a and ADC2b start their conversions. ADC2a/ADC2b have 11b effective resolution, and their first 2 MSBs overlap with CDAC1 to provide sufficient redundancy. These stages only need to be calibrated once to compensate for initial manufacturing tolerances. The second residue voltage generation (ADC2b sampling) costs about 11% of the total bit trial process time, or about 7% of one clock period, which is a relatively minor speed penalty.

The background calibration is all digital and has been implemented in an FPGA. The evaluation analysis is performed in Matlab. Polynomial correction is used to correct the static low order error due to various non-mismatch effects, such as capacitor voltage coefficients, signal-dependent sampling error, etc. The polynomial coefficients are extracted by measuring the harmonics in the frequency domain, and then used to correct distortion on the calibrated ADC output: correction =  $\alpha_1 D + \alpha_2 D^2 + \alpha_3 D^3 + \alpha_5 D^5 + \alpha_7 D^7$ , where  $D$  is normalized to [-1 1].

The ADC core is implemented in a  $0.18\mu\text{m}/0.5\mu\text{m}$  CMOS technology. It uses a 1.8V and a 5V power supply, along with a 2.5–5V reference voltage. The total area of the ADC core is  $4\text{mm}^2$ , excluding the digital calibration engine. At 1MS/s and VREF=5V, DR is 102.7dB, SNDR is 101.5dB including signal and reference noise. The ADC core consumes 12.9mW total and the synthesized power for the calibration engine is 6.8mW, which predicts 175.5dB FOM (SNDR +  $10 \log(BW/\text{power})$ ). For comparison, [4] is a 13b SAR ADC with FOM of 173.4dB and signal-dependent background calibration; [5] is a 14b split SAR ADC with FOM of 169.6dB.

The background calibration convergence performance is shown in Fig. 14.7.4. The coefficients are initialized to 0 (uncalibrated) at time 0. The rms value of the coefficient errors quickly settles to 0.25ppm within 100K samples. The convergence time constant is 16k samples regardless of the input signal.

Figure 14.7.5 shows the INL performance using a 2.5V reference to demonstrate the effectiveness of background calibration. In the upper plots, calibration is shown performed first at  $27^\circ\text{C}$ , then the temperature is changed to  $85^\circ\text{C}$  with the calibration disabled (frozen) to exhibit the mismatch shifts. Linearity is restored when the calibration is re-enabled. The INL S-shape temperature drift is mainly due to capacitor voltage coefficient temperature drift. Figure 14.7.6 shows a typical FFT spectrum and SFDR/SNDR at 5V reference.

### Acknowledgements:

The authors would like to thank Kam Mistry, Mick Mueck, Geng Zheng, Jianping Gong, and the ADI precision converter team.

### References:

- [1] J. McNeill, et al., “All-Digital Background Calibration of a Successive Approximation ADC Using the ‘Split ADC’ Architecture,” *IEEE TCAS-I*, vol. 58, no. 10, pp. 2355–2365, Oct. 2011.
- [2] W. Liu, et al., “A 12-bit, 45-MS/s, 3-mW Redundant Successive-Approximation-Register Analog-to-Digital Converter With Digital Calibration,” *IEEE JSSC*, vol. 46, no. 11, pp. 2661–2672, Nov. 2011.
- [3] C. P. Hurrell, et al., “An 18b 12.5MHz ADC with 93dB SNR,” *ISSCC*, pp. 378–379, Feb. 2010.
- [4] M. Ding, et al., “A 5.5fJ/conv-step 6.4MS/s 13b SAR ADC Utilizing a Redundancy-Facilitated Background Error-Detection-and-Correction Scheme,” *ISSCC*, pp. 460–461, Feb. 2015.
- [5] H. Xu, et al., “A 78.5dB-SNDR Radiation- and Metastability-Tolerant Two-Step Split SAR ADC Operating up to 75MS/s with 24.9mW Power Consumption in 65nm CMOS,” *ISSCC*, pp. 476–477, Feb. 2017.



Figure 14.7.1: Double conversion background calibration.



Figure 14.7.2: Simplified CDAC diagram (differential half-circuit shown, sampling switch not shown).



Figure 14.7.3: Simplified ADC top-level diagram (differential half-circuit shown).



Figure 14.7.4: Calibration convergence performance.



Figure 14.7.5: INL at 1MS/s (VREF=2.5V, fin=23.5Hz). Fixed polynomial correction coefficient.



Figure 14.7.6: ADC AC performance at 1MS/s (VREF=5V, Signal=-1dBFS). Fixed polynomial correction coefficient.



Figure 14.7.7: Die micrograph of ADC core.

# Session 15 Overview: *RF PLLs*

## RF SUBCOMMITTEE



**Session Chair:**  
**Jiayoon Ru**  
*Broadcom, Irvine, CA*



**Associate Chair:**  
**Jaehyouk Choi**  
*Ulsan National Institute of Science Technology  
 Ulsan, Korea*

**Subcommittee Chair:** *Piet Wambacq, imec, Belgium*

This session presents the latest advances in digital and analog PLLs generating frequencies from 1 to 100GHz and covering diverse topics, such as ultra-low-power ADPLLs, FMCW synthesizers, fast-settling bang-bang PLLs, interference-coupling mitigation, type-I sampling PLLs, and mm-wave ADPLLs.



1:30 PM

### 15.1 A 0.98mW Fractional-N ADPLL Using 10b Isolated Constant-Slope DTC with FOM of -246dB for IoT Applications in 65nm CMOS

*H. Liu, Tokyo Institute of Technology, Tokyo, Japan*

In Paper 15.1, Tokyo Institute of Technology presents an ultra-low-power all-digital fractional-N PLL in 65nm CMOS. The rms jitter of 0.53ps is achieved at 2.44GHz with the power consumption of 0.98mW. The corresponding FOM is -246dB.



2:00 PM

### 15.2 A 23GHz Low-Phase-Noise Digital Bang-Bang PLL for Fast Triangular and Saw-Tooth Chirp Modulation

*D. Cherniak, Infineon Technologies, Villach, Austria and Politecnico di Milano, Milan, Italy*

In Paper 15.2, Infineon Technologies describes a 20-to-24GHz digital bang-bang PLL in 65nm CMOS for mm-wave FMCW radars. The 19.7mW fractional-N PLL, having 213fs jitter and a -58dBc fractional spur, synthesizes fast chirps with a 173MHz/μs slope and an idle time of 200ns.



2:30 PM

### 15.3 A 36.3-to-38.2GHz -216dBc/Hz<sup>2</sup> 40nm CMOS Fractional-N FMCW Chirp Synthesizer PLL with a Continuous-Time Bandpass Delta-Sigma Time-to-Digital Converter

*D. Weyer, University of Michigan, Ann Arbor, MI*

In Paper 15.3, the University of Michigan describes a 38GHz digital fractional-N FMCW chirp synthesizer using a bandpass delta-sigma TDC to enable >1MHz PLL bandwidths with an in-band PN of -85dBc/Hz at a 100kHz offset. The chip fabricated in 40nm CMOS generates a 9MHz/μs-slope and 824kHz<sub>rms</sub>-error triangular-chirp signal with a 500MHz bandwidth, while consuming 68mW.



3:15 PM

**15.4 A Low-Phase-Noise Digital Bang-Bang PLL with Fast Lock Over a Wide Lock Range***L. Bertulessi, Politecnico di Milano, Milan, Italy*

In Paper 15.4, Politecnico di Milano presents a 3.7-to-4.1GHz digital bang-bang PLL adopting a background frequency-aid technique. Over a 364MHz frequency hop, the 65nm-CMOS fractional-N PLL has coarse (within 10MHz) settling time of 5.6 $\mu$ s and fine (within 364kHz) settling time of 115 $\mu$ s. The spot phase noise is -150.7dBc/Hz at 20MHz offset. The rms jitter is 183fs with a current consumption of 4.4mA, leading to an FOM of -247.5dB.



3:45 PM

**15.5 A Digital Frequency Synthesizer with Dither-Assisted Pulling Mitigation for Simultaneous DCO and Reference Path Coupling***C-R. Ho, University of Southern California, Los Angeles, CA*

In Paper 15.5, the University of Southern California presents a 3-to-5GHz digital frequency synthesizer to mitigate simultaneous interference coupling to DCO and reference paths. Implemented in 65nm CMOS and consuming 21.3mW, the proposed PLL mitigates the increase in the noise floor and spectral spurs due to PA and oscillator mutual pulling by 12 and 22.5dB, respectively.



4:15 PM

**15.6 A 0.01mm<sup>2</sup> 4.6-to-5.6GHz Sub-Sampling Type-I Frequency Synthesizer with -254dB FOM***A. Sharkia, University of British Columbia, Vancouver, Canada*

In Paper 15.6, the University of British Columbia describes a frequency synthesizer, which comprises an all-digital frequency-locked loop and a type-I integer-N PLL with an LC VCO. The synthesizer occupies a 0.01mm<sup>2</sup> area in 65nm CMOS and achieves 185fs<sub>rms</sub> jitter while consuming 1.1mW, corresponding to an FOM of -254dB across 4.6 to 5.6GHz.

15



4:30 PM

**15.7 A Dividerless Reference-Sampling RF PLL with -253.5dB Jitter FOM and <-67dBc Reference Spur***J. Sharma, Columbia University, New York, NY*

In Paper 15.7, Columbia University describes a dividerless PLL architecture, i.e. a reference-sampling PLL. The 65nm-CMOS integer-N PLL achieves an rms jitter of 110fs at 2.55GHz, while consuming 3.5mW. The corresponding FOM is -253.5dB, and the reference spur is lower than -67dBc.



4:45 PM

**15.8 An 82-to-108GHz -181dB-FOM<sub>T</sub> ADPLL Employing a DCO with Split-Transformer and Dual-Path Switched-Capacitor Ladder and a Clock-Skew-Sampling Delta-Sigma TDC***Z. Huang, HKUST, Hong Kong, China*

In Paper 15.8, Hong Kong University of Science and Technology presents a W-band ADPLL employing a wide-band DCO and a delta-sigma TDC. The 65nm-CMOS integer-N PLL achieves a frequency-tuning range from 82 to 108GHz and 10MHz phase noise between -106 and -110dBc/Hz, while consuming 35.5mW.

## 15.1 A 0.98mW Fractional-N ADPLL Using 10b Isolated Constant-Slope DTC with FOM of -246dB for IoT Applications in 65nm CMOS

Hanli Liu, Dexian Tang, Zheng Sun, Wei Deng, Huy Cu Ngo,  
Kenichi Okada, Akira Matsuzawa

Tokyo Institute of Technology, Tokyo, Japan

In a world that has become increasingly connected by the Internet, ultra-low-power (ULP) transceivers (TRX) will be key elements in a variety of short-range network applications. The RF PLL in a TRX needs a significant amount of power due to the phase noise and spurious requirement. Compared with the analog PLLs, an ADPLL is more advantageous in nm-CMOS technologies [1-6]. This paper presents a 2.0-to-2.8GHz 653 $\mu$ W fractional-N ADPLL that achieves -242dB FOM in 65nm CMOS for 2.4GHz ISM band applications. The best power-jitter trade-off is achieved at 981 $\mu$ W using a reference doubler with 535fs jitter and a -56dBc in-band fractional spur, which corresponds to a FOM of -246dB. Thanks to the proposed 10b isolated constant-slope DTC, this ADPLL breaks the -240dB FOM barrier of sub-mW fractional-N ADPLLs.

Figure 15.1.1 shows the ADPLL architecture. For a low power operation, a 4b narrow-range TDC is used for covering  $\pm 0.08\pi$  rad phase error instead of covering  $2\pi$  rad by a full-range TDC. The rest of the phase errors from  $\Delta\Sigma$ -modulator (DSM) and multi-modulus divider (MMDIV) are covered by the 10b digital-to-time converter (DTC). The DTC consumes much less power while providing better time resolution than a full-range TDC [4]. When the DTC time resolution is smaller than the TDC time resolution, a 1<sup>st</sup>-order DSM is applied for suppressing in-band phase-noise contribution. The important features of adopting the 1<sup>st</sup>-order DSM are: (1) both the delay range and the power consumption of the DTC become half that of the 2<sup>nd</sup>-order case; and (2) in-band noise level remains the same as in the integer case. In a DTC-based ADPLL architecture, a low-jitter and highly linear DTC is crucial for jitter performance [1-5]. In [1], sub-mW ADPLL is realized but with poor jitter and spur performance due to strong nonlinearity of a ULP DTC. In [2], phase dithering is used to mitigate the spur, but this sacrifices integrated jitter and FOM performance. In this work, we propose an isolated constant-slope DTC architecture realizing 10b 580fs-resolution ULP operation with high linearity, which can break the trade-off between power and jitter/linearity. A reference doubler is used to boost the effective TDC resolution and lower the divide ratio. A time-amplifier (TA) [4] gain is calibrated by a gain-and-offset calibration technique to boost the delay-line TDC resolution to 2ps, which ideally corresponds to -110dBc/Hz in-band phase noise for a 52MHz reference. The loop latency compensation technique is applied and achieves a 0.25T<sub>REF</sub> latency, which eliminates the jitter peaking. A coarse PLL loop is employed with  $\pm 64\mu$ s dead-zone to enhance the PLL pull-in range. This loop is always on while consuming almost zero power after locking.

Figure 15.1.2 shows the isolated constant-slope DTC. The constant-slope charging method [7] demonstrates a fundamental difference in linearity over the variable slope method [1-5]. In the original constant-slope DTC, the start voltage V<sub>ST</sub> is pre-charged on a large C<sub>load</sub> by a digital-to-analog converter (DAC). However, this C<sub>load</sub> must be large to suppress the current-source noise and results in large power consumption. In addition, this large C<sub>load</sub> causes linearity degradation. As shown in Fig. 15.1.2 of the conventional DTC, DAC code1 and code2 lead to different steady-state voltages, and an insufficient conversion time causes a voltage error from the ideal steady-state voltage, which results in delay errors, i.e. DNL1 and DNL2. Due to the nonlinear RC response, DNL1 is larger than DNL2, which cannot be compensated by a gain calibration. The DTC, as a main jitter contributor of the ADPLL, with 93% of the DTC jitter coming from the current source. Unfortunately, in a ULP design, increasing C<sub>load</sub> to minimize the jitter leads to a potential degradation of linearity due to the DAC settling behavior. Only 3% jitter is contributed to the DTC from the DAC, which consumes 82% of the total DTC power. This fundamental trade-off between jitter and power cannot be avoided by the conventional DTC. We propose a way to utilize the constant-slope method while completely isolating the large C<sub>load</sub> during the pre-charge step. It is done by timing-controlled switches. The DAC only needs to charge a 10 times smaller capacitance than C<sub>load</sub>, which contributes to a roughly 10 times smaller power consumption. It corresponds to a 74% total power saving as compared with the conventional constant-slope DTC and also realizes a better linearity than the conventional DTC. As for the jitter performance, the penalty is negligible from the switching operation. Hence, by simply increasing the current and C<sub>load</sub>, the jitter of the proposed DTC and the conventional DTC is reduced by the same amount.

Figure 15.1.3 shows the three conversion steps of the DTC. At the pre-charge step, SW1 is opened for isolating the DAC and C<sub>load</sub> to minimize the settling time. SW2 is closed to short the inverter input and output to set node B to the inverter V<sub>TH</sub>, which is 500mV in this design. SW3 is also closed to let the DAC charge node

A to the code controlled voltage, e.g., 600mV. At the set step, SW1 and SW4 are closed to short node A to ground. When node A becomes 0V, node B becomes 500mV-600mV=-100mV. At the compare step, SW5 is triggered by the input from MMDIV, which generates a ramp signal though a current source charging C<sub>load</sub>. Node A is charged normally from 0V while node B is charged from -100mV. When node B voltage reaches V<sub>TH</sub> of the inverter, a falling edge is generated and becomes positive edge after the next-stage inverter. The delay time ( $\Delta T$ ) is generated from the ramp start of node B to V<sub>TH</sub> of the inverter. This delay time is varied by the DAC output voltage and the resolution is decided by the DAC resolution. The only potential penalty of the DTC is the negative B voltage that causes a charge leakage issue for SW2. This is mitigated by shortening the hold time in  $\phi_2$ . A charge injection issue is also carefully considered for minimizing linearity degradation. In this design, a 10b resistor DAC is used, which consumes only 30 $\mu$ W in simulation without settling issues. Since the current source for generating the ramp signal is also isolated from the code-dependent voltage node, it can maintain higher linearity. Higher supply voltage is not required to maintain the current source linearity, and the DTC, as well as other digital blocks, can operate with only 1V supply. The simulated INL of the DTC is +0.2/-0.14ps over a 610ps range. The measured INL is +0.87/-0.86ps in 10b with 580fs/LSB resolution at 52MS/s, which corresponds to an effective resolution of 9.4b (0.15%). The measured worst rms jitter at the maximum delay is under 630fs and is well below the TDC quantization noise.

Figure 15.1.4 shows the conventional and the proposed gain calibration of the TA. In the conventional gain calibration, delay blocks  $\Delta\tau$  and G<sub>Desire</sub> $\times\Delta\tau$  are inserted at the input and output nodes, respectively. If the TA gain is as desired, the BBPD produces 0 on average. Otherwise, the gain G<sub>TA</sub> increases or decreases according to the error sign. However, the TA also suffers from the time offset mainly caused by the process mismatch. The time offset  $\epsilon_{TA}$  appears at output node and is added to the G<sub>Desire</sub> $\times\Delta\tau$ , which produces G<sub>err</sub> after the gain calibration. This error becomes  $\pm 64.25\%$  according to a Monte-Carlo simulation. In the gain-and-offset calibration technique, the time offset is first calibrated. A calibration signal is distributed to two TA inputs at the same time. The time difference is detected by BBPD, and the loading capacitor bank is controlled depending on the detected error for cancelling the time offset. After this one-time time-offset calibration, the remaining gain error is calibrated by G<sub>TA</sub>, and the total gain error can be controlled within  $\pm 6.25\%$ . The remaining gain error is compensated by the TDC gain normalization (K<sub>TDC</sub>) running in the background.

This ADPLL occupies an active area of 0.23mm<sup>2</sup> in 65nm CMOS. The die micrograph is shown in Fig. 15.1.7. Figure 15.1.5 shows the measurement results of the ADPLL. With the reference doubler bypassed, comparing with the previous 673 $\mu$ W fractional-N ADPLL [2], a 6dB improvement in integrated jitter is achieved, i.e. 1.00ps jitter, while consuming 653 $\mu$ W. Figure 15.1.6 shows a performance comparison of the proposed ADPLL with the fractional-N ADPLLs. The low-power operation and good jitter performance of the proposed ADPLL are favorable for BLE, Zigbee, and WPAN/WBAN applications. On the other hand, the lower jitter of 535fs is achieved while consuming 981 $\mu$ W with the reference doubler enabled. Thanks to the 10b linear DTC, a -108dBc/Hz in-band phase noise and -56dBc in-band fractional spur are realized. A -246dB FOM is achieved, which breaks the -240dB barrier for prior sub-mW ADPLLs in Fig. 15.1.6 by 6dB. The proposed ADPLL can be applied to IEEE802.11b/g standard.

### Acknowledgments:

This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).

### References:

- [1] V. Chillara, et al., "An 860 $\mu$ W 2.1-to-2.7GHz All-Digital PLL-Based Frequency Modulator with a DTC-Assisted Snapshot TDC for WPAN (Bluetooth Smart and ZigBee) Applications," *ISSCC*, pp. 172-173, Feb. 2014.
- [2] Y.-M. He, et al., "A 673 $\mu$ W 1.8-to-2.5GHz Dividerless Fractional-N Digital PLL with an Inherent Frequency-Capture Capability and a Phase-Dithering Spur Mitigation for IoT Applications," *ISSCC*, pp. 420-421, Feb. 2017.
- [3] X. Gao, et al., "A 2.7-to-4.3GHz, 0.16ps RMS-Jitter, -246.8dB-FOM, Digital Fractional-N Sampling PLL in 28nm CMOS," *ISSCC*, pp. 174-175, Feb. 2016.
- [4] A. Elkholy, et al., "A 3.7mW Low-Noise Wide-Bandwidth 4.5 GHz Digital Fractional-N PLL Using Time Amplifier-Based TDC," *IEEE JSSC*, vol. 50, no.4, pp. 867-881, Apr. 2015.
- [5] D. Tasca, et al., "A 2.9-4.0GHz Fractional-N Digital PLL With Bang-Bang Phase Detector and 560fs<sub>rms</sub> Jitter at 4.5mW Power," *IEEE JSSC*, vol. 46, no.12, pp. 2745-2758, Dec. 2011.
- [6] F.-W. Kuo, et al., "A 0.5V 1.6mW 2.4GHz Fractional-N All-Digital PLL for Bluetooth LE with PVT-Insensitive TDC Using Switched-Capacitor Doubler in 28nm CMOS," *IEEE Symp. VLSI Circuits*, pp. 178-179, June 2017
- [7] J. Ru, et al., "A High-Linearity Digital-to-Time Converter Technique: Constant-Slope Charging," *IEEE JSSC*, vol. 50, no.6, pp. 1412-1423, June 2015.





Figure 15.1.7: Die micrograph.

## 15.2 A 23GHz Low-Phase-Noise Digital Bang-Bang PLL for Fast Triangular and Saw-Tooth Chirp Modulation

Dmytro Cherniak<sup>1,2</sup>, Luigi Grimaldi<sup>2</sup>, Luca Bertulessi<sup>2</sup>, Carlo Samori<sup>2</sup>, Roberto Nonis<sup>1</sup>, Salvatore Levantino<sup>2</sup>

<sup>1</sup>Infineon Technologies, Villach, Austria, <sup>2</sup>Politecnico di Milano, Milan, Italy

Frequency-modulated continuous-wave (FMCW) radars with high resolution require the generation of low-phase-noise, low-spurs, and highly linear chirp signals with large peak-to-peak value (chirp bandwidth) and a short period of the modulation signal [1]. In radar systems, the spot phase noise of the chirp generator is converted to the intermediate frequency of the receiver making it difficult to detect two close targets, while spurs cause the detection of false targets. For those reasons, medium-range radar applications in the 77-to-81GHz band typically specify spot phase noise lower than -90dBc/Hz at 1MHz offset and spur level below -50dBc. Unlike triangular chirps, saw-tooth chirps allow for a reduced dead time for range detection. However, any practical modulator needs a finite time (idle time) to make a large frequency jump at the end of the saw-tooth, and this limits the duty cycle of the saw-tooth. For instance, a fast saw-tooth chirp with 200kHz rate and 95% duty cycle leaves the idle time of only 250ns. Fractional-N PLLs can be used as chirp modulators. Unfortunately, low phase noise and spur levels require a narrow PLL bandwidth, while short idle time demands for a wide one. The two-point injection of the modulation signal, both from the modulus control of the divider and the tuning input of the voltage-controlled oscillator (VCO), is a known method to simultaneously achieve a narrow PLL bandwidth and fast modulation. However, even in that scheme, a frequency modulation error is mainly limited by gain mismatch between the two injection paths and by the linearity of the VCO [2]. In this work, a 20-to-24GHz digital bang-bang PLL, which uses the two-point modulation scheme to generate triangular and saw-tooth chirp signals, is presented. Unlike previous works [1-4], this architecture is able to generate fast saw-tooth chirps with the slope up to 173MHz/μs, the idle time below 200ns, and the rms frequency error of better than 0.06%. The gain mismatch between the two modulation paths are automatically calibrated by a digital algorithm [5], and the input of the digitally controlled oscillator (DCO) is pre-distorted via an automatic background correction scheme, which compensates for the DCO nonlinearity.

In the conventional two-point injection scheme, such as that shown in Fig. 15.2.1, the high-frequency components of the modulation signal go through the VCO path. Thus, its nonlinear tuning characteristic directly degrades chirp linearity. The operating principle of the proposed idea can be illustrated by referring to the digital bang-bang PLL shown in the bottom part of Fig. 15.2.1. In an ideal two-point injection scheme  $\text{mod}[k]$  is added to the DCO input and subtracted at the divider modulus input. Thus, in principle, the error  $e[k]$  detected by the binary phase detector should not be correlated with  $\text{mod}[k]$ . In the presence of gain or delay mismatch between the injection paths, a residual component of  $\text{mod}[k]$  leaks into  $e[k]$ . A digital pre-distortion (DPD) block is, therefore, introduced to alter the modulation signal so that the correlation between  $e[k]$  and  $\text{mod}[k]$  is minimized.

Because of the high resolution required of the DCO for noise reasons, i.e. in the order of 150kHz for a 20GHz oscillator, a 200MHz chirp signal requires a segmentation of the DCO tuning characteristic. In practice, several thermometric-coded capacitor banks with different unit capacitances are connected. However, accurate matching among the capacitor banks could not be guaranteed over PVT variations. Therefore, an additional calibration of the gains of each tuning bank is required. In this work, the calibration of the gain mismatches between the two injection paths, the mismatches of the gains of the DCO bank and DCO nonlinearity of the coarsest bank, are automatically calibrated via the scheme in Fig. 15.2.2. Pre-distortion is only applied to the coarsest bank as it is the most significant. Unlike the scheme proposed in [6], the modulation signal  $\text{mod}[k]$  is first quantized and the result of this quantization is used to select the proper  $c_{2,i}$  coefficient, which replaces the coarse bits of  $\text{mod}[k]$ . The finer bits of  $\text{mod}[k]$  are instead calibrated by means of the  $g_{2,i}$  gains. This scheme realizes an automatic piece-wise-linear pre-distortion of the DCO coarse tuning bank. For each of the finer DCO banks only a single gain ( $g_1$  and  $g_0$ ) is estimated in background. A dynamic element-matching algorithm (DEM) is used at the fine DCO bank to mitigate mismatch between capacitors.

Figure 15.2.3 shows the designed FMCW modulator. It is a digital PLL based on a binary phase detector. The DCO is designed to operate at 20GHz, and it is

intended to drive a frequency multiplier-by-four (not implemented in this work) to reach the 77-to-81GHz radar band. An CML-based frequency divider-by-four acts as a prescaler and reduces this frequency to about 5.8GHz. The frequency-divided signal is further divided by a multi-modulus divider implemented in the TSPC logic and delayed by means of a digital-to-time converter (DTC). The latter block allows canceling most of the resulting quantization error [7]. The resolution of the DTC is about 310fs, and its residual quantization error is below random noise. In this condition, the binary phase detector behaves on average as a linear detector.

The DCO is a typical Class-B LC oscillator with an NMOS differential pair and an LC tail filter. Four different tuning elements cover about 4.2GHz of the tuning range. Coarsest tuning is achieved using a bank of switched MoM capacitors (not shown in figure). Finer tuning is instead realized via two different banks of MOS varactors switched from the inversion to the depletion region. The finest frequency resolution of 150kHz is achieved by means of a MOS varactor biased by a resistor-string DAC. Both the DTC and the BPD are in a current-mode logic for its better immunity to supply disturbances. The DTC circuit is an inverter driving two banks of MOS varactors and a subsequent inverter to recover the waveform slope. The binary phase detector is made of two latches and buffer interposed between them to reduce hysteresis.

The frequency synthesizer in Fig. 15.2.3 has been integrated in a 65nm CMOS process. Figure 15.2.7 shows the die micrograph of the test chip. The core area of the overall synthesizer is about 0.48mm<sup>2</sup>. The power consumption of the synthesizer is 19.7mW (excluding the pad driver). The PLL synthesizes frequency from 20.4 to 24.6GHz with a fractional-N resolution from a 52MHz reference, showing a slight offset of the DCO center frequency with respect to simulations. The reference clock is derived from an on-chip oscillator with an off-chip quartz crystal. For testing purposes, only the output of the divide-by-four prescaler is connected to the pad through a Class-A power amplifier. To assess the performance of the proposed FMCW synthesizer, we generated fast triangular and saw-tooth chirp signals with different periods and slopes (see Fig. 15.2.4). The peak and rms frequency error of the saw-tooth chirp with the fastest slope of 173MHz/μs and the shortest idle time of 200ns is 0.21% and 0.06%, respectively. Those results demonstrate that, thanks to the implemented DPD, the circuit fully exploits the bandwidth enhancement of the two-point injection scheme.

Figure 15.2.5 shows the phase noise in integer-N and fractional-N modes measured at the divide-by-four output. The proposed frequency synthesizer achieves a phase noise level of -112.5dBc/Hz at 1MHz offset (measured at 5.8GHz carrier frequency), which results in an equivalent phase noise of -90dBc/Hz at 1MHz if referred to the 79GHz carrier. Integrated jitter is below 213fs and 242fs rms for the integer and fractional modes, respectively. The same figure (see inset) also shows the measured fractional spurs. A near-integer-N channel leading to in-band fractional spurs has been chosen. This is the worst-case scenario for fractional spurs because they are unfiltered from the loop. Thanks to the DTC and its calibration algorithms, the spur level at 173kHz is about -58dBc. The performance summary and the comparison with prior designs is reported in the table in Fig. 15.2.6.

### References:

- [1] H. Sakurai, et al., "A 1.5GHz-Modulation-Range 10ms-Modulation Period 180kHz<sub>rms</sub>-Frequency-Error 26MHz-Reference Mixed-Mode FMCW Synthesizer for mm-Wave Radar Applications," *ISSCC*, pp. 292-293, Feb. 2011.
- [2] W. Wu, et al., "A 56.4-to-63.4GHz Multi-Rate All-Digital Fractional-N PLL for FMCW Radar Applications in 65nm CMOS," *IEEE JSSC*, vol. 49, no. 5, pp. 1081-1096, May 2014.
- [3] J. Vovnobjoy, et al., "A Fully Integrated 75-83 GHz FMCW Synthesizer for Automotive Radar Applications with -97 dBc/Hz Phase Noise at 1 MHz Offset and 100 GHz/mSec Maximal Chirp Rate," *IEEE RFIC*, pp. 96-99, May 2017.
- [4] H. Yeo, et al., "A 940MHz-Bandwidth 28.8ms-Period 8.9GHz Chirp Frequency Synthesizer PLL in 65nm CMOS for X-Band FMCW Radar Applications," *ISSCC*, pp. 238-239, Feb. 2016.
- [5] G. Marzin, et al., "A 20 Mb/s Phase Modulator Based on a 3.6GHz Digital PLL with -36dB EVM at 5mW Power," *IEEE JSSC*, vol. 47, no. 12, pp. 2974-2988, Dec. 2012.
- [6] S. Levantino, et al., "An Adaptive Pre-Distortion Technique to Mitigate the DTC Nonlinearity in Digital PLLs," *IEEE JSSC*, vol. 49, no. 8, pp. 1762-1772, Aug. 2014.
- [7] D. Tasca, et al., "A 2.9-to-4.0GHz Fractional-N Digital PLL with Bang-Bang Phase Detector and 560fs<sub>rms</sub> Integrated Jitter at 4.5mW Power," *ISSCC*, pp. 88-90, Feb. 2011.



**Figure 15.2.1:** Conventional two-point injection in (a) versus proposed approach with digital pre-distortion in (b).



**Figure 15.2.2: Adaptive digital pre-distortion and gain calibration of DCO banks.**



**Figure 15.2.3:** Implemented digital PLL.



**Figure 15.2.5:** Measured spectrum: (a) phase noise, (b) in-band fractional spurs.



**Figure 15.2.6: Comparison table**



Figure 15.2.7: Die micrograph.

### 15.3 A 36.3-to-38.2GHz -216dBc/Hz<sup>2</sup> 40nm CMOS Fractional-N FMCW Chirp Synthesizer PLL with a Continuous-Time Bandpass Delta-Sigma Time-to-Digital Converter

Daniel Weyer<sup>1</sup>, M. Batuhan Dayanik<sup>2</sup>, Sunmin Jang<sup>1</sup>, Michael P. Flynn<sup>1</sup>

<sup>1</sup>University of Michigan, Ann Arbor, MI

<sup>2</sup>Broadcom, Irvine, CA

Automotive radar and other mm-wave applications require high-quality frequency synthesizers that offer fast settling and low phase noise. Analog PLLs still dominate in the mm-wave range, but all-digital PLLs (ADPLLs) promise greater flexibility and area efficiency. However, existing mm-wave ADPLLs are large, fail to offer low in-band phase noise [1] or must rely on extensive calibration [2]. Performance limitations of conventional TDCs still remain a major roadblock for the adoption of high-frequency ADPLLs. To address this problem, this work introduces a noise-shaping TDC based on a 4<sup>th</sup>-order bandpass  $\Delta\Sigma$  modulator (BPDSTM) to achieve low integrated noise (183fs<sub>rms</sub>) and high linearity. Our approach enables low in-band phase noise (-85dBc/Hz @ 100kHz) for wide loop bandwidths (>1MHz) in a calibration-free single-loop digital 36.3-to-38.2GHz PLL. The prototype PLL effectively generates fast (500MHz/55μs) and precise (824kHz<sub>rms</sub> frequency error) triangular chirps for FMCW radar applications.

We sample the reference clock with a continuous-time bandpass  $\Delta\Sigma$  modulator to measure phase giving the following advantages: 1) The proposed bandpass  $\Delta\Sigma$  TDC (BPDSTDC) avoids low-frequency TDC noise contamination by treating the reference as an analog signal and sampling it at 4 times the reference frequency before down-converting it in the digital domain. Phase detection in conventional analog and digital PLLs is contaminated by low-frequency noise because the reference clock samples the feedback signal at the same frequency. Analog down-conversion to DC is implicit in conventional PLLs, causing noise sources close to DC, primarily flicker noise, to interfere with the phase-difference measurement. 2) It solves the bandwidth-related challenges of conventional reference-sampling ADCs. Such Nyquist-type ADCs require a large bandwidth to track the input edges (Fig. 15.3.1) or, like [3], resort to a preceding phase-frequency detector (PFD) with a charge pump (CP). This is possible because the BPDSTM oversamples a narrow bandwidth around a center frequency with a large oversampling ratio and shapes the quantization noise out of band, achieving high resolution in the center bandwidth. 3) In contrast to delay-line-based TDCs, there is no requirement for calibration. 4) It offers an extended phase detection range of  $\pm 2\pi$  compared to  $\pm \pi/2$  with conventional reference-sampling ADCs, resulting in robust PLL locking behavior without the need for an additional loop for frequency acquisition.

The BPDSTDC consists of a BPDSTM followed by digital down-conversion (DCC) (Fig. 15.3.1). In our PLL, the feedback signal serves as the TDC clock, and the BPDSTM outputs a digitized version of a sinusoidal reference, sampled at 4 times the reference frequency  $F_{REF}$ . The output is a measure of the phase alignment between the feedback and the reference and is then digitally down-converted by multiplying by a (+1,0,-1,0) digital sequence, which provides a DC value proportional to the captured phase information. The low-pass characteristic of the PLL filters out the shaped TDC noise. The BPDSTM provides a 5MHz bandwidth around the reference frequency of 120MHz. This translates to a 2.5MHz TDC bandwidth, offering sufficient margin for PLL bandwidths >1MHz. Since the loop locks with  $F_{FB} = 4 \times F_{REF}$ , the BPDSTDC shows an extended phase detection range of  $\pm 2\pi$  of the feedback phase (Fig. 15.3.1), resulting in a wider PLL locking range.

The prototype IC is implemented as a single-loop type-II fractional-N PLL (Fig. 15.3.2). Since the feedback signal serves as the BPDSTM clock, shaped noise from the  $\Delta\Sigma$  fractional-N divider might increase the in-band phase noise of the PLL. We reduce the fractional-N quantization noise with a divider chain that uses a phase interpolator to implement divider ratios with small incremental steps of 1/4. Hardware measurements show that this improves in-band phase noise by 18dB in comparison to a divider-ratio step size of 4, and that there is no difference in the measured in-band phase noise for integer-N and fractional-N modes. We generate 8 phases as the VCO output is divided by 8, and use a pipelined phase interpolator [4] with two phase-interpolator (PI) units to quadruple the number of selectable phases to 32 (Fig. 15.3.2). With the preceding division by 8, phase

rotation allows us to emulate the divider-ratio step size of 8/32 = 1/4. Since the phase rotation occurs ahead of the final divider stage, we perform only one phase shift per divider output cycle to retain the step size of 1/4 for the entire divider chain. A 2<sup>nd</sup>-order  $\Delta\Sigma$  modulator controls the phase rotation for fractional-N division. Finally, on-chip chirp control logic modulates the PLL division ratio to generate the desired FMCW waveform profile.

The BPDSTM is implemented as a 4<sup>th</sup>-order continuous-time modulator (Fig. 15.3.3) to obtain 2<sup>nd</sup>-order noise-shaping in the BPDSTDC. The inherent anti-aliasing of the continuous-time BPDSTM suppresses reference distortion. The modulator employs single-opamp resonators [5] to reduce power and area. A feedforward path around the second resonator allows the omission of the return-to-zero (RZ) DAC at the modulator input for the benefit of reduced input-referred noise. A transconductance amplifier sums the output currents of the resonators, and its output voltage is digitized by a 5-level flash quantizer. As also shown in Fig. 15.3.3, the PI units forward both input phases and generate an intermediate phase [4]. Each PI unit consists of three PI core circuits, two of which operate in phase-forwarding mode, while the third interpolates the two input phases. The logic gates at the PI core inputs avoid short-circuit currents to improve the linearity of the phase interpolation [4].

The prototype IC is fabricated in 40nm CMOS (Fig. 15.3.7). The BPDSTM occupies 0.02mm<sup>2</sup> and consumes 7.6mW from a 1.1V supply. Stand-alone measurements of the BPDSTM with a 120MHz input and an external 480MHz clock show 63dB SNDR (10.2b ENOB) for a 5MHz bandwidth (Fig. 15.3.4). The output power spectrum of the BPDSTDC is obtained by applying a phase-modulated clock around a 480MHz carrier, revealing ideal 2<sup>nd</sup>-order noise-shaping (Fig. 15.3.4). The BPDSTDC achieves a low measured integrated rms noise of 183fs in a 1MHz signal bandwidth, positioning our TDC among the best of noise-shaping TDCs in Fig. 15.3.4. Thanks to the low TDC noise, the PLL achieves a low measured in-band phase noise of -85dBc/Hz at 100kHz offset for a 38.1GHz output and a loop bandwidth >1MHz (Fig. 15.3.5), corresponding to a normalized phase noise of only -216dBc/Hz<sup>2</sup>. Phase noise peaking around 1MHz offset derives from peaking in the loop transfer function as the loop dynamics are chosen to obtain an agile PLL having a wide bandwidth with low phase margin. A comparison with prior art in Fig. 15.3.6 ranks these phase noise results among the best achieved in a mm-wave CMOS fractional-N PLL. The prototype PLL consumes a total power of 68mW and occupies 0.18mm<sup>2</sup>.

To verify the quality of the chirp synthesis, our PLL generates a triangular FMCW chirp profile with a 500MHz bandwidth from 37.65 to 38.15GHz and chirp slopes up to 10MHz/μs (Fig. 15.3.5). Chirp measurements reveal the chirp precision with a frequency error of only 824kHz<sub>rms</sub> for a 9.1MHz/μs slope, showing the effectiveness of the PLL for fast and precise chirp generation. Our BPDSTDC successfully leverages the high resolution of a BPDSTM gained from oversampling and noise-shaping to demonstrate low-phase-noise performance in a wide-bandwidth digital PLL, without need for extensive calibration. The prototype PLL achieves an excellent combination of normalized phase noise and area for a >10GHz digital PLL when compared to prior art in Fig. 15.3.6. The normalized phase noise at 100kHz reaches that of the analog fractional-N PLL, and of the digital PLLs only the 9GHz digital PLL achieves similar normalized phase noise.

#### Acknowledgements:

We thank Dr. Mark Ferriss of IBM Research for his valuable technical feedback. This work was supported in part by the Center for Microwave Sensor Technology.

#### References:

- [1] W. Wu, et al., "A 56.4-to-63.4 GHz Multi-Rate All-Digital Fractional-N PLL for FMCW Radar Applications in 65 nm CMOS," *IEEE JSSC*, vol. 49, no. 5, pp. 1081-1096, May 2014.
- [2] A. Hussein, et al., "A 50-to-66GHz 65nm CMOS All-Digital Fractional-N PLL with 220fs<sub>rms</sub> Jitter," *ISSCC*, pp. 326-327, Feb. 2017.
- [3] Z. Xu, et al., "A 3.6 GHz Low-Noise Fractional-N Digital PLL Using SAR-ADC-Based TDC," *IEEE JSSC*, vol. 51, no. 10, pp. 2345-2356, Oct. 2016.
- [4] A. Narayanan, et al., "A Fractional-N Sub-Sampling PLL using a Pipelined Phase-Interpolator With an FoM of -250 dB," *IEEE JSSC*, vol. 51, no. 7, pp. 1630-1640, July 2016.
- [5] H. Chae, et al., "A 12 mW Low Power Continuous-Time Bandpass  $\Delta\Sigma$  Modulator With 58 dB SNDR and 24 MHz Bandwidth at 200 MHz IF," *IEEE JSSC*, vol. 49, no. 2, pp. 405-415, Feb. 2014.



Figure 15.3.1: Nyquist-type ADC-TDC vs. Bandpass  $\Delta\Sigma$  TDC (top and center) and phase transfer curves (bottom).



Figure 15.3.2: Fractional-N chirp synthesizer PLL (top) and pipelined phase interpolator (bottom right).



Figure 15.3.3: Bandpass  $\Delta\Sigma$  modulator with resonator circuit (top) and PI unit cell with PI core circuit (bottom).



Figure 15.3.4: Measured BPDSM and BPDTDC output power spectra (top) and comparison of noise-shaping TDCs (bottom).

|                           | This work         | RFIC'15 Wu            | ESSCIR'15 Dayanik | CICC'13 Yu | ISSSC'12 Elshazly |
|---------------------------|-------------------|-----------------------|-------------------|------------|-------------------|
| TDC Type                  | CT $\Delta\Sigma$ | Flash- $\Delta\Sigma$ | CT $\Delta\Sigma$ | GSRO       | SRO               |
| Technology [nm]           | 40                | 40                    | 65                | 65         | 90                |
| Integrated RMS noise [fs] | 183*              | 103                   | 176               | 148        | 315               |
| Bandwidth [MHz]           | 2.5               | 1.25                  | 1                 | 4          | 1                 |
| Sampling frequency [MHz]  | 240**             | 50                    | 270               | 400        | 500               |
| Power [mW]                | 7.6               | 1.32                  | 8.4               | 6.55       | 2                 |
| Area [mm <sup>2</sup> ]   | 0.02              | 0.08                  | 0.055             | 0.05       | 0.02              |

\* In a 1MHz bandwidth

\*\* Effective sampling rate



Figure 15.3.5: PLL output spectrum with measured phase noise (top) and FMCW chirp spectrum and profile (bottom).

Figure 15.3.6: Performance comparison with prior-art CMOS FMCW chirp synthesizer PLLs.



Figure 15.3.7: Die micrograph.

## 15.4 A Low-Phase-Noise Digital Bang-Bang PLL with Fast Lock Over a Wide Lock Range

Luca Bertulessi, Luigi Grimaldi, Dmytro Cherniak, Carlo Samori, Salvatore Levantino

Politecnico di Milano, Milan, Italy

Digital phase-locked loops (DPLLs) based on binary phase detectors (BPDs) avoid power-hungry high-resolution time/digital converters (TDCs) while demonstrating advantages in area, power consumption, and design complexity. The introduction of digital/time converters (DTCs) enables fractional-N resolution at high spectral purity [1]. The design of a bang-bang digital PLL for wireless standards has two main challenges: quantization noise must be kept below the tolerable spot phase noise and fast lock must be guaranteed even for wide frequency steps. However, the overload of the BPD causes bang-bang PLLs to fail lock or to exhibit extremely long transients. A similar issue appears in the design of sub-sampling PLLs. This problem is exacerbated when the bang-bang PLL is designed for low phase noise for the tight resolution required of the digitally controlled oscillator (DCO). Fast locking techniques are usually based on the use of lookup tables [2], finite state machine [3], or gear shifting techniques, mostly in the field of clock-and-data recovery circuits (CDR) where spot noise performance is less of a concern. High-performance bang-bang PLLs (or subsampling PLLs) also include a frequency-aid circuit running in background [4], but its settling performance is seldom discussed.

Figure 15.4.1 depicts the simplified block diagram of the implemented DPLL. The key techniques adopted to achieve both low phase noise and low power are the following: (i) the DCO features a complementary topology with two tail filters to maximize both its current and voltage efficiency and improve its figure of merit; (ii) to limit the DCO quantization noise for a given minimum MOS varactor, the DCO frequency resolution is improved by maximizing the value of the LC tank capacitance; (iii) for the same reason, the DCO fine tuning is obtained by interpolating the output of the digital loop filter by a delta-sigma modulator whose output is filtered by a second-order passive RC filter; (iv) an additional IIR filter is placed in front of the proportional-integral (PI) loop filter to further reduce the out-of-band phase noise; and (v) the time resolution of the DTC is kept below the BPD input jitter by increasing its current consumption. Unfortunately, the additional poles introduced in the loop for noise filtering prevent widening the PLL bandwidth, and this is in contrast with the requirement of fast frequency switching. Moreover, they increase the number of state variables of the loop. Therefore, they complicate the adoption of lookup tables, gear shifting, or state machines.

The frequency-aid technique presented in this work operates in the background of the PLL, and, as soon as the lock is acquired, the circuit automatically enters an idle mode adding negligible power and noise. The reference and the divider signals are fed to a coarse and a fine digital ternary phase detector (TPD), and to a digital counter. The input-output characteristics of both TPDs, sketched inside the dashed box in Fig. 15.4.1, are midtread with different size of the dead zone,  $\Delta t_{dz,1} < \Delta t_{dz,2}$ . When the PLL is locked, the rms input time jitter  $\Delta t$  due to PLL noise is  $\Delta t < \Delta t_{dz,1} < \Delta t_{dz,2}$ . Thus  $e_1 = e_2 = 0$ , and the frequency-aid circuit is idle. When a large time error  $\Delta t$  between the reference and the frequency-divided signal occurs, the frequency-aid block comes into play. If  $\Delta t_{dz,1} < \Delta t$ ,  $e_1 = 1$ . To speed up locking, this signal, in principle, could be fed to a second PI filter driving the first tuning bank of the DCO. If  $\Delta t$  is even larger,  $\Delta t_{dz,2} < \Delta t$ , then also  $e_2 = 1$ , and another PI filter could control the bank #2. This use of standard PI filters on each path is however unfeasible if a fast lock is required. This issue, and the proposed solution, can be intuitively explained with the linear model in Fig. 15.4.2, where the controlled variable is the zero-crossing time. The DCO gains (the sensitivity of the DCO period) are indicated with  $K_T$  [s/b] and, of course, the four tuning banks have scaled sensitivities ( $K_{T3} > K_{T2} > K_{T1} > K_{T0}$ ). For each tuning word  $t_{wi}$ , the DCO operates as a zero-order hold (ZOH) block followed by an integration with gain  $NK_{Ti}$ , where  $N$  is the division factor. For output frequencies of few GHz,  $K_{T0}$  is usually in the order of 1fs/b to limit DCO quantization noise, and the finer banks feature 32 to 64 thermometric capacitors to restrain the number of connections. Thus,  $K_{T1}$  cannot exceed  $K_{T0}$  by more than one order of magnitude to guarantee the overlap between the tuning characteristics. Such a low value of  $K_{T1}$  demands a large integral gain  $\alpha_1$  to speed up the locking transient, since after a large frequency error  $e_1$  saturates to +1 (or -1) and the output varies as a linear ramp

with a slope  $N\alpha_1 K_{T1}$  [s] each  $T_{ref}$ , where  $T_{ref}$  is the reference period in seconds. A PI filter in this path would require a large proportional gain  $\beta_1 \gg \alpha_1$  for stability, which, in turn, would require that a large tuning range is covered by the first tuning bank. This is impractical because of the finite number of capacitors in the bank, and it is represented by the saturation blocks in Fig. 15.4.2. Instead  $e_1$  is weighted by  $\lambda_1$  and added to the divider modulus, which produces a time shift  $\lambda_1 T_{dco}$  at the divider output, where  $T_{dco}$  is the DCO period. The signal  $e_1 \lambda_1 T_{dco}$  is therefore equivalently integrated without any limitation of the output range. The value of  $\lambda_1$  is derived by imposing that  $\lambda_1 T_{dco} = \beta_1 N K_{T1}$ . To properly design the system, the gain  $\lambda_1 T_{dco}$  must be bounded to prevent the onset of limit cycles. It is possible to demonstrate that, in presence of a limit cycle, the delay of the “div” signal would toggle with peak-to-peak amplitude  $2N\beta_1 K_{T1} = 2\lambda_1 T_{dco}$ . If this delay is within the dead zone of the TPD, see the inset in Fig. 15.4.2, that is,  $\Delta t_{dz,1} > 2\lambda_1 T_{dco}$ , limit cycles are avoided.

In the circuit in Fig. 15.4.1, two other coarser DCO banks are present. For a large frequency step, all the three loops of the frequency-aid circuit operate concurrently, even if the coarser loop practically sets the transient behavior as its gain is much larger than that of the other ones. When the time error is within the coarser dead zone,  $\Delta t < \Delta t_{dz,2}$ , the coarser loop controlling the second tuning bank becomes idle, and the first tuning bank takes the control of the transient, until  $\Delta t < \Delta t_{dz,1}$ .

Figure 15.4.1 also shows the schematic of the two ternary phase detectors together with the BPD. The delay cell, implemented with a chain of non-symmetrically ratioed CMOS inverters, shifts the input by a nominal value  $\tau = 200ps$ . Thus,  $\Delta t_{dz,1} = 2\tau = 400ps$  and  $\Delta t_{dz,2} = 10\tau = 2ns$ . As the “div” and “ref” signals reciprocally sample their delayed versions with a zero-offset flipflop, the residual time offset affecting the four thresholds of the ternary phase detectors is mainly due to the mismatch between the delay stages, and it is negligible with respect to the size of the dead zones.

The proposed synthesizer generates a signal in the 3.7-to-4.1GHz range from a reference of 52MHz. It has been fully integrated in a 65nm CMOS process, see the die micrograph in Fig. 15.4.7, and occupies a core area of  $0.61mm^2$ . The total power dissipation is 5.28mW. Figure 15.4.3 shows the measured phase noise. The rms jitter (integrated from 1kHz to 30MHz) is 182.5fs, while the spot phase noise at 20MHz offset is  $-150.7dBc/Hz$ . This noise satisfies the tight GSM specifications referred to a 900MHz carrier of  $-162dBc/Hz$ , and it is at par with prior digital PLLs for cellular applications [5,6], while having lower power consumption thanks to the adoption of a single-bit TDC. The level of in-band fractional spur measured is  $-50dBc$ .

Figure 15.4.4 displays the measured transient response for a coarse frequency acquisition. A step of 364MHz is performed in 5.6 $\mu s$  within 10MHz from the final frequency value. Figure 15.4.5 compares the frequency transients with and without the frequency-aid circuitry (while keeping the counter in both cases). For a fine frequency step of 1MHz, the circuit reaches lock after a nonlinear behavior in about 7ms (without frequency aid) and in only 110 $\mu s$  (with frequency aid). For the coarse step of 364MHz, the circuit is unable to reach lock without frequency aid, while it settles in 115 $\mu s$  (within an error band of 364kHz) with frequency aid. The performance summary and the comparison with prior designs is reported in Fig. 15.4.6.

### References:

- [1] D. Tasca, et al., "A 2.9-to-4.0GHz Fractional-N Digital PLL with Bang-Bang Phase Detector and 560fs<sub>rms</sub> Integrated Jitter at 4.5mW Power," ISSCC, pp. 88-90, Feb. 2011.
- [2] A. Samarah and A. Carusone, "Multi-Phase Bang-Bang Digital Phase Lock Loop with Accelerated Frequency Acquisition," IEEE ISCAS, pp. 545-548, May 2015.
- [3] M. Hekmat, et al., "A 25 GHz Fast-Lock Digital LC PLL With Multiphase Output Using a Magnetically-Coupled Loop of Oscillators," IEEE JSSC, vol. 50, no. 2, pp. 490-502, Feb. 2015.
- [4] X. Gao, et al., "A 2.7-to-4.3GHz, 0.16ps<sub>rms</sub>-jitter, -246.8dB-FOM, Digital Fractional-N Sampling PLL in 28nm CMOS," ISSC, pp. 174-175, Feb. 2016.
- [5] C. Weltin-Wu, et al., "A 3.5 GHz Digital Fractional-N PLL Frequency Synthesizer Based on Ring Oscillator Frequency-to-Digital Conversion," IEEE JSSC, vol. 50, no. 12, pp. 2988-3002, Dec. 2015.
- [6] C.-W. Yao, et al., "A 14nm Fractional-N Digital PLL with 0.14ps<sub>rms</sub> Jitter and -78dBc Fractional Spur for Cellular RFICs," ISSCC, pp. 422-423, Feb. 2017.



Figure 15.4.1: Implemented digital PLL.



Figure 15.4.2: Equivalent time model illustrating the operating principle of the frequency-aid technique and the problem of limit cycle.



Figure 15.4.3: Measured phase noise.



Figure 15.4.4: Measured settling time for a coarse frequency hop.



Figure 15.4.5: Frequency transient without (left) and with frequency aid (right) for a 1MHz step (top) and a 364MHz step (bottom).

|                            | This work         | Chiu<br>JSSC10 | Weltin-Wu<br>JSSC15 | Markulic<br>ISSCC16 | Gao<br>ISSCC16 | Narayanan<br>JSSC16 | Yao<br>ISSCC17 |
|----------------------------|-------------------|----------------|---------------------|---------------------|----------------|---------------------|----------------|
| Architecture               | Bang-bang<br>DPLL | CPPLL          | DPLL                | SSPLL               | DPLL           | SSPLL               | DPLL           |
| Reference (MHz)            | 52                | 10             | 26                  | 40                  | 40             | 40                  | 26             |
| Output (GHz)               | 3.7-4.1           | 5.3-5.6        | 2.8-3.5             | 10.1-12.4           | 2.7-4.33       | 4.34-4.94           | 2.69           |
| Tuning Range (%)           | 10                | 6              | 22.2                | 20.4                | 46             | 12.9                | n.a.           |
| BW (MHz)                   | 0.150             | 0.04           | 0.140               | 0.7                 | n.a.           | 1.0                 | n.a.           |
| Settling time (μs)         | 5.6               | 20             | n.a.                | n.a.                | n.a.           | n.a.                | n.a.           |
| Frequency hop (MHz)        | 364               | 100            | n.a.                | n.a.                | n.a.           | n.a.                | n.a.           |
| Equivalent PN*<br>(dBc/Hz) | -163.3            | -135.7         | -162.5              | -154.3              | -154.0         | -152.5              | -163.5         |
| RMS jitter (fs)            | 183               | n.a.           | 665                 | 176                 | 159            | 133                 | 137            |
| Power (mW)                 | 5.28              | 19.8           | 15.6                | 5.6                 | 8.2            | 6.2                 | 13.4           |
| FoM (dB)                   | -247.5            | n.a.           | -231.6              | -247.6              | -246.8         | -249.5              | -246.0         |
| Fractional Spur (dBc)      | -50               | n.a.           | -60                 | -56.6               | -54            | -59                 | -78            |
| Spur Frequency<br>(MHz)    | 0.5               | n.a.           | 0.001               | 0.01                | 0.1            | 0.03                | 0.05           |
| Area (mm²)                 | 0.61              | 1.61           | 0.34                | 0.77                | n.a.           | 0.2                 | 0.257          |
| Process (nm)               | 65                | 180            | 65                  | 28                  | 28             | 65                  | 14             |

\* At 20MHz offset, scaled to 900MHz carrier frequency

Figure 15.4.6: Performance summary and comparison with prior designs.



Figure 15.4.7: Die micrograph.

## 15.5 A Digital Frequency Synthesizer with Dither-Assisted Pulling Mitigation for Simultaneous DCO and Reference Path Coupling

Cheng-Ru Ho, Mike Shuo-Wei Chen

University of Southern California, Los Angeles, CA

Injection pulling on frequency synthesizers has become a critical design challenge for high-performance wireless transceivers, especially in highly integrated multi-radio platforms, imposing stringent constraints on system-level frequency planning. For instance, the oscillator of the victim phase-locked loop (PLL) can be disturbed by the power-amplifier (PA) output or other nearby PLLs when their fundamental or harmonic frequency components are close to the victim PLL carrier frequency. Although the aggressor waveform in most cases is known a priori, the coupling path and mechanism (electric or magnetic) is complex and varying over time and operating conditions. Hence, the transfer function of the coupling path cannot be predetermined or reliably compensated with foreground calibration. Therefore, recent works [1-4] have proposed adaptive techniques that track the transfer function in the background. However, they only focus on pulling mitigation of the oscillator, and some incur high overhead, e.g., an additional ADC in the case of an analog PLL. In real operation, the interfering signal can affect not only the oscillator but also the input reference clock [1]. The compensation mechanism is different for two coupling paths, i.e. the oscillator pulling mitigation can worsen perturbation caused by reference path coupling, as noted in [5]. To alleviate those issues, this work proposes a dither-assisted pulling mitigation scheme that orthogonalizes the coupling from both DCO and reference paths and allows simultaneous rejection in the background. Moreover, the technique also aims to mitigate different aggressor waveforms while minimizing the hardware overhead. The prototype is implemented in 65nm CMOS and uses a digital PLL architecture for better technology scalability. Various types of interference are injected to the DCO and reference paths simultaneously, and the measurement shows 12 and 22.5dB spectral improvement for the PA and the oscillator mutual pulling, respectively, which validates the effectiveness of the technique.

Figure 15.5.1 illustrates how the aggressor affects the victim PLL via different paths. This prototype focuses on two types of pulling interference: PA pulling (modulated waveform) and oscillator mutual pulling (sinusoidal waveform). Since most part of a digital PLL is implemented in the digital domain, the vulnerable coupling paths are the LC-tanked DCO (LC-DCO) path and the input reference path via electrical and/or magnetic medium. The distinctive coupling transfer function and cancellation mechanism require two least-mean-square (LMS) loops to mitigate the perturbation from two paths simultaneously. However, one fundamental issue is that the phase disturbances from two paths are indistinguishable at the time-to-digital converter (TDC) output, which can destabilize the LMS loops as they interfere with each other. Therefore, we dither the delay of the reference path with a PN code to decouple the effects (Fig. 15.5.1). Since the phase information is sampled at rising edges of the reference clock via the TDC, the dithering essentially perturbs the sampling instant within a certain dithering range. According to an analytical derivation, the reference-coupled phase disturbance is randomized via this PN-modulated sampling instant, while the DCO-coupled phase disturbance remains mostly intact, except for the elevated dithering-induced noise. This behavior has been confirmed via measurements. More importantly, it allows two disturbances to be extracted separately via PN correlators due to the orthogonality between the PN sequence and the interference.

The DPLL architecture is shown in Fig. 15.5.2. There are three key components related to the dither-assisted pulling-mitigation scheme. The first block is a pseudo-random sequence generator that controls the delay of the reference clock buffer via a digital-to-time converter (DTC), i.e. a bank of unitary capacitors. The PN code is further used by a dither-noise cancellation block to remove the dithering effect at the TDC output. The second block is the reference-pulling mitigation loop inserted prior to the digital loop filter (DLF), which is an LMS loop minimizing the mean-square value of the PN correlator output, i.e. node  $\phi_e$ . The third block is the DCO-pulling mitigation loop that utilizes another LMS loop to minimize the mean-square value of node  $\phi_e$ . Both LMS loops take either external digital I/Q input for PA pulling or on-chip direct-digital-synthesis (DDS) output for oscillator mutual pulling depending on the intended demonstration of pulling signal types. Below, a PA pulling scenario is used to explain the configuration of the two LMS loops, but the same principle is applicable to the oscillator mutual-

pulling case. After mitigation loops, the type-II DLF with programmable bandwidth up to ~1MHz controls the delta-sigma modulator, toggling a bank of fine-resolution varactors inside the LC-DCO. The calibration-free TDC is injection-locked to the LC-DCO and provides 28 levels of fine phase quantization within a DCO period [5]. Finally, the accumulated frequency control word (FCW) is subtracted at TDC output to allow integer-N operation.

Figure 15.5.3 shows the DCO-pulling-mitigation scheme. We setup the experiment as if the 2<sup>nd</sup> harmonic of the PA output couples to the DCO, but it can be extended to other harmonics. An arbitrary waveform generator (AWG) synthesizes the interfering waveform  $[X_i^2(t)+X_0^2(t)] \cdot \sin(\omega_c t+2 \cdot \tan^{-1}[X_0(t)/X_i(t)])$ , where  $\omega_c$  and  $X_i(t)$  and  $X_0(t)$  are the DPLL carrier frequency and baseband digital I/Q signals, respectively. Additionally, the gain and phase of the waveform are changed to mimic various transfer functions of a certain coupling path. By rearranging the terms of the interfering waveform, two input variables,  $\{X_i^2(t)-X_0^2(t)\}$  and  $\{2 \cdot X_i(t) \cdot X_0(t)\}$ , are found with each convolving with a distinct transfer function. Thus, two four-tap LMS loop filters are applied for each input variable. To find the update function of the LMS loop filter, we first derive the mean-square error of the DCO-coupled phase disturbance at node  $\phi_e$ , where the reference-coupled phase disturbance is randomized due to dithering. Then the derivative of the mean-square error is taken against the filter coefficient, leading to the iterative equation shown in Fig. 15.5.3. The LMS loop tracks the variation of the coupling path and closed-loop DPLL bandpass response, where the DCO and TDC gains may vary over PVT.

When the pulling signal is coupled to the PN-dithered reference path, the phase disturbance at the TDC output can be expressed as  $(X_i^2[k]+X_0^2[k]) \cdot \sin(\omega_c \cdot PN[k] \cdot G_{DTG} + 2 \cdot \tan^{-1}[X_0[k]/X_i[k]])$ , where  $PN[k]$  is the PN code at the k<sup>th</sup> sample and  $G_{DTG}$  is the DTC gain. Compared to the DCO interference, it contains an extra PN-modulated term, and hence the phase error at node  $\phi_e$  is first correlated with the trigonometric function of the PN code using an on-chip lookup table, as shown in Fig. 15.5.4. The PN correlator extracts only the phase disturbance due to the reference path, while the DCO-induced phase disturbance is randomized by the PN correlator and does not affect the reference-pulling mitigation loop. The remaining part of the LMS-loop design follows a similar procedure as that of the DCO-pulling mitigation loop. The iterative equations of the LMS loop filter are shown in Fig. 15.5.4 and contain a closed-loop DPLL high-pass transfer function.

In this work, a second DPLL is implemented that serves as a sinusoidal aggressor and external injection points are used for interfering with the modulated signal. The prototype occupies an active area of 0.48mm<sup>2</sup> in 65nm CMOS, consuming 21.3mW from a 1V supply. Before injecting interference, the DPLL operates at the 3.2GHz carrier frequency with a 32MHz reference clock. The DPLL integrated rms jitter of 394fs is measured. The M9502A AWG generates two 64-QAM modulated pulling signals that are simultaneously coupled to the DCO and reference paths and that differ only in amplitude and phase to mimic different coupling channel responses. After coupling the modulated signal, the integrated rms jitter degrades to 3.39ps, but it improves to 1.04ps after activating the mitigation technique. In the spectrum domain, the wideband reduction of the interfering-signal-related artifacts of 12dB is observed before and after the mitigation (Fig. 15.5.5). For the oscillator mutual-pulling case, the second DPLL is locked at 3.2MHz apart from the victim DPLL carrier frequency. After mitigation, the coupled spur level is improved by 22.5dB. Figure 15.5.6 summarizes performance of prior PLLs and the PLL described in this work, which demonstrates simultaneous pulling mitigation capability. Figure 15.5.7 shows the die micrograph.

### Acknowledgement:

The author would like to thank DARPA RF-FPGA program for funding support.

### Reference:

- [1] R. Staszewski, et al., "Spur-Free Multirate All-Digital PLL for Mobile Phones in 65 nm CMOS," *IEEE JSSC*, vol. 46, no. 12, pp. 2904-2919, Dec. 2011.
- [2] A. Mirzaei and H. Darabi, "Pulling Mitigation in Wireless Transmitters," *IEEE JSSC*, vol. 49, no. 9, pp. 1958-1970, Sep. 2014.
- [3] G. Puma and C. Carbone, "Mitigation of Oscillator Pulling in SoCs," *IEEE JSSC*, vol. 51, no. 2, pp. 348-356, Feb. 2016.
- [4] R. Winoto, et al., "A 2x2 WLAN and Bluetooth Combo SoC in 28nm CMOS with On-Chip WLAN Digital Power Amplifier, Integrated 2G/BT SP3T Switch and BT Pulling Cancellation," *ISSCC*, pp. 170-171, Feb. 2016.
- [5] C.-R. Ho and M. Chen, "Interference-Induced DCO Spur Mitigation for Digital Phase Locked Loop in 65-nm CMOS," *ESSCIRC*, pp. 213-216, Sep. 2016.



Figure 15.5.1: Dither-assisted technique differentiates the signals from simultaneous interference coupling.



Figure 15.5.2: Digital PLL architecture with the DCO-pulling and reference-pulling mitigation scheme.



Figure 15.5.3: DCO-pulling mitigation scheme and the update function of the LMS loop.



Figure 15.5.4: Reference-mitigation scheme and the update function of the LMS loop.



Figure 15.5.5: Measured DPLL spectrum under PA pulling (top) and oscillator mutual pulling (bottom) before/after enabling the proposed pulling-mitigation scheme.

|                                                        | ISSCC 2011<br>R.B. Staszewski | JSSC 2012<br>I. Bashir | JSSC 2014<br>A. Mirzaei             | JSSC 2015<br>G. Puma | ISSCC 2016<br>R. Winoto | This Work                         |
|--------------------------------------------------------|-------------------------------|------------------------|-------------------------------------|----------------------|-------------------------|-----------------------------------|
| Technology                                             | 65nm                          | 65nm                   | 40nm                                | 65nm                 | 28nm                    | 65nm                              |
| PLL Type                                               | Digital                       | Digital                | Analog                              | Digital              | Digital                 | Digital                           |
| PA Pulling Mitigation                                  | Yes                           | Yes                    | Yes                                 | Yes                  | Yes                     | Yes                               |
| Oscillator Mutual Pulling Mitigation                   | No                            | No                     | Yes                                 | No                   | No                      | Yes                               |
| Pulling Coupling Path                                  | DCO                           | DCO                    | DCO                                 | DCO                  | DCO                     | DCO / REF / DCO+REF               |
| Mitigation Method                                      | Dithering                     | Delay/Phase Adjustment | Auxiliary ADC & Digital Calibration | Adaptive filter      | Adaptive Filter         | Dither-assisted & Adaptive Filter |
| Experimental PA Pulling Signal                         | GFSK                          | EDGE                   | 16 QAM                              | 8 DPSK               | N/A                     | 64 QAM                            |
| PA Pulling Signal Roll-off Bandwidth [MHz]             | N/A                           | N/A                    | <1                                  | <1                   | N/A                     | 3.2                               |
| Oscillator Mutual Pulling Improvement in Spectrum [dB] | -                             | -                      | 20                                  | -                    | -                       | 22.5                              |
| PA Pulling Improvement in Spectrum [dB]                | 5                             | 2                      | 15                                  | 10                   | N/A                     | 12                                |
| Core Power [mW]                                        | 38.4                          | N/A                    | N/A                                 | N/A                  | N/A                     | 21.3                              |
| Active Area [mm²]                                      | 0.35                          | N/A                    | N/A                                 | N/A                  | N/A                     | 0.48                              |

Figure 15.5.6: Comparison with prior PLLs with various pulling-mitigation techniques.



Figure 15.5.7: Die micrograph.

## 15.6 A 0.01mm<sup>2</sup> 4.6-to-5.6GHz Sub-Sampling Type-I Frequency Synthesizer with -254dB FOM

Ahmad Sharkia, Shahriar Mirabbasi, Sudip Shekhar

University of British Columbia, Vancouver, Canada

Power consumption, Performance in terms of phase noise and integrated jitter, and Area (PPA) are three design metrics that have driven countless research efforts in CMOS frequency-synthesizer design. Design limitations and system-level tradeoffs have made simultaneous optimizations of PPA metrics in PLLs challenging. In traditional Type-II charge-pump (CP) based PLLs, power is consumed in the VCO, divider ( $N$ ), and CP to improve noise performance, and area is consumed in large loop-filter (LF) capacitors. ADPLLs are attractive due to compact LF, but are limited in noise performance. Sub-sampling (SS) PLLs eliminate divider noise, and remove the  $N^2$  amplification of the phase detector (PD), CP, and LF noise, thereby improving the overall phase-noise performance [1]. However, their area is large due to the LF capacitors [1]. The performance of CPs and ring-VCOs in traditional or SS Type-II PLLs are also encumbered by reduced  $V_{DD}$  in scaled CMOS processes. Overall, PLLs with ring-VCOs have higher noise [2,3], and PLLs with LC-VCOs have larger area [1]. Figure 15.6.1 highlights the PPA tradeoffs in prior art.

In this work, we present a frequency synthesizer in an active area of  $100 \times 100 \mu\text{m}^2$ , with an integrated jitter of less than  $200\text{fs}_{\text{rms}}$  and total power consumption of  $1.1\text{mW}$ . The power-performance FOM [1] is  $-254\text{dB}$ . The circuit area is reduced by (1) a proposed SS-PD/LF that simultaneously acts as a PD and a compact LF in a Type-I architecture, and (2) placing the entire synthesizer under the VCO inductor. Noise and spurs are reduced by (1) the proposed switched-capacitor SS-PD/LF, and (2) careful floorplanning of circuits under the inductor. Power is reduced by (1) using a low supply voltage of  $0.8\text{V}$  for the entire synthesizer, (2) careful floorplanning to minimize inductor- $Q$  degradation, and (3) optimizing the VCO power/noise performance permitted by the larger loop BW ( $f_{\text{BW}}$ ) due to the Type-I architecture [3].

Figure 15.6.2 shows the proposed frequency synthesizer, composed of an all-digital frequency-locked loop (ADPLL), and a Type-I SS-PLL with all-passive loop components. ADPLL controls the 7b capacitor bank of the complimentary cross-coupled VCO for coarse tuning, and the SS-PLL controls the differential varactors for fine tuning. The ADPLL uses a frequency counter to measure the VCO output frequency and linearly adjusts the VCO capacitor bank based on the frequency error. After the ADPLL acquires lock close to the desired output frequency ( $f_{\text{OUT}} \approx N \times f_{\text{REF}}$ ), the SS-PLL is enabled, with the differential control nodes of the VCO varactors already precharged to the VCO output common-mode voltage to allow for a seamless phase acquisition.

Prior works have mitigated the limited lock range of traditional Type-I PLLs by boosting the forward gain [3,4] and attenuated the large reference-spurs by sampling the LF output [4] and by adding harmonic traps [3]. The proposed dual-loop design (ADPLL and PLL), combined with the large gain in SS-PDs [1], relaxes the lock-range limitation. Furthermore, the proposed SS-PD (shown in Fig. 15.6.3) acts as an all-passive, voltage-mode, switched-capacitor LF. This SS-PD/LF naturally alleviates the reference-spur limitation, as it inherently outputs a DC voltage, whose amplitude is proportional to the phase error. This is in contrast to the phase-frequency detectors (PFDs) in traditional Type-II or Type-I PLLs [4], which output a PWM signal, whose duty-cycle is proportional to the phase-error. Figure 15.6.3 shows the SS-PD/LF schematic. Complementarily clocked switches and capacitors  $C_2$  and  $C_3$  act as an SS-PD, while simultaneously acting as resistors in the equivalent LF model. The SS-PD/LF capacitors ( $C_1-C_4$ ) are orders of magnitude smaller than the LF capacitors used in traditional or SS Type-II PLLs.

As Type-I PLLs do not require an integrator in the forward path, a CP becomes unnecessary. The removal of the CP relaxes the voltage headroom constraints on the PLL, and allows for better scalability with process and voltage. Furthermore, with only one integrator, Type-I PLLs offer inherent stability (unlike Type-II PLLs). This inherent stability relaxes the design constraints of the LF, and allows for a much more compact design in comparison to the large LF capacitors needed in traditional or SS Type-II PLLs.

Furthermore, the relaxed stability in the proposed Type-I PLL allows for a larger  $f_{\text{BW}}$  for a given  $f_{\text{REF}}$ . This, in turn, reduces the noise requirements on the VCO. A larger  $f_{\text{BW}}$  is also used in a prior-art Type-I PLL [3]. However, the PD in [3] requires ADC-based calibration for maintaining the loop gain and  $f_{\text{BW}}$ .

For compact PLLs, ring VCOs are often used instead of LC-VCOs, but with an overhead to the overall noise and power. In this work, apart from a compact SS-PD/LF, the entire design (ADPLL, SS-PLL, VCO, divider, reference-buffer) is placed underneath the LC-VCO inductor to further reduce the overall synthesizer area, as shown in Fig. 15.6.4. Prior arts have placed the VCO [5], portions of the LF [6], or portions of the synthesizer [7] underneath the VCO inductor with negligible degradation to the inductor  $Q$ . Placing a complete synthesizer underneath the LC-VCO inductor raises the risk of electromagnetic (EM) coupling between the VCO inductor and the rest of the analog/digital blocks, potentially degrading the inductor  $Q$  and increasing noise coupling. Several techniques are used in this work to circumvent such adverse effects. To minimize electric coupling, a patterned ground shield (PGS) is placed between the inductor and the circuit blocks. Unlike typical inductor PGSs, which are placed on the lowest metal layers, and are aimed at electrically shielding the inductor from the substrate, the PGS implemented herein is aimed at shielding the inductor from the rest of the circuits, and is placed directly underneath the inductor and above the circuits, on metal layer 7 (out of 9 layers). This PGS also helps in shifting the peak inductor- $Q$  frequency closer to  $f_{\text{OUT}}$ , with self-resonance still  $>20\text{GHz}$ . Furthermore, to minimize magnetic coupling (Eddy currents), and inductor- $Q$  degradation, careful layout and routing techniques have been used: (1) closed current loops are avoided at all cost, (2) routings are done perpendicular to the inductor wires as much as possible, and (3) custom-designed metal-oxide-metal (MOM) capacitors with intertwined-comb fingers of minimum metal width and spacing that minimize Eddy currents (unlike MIM capacitors) are used in the VCO capacitor-bank and SS-PD/LF.

The prototype 4.6-to-5.6GHz frequency synthesizer was fabricated in a 65nm CMOS process (Fig. 15.6.7), and tested in a CQFP-44 package. The synthesizer occupies a core area of  $100 \times 100 \mu\text{m}^2$ , including the reference buffer, ADPLL, SS-PLL, VCO, and divider. The measured phase noise and output spectrum are shown in Fig. 15.6.5. All synthesizer measurements are performed on the divided-by-2 signal of the VCO output, driven by a low-noise reference,  $f_{\text{REF}} = 100\text{MHz}$ . At  $f_{\text{OUT}} = 5\text{GHz}$ , the measured in-band phase noise at 1MHz offset is  $-120.4\text{dBc/Hz}$ . The integrated jitter (10kHz to 50MHz) is  $185.3\text{fs}_{\text{rms}}$ . The reference spur of the SS-PLL is  $-64.1\text{dBc}$ .

Figure 15.6.6 summarizes the measured performance of the proposed synthesizer. All measurements are performed with  $V_{DD} = 0.8\text{V}$ . The synthesizer consumes no more than  $1.1\text{mW}$  across its entire tuning range, including the power consumption of the ADPLL, unlike [1,2]. Only the power consumed in the  $50\Omega$  output driver is excluded. The proposed synthesizer achieves an FOM of  $-254\text{dB}$  across its entire tuning range, and compares favorably to prior designs in Fig. 15.6.6 while optimizing all PPA metrics. The proposed LC-VCO-based synthesizer achieves the best-reported power-performance FOM among designs in Fig. 15.6.6, in an area of a ring-based PLL. Lastly, low  $V_{DD}$  operation, a standard-cell-based ADPLL, and an all-passive, switched-capacitor SS-PD/LF make this design promising for implementation in more advanced CMOS processes.

### Acknowledgment:

The authors acknowledge NSERC and Intel for financial support, CMC for access to CAD tools, Analog FastSPICE from Mentor Graphics, and F. P. O'Mahony, J. Rogers, T. Karnik and V. De of Intel for support.

### References:

- [1] X. Gao, et al., "A Low Noise Sub-Sampling PLL in Which Divider Noise is Eliminated and PD/CP Noise is not Multiplied by  $N^2$ ," *IEEE JSSC*, vol. 44, no. 12, pp. 3253–3263, Dec. 2009.
- [2] J. Chuang and H. Krishnaswamy, "A 0.0049mm<sup>2</sup> 2.3GHz Sub-Sampling Ring-Oscillator PLL with Time-Based Loop Filter Achieving  $-236.2\text{dB}$  jitter-FOM," *ISSCC*, pp. 328–329, Feb. 2017.
- [3] L. Kong and B. Razavi, "A 2.4GHz 4mW Inductorless RF Synthesizer," *ISSCC*, pp. 450–451, Feb. 2015.
- [4] A. Sharkia, et al., "A High-Performance, Yet Simple to Design, Digital-Friendly Type-I PLL," *IEEE CICC*, pp. 1-4, Sept. 2015.
- [5] F. Zhang and P. Kinget, "Design of Components and Circuits Underneath Integrated Inductors," *IEEE JSSC*, vol. 41, no. 10, pp. 2265–2271, Oct. 2006.
- [6] Y.-L. Hsueh, et al., "A 0.29mm<sup>2</sup> Frequency Synthesizer in 40nm CMOS with  $0.19\text{ps}_{\text{rms}}$  jitter and  $<100\text{dBc}$  Reference Spur for 802.11ac," *ISSCC*, pp. 472–473, Feb. 2014.
- [7] A. Shirazi, et al., "A 980μW 5.2dB-NF Current-Reused Direct-Conversion Bluetooth Low-Energy Receiver in 40nm CMOS," *IEEE CICC*, pp. 1-4, May 2017.

| LC-PLL Architecture               | Type-II                                                                            | Sub-Sampling Type-II                             | ADPLL                                                                     | Proposed                                                        |
|-----------------------------------|------------------------------------------------------------------------------------|--------------------------------------------------|---------------------------------------------------------------------------|-----------------------------------------------------------------|
| Power                             | CP Divider                                                                         | CP No Divider                                    | TDC Divider                                                               | No CP No Divider                                                |
| Jitter / Phase-Noise              | $(PFD \cdot CP \cdot LF \text{ noise}) \times N^2$<br>(Divider noise) $\times N^2$ | No $N^2$ noise amplification<br>No divider noise | $(TDC \cdot LF \text{ noise}) \times N^2$<br>(Divider noise) $\times N^2$ | No $N^2$ noise amplification<br>No divider noise<br>No CP noise |
| Area                              | Large LF cap<br>VCO inductor                                                       | Large LF cap<br>VCO inductor                     | No LF cap<br>VCO inductor                                                 | Compact LF<br>VCO inductor<br>PLL underneath                    |
| Process / V <sub>DD</sub> Scaling | Analog-intensive CP                                                                | Analog-intensive CP                              | Mostly digital                                                            | Mostly digital                                                  |

  

| VCO                | LC     | Ring    |
|--------------------|--------|---------|
| Power              | Lower  | Higher  |
| Jitter/Phase-Noise | Lower  | Higher  |
| Area               | Larger | Smaller |

Figure 15.6.1: Sub-sampling Type-I synthesizer compared to prior art in terms of PPA metrics.



Figure 15.6.3: SS-PD/LF (single-ended half circuit shown), equivalent LF diagram, and SS-PD transfer function.



Figure 15.6.5: Spectrum and phase noise of the divide-by-2 output of the frequency synthesizer operating at 5GHz.



Figure 15.6.2: Synthesizer with an All-Digital FLL and a sub-sampling Type-I PLL with all-passive loop components.



Figure 15.6.4: Frequency-synthesizer-under-inductor layout with patterned ground shield (PGS) in between.

|                         | J. Chuang<br>ISSCC'17 | A. Eshazy<br>ISSCC'17 | X. Gao<br>VLSI'10 | A. Elkholi<br>ISSCC'15 | I-T. Lee<br>TCAS-II'13 | This work |
|-------------------------|-----------------------|-----------------------|-------------------|------------------------|------------------------|-----------|
| CMOS Tech.              | 65nm                  | 0.13μm                | 0.18μm            | 65nm                   | 40nm                   | 65nm      |
| Supply [V]              | 1.2                   | 1.1                   | 1.8               | 0.9                    | 0.9                    | 0.8       |
| VCO                     | Ring                  | Ring                  | LC                | LC                     | LC                     | LC        |
| Out. Freq. [GHz]        | 2.3                   | 1.5                   | 2.21              | 6.8                    | 4.8                    | 5         |
| Ref. Freq [MHz]         | 192                   | 375                   | 55.25             | 106.25                 | 400                    | 100       |
| Fout/Fref (N)           | 12                    | 4                     | 40                | 64                     | 12                     | 50        |
| Ref. Spur [dBc]         | -37                   | -55.6                 | -56               | -40                    | -52.7                  | -64.1     |
| Integ. Jitter [fs rms]  | 720                   | 400                   | 160               | 190                    | 123                    | 185.3     |
| Power [mW]              | 4.59                  | 0.89                  | 2.5               | 2.25                   | 3.7                    | 1.1       |
| Area [mm <sup>2</sup> ] | 0.0049                | 0.25                  | 0.2               | 0.19                   | 0.151                  | 0.01      |
| FoM [dB]                | -236                  | -248                  | -252              | -251                   | -252                   | -254      |



Figure 15.6.6: Performance summary and comparison with prior-art integer-N synthesizers.



Figure 15.6.7: Die micrograph.

## 15.7 A Dividerless Reference-Sampling RF PLL with -253.5dB Jitter FOM and <-67dBc Reference Spurs

Jahnnavi Sharma, Harish Krishnaswamy

Columbia University, New York, NY

In the recent past, there have been exciting advances in dividerless PLLs, such as sub-sampling PLLs (SSPLLs) [1,2] and injection-locked clock multipliers (ILCMs) [3] that substantially reduce loop noise to cross the -250dB jitter-power figure-of-merit (FOM<sub>j</sub>) barrier. However, there exists a fundamental trade-off between FOM<sub>j</sub> and reference spurs in PLLs, although the mechanisms vary across architectures. Narrow PLL bandwidths are necessary for reducing spurs through filtering, but this can conflict with the optimal bandwidth for jitter. In SSPLLs, buffers, isolating the VCO from the sub-sampled phase detector (SSPD) (Fig. 15.7.1), reduce spurs at the expense of noise and power consumption. Smaller sample capacitances in the SSPD reduce spurs generated by mismatch-induced charge sharing, charge injection, and tank frequency modulation at the expense of increased kT/C noise. Consequently, the SSPLL of [2] achieves spur <-80dBc by using isolation buffers, a small sample capacitance (and another DLL-based technique) but exhibits an FOM<sub>j</sub> of -244.6dB. In the SSPLL of [1], the elimination of this isolation buffer and the use of a larger capacitance results in a better FOM<sub>j</sub> of -252dB but a spur of -56dBc. The ILCM in [3] operates with large injection to enable locking to a high multiple of the reference, but this degrades spurs. The absence of noisy loop components yields a very low, but large injection leads to a spur of -43dBc. Also, ILCMs do not feature explicit phase detectors, limiting the optimization of loop dynamics.

In this paper, we propose a new dividerless PLL architecture - the reference-sampling PLL (RSPLL) - that combines the best aspects of the SSPLL and the ILCM by (i) merging the sampler clock buffer with the VCO isolation buffer and (ii) eliminating all other noisy loop components to simultaneously achieve low noise and low spur. A 2.05-to-2.55GHz explicit reference-sampling PLL achieves an FOM<sub>j</sub> of -253.5dB and reference spur <-67dBc.

The basic premise is outlined in Fig. 15.7.1. The loop uses a buffered and gated VCO waveform to directly sample the low-noise reference-crystal sinewave near the reference zero-crossing, as opposed to having a buffered square-wave reference sample the VCO sinewave as in SSPLLs. This eliminates the large, noisy, and power-hungry reference buffer necessary in SSPLLs, as the VCO feedback buffer essentially combines the functionalities of clock buffering for the sampler and the isolation of the VCO from the sampler. The noise of this slewing inverter-buffer in the VCO feedback path has N<sup>2</sup> contribution to the PLL output, like the reference buffer in an SSPLL or ILCM (Fig. 15.7.3). However, it should be emphasized that it does not have higher power consumption due to its N<sup>2</sup> higher frequency of operation, because an inverter driven by a sinewave is dominated by largely-frequency-independent short-circuit or crowbar current. This improves the spur significantly compared to the VCO-bufferless SSPLL in [1], and eliminates the duplicate noise penalty of the spur-suppressing VCO buffer and the reference buffer in the SSPLL in [2]. Additionally, of multiple samples potentially produced by the VCO edge, only the sample near the reference zero-crossing contains phase error information (Fig. 15.7.1). Therefore, a sample edge selection circuit (SESCI) terminates sampling after the relevant edge, and in doing so, it reduces switching activity in the feedback path to further lower loop power consumption. The penalty of sampling the reference is a virtual division-by-N. The sampler generates a sample V<sub>PD</sub> proportional to the phase error ΔΦ<sub>e</sub> between the VCO and reference, V<sub>PD</sub> ∝ A<sub>REF</sub>ΔΦ<sub>e</sub>/N, where A<sub>ref</sub> is the reference sinewave amplitude. Consequently, the noise of loop components including and after the phase detector is multiplied by N<sup>2</sup>, unlike SSPLLs. However, using Type-I loop dynamics to eliminate the charge pump, realizing essentially a loop with no additional noisy loop components similar to ILCMs, keeps the in-band loop noise, and consequently the FOM<sub>j</sub>, very low despite the virtual division by N.

The complete circuit diagram is in Fig. 15.7.2. As mentioned earlier, a buffered VCO square-wave clock is used to sample the reference sinewave, and the SESCI terminates sampling after the VCO- sampling edge closest to the sinewave zero-crossing. This is done by gating the sampling clock with a selection edge (SampleEDGE) generated by AND-ing a buffered reference square-wave with the antiphase VCO+ square-wave to generate the effective sampling clock TRACK. As SampleEDGE is derived from VCO+, gating always happens after sampling by VCO- is complete. Hence, the on-time of the VCO- clock, and therefore noise on

the voltage sample, is affected only by the noise on the first VCO-/VCO+ buffers. In other words, the reference buffer in the SESCI can be very low power (170μW) as its noise does not affect the tracking period. It has a tunable delay, which is calibrated one time to match its rising edge with the reference sinewave zero-crossing. Owing to the virtual division by N, the loop noise must be kept low. Therefore, the voltage sample is used directly to control the VCO without a charge pump, realizing a Type-I loop. Instead of using a S/H in the phase detector, we use a half-reference multiplexing scheme (Fig. 15.7.2) to keep the control voltage constant over a reference period. This results in a half-harmonic spur at the output, but is not fundamental to the RSPLL. The initial VCO frequency is set within the acquisition range using a digital capacitor bank.

Unlike the SSPD and the ILCM, the proposed phase detector shown in Fig. 15.7.2 is very linear for 2π VCO phase error due to the virtual division by N. K<sub>PD</sub> is almost constant, and the noise performance is independent of the phase error at lock, obviating the need for frequency-locking or frequency-tracking circuitry that ensures that the loop locks to the center of the lock range (verified in Fig. 15.7.5).

We now discuss the loop noise. A comparison to prior SSPLLs [1,2] is shown in Fig. 15.7.3. The reference buffer (and isolation buffer if present) are the dominant source of in-band phase noise in SSPLLs. In the RSPLL, as mentioned earlier, their functionality is combined in the clock-and-isolation VCO buffer. The VCO swing is large enough to cause slewing in this buffer, making its output slope independent of frequency and causing its noise to be multiplied by N<sup>2</sup> at the PLL output, similar to the reference-buffer noise in PLLs [1-3]. However, this buffer can also be sized smaller, since it buffers a noisier (VCO vs. reference) sinewave, and as a result, it contributes 3-to-6dB lower in-band noise than prior SSPLLs [1,2] while consuming similar power. In-band phase noise from the sampler in the RSPLL including the effect of noise-folding is N<sup>2</sup> higher than the kT/C noise in an SSPD due to the virtual division by N. Here, we use a larger (10pF) C<sub>SAMP</sub> designed to generate comparable kT/C noise to the SSPD at the expense of a much larger capacitor than the C<sub>SAMP,SSPD</sub> of 10fF in [1]. However, in using a Type-I loop, the area of loop components remains comparable to the SSPLL, as there is no large integrating loop-filter capacitance. Despite the large C<sub>samp</sub> and fast sampling clock, the sampler R<sub>SWITCH</sub>C<sub>samp</sub> is only 4.5× times larger than the clock on-time as the input waveform is much (N=52×) slower than the clock. Even so, the switch is quite large (here 64μm/60nm), and the power of the inverters driving it goes up as N× but is reduced by gating the inverters using the SESCI as discussed earlier. Here we have gated these by only 50%, but more gating (by further expanding SampleEDGE duty cycle through a delay in the DFF reset of Fig. 15.7.2) improves the RSPLL FOM<sub>j</sub> even further. With these innovations, the in-band noise of the loop components is very low (-127dBc/Hz at 200kHz for 2mW), and the output noise of the Type-I loop is limited by the 20dB/dec suppression of the VCO noise. Here, a cross coupled LC-VCO with a measured FOM<sub>VCO[1MHz]</sub> of 184.1dB limits the PLL in-band noise to -121dBc/Hz. Leveraging recent LC-VCO innovations with ~10dB better FOM<sub>VCO</sub> would improve the in-band noise and FOM<sub>j</sub> even further.

The prototype implemented in 65nm CMOS (Fig. 15.7.7) has an integrated jitter of 109.6fs at 2.55GHz (f<sub>REF</sub>=50MHz) with 3.5mW of power consumption, resulting in a FOM<sub>j</sub> of -253.5dB with a 25/50MHz spur of -63/-67dBc (Fig. 15.7.4). The measured phase noise and spur across the center frequency and supply voltage are shown in Fig. 15.7.5. The RSPLL locks and operates robustly across a large variation in the VCO supply, verifying the benefit of the monotonic phase detector. Compared to existing integer-N synthesizers in Fig. 15.7.6, this work achieves the record FOM<sub>j</sub> with simultaneously low spurs.

### Acknowledgement:

We acknowledge the IBM PhD Fellowship program and in particular Dr. Mark Ferriss and Dr. Alberto Valdes Garcia of IBM Research, Yorktown Heights.

### References:

- [1] X. Gao, et. al., "A 2.2GHz Sub-Sampling PLL with 0.16ps<sub>rms</sub> Jitter and -125dBc/Hz In-band Phase Noise at 700μW Loop-Components Power," *IEEE Symp. VLSI Circuits*, pp. 139-140, June 2010.
- [2] X. Gao, et. al., "Spur-Reduction Techniques for PLLs Using Sub-Sampling Phase Detection," *ISSCC*, pp. 474-475, Feb. 2010.
- [3] A. Elkholi, et. al., "A 6.75-to-8.25GHz, 250fs<sub>rms</sub>-Integrated-Jitter 3.25mW Rapid On/Off PVT-Insensitive Fractional-N Injection-Locked Clock Multiplier in 65nm CMOS," *ISSCC*, pp. 192-193, Feb. 2016.
- [4] Z. Ru, et. al., "A 12GHz 210fs 6mW Digital PLL with Sub-sampling Binary Phase Detector and Voltage-Time Modulated DCO," *IEEE Symp. VLSI Circuits*, pp. 194-195, June 2013.



Figure 15.7.1: RSPLL uses a gated VCO waveform to sample a sinewave reference to realize ultra-low FOM<sub>j</sub> and spurs simultaneously.

| Block            | Power consumption                                                                                                        | In-band PN from block at PLL output                                                                                                                | Power breakup                                                     | Sim. PN* @ 200 kHz (dBc/Hz)        |
|------------------|--------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|------------------------------------|
| Ck+Iso. buff.    | (V <sub>DD</sub> -V <sub>TP</sub> -V <sub>TN</sub> )I <sub>SC</sub> (W <sub>BUFF</sub> )V <sub>DD</sub> f <sub>VCO</sub> | $\frac{(4\pi^2 f_{VCO}^2 V_{DD}^2)}{(I_{SC}/C_{BUFF})^2} \cdot \frac{1}{f_{VCO}} \cdot \frac{1}{f_{REF}} \cdot \text{NOISE FOLD}$                  | Gated Drive (I <sub>g</sub> =0.5) 0.93 mW<br>Ck+Iso. buff. 500 μW | 200 kHz 200 dBc/Hz                 |
| Phase Det.       | 0                                                                                                                        | KT · V <sub>VCO</sub> · N <sup>2</sup>                                                                                                             | VCO 1.6 mW<br>1.6 mW 290 μW                                       |                                    |
| Gated Driver     | α <sub>0→1</sub> f <sub>VCO</sub> C <sub>SW</sub> V <sub>DD</sub> <sup>2</sup>                                           | Neglect for hard switching                                                                                                                         | PD -135<br>VCO -130                                               |                                    |
| VCO              | I <sub>VCO</sub> V <sub>DD</sub>                                                                                         | α(f <sub>L</sub> /f <sub>BW</sub> ) <sup>2</sup>                                                                                                   | VCO -123                                                          | P <sub>DC</sub> = 3.5 mW [RSPLL]   |
| <b>SSPLL</b>     |                                                                                                                          |                                                                                                                                                    |                                                                   |                                    |
| Ref. clock buff. | (V <sub>DD</sub> -V <sub>TP</sub> -V <sub>TN</sub> )I <sub>SC</sub> (W <sub>BUFF</sub> )V <sub>DD</sub> f <sub>REF</sub> | $\frac{(4\pi^2 f_{REF}^2 V_{DD}^2)}{(I_{SC}/C_{BUFF})^2} \cdot \frac{1}{f_{REF}} \cdot \text{SLEW SLOPE}$                                          | CP+Bias Ref Buff 300 μW 400 pW<br>VCO 1.8 mW                      | 200 kHz 200 dBc/Hz                 |
| Isolation buff.  | I <sub>BIAS,ISO</sub> V <sub>DD</sub>                                                                                    | $\frac{(4\pi^2 f_{VCO}^2 V_{DD}^2)}{(I_{BIAS,ISO}A_{VCO}V_{DD})^2} \cdot \frac{1}{f_{VCO}} \cdot \frac{1}{f_{REF}} \cdot \text{NOT SLEWING SLOPE}$ | CP+Bias Iso. Buff. 300 μW<br>VCO 1.8 mW<br>Ref. Buff. 1.1 mW      |                                    |
| Charge pump      | I <sub>BIAS,CP</sub> V <sub>DD</sub>                                                                                     | 8KT <sub>V</sub>                                                                                                                                   | PD -136 -133<br>VCO -129 -125<br>Ref. Buff. -126 -126             | P <sub>DC</sub> = 2.5 mW [SSPLL,1] |
| VCO              | I <sub>VCO</sub> V <sub>DD</sub>                                                                                         | α(f <sub>L</sub> /f <sub>BW</sub> ) <sup>2</sup>                                                                                                   | PD -128<br>VCO -129 -125<br>Ref. Buff. -126 -126                  | P <sub>DC</sub> = 3.8 mW [SSPLL,2] |

Figure 15.7.3: Noise and power of blocks in RSPLL and SSPLLS in [1,2].



Figure 15.7.5: Measured jitter, FOM<sub>j</sub>, and spur across frequency for multiple samples, and across supply voltage (for 2.55GHz).



Figure 15.7.2: Circuit schematic and timing diagram of the RSPLL. Differential reference is generated from an XO and an off-chip balun.



Figure 15.7.4: Measurements at 2.55GHz with initial 6MHz error. The f<sub>REF</sub>/2 spur due to half-rate multiplexing in sampler (Fig. 15.7.2) is not intrinsic to the RSPLL.

| Arch.                          | Gao VLSI '10 [1]                   | Gao ISSCC '10 [2]                  | Gao ISSCC '09                      | Ru VLSI '13 [4]      | Helal JSSC '09            | Elkholly JSSC '15 [3]               | This work                                                                            |
|--------------------------------|------------------------------------|------------------------------------|------------------------------------|----------------------|---------------------------|-------------------------------------|--------------------------------------------------------------------------------------|
| Freq. [GHz]                    | 2.21                               | 2.21                               | 2.21                               | > 200                | 3.2                       | 6.75-8.25 (20%)                     | 2.05-2.55 (21.7%)                                                                    |
| Mul. Factor N                  | 40                                 | 40                                 | 40                                 |                      | 64                        | 64                                  | 41-51                                                                                |
| PN @ 200kHz <sup>(1)</sup>     | -125                               | -121                               | -126                               | -114.7               | -119.2 <sup>(2)</sup>     | -121.7 <sup>(2)</sup>               | -122.8                                                                               |
| PN@1MHz <sup>(1)</sup>         | -124 <sup>(2)</sup>                | -120.1 <sup>(2)</sup>              | -123.7 <sup>(2)</sup>              | -119 <sup>(2)</sup>  | -130.2                    | -126.2                              |                                                                                      |
| Int. Jitter (fs)<br>(Int. BW)  | 160<br>(10k-100M)                  | 300<br>(10k-100M)                  | 150<br>(100-40M)                   | 210<br>(10k-20M)     | 130<br>(100-40M)          | 104<br>(10k-30M)                    | 110 @ 2.55 GHz<br>(10k-100M)                                                         |
| Spur (-dBc)                    | -56 at f <sub>ref</sub>            | -80 at f <sub>ref</sub>            | -46 at f <sub>ref</sub>            | Not reported         | -63.9 at f <sub>ref</sub> | -43 at f <sub>ref</sub>             | -67 at f <sub>ref</sub><br>-63 at f <sub>ref</sub> /2                                |
| P <sub>DC</sub> (mW)           | 2.5                                | 3.8                                | 7.6                                | 6                    | 28.6                      | 0.25                                | 3.7                                                                                  |
| (VCO + Loop)                   | (1.8±0.7)                          | (1.8±2)                            | (1.8±5.8)                          | (2.5±3.5)            | -243                      | -251                                | -253.5                                                                               |
| CMOS Tech.                     | 180 nm                             | 180 nm                             | 180 nm                             | 65 nm                | 130 nm                    | 65 nm                               | 65 nm                                                                                |
| Supply (V)                     | 1.8                                | 1.8                                | 1.8                                | Not reported         | 1.2                       | 0.9                                 | 1.2/0.5                                                                              |
| Total PLL Area<br>(VCO + Loop) | 0.2 mm <sup>2</sup> <sup>(3)</sup> | 0.2 mm <sup>2</sup> <sup>(3)</sup> | 0.2 mm <sup>2</sup> <sup>(3)</sup> | 0.15 mm <sup>2</sup> | 0.4 mm <sup>2</sup>       | 0.25 mm <sup>2</sup> <sup>(4)</sup> | 0.36 mm <sup>2</sup><br>(0.3 mm <sup>2</sup> <sup>(5)</sup> + 0.06 mm <sup>2</sup> ) |

(1) In-band noise normalized to 2.21 GHz. (2) From measurement in the paper. (3) Limited tuning range and Area VCO~Area<sub>BB</sub>. (4) VCO tuning implemented with MOScap. (5) 21.7% VCO tuning range implemented with MoM cap. Area<sub>BB</sub> = Area<sub>BB,N</sub> = 0.15 mm<sup>2</sup>.



Figure 15.7.6: Comparison with low-noise and low-spur integer-N synthesizers.



Figure 15.7.7: Die micrograph of the RSPLL implemented in 65nm CMOS.

## 15.8 An 82-to-108GHz -181dB-FOM<sub>T</sub> ADPLL Employing a DCO with Split-Transformer and Dual-Path Switched-Capacitor Ladder and a Clock-Skew-Sampling Delta-Sigma TDC

Zhiqiang Huang, Howard Cam Luong, HKUST, Hong Kong, China

Existing ADPLLs in [1] are limited to 60GHz and are not capable of operating in the W-band. At 100GHz, DCOs become more sensitive to parasitics resulting in low frequency resolution. A high-resolution delta-sigma TDC is used to reduce quantization noise by noise-shaping in [2], but it suffers from large noise from its ring-oscillator. Moreover, the frequency tuning range of existing W-band VCOs is limited to only ~11% because using switched capacitors and varactors with low Q (<3) is not useful. To achieve fine tuning without varactors for high Q (>8), a variable inductor, implemented with a transformer and a variable resistor, is used in [3,4]. However, the variable resistor significantly degrades the Q with a minimum value  $Q_{min}$ , limited to  $Q_{min} \approx 2/k^2/2/TR$ , where k is the transformer coupling coefficient and TR is the tuning range. Employing a split transformer with variable resistors for wide tuning range, dual-path exponentially scaled switched-capacitor (SC) ladders for high frequency resolution, and a clock-skew-sampling delta-sigma TDC for low close-in phase noise, an 82-to-108GHz ADPLL is demonstrated in this work.

Figure 15.8.1 shows the block diagram of the proposed W-band ADPLL. The DCO consists of a six-port split-transformer as a variable inductor controlled by digital signals  $D_0 \sim D_5$  (7 bits for  $D_0 \sim D_3$  and 8 bits for  $D_4 \sim D_5$ ) to coarsely tune the output frequency from 82 to 107.6GHz and a dual-path exponentially scaled SC ladder for phase locking. A 3-stage frequency pre-scaler, consisting of 100GHz and 50GHz LC-based injection-locked dividers-by-2, and a CML divider-by-4 divides down the DCO output frequency to ~6GHz, which is further divided down to the reference frequency of ~125MHz by a programmable divider. The phase error of programmable divider is digitized by a clock-skew-sampling delta-sigma TDC and then fed to a digital loop filter (DLF), whose output is split into two paths to control the SC ladders.

As shown in Fig. 15.8.2, instead of a single transformer with a high coupling factor k but low  $Q_{min}$ , the proposed split-transformer consists of 2 parallel transformers  $L_a$  and  $L_b$ , each of which is designed with 3 parallel secondary coils with lower coupling factor  $k_i$  to maximize  $Q_{min,i}$ . Because  $TR_i$  is proportional to  $k_i^2$ , the overall k becomes  $\sqrt{\sum k_i^2}$ , and the overall  $Q_{min}$  approximates  $Q_{min,i}$  and is maximized accordingly. Binary-weighted transistors ( $R_{v0} \sim R_{v5}$ ) are added at each secondary coil as variable resistors for fine frequency tuning. Theoretically, the  $Q_{min}$  is improved to  $2N/TR$ , where N is the total number of the secondary coils. However, at higher frequencies, more variable resistors may need to be turned on for smaller inductance, causing lower Q. To alleviate the problem, the transformer is further split non-uniformly with progressively scaled  $k_i$  and  $TR_i$ , and the secondary coil, which enables oscillation at the target frequency in its high-Q region with the smallest tuning range and thus the lowest Q degradation, is selected. Due to the mutual coupling between the secondary coils, the coupling coefficients  $k_i$  are effectively reduced. As a result,  $k_i$  need to be increased accordingly to maintain the same optimal  $k_i$  and  $TR_i$ . As shown in Fig. 15.8.2, each of the two transformers consists of a primary coil  $L_a/L_b$  and 3 secondary coils  $L_{a1-3}/L_{b1-3}$ . The coils are implemented on the top metal layers with only one turn for maximum  $Q_i$ . To minimize the mutual coupling among the secondary coils, the 2 secondary coils  $L_{a1-2}/L_{b1-2}$  are placed inside but coupled to different parts of the primary coil  $L_a/L_b$ , while the third secondary coil  $L_{a3}/L_{b3}$  is placed outside but surrounded by a shorted coil to reduce its effective inductance.

As a simple illustration with N=2, Fig. 15.8.2 also shows the comparison of frequency-tuning scheme between the conventional tuning (using a switch D and a variable resistor  $R_v$  [4]) and the proposed split-transformer (employing two variable resistors  $R_{v1}$  and  $R_{v2}$  together with a swapping scheme). For the conventional tuning, after the frequency is tuned to  $f(\infty, 0)$ , the control signals ( $R_v, D$ ) need to be switched from  $(\infty, 0)$  to  $(0, \infty)$  for further tuning. As a result, with PVT variations,  $f(\infty, 0)$  differs from  $f(0, \infty)$  and a frequency gap is created. On the other hand, for the proposed split-transformer scheme,  $R_{v2}$  can be continuously used for further tuning from  $f(R_{v1}, R_{v2}) = f(\infty, 0)$  without changing any control signal, which ensures no frequency gap. Furthermore, by swapping the control signals of  $R_{v1}$  and  $R_{v2}$ , different tuning cases with tuning signal  $(R_{v1}, R_{v2})$  from  $(0, 0)$  to  $(0, \infty)$  and to  $(\infty, \infty)$  are created allowing selection of the variable resistor operating in the high-Q region. As illustrated, as a target frequency  $f_0$  is closer to  $f(0, \infty)$  in Case 2 than  $f(\infty, 0)$  in Case 1, Case 2 is selected to achieve  $Q_2 > Q_1$  in Case 1.

To maintain a stable gain for phase locking and to achieve a fine frequency resolution of 1.5kHz to minimize quantization noise, the proposed DCO employs an exponentially scaled SC ladder by exponentially scaling down the SC with more capacitive-divider stages [5], as shown in Fig. 15.8.1. However, with a larger number of bits for the SC ladders, the scaling factor variation is increased [5]. In order to reduce bandwidth variation with a smaller number of bits, a dual-path SC ladder with integral-path and proportional-path SC ladders is proposed. The integral-path SC ladder utilizes a 24b ladder with a ~1GHz tuning range and a scaling factor of ~4 in each stage, consisting of an 8-stage C-2C SC ladder connected in parallel with 2b SC units and in series with a 3-stage scaling capacitors. The ladder is controlled by a three-level quantizer and an integrator. The proportional-path SC ladder, with a tuning range of ~4MHz, utilizes a 16b SC ladder with a higher scaling factor of ~8 and 3b thermometer-coded SC units for wider linear range. The ladder is controlled by the LF output for phase locking.

As shown in Fig. 15.8.3, the conventional delta-sigma TDC [2] uses a phase-frequency detector (PFD) to convert phase error into pulses. To minimize the overall in-band noise, the PFD is replaced by a clock-skew-sampling phase detector with high-gain phase detection [6]. With the rising slope of 1V/160ps for the clock-skew sampler at a 125MHz reference input, the input-referred noise is reduced by 34dB as compared with using a traditional PFD. The implemented PDC consists of an 8b counter as an integral quantizer and a state-to-phase decoder as a fractional quantizer, which detects the differential outputs of the ring-oscillator  $P_{0-6}$  and  $N_{0-6}$  as shown in Fig. 15.8.3. Ideally, the timing relationship between the ring-oscillator output  $P_0$  and REF is the same in both the integral and fractional quantizers. However, due to time mismatch, when  $P_0$  leads REF in the fractional quantizer with an output of 0,  $P_0$  may lag REF in the integral quantizer and incorrectly reduces the output by 1. As a solution, an early/late detector with output S is used in the integral quantizer to detect the time relationship and correct the digital output accordingly.

Fabricated in a 65-nm CMOS process, the W-band ADPLL occupies a 0.36mm<sup>2</sup> core area and consumes 35.5mW. Figure 15.8.4 shows the measured DCO startup current and 10MHz phase noise with different configurations of the split-transformer. With N=1, because of severe quality factor degradation, DCO fails to oscillate at most frequencies. As N increases, the overall tank Q is improved, and the startup current becomes smaller. Specifically, for N=6, as compared to N=3, not only the DCO oscillates over a much wider tuning range but with better startup current (7mA instead of 11mA) and thus better Q (~2x) and better phase noise (-106 instead of -102dBc/Hz at 84GHz). Applying the swapping scheme described earlier, the 10MHz phase noise is further improved from -106 to -109dBc/Hz at 84GHz. Figure 15.8.5 shows the measured phase-noise spectrum of the ADPLL at 82 and 108GHz with different gains of the TDC (1V/80ps and 1V/160ps clock-skew sampler) and BW. With the high-gain TDC and the reduction of the DLF gain for a ~1MHz bandwidth, the 10kHz phase noise is improved from -79.2 to -81.7dBc/Hz at 82GHz and from -74.1 to -78.6dBc/Hz at 107.6GHz. Figure 15.8.6 summarizes and compares the ADPLL performance to prior art. The proposed ADPLL features the widest tuning range among W-band PLLs in Fig. 15.8.6 and the highest operation frequency among ADPLLs in Fig. 15.8.6 with high frequency resolution of 1.5kHz. The proposed ADPLL achieves comparable phase noise and power consumption with 60GHz PLLs in Fig. 15.8.6. The die micrograph is shown in Fig. 15.8.7.

### Acknowledgment:

This project was supported by the Hong Kong General Research Funding (16243116)

### References:

- [1] A. Hussein, et al., "A 50-to-66GHz 65nm CMOS All-Digital Fractional-N PLL with 220fs<sub>rms</sub> Jitter," *ISSCC*, pp. 326-327, Feb. 2017.
- [2] M. Straayer and M. Perrott, "A Multi-Path Gated Ring Oscillator TDC with First-Order Noise Shaping," *IEEE JSSC*, vol. 44, no. 4, pp. 1089-1098, Apr. 2009.
- [3] T.-Y. Lu, et al., "Wide Tuning Range 60GHz VCO and 40GHz DCO Using Single Variable Inductor," *IEEE TCAS-I*, vol. 60, no. 2, pp. 257-267, Feb. 2013.
- [4] X. Liu, et al. Luong, "Transformer-Based Varactor-Less 96GHz-110GHz VCO and 89GHz-101GHz QVCO in 65nm CMOS," *IEEE ASSCC*, pp. 357-360, Nov. 2016.
- [5] Z. Huang and H. C. Luong, "Design and Analysis of Millimeter-Wave Digitally-Controlled Oscillators with C-2C Exponentially-Scaling Switched-Capacitor Ladder," *IEEE TCAS-I*, vol. 64, no. 6, pp. 1299-1308, June 2017.
- [6] Z. Huang, et al., "A 4.2μs-Settling-Time 3rd-Order 2.1GHz Phase-Noise-Rejection PLL Using a Cascaded Time-Amplified Clock-Skew Sub-Sampling DLL," *ISSCC*, pp. 40-41, Feb. 2016.



Figure 15.8.1: Block diagram of the proposed W-band ADPLL and the schematic of a W-band DCO with a dual-path SC ladder (only single-ended versions shown for simplicity).



Figure 15.8.2: Schematic and layout of the DCO split-transformer as a variable inductor and the illustration of its tuning and swapping scheme in comparison with the conventional tuning using a switch and a variable resistor [4].



Figure 15.8.3: Block diagrams of the proposed clock-skew-sampling delta-sigma TDC (as compared to the conventional delta-sigma TDC [2]) and of the proposed Phase-to-Digital converter.



Figure 15.8.4: Measured startup current of the DCO versus the DCO frequency and measured 10MHz phase noise of the W-band DCO versus the ADPLL output frequency with bandwidth <0.2MHz for different configurations of the split-transformer.



Figure 15.8.5: Measured phase-noise spectrum of the W-band ADPLL with different gains of the TDC and with narrow bandwidth <0.2MHz to demonstrate the DCO phase noise at 82 and 107.6GHz.

|                                           | Szotyka, JSSC'15                | Wu, JSSC'14                 | Hussein, ISSCC'17 [1]               | Kang, TMTT'14     | Chao, RFIC'14               | This work                                                                 |
|-------------------------------------------|---------------------------------|-----------------------------|-------------------------------------|-------------------|-----------------------------|---------------------------------------------------------------------------|
| PLL Architecture                          | Sub-sampling PLL                |                             |                                     |                   |                             |                                                                           |
| Freq. (GHz)                               | 53.8-83.3 (16.2%)               | 58.4-83.4 (11.7%)           | 50.2-66.5 (27.9%)                   | 92.7-100.2 (7.8%) | 96.8-108.5 (11.4%)          | <b>82.8-107.6 (27%)</b>                                                   |
| freq (MHz)                                | 40                              | 100                         | 100                                 | 1500              | 200                         | 125                                                                       |
| Close-in PN (dBc/Hz)                      | -80 @ 10kHz<br>-90 -92 @ 0.2MHz | -75 @ 10kHz<br>-72 @ 0.1MHz | -79 -83 @ 10kHz<br>-88 -90 @ 0.1MHz | -92.5 @ 0.1MHz    | -77 @ 10kHz<br>-84 @ 0.1MHz | <b>-78.5 -81.8 @ 10kHz<br/>-84.2 -87.1 @ 0.1MHz</b> ( $BW \approx 1MHz$ ) |
| PN (dBc/Hz)<br>$\Delta f = 1MHz$          | -108                            | -110                        | -116~128                            | -105.5            | -104                        | <b>-106 -118</b> ( $BW < 0.2MHz$ ) <b>-104 -108</b> ( $BW \approx 1MHz$ ) |
| Power(mW)                                 | 42                              | 48                          | 46                                  | 469.3             | 14                          | 35.5                                                                      |
| Ref. Spur (dBc)                           | -40                             | -74                         | -59.1~68                            | -60               | -35                         | -34~52 ( $BW \approx 1MHz$ )                                              |
| VDD (V)                                   | 1                               | 1.2                         | 1                                   | 3.3,2.5,1.2       | 1.2,0.6                     | 1.2,0.8                                                                   |
| Fully-Integrated Loop Filter              | N/A                             | Yes                         | Yes                                 | Yes               | No                          | Yes                                                                       |
| Jitter(ps)                                | 0.2~0.35 (10K~100M)             | 0.59 (10K~10M)              | 0.22~0.26 (Unknown)                 | 0.0785 (10K~100M) | N/A                         | 0.276~0.328 (10K~10M, BW = 1MHz)                                          |
| FOM * (dB) $\Delta f = 10MHz$             | -167.3                          | -168.8                      | -175.7~183.5                        | -158.3            | -172.5                      | <b>-171~173</b> ( $BW < 0.2MHz$ ) <b>-169~171</b> ( $BW \approx 1MHz$ )   |
| FOM <sub>r</sub> - (dB) $\Delta f = 1MHz$ | -171.4                          | -170.1                      | -184.6~192.4                        | -156.2            | -173.6                      | <b>-178~181</b> ( $BW < 0.2MHz$ ) <b>-175~179</b> ( $BW \approx 1MHz$ )   |
| Area (mm <sup>2</sup> )                   | 0.16                            | 0.48                        | 0.45                                | 0.93              | 0.39                        | 0.36                                                                      |
| Process                                   | CMOS 40nm                       | CMOS 65nm                   | CMOS 65nm                           | SiGe 130nm        | CMOS 65nm                   | CMOS 65nm                                                                 |

\*FOM = PN - 20log(f<sub>s</sub>/Δf) + 10log(Power/1mW)  
\*\*FOM<sub>r</sub> = PN - 20log(f<sub>s</sub>/Δf) \* FTR(10%) + 10log(Power/1mW)

Figure 15.8.6: Performance summary and comparison of the proposed ADPLL with prior 60GHz and W-band PLLs.



Figure 15.8.7: Die micrograph of the proposed W-band ADPLL.

# Session 16 Overview:

## *Advanced Optical and Wireline Techniques*

### WIRELINE SUBCOMMITTEE

**Session Chair: Azita Emami**

California Institute of Technology, Pasadena, CA

**Associate Chair: Andrew Joy**

Cavium, Northampton, United Kingdom

**Subcommittee Chair: Frank O'Mahony, Intel, Hillsboro, OR**

Electrical and optical links continue to provide challenges that cannot be overcome by process technology advances alone. The ingenuity of the designer is a major contribution, and this is clearly demonstrated by the papers in this session. The first is an invited paper that reviews state-of-the-art optical interconnects in scalable, high-density standards-based switching fabrics for networking systems and cloud computing. The second paper describes the first electrical domain solution to the non-linear dispersion for both transient and adiabatic chirp of directly modulated lasers. This is followed by a paper that describes the first optical burst-mode receiver with rapid power on/off functionality for a data-rate of 56Gb/s and a power efficiency of 2.2pJ/b in always-on mode.

The next three papers demonstrate the benefits of time-domain modulation and equalization techniques as an alternative to conventional voltage-domain modulation. The first utilizes integrated pulse width modulation to equalize consecutive-bit ISI from 3 to 16Gb/s achieving 1.6-3.1pJ/b. The second encodes data based on pulse width and location within a fixed-length frame in order to signal at 20Gb/s with 4.53pJ/b. The third uses phase difference modulation to overcome reflections in a 7.8Gb/s/pin, 1.96pJ/b link for multi-drop memory applications. The next paper addresses the short-reach interface requirements at 56Gb/s and achieves a 2.25pJ/b power efficiency at 56Gb/s. This is followed by a 1.17pJ/b/pin ground-referenced single-ended link at 25Gb/s in 16nm CMOS for multi-die package communication spanning up to 80mm. The final paper describes a digitally pre-distorted PAM-4 modulation technique for contactless communication at 20Gb/s over a 127GHz carrier consuming 3.98pJ/b.

**INVITED PAPER****1:30 PM****16.1 Optical Interconnects in Computing and Switching Systems: the Anatomy of a 20Tb/s Switch Card**

A. Krishnamoorthy, Axalume, San Diego, CA



This paper reviews the use of optical interconnects in scalable, high-density standards-based switching fabrics for networking systems and cloud computing. The objective is to demonstrate, with examples, that significant system-level benefits accrue from a judicious use of optics within the system. We present a progression of 3-stage Clos-architecture datacenter switches of approximately 25Tb/s capacity with approximately 50Tb/s IO bandwidth with progressive use of optical interconnects to improve density, efficiency, and latency. First-generation systems eliminated internal electrical backplanes and used active optical cables for all IO to the switch chassis. Next-generation datacenter switches exclusively used optical interconnect technologies to enable switches for two-tier, and three-tier leaf-and-spine network topologies with up to a 16 $\times$  improvement in density  $\times$  efficiency [1]. For future systems, we predict that the integration of photonics and electronics at the semiconductor package level will enable further scaling of bandwidth and energy efficiency. We present a roadmap for co-packaged and co-integrated photonic 2.5D or 3D technologies and discuss candidates for density and energy-efficiency improvements resulting from co-integration and reduced SerDes power. We review specific hybrid flip-chip packaging configurations that will enable this technology to be manufactured and provide the necessary scale and integration density. The inclusion of photonics represents an evolution in interconnect design methodology at the board and package level and a breakthrough in system design by providing an interface that can span multiple levels of the interconnect hierarchy.

[1] A. Krishnamoorthy, et al., "From Chip-to-cloud: optical interconnects in engineered systems," IEEE Jour. Lightwave Tech., vol. 35, no. 15, pp. 3103-3115, Aug. 2017.

**2:00 PM****16.2 A 28Gb/s Transceiver with Chirp-Managed EDC for DML Systems**

K. Kwon, KAIST, Daejeon, Korea



In Paper 16.2, Korea Advanced Institute of Science & Technology presents a 28Gb/s electronic dispersion compensation technique for directly modulated lasers (DMLs). The dual lane transceiver is fabricated in a 40nm CMOS process, and consumes 236mW at 28Gb/s.



2:30 PM

**16.3 A 56Gb/s Burst-Mode NRZ Optical Receiver with 6.8ns Power-On and CDR-Lock Time for Adaptive Optical Links in 14nm FinFET CMOS***I. Ozkaya*, IBM Research, Ruschlikon, Switzerland and EPFL, Lausanne, Switzerland

In Paper 16.3, IBM Research describes a 56Gb/s optical receiver in 16nm CMOS FinFet technology with rapid power on/off functionality. It achieves a 6.8ns wake-up time while consuming always-on and off powers of 126mW and 8mW, respectively.



2:45 PM

**16.4 A 0.5-to-0.9V, 3-to-16Gb/s, 1.6-to-3.1pJ/b Wireline Transceiver Equalizing 27dB Loss at 10Gb/s with Clock-Domain Encoding Using Integrated Pulse-Width Modulation (iPWM) in 65nm CMOS***A. Ramachandran*, Oregon State University, Corvallis, OR

In Paper 16.4, Oregon State University presents a 3-16Gb/s wireline transceiver in 65nm CMOS with clock domain equalization using integrated pulse width modulation (iPWM) encoding. The proposed transceiver can compensate 27dB loss at 10Gb/s with an efficiency of 1.8pJ/b.



3:15 PM

**16.5 A 20Gb/s Transceiver with Framed-Pulsewidth Modulation in 40nm CMOS***S. Jeon*, KAIST, Daeon, Korea

In Paper 16.5, Korea Advanced Institute of Science & Technology (KAIST) presents a 20Gb/s serial link transceiver in 40nm CMOS, which employs a framed-pulsewidth modulation (FPWM) scheme. It achieves a total throughput of 20Gb/s while keeping the minimum pulse width to the equivalent of 15Gb/s.



3:45 PM

**16.6 A 7.8Gb/s/pin 1.96pJ/b Compact Single-Ended TRX and CDR with Phase-Difference Modulation for Highly Reflective Memory Interfaces***S. Lee*, Pohang University of Science and Technology, Pohang, Korea

In Paper 16.6, Pohang University of Science and Technology proposes a 7.8Gb/s/pin 1.96pJ/b phase-difference-modulation transmitter for highly-reflective memory interfaces. It demonstrates ISI suppression capability of a 14-tap DFE by overcoming 10 in-band notches.

16



4:15 PM

**16.7 A 126mW 56Gb/s NRZ Wireline Transceiver for Synchronous Short-Reach Applications in 16nm FinFET***D. Carey*, Xilinx, Cork, Ireland

In Paper 16.7, Xilinx presents a 56Gb/s NRZ short reach wireline transceiver in 16nm FinFET technology. The transceiver achieves better than 1e-15 BER over a channel with 8dB loss at 28GHz while consuming 126mW at 56Gb/s.



4:45 PM

**16.8 A 1.17pJ/b 25Gb/s/pin Ground-Referenced Single-Ended Serial Link for Off- and On-Package Communication in 16nm CMOS Using a Process- and Temperature-Adaptive Voltage Regulator***J. M. Wilson*, Nvidia, Durham, NC

In Paper 16.8, Nvidia describes a 25Gb/s/pin ground-referenced single-ended serial link in 16nm CMOS that communicates off-package over 80mm of low-cost PCB and package interconnect. It achieves a link energy efficiency of 1.17pJ/b, and uses an adaptive digital regulator to compensate for process and temperature variations.



5:00 PM

**16.9 A 20Gb/s 79.5mW 127GHz CMOS Transceiver with Digitally Pre-Distorted PAM-4 Modulation for Contactless Communications***Y. Kim*, Jet Propulsion Laboratory, Pasadena, CA and University of California, Los Angeles, CA

In Paper 16.9, Jet Propulsion Laboratory (JPL) presents a digitally pre-distorted PAM-4 transceiver for contactless communications. It achieves 20Gb/s of data with a 127GHz carrier over 1mm air gap while consuming 79.5mW.

## 16.2 A 28Gb/s Transceiver with Chirp-Managed EDC for DML Systems

Kyeongha Kwon, Jonghyeok Yoon, Hanho Choi, Younho Jeon, Jaehyeok Yang, Bongjin Kim, Soonwon Kwon, Minsik Kim, Sejun Jeon, Hyosup Won, Hyeon-Min Bae

KAIST, Daejeon, Korea

Directly modulated lasers (DMLs) are widely employed in medium-reach optical links owing to their simplicity and cost effectiveness. However, the chirp phenomenon under direct modulation limits the reach (2-10km) in a standard single-mode fiber (SMF). Although diverse optical-domain chirp-management techniques have been studied [1], excessive cost and installation difficulties have limited their widespread use. Therefore, external modulation schemes are predominant in applications requiring extended-reach, despite their high cost and power dissipation. Recently, an electronic dispersion compensation (EDC) IC has been reported to compensate for the chirp-induced dispersion (adiabatic chirp) of a 6Gb/s DML [2,3]. However, such a technique cannot be applied to high-speed (>10Gb/s) DML applications since spectral broadening caused by transient chirp dominates in high-speed links. In this paper, an adaptive EDC-based CDR IC compensating for both adiabatic and transient chirp in DMLs is proposed to help extend the reach of both 10Gb/s and 28Gb/s optical links.

In DMLs, the frequency spectrum is correlated with the laser output power due to frequency chirp, wherein the adiabatic chirp is proportional to the power,  $P(t)$ , and the transient chirp is proportional to the derivative of the logarithmic output power,  $\frac{d[\ln P(t)]}{dt} = \frac{1}{P(t)} \frac{dP(t)}{dt}$  [4]. In case the dispersion coefficient is positive ( $>1310\text{nm}$  in SMF-28), the wave velocity is proportional to the frequency. Therefore, adiabatic- or transient-chirped optical signals are dispersed through the channel (Fig. 16.2.1). The adiabatic-chirp induced dispersion can be modeled as a combination of linear and nonlinear dispersions referred to as rabbit-ear and tilting, respectively. The rabbit-ear typically appears at every rising edge of the received signal. This is because the difference in power levels between the "0" and "1" pulses induces an emission frequency deviation, causing "1" to propagate faster than "0," eventually resulting in overlap between two pulses. The rabbit-ear effect reduces the sampling-point SNR and impedes proper clock recovery by creating multiple pattern-dependent zero-crossing points. The tilting effect, with asymmetrical rise and fall times, is caused by the power-dependent propagation velocity within a pulse. The rise time decreases because the propagation speed of the subsequent portions of a rising edge increases gradually, and the fall time increases vice versa. The tilting effect degrades the sampling-point SNR because the maximum eye openings for "1"s and "0"s are not aligned in time. The transient chirp causes a frequency shift ( $\Delta f$ ) during the transient of each pulse, which is apparent especially at the rising-edges as compared to the falling edges due to laser overshoot ( $\therefore \Delta f \propto dP/dt$ ). Such a frequency shift increases the wave velocity in SMF-28 at a wavelength above 1310nm, where the positive  $\Delta f$  induces a substantial pre-cursor ISI at every rising-edge and the negative  $\Delta f$  induces a post-cursor ISI at every falling edge. The amplitude of pre-cursor ISI is affected by the proceeding data patterns because the amount of undershoot varies depending on the number of consecutive 'zeros', which in turn affects the amount of  $\Delta f$  at the subsequent rising edges. Thus, the rising-edge in the case of a 1001 pattern propagates faster than that of 0001 pattern. For x101 patterns, the bandwidth limitation causes reduced undershoot as compared to a 1001 pattern, resulting in intermediate delay as compared to 0001 and 1001 cases. Aforementioned pattern-dependent pre-cursor ISI causes substantial eye closure and pattern-dependent jitter.

Figure 16.2.2 shows the architecture of a 28Gb/s EDC transceiver compensating for both adiabatic and transient chirp. The EDC RX consists of a linear equalizer for the elimination of the rabbit-ear and a double-sampling DFE for appropriate pattern dependent sampling point adjustment for the adiabatic chirp. An eye-opening monitor (EOM) is employed for EDC to provide higher adaptability and robustness over channel variations. A phase-rotator-based all-digital quadrature-phase CDR is incorporated to retime the dispersed data while achieving robustness and power efficiency. The EDC TX overcomes pattern-dependent pre-cursor ISI due to transient-chirp. The EDC block implements a rising-edge slope control to decrease  $\Delta f$  at every rising edge for the suppression of pre-cursor ISI, and a pattern-dependent phase control to realign the rising-edge spread during

the transmission. The pattern-dependent phase shift,  $\Delta\phi$ , should be equal to the pattern-dependent differential delay through the fiber,  $\Delta t$ . The  $\Delta t$  can be estimated using the transient chirp of a DML and the material dispersion of a fiber as given by  $\Delta t = (Dz\lambda_c^2 / c_{fiber}) \cdot \Delta f$  where  $D$  is the dispersion coefficient,  $z$  is the length of fiber,  $\lambda_c$  is the center emission wavelength and  $c_{fiber}$  is the speed of light.

The approximated transfer function of the linear equalizer is

$$H_{lin} = (1 - 2 / ER) [(s + 2(ER + 1) / (ER - 1) / \Delta t_{0 \rightarrow 1}) / (s + 2/\Delta t_{0 \rightarrow 1})]$$

where  $\Delta t_{0 \rightarrow 1}$  denote a differential delay between the pulses for "0" and "1" and ER is the extinction ratio of an optical source [3]. The location of the pole and zero can be adjusted using a digitally controlled capacitor and resistor network. Instead of using an analog domain solution for the tilted waveform [3], a double-sampling 1-tap loop-unrolled DFE that samples the input signal at two different phases with different decision thresholds depending on the preceding bits is employed. The optimal sampling phases and reference voltages are adjusted using pattern-filtered stochastic eye diagrams [5].

The pattern-dependent phase control and the rising-edge slope control of the transmitted pulses are achieved by shaping the clock signals to 4:1 MUXs. The 4:1 MUX operates with quadrature clock signals, wherein an appropriate input path is selected through the clock overlaps (Fig. 16.2.3). The EDC blocks dynamically adjust the phase and the slope of the overlapping clock signals. For a pattern-matched phase control, digitally controlled 7b phase interpolators generate different phase clock signals for each rising pattern and proper clock signals are selected for multiplexing according to 4-tap data patterns. Once a rising pattern (e.g. 1001) is detected, a phase-shifted clock signal (e.g.  $CLK_{1001,\phi}$  in Fig. 16.2.3), instead of the quadrature clock signal ( $CLK_\phi$ ), is chosen for the multiplexing to generate a rising edge that is delayed by  $\Delta\phi$  to counter balance the pattern-dependent delay of  $\Delta t$ . The simplified circuit schematic for the slope control is also shown in Fig. 16.2.3. In order to adjust the rising-edge slope, charging current through PMOS transistors is digitally controlled. Once rising edges are detected in the transmitted data, the slope control is enabled and the number of PMOS transistors to be turned on is reduced, depending on the control bits, to reduce the rising slope.

Figures 16.2.4 and 16.2.5 shows the measurement results with and without the EDC functions using a DML at a modulation rate of 10Gb/s, where adiabatic chirp is dominant, and >20Gb/s, where transient chirp is dominant, respectively. To measure the performance gain of the EDC RX, the BER with EDC RX is compared to that with a CTLE and a 1-tap loop-unrolled DFE. The rabbit-ear and tilting are completely compensated with the EDC RX and the EDC achieves 6.3dB OSNR gains at a BER of  $10^{-9}$  after 75km transmission. The pattern-dependent pre-cursor ISI is considerably suppressed with the EDC TX above 20Gb/s and the EDC reduces the BER from  $10^{-6}$  to  $10^{-8}$  when the received power is -7.5dBm after 3km transmission (Fig. 16.2.5). For the measurements, a commercial 1550nm 10Gb/s DML with increased bias current is used to expand the bandwidth for >20Gb/s transmission. The BER performance of the EDC TX is compared while controlling the received power since the reach is short. Figure 16.2.6 shows the performance summary and comparison. The EDC transceiver consumes 105.5mW per lane and the global clock generator consumes 25.1mW. A microphotograph of the chip, fabricated in a 40nm CMOS process and packaged in a flip-chip BGA, is shown in Fig. 16.2.7.

### References:

- [1] A. Abbasi, et al., "10-/28-Gb chirp managed 20-km links based on silicon photonics transceivers," *IEEE Photonics Tech. Letters*, vol. 29, no. 16, pp. 1324-1327, Aug. 2017.
- [2] K. Kwon, et al., "A 6Gb/s transceiver with a nonlinear electronic dispersion compensator for directly modulated distributed-feedback lasers," *ISSCC*, pp.138-139, Feb. 2014.
- [3] K. Kwon, et al., "A 6 Gb/s Transceiver With a Nonlinear Electronic Dispersion Compensator for Directly Modulated Distributed-Feedback Lasers," *IEEE JSSC*, vol. 50, no. 2, pp. 503-514, Feb. 2015.
- [4] P. Krehlik, "Characterization of semiconductor laser frequency chirp based on signal distortion in dispersive optical fiber," *Opto-Electronic Review*, vol. 14, no. 2, pp. 123-128 June 2006.
- [5] H. Won, et al., "An on-chip stochastic sigma-tracking eye-opening monitor for BER-optimal adaptive equalization," *IEEE CICC*, pp. 1-4, Sept. 2015.



Figure 16.2.1: Adiabatic- and transient-chirp induced dispersion.



Figure 16.2.2: Overall architecture of a 28Gb/s transceiver.



Figure 16.2.3: Timing diagrams of pattern-dependent pulse shaping and rising-edge slope control schematic.



Figure 16.2.4: Measured eyes before and after the EDC-based CDR IC and BER-OSNR curves w/ and w/o RX-EDC using an adiabatic-chirp dominant DML.



Figure 16.2.5: Measured eyes and BER-received power curves w/ and w/o TX-EDC using a transient-chirp dominant DML.

|                        | This work                      | [2]                            | [1]                                             |
|------------------------|--------------------------------|--------------------------------|-------------------------------------------------|
| System Type            | CMOS Transceiver (Dual lane)   | CMOS Transceiver (Single lane) | Silicon Photonics Transceiver                   |
| CMOS Process (nm)      | 40                             | 90                             | -                                               |
| Technique              | DML+EDC+EOM                    | DML+EDC                        | Chirp Maganed Laser with Optical Tunable Filter |
| Datarate (Gb/s)        | 28, 10                         | 6                              | 28, 10                                          |
| Chirp                  | Adiabatic-chirp                | Yes                            | Yes                                             |
| Compensation           | Transient-chirp                | No                             | No                                              |
| Automatic Adaptation   | Yes (EOM)                      | No                             | No                                              |
| Supply Voltage (V)     | 0.9                            | 1                              | -                                               |
| Power Consumption (mW) | TRX : 105.53/lane<br>CG : 25.1 | TRX : 124<br>CG : 102          | -                                               |
| FoM (mW/Gb/s)          | 4.22                           | 37.67                          | -                                               |
| Active Area (mm²)      | 2.836<br>(Dual lane TRX+CG)    | 1.479<br>(Single lane TRX+CG)  | -                                               |

Figure 16.2.6: Comparison table.



Figure 16.2.7: Microphotographs of the test chip.

### 16.3 A 56Gb/s Burst-Mode NRZ Optical Receiver with 6.8ns Power-On and CDR-Lock Time for Adaptive Optical Links in 14nm FinFET CMOS

Ilter Ozkaya<sup>1,2</sup>, Alessandro Cevrero<sup>1</sup>, Pier Andrea Francese<sup>1</sup>, Christian Menolfi<sup>1</sup>, Matthias Braendli<sup>1</sup>, Thomas Morf<sup>1</sup>, Dan Kuchta<sup>3</sup>, Lukas Kull<sup>1</sup>, Marcel Kossel<sup>1</sup>, Danny Luu<sup>1</sup>, Mounir Meghelli<sup>3</sup>, Yusuf Leblebici<sup>2</sup>, Thomas Toifl<sup>1</sup>

<sup>1</sup>IBM Research, Ruschlikon, Switzerland

<sup>2</sup>EPFL, Lausanne, Switzerland

<sup>3</sup>IBM Research, Yorktown Heights, NY

The increasing bandwidth demand in data-centers requires wireline transceivers supporting >50Gb/s/lane data-rates with low power consumption. Because link utilization in data-centers is <10% for 99% of the links [1] a promising way to reduce power consumption is fine-grained power gating, where the link is powered off during idle time. For rapid on/off functionality to be efficient with short data bursts, the link needs to wake up within a few ns, which is challenging at high speeds. Burst mode operation was previously demonstrated at 25Gb/s with 18.5ns lock-time [2] without power cycling.

This work demonstrates a 56Gb/s optical RX with rapid power on/off functionality completing wake up and CDR lock in 384UI with the assistance of a link protocol. It automatically powers on/off based on a specific input-sequence. The CDR, based on digitally controlled phase rotation, uses a preamble to speed up locking, followed by a start sequence marking the beginning of valid data. The RX detects the end of the preamble and successfully locks the internal PRBS register with error-free data (1<sup>st</sup> PRBS bit-error free after start sequence) across 10<sup>10</sup> power cycles for a 2048b data burst, hence operating at BER < 2×10<sup>-13</sup>.

Figure 16.3.1 illustrates the RX block diagram and the proposed link protocol. The RX uses a low-bandwidth front-end combined with a 1-tap DFE for high sensitivity [3]. The photocurrent is converted to a voltage by the self-referenced TIA. The signal is then amplified by the VGA and sampled by 4-way time-interleaved comparator arrays. Each array consists of 2 data slicers for DFE speculation and 1 edge slicer for phase detection. Quarter-rate speculative data are first resolved by the look-ahead DFE logic [3] and then demultiplexed to 1/32 of the data-rate before being processed by the RX digital, which detects the power-down sequence and performs a protocol-aware PRBS check. During normal operation, phase lock is maintained by a bang-bang CDR (BB-CDR) loop. The CDR logic, running on the C4 clock (1/4 of the data rate) to minimize loop latency, consists of a phase-detector (PD), a majority voter and a 1<sup>st</sup>-order loop filter that produces a gated clock C4<sub>G</sub> and early/late signal EL<sub>BB</sub> driving the phase rotator (PR) control register [4]. The clock path, implemented in CML, starts with a divider generating I/Q inputs for a 128-step PR that produces a quarter-rate clock [4]. The PR is designed such that its control logic consists of a shift register, enabling a phase update every C4 cycle. The PR output goes to the I/Q generator, which creates 8 clock phases φ<sub>IQ</sub> with 45° spacing. The output is then fed to the I/Q calibration block [4], which adjusts the phase between data and edge clocks to maximize voltage and timing margin.

Rapid power on/off operates as follows. After receiving a sequence of 64 “0”s at the beginning of IDLE phase (Fig. 16.3.1 top) RX digital asserts PWROFF, which is processed by the PON sense. Then, EN signal goes low, powering down the VGA and the clock-path. The TIA is not power-gated in the off state to detect the beginning of the next data burst at the cost of 4.5mW increase in stand-by power. During idle time, the optical signal is kept at low optical power (logic zero or below) to minimize VCSEL power. The beginning of a data burst is marked by an initialization sequence (INIT) consisting of 32 “1”s. The PON sense detects a 0->1 transition and asserts the EN signal, which restores the VGA and clock-path bias with the assistance of the PON bias boost and releases the PON finite-state-machine (FSM) reset. The clock-path starts generating a stable clock ~1ns after EN goes high. INIT is followed by a preamble of a repeated “1100” pattern. Such a pattern maximizes the transition density while maintaining enough signal swing with a low bandwidth front-end. The burst mode (BM) PD consists of a single XOR gate driven by 2 consecutive edge comparators. BM PD output goes to the PON FSM, which directly drives the PR register. At the end of rapid phase lock, the PON FSM asserts BMDONE, handing over control to BB-CDR. After preamble, the beginning of valid data is marked by a “0000” start sequence (STR), that is automatically detected by the RX digital.

Figure 16.3.2 shows an RC extracted simulation describing the power-on phase-lock transient and BM-CDR loop locking. After EN signal is asserted, the RX is

powered up, generating a valid clock signal. Then, the PON FSM waits a fixed time (which can be set digitally) for bias settling to assert BMSTART, starting BM-CDR which updates the PR position every C4 cycle to achieve fast lock. It is well known that BB phase detection schemes may experience long lock time due to oscillations around a metastable initial condition. To solve this problem, the PON FSM applies a ramp up/down of 16 steps (marked as P1 in Fig. 16.3.2) based on the first received phase detection. Then, BM PD drives PR directly until a change in the PD output is sensed, detecting zero crossing of the signal (marked as P2 in Fig. 16.3.2). However, due to CDR loop latency, the PR travels 8 extra steps by the time this condition is sensed. To compensate for this, the PR is pulled back by 8 steps (marked as P3 in Fig. 16.3.2). Then, PON FSM asserts BMDONE handing over control to the BB-CDR. Figure 16.3.3 shows the circuit details of the PON sense and bias boost. The PON sense consists of an amplifier stage with programmable offset amplifying the TIA output to rail-to-rail levels followed by an SR latch generating EN. PON bias boost is used to speed up bias settling of CML stages. After EN is asserted the boost node BST is pulled to ground via C<sub>bst</sub> turning M8 on, which rapidly charges the bias node. When the bias voltage goes above threshold, M8 is turned off via M5-M7.

The RX was wire-bonded to a GaAs photodiode (Fig. 16.3.7) with 25GHz bandwidth and 50fF capacitance. A SiGe driver provided the optical signal over 7m of OM2 MM fiber. Measurements were done with a PRBS-7 pattern. Figure 16.3.4 shows the received eye diagram at 56Gb/s measured with the on-die scope. Horizontal/vertical margin in always-on condition with the CDR locked were evaluated by measuring the BER contour plot of the DFE h1 coefficient vs. the data-to-edge clock spacing at -4dBm OMA optical power (Fig. 16.3.4 right). The RX operates error-free (BER < 1<sup>-11</sup>) with wide voltage and timing margin. Correct power cycling was observed by monitoring the C8 clock (7GHz) retransmitted off-chip on a real time scope over 10<sup>7</sup> power cycles with infinite persistence (Fig. 16.3.5 bottom). As expected, the clock goes low during stand-by. To measure the power-on-lock time, data bursts with different preamble length were applied and BER was measured by RX digital. It detects STR and aligns the data with a barrel-shifter such that the PRBS register is seeded with the first 32 PRBS bits after STR. This ensures that even the 1<sup>st</sup> PRBS bit is included in the BER check. Then, it counts errors across that data-burst based on the initial seed. The error count is accumulated across multiple power cycles. Figure 16.3.5 shows BER versus preamble length for 10<sup>10</sup> power cycles at 56Gb/s. The RX operates error free with a preamble length of 352UI and 100ppm frequency offset. Together with INIT this corresponds to a power-on-lock time of 384UI (or 6.8ns). Any error in the first bits, leads to wrong PRBS seeding accumulating a large error count in that burst, which explains high BER right before the error free region. The RX features a snapshot register (Fig. 16.3.1), activated automatically at wake-up, storing the PR register early/late (EL) input history for the first 70 clock cycles after BMSTART goes high. The recorded PR transient during rapid-lock at 56Gb/s is shown in Fig. 16.3.5 demonstrating correct BB-CDR operation. The 3 phases of rapid lock P1-3 are clearly visible.

Measured on and stand-by power are 127mW (or 2.2pJ/b) including RX digital and 8mW (TIA and bias generator on) respectively, as reported in the power breakout table. Figure 16.3.6 (bottom left) reports predicted power consumption versus link utilization for 3 payload sizes based on measured on- and off-power with a conservative 512UI preamble. The plot shows good agreement with measured data. Thanks to supply and bias scaling the always-on power of the RX scales almost linearly with data-rate leading to nearly constant energy efficiency from 10Gb/s to 56Gb/s (Fig. 16.3.6 right).

#### Acknowledgments:

The authors thank the foundry team at GlobalFoundries Fab-8, Malta NY for fabricating the circuits. The research leading to these results has been supported in part by the European Union’s Seventh Framework Programme (FP7/2007-2013) in the ADDAPT project under grant agreement 619197.

#### References:

- [1] A. Roy, et al., “Inside the Social Network’s (Datacenter) Network”, *Sigcomm Comput. Commun. Rev.*, pp. 123-137, Aug. 2015.
- [2] A. Rylyakov, et al., “A 25 Gb/s burst mode receiver for rapidly reconfigurable optical networks”, *ISSCC*, pp. 400-401, Feb. 2015.
- [3] A. Cevrero, et al., “A 64Gb/s 1.4pJ/b NRZ optical-receiver data-path in 14nm CMOS FinFET”, *ISSCC*, pp. 482-483, Feb. 2017.
- [4] A. Cevrero, et al., “A 60Gb/s 1.9pJ/bit NRZ optical receiver with low latency digital CDR in 14nm CMOS FinFET”, *IEEE Symp. VLSI Circuits*, pp. C320-C321, June 2017.



Figure 16.3.1: Link protocol (top) and optical RX block diagram (bottom).



Figure 16.3.2: Power-on transient simulation at 56Gb/s (left), burst-mode CDR locking points (top-right) and metastable conditions (bottom-right).



Figure 16.3.3: Power on sense schematic (left), bias boost and its effect on the bias voltage (right).



Figure 16.3.4: 56Gb/s eye diagram at front-end output (left), BER contour plot of DFE h1 vs data-to-edge clock spacing (right).



Figure 16.3.5: Measured BER (top-left) and PR position during rapid phase lock (top-right)56Gb/s TX out (inverted) and C8 RX output clock (7GHz) on real-time scope (bottom).



Figure 16.3.6: Receiver power breakdown @56Gb/s (top), power vs link-utilization (bottom-left), always-on power and supply voltages vs data-rate (bottom-right).



Figure 16.3.7: Packaged RX micrograph (top) and layout details (bottom) in 14nm CMOS FinFET technology.

## 16.4 A 0.5-to-0.9V, 3-to-16Gb/s, 1.6-to-3.1pJ/b Wireline Transceiver Equalizing 27dB Loss at 10Gb/s with Clock-Domain Encoding Using Integrated Pulse-Width Modulation (iPWM) in 65nm CMOS

Ashwin Ramachandran, Tejasvi Anand

Oregon State University, Corvallis, OR

Improving the energy efficiency of wireline interconnects has become a necessity to sustain growing data rates. Low-voltage wireline link operation is a promising approach to achieve close to pico-joules/bit energy efficiency [1-3]. However, conventional low-voltage wireline links suffer from two limitations: (a) limited equalization (12dB) or no channel equalization, and (b) low data rates (8Gb/s @0.75V) even with fine process technology nodes [2]. Limited equalization in low-voltage links is due to the fact that conventional equalization (FFE, DFE) is performed on the high-speed data path. As a result, when the supply voltage is reduced (below 0.6V), the bandwidth of the data path is severely diminished, which compromises FFE and DFE operation. Moreover, the requirement to generate and transport narrow clock pulses in conventional low-voltage output drivers limits the maximum achievable data rates [1,3]. In view of these limitations, we propose a new approach to equalize heavy channel loss at low supply voltages by moving the equalizing operation out of the data path and into the clock path. Integrated pulse width modulation (iPWM) has been demonstrated to efficiently equalize heavy channel loss by encoding data transition edges [4]. In this work, we present a modified iPWM encoding, an energy-efficient wireline transceiver with a clock-domain low-voltage iPWM encoder, and a low-voltage multiplexer + output driver architecture, which can operate from 0.5-to-0.9V and 3-to-16Gb/s compensating 27dB channel loss at 10Gb/s with an energy efficiency of 1.8pJ/b. Compared to prior low-voltage links [2], the proposed transceiver can compensate 15dB higher loss at 25% higher data rate (10Gb/s) operating at 13% lower supply voltage (0.65V) while using an older technology node (65nm). Compared to state-of-the-art iPWM-based wireline transceiver operating at nominal supply voltage [4], the proposed transceiver achieves 2.5x lower energy/bit while compensating similar channel loss, and compared to a conventional equalization-based transceiver it compensates 8dB higher loss with similar energy efficiency [5].

Figure 16.4.1 shows the conventional FFE implementation on the high-bandwidth data path and the proposed clock domain iPWM encoding on the low-bandwidth clock path. At low supply voltages, the rise/fall time of the CMOS logic in pre-drivers and serializers increases and as a result, it fails to pass narrow data pulses, which compromises the FFE. In the proposed approach, equalization is performed by encoding the sub-rate clock edges operating at a lower supply voltage. The proposed iPWM encoding is a highly digital operation and low supply voltage implementation helps to quadratically reduce the switching power. The encoded sub-rate clock phases multiplex the data using an N:1 multiplexer, thus generating iPWM encoded data at the output. The proposed iPWM encoding is based on the observation that both pre-cursor and post-cursor ISI can be reduced by modulating the pulse width at the leading and trailing edges of the consecutive identical digits (CID), respectively. The position of CIDs leading and trailing edges is controlled using encoding coefficients  $\beta_m$  and  $\alpha_m$ , respectively. These encoding coefficients are estimated based on the channel loss profile.

Figure 16.4.2 shows the block diagram of the proposed low-voltage transceiver. The transmitter consists of a parallel PRBS generator, 4-tap quarter-rate iPWM control signal generator and 4:1 multiplexed source-series terminated output driver. An off-chip PLL is used to provide the clock, which is divided down on-chip to quarter-rate four-phase clocks and provided to the transmitter through the iPWM encoder. The receiver consists of a passive equalizer, amplifiers and quarter-rate samplers with offset correction. At lower supply voltage, clock distribution suffers from duty cycle errors and phase offsets. In view of this, phase and duty cycle correction blocks are multiplexed between the transmitter and receiver, which help to clean up the clock phases. The quarter-rate clocks to the transmitter are encoded with the proposed iPWM.

Figure 16.4.3 shows the proposed iPWM encoder, timing diagram, and block diagram of the proposed low voltage 4:1 multiplexed 50Ω output driver. The four clock phases of the output driver are independently iPWM encoded. The tunable coefficients  $\alpha_1$ ,  $\alpha_2$ ,  $\beta_1$ , and  $\beta_2$ , are implemented in 4 independent iPWM logic stages and are enabled with control signals generated at quarter-rate. These encoding coefficients equivalently provide two pre-cursor ( $\beta_1$ ,  $\beta_2$ ) and two post-cursor correction taps ( $\alpha_1$ ,  $\alpha_2$ ). In the proposed clock domain iPWM encoding, pulse width modulation of CIDs is achieved by modulating the duty-cycle of the quarter-rate clocks relative to each other, which in turn modulates the CID pulse width at the serializer. Duty-cycle modulation is achieved by varying the rise and fall time using current starved inverters. Each iPWM encoder tap consists of two sets of transition-time modulators and the strength of the modulation is controlled by the tap coefficient with a precision of 0.1ps. Multiple taps can be implemented by using a cascaded architecture making the proposed design highly scalable.

In the proposed multiplexing output driver, Stg1 latches the incoming data at quarter-rate on node 'X'. Stg2 keeps the Final stage tri-stated for 3/4<sup>th</sup> of a clock cycle and propagates the latched data for 1/4<sup>th</sup> of the clock cycle. The Final stage implements a 50Ω pull-up and pull-down impedance using tuned PMOS and NMOS transistors (shaded), respectively, which are shared by all four multiplexers. Unlike conventional architectures, no narrow pulses propagate through the pre-drivers. As a result, the proposed multiplexing output driver architecture operates at high data rates at low-supply voltages.

Figure 16.4.4 shows the measured insertion loss profiles of the channels and far-end output eye openings of the transmitter with clock domain iPWM equalization for 3Gb/s, 10Gb/s and 16Gb/s PRBS7 data with 21dB, 20dB, and 24dB loss, respectively. Operating at 3Gb/s (0.5V), 10Gb/s (0.65V), and 16Gb/s (0.9V), the proposed transmitter achieves vertical eye openings of 24mV<sub>pp</sub>, 20mV<sub>pp</sub>, and 40mV<sub>pp</sub>, respectively. Figure 16.4.5 shows the measured BER bathtub curves of the complete transceiver for three different data rates and FR4 channels. For BER < 10<sup>-12</sup>, the proposed transceiver achieves a horizontal eye opening of 78.8ps (0.24UI), 9.8ps (0.1UI), and 19.6ps (0.31UI) at 3Gb/s, 10Gb/s, and 16Gb/s compensating 21dB, 27dB and, 22dB of loss, respectively. The measured power spectral density of iPWM-encoded and NRZ-encoded transmitted data at 16Gb/s shows the high-frequency amplification, thus demonstrating the equalization capability of the proposed iPWM encoding. The low-frequency amplification is achieved through passive equalizers at the receiver. The effect of added noise on the clock due to the iPWM encoder is measured. In the presence of the proposed clock domain iPWM encoding at 5GHz, the RMS jitter, integrated from 1kHz to 100MHz, degrades marginally from 1.483ps to 1.487ps. The performance of the proposed transceiver is compared with the state-of-the-art in Fig. 16.4.6. Operating at 10Gb/s, the complete transceiver consumes 18mW of power from 0.65V supply, can equalize 27dB loss, and occupies an active area of 0.13mm<sup>2</sup>. The die micrograph is shown in Fig. 16.4.7.

### Acknowledgment:

We thank Tektronix Inc. for equipment donation and Yutao Liu for his help in design and layout.

### References:

- [1] W. Choi, et al., "A 0.45-to-0.7V 1-to-6Gb/s 0.29-to-0.58pJ/b source-synchronous transceiver using automatic phase calibration in 65nm CMOS," ISSCC, pp. 66-67, Feb. 2015.
- [2] R. Inti, et al., "A 0.5-to-0.75V, 3-to-8 Gbps/lane, 385-to-790 fJ/b, bi-directional, quad-lane forwarded-clock transceiver in 22nm CMOS," IEEE Symp. VLSI Circuits, pp.346-347, Feb. 2015.
- [3] Y.-H. Song, et al., "A 0.47-0.66 pJ/bit, 4.8-8 Gb/s I/O transceiver in 65nm CMOS," IEEE JSSC, pp. 1276-1289, vol. 48, no.5, May 2013.
- [4] A. Ramachandran, et al., "A 16Gb/s 3.6pJ/b wireline transceiver with phase domain equalization scheme: integrated pulse width modulation (iPWM) in 65nm CMOS," ISSCC, pp. 488-489, Feb. 2017.
- [5] Y. Liu, et al., "A 10Gb/s compact low-power serial I/O with DFE-IIR equalization in 65nm CMOS," ISSCC, pp. 182-183, Feb. 2009.



Figure 16.4.1: Conventional FFE implementation on the high-speed data path and proposed iPWM implementation on  $N^{th}$  rate clock path. Timing diagram and time domain response of iPWM. Mathematical representation of the proposed iPWM with pre-cursor and post-cursor correction.



Figure 16.4.2: Block diagram of the proposed low-voltage transceiver with clock domain iPWM encoder.



Figure 16.4.3: Block diagram of the proposed clock domain integrated pulse width modulation (iPWM) encoder and timing diagram in the presence of consecutive identical digits. Block diagram of the proposed low voltage 4:1 multiplexed  $50\Omega$  output driver.



Figure 16.4.4: Measured insertion loss profiles of the channels. Equalized eye diagrams at the far-end channel outputs for 3Gb/s, 10Gb/s and 16Gb/s for 21dB, 20dB and 24dB loss, respectively.



Figure 16.4.5: Measured insertion loss profiles of the channels. BER bathtub curves for the complete transceiver with equalized 3Gb/s, 10Gb/s, and 16Gb/s PRBS7 data. Measured power spectral density of NRZ and iPWM encoded data. Measured phase noise with and without clock-domain iPWM at 5GHz.

|                     | This Work    |              |              | Low Voltage Transceivers |             |                  | Nominal Voltage Transceivers |             |                   |
|---------------------|--------------|--------------|--------------|--------------------------|-------------|------------------|------------------------------|-------------|-------------------|
|                     | ISCC'15 [1]  | VLSI15 [2]   | JSSC13 [3]   | Shekhar                  | ISCC'17 [4] | JSSC11 Zhong [5] | ISCC'09                      |             |                   |
| Technology [nm]     | 65           | 65           | 65           |                          | 65          | 22               | 65                           | 65          | 40                |
| Supply [V]          | 0.5          | 0.65         | 0.9          | 0.7                      | 0.75        | 0.65             | 0.7                          | 0.9/1.0/1.1 | 1.1               |
| Data rate [Gb/s]    | 3            | 10           | 16           | 6                        | 8           | 6.4              | 5                            | 16          | 16                |
| Tx Equalization     | iPWM (Clock) | iPWM (Clock) | iPWM (Clock) | None                     | 2-tap FFE   | None             | 2-tap FFE                    | iPWM (Data) | 4-tap FFE         |
| Rx Equalization     | Passive Eq.  | Passive Eq.  | Passive Eq.  | None                     | None        | CTLE             | None                         | CTLE        | CTLE + 10-tap DFE |
| Loss [dB]           | 21           | 27           | 22           | N/A                      | 12          | 8.4              | 8                            | 19          | 27                |
| Tx Power [mW]       | 1.74         | 6.17         | 18.94        | 1.86                     | 0.32        | 1.92             | 3                            | 45.1        | 45.5              |
| Rx Power [mW]       | 1.82         | 4.93         | 11.44        | 1.02                     | 0.25        | 1.07             | 3                            | 12.2        | 24.4              |
| Ck Power [mW]       | 1.37         | 6.87         | 19.84        | 0.66                     | 0.22        | 0.038            | 7                            | -           | -                 |
| Efficiency [pJ/bit] | 1.65         | 1.8          | 3.14         | 0.59                     | 0.79        | 0.47             | 2                            | 3.58        | 4.37              |
| Area [mm²]          | 0.13         | 0.13         | 0.13         | 0.15                     | -           | 0.057            | 0.041                        | 0.21        | 0.21              |



Figure 16.4.6: Performance summary and comparison with state-of-the-art designs.



Figure 16.4.7: Die micrograph of the proposed transceiver.

## 16.5 A 20Gb/s Transceiver with Framed-Pulsewidth Modulation in 40nm CMOS

Sejun Jeon<sup>1</sup>, Woohyun Kwon<sup>1</sup>, Taehun Yoon<sup>2</sup>, Jong-Hyeok Yoon<sup>1</sup>,  
Kyeongha Kwon<sup>1</sup>, Jaehyeok Yang<sup>1</sup>, Hyeon-Min Bae<sup>1</sup>

<sup>1</sup>KAIST, Daeon, Korea

<sup>2</sup>Samsung Electronics, Hwaseong, Korea

Expanding signal bandwidths in high-speed links is increasing intersymbol interference (ISI), which necessitates the enhancement of spectral efficiency. Recently, various modulation schemes including pulse amplitude modulation (PAM) [1], pulsewidth modulation (PWM) [2], permutation modulation (PM) [3] and duo-binary signaling [4] have been investigated in high-speed wireline links to increase spectral efficiency. However, multi-level signaling schemes suffer from SNR reduction and tighter linearity requirements when compared to conventional NRZ signaling. In this work, a 20Gb/s serial link transceiver employing a framed-pulsewidth modulation (FPWM) scheme that overcomes the SNR degradation without linearity requirement is presented. The FPWM scheme encodes data at the location and the width of pulses in a frame spanning multiple UIs while maintaining a minimum pulsewidth equal to 1UI. The test-chip achieves a coding gain of 33%, which allows the total throughput of 20Gb/s while keeping the baud rate of 15Gb/s. The equalization core incorporates programmable 3-tap pre-emphasis at the transmitter and a continuous-time linear equalizer (CTLE) at the receiver, to compensate for channel insertion loss of up to 12dB at the baud frequency. The transceiver IC is implemented in 40nm CMOS and consumes 90.6mW from a 0.9V supply.

Figure 16.5.1 illustrates the encoding scheme of the FPWM. The FPWM frame has three elements where VOID (V) designates '0', DATA (D) as '1' and FLAG as '0'. The FLAG is inserted to denote the end of a frame, which is used as a reference time point. The pulsewidths of '1' and '0' in the FPWM are controlled with the integer multiples of T, where T is equal to 0.25UI. The minimum width of the pulses is set greater than 4T to prevent narrow pulses susceptible to ISI. In this test-chip, the length of one FPWM frame is set to 6UI. A 6UI-FPWM frame contains a maximum of 2 data pulses, which means maximum 4 transitions can occur. In order to maintain DC balance, entire bits in even frames are flipped electrically. The length of the first VOID ( $V_1$ ) is set to either 0 or greater than 4T to prevent the generation of narrow pulses because the polarity of the  $V_1$  in even frames and the FLAG in odd frames are different. A maximum of 8.34b can be obtained from a 6UI-FPWM encoding scheme (39%), but only 256 data sets (8b) are chosen to achieve the coding gain of 33% for implementation simplicity.

Figure 16.5.2 shows the overall architecture of the proposed transceiver IC. The CDR lane consists of logic core, TRX and global clock signal generator. The TX incorporates an encoder, a tap generator, a three-tap pre-emphasis, multiplexers (MUX), and driver blocks. The encoder incorporates four 8b look-up tables (LUTs) with VOID and DATA blocks providing consecutive '0s' and '1s' spanning from 4T to 20T. The encoder assembles frames by using VOID and DATA blocks without storing entire frame sets. The logic depth of the encoder is 3. Note that FLAGs are generated by using 4T VOID and DATA blocks. Each LUT outputs 24b for a frame, which is then divided into 2 sets of 12-phase quad-rate data in the 96:12 MUX. The tap generator located prior to the 12:4 MUX prepares three sets of T-spaced quad-rate data for pre-emphasis. Then, 12:4 and 4:1 MUXs serialize the data-stream.

The RX consists of a termination, a CTLE, a limiting amplifier (LA), 12-phase CMOS quadrate samplers, a retimer, 1:8 demultiplexers (DEMUX), a synthesized CDR logic block including a decoder, and a phase-rotator-based multi-phase clock signal generator. The CTLE provides controllable boosting gain up to 16dB at 7.5GHz, and the LA provides the DC gain of 15dB. The track-and-hold-based quadrate CMOS sampler requires only a single-phase clock signal, which achieves capacitance reduction in the data-path and immunity to the duty cycle variation of the clock signal. Quad-rate RX and TX architecture is employed to use CMOS logic gates instead of current-mode logic (CML) gates for power reduction. The demultiplexed data and edge information are provided to the CDR logic for a phase lock. A global clock signal generator distributes a 10GHz half-rate clock signal to the phase-rotator in the RX and TX via an on-chip T-line. The decoder consists of a transition detector and four 8b LUTs. The transition detector determines the length and the location of pulses using bitwise combinational logic gates. The

location of pulses is identified via XOR operation among adjacent bits, and the length of a pulse is determined from the distance between transition points. Then, the LUT restores the original data. A maximum logic depth of both transition detector and LUT is 3. The synthesized encoder and decoder operating at 625MHz have a latency of 1 and 3 clock cycles, respectively.

Figure 16.5.3 depicts the schematics of 12:4 and 4:1 MUXs. Both four parallel 12:4 and 4:1 MUXs are implemented using static logic gates instead of power-hungry CML-type MUXs, and are driven by quad-rate clock signals. The quad-rate data at the input of the 12:4 MUX is selected and multiplexed when multiphase clock signals with 120° phase differences are overlapped. The 4:1 MUXs realign the output data of 12:4 MUXs and the output is evaluated when two neighboring-phase clock signals are overlapped. Inductive loads are employed in the 4:1 MUX for bandwidth enhancement and power reduction. The current-summing CML driver subsequent to the 4:1 MUXs implements a three-tap pre-emphasis filter.

The FPWM transceiver fabricated in 40nm CMOS is flip-chip packaged. PRBS generators and checkers are implemented together with the logic core for a BER self-test under loopback operation. Figure 16.5.4 shows the waveforms of consecutive frames, a single frame, an eye-diagram of consecutive minimum-UIs (T-spaced), and the recovered clock signal. Serialized normal and inverted frames are clearly distinguished with FLAGs and the measured peak-to-peak jitter is 8.8ps with PRBS31. The standard deviation of the recovered clock jitter is 1.14ps. The test setup, including MXP40 cable/connectors, SMA connectors and board traces is shown in Fig. 16.5.5. Bathtub measurement is taken over a 30cm FR4 channel with 12dB loss at 7.5GHz and BER<4×10<sup>-9</sup> is achieved at 20Gb/s with PRBS31. In addition, the 6UI-FPWM scheme demonstrates better tolerance to dispersion caused by channel loss as compared to an NRZ scheme, thanks to the reduction of the baud rate. Note that the applied equalization scheme for the NRZ signaling includes a CTLE and a 1-tap DFE. The total measured power consumption of the transceiver is 90.6mW (TX: 40.9mW, RX: 34.2mW, Logic core: 15.5mW) from a 0.9V supply, which renders a power efficiency of 4.53mW/Gb/s. Performance comparisons between this design and prior work is summarized in Fig. 16.5.6. The microphotograph of FPWM transceiver, occupying 2.2×0.48mm<sup>2</sup>, is shown in Fig. 16.5.7.

### References:

- [1] Y. Frans, et al., "A 56Gb/s PAM4 Wireline Transceiver using a 32-way Time-Interleaved SAR ADC in 16nm FinFET," *IEEE Symp. VLSI Circuits*, pp. 1-2, June 2016.
- [2] W. Wang, et al., "A 10-Gb/s, 107-mW Double-Edge Pulsewidth Modulation Transceiver", *IEEE TCAS-I*, vol.61, no.4, pp. 1068-1080, April 2014.
- [3] A. Singh, et al., "A Pin- and Power-Efficient Low-Latency 8-to-12Gb/s/wire 8b8w-Coded SerDes Link for High-Loss Channels in 40nm Technology" *ISSCC*, pp. 442-443, Feb. 2014.
- [4] J. Lee, et al., "Design and Comparison of Three 20-Gb/s Backplane Transceivers for Duobinary, PAM4, and NRZ Data", *IEEE JSSC*, vol. 43, no .9, pp. 2120-2133, Sept. 2008.
- [5] M. Harwood, et al., "A 225mW 28Gb/s SerDes in 40nm CMOS with 13dB of analog equalization for 100GBASE-LR4 and optical transport lane 4.4 applications" *ISSCC*, pp. 326-327, Feb. 2012.



Figure 16.5.1: FPWM encoding scheme.



Figure 16.5.2: The overall architecture of the proposed FPWM transceiver.



Figure 16.5.3: Schematics and timing diagrams of 12:4 and 4:1 MUXes.



Figure 16.5.4: Measured waveforms of consecutive frames, a single frame, an eye-diagram and a recovered clock signal.



Figure 16.5.5: Experimental setting and measured bathtub curve for a 30cm FR4 channel. The BER performance comparison between NRZ and 6UI-FPWM for various FR4 channels.

| Design                       | This Work                           | VLSI 2016 [1]                       | TCAS1 2014 [2]                         | ISSCC 2014 [3]                      | JSSC 2008 [4]                       | ISSCC 2012 [5]                      |
|------------------------------|-------------------------------------|-------------------------------------|----------------------------------------|-------------------------------------|-------------------------------------|-------------------------------------|
| Process                      | 40nm CMOS                           | 16nm FinFET                         | 45nm CMOS SOI                          | 40nm CMOS                           | 90nm CMOS                           | 40nm CMOS                           |
| Data-rate [Gb/s]             | 20 (2 Wires)                        | 56 (2 Wires)                        | 10 (2 Wires)                           | 96 (8 Wires)                        | 20 (2 Wires)                        | 28 (2 Wires)                        |
| Modulation                   | FPWM                                | PAM4                                | Double-Edge PWM                        | 8b8w                                | Duo-binary                          | NRZ                                 |
| Linearity Requirement        | -                                   | O                                   | -                                      | O                                   | O                                   | -                                   |
| Equalizer                    | TX : 3tap pre-emphasis<br>RX : CTLE | TX : 3tap pre-emphasis<br>RX : CTLE | -                                      | TX : 1tap Pre-emphasis<br>RX : CTLE | TX : 3tap pre-emphasis<br>RX : CTLE | TX : 3tap pre-emphasis<br>RX : CTLE |
| Channel Loss                 | -12dB@7.5GHz                        | -25dB@14GHz                         | -15dB@5GHz                             | -15dB@6GHz                          | -5dB@5GHz                           | -13dB@14GHz                         |
| BER                          | $< 4 \times 10^{-9}$                | $> 10^{-9}$                         | $< 10^{-12}$                           | $< 8 \times 10^{-15}$               | $< 10^{-12}$                        | $< 10^{-15}$                        |
| Chip Area [mm <sup>2</sup> ] | 2.2 X 0.48                          | 2.8                                 | TX: 0.093 X 0.094<br>RX: 0.218 X 0.160 | 1                                   | TX: 0.65 X 0.35<br>RX: 0.43 X 0.24  | 2.4 X 1.5 (4 Lane)                  |
| Supply Voltage [V]           | 0.9                                 | 0.9, 1.2, 1.8                       | 1.2                                    | 0.9                                 | 1.5                                 | -                                   |
| Power [mW] (Tx+Rx+Logic)     | 90.6                                | 550                                 | 107                                    | 412                                 | 195                                 | 225                                 |
| Efficiency [mW/Gb/s]         | 4.53                                | 9.82                                | 10.7                                   | 4.29                                | 9.75                                | 8.04                                |

Figure 16.5.6: Performance summary and comparison to prior works.



Figure 16.5.7: Microphotographs of the chip.

## 16.6 A 7.8Gb/s/pin 1.96pJ/b Compact Single-Ended TRX and CDR with Phase-Difference Modulation for Highly Reflective Memory Interfaces

Sooeon Lee, Jaeyoung Seo, Kyunghyun Lim, Jaehyun Ko, Jae-Yoon Sim, Hong-June Park, Byungsub Kim

Pohang University of Science and Technology, Pohang, Korea

Compact transceivers (TRXs) for highly reflective (HR) interconnects are strongly demanded by the memory industry. Although discontinuous reflective channels like multi-drop DRAM interfaces are less suitable for high data rates than continuous point-to-point channels, their great advantages in high capacity, high throughput, and low latency attract the market [1-3]. However, compact TRXs for low-loss HR channels are more challenging than for high-loss low-reflection (LR) channels. Although the long-tail ISI of a high-loss LR channel can be cost-efficiently canceled by an FFE with a few taps or a DFE with IIR feedback (DFE-IIR) [4], the irregular ISI of a low-loss HR channel requires many DFE taps [2], demanding unacceptably large hardware cost and power dissipation. As an alternative solution, a multi-tone (MT) TRX was proposed to avoid a notch of the frequency response [3], but it is also very costly for HR channels with many notches. This paper proposes a 7.8Gb/s/pin compact single-ended (SE) TRX with simple clock data recovery (CDR) using phase-difference modulation (PDM) for HR memory interfaces. For reliable operation of the TRX/CDR, a phase-difference amplifier (PDA) is also proposed to satisfy its stringent timing requirement.

The measured S-parameters of the channel are shown in Fig. 16.6.1. If an MT scheme with a spectral efficiency of 1b/s/Hz is used, then at least 10 tones are needed to achieve 7.8Gb/s because of the 10 notches below 7.8GHz. Since utilizing 10 tones requires too much cost in hardware and power, MT is not suitable for this channel. In the time domain, the pulse response of pulse amplitude modulation (PAM) has large reflective ISI occurring at around 10UI and 19UI. Compact TX FFE or DFE-IIR with few taps cannot compensate this reflective ISI. To achieve eye opening, at least 10 DFE taps are required as simulated in Fig. 16.6.1, which is very costly.

For efficient data communication via such an HR channel, a PDM TRX is proposed (Fig. 16.6.2). In PDM, a clock frequency avoiding notches of the channel response is used, and the position of every clock edge is modulated by data. The RX recovers the data by the phase-difference between the received signals at every clock edge. Figure 16.6.2 also depicts the received signal modulated by a single bit ‘1’. The modulated edge positions with respect to the reference clock CK are simulated, plotted and regarded as the PDM pulse response (Fig. 16.6.1). The large reflective ISI observed in PAM is greatly suppressed in PDM because the PDM signal is mostly dominated by the clock frequency component. As a result, PDM can achieve decent eye opening without any DFE whereas PAM requires at least 10 DFE taps to achieve eye opening, or 14 taps to achieve the same eye size as PDM (Fig. 16.6.1). The PDM TRX can be configured for both SE and differential (DFF) signaling. In SE mode, many TXs transmit data while one TX forwards CK<sub>TX</sub>. In DFF mode, a pair of TXs per lane modulates signals in opposite directions. The path-delay mismatch can be adjusted at each TX. PDM also offers a simple CDR by interpolating and skewing the received signals.

Figure 16.6.3 shows the PDM TX. Two similar PDM units (PDM1, 2) are cascaded for optimal trade-off between jitter and PDM range. For the input clock (CK<sub>skew</sub>), the duty cycle is corrected, and the skew is adjusted to compensate delay mismatch with other lanes. PDM1 modulates CK<sub>skew</sub> by controlling pull-up/down slopes depending on the full-rate serialized data D<sub>SER</sub>. The slopes are modulated by D<sub>SER</sub> with two weak foot-transistor banks (F1 and F2) and two strong foot-transistors (M1 and M2). For instance, if D<sub>SER</sub>=0, then the enabled M1 maximizes the pull-up speed while the pull-down is much slower and its delay can be digitally tuned by F2. This modulation can be disabled for clock forwarding in SE mode. After PDM, the modulated signal is eventually transmitted by a source-series-termination (SST) driver. Eye diagrams (@7.7Gb/s, PRBS31) of PDM signals are measured at the RX side of the channel (Fig. 16.6.1), and show that PDM data communication via an HR channel is feasible without any FFE or DFE. Although the eye looks closed when independent RX and CK eyes are overlaid in SE mode, a decent differential eye (70mV, 36ps) is achieved when the common jitter of RX and CK is eliminated in the differential eye (RX-CK). In DFF mode, a more reliable eye (110mV, 37ps) is measured.

Detection of the small phase difference between the received signals (RX and CK/RXB) with large common-mode swing (Fig. 16.6.2) is the most challenging requirement for RX design. To overcome such a stringent requirement, a complementary half-rate PDM RX employing a PDA is proposed. In this architecture, the even data path detects the phase difference of the inputs only at falling edges while the odd path (the complementary conjugate counterpart) works only at rising edges. For simplicity, only the even path is depicted with the CDR in Fig. 16.6.4. Because detection by the even path is done when the input common-mode voltage is high, N-type peaking amplifiers (N-amps) are exploited at input stages for good sensitivity (Fig. 16.6.4). Gain and peaking of N-amps can be digitally controlled. To further relieve the stringent requirement, a P-type of the PDA (P-PDA) is inserted between N-amps and the slicer to enlarge the small phase difference (Fig. 16.6.4). The P-PDA amplifies the phase difference between two differential inputs by the hysteresis of cross-coupled NMOSs (X-NMOSs). With the minimum strength configuration of X-NMOSs, the differential output is proportional to the input difference just like a regular amplifier. With this small output eye, the slicer clock CK<sub>RX</sub> requires too stringent timing control. On the contrary, if the strength of X-NMOS is properly configured, the dominance between the input PMOS pairs and X-NMOSs is varied by the input common-mode voltage level (V<sub>CMPDA</sub>). For example, when V<sub>CMPDA</sub> is low, the PMOS pairs dominate, and the outputs split depending on the phase difference at the rising edge of the inputs. When V<sub>CMPDA</sub> becomes high, then the X-NMOSs dominate and hold the output like a latch. Consequently, PDAs stretch the horizontal eye and increase the acceptable timing range for CK<sub>RX</sub>; if the input phase difference is large enough, the stretched horizontal eye can be even larger than 1UI with the half-rate design because the PDA evaluates in less than a half clock period (1UI) and holds the result during another clock phase (1UI).

Owing to the PDAs, the clock can be recovered by simply interpolating and delaying input signals. A CML interpolator interpolates RX and RXB in DFF mode, and bypasses CK in SE mode. The interpolator output is level-converted, skew-controlled, and then used for data recovery. CMOS interpolators with strength-controllable inverters and buffers are used for fine skew control.

The PDM TRX is fabricated in 65nm CMOS technology and occupies 0.0078mm<sup>2</sup>. The chip was tested with the channel of Fig. 16.6.1 while measuring the *in-situ* eye diagram at the PDA by sweeping the slicer offsets and the clock phases. Reliable eye opening (267mV, 1UI) was measured in DFF mode at 6Gb/s with PRBS31. 1UI horizontal eye is not surprising for the half-rate PDA design as discussed. Due to the greatly stretched horizontal eye from the PDAs, the simple CDR successfully recovers the clock. When the vertical bathtub is measured with the CDR clock, a 200mV eye was measured at a BER of 10<sup>-12</sup> (Fig. 16.6.5). In SE mode, eye sizes of 98mV and 1UI were achieved at 6Gb/s with PRBS15. With PRBS31, the eye sizes were reduced to 35mV and 58ps. The non-linear phase-gain of the PDAs causes a large eye size reduction for PRBS31. Slight reduction in the input phase difference by PRBS31 can significantly reduce the phase-gain of the PDA, greatly reducing eye sizes at the PDA output. With CDR, a 140mV eye is measured with PRBS7. In SE mode, the maximum data rate of 7.8Gb/s/pin was achieved with a 49mV eye at the cost of only 1.96pJ/b while overcoming 10 in-band notches. Compared to the best prior art utilizing multi-tones [3], the TRX achieves 2.1× the data rate per pin overcoming 10× more in-band notches at cost of only 2× energy per bit and 1/5× normalized area (1/1.8× without normalization). Compared to other prior art [1], ([2]), the TRX improves data rate, energy efficiency, and area by 3.25× (2.7×), 28× (3×), and 53.9× (72.6×), respectively.

### Acknowledgement:

This work was supported by the National Research Foundation (NRF) of the MSIP, Korea, under Grant 2015R1A2A2A09001553, and by Samsung POSTECH Research Center (SPRC) funded by Samsung Electronics. Authors appreciate IDEC for tool support.

### References:

- [1] W.-Y. Shin, et al., “A 4.8Gb/s Impedance-Matched Bidirectional Multi-Drop Transceiver for High-Capacity Memory Interface,” *ISSCC*, pp. 494-495, Feb. 2011.
- [2] H.-W. Lim, et al., “A 5.8Gb/s Adaptive Integrating Duobinary-Based DFE Receiver for Multi-Drop Memory Interface,” *ISSCC*, pp. 182-183, Feb. 2015.
- [3] K. Gharibdoust, et al., “Hybrid NRZ/Multi-Tone Serial Data Transceiver for Multi-Drop Memory Interfaces,” *ISSCC*, pp. 180-181, Feb. 2015.
- [4] B. Kim, et al., “A 10-Gb/s Compact Low-Power Serial I/O With DFE-IIR Equalization in 65-nm CMOS,” *IEEE JSSC*, vol. 44, no. 12, pp 3526-3538, Dec. 2009.



Figure 16.6.1: Channel characteristics: frequency response, pulse response, and eye size versus DFE tap count.



Figure 16.6.2: The concept and the architecture of the proposed PDM TRX.



Figure 16.6.3: A schematic diagram of the PDM TX producing eye diagrams measured at the RX side of the channel.



Figure 16.6.4: A schematic diagram of the PDM RX and timing diagrams of the PDA.



Figure 16.6.5: In-situ eye diagrams and a vertical bathtub curve measured with the recovered clock.

|                                        | ISSCC' 11 [1]                       | ISSCC' 15 [2]                                 | ISSCC' 15 [3]                         | This work                           |
|----------------------------------------|-------------------------------------|-----------------------------------------------|---------------------------------------|-------------------------------------|
| Tech (nm)                              | 130                                 | 45                                            | 40                                    | 65                                  |
| Supply voltage(V)                      | 1.2                                 | 1.1                                           | 0.9                                   | 0.9/0.9                             |
| Data rate/pin (Gb/s/pin)               | 2.4                                 | 2.9                                           | 3.75                                  | <b>6</b>                            |
| Channel type                           | 8-drop                              | 4-drop (stubless)                             | 3-drop                                | 2-drop                              |
| # of Notch                             | N.A.                                | 2                                             | 1                                     | <b>7</b>                            |
| RequiredDFE tap                        | N.A.                                | 7-tap DFE                                     | 18-tap DFE                            | 8-tap DFE                           |
| Energy efficiency (pJ/b)               | TX: 14.24<br>RX: 13.69<br>CDR: N.A. | TX: N.A.<br>RX: 1.285<br>CDR: 1.165 (x2.86)** | TX: N.A.<br>RX: 0.27<br>CDR: 0.21     | TX: 0.52<br>RX: 0.739<br>CDR: 0.117 |
| Total                                  | 27.93 (x14.2)*                      | 0.99 (x0.48)**                                | 2.062*                                | 1.961**                             |
| Data pattern                           | PRBS7                               | PRBS7                                         | PRBS15                                | PRBS31                              |
| BER                                    | 10 <sup>-9</sup>                    | 10 <sup>-10</sup>                             | 10 <sup>-12</sup>                     | 10 <sup>-12</sup>                   |
| Area(mm <sup>2</sup> ) (Norm. w/ tech) | TX: N.A.<br>RX: N.A.                | TX: 0.009 (x4.4)<br>RX: 0.087 (x72.6)         | TX: 0.0051 (x5.62)<br>RX: 0.0024 (x1) | TX: 0.0141 (x5)<br>RX: 0.0078 (x1)  |
| Total                                  | 1.68 (x53.9)                        | 0.099 (x0.48)**                               | 2.062*                                | 1.961**                             |

\*with clock recovery    \*\*without clock recovery

Figure 16.6.6: Performance summary and comparison table.



Figure 16.6.7: A chip microphotograph.

## 16.7 A 126mW 56Gb/s NRZ Wireline Transceiver for Synchronous Short-Reach Applications in 16nm FinFET

Marc Erett<sup>1</sup>, Declan Carey<sup>1</sup>, James Hudner<sup>1</sup>, Ronan Casey<sup>1</sup>, Kevin Geary<sup>1</sup>, Pedro Neto<sup>1</sup>, Mayank Raj<sup>2</sup>, Scott McLeod<sup>3</sup>, Hongtao Zhang<sup>2</sup>, Arianne Roldan<sup>2</sup>, Hongyuan Zhao<sup>4</sup>, Ping-Chuan Chiang<sup>4</sup>, Haibing Zhao<sup>4</sup>, KeeHian Tan<sup>4</sup>, Yohan Frans<sup>2</sup>, Ken Chang<sup>2</sup>

<sup>1</sup>Xilinx, Cork, Ireland

<sup>2</sup>Xilinx, San Jose, CA

<sup>3</sup>Acacia Communications, San Jose, CA

<sup>4</sup>Xilinx, Singapore, Singapore

The industry has recently proposed standards for synchronous high-speed interfaces targeting chip-to-chip communication across a very short PCB trace [1]. Figure 16.7.1 shows an example of such an interface. Eight 56Gb/s NRZ lanes provide a total of 448Gb/s aggregate bandwidth in each direction. The channel insertion loss and propagation delay varies from lane to lane, with a maximum insertion loss of 8dB at 28GHz from BGA to BGA. The routing inside the two packages adds an additional 3dB insertion loss at 28GHz. Taking advantage of the relatively low channel loss, the interface is expected to adopt simple transmitter/receiver circuits with low power consumption. However, a per-lane deskewing scheme is still required due to the propagation delay variations between lanes.

The transmitter in Fig. 16.7.2 serializes 128b data into 56Gb/s NRZ signals without TX FIR equalization. It uses a regulated CMOS quarter-rate clocking architecture with a 4-to-1 multiplexer at the final serialization stage. The phase errors of the four-phase 14GHz clocks are sensed at the 4-to-1 multiplexer clock inputs and corrected using inverter-based digitally controlled delay adjustment at the input clock buffer. To save power, only inverters and transmission gates are used in the MUX design. Figure 16.7.2 also depicts the 4-phase pulse generator, which uses NAND and NOR gates to generate four 1UI (17.8ps) pulses from the input clocks. To minimize distortion due to device mismatch, a cross coupled pair is inserted between the NAND and NOR outputs for de-skew purposes. The rise/fall times of the generated pulses are optimized to give the best power/performance trade off. The low-swing voltage-mode driver is implemented using a regulated N-over-N structure [2] capable of delivering up to 400mV diff-pp swing.

Sub-UI de-emphasis is used at the pre-driver to allow for larger fanout of the pre-driver stages (hence lower power consumption) without significantly increasing ISI-induced jitter. As shown in Fig. 16.7.2, the pre-driver sub-UI de-emphasis is achieved by adding a secondary path in parallel with the tapered main pre-driver buffer chain whose outputs are shorted at the input of the last pre-driver stage (EQ/EQb). The secondary path has a longer delay than, and the opposite polarity to, the main path. The accumulated data path ISI is significantly reduced at the output of the pre-driver. Figure 16.7.2 shows the simulated impact of this equalization. Figure 16.7.3 shows the receiver sub-system architecture. A level-shifting input buffer [3] shifts the input signal common mode from a near-ground reference level to a level compatible with the NMOS-based continuous-time linear equalizer (CTLE). The single-stage CTLE comprises a source-degenerated CML stage providing up to 12dB of high-frequency peaking at the 28GHz Nyquist frequency. The CTLE input common mode is dynamically adjusted to ensure that the tail device is adequately saturated for reliable operation but still provides maximum headroom for the differential pair. The CTLE output drives 5 StrongArm-based slicers which are clocked by 4 phases of a 14GHz clock. 4 of the slicers are data slicers to recover the 56Gb/s data. The 5<sup>th</sup> slicer is an error slicer, which samples the peak of the input signal and is used in conjunction with the data samples to perform a baud-rate clock and data recovery (CDR) control loop. The CDR loop directs the sampling clocks to their optimum positions by digitally controlling a delay adjustment block implemented using a deskewing injection-locked oscillator (ILO) [4]. In order to maximize the delay adjustment range, the deskewing ILO is designed to operate at 3.5GHz. The deskewing ILO output is frequency-multiplied by a factor of four by a multiplying ILO to generate the final 4-phase 14GHz sampling clocks. The 4-phase sampling clocks at the slicers are sensed and processed by a quadrature phase detection (QPD) circuit. The QPD circuit controls the supply level of the multiplying ILO to ensure a free running oscillation frequency that is matched to the frequency of the injected clocks, to minimize IQ error [5].

Figure 16.7.4 shows the architecture of the proposed deskewing ILO circuit. It consists of a 4-stage pseudo differential ring oscillator with transmission gate switches that are used to short the output of the first stage when clk is high and that of the third stage when clk\_b is high. This symmetric injection sequence ensures division by two and maintains IQ accuracy [5]. Additionally, division by two doubles the time deskew range for a given amount of phase deskew range, which is limited by the injection strength [4]. The injection strength is controlled by digitally setting the effective resistance of the injecting switch (Fig. 16.7.4). The outputs of the second and fourth delay cells are used as the IQ outputs of the ILO. To perform time deskewing, the free running frequency of the ILO is detuned by using two digitally controlled resistor DACs which control the supply and the effective ground of the ILO. Each DAC has 6b for deskew control and 4b for PVT calibration. A one-time PVT calibration is performed by observing the frequency of the free running oscillator and altering the calibration bits to bring it close to the injected frequency.

The design implements an LC-based PLL (Fig. 16.7.5) with a VCO oscillating at 28GHz. A ring-based injection-locked frequency divider (ILFD) divides the VCO clock by 2, generating 4 phases of 14GHz clocks. The ILFD design provides a low-power solution to high-frequency clock division in the PLL loop. It drives the CMOS based feedback divider in the PLL loop and the 4-phase 14GHz output clocks. Pseudo-differential inverters with regulated supply are used to distribute the 4-phase 14GHz output clocks to the TX/RX channels. After initial CDR lock, the incoming data may drift relative to the RX sampling clocks as the ambient temperature changes. Since the deskewing ILO has a limited delay adjustment range, the drift must be sufficiently small over a wide temperature range. In order to desensitise the leaf nodes of the RX and TX clock trees to temperature change, the insertion delay on RX and TX clock paths are matched as closely as possible to the first order. Secondly, an inverter delay chain is added to the PLL feedback such that the total PLL feedback delay is approximately the same as the average TX/RX clock tree delay across all 8 lanes.

The PRBS-31 transmitter output eye diagram in Fig. 16.7.6 is captured using a Keysight DSA-X 90000 series 60GHz scope with de-embedded PCB trace. The measured RJ is 200fs<sub>rms</sub>, with peak to peak ISI of 3.98ps. To evaluate link performance, TX-to-RX loopback PRBS-31 traffic tests are performed over a PCB channel with up to 8dB BGA-to-BGA loss. The resulting bit-error-rate (BER) bathtub plots for four of the worst-case channels are shown in Fig. 16.7.6. The link BER is <1e-15, with the worst case eye opening of 0.24UI at 1e-12 BER. To test the effectiveness of the temperature drift tracking scheme, the CDR is initially locked at 0°C and subsequently the temperature is increased to 50°C and 100°C. BER bathtub plots for 3 temperatures are shown in Fig. 16.7.6, showing 0.47UI of shift in the CDR locking point, well within the deseskew ILO delay adjustment range. The design comprises a centrally located PLL and bias supplying 4 channels of RX and TX to either side as shown in the die photo (Fig. 16.7.7). The performance summary is given in Fig. 16.7.7. The power consumption at 56Gb/s is 126mW per TX/RX channel, including amortized PLL and clock distribution power. The area per TX/RX channel is 0.33mm<sup>2</sup>. This design consumes ~50% less power and ~50% less area compared to a 56Gb/s NRZ transceiver designed to handle larger channel loss [6].

### Acknowledgements:

The authors wish to acknowledge the contributions of the Xilinx layout teams in Cork, Singapore and San Jose, as well as the characterisation and laboratory support provided by Sai Lalith Chaitanya Ambatipudi and David Mahashin.

### References:

- [1] Optical Internetworking Forum (OIF), "CEI-56G-XSR-NRZ Implementation Agreement Draft Text", Jan. 2017.
- [2] R. Palmer, et al., "A 14mW 6.25Gb/s Transceiver in 90nm CMOS for Serial Chip-to-Chip Communications," *ISSCC*, pp. 440-441, Feb. 2007.
- [3] P. A. Francese, et al., "16 Gb/s Receiver with DC Wander Compensated Rail-to-Rail AC Coupling and Passive Linear-Equalizer in 22 nm CMOS," *ESSCIRC*, pp. 435-438, Sept. 2014.
- [4] F. O'Mahony, et al., "A 27Gb/s Forwarded-Clock I/O Receiver Using an Injection-Locked LC-DCO in 45nm CMOS," *ISSCC*, pp. 452-453, Feb. 2008.
- [5] M. Raj, et al., "A 4-to-11GHz injection-locked quarter-rate clocking for an adaptive 153fJ/b optical receiver in 28nm FDSOI CMOS," *ISSCC*, pp. 404-405, Feb. 2015.
- [6] T. Shibusaki, et al., "3.5 A 56Gb/s NRZ-electrical 247mW/lane serial-link transceiver in 28nm CMOS," *ISSCC*, pp. 64-65, Feb. 2016.



Figure 16.7.1: Example of a parallel chip-to-chip link defined in CEI-56G-XSR-NRZ.



Figure 16.7.2: Transmitter block diagram.



Figure 16.7.3: Receiver block diagram.



Figure 16.7.4: Deskewing ILO circuit.



Figure 16.7.5: PLL / clocking architecture.



Figure 16.7.6: TX eye diagram; BER bath-tub plots for worst-case channels; deskewing ILO delay adjustment range, and BER bath-tub shifts over 100°C temperature change.



|                                     | This Work                                               | [6]                                   |
|-------------------------------------|---------------------------------------------------------|---------------------------------------|
| Technology                          | CMOS, 16nm FinFET                                       | 28nm CMOS                             |
| Power Supply                        | 0.9V, 1.2V, 1.8V                                        | 0.96V                                 |
| Data Rate [Gb/s]                    | 56                                                      | 56.2                                  |
| Per Channel Area [mm <sup>2</sup> ] | 0.33                                                    | 0.7                                   |
| Power/Channel [mW/Gb/s]             | Tx : 0.68<br>Rx : 1.07<br>Clocking: 0.5<br>Total : 2.25 | Tx : 1.87<br>Rx : 2.53<br>Total : 4.4 |
| Tx Function                         | 128:1 Mux                                               | 32:1 MUX<br>2-tap FFE                 |
| Rx Function                         | CTLE<br>CDR                                             | CTLE<br>1-tap DFE<br>CDR              |
| Tx RJ (fs)                          | 200                                                     | 288                                   |
| BER at 56Gb/s                       | <1e-15 (8 dB loss)                                      | <1e-12 (18.4dB loss)                  |

Figure 16.7.7: Die photo and performance summary.

## 16.8 A 1.17pJ/b 25Gb/s/pin Ground-Referenced Single-Ended Serial Link for Off- and On-Package Communication in 16nm CMOS Using a Process- and Temperature-Adaptive Voltage Regulator

John M. Wilson<sup>1</sup>, Walker J. Turner<sup>1</sup>, John W. Poulton<sup>1</sup>, Brian Zimmer<sup>2</sup>, Xi Chen<sup>2</sup>, Sudhir S. Kudva<sup>2</sup>, Sanquan Song<sup>2</sup>, Stephen G. Tell<sup>1</sup>, Nikola Nedovic<sup>2</sup>, Wenxu Zhao<sup>1a</sup>, Sunil R. Sudhakaran<sup>2</sup>, C. Thomas Gray<sup>1</sup>, William J. Dally<sup>2</sup>

<sup>1</sup>Nvidia, Durham, NC; <sup>2</sup>Nvidia, Santa Clara, CA

<sup>a</sup>now with Broadcom, Irvine, CA

Toward the end of the Moore's-law era, increases in system complexity will rely more heavily on packaging technology. Systems will increasingly comprise multiple chips that must be linked by high-speed data channels carrying a substantial fraction of on-chip bandwidth. To take advantage of inexpensive organic packages and conventional printed circuit (PC) boards, data links that are both energy and pin efficient are needed. A link between neighboring packages is by far the more challenging application due to increased cross-talk, signal attenuation, and reflections from impedance discontinuities. The combination of signal integrity challenges and production margining requires increased amplitude, equalization, ESD protection, and PVT-tolerant circuit design techniques.

We describe a short-reach link that connects chips on the same package or on neighboring packages over conventional PC board channels. These links use single-ended signaling to conserve pins, operate at 25Gb/s/pin and 1.17pJ/b, and use a simple but robust clock-forwarding scheme to cancel jitter. The overall link design is shown in Fig. 16.8.1. Eight data lanes share a common forwarded clock. Data is transmitted on an in-phase clock (Iclk) while the forwarded clock is transmitted using a quadrature clock (Qclk). Data and clock lanes are designed to have closely matched channel delays. At the receiver, the forwarded clock is directly buffered and driven out to data samplers; a programmable delay in each data lane delays the received data by an amount matched to the insertion delay of the clock buffer chain.

The link uses ground-referenced signaling (GRS), described in [1]. GRS avoids many of the usual problems of single-ended signaling. The ground network, typically the lowest impedance network in a system, is used as the signaling voltage reference, eliminating the need for a precise matched reference at the receiver. Return currents flow only on ground, providing a clean line termination. Signals must be driven symmetrically positive and negative with respect to ground; GRS uses an output-multiplexed pair of charge pumps, shown in Fig. 16.8.2 (left), to produce these voltage levels. Since the charge-pump transmitter draws the same current from the supply regardless of data polarity, simultaneous switching noise is largely eliminated. A capacitively coupled auxiliary transmitter injects extra charge into the line on data transitions, providing tunable transmitter equalization across 5 strength settings. The receiver, shown in Fig. 16.8.2 (right), employs a pseudo-differential amplifier that performs four functions: 1) signal gain; 2) level-shift from the about-ground line signal up to CMOS levels; 3) linear EQ using gm-C active inductors in the amp load; and 4) single-ended to differential conversion. First-stage gain is about 6dB, with RX EQ boost set at maximum. A pair of CMOS inverters form the second amp stage and provide further gain of about 10dB. A common-mode feedback system keeps the first stage output centered on the inverter threshold. Termination is comprised of the adjustable Rterm resistor in parallel with  $1/g_m$  of the input stage. Input offset is coarsely trimmed by differentially varying the current in the two arms of the amplifier, while Rtrim provides fine offset trim. The input amp consumes about 2.6mA.

Clocks are driven from a central PLL to the individual data and clock lanes using conventional CMOS buffers, each of which has digitally controlled insertion-delay and duty-factor trim. Clocks are distributed via top-level metal with skew < 1ps.

The link employs a novel power supply regulation scheme, shown in Fig. 16.8.3. A phase-locked loop locks a CMOS ring oscillator (RO) at 25GHz to a 1.56GHz reference clock; neither frequency is critical. The full-rate RO output is divided by 2 to provide in-phase and quadrature clocks Iclk/Qclk. Vreg\_PLL is the VCO control input, set by a PMOS control element that regulates down from the external Vdd\_IO supply. The I/O circuitry operates from a second supply, Vreg, whose digital regulator uses Vreg\_PLL as a reference voltage. This voltage is set in the PLL so that an exemplar CMOS circuit (the RO) operates at a fixed rate independent of PVT, thus the I/O circuitry, which operates on a supply voltage

that is nominally equal to Vreg\_PLL, also operates at a fixed rate independent of PVT. At the expense of regulator losses, this arrangement varies the internal supply voltage to flatten current consumption and circuit speed across PVT (see Fig. 16.8.3), thereby saving the considerable power that would otherwise be needed to provide margin. The flattening of current across PVT also aids in satisfying electro-migration requirements.

An important aspect of power efficiency is energy consumed when link traffic is variable. This link features a "pause" mode that reduces power dissipation by 75%, while providing fast entry and exit times (<5ns). When "pause" is needed, the last data bit is transmitted on a low-to-high transition of Iclk, and the clock is then held high for an extended period. Since Iclk is not toggling, data transmitters are effectively powered down. The clock lane continues to run on Qclk, but sends a stream of 1s. At the receiver, a simple analog detector observes that the clock has not transitioned for several cycles and turns off current consumers (mainly the data input amps). The PLLs continue to run both at the transmitter, to keep the clock forwarding charge pumps running, and at the receiver, to maintain Vreg. At the end of "pause", the clock begins toggling again, 2-5 (programmable) invalid words are transmitted over the data links while the receiver powers up, and then live data can begin transmission.

The link is fabricated in a TSMC 16nm FINFET 12-metal process, and operates from an external power supply voltage of 0.95V. It comprises a transceiver "brick", centered in a 4×5 bump array on a 150μm pitch (Fig. 16.8.7). Overall area is 686μm×565μm, where the I/O brick occupies 81,406μm<sup>2</sup>. Figure 16.8.6 contains the circuit area breakdown of the system sub-blocks.

The link is intended to operate over short channels in multi-chip and multi-package environments. A model of the experimental package-to-package channel is shown in Fig. 16.8.4. It is 80mm long and has attenuation of -8.5dB at Nyquist. Individual lanes are matched in length to the clock lane within ±4ps. Crosstalk for the package-to-package system is held below -31.6dB for the sum of all eight aggressors. This is achieved by: 1) using a checkerboard pattern of signal bumps with ground/power bumps; 2) stripline routing in the package and PC board; 3) placing grounded shields between traces in the PC board only; and 4) placing PC board routes in the next-to-bottom PC board layer to minimize any via stub effects. T-coils are used on each lane to improve back-match, with a robust ESD protection device included. Link attenuation is overcome using a combination of transmitter and receiver linear EQ. The data and clock eyes shown in Fig. 16.8.2 were probed at the output of a transmitter. The effects of crosstalk, due to transitioning of data lanes, can be seen as voltage noise at the minimum and maximum values of the clock waveform.

Figure 16.8.5 shows the bathtub curves for the package-to-package link and a 10mm on-package link. The on-chip phase interpolator provides 1.33ps steps, and the 8 data lanes are matched to within a 6-step window, then subsequently trimmed to 3 steps using the on-chip data delay elements. At BER = 10<sup>-15</sup>, the aggregate eye opening is 0.42UI for the off-package channel and 0.77UI on-package.

Figure 16.8.6 shows the energy per-bit breakdown for the link and a comparison with other work in the general area of short-reach links. The presented link normalizes operation across PVT and passes 110°C 10-year EM requirements to achieve 25Gb/s, 1.17pJ/b, single-ended signaling suitable for both on-package and high-loss off-package channels.

### Acknowledgements:

This research was, in part, funded by the U.S. Government under the DARPA CRAFT program. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

DISTRIBUTION A. Approved for public release: distribution unlimited

### References:

- [1] J. Poulton, et al., "A 0.54pJ/bit 20Gb/s Ground-Referenced Single-Ended Short-Haul Serial Link in 28nm CMOS for Advanced Packaging Applications", *IEEE JSSC*, vol. 48, no. 12, pp. 3206-3218, Dec. 2013.
- [2] A. Shokrollahi, et al., "A Pin-Efficient 20.83GB/s/wire 0.94pJ/bit Forwarded Clock CNRZ-5-Coded SerDes up to 12mm for MCM Packages in 28nm CMOS," *ISSCC*, pp. 182-183, Feb. 2016.
- [3] J. Song, et al., "A 1-V 10-Gb/s/pin Single-Ended Transceiver with Controllable Active-Inductor-Based Driver and Adaptively Calibrated Cascaded-Equalizer for Post-LPDDR4 Interfaces," to appear in *IEEE TCAS-I*, accepted June 2017.



Figure 16.8.1: GRS link overview.



Figure 16.8.2: Transmitter and receiver schematics.



Figure 16.8.3: PVT tolerant design: PLL &amp; digital regulator schematic and ring-oscillator voltage and current across corner cases.



Figure 16.8.4: PCB channel model, bump pattern, and complete channel response.



Figure 16.8.5: 25Gb/s PRBS-31 bathtub curves for 10mm package channel &amp; PCB channel.



|                        | [3]          | [2]    | [1]          | This Work    |
|------------------------|--------------|--------|--------------|--------------|
| Signaling              | Single-Ended | CNRZ-5 | Single-Ended | Single-Ended |
| Data Rate / Pin [Gb/s] | 10.0         | 20.83  | 20.0         | 25.0         |
| Reach                  | 100mm        | 12mm   | 4.5mm        | 80mm         |
| Type of Packaging      | MCM or PCB   | MCM    | MCM          | MCM or PCB   |
| CMOS Technology        | 65nm         | 28nm   | 28nm         | 16nm         |
| Energy / Bit [pJ/bit]  | 4.18         | 0.94   | 0.54         | 1.17         |

Figure 16.8.6: Energy and area breakdown for this work &amp; comparison to prior work.



Figure 16.8.7: Die photo.

## 16.9 A 20Gb/s 79.5mW 127GHz CMOS Transceiver with Digitally Pre-Distorted PAM-4 Modulation for Contactless Communications

Yanghyo Kim<sup>1,2</sup>, Boyu Hu<sup>2</sup>, Yuan Du<sup>2</sup>, Adrian Tang<sup>1,2</sup>, Huan-Neng Chen<sup>3</sup>, Chewnpu Jou<sup>3</sup>, Jason Cong<sup>2</sup>, Tatsuo Itoh<sup>2</sup>, Mau-Chung Frank Chang<sup>2,4</sup>

<sup>1</sup>Jet Propulsion Laboratory, Pasadena, CA

<sup>2</sup>University of California, Los Angeles, CA

<sup>3</sup>TSMC, Hsinchu, Taiwan

<sup>4</sup>National Chiao Tung University, Hsinchu, Taiwan

Contactless chip-to-chip or board-to-board proximity (~1mm) communications have been realized by using either wireless transmission [1-3], or inductive/capacitive coupling schemes [4-6]. However, their practical deployments in consumer electronics are currently limited because of bandwidth-density-inefficient OOK/ASK-only modulations [1-3], process/voltage/temperature sensitive carrier-recovery-less coherent detection [1], and large coupler footprint (6-10mm<sup>2</sup>) for meeting bandwidth requirements [1], [4-6]. To overcome such challenges, the demonstrated contactless communication system employs a 127GHz CMOS transceiver to concurrently increase the coupling bandwidth and reduce coupling antenna size. Furthermore, a PAM-4 modulator with digital pre-distortion (DPD) in the transmitter (TX) and an envelope detecting non-coherent receiver (RX) are designed to further scale the bandwidth and eliminate carrier recovery circuitry, respectively.

The block diagram of the CMOS transceiver with a 127GHz carrier and digitally pre-distorted PAM-4 modulation is shown in Fig. 16.9.1. Unlike a conventional wireless TX that demands a linear mixer and power amplifier, the implemented TX uses only an oscillator to deliver the necessary output power and mixes signals via a capacitive-digital-to-analog-converter (CDAC) based rail-to-rail PAM-4 modulator. The PAM-4 modulator consists of three 5b CDACs, each of which is responsible for the height of one distinct data-eye. The RX contains a one-stage low-noise amplifier (LNA), a down-conversion self-mixer, and extra output drivers to interface with an oscilloscope for testing purposes. As illustrated in Fig. 16.9.1, the TX mixer encounters non-linearity through a rail-to-rail PAM-4 input swing, and the RX self-mixer also conducts a non-linear square-law operation. Consequently, equally spaced PAM-4 input signals end up with unequal spacing at the RX output. Thus, to invert the system non-linearity, three parallel 5b CDACs pre-distort the input signals through symbol re-mapping in a memoryless way.

As simulated in Fig. 16.9.1, an efficient air-coupling between TX and RX is achieved by designing a folded-dipole antenna on FR4HR substrate similar to that of [3]. To retain sufficient bandwidth while maintaining a differential input impedance of 100Ω, asymmetrical thicknesses of the upper (0.076mm) and lower (0.103mm) conductor are chosen. A via-wall (3 mil spacing between each via) is also placed around the antenna to confine electromagnetic energy in case of implementing multi-channel systems. The resulting footprint of the antenna is 0.7mm<sup>2</sup>, and the radiation pattern of E- and H-plane shows 5.9dBi gain. According to HFSS simulations, 1mm air-coupling causes 7dB loss at 127GHz, and 1mm offset in x- and y-axis contributes 10dB extra loss.

The CDAC-based PAM-4 modulator of TX is shown in Fig. 16.9.2. A cascade of a 32:4 MUX and a 4:2 MUX up-streams baseband parallel signals to the full symbol rate. Next, a 2:3 decoder splits the 2b PAM-4 data stream into a 3b thermo-code signal D[2:0] with each bit mapping to one of the three PAM-4 eyes. The CDAC is composed of three identical sub-sections, each of which contains binary-weighted 5b capacitors and their driver/data-select-MUX. Through a 3:1 MUX, each weighted capacitor can be assigned to any of the 3b thermo-codes. In such a way, the weight percentage of an individual data-eye over the total amount can be independently adjusted while keeping the PAM-4 modulator swing constant. As an example, when the 'sel1\_16[2:0]' of 'DPD code 1' is set [100] instead of [010], an extra 16C portion is assigned to the upper-eye CDAC by borrowing from the center-eye CDAC, to widen the upper-eye and shrink the center-eye, while leaving the lower-eye unchanged. In addition, the PAM-4 modulator adjusts its output common-mode (CM) via an R-2R DAC to optimize the biasing point of the mixer. This allows the PAM-4 waveform at the mixer input to be digitally pre-distorted within the rail-to-rail range.

In the TX circuitry depicted in Fig. 16.9.3, a free-running oscillator generates 127GHz carrier and drives the switching pair of a single-balanced mixer via a transformer. The mixer's input/output characteristic can be comprehended by sweeping the DC bias of the tail device and measuring the peak-to-peak swing of the 127GHz carrier at the TX output. As shown in the simulated plot, the gain of the mixer is higher when the input is in between upper and lower rail. Correspondingly, the center-eye is expected to grow wider than the rest. Comparing simulated waveforms with and without DPD, the carrier's modulation depth becomes more evenly spaced at the TX output by reducing center-eye amplitude at the mixer input. Note that the carrier's modulation depth at the TX output gradually decreases from the digital state 00 to 11. This is to compensate for the square-law non-linearity of the self-mixer in the RX, where gain increases from lower to larger input swings. Finally, the TX delivers -0.2dBm saturated output power under 1.2V supply.

In the RX path shown in Fig. 16.9.4, an LNA amplifies incoming signals with 10dB gain while offering 9dB noise figure (NF). After integrating over 10GHz (100dB) LNA bandwidth, the RX sets the noise floor at -65dBm [-174dBm/Hz + 100dB + 9dB]. Assuming the 17dB worst-case air-coupling loss from Fig. 16.9.1, the self-mixer receives a -7.2dBm [-0.2 dBm (TX) - 17dB + 10dB (LNA)] carrier-modulated signal and yields 20dB conversion loss based on its simulated input/output characteristic. To increase the demodulation bandwidth of the self-mixer, a feedback trans-impedance amplifier is designed as a load to convert the rectified current into voltage. With a -27.2dBm [-7.2dBm-20dB] de-modulated signal and considering 27dB to be the minimum signal-to-noise ratio (SNR) required to achieve a bit-error-rate (BER) better than 10<sup>-12</sup> for intended PAM-4 signaling, the system exhibits an SNR margin of 10.8dB [-27.2dBm - (-65dBm + 27dB)]. At the output of the self-mixer, an RC filter extracts the average DC content of the demodulated signal. Because of the data-pattern-dependent average DC level of PAM-4 signaling, an R-2R DAC feeds a constant CM to the single-ended-to-differential converter (SDC) based on feedback from the RC filter. Additionally, to prevent a mismatch-oriented DC offset from propagating through the SDC and following output drivers, an offset cancellation is embedded in the SDC using R-2R DACs and digitally controlled current sources. Comparing simulated waveforms, de-modulated PAM-4 signals display four balanced levels with DPD whereas PAM-4 signals without DPD produce three unequal levels instead.

The prototype TX and RX are implemented in a 65nm CMOS process and are tested by placing them 1mm apart from each other as shown in Fig. 16.9.7. As captured in Fig. 16.9.5, without DPD, upper-eye is completely closed, center-eye is widely opened, and lower-eye is almost closed. Once DPD is applied, all three levels open equally, fulfilling 20Gb/s data communication. The amplitude of each eye opening shown by the oscilloscope is relatively small (~10 mV), because the RX output driver transports the de-modulated PAM-4 signal linearly. The BER is estimated to be superior to 10<sup>-12</sup> according to measured eye openings and calculated SNR. The fabricated TX and RX are shown in Fig. 16.9.7. The entire contactless communication system consumes 79.5mW (TX: 50.8mW, RX: 28.7mW) without counting input/output drivers, resulting in 3.98pJ/b energy efficiency. Figure 16.9.6 summarizes and compares its performance with that of prior arts. This work accomplishes a 127GHz CMOS transceiver with digitally pre-distorted PAM-4 modulation for reaching state-of-the-art data rate and energy efficiency.

### References:

- [1] Y. Tanaka, et al., "A versatile multi-modality serial link," *ISSCC*, pp. 332-333, Feb. 2012.
- [2] Y. Kim, et al., "A 60 GHz CMOS transceiver with on-chip antenna and periodic near field directors for multi-Gb/s contactless connector," *IEEE Microwave Wireless Component Letters*, vol. 27, no. 4, pp. 404-406, Mar. 2017.
- [3] Y. Kim, et al., "A 125 GHz transceiver in 65 nm CMOS assembled with FR4 PCB antenna for contactless wave-connectors," *IEEE MTTS Int. Microwave Symp.*, June 2017.
- [4] A. Kosuge, et al., "A 6 Gb/s 6 pJ/b 5 mm-distance non-contact interface for modular smartphones using two-fold transmission-line coupler and EMC-qualified pulse transceiver," *ISSCC*, pp. 176-177, Feb. 2015.
- [5] K. Hijjoka, et al., "A 5.5 Gb/s 5mm contactless interface containing a 50 Mb/s bidirectional sub-channel employing common-mode OOK signaling," *ISSCC*, pp. 406-407, Feb. 2013.
- [6] C. Thakkar, et al., "A 32 Gb/s bidirectional 4-Channel 4 pJ/b capacitively coupled link in 14 nm CMOS for proximity communication," *IEEE JSSC*, vol. 51, no. 12, pp. 3231-3245, Dec. 2016.



Figure 16.9.1: System block diagram, digital pre-distortion concept, and air-coupling design.



Figure 16.9.2: Schematic of capacitive-DAC-based PAM-4 modulator with memoryless DPD examples.



Figure 16.9.3: Schematic of CMOS TX, and simulated waveforms with and without DPD.



Figure 16.9.4: Schematic of CMOS RX, and simulated waveforms with and without DPD.



Figure 16.9.5: Measured eye-diagrams with and without DPD.

|                                      | This Work           | ISSCC 2012 [1] | MWCL 2017 [2]      | IMS 2017 [3]       | ISSCC 2015 [4] | ISSCC 2013 [5] | JSSC 2016 [6] |
|--------------------------------------|---------------------|----------------|--------------------|--------------------|----------------|----------------|---------------|
| Coupling Method                      | Antenna             | Antenna        | Antenna            | Antenna            | T-Line         | Inductive      | Capacitive    |
| Carrier (GHz)                        | 127                 | 57 & 80        | 60                 | 125                | Baseband       | Baseband       | Baseband      |
| Modulation                           | PAM4 (Non-Coherent) | ASK (Coherent) | OOK (Non-Coherent) | OOK (Non-Coherent) | N/A            | N/A            | N/A           |
| Technology                           | 65nm CMOS           | 40nm CMOS      | 65nm CMOS          | 65nm CMOS          | 0.18μm CMOS    | 14nm CMOS      |               |
| Data Rate (Gb/s)                     | 20                  | 20             | 5                  | 14                 | 6              | 5.5            | 8             |
| Power (mW)                           | 79.5                | 137            | 40                 | 60                 | 36             | 198            | 32            |
| FoM (pJ/bit)                         | 3.98                | 6.85           | 8                  | 4.3                | 6              | 36             | 4             |
| Coupler Footprint (mm <sup>2</sup> ) | 0.7                 | 9              | 0.2                | 0.9                | 6              | N/A            | 9.6           |
| Distance (mm)                        | 1                   | 5              | 0.5                | 2                  | 1.11           | 5              | 0.8           |

Figure 16.9.6: Performance comparison table.



Figure 16.9.7: TX and RX die photo, measurement setup, and flip-chip assembled module.

# Session 17 Overview:

## *Technologies for Health and Society*

### TECHNOLOGY DIRECTIONS SUBCOMMITTEE



**Session Chair: *Patrick P. Mercier***  
*University of California San Diego, La Jolla, CA*



**Associate Chair: *Long Yan***  
*Samsung Electronics, Hwaseongsi, Korea*

**Subcommittee Chair: *Makoto Nagata*, Kobe University, Kobe, Japan**

Advances in sensors, low-power circuits, and integration technologies are helping to revolutionize industries ranging from agriculture to healthcare. This session highlights innovations in connected sensors for improved food production, diagnostic imaging, physiochemical sensing, and neurophysiology. The first paper is an invited paper, and describes how advances in sensors, circuits, and algorithms can help improve the efficiency of food production. The second paper demonstrates a multi-camera capsule endoscope with integrated high-throughput communications. The next three papers describe sensing systems that are powered from and/or measure chemical parameters in gas for industrial applications, or in bodily fluids for healthcare applications. Subsequent papers demonstrate advances in transcranial communications, optoelectronic neural recorders, multi-modal wearable brain imagers, and closed-loop neural implants with integrated support vector machine classifiers.

**INVITED PAPER**

**1:30 PM**

**17.1 Food and Agriculture Cloud Services with Sensor Networks**

Katsuyoshi Watanabe<sup>1</sup>, Ryoichi Sakuma<sup>2</sup>, <sup>1</sup>Fujitsu Kyushu System Services, Fukuoka, Japan; <sup>2</sup>Fujitsu America, Sunnyvale, CA

Agriculture carries out a multifaceted function such as supply of food, employment and the maintenance of the natural environment. For the human race, stable food production at low-cost is increasingly important in the future. Agricultural production has many elements that are difficult to predict such as weather fluctuations. Especially the agriculture of Japan, where climatic change is intense, has been supported to the present by veteran producers' "experience and intuition". However, reduction in the number of agricultural workers with know-how and reduction of cultivation area are major social problems in Japan because the aging of the population is progressing.

Even in the field of agriculture where the utilization of ICT lags behind other areas, an environment where these systems can be used more quickly, easily, and at lower-cost has been realized by recent progress in cloud computing. This field is evolving from agriculture based on "experience and intuition" to precision agriculture based on "data".

Food and Agriculture Cloud Services that Fujitsu offers are providing various services that fully support outdoor cultivation, greenhouse cultivation, fruit cultivation, and livestock farming from the point of view of management, production, and sales. Various data sets such as human behavior, movement of physical and financial assets, cultivation management data and environmental control data are aggregated at the data center. The information stored at the data center comprises a huge and complex data set. By analyzing this data, optimizations that provide improved value can be realized.

Advanced semiconductor technologies such as sensors, wireless networks, processors, storage, renewable energy systems, amongst others, enable new data analysis technologies through the use of AI. When using this technology in combination with labor saving and automation technologies such as robots important new methods and optimized systems are created thus enabling transformative advancements in agriculture.

**2:00 PM**

**17.2 4-Camera VGA-Resolution Capsule Endoscope with 80Mb/s Body-Channel Communication Transceiver and Sub-cm Range Capsule Localization**

J. Jang, KAIST, Daejeon, Korea

In Paper 17.2, KAIST presents a capsule endoscope supporting 360° visual angle by 4-Cameras and 80Mb/s body channel communication (BCC) transceiver for 4fps, VGA resolution image transmission. With the help of reconfigurable BCC receivers distributed on the human body, real-time localization of the capsule at sub-cm accuracy is demonstrated.



2:30 PM

**17.3 A 0.3V Biofuel-Cell-Powered Glucose/Lactate Biosensing System Employing a 180nW 64dB SNR Passive  $\Delta\Sigma$  ADC and a 920MHz Wireless Transmitter**

*A. Fazli Yeknami*, University of California, San Diego, La Jolla, CA

In Paper 17.3, the University of California, San Diego demonstrates a wireless biosensing system powered directly by a Glucose/Lactate biofuel cell without a DC-DC converter. Operating at 0.3V, the presented chip includes an duty-cycled maximum power point tracker, a 180nW 64dB SNR passive  $\Delta\Sigma$  ADC, and a 30pJ/b 920MHz transmitter.



3:15 PM

**17.4 A 0.28m $\Omega$ -Sensitivity 105dB-Dynamic-Range Electrochemical Impedance Spectroscopy SoC for Electrochemical Gas Detection**

*G. Qu*, Analog Devices, Beijing, China

In Paper 17.4, Analog Devices describes an electrochemical impedance spectroscopy (EIS) SoC for electrochemical gas detection. It features 0.28m $\Omega$  sensitivity and 105dB dynamic range which has 5x better sensitivity and 10dB higher dynamic range compared to state-of-the-art.



3:45 PM

**17.5 50nW 5kHz-BW Opamp-Less  $\Delta\Sigma$  Impedance Analyzer for Brain Neurochemistry Monitoring**

*M. El Ansary*, University of Toronto, Toronto, Canada

In Paper 17.5, the University of Toronto and the Toronto Western Hospital present a 12-channel potentiostat IC sensing K+ ions concentration in vivo by computing the impedance on the surface of biofouling-resistant label-free potassium-sensitive chemically functionalized microelectrodes. The 50nW, 5kHz-BW opamp-less  $\Delta\Sigma$  ADC performs 19.5fJ/conv at an ENOB of 8bits, and is validated in a mouse model.



4:00 PM

**17.6 A 200Mb/s Inductively Coupled Wireless Transcranial Transceiver Achieving 5e-11 BER and 1.5pJ/b Transmit Energy Efficiency**

*W. Li*, University of California, Berkeley, CA

In Paper 17.6, the University of California, Berkeley, demonstrates a wireless transceiver designed for transcranial applications. Utilizing off-chip 10x10mm<sup>2</sup> coupled inductors, the 65nm transceiver achieves a data rate of up to 200Mb/s with a bit error rate as low as 5x10<sup>-11</sup> across scalp and skull; at a transmit power of 300uW, an energy efficiency of 1.5pJ/bit is achieved.



4:15 PM

**17.7 A 330 $\mu$ m<sup>2</sup> Opto-Electronically Integrated Wireless System-on-Chip for Recording of Neural Activities**

*S. Lee*, Cornell University, Ithaca, NY

In Paper 17.7, Cornell University presents a 330x90um<sup>2</sup> opto-electronic wireless system for neural recording applications. A 0.18 $\mu$ m CMOS chip is powered by an AlGaAs diodes that doubles as a PV cell and an LED, the latter of which is used for wireless telemetry of recorded neural data.



4:30 PM

**17.8 A 665 $\mu$ W Silicon Photomultiplier-Based NIRS/EEG/EIT Monitoring ASIC for Wearable Functional Brain Imaging**

*J. Xu*, imec - Holst Centre, Eindhoven, The Netherlands

In Paper 17.8, imec demonstrates an active sensing ASIC that supports simultaneous measurement of near-infrared spectroscopy (NIRS), electroencephalography (EEG), and electrical impedance tomography (EIT). Use of a silicon photomultiplier reduces the power needed for NIRS sensing, resulting in total ASIC power of 665 $\mu$ W.



4:45 PM

**17.9 A Recursive-Memory Brain-State Classifier with 32-Channel Track-and-Zoom  $\Delta^2\Sigma$  ADCs and Charge-Balanced Programmable Waveform Neurostimulators**

*G. O'Leary*, University of Toronto, Toronto, Canada

In Paper 17.9, the University of Toronto, the Krembil Neuroscience Center, the Toronto Western Hospital, and Princeton University present a neural interface SoC used for closed-loop control of neurological disorders. A 32-channel analog front-end captures data using an artifact-tolerant architecture, and classifies neural data with an integrated decaying-memory support vector machine before deciding when to activate a digitally charge-balanced neurostimulator.

## 17.2 4-Camera VGA-Resolution Capsule Endoscope with 80Mb/s Body-Channel Communication Transceiver and Sub-cm Range Capsule Localization

Jaejun Jang<sup>1</sup>, Jihee Lee<sup>1</sup>, Kyoung-Rog Lee<sup>1</sup>, Jiwon Lee<sup>1</sup>, Minseo Kim<sup>1</sup>, Yongsu Lee<sup>1</sup>, Joonsung Bae<sup>2</sup>, Hoi-Jun Yoo<sup>1</sup>

<sup>1</sup>KAIST, Daejeon, Korea

<sup>2</sup>Kangwon National University, Chuncheon, Korea

Recently, capsule endoscopes are emerging as an alternative to the cable-attached endoscopes since not only mitigating pain and fear of patients but also acquiring additional information about unexplained lesions for accurate diagnoses. Nevertheless, their applicability has been mainly limited by insufficient viewing angles and image qualities [1]. Especially, a single end-facing camera with VGA resolution images suffers from blurred and blind zone through digestive canal, increasing its overall miss-rate up to 20%, which is fatal in diagnosis [2]. A 4-camera capsule was proposed to support 360° visual angle [1], however, full 360° images were just stored on a Flash EPROM without wireless image transmission, real-time image viewing and capsule tracking. Full 360° high resolution images with multi-cameras inherently require high data-rate wireless telemetry inside the capsule, but previous wireless capsule endoscopes [3] did not support wireless transmission of panoramic view images because of their limited bandwidth with the coin battery.

In this paper, we present a 4-camera VGA resolution capsule endoscope system with 80Mb/s body channel communication (BCC) transceiver and sub-cm range capsule tracking for transmitting full 360° high resolution images with localization information. It has three key features. First, reconfigurable external receiver and adaptive node selection scheme are proposed for both efficient image transmission and capsule localization. Second, dual-band pulse-shaping BCC transmitter (Tx) is proposed to support high quality panoramic image transmission, consuming less than 1mW. Third, contact attenuation compensated RSSI is proposed for accurate capsule localization under dynamically varied measurement environment.

Figure 17.2.1 shows the proposed capsule endoscope system. While images are transmitted to external receiver (Rx) with high-speed up-link HS-BCC, control codes are configured through low-speed down-link LS-BCC. The capsule is compact (12mm (D) / 32mm (L), <4g) and 410 mm<sup>2</sup> two symmetrical gold-plated signal electrodes at both ends of the capsule are used for communication. Under the transparent middle body, 4-cameras with lens module (120° Field of View (FOV) each) are cross-placed to scan every direction along a digestive tract. With two batteries (1.55V, 55mAh), 4-LEDs per camera are used for illumination with VGA image resolution at 4fps. The external receiver consists of 8-nodes and hub integrated on a tight vest for convenience and reliable skin contact.

Figure 17.2.2 shows the overall block diagram of the proposed 2 SoC chips: the capsule chip and receiver chip. The capsule chip is composed of HS-BCC Tx, Image Encoder, LED driver and LS-BCC Rx. 4-image sensors are enabled sequentially with 2% duty cycle and its images are compressed by the DPCM encoder. The HS-BCC adopts dual-band (40MHz-QPSK, 160MHz-BPSK). In the LS-BCC RX, the code for capsule status is received from external hub via super-regenerative Rx. On the other hand, the receiver chip consists of HS-BCC Rx, contact attenuation compensated received signal strength indicator (CAC-RSSI), and LS-BCC Tx. HS-BCC Rx adopts dual-band receiver architecture with the Costas loop. To compensate for attenuation in the contact impedance variation, both contact impedance and signal power are measured in CAC-RSSI.

Figure 17.2.3 describes the operation of the reconfigurable receiver with localization scheme. To utilize electrodes efficiently, image transmission (mode1) and localization (mode2) modes are switched sequentially with 0.5% duty (mode2/mode1). In the mode1, image data is transmitted with RSSI data, and signal power is measured. The measured signal power is used to calculate the distance between capsule and receiver in terms of contact impedance measured in mode2. From the distances calculated from 8-nodes, the nearest node is determined to minimize BCC channel attenuation and the nearest 4-nodes are adaptively selected to improve localization accuracy. With 4 distance values, cost function is generated and Levenberg-Marquardt algorithm (LMA) enables the initial point (x',y') to converge (x,y) by minimization of the cost function [4].

Figure 17.2.4 shows the dual-band pulse shaping transmitter scheme. For low-power consumption, a 2-stage injection locked ring oscillator (ILRO) generates 8-phase ( $\phi_1$ - $\phi_8$ ) clocks with 45° difference for PSK signal generation. To suppress ILRO phase noise and enhance phase accuracy, frequency calibration is executed in the initial state. ILRO frequency calibration unit generates control code for adjust  $f_{ILRO}$  to  $f_{clk}$ . 40MHz  $f_{clk}$  from crystal oscillator is divided by  $2^{10}$ , and used for counter clock for  $f_{ILRO}$ . If the counted value is higher than  $2^{10}$ , the code increases and vice versa. To cover frequency variation (20MHz-to-60MHz), coarse and fine current controls are adopted. Unary-weighted coarse code controls the current by 3.1µA step and binary-weighted fine codes are used with 98nA step. With frequency calibration,  $f_{ILRO}$  is matched to 40MHz with 0.8% accuracy, and accurate phase alignment can be obtained. To increase the data rate with low power consumption, a dual-band QPSK and BPSK transmitter is proposed with pulse-shaping technique. With generated phase information, the DAC-based QPSK modulator generates 4-phase pulse-shaping signal, and the 3<sup>rd</sup> and 5<sup>th</sup> harmonics of QPSK signal are reduced by more than 30dB, or increase of SNR.

Figure 17.2.5 describes the CAC-RSSI circuit for localization with measurement results. The CAC-RSSI circuit consists of RSSI and contact impedance sensor. The sensitivity of RSSI is switched in accordance with the input power strength. For 30cm of RX distance detection range, RSSI dynamic range covers more than 60dB. In case of the RX distance >15cm, higher sensitivity RSSI is required as more distance is changed with respect to input power. In high sensitivity mode, the CAC-RSSI covers 24dB input dynamic range with full output range that can increase RSSI sensitivity. In the contact impedance sensor, 1.25MHz current is injected from 16-to-512µA by 5b step, which covers up to 10kΩ contact impedance variation. With contact impedance compensation, average localization error decreases from 4.89cm to 0.98cm.

Figure 17.2.6 shows the eye-diagram and in-vitro system measurement results. The eye diagram of the received QPSK and BPSK is measured at 20Mb/s and 40Mb/s with -45dBm input. The proposed capsule system is applied to a pig small intestine in human mimicking phantom. The localization measurement results track the capsule location with sub-cm accuracy. The 360° images of pig intestine are captured and sent to the external receiver successfully.

Figure 17.2.7 shows the chip micrograph, and a performance summary of the proposed transceiver and transmitter comparison table. The proposed BCC transceiver is fabricated in a 65nm mixed CMOS process. The die area 4.0mm×4.0mm for both chips. Measured data rate is 40Mb/s for the 40MHz band and 160MHz band each. In the HS-mode, the transmitter consumes 0.8mW under 1.0V supply and the receiver consumes 8mW. Compared to previous BCC and RF transmitters, the proposed transmitter achieves the lowest power (0.8-to-1.7mW) and efficiency (0.022nJ/b) among the reported transmitters listed. The system power breakdown shows that the proposed capsule system can operate longer than 12 hours. As a result, we achieve 12 hours, 360° image transmission capsule endoscope system with sub-cm tracking accuracy localization.

### References:

- [1] K. Friedrich, et al., "First clinical trial of a newly developed capsule endoscope with panoramic side view for small bowel: a pilot study," *J. Gastroenterol Hepatol*, 28(9), pp. 1496-1501, Sept. 2013.
- [2] M.K. Goenka, S. Majumder, U. Goenka, "Capsule endoscopy: Present status and future expectation," *World Journal of Gastroenterology*, 20(29), pp. 10024-10037, Aug. 2014.
- [3] M.R. Yuce and T. Dissanayake, "Easy-to-Swallow Wireless Telemetry," *IEEE Microwave Magazine*, vol. 13, no. 6, pp. 90-101, Sept.-Oct. 2012.
- [4] X. Li, "RSS-Based Location Estimation with Unknown Pathloss Model," *IEEE Trans. on Wireless Communications*, vol. 5, no. 12, pp. 3626-3633, Dec. 2006.
- [5] J. Bae, et al., "A 0.24-nJ/b Wireless Body-Area-Network Transceiver With Scalable Double-FSK Modulation," *IEEE Journal of Solid-State Circuits*, vol. 47, no. 1, pp. 310-322, Jan. 2012.
- [6] H. Cho, et al., "A 79 pJ/b 80 Mb/s Full-Duplex Transceiver and a 42.5µW 100 kb/s Super-Regenerative Transceiver for Body Channel Communication," *IEEE Journal of Solid-State Circuits*, vol. 51, no. 1, pp. 310-317, Jan. 2016.



Figure 17.2.1: Proposed capsule endoscope system.



Figure 17.2.2: Overall architecture of proposed capsule/receiver SoC chip.



Figure 17.2.3: Reconfigurable external receiver with localization scheme.



Figure 17.2.4: Dual-band pulse shaping transmitter.



Figure 17.2.5: Contact attenuation compensated RSSI.



Figure 17.2.6: Transceiver and in-vitro system measurement results.



### 17.3 A 0.3V Biofuel-Cell-Powered Glucose/Lactate Biosensing System Employing a 180nW 64dB SNR Passive $\Delta\Sigma$ ADC and a 920MHz Wireless Transmitter

Ali Fazli Yeknami, Xiaoyang Wang, Somayeh Imani, Ali Nikoofard, Itthipon Jeerapan, Joseph Wang, Patrick P. Mercier

University of California, San Diego, La Jolla, CA

Wearable physiochemical biosensors offer an exciting opportunity to monitor the concentration of ions and metabolites in bodily fluids such as sweat, saliva, and interstitial fluids for emerging applications in health and fitness monitoring [1]. However, current physiochemical sensing prototypes rely on batteries and DC-DC converters to provide power for instrumentation, which may result in a large, obtrusive form factor with limited lifetime [1]. This paper presents a wireless physiochemical sensing system capable of monitoring glucose or lactate when powered via an enzymatic biofuel cell (BFC) based on energy naturally present in the underlying analytes to be sensed. Unlike prior-art BFC harvesters, which utilize bulky boost converters to increase the 0.3-to-0.4V BFC voltage to a higher level suitable for conventional CMOS circuits [2], this work forgoes any DC-DC converter, and instead the entire system, including a  $\Delta\Sigma$  ADC and 920MHz RF transmitter, is designed to operate directly from the dynamic 0.3-to-0.4V BFC output.

A block diagram of the proposed BFC-powered sensing system is shown in Fig. 17.3.1. The BFC, modeled as a voltage source,  $V_{BFC}$ , with resistor,  $R_{BFC}$ , passes current through switch  $S_1$  during phase  $\theta$ , to establish  $V_{DD}$ . The average system power consumption is designed to be much less than the maximum power point (MPP) of the BFC in order to minimize the rate of fuel consumption, and thus  $V_{DD}$  is typically near the open-circuit voltage of the BFC (point A in Fig. 17.3.1). The MPP of the BFC (point B in Fig. 17.3.1), correlates linearly with underlying fuel concentration, and thus can be used to infer analyte concentration. Since continuously presenting a matched load at the MPP depletes fuel at the maximum possible rate, limiting longevity, the system instead only presents a matched load at a 1% duty ratio via periodic activation of switch  $S_2$  during phase  $\bar{\theta}$ . When  $S_1$  opens,  $V_{DD}$  is buffered by  $C_{DD}$ , and an on-chip  $\Delta\Sigma$  ADC samples the voltage across  $R_{MPP}$  (with 3b binary-weighting to match  $R_{BFC}$ ) to compute the MPP, and therefore infer analyte concentration. Digitized data is then serialized, buffered, and delivered to an integrated wireless transmitter (TX).

Since designing amplifiers to attain low noise and large gain at low supply voltages is difficult without consuming significant power, a passive DT  $\Delta\Sigma$  ADC is employed. The constituent  $\Delta\Sigma$  modulator ( $\Delta\Sigma$ M), shown in Fig. 17.3.2, utilizes a conventional passive integrator in the 1<sup>st</sup> stage instead of a switched-capacitor (SC) passive voltage boosting approach in order to avoid signal-dependent nonlinearities. The second stage, which has relaxed linearity requirements, does utilize a passive SC gain-boosting integrator to reduce in-band noise with minimal power penalty. Specifically, a charge redistribution scheme is employed, where the 1<sup>st</sup> integrator's output is sampled onto capacitors  $C_{S2}$  in phase  $C_1$  (all in parallel), and then the pre-charged  $C_{S2}$ s are positioned in series to charge share with the integrating capacitor  $C_{I2}$  in phase  $C_2$ . Simulation results in Fig. 17.3.2 show that the in-band noise reduces with increasing passive gain,  $N$ ; since  $N > 5$  yields diminishing gain returns due to sampling capacitor parasitics [3],  $N=5$  was chosen in this design. A 1b quantizer is realized via a dynamic comparator (Fig. 17.3.3, bottom). To mitigate comparator non-idealities (e.g., kick-back noise, offset) and relax comparator sensitivity, a pre-amp is employed. For robust operation at 0.3V in subthreshold, only two devices are stacked in the pre-amp, and a cross-coupled negative resistance load was employed to boost gain to 20dB, while diode-connected PMOSs establish the output common-mode voltage.

In 65nm CMOS, it is not possible to turn on the switches in the  $\Delta\Sigma$ M sufficiently strongly when using a 0.3V supply, even when using low- $V_t$  transistors. To compensate, the sampling switches, shown in detail in Fig. 17.3.3 (top left), utilize a 4T cascaded transmission gate, along with the off-current-limiting feedback technique in [4], all combined with dynamic threshold MOS (DTMOS) techniques and 3x clock boosting for the NMOS transistors (booster circuit shown in Fig. 17.3.3, top right) to improve  $I_{ON}/I_{OFF}$  by 92% and reduce signal-dependent leakage. Additionally, the proposed negative clock level shifter circuit (Fig. 17.3.3, top middle) brings the PMOS gate voltages down to -200mV, resulting in an 8dB improvement in SNDR. All other switches in the  $\Delta\Sigma$ M are designed using NMOS

or PMOS transistors, and are activated by a 3x clock booster (NMOS), or a -200mV charge pump (PMOS), as shown in Fig. 17.3.3 (top middle).

Output bits from the  $\Delta\Sigma$ M are passed through a  $sinc^2$  decimation filter, and stored in a FIFO until the TX is activated. The TX is designed as a single-stage direct-RF OOK-modulated power oscillator (RFPO) [6] that provides inherent impedance matching with a 1cm 920MHz on-board loop antenna (Fig. 17.3.4). Unfortunately, the 0.3-to-0.4V supply would limit the efficiency, the  $I_{ON}/I_{OFF}$  ratio, and the start-up time of the circuit in [6], and boosting the main power supply to compensate would require unnecessarily large inductors or capacitors. Instead, in this design the tail transistor,  $M_0$ , is designed to operate in triode by boosting its gate voltage by 3x (easily achieved via small on-chip capacitors), which increases ON conductance by 93.8%. This helps increase the headroom and overdrive of  $M_{1,2}$ . When combined with a body-bias adjustment available via the clamped clock-boosting circuit in Fig. 17.3.4 (top right) that sets  $V_{bias}$  close to the threshold of clamping transistor  $M_b$ , the ON conductance of  $M_{1,2}$  is increased by 29.6%, and the RFPO start-up time is improved by 48%. The extra drive strength of  $M_{1,2}$  also reduces their size and parasitic capacitance by 75.5% iso-current, enabling a 0.7mm larger antenna diameter while maintaining resonance at 920MHz, increasing radiation efficiency by 1.2x. Given the rapid start-up time, the TX is deeply duty-cycled, and activated once every 14.3ms.

The self-powered wireless biosensing chip is implemented in 65nm CMOS. The  $\Delta\Sigma$ M is sampled at 256kHz during active mode, and achieves 64dB SNR, 60dB SNDR, and 65dB DR, respectively, for a 3kHz signal bandwidth (Fig. 17.3.5) while consuming 180nW at 0.3V (including the decimation filter). The ADC is robust to supply voltage variation, achieving up to 71dB SNR and 65.5dB SNDR at 0.5V (Fig. 17.3.5). Implemented in 230 $\mu$ m $\times$ 650 $\mu$ m of core area, the  $\Delta\Sigma$  ADC achieves an FoM of 37fJ/step at 0.3V, which is 8x better than [3] (and at a 60% lower supply), and 12.9x better than [5] at 0.3V. During active mode, the TX operates at 1Mbps (4Mbps), outputs >-53dBm (>-50dBm) when measured by a  $\lambda/4$  whip antenna placed 10cm away, and consumes 30pJ/bit (14.4pJ/bit) at 0.3V (0.35V), resulting in the most energy-efficient and lowest-voltage narrowband TX at GHz-frequencies. The TX starts-up is <44.6ns (Fig. 17.3.4). During sleep mode, the RFPO consumes 100pW, and its digital controller consumes 5nW. At 1% duty-cycle, the ADC and TX consume an average of 1.15 $\mu$ W.

The chip was tested with custom-fabricated lactate and glucose BFCs. The BFC anodes were fabricated by using a carbon nanotube (CNT)-based mediator nanocomposite onto a thin layer of carbon, followed by an enzymatic cocktail consisting of lactate or glucose oxidase (LOx or GOx) and BSA (all coated with Nafion); the BFC cathode comprised a carboxylated-CNT/Ag<sub>2</sub>O nanocomposite coated with Nafion (composition shown in Fig. 17.3.1). As shown in Fig. 17.3.6, the system, when powered exclusively from the BFCs, can successfully detect changes in lactate (glucose) concentration between 2.5-to-15mM (5-to-15mM), representative of physiologic ranges in human perspiration, for the first demonstration of an integrated self-powered chemical biosensing system with digital wireless readout. A die micrograph is shown in Fig. 17.3.7.

#### Acknowledgements:

This work was supported in part by the National Institute of Biomedical Imaging and Bioengineering of NIH (R21EB019698), and in part by the Arnold and Mabel Beckman Foundation.

#### References:

- [1] A. J. Bandodkar, et al., "Wearable Chemical Sensors: Present Challenges and Future Prospects," *ACS Sensors*, 1(5), pp. 464-482, May 2016.
- [2] A. J. Bandodkar, et al., "Soft, stretchable, high power density electronic skin-based biofuel cells for scavenging energy from human sweat," *Energy Environ. Sci.*, 10, pp. 1581-1589, 2017.
- [3] A. F. Yeknami, et al., "Low-Power DT  $\Delta\Sigma$  Modulator Using SC Passive Filters in 65nm CMOS," *IEEE TCAS-I*, vol. 61, no. 2, pp. 358-370, Feb. 2014.
- [4] D. C. Daly, et al., "A 6-bit, 0.2V to 0.9V Highly Digital Flash ADC with Comparator Redundancy," *ISSCC Dig. Tech. Papers*, pp. 554-555, Feb. 2008.
- [5] F. Michel, et al., "A 250mV 7.5 $\mu$ W 61dB SNDR SC  $\Delta\Sigma$  Modulator Using Near-Threshold-Voltage-Biased Inverter Amplifiers in 130nm CMOS," *IEEE JSSC*, vol. 47, no. 3, pp. 709-721, Mar. 2012.
- [6] P. P. Mercier, et al., "A Sub-nW 2.4 GHz Transmitter for Low Data Rate Sensing Applications," *IEEE JSSC*, vol. 49, no. 7, pp. 1463-1474, July 2014.



Figure 17.3.1: DC-DC-converter-free biofuel-cell-powered wireless glucose/lactate biosensor system architecture including functional timing diagram (top); biofuel cell photograph, composition, and representative polarization curves (bottom).



Figure 17.3.2: Simplified block diagram of the 2<sup>nd</sup>-order passive 1b DT  $\Delta\Sigma M$  (top left); NTF magnitude for various values of passive gain,  $N$  (top right); circuit schematic of the DT  $\Delta\Sigma M$  (bottom).



Figure 17.3.3: Proposed sampling switch with improved ON conductance using p-switch negative gate level shifting (top left); 3x clock booster (top right) and p-type voltage boosting for driving gates of p-switch in a 1b DAC (bottom left); pre-amplifier circuit using low- $V_t$  transistor with cross-coupled PMOS load and inherent output common-mode feedback (bottom middle); dynamic comparator (bottom right).



Figure 17.3.4: Direct-RF Power Oscillator TX with clamped body bias booster (top); measured power spectrum (bottom left); measured start-up time (bottom right).



Figure 17.3.5: Measured  $\Delta\Sigma M$  power spectral density (PSD) versus normalized frequency with shorted inputs (top left); SNR and SNDR versus normalized input amplitude for a 500Hz sinusoidal input (top right); Measured PSD versus normalized frequency for a 100Hz sinusoidal input with 250mVpp at  $V_{DD} = 0.3V$  (bottom left); SNR versus power supply (bottom right).



Figure 17.3.6: Measured dynamic  $V_{DD}$  variation and  $V_{IN}$  waveforms during in-vitro experiments for lactate (left) and glucose (right); calibration curves from these measurements (bottom).



Figure 17.3.7: Die micrograph of the BFC-powered wireless biosensing system.

## 17.4 A 0.28mΩ-Sensitivity 105dB-Dynamic-Range Electrochemical Impedance Spectroscopy SoC for Electrochemical Gas Detection

Guangyang Qu<sup>1</sup>, Hanqing Wang<sup>1</sup>, Yimiao Zhao<sup>1</sup>, John O'Donnell<sup>2</sup>, Colin Lyden<sup>3</sup>, Yincui Liu<sup>1</sup>, Junbiao Ding<sup>4</sup>, Dennis Dempsey<sup>2</sup>, Leicheng Chen<sup>1</sup>, Donal Bourke<sup>3</sup>, Shurong Gu<sup>1</sup>, Jun Gao<sup>1</sup>, Lizhu Lu<sup>1</sup>, Li Wang<sup>1</sup>, Xuemin Li<sup>1</sup>, Hongxing Li<sup>5</sup>, Chao Chu<sup>1</sup>, Ling Yang<sup>1</sup>

<sup>1</sup>Analog Devices, Beijing, China; <sup>2</sup>Analog Devices, Limerick, Ireland

<sup>3</sup>Analog Devices, Cork, Ireland; <sup>4</sup>Analog Devices, Shanghai, China

<sup>5</sup>Analog Devices, Wilmington, MA

An electrochemical (EC) gas sensor is a 3+ terminal electrochemical device which, when biased correctly with a potentiostat circuit, behaves as a current source whose output is proportional to the target gas concentration. The short lifetime of these sensors has become a key functional safety issue for hazardous gas detection modules. In order to predict the failure of EC gas sensors, a high precision and wide bandwidth EC impedance measurement is required [1]. For example, an electrochemical impedance spectroscopy (EIS) instrument is often used in the laboratory for failure rate analysis and there is a trend towards integration of the EIS at the chip level. However, it is extremely challenging to integrate the precision high-speed/high-power EIS with the precision low-power EC potentiostat system. For instance, [2] demonstrated a very low-power integration of the EIS and bio-potentiostat; however the EIS bandwidth is only 80Hz and the EIS dynamic range (DR) is limited at 77dB. [3] and [4] presented good EIS performance because there were no trade-off limitations between the low-power module and the high speed EIS. Furthermore, two additional challenges, not met by [2, 3, 4], are to apply an AC stimulus without disturbing the large DC potentiostat and to improve the DR for stimulus-sensitive sensors. This paper demonstrates a highly integrated SoC to support EIS applications for diverse sensors. Figure 17.4.1 shows the implemented architecture. The analog front end (AFE) and digital back end (DBE) are implemented on a 0.18μm die and a 90nm die respectively and stacked in a 5×6×1mm<sup>3</sup> LGA package. A Cortex-M3, featuring UART/SPI/I<sup>2</sup>C interfaces, is integrated in the top DBE die. The bottom AFE die has two EC potentiostat channels, with sensor readout, as well as a chip scale 0.1Hz~200kHz EIS solution, which is realized by a transmitter (Tx), receiver (Rx) and DSP. A sequencer can flexibly control all these modules through a programmable 6KB SRAM, which is shared by command memory and data FIFO.

Figure 17.4.2 shows the block diagram of the AFE die. Each potentiostat channel consists of a dual output, 12b, 1μA ultra-low power (ULP) DAC, two 10Hz low pass filters (LPF) and two 1.8uVpp (0.1Hz~10Hz) OpAmps. The LPF can be bypassed for fast settling and wide bandwidth applications. A 1μA 15ppm/°C reference is shared by two channels. The continuous current per potentiostat is 8μA compared with 10μA in [5] which does not include an internal reference.

In order to achieve high DR EIS measurements, a high accuracy stimulator is required. One way to implement this is to inject a current stimulus and then to measure the response voltage using a readout circuit [2, 3, 4]. In this work, a voltage stimulator has been used, for better linearity, and high-gain amplifiers to amplify or attenuate the stimulus thus enabling a large impedance range. The Tx which applies the sinusoid voltage to the impedance under test (IUT), consists of a waveform generator, a 2.3MSPS AC-coupled DAC, a 3<sup>rd</sup> order reconstruction filter (RCF), a programmable gain amplifier (PGA), a stimulus buffer and a High-Speed trans-impedance amplifier (TIA). Through the D/P/N/T 4-wire feedback loop, the stimulus ( $V_{PN}$ ) is accurately controlled by the ULP-DAC and AC-DAC. The TIA converts the IUT response current into a voltage for readout. The Rx is shared by both the sensor and EIS readouts, so the PGA\_Rx supports a programmable gain from 1 to 9 for large-scale signals. A 3<sup>rd</sup> order anti-alias filter (AAF) attenuates the aliased images modulated by the clock jitter. The quantizer is a 16bit 800kSPS SAR ADC with an internal 30μA, 10ppm/°C reference. The DR for large DC bias sensors is improved by canceling the DC signal before amplifying. Thus, the ULP-DAC from the 2<sup>nd</sup> channel can be reused as the DC offset cancellation DAC to remove the large DC signal in the PGA\_Rx. In the DSP, a single frequency discrete Fourier transform (DFT) with the number of samples being programmable from 16 to 16K, is designed to extract the stimulus frequency directly. Even with these features, this complex signal chain has many gain error and phase error sources in both the Tx and Rx. Consequently, an external precision reference resistor ( $R_{REF}$ ) is introduced to calibrate this systemic error and its drift.

One challenge is to maintain the DC potentiostat for the EC sensor during the EIS. Although the AC signal can be applied on the DC potentiostat using a coupling capacitor, it is hard to support 0.1Hz-to-200kHz excitation. In this work, the stimulus loop (Fig 17.4.3) uses two separate DACs to provide the DC and AC signals respectively. The ULP DAC from the potentiostat can be reused without any disturbance while switching between potentiostat mode and EIS mode. In the AC path different gains can be selected, depending on IUT tolerance, without reducing the DR of the AC DAC. The gain is set, not only by the PGA\_Tx, but also by the summing node resistor divider. For large impedances, the ±800mVpp DAC can be used for high DR because the noise is not amplified too much in the TIA and Rx. Alternatively, attenuated stimulator mode can be used for small, or stimulus-sensitive impedances. To switch between these two modes, 4 gain combinations are provided by means of PGA Tx and resistor divider gain settings of  $(1, \frac{1}{2})$  and  $(2, \frac{1}{4})$  respectively. For example, in order to apply 40mVpp onto a 600mV potentiostat, the 2-DAC architecture allows attenuation in the AC DAC with smaller resolution thus improving the whole system DR by 21dB. By means of the gain-split design, in attenuation mode, the PGA\_Tx noise transfer function is changed and the whole system noise performance is improved by 9dB compared to a single gain architecture (2 and  $\frac{1}{2}$  in PGA\_Tx).

Figure 17.4.4 shows the DR test results. For a 1kΩ IUT, the RMS noise, as tested on silicon, is 5.85mΩ and the test is stable across 40 ICs. The DR is around 105dB normally and drops to 95dB and 91dB at 0.07Hz (flicker noise) and 236kHz (gain-bandwidth limitations) respectively. A large impedance range can be accommodated by adjusting the  $R_{REF}$  value. Figure 17.4.5 shows that the DR is almost constant with different  $R_{REF}$  values (up to 1MΩ). When the  $R_{REF}$  is small (10Ω) for small impedance measurement, the DR drops to 91dB. This is because the attenuation in the stimulator side changes the PGA\_Tx noise transfer function. In this mode, the lowest impedance noise is 0.28mΩ with a 12.21Hz bandpass filter from DFT, resulting in an impedance referred noise density of 0.086mΩ/√Hz.

Figure 17.4.6 shows the comparison with recently published state-of-the-art research. In this work, the stimulus frequency range has been extended to 200kHz with 24 bits of tuning range. The DR is improved by 10dB compared to [3] and, furthermore, the sensitivity is improved to 0.28mΩ, which is 5 times better than [3]. The 0.086mΩ/√Hz resolution performance is 113 times better than [2] and 56 times better than [4].

Figure 17.4.7 shows a micrograph of the AFE die and the stacked LGA package bonding prototype. The power is mostly dissipated by the always-on EC potentiostat and sensor readout module. For EC health diagnostics, the EIS module is turned on for a short period every day and dissipates less than 1μA average power. By integrating the DBE, potentiostat, readout, and EIS into a small package, the conventional analog EC sensor can be leveraged as a self-diagnostic, digitally-interfaced, smart sensor.

### Acknowledgements:

The authors acknowledge the great support by the Analog Devices Integrated Precision (IPN) and Linear Product (LPT) groups, including Management, Design, Layout, Verification, Applications, Evaluation, Test, Marketing and Assembly teams involved in the development of this SoC and of the technology upon which it is founded.

### References:

- [1] H. Li, et al., "Low Power Multimode Electrochemical Gas Sensor Array System for Wearable Health and Safety Monitoring," *IEEE Sensors Journal*, vol. 14, no. 10, pp. 3391-3399, Oct. 2014.
- [2] N. Van Helleputte, et al., "A 345 μW Multi-Sensor Biomedical SoC With Bio-Impedance, 3-Channel ECG, Motion Artifact Reduction, and Integrated DSP," *IEEE JSSC*, vol. 50, no. 1, pp. 230-244, Jan. 2015.
- [3] M. Kim, et al., "A 1.4mΩ-sensitivity 94dB-dynamic-range electrical impedance tomography SoC and 48-channel Hub SoC for 3D lung ventilation monitoring system," *ISSCC Dig. Tech. Papers*, pp. 354-355, Feb. 2017.
- [4] S. Hong, et al., "A 4.9mΩ-sensitivity mobile electrical impedance tomography IC for early breast-cancer detection system," *ISSCC Dig. Tech. Papers*, pp. 316-317, Feb. 2014.
- [5] Texas Instruments, LMP91000, "Configurable AFE Potentiostat for Low-Power Chemical Sensing Applications", Datasheet, Dec. 2014. Available at: <[www.ti.com](http://www.ti.com)>.



Figure 17.4.1: Electrochemical gas detection system.



Figure 17.4.2: Overall block diagram of the analog front-end architecture.



Figure 17.4.3: Transmitter schematic and test results for RC impedance. The comparison among the different Tx gains shows that the two-DAC architecture enables applying low-noise stimulus to achieve 21.7dB higher accuracy without disturbing the DC potentiostat.



Figure 17.4.4: Test results for a 1kΩ resistor over a sample size of forty chips.



Figure 17.4.5: Silicon results for sensitivity based on different reference resistors.

|                       | Features                  | This Work                   | JSSCC'15[2]                 | ISSCC'17[3] | ISSCC'14[4]   |
|-----------------------|---------------------------|-----------------------------|-----------------------------|-------------|---------------|
| Impedance measurement | Dynamic range             | 105dB                       | 77dB                        | 94dB        | 80dB          |
|                       | Power                     | 20.6mW                      | 58μW                        | 6.96mW      | 53.4mW        |
|                       | Operating frequency range | 200kHz                      | 80Hz                        | 128kHz      | 100kHz        |
|                       | # of frequency            | Tunable (2 <sup>24</sup> )  | N.A.                        | 16          | 4             |
|                       | Sensitivity               | 0.28mΩ                      | 87 mΩ                       | 1.5mΩ       | 4.9mΩ         |
|                       | Resolution                | 0.086mΩ/√Hz                 | 9.8mΩ/√Hz                   | N.A.        | 4.9mΩ/√Hz     |
| Potentiostat          | Sinusoid Generator clock  | 2.23MHz                     | N.A.                        | 256kHz      | N.A.          |
|                       | Power supply current      | 8μA                         | X <sup>1</sup>              | X           | X             |
|                       | Noise (0.1Hz~10Hz)        | 1.8μVpp                     | X <sup>1</sup>              | X           | X             |
|                       | Temperature sensor        | ±1 °C                       | X <sup>1</sup>              | X           | X             |
| Receiver              | Int. ULP REF              | 15ppm/°C                    | X <sup>1</sup>              | X           | X             |
|                       | ADC                       | 16bit SAR                   | 13.5bit SD <sup>2</sup>     | 12bit SAR   | 14bit SD      |
|                       | Hibernate power           | 3μW                         | 10μW                        | N.A.        | N.A.          |
|                       | CPU                       | Cortex-M0                   | RISC                        | N.A.        | N.A.          |
| Digital               | On-chip memory            | 128KB eFlash, 6KB SRAM      | 128KB                       | 16kB        | 12kB          |
|                       | Accelerator               | Sequencer                   | Accelerator                 | Controller  | Pre-processor |
|                       | Interface                 | I <sup>2</sup> C, UART, SPI | I <sup>2</sup> C, UART, SPI | N.A.        | UART          |

1 has the bio-potential readout, but it is hard to compare with EC potentiostat  
2 is 13.5bit-SNR Sigma-Delta ADC with OSR=64

Figure 17.4.6: Measured performance comparison.



Figure 17.4.7: Chip micrograph and performance summary.