

# Session 10 Overview: *Sensor Systems*

## IMMD SUBCOMMITTEE



**Session Chair:**  
**Michael Kraft**  
*KU Leuven, Leuven, Belgium*



**Associate Chair:**  
**Masayuki Miyamoto**  
*Wacom, Tokyo, Japan*

### Subcommittee Chair: **Makoto Ikeda**, University of Tokyo, Japan

The session describes advances in sensor systems covering topics of sensors for inertial navigation, capacitive touch and stylus systems, ultrasound and bolometers. The first paper presents a frequency-modulated gyroscope using rate chopping to reject drift. The second paper describes a personal inertial navigation system for GPS-denied environments. The third paper presents a capacitive touch system with palm rejection functionality. The fourth paper presents a highly noise-immune stylus analog front-end that supports pen pressure for both passive electrically coupled resonance (ECR) and active styluses. The fifth paper demonstrates a pitch-matched front-end ASIC that realizes subarray beamforming and digitization for ultrasound imaging. The sixth paper demonstrates a 64-channel front-end ASIC for intracardiac echocardiography (ICE) catheters. The seventh paper presents a 2x2 transformer-based magnetic sensor array, which can detect magnetic nanoparticles with better than 0.3ppm accuracy. Finally, the eighth paper presents a low-cost 80x60-pixel thermal infrared imager featuring 100mK temperature resolution.



8:30 AM

#### 10.1 Chopped Rate-to-Digital FM Gyroscope with 40ppm Scale Factor Accuracy and 1.2dph Bias

*B. Eminoglu, University of California, Berkeley, CA*

In Paper 10.1, UC Berkeley presents a frequency-modulated gyroscope that measures rate signal directly as frequency variations and employs a rate chopping technique to reject drift. The sensor can also be operated in a long and short-term mode of operation.



9:00 AM

#### 10.2 Personal Inertial Navigation System Employing MEMS Wearable Ground Reaction Sensor Array and Interface ASIC Achieving a Position Accuracy of 5.5m Over 3km Walking Distance Without GPS

*Q. Guo, University of Utah, Salt Lake City, UT*

In Paper 10.2, the University of Utah, UC Berkeley, Ozyegin University and Case Western Reserve University describe a personal navigation system for operations under a GPS denied environment. The system demonstrates a position accuracy of 5.5m over 3km walking distance without GPS.



9:15 AM

**10.3 Multi-Way Interactive Capacitive Touch System with Palm Rejection of Active Stylus for 86" Touch Screen Panels**
*J-S. An, Hanyang University, Seoul, Korea*

In Paper 10.3, Hanyang University, Leading UI, Chung-Ang University, and MiraeTNS present a multiple-way interactive capacitive touch system (MI-CTS) with the palm rejection functionality. The proposed system is based on a multiple frequency driving method and successfully demonstrated with an 86" 198×112 and two 32" 104×64 touch screen panels.



9:30 AM

**10.4 A Noise-Immune Stylus Analog Front-End Using Adjustable Frequency Modulation and Linear-Interpolating Data Reconstruction for Both Electrically Coupled Resonance and Active Styluses**
*K-H. Lee, Samsung Electronics, Hwaseong, Korea*

In Paper 10.4, Samsung Electronics presents a highly noise-immune stylus analog front-end (AFE) that supports pen pressure for both passive electrically coupled resonance (ECR) and active styluses. The measured SNR with a 1mm ECR stylus is 56dB under a charger and display noise environment.

10



10:15 AM

**10.5 A 0.91mW/Element Pitch-Matched Front-End ASIC with Integrated Subarray Beamforming ADC for Miniature 3D Ultrasound Probes**
*C. Chen, Delft University of Technology, Delft, The Netherlands*

In Paper 10.5, Delft University presents a pitch-matched front-end ASIC that realizes subarray beamforming and digitization at 10× lower power and 3.3× smaller area per element than prior work. This is achieved by employing subarray beamforming ADCs that merge the delay-and-sum and digitization functions in the charge domain.



10:45 AM

**10.6 Single-Chip Reduced-Wire Active Catheter System with Programmable Transmit Beamforming and Receive Time-Division Multiplexing for Intracardiac Echocardiography**
*G. Jung, Georgia Institute of Technology, Atlanta, GA*

In Paper 10.6, the Georgia Institute of Technology and the University of Leeds demonstrate a 64-channel frontend ASIC for intracardiac echocardiography (ICE) catheters. The 2.6×11 mm<sup>2</sup> ASIC is implemented in 60V 0.18μm HV-BCD technology, effectively reducing the number of wires in the catheter from more than 64 to only 22.



11:15 AM

**10.7 A 0.3ppm Dual-Resonance Transformer-Based Drift-Cancelling Reference-Free Magnetic Sensor for Biosensing Applications**
*C. Sideris, California Institute of Technology, Pasadena, CA*

In Paper 10.7, the California Institute of Technology presents a 2×2 transformer-based magnetic sensor array, implemented in a CMOS process, which can detect magnetic nanoparticles with better than 0.3ppm accuracy and as little as 3mW power consumption. The capability of the sensor is demonstrated by performing an in-vitro DNA detection experiment.



11:45 AM

**10.8 A 100mK-NETD 100ms-Startup-Time 80×60 Micro-Bolometer CMOS Thermal Imager Integrated with a 0.234mm<sup>2</sup> 1.89μV<sub>rms</sub> Noise 12b Biasing DAC**
*K-D. Kim, KAIST, Daejeon, Korea*

In Paper 10.8, KAIST presents a low-cost 80×60-pixel thermal infrared imager featuring 100mK temperature resolution with integrated biasing DAC. The extremely low-noise DAC architecture avoids the use of slow low-pass filter, achieving a sensor startup time as low as 100ms.

## 10.1 Chopped Rate-to-Digital FM Gyroscope with 40ppm Scale Factor Accuracy and 1.2dph Bias

Burak Eminoglu, Bernhard E. Boser

University of California, Berkeley, Berkeley, CA

Present implementations of MEMS gyroscopes measure rate indirectly by first converting it to a displacement [1,2]. In this case, the scale factor is a complex function of the transducer and readout circuits. Changes of any of the underlying parameters result in measurement errors.

The solution presented here measures rate directly as frequency and converts it to a digital output by comparing it to a precision clock reference [3]. Figure 10.1.1 illustrates the principle. The transducer proof mass consists of two orthogonal resonators excited at their resonant frequencies  $f_o$  by two sustaining circuits. For a 90° phase shift in the displacements of the x- and y-channels, the motion of the proof mass follows a circular pattern. An observer in the rotating frame perceives a rate input as a shift of the observed oscillation frequency of the proof mass. The scale factor equals  $\alpha_z$ , where  $\alpha_z$  is the unit-less transducer gain. It can be measured accurately with a frequency-to-digital converter with an explicit reference input  $f_{ref}$ .

The transducer resonance  $f_o$  appears as a huge offset in the output. Environmental variations preclude straightforward subtraction from the rate output. Instead, the direction of the circular path is altered periodically to modulate the sign of the rate sensitivity corresponding to rate being shifted to the modulation frequency. It is accomplished by deliberately mismatching the resonances  $f_{ox}$  and  $f_{oy}$  of the two axes by a small amount  $\Delta f$  (typically <100Hz). Now the relative phase  $\phi_{xy}$  of the x- and y-channels changes continuously, passing through 90° and 270°, corresponding to FM gains +1 and -1. This is equivalent to chopper stabilization and rejects drift at frequencies below the modulation rate. At 0° and 180° the rate modulates the amplitude rather than frequency of the x- and y-displacements. The force-equations illustrate that the rate signal  $\Omega_z$  and the quadrature error  $\Omega_q$  modulate the effective stiffness ( $k$ ) and damping ( $b$ ) terms with time-varying  $\phi_{xy}$  ( $=2\pi\Delta f t$ ). Consequently, rate appears in the output both as a frequency shift (FM channel) and a change in the oscillation amplitude (AM channel). FM and AM signals are modulated at  $\Delta f$  with  $\sin(\phi_{xy})$  and  $\cos(\phi_{xy})$ , respectively.

As in conventional AM gyroscopes, reducing the split  $\Delta f$  between the modes improves the ARW of the sensor [1,4]. Since both modes are continuously driven, the split is observable and electrostatically tuned to 10Hz in the prototype. The ability to accurately set the split frequency is an important advantage of FM over conventional AM implementations and a consequence of both axes being driven.

Figure 10.1.2 shows the readout circuits consisting of differential oscillators with amplitude control. It consists of a trans-capacitance amplifier followed by a phase shifter, amplitude detector, and VGA. An active biasing circuit with long-channel transistors having less than 5fA/ $\sqrt{\text{Hz}}$  current noise is used to provide the DC feedback in the trans-capacitance amplifier which has, in total, 17fA/ $\sqrt{\text{Hz}}$  input-referred current noise. An SC peak detector is clocked at the zero crossings of the differentiator output to sample the oscillation amplitude. Unlike other options, this solution does not require a low-pass filter which would limit the measurement bandwidth (Figure 10.1.5). The VGA ensures a stable oscillation amplitude and rejects the amplitude variations from the AM channel.

For testing, the circuit was connected to the symmetric quad-mass gyroscope (QMG) shown in Figure 10.1.7 with nominal  $f_o=24\text{kHz}$  and  $Q=100\text{k}$ . The frequency-to-digital conversion (FDC) is performed off-chip by digitizing two oscillator outputs and extracting the frequency with a software PLL. The output from the AM channel is obtained from the amplitude detector and also digitized off-chip.

Figure 10.1.3 shows the measured linearity and scale factor stability. The sensitivity of conventional AM gyroscopes is determined by transducer bias, electrode gaps, oscillation amplitude, and VGA gain which are difficult to control accurately. In contrast, FM sensitivity is set by an external reference clock and proportional to the slip-factor  $\alpha_z$  set by transducer geometry and the sum of the reciprocal velocity ratios ( $v_x/v_y+v_z/v_x$ ). For best stability, the velocities are chosen to be equal, contributing only a 1ppm error to the scale-factor for as much as 1400ppm velocity mismatch [5]. For this prototype, the AM and FM linearity over  $\pm 300\text{dps}$  are 1830ppm and 110ppm, respectively. Measured over a 24-hour period in an uncontrolled environment, the individual FM channels exhibit considerable fluctuations dominated by temperature variations. Summing the two outputs reduces this variation to  $\pm 150\text{ppm}$ , a more than order-of-magnitude improvement over the AM performance. Also shown is a first-order compensated

result which reduces the magnitude of the error to less than 40ppm. The temperature of the sensor is obtained without extra circuitry from the FDC based on the transducer TCF of  $-30\text{ppm}/^\circ\text{C}$ .

Figure 10.1.4 shows the measured Allan variance for sensors operated at equal oscillation amplitudes and hence nearly identical velocities in both channels and with a deliberate amplitude mismatch. The mismatch increases the scale factor of the sensor, thereby reducing the noise contribution of the FDC. Note that the only change between the two measurements is a different setting in the amplitude controller. The possibility to dynamically adjust long versus short-term stability without increased power dissipation is a unique feature of the FM gyroscope.

These results were achieved with a transducer having parallel plate transduction, which due to its inherent nonlinearity results in noise folding, impairing the ARW. The 1mdps/ $\sqrt{\text{Hz}}$  obtained with a transducer having comb-drive actuation confirms this hypothesis. Furthermore, comb-drive actuation enables to operate the transducer with a larger oscillation amplitude which minimizes the Brownian noise to less than 0.2mdps/ $\sqrt{\text{Hz}}$ . Unfortunately, because this design has not been optimized, it exhibits poor long-term stability.

The noise is a function of  $\Delta f$  and decreases from 10mdps/ $\sqrt{\text{Hz}}$  at 100Hz to 1mdps/ $\sqrt{\text{Hz}}$  at 10Hz in the asymmetric mode, where total noise is dominated by the electronics. Below 5Hz close-to-carrier phase noise dominates the ARW. Consequently, the mode-split of the gyroscope is tuned to 10Hz with a servo loop with 20ms settling time. Tuning accuracy is not critical since the FDC extracts the instantaneous phase difference between the x- and y-axis motion for demodulation.

While reducing the mode-split is advantageous for noise, this also lowers the useful bandwidth since the input is chopped at this rate. Since the outputs from the FM and AM channels are in quadrature, the bandwidth of the sum of these outputs is limited only by the bandwidth of the amplitude controller. Figure 10.1.5 illustrates the summing process and the spectrum for a 25Hz rate input. To show the effectiveness of the technique, these measurements have been performed with  $\Delta f$  tuned to 5Hz. The tone at 20Hz is due to transducer nonlinearity and can be reduced with an improved mechanical design. The image is the result of imperfect gain matching of the AM and FM channels. As expected, it disappears after trimming the AM scale factor.

Figure 10.1.6 compares this result to solutions reported earlier. The FM gyro achieves competitive or better performance in all categories. Note that these results have been achieved without calibration. Not usually reported but a significant error source for applications such as navigation is scale factor accuracy. By relying on an explicit reference supplied in the prototype by an external (precision) clock the FM gyro scale factor stability is more than two orders-of-magnitude better than typical AM gyro accuracy [6]. Further significant advantages include the continuously tuned mode-split and the asymmetric mode of operation used to trade off long- and short-term stability without circuit changes.

### Acknowledgment:

The authors would like to thank Yu-Ching Yeh, Mithcell Kline, and Parsa Taheri-Tehrani for the transducer design and Yunhan Chen, Ian B. Flader, Dongsuk D. Shin, and Professor Thomas W. Kenny at Stanford University for the MEMS fabrication. Authors acknowledge the support of this project by DARPA under the PASCAL program and thank the TSMC University Shuttle Program for CMOS chip fabrication.

### References:

- [1] M. Marx, et al., "A 27 $\mu\text{W}$  0.06mm<sup>2</sup> Background Resonance Frequency Tuning Circuit Based on Noise Observation for a 1.71mW CT- $\Delta\Sigma$  MEMS Gyroscope Readout System with 0.9°/h Bias Instability," *ISSCC Dig. Tech. Papers*, pp. 164-165, Feb. 2017.
- [2] C. Ezekwe, et al., "A 3-Axis Open-Loop Gyroscope with Demodulation Phase Error Correction," *ISSCC Dig. Tech. Papers*, pp. 478-479, Feb. 2015.
- [3] I. Izumkin, et al., "A 7ppm, 6°/hr Frequency Output MEMS Gyroscope," *Dig. MEMS*, pp. 33-36, Jan. 2015.
- [4] C. Ezekwe and B. Boser, "A Mode-Matching  $\Delta\Sigma$  Closed-Loop Vibratory Gyroscope Readout Interface with a 0.004°/s/ $\sqrt{\text{Hz}}$  Noise Floor over a 50Hz Band," *ISSCC Dig. Tech. Papers*, pp. 580-581, Feb. 2008.
- [5] B. Eminoglu, et al., "Comparison of Long-Term Stability of AM vs. FM Gyroscopes," *Dig. MEMS*, pp. 954-957, Jan. 2016.
- [6] ST Microelectronics, "iNEMO Inertial Module: 3D Accelerometer and 3D Gyroscope with Digital Output for Industrial Applications," ISM330DLC Datasheet, June 2017.



Figure 10.1.1: Simplified block diagram and rate chopping for FM and AM channels.



Figure 10.1.2: Circuit schematic of a single oscillator with active biasing and SC amplitude controller.



Figure 10.1.3: Scale factor tests of FM and AM channels.



Figure 10.1.5: Image rejection in combined FM and AM readout.



Figure 10.1.4: Allan deviation of the FM channel, ARW versus mode split, and automatic mode split tuning.

|                                       | This Work                               | ISSCC'17 [1] Marx                          | ISSCC'15 [2] Ezekwe                                | ISSCC'08 [4] Ezekwe |
|---------------------------------------|-----------------------------------------|--------------------------------------------|----------------------------------------------------|---------------------|
| ARW [dps/rt-Hz]                       | $0.001^1$                               | 0.0014                                     | 0.0049 <sup>7</sup>                                | 0.0028 <sup>7</sup> |
| Bias Stability [deg/hr]               | $1.2^1$                                 | 0.9                                        | n.a.                                               | n.a.                |
| RRW [deg/hr <sup>1.5</sup> ]          | $1.5^2$                                 | 3.8                                        | n.a.                                               | n.a.                |
| FS [dps]                              | $1000^{2,3}$                            | 800                                        | 2000                                               | n.a.                |
| Bandwidth [Hz]                        | $1900^4$                                | 50                                         | 520                                                | 50                  |
| Number of Axes                        | 1                                       | 1                                          | 3                                                  | 1                   |
| Supply [V]                            | 1.8                                     | 3.3                                        | 1.71-3.6                                           | 3.3                 |
| Power [mW]                            | 0.45 <sup>5</sup>                       | 1.71                                       | 0.37/axis                                          | <sup>1</sup> 8      |
| FoM <sup>6</sup> for ARW [dps/Hz x W] | $0.45n^1$                               | 3.4n                                       | 8.9n                                               | 7.8n                |
| Read-out Features                     | -Simultaneous FM and AM                 | - $\Delta\Sigma$ with tuned cont. time BPF | -Open loop<br>-With HV                             | -Closed loop        |
| Bias Stability Methods                | -Symmetric transducer and readout       | Manual quadrature tuning                   | Background phase error correction over temperature | n.a.                |
| Mode Split Sensing                    | Direct readout of resonance frequencies | n.a.<br>(initial tuning)                   | n.a.                                               | Tone injection      |

<sup>1</sup> Asymmetric FM  
<sup>2</sup> Symmetric FM  
<sup>3</sup> Circuit full-scale. Tested up to  $\pm 300\text{dps}$  (rate table limitation).  
<sup>4</sup> Tested up to 25Hz.  
<sup>5</sup> Power does not include off-chip ADCs and DSP.  
<sup>6</sup> FOM = Power  $\times$  ARW<sup>2</sup> (per axis)  
<sup>7</sup> Rate noise density reported. ARW = Rate Noise Density/ $\sqrt{2}$   
<sup>8</sup> Drive electronics power not included.

Figure 10.1.6: Performance summary and comparison table.



Figure 10.1.7: Chip micrograph and SEM image of the MEMS gyroscope.

## 10.2 Personal Inertial Navigation System Employing MEMS Wearable Ground Reaction Sensor Array and Interface ASIC Achieving a Position Accuracy of 5.5m Over 3km Walking Distance Without GPS

Qingbo Guo<sup>1</sup>, William Deng<sup>2</sup>, Ozkan Bebek<sup>3</sup>, Cenk Cavusoglu<sup>4</sup>, Carlos Mastrangelo<sup>1</sup>, Darrin Young<sup>1</sup>

<sup>1</sup>University of Utah, Salt Lake City, UT

<sup>2</sup>University of California, Berkeley, CA

<sup>3</sup>Ozyegin University, Istanbul, Turkey

<sup>4</sup>Case Western Reserve University, Cleveland, OH

An accurate personal inertial navigation system under GPS-denied environment is highly critical for demanding applications such as firefighting, rescue missions, and military operations. Location-aware computation for large-area mixed reality also calls for accurate personal position tracking. Position calculation can be accomplished by using an inertial measurement unit (IMU) composed of a 3-axis accelerometer, 3-axis gyroscope, and 3-axis magnetometer. A gyroscope and magnetometer together can provide the orientation information, while the displacement can be obtained by integrating the acceleration data over time. A MEMS-based IMU is attractive for its small size, low power and low cost. However, such devices exhibit a limited accuracy, large offset, and time drift, which can result in an excessive position error over time. To achieve high-performance navigation, it is critical to accurately reset the IMU time-integration during each step when the foot contacts the ground. Furthermore, correcting the IMU inherent inaccuracy, bias, and time drift becomes important for improving system performance.

In this paper, we present the design, implementation, and field testing results of a personal inertial navigation system achieving a position accuracy of 5.5m over a 3,100m walking distance exhibiting various ground surfaces without GPS. This performance is achieved by merging MEMS technology and low-power CMOS integrated circuits design through a high-level system integration together with an effective system calibration technique and sensors data fusion and processing algorithm. Figure 10.2.1 presents the system design architecture, where a ground reaction sensing system (GRSS) and an IMU are placed in close proximity inside a heel region of a boot. The GRSS is composed of a MEMS-based ground reaction sensor array and an interface ASIC, and can accurately detect the foot-on-ground timing, which in-turn assists the IMU to achieve an accurate navigation performance without GPS. Figure 10.2.2 illustrates the operating principle of the MEMS-based ground reaction sensor array (GRSA) and GRSA-assisted inertial navigation system. The GRSA is designed with an area of  $54 \times 57 \text{ mm}^2$ , matched to a typical boot heel dimension. A sensor pitch size of approximately  $4 \times 2 \text{ mm}^2$  is selected as a tradeoff between the foot-ground reaction sensing accuracy and array wiring complexity, resulting in an array size of  $13 \times 26$ . Each sensor cell within the array consists of a 1mm-thick inverted PDMS pyramid membrane over a flexible substrate encapsulating a number of inter-digitated stimulation and sense electrodes. With an applied vertical pressure, the PDMS membrane deflects towards the substrate, causing an enlarged contact area and hence an increased sensing capacitance,  $C_s$ , between the stimulation and sense electrodes, which is coupled through the floating couple electrode deposited on the PDMS membrane surface. The floating couple electrode can greatly enhance the device sensitivity by an order of magnitude compared to a design without the electrode [1]. The fabricated GRSA achieves a nominal sensor cell capacitance of  $0.8\text{pF}$  with a sensitivity of  $3.7\text{fF/kPa}$  for a maximum input range of  $400\text{kPa}$ . It also should be noted that the long interconnect traces on the GRSA are prone to electrical interference. Therefore, matching traces and reference electrodes are co-designed in the GRSA to ensure a differential configuration for suppressing interference. The left side of Fig. 10.2.2 depicts a human bipedal locomotion with the corresponding heel pressure profiles detected by the GRSA during the stance phase when the foot is in contact with the ground. The foot-on-ground timing can be accurately determined from the GRSA output profile. Our experimental results also reveal that the capacitive GRSA is insensitive to inertial and vibration signals, thus attractive for an accurate timing detection compared to other inertial sensor-based approaches [2,3]. When the foot makes an initial contact to the ground (the heel striking moment at  $T_1$ ), the IMU position calculation is ceased. When the foot detaches from the ground (the heel detaching moment at  $T_3$ ), the IMU position calculation is resumed. This precise resetting of the IMU at each step not only

suppresses the IMU inherent error accumulation over time, but also ensures an accurate position calculation, hence achieving a high navigation accuracy. It is evident that during the mid-stance at  $T_2$ , the heel velocity should be equal to zero. Therefore, any residual velocity measured during the mid-stance is caused by IMU inherent inaccuracy and offset, which can be corrected by the subsequent data processing algorithm to further enhance navigation accuracy.

Figure 10.2.3 presents the electrical system design architecture for the MEMS ground reaction sensor array. The  $13 \times 26$  array can be sequentially scanned by an interface ASIC within 10ms, which is chosen as a trade-off between the system power dissipation and the detected foot-on-ground timing accuracy. Each sensor cell can be selected by the stimulation and sensing MUX to interface with a correlated double sampling (CDS) and programmable capacitance-to-voltage (C/V) converter. An on-chip programmable reference capacitor array,  $C_{\text{ref-prog}}$ , and programmable parasitic reference capacitor array,  $C_{\text{pref-prog}}$ , are designed to provide a close match to the sensor capacitance,  $C_s$ , and sensor parasitic capacitance,  $C_p$ , from the GRSA. A close match is crucial for minimizing C/V converter output offset as well as suppressing potential amplification of interference from the ground. The C/V converter is also designed with programmable stimulation voltage,  $V_s$ , and programmable integrating capacitor,  $C_i$ , ranging from 1.5 to 2.5V and 0.35 to  $2.45\text{pF}$ , respectively, thus resulting in a conversion gain ranging from 1.5 to  $17.9\text{V/pF}$ . The programmable feature of the C/V converter is highly desirable for interfacing with a wide range of capacitive sensors and sensor arrays. An input common-mode feedback (ICMFB) circuit is incorporated with the converter to minimize residual offset due to the mismatch of input parasitic capacitance and drift over time. Characterization of the GRSA reveals an electrical interference on the order of mV, corresponding to a dynamic range of  $66\text{dB}$  (11b) for the prototype design. Therefore, the C/V converter is designed with a fully differential two-stage Class A/AB amplifier exhibiting a DC gain of  $126\text{dB}$  to achieve an 11b settling accuracy and a unity gain frequency of  $12.6\text{MHz}$  for a proper settling. An amplifier input-referred noise of  $7\text{nV}/\sqrt{\text{Hz}}$  is designed to achieve a total output noise of  $0.5\text{mV}_{\text{RMS}}$ , corresponding to the 11b resolution. A fully differential 12b cyclic ADC [4] with a sampling rate of  $66.7\text{KS/s}$  is designed to digitize the GRSA output pressure information. A switch parasitic capacitance compensation scheme using dummy switches is employed to achieve the required gain accuracy for a 12b performance. A symmetric / common centroid layout as well as proper shielding are implemented to minimize distortion and noise. The electronics are fabricated in the XFAB  $0.35\mu\text{m}$  CMOS process. Figure 10.2.7 shows the chip micrograph occupying  $10\text{mm}^2$  area. Figure 10.2.4 presents the measurement results for GRSA, C/V converter and ADC. The entire ASIC dissipates  $3\text{mW}$  from a  $3\text{V}$  supply. Figure 10.2.5 shows the assembled prototype system, where a commercial 9-axis IMU and a GRSA are placed inside a boot heel. The GRSA is connected to a packaged ASIC via a flexible cable, which is further connected to PC for data acquisition. The prototype system was worn by a volunteer for field testing. An L-shape calibration loop was initially walked to capture IMU's inherent directional drift and gain errors, which were then used to compensate for the random walk trajectory. Figure 10.2.6 summarizes the measured system performance. Our work demonstrates the most accurate personal inertial navigation system tested over the longest walking distance reported to date without GPS, based on a comparison with recent systems [2,3,5].

### References:

- [1] Q. Guo, et al., "High Performance MEMS Tactile Sensor Array with Robustness and Fabrication Simplicity," *MEMS 2016*, pp. 877-880.
- [2] J. O. Nilsson, et al., "Foot-Mounted Inertial Navigation Made Easy," *Int. Conf. on Indoor Positioning and Indoor Navigation*, pp. 24-29, 2014.
- [3] E. Foxlin, "Pedestrian Tracking with Shoe-Mounted Inertial Sensors," *IEEE Computer Graphics and Applications*, pp. 38-46, 2005.
- [4] P. Cong, et al., "A Wireless and Batteryless 10-Bit Implantable Blood Pressure Sensing Microsystem With Adaptive RF Powering for Real-Time Laboratory Mice Monitoring," *IEEE JSSC*, pp. 3631-3644, 2009.
- [5] Ö. Bebek, et al., "Personal Navigation via High-Resolution Gait-Corrected Inertial Measurement Units," *IEEE T. Inst. Meas.*, pp. 3018-3027, 2010.



Figure 10.2.1: System architecture for personal inertial navigation system under GPS-denied environment.



Figure 10.2.2: Operating principles of MEMS GRSA and GRSA-assisted personal inertial navigation system.



Figure 10.2.3: Electrical system design architecture for MEMS GRSA.



Figure 10.2.4: Measurement results for GRSA, C/V converter and ADC.



Figure 10.2.5: Prototype personal inertial navigation system and field-test evaluation.

| <b>Ground Reaction Sensor Array</b>                                          |                                   | <b>Programmable C/V Converter</b>        |                     |
|------------------------------------------------------------------------------|-----------------------------------|------------------------------------------|---------------------|
| Nominal Cap.                                                                 | 0.8 pF                            | Programmable C <sub>ref</sub>            | 0~4 pF              |
| Sensitivity                                                                  | 3.7 fF/kPa                        | Programmable C <sub>p-ref</sub>          | 0~20 pF             |
| Sensing Range                                                                | 0~400 kPa                         | Stimulation Voltage                      | 1.5~2.5V            |
| Array Size / Area                                                            | 13 x 26 / 54 x 57 mm <sup>2</sup> | Conversion Gain                          | 1.5 ~ 17.9 V/pF     |
| <b>Cyclic ADC</b>                                                            |                                   | <b>Total Output Noise + Interference</b> |                     |
| Sampling Rate                                                                | 66.7 kS/s                         | Min. Detectable ΔC                       | 0.11 fF             |
| ENOB                                                                         | 11.5 bit                          | Conversion Time/Cell                     | 15 μsec             |
| DNL                                                                          | <+/- 0.2 LSB                      | Array Scanning Time                      | 10 msec             |
| INL                                                                          | <+/- 0.5 LSB                      | Power                                    | 460 μA @ 3V         |
| <b>Navigation Performance Comparison</b>                                     |                                   |                                          |                     |
|                                                                              | Surface Condition                 | Walking Distance (m)                     | End Point Error (m) |
| This work                                                                    | Mixed*                            | 3100                                     | 4                   |
| J. Nilsson [2]                                                               | Not Reported                      | 50                                       | 0.5                 |
| E. Foxlin [3]                                                                | Road                              | 742                                      | 2                   |
| O. Bebek [5]                                                                 | Grass                             | 1215                                     | 4.3                 |
| *Mixed Surface Condition: a combination of concrete, grass, mulch, and rocks |                                   |                                          |                     |

Figure 10.2.6: Table summarizing measured building blocks and navigation system performance.



Figure 10.2.7: Fabricated ASIC micrograph.

### 10.3 Multi-Way Interactive Capacitive Touch System with Palm Rejection of Active Stylus for 86" Touch Screen Panels

Jae-Sung An<sup>1</sup>, Sang-Hyun Han<sup>2</sup>, Kyeong-Bin Park<sup>1</sup>, Ju Eon Kim<sup>3</sup>, Jae-Hun Ye<sup>4</sup>, Seung-Hwan Lee<sup>1</sup>, Ji-Yong Jeong<sup>1</sup>, Jung Soo Kim<sup>1</sup>, Kwang-Hyun Baek<sup>3</sup>, Ki-Seok Chung<sup>1</sup>, Seong-Kwan Hong<sup>1</sup>, Oh-Kyong Kwon<sup>1</sup>

<sup>1</sup>Hanyang University, Seoul, Korea; <sup>2</sup>Leading UI, Anyang, Korea

<sup>3</sup>Chung-Ang University, Seoul, Korea; <sup>4</sup>MiraETNS, Chungju, Korea

There have been many recent advances in capacitive touch systems (CTSs) [1-5]. Multiple-way interactive CTSs (MI-CTSs) that can simultaneously communicate with each other on a real-time basis have been demanded; however, such MI-CTSs have not yet been reported. In addition, palm rejection in an active stylus would be a very useful feature for when the palm inevitably touches a touch screen panel (TSP) [5]. In this paper, an MI-CTS with the palm rejection for an active stylus is reported. This allows simultaneous interaction between CTSs on a real-time basis, while lessening the computational load.

Figure 10.3.1 shows a block diagram of the MI-CTS with multiple CTSs, each of which consists of a TSP, display, analog front-end (AFE) IC, and MCU. The MCU is implemented with the touch coordinate extraction block and multiple-way control block including the touch coordinate transform and display blocks, the touch coordinate TX and RX, and a Wi-Fi block. When the AFE IC sends FFT data to the MCU in CTS<1>, the touch coordinate extraction block extracts the touch coordinates of CTS<1> (TC<sub>1<1></sub>), where CTS<i> is the *i*<sup>th</sup> CTP in the MI-CTS, and TC<sub>i<j></sub> is the coordinates touched at CTS<i> and displayed in CTS<j>. TC<sub>1<1></sub> are then sent to the multiple-way control block. The touch coordinate transform block, which has all the display resolutions of CTS<1> to CTS<N>, converts TC<sub>1<1></sub> to TC<sub>1<2>~TC<sub>1<N></sub></sub> to be fitted to the corresponding display resolutions of the remaining CTSs. TC<sub>1<2>~TC<sub>1<N></sub></sub> are then sequentially transferred to the corresponding CTSs through the touch coordinate TX and Wi-Fi block. In addition, the touch coordinate RX serially receives TC<sub>2<1>~TC<sub>N<1></sub></sub> through the Wi-Fi block from the remaining CTSs. The touch coordinate display block then combines TC<sub>1<1>~TC<sub>N<1></sub></sub> and sends them to the display of CTS<1> so that the touch coordinates of all the CTSs can be simultaneously displayed in CTS<1>. The wireless communication between the CTSs is carried out by transferring only the information on the coordinates of finger, and the coordinates and colors of stylus to identify the corresponding CTS, thus lessening the data traffic between the CTSs. The MI-CTS, in which 3,596 CTSs are configured to be interactively communicable with each other when 1 finger and 4 stylus are simultaneously used, adopts an 802.11ac Wi-Fi protocol.

Figure 10.3.2 shows the block diagram and operational principle of the CTS using the modified multiple frequency driving method (MFDM) [4], which allows the palm rejection and erase features. The FFT processor acquires a spectrum of external noises, and then the MCU locates the frequencies of excitation ( $f_{EXT1-112}$ ), stylus, palm, and eraser signals ( $f_{S1-3}$ ,  $f_p$ , and  $f_e$ ) in the low-noise region. The excitation circuits of the AFE IC and active stylus concurrently send the excitation signals ( $V_{EXTs}$ ), and the stylus, palm, and eraser signals ( $V_{STY}$ ,  $V_{PALM}$ , and  $V_{ERASE}$ ), respectively, to the TSP. The AFE IC then acquires a spectrum of charge signals ( $Q_s$ ) of  $V_{EXT}$ ,  $V_{STY}$ ,  $V_{PALM}$ , and  $V_{ERASE}$  to extract the coordinates of finger and active stylus as well as the pressure and tilt of the active stylus, and to identify the presence of the palm and eraser, respectively. As the MCU detects a capacitance profile of  $f_p$  corresponding to  $V_{PALM}$  and removes it, the palm rejection is carried out without a heavy computational load. In the same way, the MCU detects a capacitance profile of  $f_e$  corresponding to  $V_{ERASE}$  and erases the displayed touch coordinates. In the MI-CTS, the CTS requires an additional frame time to extract the coordinates of the active stylus, palm, and eraser from the TX and RX electrodes.

Figure 10.3.3 (top) shows a block diagram of the active stylus. When the active stylus is used, the MCU in the stylus sends  $f_{S1-3}$ ,  $f_p$ , and  $f_e$  to the excitation circuits<1:5>. The excitation circuits<1:3> generate respective sinusoidal waves having  $f_{S1-3}$ , which represent the coordinates, pressure, and tilt angle of the active stylus [4]. The mixer then combines these sinusoidal waves and emits them to the TSP via the buffer. The excitation circuit<4> sends  $V_{PALM}$  having  $f_p$  to the TSP through the human body. In addition, the excitation circuit<5> sends  $V_{ERASE}$  to the TSP via the conductive eraser to perform the erase operation. To reduce the power consumption, the active stylus emits  $V_{STY}$ ,  $V_{PALM}$ , and  $V_{ERASE}$  only when the inertial motion unit (IMU) sensor detects the movements of the active stylus. The excitation circuits<1:4> generate  $V_{STY}$  and  $V_{PALM}$  when the active stylus moves in

the forward direction. On the other hand, the excitation circuits<4:5> send  $V_{PALM}$  and  $V_{ERASE}$  to the side electrode and conductive eraser, respectively, when the active stylus is rotated in the backward direction. To verify the operation of the active stylus with palm rejection, the coordinates of the active stylus and palm of the hand were measured and depicted as a combined FFT data at the RX and TX electrodes as shown in Fig. 10.3.3 (bottom). When a  $V_{PALM}$  signal is injected into the human body, the amplitude of  $V_{PALM}$  can be altered according to the impedance of human body. Nonetheless, the MCU simply detects the existence of  $f_p$  and removes such coordinates of the palm from the raw FFT data, thus achieving the palm rejection without a heavy computational load.

Figure 10.3.4 shows a block diagram of the AFE IC. The 64-channel excitation circuits generate the sinusoidal waves using the direct digital synthesizer (DDS), low-pass filter (LPF), and the programmable gain amplifier (PGA). The readout circuit consists of 104-channel CCILs, high-pass filter (HPF), and PGA, and 26-channel 8:2 MUXs with SAR-ADCs. To achieve a high frame rate while preventing a charge overflow, each excitation circuit sends  $V_{EXT1-56}$  and  $V_{EXT57-112}$  for two frame times to all the TX electrodes, and the current conveyor II (CCII) adjusts its input impedance at port X according to the sizes of MP and MN. Since the AFE IC is heavily affected by the display noises in the large-sized TSP, a differential sensing method [2] using two CCILs is adopted in the AFE IC to filter out the display noise, and realized simply by connecting the nodes ZP and ZN of the two adjacent CCILs. In addition, a combination of the TSP and CCII, which operates as a band-pass filter [4], is combined with the HPF to filter out the external noises, which are mainly distributed up to tens of kHz, and thus the cut-off frequency of the HPF is designed to be 100kHz. The signal gain is adjusted by the output resistors of the CCII ( $R_{out}s$ ) and the PGA. Because the fabricated 86" TSP has a cut-off frequency of 1MHz, the output of the PGA should be sampled with 2MHz frequency per channel considering the Nyquist theorem, and thus the SAR-ADC is designed to have an 8MHz sampling frequency when an 8:2 MUX is used. The 1024-point FFT processor receives 1024 ADC data and converts it to 1024 FFT data, which is distributed from -1MHz to +1MHz, and thus the frequency resolution is determined to be 1.95kHz.

As shown in Fig. 10.3.5 (top-left), the raw data of the CTS with an 86" TSP shown in Fig. 10.3.6 (top-right) was measured in a real environment when the 1 and 10mm metal pillars, and active stylus were touched on the TSP. Figure 10.3.5 (bottom-left) shows the demonstration of the MI-CTS using an 86" and two 32" TSPs. When the finger and active stylus touched a CTS, the remaining CTSs simultaneously displayed the coordinates of the finger and active stylus. Figure 10.3.5 (bottom-right) shows the test results of the palm rejection, showing that the touched area of the palm was not displayed on the TSP, thus the palm rejection properly worked.

Figure 10.3.6 shows the performance summary of the AFE IC in comparison with previous works. Although a large-sized 86" TSP with a 5.0mm thickness glass was used, the measured SNRs were comparable or higher than those in previous works owing to the use of the differential sensing method. Figure 10.3.7 shows the die micrograph of the AFE IC.

#### Acknowledgements:

This research was financially supported by the Ministry of SMEs and Startups(MSS), Korea, under the "Regional Specialized Industry Development Program (R&D, R0004373)" supervised by the Korea Institute for Advancement of Technology(KIAT).

#### References:

- [1] C. Park, et al., "A Pen-Pressure-Sensitive Capacitive Touch System Using Electrically Coupled Resonance Pen," *ISSCC Dig. Tech. Papers*, pp. 124-125, Feb. 2015.
- [2] M. Hamaguchi, et al., "A 240Hz-Reporting-Rate Mutual-Capacitance Touch-Sensing Analog Front-End Enabling Multiple Active/Passive Styluses with 41dB/32dB SNR for 0.5mm Diameter," *ISSCC Dig. Tech. Papers*, pp. 120-121, Feb. 2015.
- [3] H. Hwang, et al., "A 6.9 mW 120 fps 28×50 Capacitive Touch Sensor with 41.7 dB SNR for 1 mm Stylus Using Current-Driven ΔΣ ADCs," *ISSCC Dig. Tech. Papers*, pp. 170-171, Feb. 2017.
- [4] J. S. An, et al., "A 3.9kHz-Frame-Rate Capacitive Touch System with Pressure/Tilt Angle Expressions of Active Stylus Using Multiple-Frequency Driving Method for 65" 104×64 Touch Screen Panel," *ISSCC Dig. Tech. Papers*, pp. 168-169, Feb. 2017.
- [5] S. Yoshida, et al., "An 87×49 mutual capacitance touch sensing IC enabling 0.5 mm-diameter stylus signal detection at 240 Hz-reporting-rate with palm rejection," *ASSCC*, pp. 217-220, Nov. 2014.



Figure 10.3.1: Block diagram of the multiple-way interactive capacitive touch system (MI-CTS) having multiple CTSs.



Figure 10.3.2: Block diagram and operational principle of the CTS using the modified multiple-frequency-driving method.



Figure 10.3.3: Block diagram of the active stylus (top) and the measured and depicted coordinates of the active stylus and palm of the hand (bottom).



Figure 10.3.4: Block diagram of the AFE IC.



Figure 10.3.5: Measured raw touch data (top-left), and the demonstration of the proposed CTS with a 86" 198x112 TSP (right), MI-CTS using an 86" 198x112 TSP and two 32" 104x64 TSPs (bottom-left), and test results of the palm rejection (bottom-right).

|                                    | This work                                                               | ISSCC 2015 [1]                                                                                                   | ISSCC 2015 [2]                                                       | ISSCC 2017 [3]                                                            | ISSCC 2017 [4]                                                            |
|------------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|---------------------------------------------------------------------------|---------------------------------------------------------------------------|
| Process                            | 0.13-μm CMOS                                                            | 0.18-μm BCD                                                                                                      | 85-nm CMOS                                                           | 0.18-μm CMOS                                                              | 0.13-μm CMOS                                                              |
| TSP                                | 86-inch <sup>2</sup>                                                    | 10.1-inch                                                                                                        | 13-inch                                                              | 10.1-inch                                                                 | 65-inch                                                                   |
| Cover glass                        | 5.0 mm                                                                  | N/A                                                                                                              | 1.1 mm                                                               | N/A                                                                       | 4.0 mm                                                                    |
| Air gap                            | 3.0 mm                                                                  | N/A                                                                                                              | 1.5 mm                                                               | N/A                                                                       | 3.0 mm                                                                    |
| # of electrodes                    | TX: 112<br>RX: 198                                                      | TX: 48<br>RX: 32                                                                                                 | TX: 35<br>RX: 57                                                     | TX: 28<br>RX: 50                                                          | TX: 64<br>RX: 104                                                         |
| Electrode                          | Metal Mesh                                                              | ITO                                                                                                              | Metal Mesh                                                           | AgNW                                                                      | Metal Mesh                                                                |
| Frame rate                         | 977 Hz                                                                  | 240 Hz                                                                                                           | 120 Hz                                                               | 120 Hz                                                                    | 3.9 kHz                                                                   |
| SNR (dB)                           | Stylus (1mm)<br>Finger<br>Pressure<br>Tilt<br>Palm rejection<br>Eraser  | Active: 45.5<br>Passive: 39.0<br>60.1 (@10mm)<br>Yes (7.4-bit) <sup>2</sup><br>Yes (7.4-bit) <sup>2</sup><br>Yes | Active: N/A<br>Passive: 49.0<br>62.0 (@6mm)<br>Yes (6.5-bit)<br>No   | Active: 41.0 (0.5mm)<br>Passive: 38.0<br>N/A<br>53.3<br>Yes (6-bit)<br>No | Active: N/A<br>Passive: 41.7<br>61.0 (@10mm)<br>53.3<br>Yes (6-bit)<br>No |
| Stylus Function <sup>1</sup>       |                                                                         |                                                                                                                  |                                                                      |                                                                           |                                                                           |
| Multiple-way interactive operation | Supply<br>Power consumption<br>Chip area<br>Chip area / # of electrodes | 1.5/3.3 V<br>797.4 mW<br>74.17 mm <sup>2</sup><br>6689.0 μm <sup>2</sup>                                         | 1.2/3.3 V<br>30 mW<br>14.7 mm <sup>2</sup><br>9570.3 μm <sup>2</sup> | 1.8/3.3 V<br>6.9 mW<br>1.96 mm <sup>2</sup><br>6265.7 μm <sup>2</sup>     | 1.5/3.3 V<br>246.3 mW<br>42.25 mm <sup>2</sup><br>37.0 μm <sup>2</sup>    |

<sup>1</sup>The CTS was implemented with an 86" 198x112 TSP using a 12.0μm thickness Cu metal mesh structure, where the TSP was covered with 5.0mm thickness glass and mounted on the ultra-HD LCD with 3.0mm air gap.

<sup>2</sup>Resolutions of pressure and tilt of the proposed active stylus is derived when the one active stylus is used in the CTS.

Figure 10.3.6: Performance summary for the AFE IC in comparison with previous works.



Figure 10.3.7: Die micrograph of the AFE IC.

## 10.4 A Noise-Immune Stylus Analog Front-End Using Adjustable Frequency Modulation and Linear-Interpolating Data Reconstruction for Both Electrically Coupled Resonance and Active Styluses

Kyung-Hoon Lee<sup>1</sup>, Sang-Pil Nam<sup>1</sup>, Jung-Ho Lee<sup>1</sup>, Michael Choi<sup>1</sup>, Hyung-Jong Ko<sup>1</sup>, San-Ho Byun<sup>1</sup>, Jin-Chul Lee<sup>1</sup>, Yong-Hoon Lee<sup>1</sup>, Yeong-Cheol Rhee<sup>1</sup>, Yoon-Kyung Choi<sup>1</sup>, Byung-Hoon Kang<sup>2</sup>, Chang-Byung Park<sup>2</sup>, Sungsoo Park<sup>2</sup>, Taesung Kim<sup>2</sup>

<sup>1</sup>Samsung Electronics, Hwaseong, Korea

<sup>2</sup>Samsung Electronics, Suwon, Korea

As the demand for intuitive and interactive displays is increasing in mobile devices such as smartphones and tablets, a pressure-sensitive stylus pen solution is needed for advanced user experiences [1-5]. Figure 10.4.1 compares electromagnetic resonance (EMR), active, and electrically coupled resonance (ECR) stylus systems, and shows cross-sectional views of sensor panels. EMR stylus systems have been successfully commercialized for high-end devices due to their battery-less, light, and pressure-sensing features [1-2]. However, EMR stylus systems require an extra sensing panel for electro-magnetic coupling. The active stylus system does not require the additional sensing panel but needs a battery located inside the stylus. On the other hand, ECR stylus systems are cost-effective without needing either the additional sensing panel or a built-in battery. Furthermore, ECR stylus systems measure pen pressure by sensing a stylus' resonant frequency change upon pressure without additional circuitry [3]. Due to the external noise injection to the stylus system being more than ten times larger than the pen signal as shown in Fig. 10.4.1, noise immunity is a key performance factor for commercialization. This paper proposes an analog front-end (AFE) for both ECR and active stylus systems with high noise immunity that results in improved signal-to-noise ratio (SNR) by applying a fully differential architecture, adjustable frequency modulation (AFM), and linear-interpolating data reconstruction (LIDR).

Figure 10.4.2 shows the AFE architecture, comprising 2 groups of a 7-channel receiver, two groups of a reference-channel receiver, and channel-selecting multiplexer. A single receiver channel consists of a charge-to-voltage converting amplifier (CA), a down-conversion mixer, a 4<sup>th</sup>-order low-pass filter (LPF), a programmable gain amplifier (PGA), and an analog-to-digital converter (ADC), while a reference-channel receiver amplifies the external noise with CA only for common-mode noise rejection. In global scan mode, the overall 14-channel receiver scans all the column and row panel electrodes sequentially to sense the pen's approximate position. In local scan mode, the 2 groups of the 7-channel receiver are dedicated to the 7 column and the 7 row electrodes respectively, positioned near the pen's coarse location for sensing the pen's accurate coordinate.

The AFE supports 3 signal-sensing modes (single-ended, pseudo-differential, and fully differential modes) to support various application and system requirements. Figure 10.4.3 shows the single-channel receiver with its 3 sensing mode configurations. In the pseudo-differential mode, touch screen panel's common noise is suppressed by partially canceling the panel noise, using the reference channel. The physically remote location between the signal and reference-sensing electrodes prevents extensive suppression of the panel noise. In the fully differential mode, the receiver utilizes the 2 adjacent panel electrodes as the differential input signals and cancels the panel's common noise for the AFE's SNR improvement.

Figure 10.4.4 shows the frequency-domain analysis of the noise-suppressing adjustable frequency modulation (AFM) scheme. The AFE senses the pen pressure by detecting the varying ECR pen's LC resonant frequency upon applied pressure. A noise-efficient filter adjusts its cut-off frequency to track the ECR pen's frequency change for maximum out-of-band noise suppression and better SNR. Most external noise sources such as fluorescent lamps, chargers, and hum are located in the low-frequency band (20 to 250kHz), and their harmonic components are positioned in the higher-frequency band (>700kHz) than the pen frequency ( $f_{pen}$ ). A simple way to suppress the external noises, located in both sides of the pen signal band, is to add a band-pass filter (BPF) in the signal path. However, BPF with narrow (~30kHz) transition band is disadvantageous in circuit

implementation to adjust the filter's cut-off frequency. To eliminate the need for the filter's cut-off frequency change, an AFM scheme is proposed. In this stylus receiver, a frequency down-conversion mixer modulates the pen signal to a low-frequency band while external noise is modulated to a high-frequency band. Regardless of the pen signal's frequency change upon pressure, the modulated pen signal is always fixed at 30kHz ( $\Delta kHz = f_{PEN} - f_{MOD}$  in Fig. 10.4.4) by adjusting the modulation frequency according to the pen signal frequency, detected in the global scan mode. A 4<sup>th</sup>-order switched-capacitor Butterworth low-pass filter with fixed 50kHz 3-dB cut-off frequency sufficiently suppresses out-of-band noise regardless of the signal frequency variation, and a switched-capacitor PGA amplifies the low-pass filtered pen signal for further SNR enhancement. The AFE architecture can be easily employed to various kinds of ECR and active stylus pen systems without hardware modifications.

A display panel's periodic pixel updates generate noise spikes that are transferred to the touch screen panel via coupling capacitance and cause stylus AFE's performance degradation. Although the fully differential sensing suppresses the display noise, the noise inevitably remains. The linear-interpolating data reconstruction (LIDR) scheme suppresses this remaining display noise further. LIDR comprises two operations: 1) resetting the charge-to-voltage converting amplifier (CA), and 2) the linear interpolation of the AFE output. The horizontal synchronized clock signal ( $H_{SYNC}$ ) from display panel contains timing information for the periodically coupled display noise. While  $H_{SYNC}$  is active, CA is reset to avoid the AFE saturation, and the whole AFE channels go to a null state. However, the nulling information during  $H_{SYNC}$  timing can be restored by the linear interpolation of digital processing. Figure 10.4.4 illustrates time-domain analysis of the LIDR with AFM. Without LIDR and AFM, the output of PGA is saturated during the  $H_{SYNC}$  timing due to the injected display noise that contaminates the signal frequency information. When the frequency of the pen signal is higher than that of  $H_{SYNC}$ , the PGA output is not restored by the interpolation properly with LIDR only, as shown in the middle of Fig. 10.4.4. However, the pen signal can be most effectively restored with both LIDR and AFM because the frequency of the modulated pen signal (30kHz) is sufficiently lower than that of  $H_{SYNC}$  (250kHz), as shown in the right-side of Fig. 10.4.4.

The AFE is fabricated in a 100nm CMOS process, and its performance is measured on a 10.1-inch commercial tablet as shown in Fig. 10.4.5. The touch screen panel for AFE evaluation contains a 4mm-channel-pitch mesh structure with 54x35-metal-mesh. The AFE achieves up to 56dB SNR in ECR stylus, applying the noise-suppressing AFM and LIDR schemes in the fully differential mode. The active stylus is also implemented and evaluated with the same AFE that performs with 2dB higher (58dB) SNR than ECR stylus. The extracted column- and row-axis coordinates of the ECR stylus are shown in Fig. 10.4.5 (bottom). The performances of the AFE and previous works for stylus are summarized in a table in Fig. 10.4.6. Based on the table, the AFE achieves the highest SNR (56dB), compared to the previously reported best performance (49dB), even with injected display and charger noises. The AFE supports pressure sensing at an active stylus as well as a passive stylus, and the 10b resolution is superior to the recent previous works (6b). Figure 10.4.7 shows a die photograph of the stylus AFE, embedded in one-chip stylus and finger-touch controller. The total size of the IC is 39.2mm<sup>2</sup>, while the stylus AFE occupies 5.49mm<sup>2</sup>.

### References:

- [1] Samsung. Galaxy Note8 [Online]. Accessed on Nov. 9, 2017. Available: <http://www.samsung.com/global/galaxy/galaxy-note8/>
- [2] Wacom. What is the EMR [Online]. Accessed on Nov. 9, 2017. Available: <http://www.wacom.com/en-in/support/faqs/scope-of-business/electromagnetic-resonance-information>
- [3] C. Park, et al., "A Pen-Pressure-Sensitive Capacitive Touch System Using Electrically Coupled Resonance Pen," *ISSCC Dig. Tech. Papers*, pp. 124-125, Feb. 2015.
- [4] M. Hamaguchi, et al., "A 240Hz-Reporting-Rate Mutual-Capacitance Touch-Sensing Analog Front-End Enabling Multiple Active/Passive styluses with 41dB/32dB SNR for 0.5mm Diameter," *ISSCC Dig. Tech. Papers*, pp. 120-121, Feb. 2015.
- [5] J. An, et al., "A 3.9kHz-Frame-Rate Capacitive Touch System with Pressure/Tilt Angle Expressions of Active Stylus Using Multiple-Frequency Driving Method for 65," *ISSCC Dig. Tech. Papers*, pp. 168-169, Feb. 2017.



The diagram shows a hand holding a pen over a capacitive touch screen panel. The panel consists of two parallel metal plates: a top plate labeled  $C_{PEN}$  and a bottom plate labeled  $V_{COM}$ . A human hand is shown with a coupling capacitance  $C_{BODY}$  between the hand and the bottom plate. The pen tip is labeled  $(V_{PEN})$  and has a coupling capacitance  $C_{PEN}$  to the top plate. External noise is represented by a wavy line with voltage  $(V_{BODY})$ . A box labeled "Rx AFE" is connected to the chip, which is connected to the  $V_{COM}$  plate.

**Figure 10.4.1: Comparison of stylus systems, and circuit model for a pen signal and external noise coupling mechanism.**



**Figure 10.4.3: Block diagram of single-channel RX and configuration of AFE for three sensing modes.**



**Figure 10.4.5:** Measurement setup for experimental verification and measured results, and demonstration of the proposed AFE with a commercial 10-inch tablet.



**Figure 10.4.2: Block diagram of the stylus pen AFE.**



**Figure 10.4.4: Frequency-domain and time-domain analysis of proposed AFM and LIDR.**

| List            | [3] ISSCC2015        | [4] ISSCC2015                 | [5] ISSCC2017                   | This work                     |
|-----------------|----------------------|-------------------------------|---------------------------------|-------------------------------|
| TSP Size        | 10.1 inch            | 13 inch                       | 65 inch                         | 10.1 inch                     |
| # of Channels   | 48 x 32              | 57 x 35                       | 104 x 64                        | 54 x 35                       |
| Scan rate       | 120 Hz               | 240 Hz                        | 3906 Hz                         | 133 Hz                        |
| SNR             | Display / Charger    | OFF / -                       | ON / -                          | ON / ON                       |
|                 | Passive              | 49.0 dB ( $\Phi=1\text{mm}$ ) | 38.0 dB ( $\Phi=1\text{mm}$ )   | 41.0 dB ( $\Phi=1\text{mm}$ ) |
|                 | Active               | -                             | 41.0 dB ( $\Phi=0.5\text{mm}$ ) | 50.1 dB ( $\Phi=1\text{mm}$ ) |
| Pressure        | Yes (6 bit)          | Yes (N/A)                     | Yes (6 bit)                     | Yes (10 bit)                  |
| Process         | 180 nm BCD           | 85 nm CMOS                    | 130 nm CMOS                     | 100 nm CMOS                   |
| Chip area       | 14.7 mm <sup>2</sup> | 12.5 mm <sup>2</sup>          | 42.3 mm <sup>2</sup>            | 39.2 mm <sup>2</sup>          |
| Power (@ touch) | 30.0 mW              | 62.0 mW                       | 96.3 mW                         | 24.0 mW                       |

**Figure 10.4.6: Performance summary and comparison.**



Figure 10.4.7: The micrograph of the proposed IC.

## 10.5 A 0.91mW/Element Pitch-Matched Front-End ASIC with Integrated Subarray Beamforming ADC for Miniature 3D Ultrasound Probes

Chao Chen<sup>1</sup>, Zhao Chen<sup>1</sup>, Deep Bera<sup>2</sup>, Emile Noothout<sup>1</sup>, Zu-Yao Chang<sup>1</sup>, Mingliang Tan<sup>1</sup>, Hendrik J. Vos<sup>1,2</sup>, Johan G. Bosch<sup>2</sup>, Martin D. Verweij<sup>1,2</sup>, Nico de Jong<sup>1,2</sup>, Michiel A. P. Pertjjs<sup>1</sup>

<sup>1</sup>Delft University of Technology, Delft, The Netherlands

<sup>2</sup>Erasmus MC, Rotterdam, The Netherlands

Data acquisition from 2D transducer arrays is one of the main challenges for the development of emerging miniature 3D ultrasound imaging devices, such as 3D trans-esophageal (TEE) and intra-cardiac echocardiography (ICE) probes (Fig. 10.5.1). The main obstacle lies in the mismatch between the large number of transducer elements ( $10^3$  to  $10^4$ ) and the limited cable count (<200). Recent advances in transducer-on-CMOS integration have enabled the use of in-probe subarray beamforming based on delay-and-sum (DAS) circuits [1] to reduce the channel count by an order of magnitude. Further reduction calls for in-probe digitization to enable more advanced data processing and compression in the digital domain. However, prior designs [2-4] compromise on transducer pitch (> half wavelength) to accommodate the ADC and consume >9mW/element, which translates into unacceptable self-heating in miniature 3D probes.

This paper presents an element-pitch-matched front-end ASIC that realizes in-probe digitization while consuming 10x less power per element than the prior art [2-4], thus paving the way towards sub-Watt digital probes with 1000+ elements. This is achieved by using subarray beamforming ADCs that combine the DAS and digitization functions in the charge domain, thus reducing the overhead associated with digitization.

Figure 10.5.1 shows an overview of the system. A 5MHz 150μm-pitch PZT transducer array is integrated directly on top of the ASIC using an approach similar to [5]. As a proof-of-concept, we use a split 2D array consisting of 6×24 receive (RX) elements and 3×24 transmit (TX) elements, while the ASIC layout is scalable to a larger aperture (e.g. 32×32 [1]). Since our focus is on RX digitization, the TX elements are directly wired to external pulsers for simplicity. To realize a 9-fold RX channel reduction, the RX elements are divided into 16 subarrays of 3×3 elements each, on which DAS beamforming and digitization is applied within the subarray. An additional 4-fold channel reduction is obtained by sharing a high-speed datalink at the ASIC periphery between 4 subarrays, leading to a total 36-fold channel reduction.

Each subarray receiver consists of a 9-channel analog front-end (AFE) and a 10b 30MS/s beamforming ADC (Fig. 10.5.2). The AFE comprises a low-noise amplifier (LNA), implemented as a single-ended capacitive-feedback inverter-based amplifier [1], and a programmable gain amplifier (PGA), implemented with a differential telescopic amplifier with capacitive feedback. The AFE provides 12 gain steps with a total range of 48dB to compensate the time-dependent propagation attenuation, thus reducing the input dynamic range of the succeeding beamformer to about 45dB and facilitating the implementation of the ADC.

Figure 10.5.3 shows the beamforming ADC. The delay lines are implemented with differential switched-capacitor memory cells [1]. In contrast to [2], which uses an amplifier for active charge-summation and a voltage buffer to drive a SAR ADC, the delay line output is quantized immediately after passive charge-summation, thus eliminating the need for intermediate stages. This is achieved by performing the digitization in the charge domain based on the charge-sharing concept [6]. Upon the rising edges of channel-specific readout clocks,  $R_k<1:8>$ , the corresponding memory cells in 9 channels are joined at the summing nodes,  $V_{XP}$  and  $V_{XN}$ . After a short time interval for passive charge redistribution, binary-scaled charge references, which have been pre-charged by a charge-reference generator, are successively merged to the summing nodes to neutralize the delayed-and-summed charge according to the decision of a self-timed comparator, forming a digital representation of the charge. At the end of each conversion, the summing nodes are reset (CPRST) to clear the residue charge.

The differential comparator output (CPout+/-), a return-to-zero signal that represents the ADC output bits, is routed to a clock and data recovery circuitry at the periphery of the ASIC (Fig. 10.5.2). There, it is synchronized by a dual-clock FIFO to a 300 MHz system clock. The FIFO outputs of every 4 subarrays are combined and encoded into a 10b DC-balanced data-stream to facilitate clock and data recovery on the system side. This is then serialized to 1.5 Gb/s with the aid of an on-chip DLL and transmitted off-chip by an LVDS driver.

The charge-reference generator is shown in Fig. 10.5.4. To avoid bulky on-chip decoupling capacitors and power-hungry reference buffers, we use a gated current source ( $I_p$ ) to precharge the CDAC within a predefined duration ( $T_{int}$ ) (Fig. 10.5.4). Unlike [7],  $I_p$  is locally generated in each subarray and periodically self-calibrated during TX when digitization is not required, in reference to a global external voltage  $V_{REF}$ . This simplifies the system-level layout, while maintaining reference uniformity across the chip. During TX, a charge-pump and a comparator are enabled in each clock cycle to adjust the bias voltage of  $I_p$ , until the voltage on the CDAC approaches  $V_{REF}$ . The noise of the charge pump does not affect the RX noise, as it is sampled on  $C_{MOS}$  together with the calibrated bias voltage and held constant during RX. Instead, the dominant reference noise source is the jitter of  $T_{int}$ . To relax the requirement on the system clock jitter, a ping-pong charge reference consisting of two identical CDACs is employed to maximize  $T_{int}$ . During calibration only one of them is activated, while during RX the two are alternately used for precharging and conversions. By sharing  $I_p$  and the generator for  $T_{int}$ , the ping-pong charge reference is free from interleaving spurs caused by the CDAC mismatch.

The ASIC is fabricated in a 0.18μm 1P6M CMOS process and measures 4.8×2.5 mm<sup>2</sup> (Fig. 10.5.7). The subarray circuits consume 0.46mW/element, 38% of which is consumed by the beamforming ADC along with the delay control logic. The total power consumption including the datalink and LVDS drivers is 0.91mW/element.

The ASIC has been characterized electrically by wire-bonding test inputs to the transducer bondpads and recording the digitized outputs with an FPGA. The measured transfer function at 12 gain settings shows a -3dB bandwidth of 11.9MHz (Fig. 10.5.5). An output spectrum recorded at the highest gain shows a peak SNDR of 51.8dB with a 4.95MHz input. To demonstrate the high-speed datalink, Fig. 10.5.5 shows the output waveforms of 4 subarrays recovered from one LVDS output, with the subarrays programmed at different uniform delays while sharing the same input (a 3-cycle 2MHz burst to clearly show the delay).

A fabricated prototype with an integrated transducer array is shown in Fig. 10.5.7. To demonstrate its imaging capability, a waterbag was mounted on top of the prototype, and a 3-needle phantom was placed at about 20mm in front of the transducer array (Fig. 10.5.6), and imaged by driving 6 TX elements with 20V pulses. The recorded digital output of the RX subarrays shows a clear increase in the echo amplitude when the beamformer is steered towards the needles. A B-mode image in the lateral direction, generated from four digital subarray outputs, clearly shows the positions of the needles with a resolution in line with the relatively small aperture. A comparison with prior digitization solutions for 3-D ultrasound imaging is included in a table in Fig. 10.5.7. Based on the table, our work achieves a 10x improvement in power efficiency and 3.3x in the integration density, making the presented receive architecture a promising candidate for next-generation miniature 3D ultrasound probes.

### References:

- [1] C. Chen, et al., "A Front-End ASIC with Receive Sub-Array Beamforming Integrated with a 32 × 32 PZT Matrix Transducer for 3-D Transesophageal Echocardiography," *IEEE JSSC*, vol. 52, pp. 994 - 1006, 2017.
- [2] J.-Y. Um, et al., "An Analog-Digital-Hybrid Single-Chip RX Beamformer with Non-Uniform Sampling for 2D-CMUT Ultrasound Imaging to Achieve Wide Dynamic Range of Delay and Small Chip Area," *ISSCC Dig. Tech. Papers*, pp. 426 – 427, Feb. 2014.
- [3] M.-C. Chen, et al., "A Pixel-Pitch-Matched Ultrasound Receiver for 3D Photoacoustic Imaging with Integrated Delta-sigma Beamformer in 28nm UTBB FD-SOI," *ISSCC Dig. Tech. Papers*, pp. 456-457, Feb 2017.
- [4] Y.-J. Kim, et al., "A Single-Chip 64-Channel Ultrasound RX-Beamformer Including Analog Front-End and an LUT for Non-Uniform ADC-Sample-Clock Generation," *IEEE T BioCAS*, vol. 11, pp. 87-97, 2017.
- [5] C. Chen, et al., "A Prototype PZT Matrix Transducer with Low-Power Integrated Receive ASIC for 3-D Transesophageal Echocardiography," *IEEE T-UFFC*, vol. 63, no. 1, pp. 47–59, Jan. 2016.
- [6] J. Craninckx and G. van der Plas, "A 65fJ/conversion-step 0-to-50MS/s 0-to-0.7mW 9b Charge-Sharing SAR ADC in 90nm Digital CMOS," *ISSCC Dig. Tech. Papers*, pp. 246-247, Feb. 2007.
- [7] J. Kuppambatti and P. R. Kinget, "Current Reference Pre-Charging Techniques for Low-Power Zero-Crossing Pipeline-SAR ADCs," *IEEE JSSC*, vol. 49, pp. 683-694, 2014.



**Figure 10.5.1:** ASIC overview with insets showing the application of miniature probes in echocardiography, the transducer-on-CMOS integration and the subarray receiver architecture.

**Figure 10.5.2:** Block diagram of the RX circuits.



**Figure 10.5.3:** Circuit diagram of the subarray beamforming ADC and its timing details.



**Figure 10.5.4:** Circuit diagram of the self-calibrated charge-reference generator and its timing details.



**Figure 10.5.5:** Electrical characterization results.



**Figure 10.5.6:** Acoustic experimental setup and results.



Figure 10.5.7: Micrograph of the ASIC and the prototype with fabricated transducer array, along with a comparison with the state of the art.

## 10.6 Single-Chip Reduced-Wire Active Catheter System with Programmable Transmit Beamforming and Receive Time-Division Multiplexing for Intracardiac Echocardiography

Gwangrok Jung<sup>1</sup>, M. Wasequr Rashid<sup>1</sup>, Thomas M. Carpenter<sup>2</sup>, Coskun Tekes<sup>1</sup>, David M. J. Cowell<sup>2</sup>, Steven Freear<sup>2</sup>, F. Levent Degertekin<sup>1</sup>, Maysam Ghovanloo<sup>1</sup>

<sup>1</sup>Georgia Institute of Technology, Atlanta, GA  
<sup>2</sup>University of Leeds, Leeds, United Kingdom

Intracardiac echocardiography (ICE) provides real-time ultrasound imaging of the heart anatomy from inside, guiding interventions like valve repair, closure of atrial septal defects (ASD) and catheter-based ablation to treat atrial fibrillation. With its better image quality and ease of use, ICE is becoming the preferred imaging modality over transesophageal echography (TEE) for structural heart interventions. The existing commercial ICE catheters, however, offer a limited 2-D or 3-D field of view despite catheters utilizing large number of wires. In these catheters, each element in the ICE array is connected to the backend data-acquisition channel with a separate wire, which is a critical barrier for improving image quality and widening the field of view. In order to use ICE catheters under MRI instead of the ionizing X-ray radiation-based angiography, the number of interconnect wires in the catheter should be minimized to reduce RF-induced heating. Furthermore, reducing the number of wires improves the flexibility and lowers the cost of the single-use ICE catheters.

To reduce the number of interconnects in ultrasound systems, recent literature shows subarray beamforming with switched-capacitor delay [1,2], digital subarray beamforming with  $\Delta\Sigma$  modulator [3], and time-division multiplexing (TDM) with direct digital demultiplexing (DDD) [4]. The switched-capacitor approach is preferred in 2-D arrays; however, a large number of capacitors and switches are needed for each channel for reasonable delay, which introduce mismatch, and makes it difficult to fit in the ICE catheter. It is also not compatible with applications that need access to the raw echo data of every channel for improved image processing. The  $\Delta\Sigma$  modulator approach has shown compact integration. However, it requires high frequency clock (960MHz), which is difficult to feed into long catheters, limiting the integration of high voltage (HV) Transmit (TX) circuits with thick gate oxide on a single chip.

This paper presents a single-chip reduced-wire active catheter system for driving a 64-channel piezo-transducer array that adopts 8:1 TDM analog receiver (RX) with DDD, and TX-beamformer (TX-BF), which can be programmed with a single low-voltage differential signaling (LVDS) data line for wire reduction. Figure 10.6.1 shows the ICE system block diagram focusing on the front-end ASIC, which reduces the number of wires from more than 64 to 22. The 64 received raw echo signals are reduced to 8 through TDM, and sent to ADCs in the backend system, where DDD is performed in an FPGA for real-time image processing in the digital domain.

Figure 10.6.2 shows detailed block diagram of the TX-BF and 64-channel pulser array of the ASIC. It can create a maximum delay of 10.235 $\mu$ s with a resolution of 5ns, using an 11b global counter (GC). The 6b coarse counter (CC) part of GC starts to count down from 63 to find the coarse delay of each channel, while the 5b mod counter (MC) part of GC is programmed to find the exact start time and the width of a pulse for each channel. Each channel of the 16b serial-in parallel-out (SIPO) shift register (SR) stores TX delay and pulse width values for each pulser, which can enable TX pulse-width apodization for side-lobe suppression. A 6b SR is used to check proper data loading, a 2b SR is used for RX configuration, and a 3b SR stores the number of firing pulses for Doppler operation. To program firing delays for all 64 channels, each programming cycle requires a 1040b data packet (64 $\times$ 16+2+3+5+6). Before programming the TX-BF, all registers are reset, following which the data packet is sent from an FPGA according to the timing diagram in Fig. 10.6.3. The 60V pulser can drive 15pF of capacitive loading at 7MHz, which is suitable for comparably-sized 1-D ICE piezo transducer array. The pulser is designed to limit the peak level-shifter current to 5mA from 60V supply by adding a 55V supply and a 5V buffer chain, avoiding voltage drop across the 6.7mm on-chip 60V powerline. Figure 10.6.3 also shows measured results indicating functionality of the TX-BF and 60V pulser.

The RX block in Fig. 10.6.4 operates at 1.8V and consists of TX/RX switches, variable-gain (VG) LNAs, buffer, and TDM circuitry with a symmetric layout, which reduces mismatch. TX/RX switches protect the RX blocks from 60V pulses. Time-gain compensation (TGC) can be applied via data line to set the 2-stage LNA gain to one of 4 fixed levels (15, 21, 27, and 32dB), which on average shows 4.6nV/ $\sqrt{\text{Hz}}$  of the input referred noise at 7MHz. The buffer delivers the amplified echo signal into a TDM circuit, which consists of an analog multiplexer and sample-and-hold (S/H) capacitors. TDM is controlled by a block that generates sample clocks for each of the 8 channels by gating the corresponding clock. The ADC timing, generated by the backend system, is accurately synchronized with the TDM clock to make sure each ADC sample corresponds to a channel in the multiplexed data, following transients. Since properties of the catheter change depending on its surrounding environment, link training is performed (Fig. 10.6.3), as described in [4]. In this work, TDM operates at 200MHz, yielding 25MS/s for each channel, suitable for the echo signal that is centered at 7MHz with 80% bandwidth. The sampled TDM signals are sent out from the catheter handle to backend ADCs through 3m-long Ethernet cables.

The ASIC is fabricated in 60V 0.18 $\mu$ m HV-BCD process. Figure 10.6.7 shows a micrograph of the entire ASIC, which consists of the 64-channel analog front-end (Pulser, TX/RX switch, LNA, and buffer), TX-BF, and TDM, which occupies 2.6 $\times$ 11mm<sup>2</sup>, and consumes 401mW average power during B-mode imaging. Each of TX/RX AFE channels occupies 0.26mm<sup>2</sup>, which is matched to the size of each ICE array element. Successful proof-of-concept imaging experiments are performed by connecting the ASIC, wire-bonded on a PCB, to a 64-channel piezo transducer array at the tip of an ICE catheter using flex cables. Figure 10.6.5 shows the early B-mode images obtained on a standard imaging phantom (N-365, Kyoto kagaku) of 10 nylon wires with 40dB of dynamic range. Figure 10.6.6 benchmarks the state-of-the-art ultrasound ASIC designs. This work integrates both TX beamforming and RX cable reduction in a single chip, reducing the number of wires from more than 64 down to 22, with 5ns of delay resolution within a span of 10.235ms, while providing the backend image processing engine with access to the entire raw echo data from every channel. This is the equivalent of ~65% reduction in the diameter of the catheter which significantly improves its flexibility and reach. The complete backend system is designed with the capacity to handle up to 12 TDM signals from a 2-D transducer array, which occupies the same footprint on the ASIC, as shown in the Fig. 10.6.4 layout, while supporting a 96-channel system. Since the size of the pulser often limits the minimum size TX/RX elements on the ASIC, the reduced capacitive loading helps with matching the transducer unit area. This architecture is also compatible with subarray beamforming with switched-capacitor delay, which will pave the way to further reduction in the number of wires in ICE catheters, while supporting higher resolution 3D images.

### Acknowledgement:

This work was supported in part by Siemens Medical Solutions and the National Institutes of Health under grants EB015607 and HL121838.

### References:

- [1] Y. Katsube, et al., "Single-Chip 3072ch 2D Array IC with RX Analog and All-Digital TX Beamformer for 3D Ultrasound Imaging," *ISSCC Dig. Tech. Papers*, pp. 458-459, Feb. 2017.
- [2] C. Chen, et al., "A Front-End ASIC with Receive Sub-Array Beamforming Integrated with a 32 x 32 PZT Matrix Transducer for 3-D Transesophageal Echocardiography," *IEEE JSSC*, vol. 52, no. 4, pp. 994-1006, Apr. 2017.
- [3] M.C. Chen, et al., "A Pixel-Pitch-Matched Ultrasound Receiver for 3D Photoacoustic Imaging with Integrated Delta-Sigma Beamformer in 28nm UTBB FDSoI," *ISSCC Dig. Tech. Papers*, pp. 456-457, Feb. 2017.
- [4] T.M. Carpenter, et al., "Direct Digital Demultiplexing of Analog TDM Signals for Cable Reduction in Ultrasound Imaging Catheters," *IEEE T. Ultrason., Ferroelect., Freq. Control*, vol. 63, no. 8, pp. 1078-1085, Aug. 2016.
- [5] G. Gurun, et al., "An Analog Integrated Circuit Beamformer for High-Frequency Medical Ultrasound Imaging," *IEEE TBioCAS*, vol. 6, no. 5, pp. 454-467, Oct. 2012.
- [6] J.Y. Um, et al., "An Analog-Digital-Hybrid Single-Chip RX Beamformer with Non-Uniform Sampling for 2D-CMUT Ultrasound Imaging to Achieve Wide Dynamic Range of Delay and Small Chip Area," *ISSCC Dig. Tech. Papers*, pp. 426-427, Feb. 2014.



Figure 10.6.1: Top-level block diagram of the intracardiac echocardiography system, including the backend, and reduced wires down to 22 in the catheter in comparison with >64 interconnects in a generic catheter.



Figure 10.6.2: Block diagram of the transmit beamformer and pulser sections of the ASIC.



Figure 10.6.3: Timing diagram of the system, TDM sampling clock diagram through the link training process, and measured output pulses from adjacent channels, with multiple consecutive pulse generation for Doppler imaging.



Figure 10.6.5: Imaging setup diagram with phantom, and B-mode image of 10 nylon wires in a standard phantom.



Figure 10.6.4: Block diagram of the receiver section of the ASIC, focusing on schematic of LNA, and the symmetrical layout of the TDM block to improve signal quality.

|                                 | This Work     | [1] ISSCC'17     | [2] JSSC'17 | [3] ISSCC'17 | [5] TBCAS'12  | [6] ISSCC'14  |
|---------------------------------|---------------|------------------|-------------|--------------|---------------|---------------|
| Integrated Tx-BF                | Yes           | Yes              | No          | No           | No            | No            |
| Rx wire reduction               | TDM           | S/H analog       | S/H analog  | ADC + FIFO   | Analog filter | S/H + Digital |
| Rx raw data accessibility       | Yes           | No               | No          | Yes          | No            | No            |
| Delay min (ns)                  | 5             | 25               | 30.3        | 8.33         | 1.75 ~ 2.5    | 6.25          |
| Delay max (μs)                  | 10.235        | 0.750            | 0.272       | 1.067        | 0.035         | 8             |
| Die area (mm <sup>2</sup> )     | 28.6          | 416.64           | 37.21       | 9.37         | 0.36          | 19.35         |
| Die dimension(mm <sup>2</sup> ) | 2.6×11        | 22.4×18.6        | 6.1×6.1     | 2.93×3.2     | 1.2×0.3       | 4.5×4.3       |
| Power consumption / channel     | 6.26mW        | 0.7mW            | 0.27mW      | 17.5mW       | 4.62mW        | 17.81mW       |
| # of channels                   | 64 Tx / 64 Rx | 128 Tx / 3072 Rx | 1024 Rx     | 16 Rx        | 8 Rx          | 64 Rx         |
| # of wires                      | 22            | > 128            | > 160       | -            | -             | -             |
| Tx amplitude                    | 60 V          | 136 V            | -           | -            | -             | -             |
| Transducer                      | PZT / CMUT    | 2D PZT           | 2D PZT      | 2D CMUT      | Annular CMUT  | 2D CMUT       |
| Process                         | 0.18μm HV     | 0.18μm HV SOI    | 0.18μm      | 28nm         | 0.35μm        | 0.13μm        |

Figure 10.6.6: Benchmarking table of state-of-the-art ultrasound array ASICs.



Figure 10.6.7: Micrograph of the 64-ch 1-D transducer interface ASIC implemented in 60V 0.18μm HV-BCD technology.

## 10.7 A 0.3ppm Dual-Resonance Transformer-Based Drift-Cancelling Reference-Free Magnetic Sensor for Biosensing Applications

Constantine Sideris, Parham Porsandeh Khial, Bill Ling, Ali Hajimiri

California Institute of Technology, Pasadena, CA

Cost-efficient, point-of-use diagnostics are critical for early disease detection. Traditionally, the majority of lab-based analysis equipment utilizes fluorescent markers for biodetection assays. However, magnetic-based labels have recently been shown to be promising alternatives to fluorescent tags for DNA, protein, and cell assays. Magnetic assays offer several key advantages over their fluorescent counterparts, namely that magnetic beads do not suffer from signal decay due to bleaching and that they can be detected with cheap CMOS-based sensors, eliminating the need for expensive lasers, photo-diodes, filters, and complicated post-processing steps. Significant progress has recently been made in the design of magnetic imager ICs, such as [1] which utilizes a GMR approach for detection and [2-4] which measure the resonance shift in an LC tank.

While LC-based approaches arguably offer the simplest and cheapest approach for detecting magnetic labels, they suffer major drift issues due to the fact that the resonance shift in the LC tank is typically sensed by measuring the frequency-shift of an on-chip, free-running oscillator. Utilizing a replica reference can compensate a few sources of drift; however, this consumes twice the amount of power and chip area. Further, thermal gradients on the chip and noise sources of the active devices lead to major degradation in SNR for long-term measurements and cannot be adequately cancelled with a replica reference. In-vitro experiments must often be carried out over multiple hour-long timescales, making the original LC detection approach impractical for realistic assays. Other approaches for compensating drift in LC-based sensors require external permanent magnets and moving parts and may still require replica reference sensors.

The work in [4] achieves magnetic multiplexing by taking advantage of the rich frequency dependence of the magnetic susceptibility of magnetic nanoparticles. Notably, the nanoparticles contribute negatively beyond a certain excitation frequency and decrease the total sensor inductance. At high enough frequencies, the particles become transparent to the polarizing field and do not affect the effective sensing inductance. Therefore, designing a dual-frequency oscillator enables measurement of the contribution of the beads at the lower frequency and tracking of the sensor drift at the higher frequency, where the beads are transparent and do not affect the oscillation frequency. A switched-capacitor-based topology, as in [4], is one potential approach for achieving two oscillation frequencies with the same sensing inductor cell. Unfortunately, the higher frequency of oscillation would not utilize the majority of the tank capacitance used by the lower frequency making it a poor approach for accurately tracking the drift of the lower frequency. Moreover, the switches required to switch the extra tank capacitors would have to be very large to achieve sufficiently low on-resistance, adding a significant amount of extra parasitic capacitance and contributing more noise to the system.

Instead, we consider the 4<sup>th</sup>-order transformer-based system shown in Fig. 10.7.1. This 4<sup>th</sup>-order tank has two resonance frequencies, whose ratios can be shown to depend as:  $f_1/f_2 = \sqrt{(1 - k) / (1 + k)}$  when  $L_1C_1=L_2C_2$ , where  $k$  is the transformer coupling factor. In this work, we exploit the dependence of this ratio only on  $k$ , which desensitizes the system to variations in other component values (e.g., tank capacitance and device parasitics). The low and high resonances of the tank represent symmetric and anti-symmetric modes of the transformer and result in the primary and secondary oscillation voltages being in phase and 180° out of phase respectively. In order for the system to oscillate at the desired frequency, switches can be used as shown in Fig. 10.7.1 to enforce the required boundary conditions. Unlike the switches required for switched-capacitor-based topologies, the node voltages on both sides of these switches are always identical, indicating that they do not pass any current and thus do not contribute to the overall system noise or tank capacitance.

The transformer-based system can thus oscillate at both the low and high frequencies depending on switch setting, and in the absence of magnetic particles over the transformer surface, the two frequencies will track each other with a ratio depending only on the transformer  $k$ , as noted earlier. The presence of magnetic particles only affects the inductance at the low frequency of operation since they are transparent at the high frequency, and therefore the effective lower frequency

in the absence of beads can be reconstructed from the higher frequency. The difference of the actual and reconstructed lower frequencies cancels the system drift and corresponds to shift only due to the magnetic content. In addition, this approach obviates the need of a replica reference, since the sensor acts as its own self-reference, halving the required chip area and power consumption per sensing unit. No external magnets or moving parts are needed. We implement a prototype 2×2 array of the transformer-based drift-cancelling magnetic sensor in a standard CMOS 65nm process. The sensor operates at 1.44GHz and 3.65GHz and uses a 1:1 transformer with  $k$  of 0.73 and 4.2nH inductance per coil. Each sensor runs off a 1V supply and its bias current can be adjusted from 3 to 30mA, enabling each cell to consume as little as 3mW of power. On-chip digital dividers divide the oscillation frequency by 32, allowing for basic frequency counting circuitry to perform sensor readout. Both sides of the transformer are interleaved and laid out as common-centroid to maximize matching between inductance, capacitance, and active devices. The phase-noise was measured to be -130dBc/Hz at a 1MHz offset for the 1.44GHz frequency. Figure 10.7.2 shows a block diagram of the sensor array design.

We alternate switching the sensor between low and high frequencies, counting each for one second at a time, over an 11hr time period to test the frequency tracking and reconstruction capabilities of the system. Figure 10.7.3 shows the low frequency reconstructed from the high frequency data overlaid on top of the measured low frequency, demonstrating the excellent tracking ability of the sensor. Figure 10.7.4 shows time-domain waveforms of the measured low, reconstructed low, and difference frequencies with magnetic beads over the sensor and after they are removed. As can be seen, the reconstructed signal captures the drift content and noise of the lower frequency. Note that the high frequency shifts slightly upwards when beads are introduced. This is due to the fact that the frequency is not high enough for the beads used to be completely transparent, and they instead operate in the regime where the effective inductance is decreased [4]. This leads to an improvement in the effective SNR, since the two frequencies shift in opposing directions due to magnetic content, but in the same manner in response to sensor noise. The frequency noise-floor of the drift-compensated signal was measured to be 500Hz, corresponding to a 0.35ppm detection capability. We measure differing amounts of 4.5μm Dynabeads to characterize sensor response, linearity, and dynamic range, also shown in Fig. 10.7.4. A single bead can be easily resolved by the sensor owing to the drift-cancellation, and excellent linearity is achieved with a dynamic range of at least 62dB, limited only by the maximum number of beads that we measured.

Each sensing site is 250×250μm<sup>2</sup>, offering adequate area for DNA, protein, and cell experiments. We perform a DNA detection assay to demonstrate the viability of our sensor to be used in in-vitro experiments: a capture DNA strand is attached to the sensing surface and the presence of a complementary target DNA strand binds the capture strand to a probe strand labeled with a 1μm magnetic marker. The magnetic markers are thus anchored to the sensing surface by the target strand. Figure 10.7.5 shows a diagram of the experiment, a photo of the sensor surface with bound beads, and the measured sensor response indicating that the sensor was able to successfully detect and measure the presence of the target DNA.

### Acknowledgments:

The authors thank K. Mauser and N. Scianmarello for help with sample preparation.

### References:

- [1] G. Li, et al., "Detection of Single Micron-Sized Magnetic Bead and Magnetic Nanoparticles Using Spin Valve Sensors for Biological Applications," *Journal of Appl. Phys.*, vol. 93, no. 10, pp. 7557-7559, 2003.
- [2] H. Wang, et al., "A Frequency-Shift CMOS Magnetic Biosensor Array with Single-Bead Sensitivity and No External Magnet," *ISSCC Dig. Tech. Papers*, pp. 438-439, 2009.
- [3] J.-C. Chien, and A.M. Niknejad, "Oscillator-Based Reactance Sensors with Injection Locking for High-Throughput Flow Cytometry Using Microwave Dielectric Spectroscopy," *IEEE JSSC*, vol. 51, no. 2, pp. 457-472, 2016.
- [4] C. Sideris and A. Hajimiri, "An Integrated Magnetic Spectrometer for Multiplexed Biosensing," *ISSCC Dig. Tech. Papers*, pp. 300-301, 2013.
- [5] T. Mitsunaka, et al., "CMOS Biosensor IC Focusing on Dielectric Relaxations of Biological Water With 120 and 60 GHz Oscillator Arrays," *IEEE JSSC*, vol. 51, no. 11, pp. 2534-2544, 2016.
- [6] K.-H. Lee, et al., "CMOS Capacitive Biosensor with Enhanced Sensitivity for Label-Free DNA Detection," *ISSCC Dig. Tech. Papers*, pp. 120-121, 2012.



Figure 10.7.1: Dual-resonance transformer-based magnetic sensor schematic.



Figure 10.7.2: 2x2 magnetic sensor array system block diagram.



Figure 10.7.3: Demonstration of long-term drift cancellation over 11-hour measurement period.



Figure 10.7.4: Time-domain sensor response to magnetic bead and sensor dynamic range characterization.



Figure 10.7.5: In-vitro DNA detection experiment and sensor response.

|                              | [2]            | [3]                   | [5]               | [6]        | This work        |
|------------------------------|----------------|-----------------------|-------------------|------------|------------------|
| Sensor Type                  | LC (Inductive) | LC (Injection Locked) | LC (Inductive)    | Capacitive | LC (Transformer) |
| Technology                   | 0.13 μm        | 65 nm                 | 65 nm             | 0.35 μm    | 65 nm            |
| Power                        | 165 mW         | 65 mW                 | 12.2 mW (@60 GHz) | 2.34 mW    | 5 mW             |
| Sensitivity (ppm)            | 0.23 ppm       | 1.25 ppm              | 2.67 ppm          | N/A        | 0.3 ppm          |
| Reference Required           | Yes            | Yes                   | N/A               | Yes        | No               |
| Sensor Active Area*          | 0.6 mm²        | 0.212 mm²             | 0.014 mm²         | 0.99 mm²   | 0.17 mm²         |
| Frequency                    | 1.01 GHz       | 6.5/11/17/30 GHz      | 60GHz and 120GHz  | N/A        | 1.4/3.6 GHz      |
| Temperature Control Required | Yes            | Yes                   | Yes               | N/A        | No               |
| Dynamic range                | N/A            | 52 dB                 | N/A               | N/A        | >62 dB           |

\* Sensor and reference cell area

Figure 10.7.6: Performance comparison summary.



Figure 10.7.7: 2x2 chip sensor array die photo.

## 10.8 A 100mK-NETD 100ms-Startup-Time 80×60 Micro-Bolometer CMOS Thermal Imager Integrated with a 0.234mm<sup>2</sup> 1.89μV<sub>rms</sub> Noise 12b Biasing DAC

Ki-Duk Kim<sup>1</sup>, Seunghyun Park<sup>1</sup>, Kye-Seok Yoon<sup>1</sup>, Gyeong-Gu Kang<sup>1</sup>, Hyun-Ki Han<sup>1</sup>, Ji-Su Choi<sup>1</sup>, Min-Woo Ko<sup>1</sup>, Jeong-hyun Cho<sup>1</sup>, Sangjin Lim<sup>1</sup>, Hyung-Min Lee<sup>2</sup>, Hyun-Sik Kim<sup>3</sup>, Kwyro Lee<sup>1</sup>, Gyu-Hyeong Cho<sup>1</sup>

<sup>1</sup>KAIST, Daejeon, Korea; <sup>2</sup>Korea University, Seoul, Korea

<sup>3</sup>Dankook University, Cheonan, Korea

A micro-bolometer focal plane array (MBFPA) detector is one of the best candidates for thermal imaging cameras due to its excellent uncooled imaging performance with low manufacturing cost [1-4]. In Fig. 10.8.1, remote infra-red signals from thermal objects are maximized and absorbed at the MEMS micro-bolometer pixels having a  $\lambda/4$  cavity structure, and they are then converted into resistance of a thermistor layer in each cell. Then, a CMOS analog front-end (AFE) reads out the cell resistance value in current-mode by applying a voltage bias to the micro-bolometer pixel. In the readout process, the skimming cell that does not respond to the infra-red signal is used to remove the offset components by generating an opposite-phase current, which in turn alleviates the system required resolution. Nevertheless, there is still very significant fixed-pattern noise (FPN) resulting from process, voltage, and temperature (PVT) variations, and this severely limits the responsivity/dynamic range trade-off. Addressing the problem, both bias voltages ( $V_{FID}$  &  $V_{GSK}$ ) applied to sensing and skimming cells, respectively, should be precisely adjusted so as to avoid any saturation while maintaining sufficient responsivity, and their noise levels must be low enough considering the noise amplification in the signal chain.

Consequently,  $\geq 10$ b-resolution biasing DACs providing  $V_{FID}$  and  $V_{GSK}$  with high linearity and guaranteed monotonicity are required for fine calibration. However, it is technically difficult to integrate such DACs within a fully monolithic CMOS imager chip. The conventional resistor-string DAC has a good monotonicity, but it is typically implemented as an external chip due to its intrinsic large size. Furthermore, the low temporal noise characteristic at the DAC output is required to reduce the noise-equivalent temperature difference (NETD) of a thermal camera. In general, the noise of the biasing DAC is amplified through the AFE by 10 to 20 $\times$ . Considering 1-LSB of the imager is 0.2mV, the biasing DAC noise should be constrained to  $< 10\mu V_{p-p}$ . In conventional approaches, the NETD induced from the biasing DAC noise is reduced by incorporating a low-pass filter (LPF) with very low cut-off frequency. However, this significantly limits the bandwidth that degrades the DAC settling time, increasing the time to obtain first image data after power-on (imaging startup time). Thus, the DAC noise in low frequencies, including 1/f noise, should be suppressed to use the higher cut-off LPF for fast startup speed. In this paper, we present an MBFPA thermal imager integrated with internal biasing DACs. The bit-inverted current-mode design of the biasing DAC is almost mismatch-free. Thus, it significantly reduces the area occupation with high linearity, resulting in fully monolithic integration. Furthermore, a chopping technique is applied to the biasing DAC, improving the NETD as well as the startup speed of the thermal image sensor.

Figure 10.8.2 shows a 12b DAC core for the current-mode design, including a full-scale current ( $I_{REF}$ ) as an input, 12 stacked divider-cells for generating binary-weighted currents, data current-summing MUXes at an output ( $I_{DAC}$ ), and a bias circuit; the basic concept is described in [5]. The DAC design in [5] has an easy stacking framework for any resolution, but for 12b high resolution design, it can suffer from too low bias voltages due to the body effect and accumulated current-division error problem. In the proposed DAC, a self-body-biasing for each transistor with separated N-well is utilized to decrease the threshold  $V_{TH}$  of each divider-cell, reducing the body effect. To alleviate the well-to-well spacing in each transistor, the body terminals are clustered for multiple transistors with similar voltage levels, as shown in Fig. 10.8.2. Considering the DAC output (IDAC) accuracy, each divider cell has its own current-division error, and the accumulated error by the stacked structure may result in large INL/DNLs. To mitigate its mismatch without increasing the divider cell size, the current divider cell has a swapping function to exchange two current paths, controlled by its corresponding input bit ( $D_x$ ). This path-swapping operation greatly improves DNL performance because error polarity of one current path in each divider cell always follows its own data bit ( $D_x$ ). Moreover, an additional control bit INV inverts the swapping bit ( $D_x$ ) in each divider cell. As a result, with INV=1, an inverted INL that has the

same magnitude of INL at INV=0 but opposite polarity can be obtained. Consequently, this bit-inversion method with INV control cancels the symmetric INL and inverted INL through chopping operation, resulting in very small INL/DNL while maintaining compact DAC size.

Figure 10.8.3 shows the entire DAC architecture. The reference current  $I_{REF}$  is generated by a low-noise reference voltage  $V_{REF}$  and resistor  $R_{REF}$  with a negative-feedback op-amp. The  $I_{REF}$  is mirrored and sourced to the 12b DAC core through a cascode mirror stage (MP3 and MP4). The D/A-converted IDAC is mirrored again to output the DAC voltage by a wide-swing current mirror stage composed of MN3 and MN4 with a high-gain op-amp for high output impedance. Considering loop stability of the mirror stage, the positive loop gain is always smaller than the negative one by low source impedance of MN5. In each stage, the low-frequency noises including 1/f noise, offset, and mismatch are modulated to a high-frequency regime through a chopper with a chopping frequency of  $f_{CHOP}$ . It should be noted that the bit inversion (INV) of the 12b DAC core is also periodically switched by the chopping frequency  $f_{CHOP}$ ; therefore, its INL error (including 1/f noise of divider-cells) is also modulated with the high frequency of  $f_{CHOP}$ . Finally, the  $I_{DAC}$  is converted to the DAC voltage output ( $V_{OUT}$ ) after passing through a LPF to remove the chopped 1/f noise, offset, and mismatch of all stages, while using  $>10\times$  higher cut-off frequency than conventional ones thanks to much smaller DAC noise.  $V_{OUT}$  is then provided as the bias voltage  $V_{GSK}$  or  $V_{FID}$ . The noise contribution in each stage is simulated and summarized in the table of Fig. 10.8.3, which shows 0.98μV<sub>rms</sub> total noise, meeting the requirement of our DAC noise.

The thermal imager IC is fabricated in a 0.18μm CMOS process. The total chip occupies 5.1×5.1 mm<sup>2</sup>, and the sensing-cell array consists of 80×60 pixels with a pitch of 35μm. Two biasing DACs, proposed in this paper, are integrated in the IC. Figure 10.8.4 shows the measured INL and DNL results of the 12b biasing DAC. In virtue of the mismatch-free design, the maximum INL and DNL are measured to be 0.14 LSB and 0.09 LSB, respectively, while achieving the compact DAC size of 180×1300μm<sup>2</sup>. These demonstrate that the biasing DAC can be fully integrated in the imager IC and provide wide calibration range, accurate calibration, and high inter-IC uniformity for NETD. Figure 10.8.4 also shows the measured noise spectrum of the 12b biasing DAC output. The DAC noises, including 1/f noise in low-frequency range, are highly suppressed owing to the chopping technique, which can be only applicable to our DAC structure. Thus, the noise-band-limiting LPF following the DAC does not need an excessively low cut-off frequency, which significantly improves the DAC settling-time and temporal noise. The DAC settling time and the temporal noise were measured as 100ms with  $R_{LPF}=10k\Omega$  and  $C_{LPF}=400nF$ , which is much shorter than 30-to-60s in [4], and 1.89μV<sub>rms</sub>, respectively.

Figure 10.8.5 shows the temporal noise of the final imager output, measuring a series of 128 image frames with 60Hz frame rate at room temperature. The standard deviation per pixel was measured to be 1.6-LSB<sub>rms</sub>. Based on the temporal noise and 1-kelvin response measurement (1K = 18 LSB), the 80×60 NETD map is acquired with the maximum NETD of 100mK. This excellent NETD performance results from the low-noise biasing DACs reducing the final temporal noise. Figure 10.8.5 also shows the real demonstration of the thermal imager. Figure 10.8.6 summarizes the IC performance and compares it with the state of the art. Since the biasing DAC has low noise and short settling time, the imager IC shows not only the fastest imaging startup speed ( $\leq 0.1s$ ) in the table, but also comparable NETD performance with high dynamic range and resolution. Incorporating an almost mismatch-free architecture, the area-efficient biasing DACs with excellent monotonicity can be fully integrated in the imager IC, resulting in significant reduction in product cost. Figure 10.8.7 shows a die micrograph.

### References:

- [1] C. Posch, et al., "A Microbolometer Asynchronous Dynamic Vision Sensor for LWIR," *IEEE Sensors J.*, vol. 9, no. 6, pp. 654-664, June 2009.
- [2] B. Dupont, et al., "A [10°C ; 70°C] 640×480 17μm Pixel Pitch TEC-Less IR Bolometer Imager with Below 50mK and Below 4V Power Supply," *ISSCC Dig. Tech. Papers*, pp. 394-395, 2013.
- [3] S. Park, et al., "A Shutter-less Micro-Bolometer Thermal Imaging System using Multiple Digital Correlated Double Sampling for Mobile Applications," *IEEE Symp. VLSI Circuits*, pp. C154-C155, 2017.
- [4] ULIS, "Micro80P," Jan. 2015. Accessed on Aug. 18, 2017,  
<<https://www.ulisir.com/media/catalog/datasheet/micro80p.pdf>>
- [5] K. D. Kim, et al., "A 10-bit Compact Current DAC Architecture for Large-Size AMOLED Displays," *Proc. SID*, pp. 334-337, 2011.



Figure 10.8.1: MBFPA system and read-out architecture with CMOS sensing circuitry.



Figure 10.8.2: 12 divider-stacking 12b DAC core with multiple bias generator.



Figure 10.8.3: Entire architecture for 12b DAC and simulated noise contribution.



Figure 10.8.5: Measured temporal noise of pixel, 80x60 NETD map, and real imaging demonstrations.



Figure 10.8.4: Measured INL/DNL, noise PSD and settling time of 12b biasing DAC.

|                      | This Work                                                                                                    | ISSCC 2013 [2]     | Micro80P [4]                | SOVC 2017 [3]      |
|----------------------|--------------------------------------------------------------------------------------------------------------|--------------------|-----------------------------|--------------------|
| Technology           | 0.18- $\mu$ m CMOS                                                                                           | 0.18- $\mu$ m CMOS | -                           | 0.35- $\mu$ m CMOS |
| Bolometer Tech.      | Uncooled a-Si                                                                                                | Uncooled a-Si      | Uncooled a-Si               | Uncooled a-Si      |
| Power Supply         | 2.6 – 3.6 V                                                                                                  | 4 V                | 3.6 V                       | 2.6 – 3.6 V        |
| Power Consumption    | 47 mW                                                                                                        | 170 mW             | 50 mW                       | 45 mW              |
| Pixel Resolution     | 80 x 60                                                                                                      | 640 x 480          | 80 x 80                     | 80 x 60            |
| Temporal NETD        | 100 mK                                                                                                       | 40 mK              | 100 mK                      | 100 mK             |
| Frame rate           | 60 Hz                                                                                                        | -                  | 50 Hz                       | 60 Hz              |
| Imaging Startup Time | 0.1 s                                                                                                        | -                  | $\geq 30$ s *               | -                  |
| Temperature Range    | -20 – +65 °C                                                                                                 | +10 – +70 °C       | -40 – +85 °C                | -20 – +65 °C       |
| Type                 | Fully Integrated                                                                                             | Fully Integrated   | Fully Integrated            | External Chip      |
| Resolution           | 12b                                                                                                          | -                  | 8b/10b                      | 12b                |
| D/A type             | Current-mode divider-stacking                                                                                | -                  | -                           | -                  |
| Techniques           | Body clustering, Chopping with bit-inversion                                                                 | -                  | -                           | -                  |
| Area                 | 180 $\mu$ m x 1300 $\mu$ m                                                                                   | -                  | -                           | -                  |
| INL/DNL              | 0.14 / 0.09 LSB                                                                                              | -                  | -                           | -                  |
| Noise PSD            | 187 nV/ $\sqrt{\text{Hz}}$ @ 10 Hz<br>19 nV/ $\sqrt{\text{Hz}}$ @ 100 Hz<br>6 nV/ $\sqrt{\text{Hz}}$ @ 1 kHz | -                  | -                           | -                  |
| Noise (rms)          | 1.89 $\mu$ V rms<br>@ R=10k $\Omega$ , C=400nF                                                               | -                  | -                           | -                  |
| Settling Time        | 0.1 s<br>@ R=10k $\Omega$ , C=400nF                                                                          | -                  | 30 – 60 s<br>@ C=47 $\mu$ F | -                  |

\* with internal DAC

Figure 10.8.6: Performance summary and comparison.



Figure 10.8.7: Die micrograph.

# Session 11 Overview: **SRAM**

## MEMORY SUBCOMMITTEE



**Session Chair:**  
***Jonathan Chang***  
*TSMC, Hsinchu, Taiwan*



**Associate Chair:**  
***Chun Shiah***  
*Etron, Hsinchu, Taiwan*

### **Subcommittee Chair: *Leland Chang*, IBM, Yorktown Heights, NY**

SRAM continues to be the critical technology enabler for a wide range of applications from low-power to high-performance computing. This session showcases the leading-edge SRAM developments from the semiconductor industry. Intel presents the smallest SRAM bitcell for 10nm technology, with design assist techniques to enable low  $V_{MIN}$  operation. Samsung presents the smallest bitcell for 7nm technology and shows a double-write driver technique to further improve  $V_{MIN}$ . TSMC demonstrates a 7nm 5GHz L1 cache for high-performance computing.



8:30 AM

**11.1 A 23.6Mb/mm<sup>2</sup> SRAM in 10nm FinFET Technology with Pulsed PMOS TVC and Stepped-WL for Low-Voltage Applications***Z. Guo, Intel, Hillsboro, OR*

In Paper 11.1, Intel presents a 23.6Mb/mm<sup>2</sup> SRAM in 10nm FinFET with the smallest 10nm SRAM bitcell. It adopts column-based transient voltage collapse and a stepped wordline to lower the minimum operation voltage ( $V_{MIN}$ ).



9:00 AM

**11.2 A 7nm FinFET SRAM Using EUV Lithography with Dual Write-Driver-Assist Circuitry for Low-Voltage Applications***T. Song, Samsung Electronics, Hwaseong, Korea*

In Paper 11.2, Samsung Electronics presents a 7nm FinFET SRAM using EUV lithography. It adopts a 0.026 $\mu\text{m}^2$  bitcell and  $V_{MIN}$  is improved with a proposed dual write-driver (DWD) scheme in combination with a negative bitline scheme.



9:30 AM

**11.3 A 5GHz 7nm L1 Cache Memory Compiler for High-Speed Computing and Mobile Applications***M. Clinton, TSMC, Austin, TX*

In Paper 11.3, TSMC presents a 7nm L1 cache memory compiler, which operates at a 5GHz clock frequency. It implements a self-timing scheme with small-signal sensing and a folded architecture to increase the performance.

## 11.1 A 23.6Mb/mm<sup>2</sup> SRAM in 10nm FinFET Technology with Pulsed PMOS TVC and Stepped-WL for Low-Voltage Applications

Zheng Guo, Daeyeon Kim, Satyanand Nalam, Jami Wiedemer, Xiaofei Wang, Eric Karl

Intel, Hillsboro, OR

The emergence of cloud computing and big data analytics, accompanied by a sustained growth of battery-powered mobile devices, continues to drive the importance of energy and area efficient CPU and SoC designs. Low-voltage operation remains one of the primary approaches for active power reduction, but SRAM  $V_{MIN}$  can limit the minimum operating voltage. Device size quantization continues to be a challenge for compact 6T SRAM design in FinFET technologies, where careful co-optimization of the technology and assist circuit design is required for high-density low-voltage array implementations. This paper presents two SRAM array designs in a 10nm low-power CMOS technology featuring 3<sup>rd</sup> generation FinFET transistors: a high-density 23.6Mb/mm<sup>2</sup> array and a low-voltage 20.4Mb/mm<sup>2</sup> array.

Figure 11.1.1 shows the layout diagrams of a 0.0312μm<sup>2</sup> high-density 6T SRAM cell (HDC) and a 0.0367μm<sup>2</sup> low-voltage 6T SRAM cell (LVC) in a 10nm FinFET technology. The HDC utilizes minimum sized devices, with a fin ratio of 1:1:1 (PU:PG:PD), to minimize cell area, while the LVC features a larger PD device (1:1:2) for improved read stability at low voltage. Self-aligned quad patterning (SAQP) is introduced on critical layers to achieve fin pitches down to 34nm and metal pitches down to 36nm with 193nm immersion lithography [1], enabling a 0.62x area scaling of the 6T SRAM cell relative to a 14nm technology [2]. To further maximize density scaling of the 10nm technology, several key architectural features have been added to achieve further array area scaling of the 128kb HDC and LVC macros: achieving a 0.58x and 0.57x reduction relative to 14nm equivalents. Figure 11.1.1 highlights the cell area (μm<sup>2</sup>) and array area (mm<sup>2</sup>/Mb) of recently reported 6T SRAM designs from 14nm, 10nm and 7nm technologies [2-4].

Figure 11.1.2 details two architectural features of the 10nm technology for improved density [1]. The first eliminates the need for isolation dummy gates by introducing a minimum isolation step at the source/drain boundary to isolate neighboring transistors by the width of a single gate. The second enables the placement of gate contacts over active transistors, thus eliminating the need for gate extension over isolation to land contacts. The tables in Fig. 11.1.2 summarize the area scaling of critical SRAM periphery circuits, and the array efficiency and density of 128kb SRAM macros in 14nm and 10nm technologies. The combination of single-gate isolation, enabling contacts over active gates, along with the improved pitch scaling of critical interconnect layers has enabled aggressive area scaling of critical SRAM peripheral logic from 14nm to 10nm with minimum fin depopulation. As a result, a 77.1% array efficiency and a 23.6Mb/mm<sup>2</sup> density are achieved for a 128kb HDC macro: a 5.4% area efficiency improvement over a comparable 14nm design. A 78.4% array efficiency and a 20.4Mb/mm<sup>2</sup> density are achieved for a 128kb LVC macro: a 6.8% area efficiency improvement over a comparable 14nm design.

Wordline underdrive (WLUD) is used to improve the low-voltage read and half-select stability of an SRAM cell: trading off performance for  $V_{MIN}$  [2]. To minimize the impact of interconnect resistance on WL voltage uniformity between different rows of the decoder, WLUD PMOS devices are implemented locally in the WL driver using a matched layout and routing across neighboring rows. To improve the low-voltage write margin, a column-based transient voltage collapse (TVC) scheme is employed to weaken the PU transistor during a write [2]. In this work, a PMOS device (PWR) is used to discharge the memory cell supply ( $V_{CS}$ ). Compared to an NMOS device, a PMOS device improves  $V_{CS}$  control, but at the cost of discharge speed.  $V_{CS}$  can be regulated by PMOS bias devices (PB[1:0]), as is illustrated in Fig. 11.1.3. To minimize write energy overhead, a pulsed  $V_{CS}$  collapse can be applied with no bias current [2]. To avoid half-select instability along the column, due to a low  $V_{CS}$ , careful control of the TVC pulse must be implemented across a range of array configurations when using pulsed TVC. Since the PMOS transistor drive strength degrades super-linearly with a falling  $V_{CS}$  in this configuration, wider TVC pulses can be applied without requiring a bias current to avoid half-select instability.  $V_{CS}$  sensitivity to the TVC pulse-width and/or

array configuration is also reduced, and can be further adjusted by tuning the PMOS transistor  $V_t$ . If PMOS bias is required,  $V_{CS}$  can be determined by the voltage division across PWR and PB[1:0]. The resulting voltage level correlates well to the write margin, compared to an NMOS TVC that can produce a higher  $V_{CS}$  under process skew with a lower NMOS:PMOS drive ratio, where the SRAM write margin is degraded.

While WLUD is effective for enhancing read stability, it degrades the write margin. To independently improve the 6T cell's read and write  $V_{MIN}$ , a stepped-wordline (S-WL) scheme [5,6] is implemented to complement the pulsed PMOS TVC write assist. Figure 11.1.3 details the design of the S-WL and PMOS TVC in a 128kb SRAM macro. WLUD pulses (WLUDPULSE[2:0]) are generated from static WLUD bias controls (WLBIA[2:0]) and the read/write clock. WL suppression is first enabled to create a sufficient BL separation that reinforces cell stability, before WLUDPULSE[2:0] are adjusted to restore the WL to a higher voltage level. Since the required BL differential needed to improve read stability is higher than the voltage sensing margin, the read performance is not impacted by S-WL operation. To maximize the effectiveness of the TVC write assist, the TVC pulse is delayed to align it with WL restoration. To minimize interconnect delay along the WLCLK# and WLUDPULSE paths, local buffers are implemented to drive across the 32b sections of the 256b decoder. This reduces the distributed WLCLK# and WLUDPULSE gate loading by 8x, while maintaining the same logic depth for WL generation. Control logic for S-WL is implemented in the timer/control region with a negligible area overhead.

Figure 11.1.4 shows the simulation waveforms during a write cycle using static WLUD, no WLUD and S-WL. With static WLUD, WL is suppressed for the duration of the WL pulse to maintain cell stability while degrading the write margin. Turning off WLUD improves write margin, but compromises read stability. When S-WL is enabled, the WL is suppressed during the first phase to maintain cell stability. After a sufficient BL separation is achieved, WLBIA[2:0] are adjusted to raise the WL voltage, aligned to the TVC pulse, to improve the write margin.

Figure 11.1.5 shows the measured voltage-frequency shmoos of the HDC SRAM using pulsed PMOS TVC write assist, complemented by static WLUD, no WLUD and S-WL. The WL voltage level during the first phase of S-WL matches the static WLUD level. S-WL enables an 80mV and a 100mV improvement to  $V_{MIN}$ , compared to a static WLUD and no WLUD. Compared to S-WL operation, array performance is improved with no WLUD due to the higher WL voltage over the entire WL pulse, but  $V_{MIN}$  increases due to degraded read stability. In contrast, array performance for static WLUD is limited by the suppressed WL write operation. Figure 11.1.6 summarizes the write and read  $V_{MIN}$  measurements for HDC and LVC using a static WLUD and S-WL, with pulsed PMOS TVC. Similar to the voltage-frequency measurements, the WL voltage level during the first phase of S-WL matches the static WLUD level, as determined by the memory cell read stability requirement. S-WL operation enables a 150mV write  $V_{MIN}$  improvement for HDC at the 90<sup>th</sup> percentile and a 60mV write  $V_{MIN}$  improvement for LVC at the 90<sup>th</sup> percentile, without degrading the read  $V_{MIN}$ . Decreased write  $V_{MIN}$  improvement is observed for LVC due to the reduced WLUD applied. A die micrograph of the 10nm test vehicle with 72Mb of LVC SRAM and 54Mb of HDC SRAM is shown in Fig. 11.1.7.

### References:

- [1] C. Auth, et al., "A 10nm High Performance and Low-Power CMOS Technology Featuring 3rd Generation FinFET Transistors, Self-Aligned Quad Patterning, Contact over Active Gate and Cobalt Local Interconnects", *IEDM*, 2017.
- [2] E. Karl, et al., "A 0.6V, 1.5GHz 84Mb SRAM Design in 14nm FinFET CMOS Technology", *ISSCC*, pp. 310-311, 2015.
- [3] K.-I. Seo, et al., "A 10nm Platform Technology for Low Power and High Performance Application Featuring FINFET Devices with Multi Workfunction Gate Stack on Bulk and SOI", *Symp. VLSI Tech.*, pp. 12-13, 2014.
- [4] J. Chang, et al., "A 7nm 256Mb SRAM in High-K Metal-Gate FinFET Technology with Write-Assist Circuitry for Low- $V_{MIN}$  Applications", *ISSCC*, pp. 206-207, 2017.
- [5] K. Takeda, et al., "Multi-step Word-line Control Technology in Hierarchical Cell Architecture for Scaled-down High-density SRAMs", *IEEE Symp. VLSI Circuits*, pp. 101-102, 2010.
- [6] J. Chang, et al., "A 20nm 112Mb SRAM in High- $\kappa$  Metal-Gate with Assist Circuitry for Low-Leakage and Low- $V_{MIN}$  Applications", *ISSCC*, pp. 316-317, 2013.



Figure 11.1.1: (top) 10nm HDC and LVC 6T SRAM cells. (bottom) Bitcell and array area scaling trends.



Figure 11.1.2: (top) 10nm technology features. (middle and bottom) Impact on array efficiency and array density.



Figure 11.1.3: PMOS TVC circuit, and WLUD circuit with S-WL feature.



Figure 11.1.4: Simulation waveforms with (left) PMOS TVC and static WLUD, (middle) TVC and no WLUD, and (right) TVC and S-WL.



Figure 11.1.5: Measured voltage-frequency shmoo for HDC with (top) PMOS TVC and static WLUD compared to TVC and S-WL, and (bottom) TVC and no WLUD compared to TVC and S-WL.



Figure 11.1.6: Measured write and read V<sub>MIN</sub> distribution for HDC and LVC with PMOS TVC plus static WLUD and S-WL.



Figure 11.1.7: Micrograph of 10nm 72Mb LVC and 54Mb HDC test chip.

## 11.2 A 7nm FinFET SRAM Using EUV Lithography with Dual Write-Driver-Assist Circuitry for Low-Voltage Applications

Taejoong Song, Jonghoon Jung, Woojin Rim, Hoonki Kim, Yongho Kim, Changnam Park, Jeongho Do, Sunghyun Park, Sungwee Cho, Hyuntaek Jung, Bongjae Kwon, Hyun-Su Choi, JaeSeung Choi, Jong Shik Yoon

Samsung Electronics, Hwaseong, Korea

SRAM plays an integral role in the power, performance, and area of a mobile system-on-a-chip. To achieve low power and high density, extreme ultraviolet (EUV) technology is adopted for the 7nm FinFET technology [3-4]. Conventional ArF immersion with a single exposure for an extreme high-resolution patterning shows the limitation of lithographic patterning. Therefore, multi-patterning lithographic technique is applied to support a high-resolution lithography. However, this also includes process variations due to using multi-patterning masks. Alternatively, EUV offers competitive scaling with a single-mask with the benefit of smaller wavelength, which provides smaller process variation with less additional patterning. Figure 11.2.1 shows a 7nm EUV FinFET 6T high-density (HD) SRAM bitcell with an area of  $0.026\mu\text{m}^2$ . The pull-up, pass-gate, and pull-down ratios are 1:1:1 for high-density and low-power applications. Another benefit of EUV technology also features a bi-directional metal layer with a scaled pitch that provides an extra degree of freedom for signal and power routing. Figure 11.2.2 highlights EUV benefits in accordance with bi-directional metals. A uni-directional metal layer requires different metal layer to connect two nets, and have no choice but to support the limited via between two perpendicular metal lines with the limited metal width. A wider metal allows placement of more vias between the metal lines, but it does not demonstrate optimum Power, Performance, and Area (PPA) with redundant parasitic capacitance. However, EUV provides bi-directional metal lines, where the different layers of metal are coherent in the same direction. Therefore, more vias can be placed to reduce the IR-drop with smaller parasitic capacitance and resistance. Figure 11.2.2 illustrates the delay impact versus stacked-via distance in a standard cell array. It shows that the timing penalty is directly proportional to the stacked via distance in a uni-directional metal routing.

SRAM assist is a common technique for achieving low power in recent technologies [2-5]. Since the 6T-HD bitcell does not cover low-voltage ranges in write and read operation, especially in the FinFET technology, SRAM assist techniques are selectively applied to write and/or read operations. Figure 11.2.3 illustrates conventional SRAM assist schemes that control the WL, BL, and bit-cell voltage ( $V_{DDC}$ ) independently or together to affect bitcell characteristics favorably. WL is controlled to help bolster the Access-Disturbance Margin (ADM) by trading off against the Write-Margin (WRM) temporarily.  $V_{DDC}$  is lowered to skew the WRM within the safe range of bitcell retention. Meanwhile, a negative BL (NBL) scheme is used as a write-assist technique to improve WRM without affecting ADM. However, the NBL technique is limited in application due to BL resistance; Fig. 11.2.3 illustrates the WRM degradation as the number of rows per BL (RPB) increases. The NBL effect diminishes for a large RPB, and is worse for the bitcell farthest from the write-driver. Otherwise, WL is used as an assist-knob to avoid the resistance impact, since WL is connected to the gate, not source of pass transistor. The WL voltage is turned-on slowly low-to-high for both ADM and WRM [5] in connection with the timing penalty for a safe ADM. Meanwhile, the BL line is designed with wider width of metal through the part of bitcell array to mitigate the BL resistance [2]. Therefore, BL resistance decreases up to 50% of the original BL by widening the infinitive width of half-BL at most, which degrades the performance with large capacitance of BL instead.

Conventionally, the write driver is located at the bottom of SRAM macro to drive the whole bitcell array. Therefore, the top bitcell, which is located farthest from the write driver, suffers from the worst WRM due to the largest BL resistance among the bitcell array under the same condition of bitcell variation itself. To minimize BL resistance effectively, the Dual Write-Driver (DWD) is proposed as a write-assist as shown in Fig. 11.2.4. The DWD uses two write drivers on the top and bottom, which act coherently in a short time. Since the two write drivers are designed by half-size of the conventional single write driver, the DWD has a similar area to the conventional one. Moreover, the farthestmost bitcell from the write driver is located in the middle of the bitcell array, neither in the top nor bottom.

The effective resistance of DWD is calculated using simple methods: (1) Since BL length from the write driver to the farthestmost bitcell is cut by half, each BL resistance is reduced by 2x. (2) Also, the two write-drivers drive the middle bitcell in parallel at the very same time, thus reducing the BL resistance by 2x again. (3) Therefore, the effective BL resistance sums up to be 0.25x of the conventional single write-driver for the farthestmost bitcell from the write-driver finally. The top write-driver features a Global Write BL (GWBL) that is designed to be enabled with the bottom write-driver in a short time. There are other approaches to decrease the BL resistance in the conventional SRAM design: (1) BL is designed using a 4x width that proportionately decreases the ADM. Therefore, there is a limitation to increase BL width using the optimum bitcell margin. Moreover, there is a PPA trade-off such as performance degradation due to a large BL width. (2) Alternatively, a multi-bank architecture is also adopted to provide a smaller BL resistance in each chunk of BL. However, a multi-bank SRAM macro requires white-space that tends to increase area at the boundary between the bitcell array and the peripheral, even more in a recent cutting-edge technology [3]. However, the DWD is effective to reduce the BL resistance by maintaining the BL capacitance with additional write-driver path. It mitigates the potential technology challenges, which make design overhead with conventional approaches. The DWD can handle 4x larger RPB effectively without trading off with the bitcell stability and scaling, which is not easily accomplished in the conventional SRAM design.

Figure 11.2.5 illustrates the 7nm EUV FinFET 256Mb SRAM array's  $V_{MIN}$  operation with NBL and/or DWD schemes. In order to exclude the impacts of ADM among  $V_{MIN}$  distribution, WLUD is applied using a 10% lower  $V_{DD}$  as a read-assist. Silicon shows that DWD itself improves  $V_{MIN}$  by 120mV, and NBL by 200mV, compared to no-assist. Then, when both DWD and NBL are applied,  $V_{MIN}$  improves by 300mV. Moreover, when  $V_{MIN}$  is measured at different positions in the bitcell array, DWD shows a smaller  $V_{MIN}$  variation. Conventionally,  $V_{MIN}$  is worse at the farthestmost position from the write-driver as explained in the previous section. As shown in Fig. 11.2.5, the top-most bitcell (256<sup>th</sup> row among 256 RPB) shows the worst WRM, and a lower bitcell (64<sup>th</sup> row of 256 RPB) has better WRM in either no-assist or NBL. However, DWD shows smaller  $V_{MIN}$  variation over the bitcell array, which provides better controllability of process margin for mass production. Silicon shows that DWD reduces  $V_{MIN}$  variation by up to 8x compared to without DWD.

The SRAM macro area overhead is assessed for the different write-assist schemes in Fig. 11.2.5. NBL requires about 5% area overhead, due to the charge pump and additional buffer. However, DWD shows no more than a 0.5% area overhead due to the additional driver. Otherwise, a multi-bank architecture can be applied to decrease the resistance per BL. For example, a 4-bank architecture is adopted to implement 64 RPB with a similar  $V_{MIN}$  as shown in the silicon result. However, a 4-bank architecture requires four times white-space at the bitcell array boundary, compared to a 1-bank architecture. The SRAM macro area also increases by up to 30% for a 4-bank architecture with 64 RPB, versus a 1-bank architecture with 256 RPB.

Figure 11.2.6 shows the  $V_{MIN}$  distribution of the 7nm EUV FinFET 6T-HD SRAM with write-assist. The 64 RPB  $V_{MIN}$  distribution shows that DWD is expected to improve  $V_{MIN}$  additionally over the operating voltage-range. Figure 11.2.7 shows the die-photo of the 7nm EUV FinFET SRAM test-chips. Chip-A is designed using a 256Mb SRAM macro that explores NBL and DWD write-assist schemes. Chip-B is configured using 512Kb SRAM macros using the  $0.026\mu\text{m}^2$  6T-HD bitcell, which shows a  $V_{MIN}$  distribution with NBL assist and DWD impact.

### References:

- [1] S. Y. Wu, et al., "Demonstration of a sub-0.03 um<sup>2</sup> high density 6-T SRAM with scaled bulk FinFETs for mobile SOC applications beyond 10nm node," *IEEE Symp. VLSI Tech.*, 2016.
- [2] J. Chang, et al., "A 7nm 256Mb SRAM in high-k metal-gate FinFET technology with write-assist circuitry for low-VMIN applications," *ISSCC*, pp. 206-207, 2017.
- [3] T. Song, et al., "A 7nm FinFET SRAM macro using EUV lithography for peripheral repair analysis," *ISSCC*, pp. 208-209, 2017.
- [4] D. Ha, et al., "Highly manufacturable 7nm FinFET technology featuring EUV lithography for low power and high performance applications," *IEEE Symp. VLSI Tech.*, 2017.
- [5] T. Song, et al., "A 10nm FinFET 128Mb SRAM with assist adjustment system for power, performance, and area optimization," *ISSCC*, pp. 306-307, 2016 .



**Figure 11.2.1: 7nm EUV FinFET 6T HD 0.026  $\mu\text{m}^2$  SRAM bitcell.**



**Figure 11.2.2: EUV design flexibility with smaller IR-drop impact.**



**Figure 11.2.3:** Conventional SRAM assist, and WRM versus rows per BL.



**Figure 11.2.4: The proposed Dual Write Driver (DWD) SRAM write-assist and effective BL resistance.**



**Figure 11.2.5: 256Mb SRAM silicon result with DWD or/and NBL, and SRAM macro area comparison for write-assist schemes and bank-architectures.**



**Figure 11.2.6:** V<sub>DDQ</sub> distribution of 6T-HD SBAM bitcell



Figure 11.2.7: 7nm EUV FinFET 6T SRAM test-chips.

### 11.3 A 5GHz 7nm L1 Cache Memory Compiler for High-Speed Computing and Mobile Applications

Michael Clinton<sup>1</sup>, Rajinder Singh<sup>1</sup>, Marty Tsai<sup>1</sup>, Shayan Zhang<sup>1</sup>, Bryan Sheffield<sup>1</sup>, Jonathan Chang<sup>2</sup>

<sup>1</sup>TSMC, Austin, TX

<sup>2</sup>TSMC, Hsinchu, Taiwan

In high performance computing (HPC) applications, the speed of the L1 cache will typically determine the maximum frequency ( $f_{MAX}$ ) of the processor core. Companies that mass produce high-performance microprocessors commonly have the L1 cache consist of fully-custom macros: to ensure that the performance of the L1 cache does not limit the  $f_{MAX}$  or throughput of the processor. In addition, it is also common for the custom L1 cache designs to use a two-port 8T or a large 6T bitcell, along with domino read logic and very short BL [2,3]. These designs tradeoff density and area for high performance. This paper presents a different approach, one which can satisfy a range of different applications; a memory compiler that can generate more than 10,000 different high-speed L1 cache macro configurations is proposed. The 7nm L1-cache compiler described in this paper uses a high-current (HC) 6T bitcell, which is more area efficient than an 8T bitcell. The HC bitcell, along with small-signal sensing, allows for long BL (256b), leading to further area efficiency improvements. Since these L1 macros are just as likely to be used in mobile applications as they are to be used in HPC applications, they were implemented using the array dual-rail (ADR) architecture [4]. The ADR architecture (Fig. 11.3.1) allows the periphery circuits of the L1 macro to operate at the same voltage as the processor core: a lower  $V_{DD}$  results in dynamic power savings. ADR performance is also improved, over an interface dual-rail, when the SRAM and logic supplies are equivalent, as ADR design does not suffer from a level-shifter delays on the inputs or outputs.

Quickly activating the WL is critical for a high-speed L1 cache. The L1 macro is built using a standard SRAM butterfly architecture and places the row-decoder and WL drivers in the center of the macro, which reduces the WL RC delay by 4x. Due to the increased wiring and via resistance in advanced nodes (i.e. 7nm) careful layout construction is required to guarantee that the upper and lower WL's are activated at exactly the same time. Within our power, performance and area constraints, we found that a four-WL clock-drive scheme resulted in the best address setup, access time and wiring/circuit area optimization (Fig. 11.3.2). Using wider than minimum WL clock (wl\_clk<3:0>) wires, and reducing the gate load by a factor of four helped speed up WL activation. In addition, by controlling the WL pulse width independently for read and write cycles, we are able to shorten the WL pulse during a write and reduce the dynamic power associated with dummy reads.

In an ADR design, the WL driver must use the bitcell voltage ( $V_{DDM}$ ) for proper bitcell operations. The L1 cache performs voltage level translation from the periphery's supply to  $V_{DDM}$  using the NAND gate in the WL decode path. The self-timer scheme used in the L1 cache and described in more detail in the next paragraph, depends on all of the various delays in the normal and self-time path matching. In this design we copy the entire 4-WL decode/driver block and use it to activate the single tracking WL. This allows us to replicate the layout context and layout dependent effects (LDE) for this critical portion of the access path.

The rising edge of CLK generates the internal clock (iclkz) and starts an access to the L1 cache. The self-timing scheme controls the setting time of the sense amplifier and the timing of the restore sequence. The internal clock has a very high fan-out, but we are able to generate iclkz and drive it with only one gate delay by using a dynamic clock generator circuit (Fig. 11.3.3).

The self-timing scheme consists of tracking bitcells, which are base-layer identical to normal bitcells, and therefore can track the normal bitcell read current ( $I_{CELL}$ ) closely. The tracking BL has the same wire and diffusion loading as a normal BL, thus tracking the rate of voltage change very closely and proportionately to the rate of differential development on a normal BL. This scheme uses a tracking WL which is tuned to match the rise time of a normal WL across the full range of columns of the L1 compiler. The differential on the BL's at sense time is flat as a function of columns, which allows us to drive the global IO signals with fast edges. The restore operation start is timed from the sense enable trigger signal, which helps to minimize cycle time.

The HC bitcell can meet the performance targets with a 256b long BL, but there is a significant performance improvement when the BL length is cut in half. This is exactly what is done with what we refer to as the folded option. For this option, we fold the L1 macro over its right edge and reduce the BL length in half (Fig. 11.3.4). The capacity of the macro remains the same, but the BL length is halved leading to a 15-20% reduction in access and cycle time. In our current implementation of the folded macro, the area penalty is approximately 15%. The folding option can offer a sufficient performance boost, for example by pushing the minimum cycle time of the largest macro (72kb) to over 5GHz.

We recognize that the minimum differential, even with a 6x weak bitcell, increases as the SRAM bitcell voltage is increased. We take advantage of this fact by offering a turbo mode at higher voltages, where the sense enable timing is advanced. Putting the largest L1 macros into turbo mode at high voltage, can result in an additional 5% performance boost.

Compared to a 16nm L1 cache [5] that uses the same architecture, the presented 7nm cache is over 60% smaller (Fig. 11.3.5). The high-speed 7nm L1-cache compiler described in this paper has been verified in silicon. Cycle time measurements made at room temperature and -40°C are presented for a 512x36 and a 1024x72 macro. The measured results were performed on a slow-corner lot. The -40°C measured results show that the 18kb macro is able to run at 5.36GHz at 1.115V, while the largest 72kb macro is able to achieve 4.4GHz operation at this voltage (Fig. 11.3.6).

The authors would like to thank Van Sisourath for physical design support, and Rao Kodali for logic verification.

#### References:

- [1] J. Chang, et al., "A 7nm 256Mb SRAM in High-K Metal-Gate FinFET Technology with Write-Assist Circuitry for Low- $V_{MIN}$  Applications," *ISSCC*, pp. 206-207, 2017.
- [2] J. Davis, et al., "7GHz L1 Cache SRAMs for the 32nm zEnterprise EC12 Processor," *ISSCC*, pp. 324-325, 2013.
- [3] J. Kulkarni, et al., "Dual-Vcc 8T-bitcell SRAM Array in 22nm Tri-Gate CMOS for Energy-Efficient Operation across Wide Dynamic Voltage Range," *IEEE Symp. VLSI Tech.*, pp. 126-127, 2013.
- [4] M. Clinton, et al., "A Low-Power and High-Performance 10nm Architecture for Mobile Applications," *ISSCC*, pp. 210-211, 2017.
- [5] J. Chang, et al., "Embedded Memories for Mobile, IoT, Automotive and High Performance Computing," *IEEE Symp. VLSI Tech.*, pp. 26-27, 2017.

- VDD
  - VDDM

## Array-DR Architecture



**Figure 11.3.1: SRAM *butterfly floorplan* with array dual-rail power distribution.**



**Figure 11.3.2: Row decoder, level shifter and WL driver.**



**Figure 11.3.3: CLK generator, WL activation and self-timing scheme.**



**7nm 1024x72m4  
“Folded” macro**



**Figure 11.3.5:** Layout view of 16nm and 7nm L1-cache macros.



**Figure 11.3.6: Silicon cycle time shmoo for 512x36m4 and 1024x72m4.**



Figure 11.3.7: 1024×72 L1 cache macro die photograph.

# Session 12 Overview: *DRAM*

## MEMORY SUBCOMMITTEE



**Session Chair:**  
**Seung-Jun Bae**  
*Samsung, Hwasung, Korea*



**Associate Chair:**  
**Wolfgang Spirkl**  
*Micron Semiconductor, Munich, Germany*

### **Subcommittee Chair: Leland Chang, IBM, Yorktown Heights, NY**

Demand for high-performance and high-capacity DRAMs is increasing more dramatically than in the past, due to the emergence of new areas such as machine learning, VR and AR. In line with this trend, new innovations with capacities of 16Gb and data-rate speeds of 18Gb/s/pin are introduced this year. These changes are common to high-performance computing, gaming graphics, mobile, and server fields, including artificial intelligence. Two graphics DRAM papers of the next generation GDDR6 standard show a maximum data rate of 16 to 18Gb/s/pin with single-ended signaling, and 16Gb high-density DRAMs in a 10nm process node are introduced in LPDDR4X and DDR4. HBM2 is extended to an 8H stack for 64Gb density while keeping a BW of 341GB/s.



10:15 AM

**12.1 A 16Gb 18Gb/s/pin GDDR6 DRAM with Per-Bit Trainable Single-Ended DFE and PLL-Less Clocking***Y-J. Kim, Samsung Electronics, Hwaseong, Korea*

In Paper 12.1, Samsung presents a graphics-DRAM GDDR6 at a data-rate of 18Gb/s/pin with 16Gb density. The chip implements a per-bit trainable single-ended DFE, a ZQ-coded transmitter, and PLL-less clocking to overcome I/O speed limitations due to the DRAM process. Furthermore, this work optimizes clock and power domain crossing and adopts a split-die architecture to improve signal integrity.



10:45 AM

**12.2 A 16Gb LPDDR4X SDRAM with an NBTI-Tolerant Circuit Solution, an SWD PMOS GIDL Reduction Technique, an Adaptive Gear-Down Scheme and a Metastable-Free DQS Aligner in a 10nm Class DRAM Process***K. C. Chun, Samsung Electronics, Hwaseong, Korea*

In Paper 12.2, Samsung presents a 16Gb 5Gb/s/pin LPDDR4X with in-DRAM ECC with a self-refresh power of 0.1mW/Gb in a 10nm class DRAM process. Circuit techniques for achieving high speed and low power are presented and key features include an NBTI-tolerant solution, a PMOS SWD GIDL reduction technique, an adaptive IO buffer gear-down scheme, a hybrid I/O buffer and a metastable-free DQS aligner.

12



11:15 AM

**12.3 A 1.2V 64Gb 341GB/s HBM2 Stacked DRAM with Spiral Point-to-Point TSV Structure and Improved Bank Group Data Control***J. Kim, SK hynix, Gyeonggi, Korea*

In Paper 12.3, SK hynix presents a 64Gb 341GB/s HBM2 DRAM with 8H stack. To increase data rate of the multi-TSV stack, a spiral point-to-point TSV structure and improved bank group control are proposed. TSV self-repair techniques and serial temperature code read-out schemes are introduced for reliable 3D operation.



11:45 AM

**12.4 A 16Gb/s/pin 8Gb GDDR6 DRAM with Bandwidth Extension Techniques for High-Speed Applications***K-D. Hwang, SK hynix, Gyeonggi, Korea*

In Paper 12.4, SK hynix presents a 16Gb/s/pin GDDR6 with an 8Gb density. Bandwidth extension techniques are introduced: WCK divider with an analog duty correction circuit, 4-to-1 multiplexer with on-chip feedback EQ filter, and loop-unrolled one-tap DFE with a two-stage pre-amplifier.



12:00 PM

**12.5 A 16Gb 1.2V 3.2Gb/s/pin DDR4 SDRAM with Improved Power Distribution and Repair Strategy***S. Shim, SK hynix, Icheon, Korea*

In Paper 12.5, SK hynix presents a 16Gb 3.2Gb/s/pin DDR4 DRAM in an 18nm process. To overcome internal power drops in the large die, power pads are placed in the middle of the array and a staggered power up scheme for the memory stack is used to reduce in-rush current. ECC for reliable operation of redundancy fuse latches and module self-repair schemes are introduced.

## 12.1 A 16Gb 18Gb/s/pin GDDR6 DRAM with Per-Bit Trainable Single-Ended DFE and PLL-Less Clocking

Young-Ju Kim, Hye-Jung Kwon, Su-Yeon Doo, Yoon-Joo Eom, Young-Sik Kim, Min-Su Ahn, Yong-Hun Kim, Sang-Hoon Jung, Sung-Geun Do, Chang-Yong Lee, Jae-Sung Kim, Dong-Seok Kang, Kyung-Bae Park, Jung-Bum Shin, Jong-Ho Lee, Seung-Hoon Oh, Sang-Yong Lee, Ji-Hak Yu, Ji-Suk Kwon, Ki-Hun Yu, Chul-Hee Jeon, Sang-Sun Kim, Min-Woo Won, Gun-hee Cho, Hyun-Soo Park, Hyung-Kyu Kim, Jeong-Woo Lee, Seung-Hyun Cho, Keon-Woo Park, Jae-Koo Park, Yong-Jae Lee, YongJun Kim, Young-Hun Seo, Beob-Rae Cho, Chang-Ho Shin, Chan-Yong Lee, YoungSeok Lee, Yoon-Gue Song, Sam-Young Bang, YounSik Park, Seouk-Kyu Choi, Byeong-Cheol Kim, Gong-Heum Han, Seung-Jun Bae, Hyuk-Jun Kwon, Jung-Hwan Choi, Young-Soo Sohn, Kwang-II Park, Seong-Jin Jang

Samsung Electronics, Hwaseong, Korea

Starting at 512Mb 6Gb/s/pin [1], GDDR5's speed and density have been steadily developing for about 10 years; recently achieving 8Gb 9Gb/s/pin [2] with per-pin timing training. Although 8Gb GDDR5X can operate at 12Gb/s [3] by increasing the burst length (BL) from 8 to 16, a degradation in system performance at a data granularity of 64B is seen. The I/O specification, using PLL clocking that additionally causes PLL jitter, has not changed much compared with GDDR5. To overcome these issues, GDDR6 introduced a dual channel for a data granularity of 32B with a BL16, per-bit training of  $V_{REF}$ , and an equalizer with PLL-less clocking. This paper presents a 16Gb 18Gb/s/pin GDDR6 DRAM with a die architecture and high-speed circuit techniques on 1.35V DRAM process.

The proposed 16Gb GDDR6 DRAM adopts a 1-channel/1-die (split-die) architecture to achieve a high data rate, while LPDDR4 uses it mainly for yield [4], as shown in Fig. 12.1.1(a). In this architecture, the amount of inter-symbol interference (ISI) and crosstalk at high-speed over 12Gb/s is significantly reduced with the optimized placement of each die for package routing between the package ball and the chip's DQ/WCK pads, shielding, and return path. In addition, array IR-drops and the internal data-path delays are smaller than those of a 2-channel/1-die architecture that is shown in Fig. 12.1.1(b), due to the half-length of array. Furthermore, the split-die architecture is suitable for high density and yield.

The proposed GDDR6 optimizes power and clock-domain crossings to provide noise insensitivity and to get enough internal data sampling margin by avoiding the low jitter correlation between CK and WCK at 18Gb/s as shown in Fig. 12.1.2. Since WCK uses a 9GHz clock without PLL, WCK should be divided by 4 to match the CK frequency. Additionally, WCK4 and CK are divided by 2 to generate in-pointer and out-pointer signals, which have the same period as the command cycle for the FIFO in a controller. As a result, the clock-domain crossing between WCK to CK occurs in the FIFO with a  $2t_{CK}$  margin, which is 889ps at 18Gb/s. In addition, the power-domain crossings are done along the same boundaries as the clock-domain crossings to maximize data sampling margin, which is dependent on the difference between  $V_{DDQ}$  and  $V_{DD}$ .

To achieve 18Gb/s with a pseudo-open-drain single-ended transmitter, the ISI amplification and power-supply-induced jitter (PSIJ) must be minimized. Figure 12.1.3(a) shows a conventional transmitter with a pre-driver consisting of a NAND/NOR-gate. It encodes the full-rate data with the reference impedance (ZQ) codes, which represent ZQ calibration results at the pre-driver. Therefore, transmitter's PSIJ becomes large due to the long delay of CMOS circuits from WCK to DQ, and the ISI seriously increases due to the high frequency operation of the NAND/NOR gates. The proposed ZQ-coded transmitter, shown in Fig. 12.1.3(b), resolves these issues; it exploits two 4:1 multiplexers (MUXes) in each driver segment, and encodes the quarter-rate data with ZQ codes before the 4:1 MUX. Therefore, the delay from WCK to DQ is minimized, and ISI amplification is reduced by eliminating the NAND/NOR-gate in the pre-driver. Although the power consumption increases by the number of 4:1 MUXes, the ZQ-coded transmitter can achieve 18Gb/s with low jitter because the ISI and PSIJ are substantially improved by a 33% delay reduction.

As single-ended IO speeds increase over 12Gb/s, the DRAM receiver's DFE is constrained to make a decision in less than a UI (55.56ps at 18Gb/s) and is further impacted by the large bit-to-bit variation caused by crosstalk and mismatch. A conventional DFE that exploits the output of the flip-flop as feedback data cannot be used at 18Gb/s because the flip-flop delay in a DRAM process is larger than 55.56ps. Loop-unrolling [3] is possible, but causes inevitable power and area overhead. Therefore, the proposed GDDR6 adopts the fast-feedback scheme [1] that directly feeds back the output of the latch, not the flip-flop as shown in Fig. 12.1.4. In this work, the stacked-feedback sense amplifier (SA) with a dedicated  $V_{REF}$  is proposed for faster speed, immunity to PVT variation, and per-bit training. The dedicated  $V_{REF}$  is reflecting the reference level and the DFE coefficient, which are generated by a reference voltage generator in each DQ. Since the SA in [1] is affected by the common-mode (CM) and differential levels of the feedback data, the DFE tap coefficients of the previous SA can be changed by the voltage swing variation of the feedback data according to the input data pattern and PVT. Thus, the output CM of the SA should settle within a UI for DFE operation. The proposed DFE SA operates faster than the DFE SA in [1], this is accomplished by relaxing the time required for the output CM of the SA to settle. Constant tap coefficients are guaranteed regardless of the SA's CM output signal level, because the stacked-feedback SA outputs are only used to select the polarity and the DFE tap coefficient is decided by the level of the reference voltages ( $V_{REF\_P}$  and  $V_{REF\_N}$ ). Simulation results in the worst process corner show that a DFE with the proposed SA operates at over 18Gb/s with a 1mV offset. Whereas, a DFE based on the SA in [1] can operate up to 13Gb/s with a 20mV offset. The reference voltage generator is implemented with minimum area overhead via a shared resistor ladder and a decoder-embedded adder and subtractor.

It is hard to generate a low-jitter clock using a PLL in a DRAM process, thus the proposed GDDR6 uses PLL-less clocking, as shown in Fig. 12.1.5. A PLL-less clock scheme directly receives a 9GHz clock for 18Gb/s operation. A WCK receiver uses a current-mode logic (CML) buffer, with a DC suppression capacitor, to compensate for the duty-cycle distortion of the received WCK [2]. In addition, the fan-out of the WCK receiver and the frequency divider is limited to 1 to maximize bandwidth; as the WCK path is sensitive to the swing level and the duty-cycle of the input/output signals, and the bandwidth of the circuits. The frequency divider following the WCK receiver uses low-threshold-voltage transistors for MNL0 to MNL7. MNL0 and MN1 operate in the saturation region, and maintain a differential gain via the low-threshold-voltage transistors even if the output swing level of the WCK receiver is decreased by the limited bandwidth. Additionally, a PLL-on mode is also implemented so that it can receive a 4.5GHz WCK, and operate with a PLL for 18Gb/s. The PLL is implemented based on CML to improve the power-noise characteristics. Furthermore, the voltage controlled oscillator (VCO) and the charge pump (CP) use regulated power supplies to make them insensitive to power-line noise.

Figure 12.1.6(a) and (b) show the frequency-voltage shmoos when the PLL is off and on. In default mode (PLL off) speeds of 14 and 16Gb/s are achieved at 1.1 and 1.15V, even with a 1.35V supply voltage specification. Although, as shown in Fig. 12.1.6(b), this GDDR6 with PLL clocking can achieve 16Gb/s at 1.3V, the low-voltage margin and the pass window is reduced compared to PLL off mode. Figure 12.1.6(c) shows an eye shmoos during read operations at 16Gb/s with a 1.35V supply voltage, using 2ps and 25mV steps. The horizontal and vertical eye opening are 44ps (0.7UI) and 325mV, when GDDR6 operates in gap-less read with bank interleaving. Additionally, Fig. 12.1.6(d) shows the measured 18Gb/s output waveform using a 0101 pattern at 1.35V. The chip micrograph and summary are shown in Fig. 12.1.7.

### References:

- [1] S.-J. Bae, et al., "A 60nm 6Gb/s/pin GDDR5 Graphics DRAM with Multifaceted Clocking and ISI/SSN-Reduction Techniques," ISSCC, pp. 278-279, 2008.
- [2] H.-Y. Joo, et al., "A 20nm 9Gb/s/pin 8Gb GDDR5 DRAM with an NBFI monitor, jitter reduction techniques and improved power distribution," ISSCC, pp. 314-315, 2016.
- [3] M. Brox, et al., "An 8Gb 12Gb/s/pin GDDR5X DRAM for cost-effective high-performance applications," ISSCC, pp. 388-389, 2017.
- [4] C.-K. Lee, et al., "A 5Gb/s/pin 8Gb LPDDR4X SDRAM with Power-Isolated LVSTL and Split-Die Architecture with 2-Die ZQ Calibration Scheme," ISSCC, pp. 390-391, 2017.



Figure 12.1.1: Dual channel GDDR6: (a) the 1-channel/1-die (split-die) architecture used in this work, (b) a conventional 2-channel/1-die architecture.



Figure 12.1.2: Clock and power-domain crossing architecture of GDDR6.



Figure 12.1.3: (a) Conventional transmitter, (b) proposed ZQ-coded transmitter.



Figure 12.1.4: Direct feedback 1-tap DFE receiver with per-bit dual-reference-voltage generator.



Figure 12.1.5: PLL-less clocking path with optionally selectable PLL.



Figure 12.1.6: Frequency-voltage shmoo during (a) PLL-off and (b) PLL-on modes. (c) ATE measured read data eye at 1.35V and 16Gb/s. (d) Measured 18Gb/s output waveform for a 0101 pattern at 1.35V.



|                                |
|--------------------------------|
| <b>18 Gb/s/pin</b>             |
| <b>Supply voltage : 1.35 V</b> |
| <b>16 Bank/1-CH</b>            |
| <b>8 Gbit/1-CH</b>             |
| <b>16 Burst Length</b>         |
| <b>X16 IO/1-CH</b>             |
| <b>47.9mm<sup>2</sup>/1CH</b>  |

Figure 12.1.7: Chip micrograph (1-channel) and summary.

## 12.2 A 16Gb LPDDR4X SDRAM with an NBTI-Tolerant Circuit Solution, an SWD PMOS GIDL Reduction Technique, an Adaptive Gear-Down Scheme and a Metastable-Free DQS Aligner in a 10nm Class DRAM Process

Ki Chul Chun, Yong-Gyu Chu, Jin-Seok Heo, Tae-Sung Kim, Soohwan Kim, Hui-Kap Yang, Mi-Jo Kim, Chang-Kyo Lee, Juhwan Kim, Hyunchul Yoon, Chang-Ho Shin, Sangju Cha, Hyung-Jin Kim, Young-Sik Kim, Kyungryun Kim, Young-Ju Kim, Wonjun Choi, Dae-Sik Yim, Inkyu Moon, Young-Ju Kim, Junha Lee, Young Choi, Yongmin Kwon, Sung-Won Choi, Jung-Wook Kim, Yoon-Suk Park, Woongdae Kang, Jinil Chung, Seunghyun Kim, Yesin Ryu, Seong-Jin Cho, Hoon Shin, Hangyun Jung, Sanghyuk Kwon, Kyuchang Kang, Jongmyung Lee, Yujung Song, Young-Jae Kim, Eun-Ah Kim, Kyung-Soo Ha, Kyoung-Ho Kim, Seok-Hun Hyun, Seungbum Ko, Jung-Hwan Choi, Young-Soo Sohn, Kwang-II Park, Seong-Jin Jang

Samsung Electronics, Hwaseong, Korea

High-density and high-speed DRAM requirements have been ever-increasing to achieve a better user experience for mobile systems, by adopting QHD (2560×1440), and higher display resolutions, dual cameras, augmented reality, and advanced driver-assistance systems. LPDDR4X has been the hand-held and mobile memory of choice due to its high speed (5.0Gb/s/pin [1]) and low-power data retention (<0.1mW/Gb [2-3]), as well as reliability due to in-DRAM ECC. The DRAM process continues to scale down to the 10nm era to meet the ever increasing density requirements (LPDDR4X density doubles every two years for flagship smart-phones). However, poor data retention characteristics due to smaller storage capacitances and device issues, such as reliability (NBTI) and leakage (especially core transistors), with the traditional poly-gate and planar-bulk technology becomes a primary concern for mobile DRAM. In-DRAM ECC is fully supported by the JEDEC LPDDR4 specification by the introduction of the new masked-write command (MWR;  $t_{CCDMW}=32t_{CK}$ ), however the area overhead (6.25%), due to the additional parity arrays for a (136, 128) single-error-correction code [4], is currently limiting for mass production in terms of chip cost. This overhead can be mitigated by adopting a scaled technology node that enables a smaller chip size as well as better retention time due to ECC. This paper presents several circuit techniques to maintain LPDDR4X's high speed and low power in a 10nm class process, thereby enabling a cost-effective DRAM design with in-DRAM ECC: using (1) an NBTI-tolerant circuit solution that covers whole high-speed circuit regions, (2) a sub-WL driver (SWD) PMOS GIDL-reduction technique ensures stable power recovery, (3) an adaptive IO buffer current gear-down scheme based on user-scenarios, and (4) a metastable-free DQS aligner. Figure 12.2.1 shows the top-level block diagram of the 8Gb/1channel macro, with an in-DRAM ECC using a (136, 128) single-error-correction code, similar to that of previous 20nm designs [2-4].

As the NBTI-induced PMOS performance degradation, and the resulting chip speed degradation phenomenon, becomes worse with technology scaling [5], an NBTI-tolerant solution that covers the whole high-speed data path and IO circuit regions (highlighted in Fig. 12.2.1) is proposed in Fig. 12.2.2. Header-only power-gating is applied to data drivers and repeaters to suppress the amount of device degradation, as the virtual power supply ( $V_{PWR}$ ) level drops below the nominal supply voltage during standby. Critical signals controlling major peripheral blocks are generated only on rising edges of CLK; specifically, control signal's rising/falling edges are generated via consecutive rising edges of CLK. The rising-CLK-only signal generation enables NBTI-free control, as PMOS performance degradation only happens during the falling edge of CLK. As for chip internal clocks, a low-frequency oscillator-based toggle technique is proposed. During standby, internal clock paths are connected to an NBTI-oscillator (100kHz in this design) and toggled for NBTI recovery [6]. The measured post stress  $t_{CK}$  variation is within 2.0% using the proposed NBTI solution.

The self-refresh current consists of an AC (refresh control, WL and BL charging) and DC (bias generators and transistor leakage) components. The storage capacitor is becoming smaller for each new DRAM process, resulting in an increasing AC current owing to the requirement for more frequent refresh operations. In-DRAM ECC can relieve this issue by extending refresh rates as ECC corrects for errors in the tail cells, which is caused by random telegraph noise or

variable retention time [3]. Transistor leakage can be suppressed with well-known power-gating techniques, thereby decreasing DC current. However, the core DRAM transistors, especially the sub-WL driver (SWD) PMOS, suffer from GIDL and this situation is getting worse due to decreasing device dimensions. The best way to control GIDL is to lower the gate voltage of the SWD PMOS; namely the normal wordline enable signal (NWEiB) standby level. A simple, and compact, way of achieving this is to utilize an NMOS  $V_t$  down where NWEiB level is dropped by  $1-V_t$  from  $V_{PP}$  since NMOS power switch is selected during self-refresh, as shown in Fig. 12.2.3. Self-refresh power can be reduced by 18% (i.e. from 0.98 to 0.8mW per 8Gb). For a stable self-refresh exit and active operations, a block-by-block power recovery scheme is implemented such that the block NWEiB levels are restored to  $V_{PP}$  only when they are selected for activation. This proposed method reduces  $V_{PP}$  peak current from 250 to 0.5mA, which can prevent system failures due to the large surge currents when exiting self-refresh mode.

Figure 12.2.4 shows the adaptive gear-down scheme (DQS, CLK, and CA buffers) based on user-scenario. As the majority of operating frequency in real systems is below 1.6Gb/s, buffer current adjustment based on the operating speed can be a power efficient solution. A CA buffer is divided into two signal paths based on the operating speed: 1.6Gb/s is used as a boundary. The high speed (HF) path is composed of a two-stage amplifier and the low speed (LF) path is composed of a low-power main-amplifier featuring single-ended operation. A DQS buffer consists of an equalizer and a pre-amplifier with current steering control. The high-speed path, which is enabled beyond 1.6Gb/s, is controlled with four steps: each step is controlled by an additional signal path and current control with a pre-defined scaling ratio to maintain biasing. Whereas, the low-speed path is implemented with a simple single-ended buffer. The CLK buffer uses the same gear-down approach as the DQS buffer. The on-chip buffer currents of CLK, DQS, and CA are reduced by 73%, 62%, and 39% respectively at 1.6Gb/s compared to 4.266Gb/s.

As the operating speed increases beyond 3.2Gb/s, the stability of the DQS divider should be carefully considered. A conventional D-FF based divider used to relieve timing margin for data align can get stuck in an unknown state; due to the bandwidth limit of the poly gate and planar bulk technology in a DRAM process. To alleviate this issue, a metastable-free DQS aligner in the write path is proposed as shown in Fig. 12.2.5. In previous designs specifications require that two preambles are allocated before a write operation and that the data align edge (DQSB/2) is reference to a divided DQS signal. However, for the case where both DQS and DQSB are low, including an unknown state before normal operation, it is possible for the DQS divider to become stuck in an unknown state due to ISI. In the proposed scheme, the data align reference edge is shifted backwards by one clock cycle, without increasing the write latency, and then the DQ data are aligned by DQSB/2 (pseudo 3-preamble), where the polarity is determined by an even/odd phase detector (E/O detector). Consequently, the bandwidth of the DQS divider is improved by 50% and reaches 5Gb/s.

A 16Gb LPDDR4X SDRAM with in-DRAM ECC is fabricated in a 10nm DRAM process. Figure 12.2.6 (top) shows the measured  $t_{CK}$  shmoo after NBTI stressing that emulates 10 years of usage. The NBTI solution achieves a 5Gb/s data rate at 1.1V and improves the low  $V_{DD}$  margin down to 0.98V at 4.266Gb/s. The adaptive gear-down scheme achieves a an IO data window of 0.77UI in high-frequency (4.266Gb/s) differential mode and 0.79UI in low-frequency (2.133Gb/s) single-ended mode for CLK and DQS as shown in Fig. 12.2.6 (bottom). The chip micrograph of the fabricated 8Gb/channel DRAM is shown in Fig. 12.2.7. The chip density is 42.77mm<sup>2</sup>/channel.

### References:

- [1] C.-K. Lee, et al., "A 5Gb/s/pin 8Gb LPDDR4X SDRAM with Power-Isolated LVSTL and Split-Die Architecture with 2-Die ZQ Calibration Scheme," ISSCC, pp. 390-391, 2017.
- [2] N. Kwak, et al., "A 4.8Gb/s/pin 2Gb LPDDR4 SDRAM with Sub-100µA Self-Refresh Current for IoT Applications," ISSCC, pp. 392-393, 2017.
- [3] H.-J. Kwon, et al., "An Extremely Low-Standby-Power 3.733Gb/s/pin 2Gb LPDDR4 SDRAM for Wearable Devices," ISSCC, pp. 394-395, 2017.
- [4] T.-Y. Oh, et al., "A 3.2Gb/s/pin 8Gb 1.0V LPDDR4 SDRAM with integrated ECC engine for sub-1V DRAM Core operation," ISSCC, pp. 430-431, 2014.
- [5] H.-Y. Joo, et al., "A 20nm 9Gb/s/pin 8Gb GDDR5 DRAM with an NBTI Monitor, Jitter Reduction Techniques and Improved Power Distribution," ISSCC, pp. 314-315, 2016.
- [6] G. Chen, et al., "Dynamic NBTI of p-MOS Transistors and its Impact on MOSFET Scaling," IEEE Electron Dev. Letters, vol. 23, no. 12, pp. 734-736, Dec. 2002.



Figure 12.2.1: Top-level block diagram of an 8Gb/1channel LPDDR4X macro.



Figure 12.2.2: NBTI-tolerant circuit solutions utilizing a header-only power-gating, a rising-CLK-only signal generation, and a toggle technique.

Figure 12.2.3: SWD PMOS GiDL reduction scheme featuring an NWEiB 1-V<sub>t</sub> down during self-refresh and block-by-block power recovery.

Figure 12.2.4: Adaptive gear-down scheme featuring a low-power main amplifier for the CA buffer and single-ended circuits for CLK and DQS at low frequency.



Figure 12.2.5: Metastable-free pseudo 3-preamble input data aligner and its timing diagram.

Figure 12.2.6: Measured t<sub>CK</sub> shmoos after NBTI stress (top) and IO shmoos of the adaptive gear-down scheme (bottom).



Figure 12.2.7: Chip micrograph of 8Gb/channel LPDDR4X.

## 12.3 A 1.2V 64Gb 341GB/s HBM2 Stacked DRAM with Spiral Point-to-Point TSV Structure and Improved Bank Group Data Control

Jin Hee Cho, Jihwan Kim, Woo Young Lee, Dong Uk Lee, Tae Kyun Kim, Heat Bit Park, Chunseok Jeong, Myeong-Jae Park, Seung Geun Baek, Seokwoo Choi, Byung Kuk Yoon, Young Jae Choi, Kyo Yun Lee, Daeyong Shim, Jonghoon Oh, Jinkook Kim, Seok-Hee Lee

SK hynix, Gyeonggi, Korea

With the recent increasing interest in big data and artificial intelligence, there is an emerging demand for high-performance memory system with large density and high data-bandwidth. However, conventional DIMM-type memory has difficulty achieving more than 50GB/s due to its limited pin count and signal integrity issues. High-bandwidth memory (HBM) DRAM, with TSV technology and wide IOs, is a prominent solution to this problem, but it still has many limitations: including power consumption and reliability. This paper presents a power-efficient structure of TSVs with reliability and a cost-effective HBM DRAM core architecture.

HBM's major obstacle for achieving high bandwidth and low power is the heavy capacitive load due to the thousands of TSVs in 8Hi stacks. The previous HBM (with multi-drop TSV) had limited capabilities for managing the 8Hi heavy TSV loading, which is one of the major obstacles for achieving both bandwidth and density. To reduce the heavy loading of a conventional multi-drop TSV structure, a spiral point-to-point (P2P) TSV structure is proposed. Figure 12.3.1 shows a conventional multi-drop TSV structure [1] and the proposed spiral P2P TSV structure. Compared to a multi-drop TSV structure, where each core die has its own TX, RX, and a 4-to-1 MUX, the proposed spiral P2P TSV structure has only three sets of TX and RX for an 8Hi stack. Furthermore, it eliminates the need for a channel selection MUX and its complicated routing. The current consumption to drive TSVs was reduced by 30%, due to the reduced capacitance, and its slew rate was increased from 3.4 to 4.9V/ns as shown in Fig. 12.3.1.

Another challenge for HBM development is achieving high yield for TSVs and micro-bumps ( $\mu$ -bump): the total yield of a stacked chip is obtained by squaring the yield of one TSV by the number of TSVs. Furthermore, in case of an 8Hi-stack HBM, one failed TSV connection causes 9 dies to be discarded. Therefore, a TSV repair technique is essential to compensate for TSV yield. Unlike conventional TSV repair techniques [2] that need test equipment to test for the open/short-state of a TSV connection, the proposed automatic TSV self-repair technique, shown in Fig. 12.3.2, performs open/short tests during the boot-up sequence, without the need for test equipment or fuses. To detect a weak TSV connection, a core-side strong PMOS and a base-side NMOS, for which the leakage current is controlled by a bias voltage, are turned on at the same time. The quantized voltage level of the TSV is stored in a latch. A base-side strong PMOS and core-side NMOS are also turned on to confirm the TSV connectivity. By sequentially reading out the latched results, the locations of TSV failures can be determined. And the TSV self-repair operation can be performed by changing core die to find exact positions of failed TSVs. The slice ID signal sent from the base die to the core die is changed using the test mode to make each core die behave like the top slice. The proposed architecture also supports a conventional current scan, without additional circuitry, by jointly using the PMOS' as a current source and the DFF as a switch-enable signal shifter. The TSVs used to control the repair operations cannot make use of self-repair, but instead they exist in pairs to ensure robust operation. The repair procedure is performed during the boot-up sequence, so that users do not need to execute post-package repair.

HBM2 has a pseudo-channel function [1], which decreases the page size in half while doubling the effective number of banks to improve DRAM core timing such as  $t_{FAW}$  (four-active-window) and  $t_{CCDL}$  (column-to-column access timing). Prior work [1] used four channels in a slice, whereas the proposed architecture includes only two channels per core die via a spiral TSV structure. Each channel is divided into two pseudo channels: a pseudo channel has 16 independent banks and 64 IOs. Because of the large number of IO lines, the area overhead of HBM is significant, thus leveraging increased pre-fetch, which is a common method in conventional DRAM, is restricted. The improved bank group control, shown in Fig. 12.3.3, is proposed to mitigate speed and area penalties. For a 4b pre-fetch

operation, each IO has 4 internal data lines, and there should be  $64 \times 4b \times 4$  bank group IOs (BG\_IO) and 256 global IOs (GIO) per each pseudo channel. However, the proposed architecture has BG\_IOs with a 2b pre-fetch (feasible due to a relaxed  $t_{CCDL}$ ) and GIO has a 4b pre-fetch, that is divided into left and right, to keep the effective line similar to the 128 GIOs. Figure 12.3.3 shows the timing diagram of the proposed architecture where column commands and Y-pulses are divided into even and odd groups by the order of the command input (BL0, 1 and BL2, 3). Therefore, the core die effectively has twice the number of data lines and column addresses, and it has twice the core timing margin without significant area penalty.

The power distribution in the HBM core die is also a challenge because significant power is consumed in a small area. The IR drop in a 3D stack structure causes  $t_{CCDL}$  degradation in the DRAM core operation. To mitigate this issue, additional bank-power TSVs are placed in-between row decoders: these directly supply power from base die to power-hungry core areas. Another merit of bank power TSVs is the ability to share the power distribution network among each core die, which greatly reduces the peak IR drop for an 8Hi-stack as shown in Fig. 12.3.4. Based on PDN simulation results, more than 50% of IR drops are diminished in IDD4W (gapless write; worst pattern for core IR drop) compared to a previous version without bank power TSVs.

The base die and core dies in HBM2 all have temperature sensors. Memory controllers can read out the 8b of highest temperature code among the core dies, 8b for the base die temperature code, and 1b of catastrophic trip threshold (CATTRIP). Therefore, 4Hi and 8Hi stacks require 36 and 72 TSVs. The proposed serial temperature read-out scheme uses only 2 TSVs, one for the temperature code and the other for the CATTRIP. Figure 12.3.5 shows the core die temperature read-out scheme. The core die temperature codes are shifted, in descending order of core dies, by the shift clock that is generated in base die. The base die stores 8b of code in the CTEMP register, and compares it with the code stored in the CTEMP\_MAX register, which stores the maximum temperature code observed. The CATTRIP scheme is depicted in the right side of Fig. 12.3.5. Since CATTRIP must reflect all information from every core die and base die, all dies share one TSV using wired-OR logic. The base die always turns on a pull-down transistor to drive the CATTRIP  $\mu$ -bump LOW. When any die reaches the limit temperature (e.g. 125°C), it generates a CATTRIP flag to make the TSV HIGH, which is driven onto the CATTRIP  $\mu$ -bump.

Since the PHY  $\mu$ -bump cannot be probed, all tests of the HBM were performed using a direct access ball (DA). However, because of the operational characteristics of the PHY IO and the necessity for system implementation verification, an active interposer package (AIP), depicted in Fig. 12.3.6, is proposed. HBM DA and PHY operation can be verified, and 2-channel interleaving technology can be applied to determine the influence of the independent operations between channels. The AC characteristics of the HBM PHY can be measured: such as the input setup/hold, the 1-pin input setup/hold, and  $t_{DV}$  (data valid window). Since the signal integrity between the controller and the HBM in the SiP is reflected on the AIP, the LFSR/MISR can be operated between the active interposer and the HBM under similar conditions. Test flexibility is improved by applying a serial-test-mode input technique similar to IEEE1500 [3].

Several key technologies have been introduced to address impediments to increasing bandwidth for HBM memories. The spiral-P2P scheme and the TSV self-repair scheme both provide good solutions for managing the heavy 8Hi TSV loading. The improved bank-group data control optimizes area overhead and DRAM core speed. Additional bank power TSVs reduce the IR drop at the bank side by 50%. The HBM shmoos results, shown in Fig. 12.3.6, shows a 341GB/s 8Hi known-good-stack dies (KGSD) gapless-read operation at 1.2V and 105°C, and 320GB/s at 1.15V and 105°C. The chip micrographs for the 8Hi-stacked 8Gb DRAM dies and the base die are shown in Fig. 12.3.7.

### References:

- [1] J. C. Lee, et al., "A 1.2V 64Gb 8-channel 256GB/s HBM DRAM with peripheral-base-die architecture and small-swing technique on heavy load interface," ISSCC, pp. 318-319, 2016.
- [2] D. U. Lee, et al., "An Exact Measurement and Repair Circuit of TSV Connections for 128GB/s High-Bandwidth Memory(HBM) Stacked DRAM," IEEE Symp. VLSI Circuits, 2014.
- [3] JEDEC Standard High Bandwidth Memory (HBM) DRAM Specification, 2015.



Figure 12.3.1: Structure and performance comparison between multi-drop and the spiral P2P TSV structure.



Figure 12.3.2: TSV self-repair scheme and test results.



Figure 12.3.3: Core architecture and improved bank-group data control.



Figure 12.3.4: Power TSVs in the middle of banks and power distribution network simulation results.



Figure 12.3.5: Serial temperature read-out and the catastrophic trip threshold scheme.



Figure 12.3.6: Active interposer package and test shmoo result.



Figure 12.3.7: Chip micrograph and summary table.

## 12.4 A 16Gb/s/pin 8Gb GDDR6 DRAM with Bandwidth Extension Techniques for High-Speed Applications

Kyu-Dong Hwang, Boram Kim, Sang-Yeon Byeon, Kyu-Young Kim, Dae-Han Kwon, Hyun-Bae Lee, Geun-II Lee, Sang-Sic Yoon, Jin-Youp Cha, Soo-Young Jang, Seung-Hun Lee, Yong-Suk Joo, Gang-Sik Lee, Sung-Soo Xi, Soo-Bin Lim, Kyung-Ho Chu, Joo-Hwan Cho, Junhyun Chun, Jonghoon Oh, Jinkook Kim, Seok-Hee Lee

SK hynix, Gyeonggi, Korea

Recently the demand for high-bandwidth graphic DRAM, for game consoles and graphic cards, has dramatically increased due to the development of virtual reality, artificial intelligence, deep learning, autonomous driving cars, etc. These applications require greater data transfer speeds than previous devices, GDDR5 [1] and GDDR5X [2], which are limited to 12Gb/s/pin. This paper introduces an 8Gb GDDR6 operating at up to 16Gb/s/pin. To exceed the prior speed limit various bandwidth extension techniques are proposed. WCK is driven with a dividing scheme to overcome speed limitations and to reduce power consumption. In addition, a dual-band architecture with different types of nibble drivers is proposed in order to cover stability of CML-to-CMOS in all frequency regions; CML nibble is used for high-speed, while CMOS nibble is used for low-speed. A DC-split scheme is implemented for duty-cycle correction and skew compensation. The bandwidth of the high-frequency divider is extended by using a proposed mode-changed flip-flop. The receiver uses a loop-unrolled one-tap decision-feedback equalizer (DFE) designed to eliminate channel inter-symbol interference (ISI). A two-stage pre-amplifier is also used for bandwidth extension. The transmitter uses a 4:1 multiplexer using a half-rate sampler, where a 1UI pulse is unnecessary to minimize the full-rate operation. To secure on-chip signal transmission characteristic, the bandwidth limitation of transistor in a DRAM process is extended by adopting an on-chip feedback EQ filter.

A 16Gb/s/pin 8Gb GDDR6 DRAM is implemented using a DRAM process, the interface network is shown in Fig. 12.4.1. In GDDR5's word mode one-pair of WCK/WCKB drives 16DQs, whereas in GDDR6's byte mode one-pair of WCK/WCKB drives 8DQs. The maximum WCK frequency of GDDR6 is 8GHz to achieve a 16Gb/s/pin data-rate. To increase the WCK distribution power efficiency, its frequency is divided to a lower operating frequency and some of the divided phases are transmitted to each local network; each local network drives DQ arrays. A DQ block consists of a receiver (RX) and a transmitter (TX).

Figure 12.4.2 presents the WCK-distribution network in detail. Although the maximum operation frequency is 8GHz, it should also be able to operate at lower frequencies. To effectively cover this wide range of frequencies, WCK supports a dual-mode which consists of a high-frequency path and a low-frequency path: this is accomplished by using CML-type and CMOS-type nibble-drive WCK global lines for high- and low-frequency operation. The advantage of the CMOS-type buffer is that it can dramatically decrease the WCK distribution power at lower frequencies. Two types of CML-to-CMOS are used in each mode: (1) AC-coupled CML-to-CMOS is suitable for high-frequency operation, but not for low-frequency of operation as the coupling capacitor increases rapidly and it may oscillate due to resistive feedback [1]. (2) An amplifier-type CML-to-CMOS is used for low-frequency operation. SEL selects between these dual modes, it's generated by the DRAM control logic. CML requires an analog bias voltage, which is controlled by a digital code. As the bias increases the clock swing improves, but power consumption increases. For efficient power management, the CML nibble and other CML buffers use different bias voltages (multi-bias): if a CML nibble increases its bias to improve clock swing on a global line, the other CML buffers with a relatively small load maintain their bias.

Figure 12.4.3 shows the bandwidth-extension techniques for WCK receiver and divider. For high-speed operation, WCK receiver should compensate for duty-cycle distortion, which is generated in the SOC or by the channel. Although a cross-connected capacitor (CC) operates as a duty-cycle corrector (DCC) [1], its excessive DC suppression decreases the DC swing of the divider input and the divider oscillates. A mode selection scheme, between equalizing and duty-correction [1], is used to prevent oscillation. The high-speed timing margin is degraded due to the different buffer delays before and after WCK2CK training.

Small-sized cross-connected transistors are used to mitigate oscillations by splitting the DC level between the differential signals. The proposed DC-split scheme also corrects skew distortion between WCK and WCKB, like a phase mixer. Since WCK divider consumes a large current for high-speed operation, large-sized reset transistors are also required. However, large output load limits the divider's speed. The proposed mode-changed CML flip-flop uses small-current mode during reset time. In this mode, the divider's current is smaller by switched load and current source. As a result, the size of the reset transistor is reduced and high-speed operation is improved.

Figure 12.4.4 shows the GDDR6 receiver architecture, which is composed of a pre-amplifier and eight samplers. Each of the eight samplers consists of a loop-unrolled, 1-tap DFE. Loop-unrolling is used to relax constraints on the feedback loop delay. The total clock-to-output delay ( $t_{CO}$ ) of the sampler and the setup time of next FF is less than 1UI (62.5ps). The GDDR6 RX pre-amplifier consists of a pseudo-differential first stage, a modified Cherry-Hooper (CH) second stage, and an active feedback (AF) stage for bandwidth extension [3,4]. There exists a WCK speed constraint for high-speed test, a doubler is used to reduce the required WCK input frequency in half. In normal operation, an 8GHz WCK is applied. However, for testing the doubler is used and a 4GHz WCK is applied to WCK0 (BYTE #0) and a 90° phase-shifted WCK is applied to WCK1 (BYTE #1).

The TX stage is designed to compensate for on/off chip ISI and to reduce line loading as shown in Fig. 12.4.5. To achieve a 16Gb/s/pin full-rate operation at the final TX stage, both off-chip and on-chip signaling is important. A half-rate sampler and an on-chip feedback equalizer (EQ) are used to optimize the serialization and high-speed signaling at the GDDR6 target speed. For an existing 4:1 multiplexer [5] a 1UI pulse is generated for data sampling, but it is advantageous to reduce the full-rate operation due to the extremely short GDDR6 1UI period (~62.5ps). The implemented half-rate sampler does not need to generate and drive a 1UI pulse, because the half-rate sampled data and the 90° phase-shifted half-rate clock are directly connected to a NAND gate. For on-chip high speed signaling an on-chip de-emphasis is used to compensate for ISI on critical nodes [1,5]. This approach is not suitable in this work, because line loading is concentrated on one node. In this work, the existing on-chip de-emphasis is improved by using feedback: where line loading is distributed between the input and output nodes of signaling path. Moreover, the line loading of critical nodes can be minimized by using a tristate inverter, where the transistors are designed for a different purpose. Transistors for enable and strength control purpose are designed with large size ones, and a small size transistor is connected to the critical node to reduce line loading.

Figure 12.4.6 shows the measurement shmoo of tDV-WCK period, measurements from 32 DQ pins are overlapped, also shown is the TX eye-diagram at 16Gb/s/pin. The maximum test frequency is 16Gb/s/pin, due to tester limitations. The maximum tDV is 30ps (0.48UI) at 16Gb/s/pin. A wide tDV is achieved by using a suitable WCK distribution network and several bandwidth extension techniques for WCK, RX and TX.

### References:

- [1] H. Y. Joo, et al., "A 20nm 9Gb/s/pin 8Gb GDDR5 DRAM with an NBTI Monitor, Jitter Reduction Techniques and Improved Power Distribution," *ISSCC*, pp. 314-315, 2016.
- [2] M. Brox, et al., "An 8Gb 12Gb/s/pin GDDR5X DRAM for Cost-Effective High-Performance Applications," *ISSCC*, pp. 388-389, 2017.
- [3] E. M. Cherry and D. E. Hooper, "The design of wide-band transistor feedback amplifier," *Proc. Inst. Elec. Eng.*, vol. 110, no. 2, pp. 375-389, Feb. 1963.
- [4] S. Galal, et al., "10-Gb/s Limiting Amplifier and Laser/Modulator Driver in 0.18-um CMOS Technology," *JSSCC*, vol. 38, no. 12, pp. 2138-2146, Dec. 2003.
- [5] S. J. Bae, et al., "A 40nm 7Gb/s/pin Single-ended Transceiver with Jitter and ISI Reduction Techniques for High Speed DRAM Interface," *IEEE Symp. VLSI Circuits*, pp.193-194, 2010.



Figure 12.4.1: Block diagram of the GDDR6 Interface with WCK generator, distribution, RX and TX.



Figure 12.4.2: WCK distribution network with dual-mode and multi-bias schemes.



Figure 12.4.3: WCK receiver with DC-split. WCK divider with mode-changed CML flip-flop.



Figure 12.4.4: One-tap Loop-unrolled DFE and WCK doubler for high-speed test.



Figure 12.4.5: High-speed low-voltage single-ended transmitter.



Figure 12.4.6: tDV-WCK period shmoo and TX eye-diagram at 16Gb/s/pin.



|                      |
|----------------------|
| 22nm CMOS 3 metal    |
| 180 Ball Flip-Chip   |
| 16Gbps/pin           |
| 128M X 64            |
| X32 I/O              |
| 2 Channels           |
| 69.90mm <sup>2</sup> |
| 1.35V                |

Figure 12.4.7: Chip micrograph and summary.

## 12.5 A 16Gb 1.2V 3.2Gb/s/pin DDR4 SDRAM with Improved Power Distribution and Repair Strategy

Seokbo Shim, Sungho Kim, Jooyoung Bae, Keunsik Ko, Eunryeong Lee, Kwidong Kim, Kyeongtae Kim, Sangho Lee, Jinhoon Hyun, Insung Koh, Joonhong Park, Minjeong Kim, Sunhye Shin, Dongha Lee, Yunyoung Lee, Sangah Hyun, Wonjohn Choi, Dain Im, Dongheon Lee, Jieun Jang, Sangho Lee, Junhyun Chun, Jonghoon Oh, Jinkook Kim, Seok-Hee lee

SK hynix, Icheon, Korea

Advances in silicon technology bring high-performance mobile devices and networks that connect people all over the world. In the meantime, data centers with high computational capabilities boost the prosperity of the social world. Emerging data centers keep requiring higher density memory, with higher data rates for processing large amounts of data. However, the implementation of high density DRAM is hindered by large chip area, causing degradation of the power distribution network (PDN) and higher yield losses due to the higher probability of die defects. This paper presents a 16Gb 3.2Gb/s/pin DDR4 SDRAM that features an improved PDN and a repair strategy. The PDN is reinforced by power pads with regulators in the middle of the bank area and a staggered power-up scheme for 3D stacked (3DS) DRAM. Yield is enhanced by introducing ECC for redundant cell operation and by developing an advanced built-in self-repair scheme that automatically corrects bit-errors at the application level.

Figure 12.5.1 shows the chip architecture with the additional power pads and voltage generator blocks located in the middle of the bank area. The larger cell area of high-capacity DRAM results in longer signal and power lines: i.e. higher resistances and more severe IR drops. Therefore, the far-bank areas suffer from internal power drops, which worsens DRAM timing:  $t_{RCD}$ ,  $t_{AA}$ ,  $t_{WR}$ , etc. This can lead to DRAM malfunction or performance degradation. Extra power pads with regulators are placed in bank area, assuming a flip-chip bonding technology, to prevent the edge of the bank from unacceptable IR drops. In Fig. 12.5.2 the PDN simulation results show a 25% improvement in IR drop, compared to a conventional power pad placement.

A staggered power-up scheme is used to mitigate the in-rush current at power-up of multiple high-density DRAMs in a 3DS package: each DRAM die inside the package is controlled to sequentially turn on. A staggered power up circuit for a 4-high stack 3DS DRAM is depicted in Fig. 12.5.3. A power-up signal from the base slice (slide-0) goes to the higher slices through TSV with a counter-based delay. By the timing control circuit and fused stack information of die, each DRAM communicates through TSV and distributes the 3DS DRAM current consumption to reduce the peak power-up current dissipation. Figure 12.5.3 shows the inrush current measured, a reduction of 17% to the peak power consumption is seen using a staggered power-up scheme.

ECC is employed for the reliable operation of a repaired fuse latch [1]. To access the redundant cells, the address information of the repaired cell is stored in registers, called fuse latches. In order to avoid yield loss, it is sufficient to maintain redundant cells properly in a high-density DRAM, since the number of defective cells is proportional to the DRAM capacity. Thus, a 16Gb DDR4 requires twice as many redundant cells as a 8Gb DDR4. Figure 12.5.4 illustrates the number of column fuse latches and single-event functional interrupts (SEFI) [2] relative to the increase in DRAM capacity. Stemming from cosmic rays, SEFI increases proportionally to the number of column fuse latches. In order to solve this problem, additional parity latches, parity and syndrome generators are implemented based on a Hamming code, thereby enhancing the reliability of column repair circuits shown in Fig. 12.5.4. Due to the extra power pads located next to the column decoder area, which allows less reservoir cap, the ECC circuits can be implemented with minimal area overhead. Only column redundancy utilizes ECC, since column repair is more frequently performed than row repair. ECC is performed without any timing overhead since the parity is generated during DRAM initialization after power-up, and the syndrome is generated every activate command issue.

In addition, advanced built-in self-repair (ABISR) is developed to prevent multiple failures during module test, as shown in Fig. 12.5.5. In conventional module level cell repair bit failures can only be repaired when the failure addresses and DQs are detected during module level test [3-5]. Since ECC corrects erroneous bits, the failed DQ information cannot be detected, thus 25% of modules with failures are de-soldered. The developed ABISR analyzes the failing addresses, classifies the failure by type (bit, full row or column error), selects the appropriate redundant cells and repairs the cells by itself without any failure information. Figure 12.5.5 shows the ABISR block diagram. Test address registers store bank, row and column addresses in real-time. Row and column registers store and compare failed bit addresses during test. After failure classification based on the number of failed bits recorded in the fail-address recorder, the failure mode analyzer and failure region searcher repair the fail cells using the appropriate redundant cells. ABISR reduces the number of de-soldered modules at module test by 10%.

The 16Gb DDR4 is fabricated in an 18nm 3-metal layer DRAM process. Figure 12.5.6 shows a chip summary and the measured results. A micrograph of the DRAM die is shown in Fig. 12.5.7. The improved PDN leads to a speed performance over 3.2Gb/s/pin even with an expanded chip size. In addition, the proposed redundant cell control schemes result in over a 1% increase in module yield and enhanced reliability with respect to cosmic-ray induced soft errors.

### References:

- [1] T. J. Dell, "A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory," *IBM Microelectronics Division Whitepaper*, Nov. 1997, [http://www.ece.umd.edu/courses/enee759h.S2003/references/ibm\\_chipkill.pdf](http://www.ece.umd.edu/courses/enee759h.S2003/references/ibm_chipkill.pdf).
- [2] JEDEC, "Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices, JESD89A," *JEDEC Solid State Tech. Assoc.*, Aug. 2001, <http://www.jedec.org/download/search/JESD89A.pdf>.
- [3] J.-K. Wee, et al., "A Post-Package Bit-Repair Scheme Using Static Latches With Bipolar-Voltage Programmable Antifuse Circuit for High-Density DRAMs," *JSSC*, vol. 37, no. 2, pp. 251-254, Feb. 2002.
- [4] W. Jeong, et al., "A fast built-in redundancy analysis for memories with optimal repair rate using a line-based search tree," *IEEE TVLSI*, vol. 17, no. 12, pp. 1665-1678, Dec. 2009.
- [5] W. Osamu, et al., "Post-packaging auto repair techniques for fast row cycle embedded DRAM," *Int. Test Conf.*, pp. 1010-1023, 2004.



Figure 12.5.1: 16Gb core architecture with an additional voltage generator featuring a flip-chip power pad.



Figure 12.5.2: PDN simulation results for a voltage generator with flipchip power pads vs one without flipchip power pads.



Figure 12.5.3: Staggered-power-up circuit for reducing 3DS DRAM inrush current.



Figure 12.5.4: SEFI fail rate vs. DRAM column fuse latch and ECC for column repair fuse latches.



Figure 12.5.5: Comparison of module repair process: conventional vs advanced built-in self repair (ABISR). Block diagram for ABISR.



|                                 |                       |
|---------------------------------|-----------------------|
| Technology                      | 18 nm CMOS 3 metal    |
| Chip Size                       | 81.28 mm <sup>2</sup> |
| Memory Capacity                 | 16Gb (16 Bank)        |
| PKG Type                        | Flip-chip PKG         |
| High Density PKG                | 3DS-TSV 4-hi (Stack)  |
| Power Supply (VDD / VDDQ / VPP) | 1.2 V / 1.2 V / 2.5 V |
| I/O                             | x4 / x8 / x16         |
| Speed                           | ~ 3.2 Gb/s/pin (DDR)  |

Figure 12.5.6: Measurement results and chip summary table.



Figure 12.5.7: Die micrograph of the fabricated 16Gb DDR chip.

# Session 13 Overview:

## *Machine Learning and Signal Processing*

### DIGITAL ARCHITECTURES AND SYSTEMS SUBCOMMITTEE



**Session Chair:**  
**Dejan Marković**

*University of California, Los Angeles, Los Angeles, CA*

**Associate Chair:**  
**Masato Motomura**

*Hokkaido University, Sapporo, Japan*

**Subcommittee Chair: Byeong-Gyu Nam, Chungnam National University, Daejeon, Korea**

Architectures supporting machine learning for embedded perception and cognition are continuing their rapid evolution, inspired by modern data analytics and enabled by the low energy cost of CMOS processing. This makes it feasible to migrate data analytics toward edge and wearable devices. To further support increased requirements for multiuser connectivity and sparse data, multi-user MIMO and compressive reconstruction are also required.

This session covers trends in machine learning and signal processing for improved accuracy of speech, image, video processing for next-generation mobile/edge and data center devices. The session features programmable accelerators for Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Multi-Layer Perceptron (MLP) algorithms, with digital and mixed-signal processing kernels. The session concludes with a link-adaptive massive MIMO detector and robust compressive-sensing reconstruction processors.

#### INVITED PAPER

1:30 PM

#### 13.1 A Shift Towards Edge Machine-Learning Processing

*O. Temam, Google, Paris, France*

The field of machine learning, especially Deep Neural Networks, is advancing at a breathtaking pace, with new functionalities achieved on a monthly basis. In the span of a few years, close to human-level accuracy has been achieved for simple voice commands, then full speech recognition, speech synthesis, translation, and increasing progress has been achieved in language understanding.



Machine-learning researchers largely acknowledge that the current successes of deep neural networks have been fueled by two evolutions: the availability of a large quantity of data for training, and the availability of high-performance computing at low cost, initially enabled by GPUs. Both advances combined to make training times tractable for large neural-network models.

Beyond GPUs, both the broad application span and high computational cost of deep neural networks have made custom machine-learning hardware economically sensible, and such architectures are currently being developed by many hardware, cloud or startup companies. The number of competing companies and the broad dissemination of knowledge on how to design such hardware should help reduce the cost of neural network computing, and make sophisticated machine learning more accessible in the coming years.

As Moore's Law plateaus, one of the main paths forward is increasing customization for increased efficiency. This trend will paradoxically first arise at the edge (vs. in the data center), where hardware efficiency is most critical. Unfortunately, the need for customization/efficiency runs contrary to the very fast evolution of machine-learning algorithms. Traditional architectural approaches for achieving generality, while great for general-purpose computing, may not be best suited for resolving this tension between efficiency, velocity and generality.

Beyond hardware efficiency challenges, the other key challenge remains access to data. Consumer data privacy, corporate data confidentiality, or even regulatory compliance force a shift towards processing data closer to where it exists, i.e., at the edge. For many applications, doing so also provides useful, if not indispensable, latency, bandwidth and connectivity benefits for many applications. It even provides an out-of-the-box way to tackle the economic consequences of a plateauing Moore's Law for fast growing data-center machine-learning applications.

The talk will go over these different trends and their consequences.



2:00 PM

**13.2 QUEST: A 7.49TOPS Multi-Purpose Log-Quantized DNN Inference Engine Stacked on 96MB 3D SRAM Using Inductive-Coupling Technology in 40nm CMOS**
*K. Ueyoshi, Hokkaido University, Sapporo, Japan*

In Paper 13.2, Hokkaido University presents a  $14.3 \times 8.5 \text{mm}^2$  multi-purpose log-quantized deep neural network (DNN) inference engine stacked on a 96MB 3D SRAM using inductive coupling technology in 40nm. The system features 3-cycle 28.8GB/s memory communication and 7.49TOPS peak performance in binary precision at 1.1V, 300MHz, for cutting-edge DNN workloads.



2:30 PM

**13.3 UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight Bit-Precision**
*J. Lee, KAIST, Daejeon, Korea*

In Paper 13.3, KAIST describes a DNN accelerator with variable bit precision from 1b to 16b. Using a flexible DNN core architecture, look-up-table-based bit-serial processing, and off-chip memory management, the  $16 \text{mm}^2$  65nm chip achieves 50.6TOPS/W energy efficiency for 1b data at 10MHz, 0.66V.



3:15 PM

**13.4 A 9.02mW CNN-Stereo-Based Real-Time 3D Hand-Gesture Recognition Processor for Smart Mobile Devices**
*S. Choi, KAIST, Daejeon, Korea*

In Paper 13.4, KAIST presents a 3D hand-gesture recognition processor for real-time user interaction in smart mobile devices. With a CNN stereo engine, triple ping-pong buffers, and processor-in-memory techniques, the  $16 \text{mm}^2$  65nm processor achieves real-time 3D hand-gesture recognition with 9.02mW and 4.3mm error at 0.85V, 50MHz.

13



3:45 PM

**13.5 An Always-On 3.8 $\mu$ J/86% CIFAR-10 Mixed-Signal Binary CNN Processor with All Memory on Chip in 28nm CMOS**
*D. Bankman, Stanford University, Stanford, CA*

In Paper 13.5, Stanford University and KU Leuven introduce a mixed-signal binary CNN processor based on near-memory computing. The  $2.4 \times 2.4 \text{mm}^2$  28nm 0.6V processor features 328KB of on-chip SRAM for a 9-layer CNN, data parallelism and parameter re-use, achieving a 3.8 $\mu$ J/classification at 86.05% accuracy on the CIFAR-10 dataset.



4:15 PM

**13.6 A 1.8Gb/s 70.6pJ/b 128 $\times$ 16 Link-Adaptive Near-Optimal Massive MIMO Detector in 28nm UTBB-FDSOI**
*W. Tang, University of Michigan, Ann Arbor, MI*

In Paper 13.6, the University of Michigan describes 128 $\times$ 16 massive MIMO detector with link adaptation to meet practical channel conditions with scalable energy. Implemented as a condensed systolic array, the  $2 \text{mm}^2$  28nm FDSOI chip achieves 1.8Gb/s at 70pJ/b, 569MHz and 4.3dB processing gain with channel data obtained from real-life measurements.



4:45 PM

**13.7 A 232-to-1996KS/s Robust Compressive-Sensing Reconstruction Engine for Real-Time Physiological Signals Monitoring**
*T.-S. Chen, National Taiwan University, Taipei, Taiwan*

In Paper 13.7, National Taiwan University presents a compressive-sensing reconstruction engine with parallel atom searching approach to reduce signal distortion due to measurement noise. The  $2.93 \times 2.93 \text{mm}^2$  40nm processor achieves up to 1996KS/s with 93mW power consumption at 0.9V, 67.5MHz.

### 13.2 QUEST: A 7.49TOPS Multi-Purpose Log-Quantized DNN Inference Engine Stacked on 96MB 3D SRAM Using Inductive-Coupling Technology in 40nm CMOS

Kodai Ueyoshi<sup>1</sup>, Kota Ando<sup>1</sup>, Kazutoshi Hirose<sup>1</sup>, Shinya Takamaeda-Yamazaki<sup>1</sup>, Junichiro Kadomoto<sup>2</sup>, Tomoki Miyata<sup>2</sup>, Mototsugu Hamada<sup>2</sup>, Tadahiro Kuroda<sup>2</sup>, Masato Motomura<sup>1</sup>

<sup>1</sup>Hokkaido University, Sapporo, Japan

<sup>2</sup>Keio University, Yokohama, Japan

A key consideration for deep neural network (DNN) inference accelerators is the need for large and high-bandwidth external memories. Although an architectural concept for stacking a DNN accelerator with DRAMs has been proposed previously, long DRAM latency remains problematic and limits the performance [1]. Recent algorithm-level optimizations, such as network pruning and compression, have shown success in reducing the DNN memory size [2]; however, since networks become irregular and sparse, they induce an additional need for agile random accesses to the memory systems.

Figure 13.2.1 illustrates our proposal: stacking a DNN inference engine, QUEST, with multi-vault SRAMs using inductive-coupling die-to-die wireless communication technology, known as a ThruChip Interface (TCI) [3]. Parallel TCI channels placed in a planar manner provide QUEST with multiple independent high-bandwidth access points to the stacked SRAMs. SRAMs can provide random access capability with extremely low latency (an order of magnitude lower than DRAMs), whereas 3D stacking helps SRAMs achieve reasonably large memory capacity. In Fig. 13.2.1, QUEST and 8 SRAMs are TCI-stacked as a single 14.3×8.5 mm<sup>2</sup> 3D module. Power/ground are supplied through TSVs. QUEST has 24 processing cores running at 300MHz, where each core is associated with one 32b-width 4MB SRAM vault. Running at 3.6GHz, a TCI channel (7-Tx/5-Rx coils) realizes 9.6Gb/s/vault, combined 28.8Gb/s/module, R/W data bandwidth in a source synchronous manner. The R/W latency including TCI trip time is 3 cycles, which is uniform over the 8 SRAMs. TSV technology, used commonly for die stacking, is known to experience open-contact failure. In our design, however, since all signal transmissions are conducted by wireless TCI channels, the presented 3D module can limit the usage of TSVs to power/ground grids, where numerous parallel connections negate this concern.

Figure 13.2.2 shows the overall block diagram of the fabricated QUEST prototype. The 24 cores run in a MIMD-parallel manner, where inter-core communication is handled either with a mesh-structured local link or tree-structured global network. Each core has a micro-programmed sequencer for setting and controlling the PE array. Synchronization among the cores is managed through a synchronization table when needed. Each core also has a DMAC, which issues memory accesses to intra-core memories (shaded) and the stacked SRAM vault in response to intra/inter-core memory requests. The 32×16 PE array features a bit-serial architecture: the PE conducts binary computation in a single cycle, and N-bit log-quantized ones in N cycles (N<5). Weights double-buffered in W\_MEMs are distributed to the PE array in a fully parallel manner, whereas incoming activations also double-buffered in A\_MEMs are broadcast in a row-parallel manner. The ACT unit at the tail of each column applies the bias (shifted-in from B\_MEMs), scaling, and activation function; and then writes the output activations into O\_MEMs.

Unlike other array-structured DNN accelerators, all PEs receive unique weight bits, whereas the PEs in a row receive an identical activation bit, as detailed in Fig. 13.2.3. In a PE column, partial dot products are first generated in parallel in PEs and then simply shifted towards ACT where they are accumulated. The pipelined shifts hide behind the PE-parallel, bit-serial, dot-product computations as shown in the time chart. These mechanisms are key enablers for handling various DNNs on a single homogeneous PE array (whereas [5] and [6] use hybrid cores and hybrid PEs, respectively, for different DNN types): e.g., for a fully connected (FC) layer in a CNN, MLP, or RNN, up to 32 fan-ins for a neuron are mapped onto a PE column at a time and then time multiplexed on the same column. For a convolutional (CONV) layer in CNN, on the other hand, up to 32 input channels are mapped onto a PE column at a time and then time multiplexed. The filter kernel is stored vertically in the W\_MEMs and processed in an element-by-element, kernel-parallel manner.

Figure 13.2.4 presents a log-quantized neural datapath. The log-quantization method [4] is superior to linear quantization in two ways: 1) its “denser the finer”

approach allows it to represent weight/activation distributions better, and 2) resource-consuming multiply operations are reduced to additions. Dot-products are computed in PEs by “log” bit-serial addition and “linear” accumulation. ACT accumulates the dot products and adds a bias in “linear”, then applies a scaling/activation function such as ReLU in “log”. The lightweight PE architecture has enabled the dense PE array to be tightly coupled (bit-by-bit) with W\_MEMs and A\_MEMs (Fig. 13.2.2), achieving versatile parallel NN computation (Fig. 13.2.3). Log-quantization inference accuracy is evaluated on AlexNet (for ImageNet) and on LeNet-5 (for MNIST). Log-4 (4b log-quantized) AlexNet shows only marginal degradation compared with FP-32, whereas Log-3 was destructive. Even binary can attain reasonable accuracy for LeNet-5: the performance-accuracy trade-off is also indicated in the figure.

Figure 13.2.5 depicts AlexNet mapped on the QUEST 3D module, where 24 parallel cores process the inference in a layer-by-layer manner, producing/reading intermediate results to/from the SRAMs, respectively. For a CONV layer, an output channel is mapped spatially among the cores so that a “cluster” of cores can share same input channels. For a FC layer, output neurons are mapped onto all the cores evenly, requiring all-to-all shuffling data-distribution patterns. In both cases, computation in a core must read activations from another core’s SRAM vaults, that are delivered through the TCI channels and the on-chip networks. The accesses are scattered across individual memory spaces, and burst lengths are very short (1 to 4 for this mapping). Fig. 13.2.5 summarizes the performance of the Log-4 AlexNet, which occupies 39% of the 3D SRAM, as well as Log-4 and binary VGG11 (for CIFAR-10). It is shown that for AlexNet, having more than 2.9MB of on-chip memory is crucial for sustaining above 90% effective/peak performance. The 3-cycle short random-access latency of the 3D SRAM, on the other hand, is also indispensable for effective performance, assuming burst memory access with 30-cycle initial latency, which mimics modern DRAM latency, effective VGG11 performance degrades drastically for Log-4 and binary cases. Larger DNNs such as ResNet, moreover, require aggressive pruning to fit the limited memory space, where the presented random-access capability of the 3D module will become even more indispensable.

Figure 13.2.6 compares recently reported multi-purpose (CNN/FC/RNN, etc.) DNN accelerators [5] and [6], using LUT-based and linear quantization, respectively, with this work. Those works integrated a limited amount of on-chip SRAM (around 300KB), and did not include external memory for power estimation. QUEST, on the other hand, integrates 7.68MB large on-chip SRAM (sufficient for AlexNet on-chip buffering) in addition to the 96MB 3D SRAM. It achieves 5× better effective performance on AlexNet benchmark at 4b precision. Since external memory accesses are responsible for the majority of the power dissipation, and since the 3D SRAM can substantially reduce external memory power in comparison to DRAMs, system-level energy efficiency favors the proposed solution.

Figure 13.2.7 shows a QUEST prototype microphotograph with a specification table. To summarize, QUEST is aimed toward rapidly revolutionizing highly compressed (bit-reduced, pruned, etc.) DNNs with three main architectural features: 1) 3D integration with large capacity/bandwidth yet low-latency random access SRAM, 2) flexible dataflow support in the PE array for CONV/FC and other types of DNN layers, 3) a bit-serial PE architecture for binarized and log-quantized DNN representations.

#### Acknowledgements:

This work was funded by JST ACCEL Grant Number JPMJAC1502, Japan. The authors thank Profs. T. Asai, M. Ikebe, E. Sano, M. Arita from Hokkaido University and the colleagues at UltraMemory Inc. for their invaluable support.

#### References:

- [1] M. Gao, et al., “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory,” *ACM ASPLOS*, pp. 751-764, 2017.
- [2] A. Parashar, et al., “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks,” *ACM ISCA*, pp. 27-40, 2017.
- [3] D. Ditzel, et al., “Low-Cost 3D Chip Stacking with ThruChip Wireless Connections,” *IEEE Hot Chips*, pp. 1-37, 2014.
- [4] D. Miyashita, et al., “Convolutional Neural Networks using Logarithmic Data Representation,” *arXiv: 1603.01025 [cs.NE]*, 2016.
- [5] D. Shin, et al., “DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks,” *ISSCC*, pp. 240-241, 2017.
- [6] S. Yin, et al., “A 1.06-to-5.09 TOPS/W Reconfigurable Hybrid-Neural-Network Processor for Deep Learning Applications,” *IEEE Symp. on VLSI Circuits*, 2017.



Figure 13.2.1: QUEST module overview and 3-cycle R/W latency on 3D SRAM.



Figure 13.2.2: Overall architecture of the proposed DNN inference engine (QUEST).



Figure 13.2.3: Detailed computation in a PE column (top). FC/CONV dataflow on a homogeneous PE array (bottom).



Figure 13.2.4: A log-quantized neural datapath in PE and ACT, and accuracy evaluation on realistic networks.



Figure 13.2.5: AlexNet mapped on the QUEST (left). Throughput across different memory capacity/latency (right).

|                                         | ISSCC2017 [5]                                        | VLSIC2017 [6]           | This Work          |
|-----------------------------------------|------------------------------------------------------|-------------------------|--------------------|
| Technology                              | CMOS 65nm                                            | CMOS 65nm LP            | CMOS 40nm LP       |
| Die Area [mm <sup>2</sup> ]             | 4 x 4                                                | 4.4 x 4.4               | 14.3 x 8.5         |
| Target DNN                              | CNN/MLP/RNN                                          | CNN/MLP/RNN             | CNN/MLP/RNN        |
| Precision [bit]                         | 4 - 16                                               | 2 - 16                  | 1 - 4              |
| Quantization                            | LUT-based                                            | Linear                  | Logarithmic        |
| Clock Freq. [MHz]                       | 50 - 200                                             | 100 - 400               | 75 - 330           |
| Supply Voltage [V]                      | 0.77 - 1.2                                           | 0.67 - 1.2              | 0.77 - 1.1         |
| External Memory                         | Not Discussed<br>(Not Included in Power Dissipation) | 3D-Stacked SRAM         |                    |
| Power Dissipation [W]                   | 0.03@0.77V<br>0.28@1.1V                              | 0.04@0.67V<br>0.45@1.2V | 3.3@1.1V           |
| On-chip SRAM [KB]                       | 290                                                  | 349                     | 7,680              |
| Peak Performance [TOPS]                 | 0.3@16b<br>1.2@4b                                    | CONV                    | 0.410@4b           |
| AlexNet@4b Effective Performance [TOPS] | 0.26                                                 | CONV                    | 1.96@4b            |
| AlexNet@4b Top-5 Accuracy               | 0.02                                                 | FC                      | 7.49@1b            |
| VGG11 Effective Perf. [TOPS]            | -                                                    | -                       | 1.78@4b<br>6.52@1b |

Figure 13.2.6: Comparison with state-of-the-art multi-purpose DNN accelerators.



**Figure 13.2.7:** A microphotograph of the QUEST prototype, along with the chip specification summary.

### 13.3 UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight Bit-Precision

Jinmook Lee, Changhyeon Kim, Sanghoon Kang, Dongjoo Shin, Sangyeob Kim, Hoi-Jun Yoo

KAIST, Daejeon, Korea

Deep neural network (DNN) accelerators [1-3] have been proposed to accelerate deep learning algorithms from face recognition to emotion recognition in mobile or embedded environments [3]. However, most works accelerate only the convolutional layers (CLs) or fully-connected layers (FCLs), and different DNNs, such as those containing recurrent layers (RLs) (useful for emotion recognition) have not been supported in hardware. A combined CNN-RNN accelerator [1], separately optimizing the computation-dominant CLs, and memory-dominant RLs or FCLs, was reported to increase overall performance, however, the number of processing elements (PEs) for CLs and RLs was limited by their area and consequently, performance was suboptimal in scenarios requiring only CLs or only RLs. Although the PEs for RLs can be reconfigured into PEs for CLs or vice versa, only a partial reconfiguration was possible resulting in marginal performance improvement. Moreover, previous works [1-2] supported a limited set of weight bit precisions, such as either 4b or 8b or 16b. However, lower weight bit-precisions can achieve better throughput and higher energy efficiency, and the optimal bit-precision can be varied according to different accuracy/performance requirements. Therefore, a unified DNN accelerator with fully-variable weight bit-precision is required for the energy-optimal operation of DNNs within a mobile environment.

In this paper, we present a unified neural processing unit (UNPU) supporting CLs, RLs, and FCLs with fully-variable weight bit-precision from 1b to 16b. As shown in Fig. 13.3.1, the reuse of input features (IFs) is more efficient than the reuse of weights under low-weight bit-precision and the operations of CLs become identical to those of RLs and FCLs when the IFs of the CLs are vectorized into a 1-dimensional vector so that the hardware can be fully shared in the UNPU by IF reuse. Moreover, the lookup-table-based bit-serial PE (LBPE) is implemented for energy-optimal DNN operations with variable-weight bit-precisions from 1b to 16b through iterations of 1b weight operations. Furthermore, an aligned feature loader (AFL) minimizes the amount of off-chip memory accesses required to fetch IFs by exploiting the data locality among convolution operations.

Figure 13.3.2 shows the overall architecture of the UNPU. It consists of 4 DNN cores, an aggregation core, a 1D SIMD core, and a RISC controller. All of these components are connected to an on-chip network for communication. Each DNN core has 6 LBPEs, 6 AFLs (64x6), a weight memory (48KB), an instruction decoder and a controller. The LBPE receives aligned IF as an input operand through AFLs and calculates 576 ( $4 \times 12 \times 12$ ) multiplications in parallel in a bit-serial manner. The partial-sums (Psums) calculated by each DNN core are aggregated to an output feature (OF) in the aggregation core. The 1D SIMD core performs the remaining operations, such as non-linear activation or pooling, and the results are stored in off-chip memory through the external gateways.

Figure 13.3.3 elaborates on workload allocation. For RLs and FCLs, its 1D IF is mapped to AFLs with one-to-one (48x1) and sent to a PE. The weights are loaded from 12 channels of OF (48x12b) to calculate multiple channels of Psums with the same IF. For a CL, IFs distributed over multiple input channels are concatenated into a 1D row vector and loaded into the AFLs, as is done with RLs and FCLs. The weights of CLs are converted into 1D column vectors and then the Psums are calculated by multiplying with the 1D IF row vector. 4 LBPEs in a DNN core calculate the product between 48 pairs of IFs and weights, and each LBPE corresponds to 12 IF-weight pairs. The IF is reused for multiple column vectors from other channels. The Psums from each PE are accumulated by 12 adder trees. The weights are reused among the 6 LBPEs for better energy efficiency. For example, in RLs and FCLs, the 6 different IFs are assigned to 6 LBPEs in parallel with the same weights if batch-wise parallelism is possible. For a CL, the 6 consecutive IFs in the same channel are multiplied with the same weights in 6 different LBPEs in parallel. Peak performance for CLs and RLs (or FCLs) is increased by 1.15x and 13.8x, respectively, compared to [1] owing to the higher compute density of the unified DNN core.

Figure 13.3.4 shows the architecture of the LBPE. The key idea of the LBPE is that partial-sums are repeatedly calculated during the weight bit-serial MAC operation. A LBPE consists of 4 PE clusters, adder trees to accumulate the results of each PE cluster, and shift-and-add logic for bit-serial multiplications. Each PE cluster contains 4 look-up-table (LUT) modules and a controller that determines whether the value from LUTs is added or subtracted. In the LUT module, a table with 8 entries is used, supporting 3-way MAC for multi-bit multiplication and 4-way MAC for 1b multiplication. The LUT is updated after IFs load into the AFLs, and IF values are reused for all output channels of the layer currently being processed. The 1b weight Psums are fetched from the LUT prepared in advance and accumulated for MAC operation. The LUT can fetch 12 Psums in parallel so that a total of 48x12 Psums (64x12 for 1b case) can be calculated simultaneously on a LBPE in 1 cycle. With the help of table-based operations, the LBPE improves energy efficiency more than conventional bit-serial PEs [4]. When IFs are reused 1024 times, the energy-consumption of LBPEs, including the LUT update, is reduced by 23.1%, 27.2%, 41.0%, and 53.6%, for the case of 16b, 8b, 4b, and 1b weight operations, respectively, compared with fixed-point MAC units under the same throughput conditions.

Figure 13.3.5 explains the AFL. 6 AFL-LBPE pairs are integrated in a DNN core and each AFL has 64 entries. The data in the AFL can be shifted diagonally across AFL boundaries, as well as shifted inside the AFL itself. In the case of CLs with 3x3 kernels and stride 1, 8 entries of IF from Ch. 1 are loaded on AFL 0 at first. At the next cycle, the 7 top entries of AFL 0, except the top-most entry, are shifted diagonally to AFL 1, while the 8 entries from Ch. 2 are concatenated below the remaining 3 entries of Ch. 1 on AFL 0. And then, 8 entries from Ch. 3 are concatenated below the remaining 3 entries of Ch. 2 on AFL 0, while 6 entries on AFL 1 from Ch. 1 are shifted diagonally to AFL 2, and 7 entries from Ch. 2 are shifted diagonally to concatenate below the remaining 3 entries from Ch. 1 on AFL 1. Iterations of diagonal shifts allocate a 3x3 kernel to each AFL or LBPE so that parallel multiplication is possible to accelerate convolution. Varied stride sizes are supported via the application of multiple shifts. The AFL keeps the PE utilization high, unlike an architecture that moves data only between PEs. In addition, it can skip zeros by an upward-shift within the buffer. When the AFL is applied to AlexNet and VGG-16, external memory access operations for IF load are reduced by 57.2% and 55%, respectively.

Figure 13.3.6 shows measurement results for the fabricated UNPU. The UNPU can operate at 0.63-to-1.1V supply voltage with a maximum 200MHz clock frequency. The power consumption at 0.63V and 1.1V is 3.2mW and 297mW, respectively. The power-efficiency, as measured on CLs (5x5 kernels) with consideration of PE utilization is 3.08, 11.6, and 50.6TOPS/W for the case of 16b, 4b, and 1b weights, respectively. The architecture supports any weight bit-precision from 1b to 16b for optimal DNN operation and shows 1.43x higher power efficiency for CLs at 4b weight compared to [1]. When operating on a 1b weight network, it achieves 8.43x higher efficiency and 7.4x higher peak performance as compared to [6].

The UNPU is fabricated using 65nm CMOS technology and occupies 16mm<sup>2</sup> die area, as shown in the Fig. 13.3.7. The UNPU has been demonstrated successfully on facial expression recognition and dialogue generation tasks with the FER2013 and the Twitter dialogue database for human-computer interaction, respectively.

#### References:

- [1] D. Shin, et al., "DNPU: An 8.1 TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks," *ISSCC*, pp. 240-241, 2017
- [2] B. Moons, et al., "Envision: A 0.26-to-10TOPS/W Subword-Parallel Dynamic-Coltage-Accuracy-Frequency-Scalable Convolutional Neural Network processor in 28nm FDSOI," *ISSCC*, pp. 246-247, 2017.
- [3] K. Bong, et al., "A 0.62mW Ultra-Low-Power Convolutional-Neural-Network Face-Recognition Processor and a CIS Integrated with Always-On Haar-Like Face Detector," *ISSCC*, pp.248-249, 2017
- [4] P. Judd, et al., "Stripes: Bit-serial Deep Neural Network Computing," *IEEE Computer Architecture Letters*, vol. 16, no. 1, pp. 80-83, Jan.-June 1 2017.
- [5] S. Yin, et al., "A 1.06-to-5.09 TOPS/W Reconfigurable Hybrid-Neural-Network Processor for Deep Learning Applications," *IEEE Symp. VLSI Circuits*, 2017.
- [6] K. Ando, et al., "BRein memory: A 13-Layer 4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconfigurable In-Memory Deep Neural Network Accelerator in 65 nm CMOS," *IEEE Symp. VLSI Circuits*, 2017.



Figure 13.3.1: Fully reconfigurable unified DNN accelerator with bit-serial PEs.



Figure 13.3.2: Overall architecture.



Figure 13.3.3: Workload allocation on the unified DNN core.



Figure 13.3.4: LUT-based bit-serial processing elements.



Figure 13.3.5: Aligned feature loader for reduction of off-chip memory accesses.



Figure 13.3.6: Measurement results and performance comparison table.



Figure 13.3.7: Chip micrograph and performance summary.

### 13.4 A 9.02mW CNN-Stereo-Based Real-Time 3D Hand-Gesture Recognition Processor for Smart Mobile Devices

Sungpill Choi, Jinsu Lee, Kyuho Lee, Hoi-Jun Yoo

KAIST, Daejeon, Korea

Recently, 3D hand-gesture recognition (HGR) has become an important feature in smart mobile devices, such as head-mounted displays (HMDs) or smartphones for AR/VR applications. A 3D HGR system in Fig. 13.4.1 enables users to interact with virtual 3D objects using depth sensing and hand tracking. However, a previous 3D HGR system, such as Hololens [1], utilized a power consuming time-of-flight (ToF) depth sensor (>2W) limiting 3D HGR operation to less than 3 hours. Even though stereo matching was used instead of ToF for depth sensing with low power consumption [2], it could not provide interaction with virtual 3D objects because depth information was used only for hand segmentation. The HGR-based UI system in smart mobile devices, such as HMDs, must be low power consumption (<10mW), while maintaining real-time operation (<33.3ms). A convolutional neural network (CNN) can be adopted to enhance the accuracy of the low-power stereo matching. The CNN-based HGR system comprises two 6-layer CNNs (stereo) without any pooling layers to preserve geometrical information and an iterative-closest-point/particle-swarm optimization-based (ICP-PSO) hand tracking to acquire 3D coordinates of a user's fingertips and palm from the hand depth. The CNN learns the skin color and texture to detect the hand accurately, comparable to ToF, in the low-power stereo matching system irrespective of variations in external conditions [3]. However, it requires >1000 more MAC operations than previous feature-based stereo depth sensing, which is difficult in real-time with a mobile CPU, and therefore, a dedicated low-power CNN-based stereo matching SoC is required.

In this paper, we describe an accurate, low power (<10mW), and real-time (<33.3ms) 3D HGR processor for smart mobile devices with 3 key features: 1) a pipelined CNN processing element (PE) with a shift MAC operation for high throughput by maximizing core utilization; 2) triple ping-pong buffers with workload balancing for fast line streaming by reducing external accesses; and 3) nearest-neighbor searching (NNS) processing-in-memory (PIM) for high energy efficiency by reducing the number of bitlines requiring pre-charge in SRAM.

Figure 13.4.2 shows the overall architecture of the HGR processor that consists of a CNN-stereo engine (CSE) and an ICP-PSO engine (IPE). The CSE contains two line-streaming CNN cores with 4 locally distributed memories and one matching core. The CNN core has one pipelined CNN PE and a local DMA with a forwarding/backwarding (FWD/BWD) unit to balance workloads between the CNN cores. The IPE consists of a NNS unit with 16-way parallel NNS PIMs and a hand-tracking unit.

Figure 13.4.3 shows the pipelined CNN PE architecture with shift MAC operation, which performs 1D convolution, for a line-streaming CNN. The entire 2-D convolution is performed by repetition of shift MAC. The shift MAC operation with a 3x3 filter in Fig. 13.4.3 consists of three stages: 1) shifting feature maps and filters, 2) element-wise multiplication, and 3) partial-sum accumulation. First, input feature maps and filters are loaded into shift registers and both are shifted by 1-index in every clock cycle. Then, the active weights of each channel are multiplied with active features element-by-element. Finally, multiplication results are accumulated to obtain 1D convolution results in 3 cycles. The line-streaming CNN operation is accelerated by the 7-stage pipelined CNN PE that processes 48 MACs per cycle with 96% core utilization. Moreover, the pipelined architecture enables line-streaming processing, as well as memory access latency hiding to achieve 1.8TOPS/W, 60MHz at 0.9V.

Memory management of the CNN core is shown in Fig. 13.4.4. The hardware utilizes triple ping-pong memories to store feature maps, where each memory is accessed simultaneously to feed pipeline inputs, write back pipeline outputs, and to access an external interface, respectively. Instead of storing the entire feature maps on the chip, the line-streaming processing with only 3-to-5 lines of feature maps reduces 90.1% of required data that must be fetched from/to off-chip. As a result, the triple ping-pong operation hides the external access time behind CNN computation, and the hand tracking system does not need external accesses to fetch intermediate feature maps. In addition, the FWD/BWD unit balances

workloads between two CNN cores automatically. As shown in top-right of Fig. 13.4.4, data in each core becomes unbalanced due to the reduced size of feature maps after convolution, especially in a distributed memory architecture. The data transaction time between CNN cores must be well defined to balance their workloads [4]. The FWD/BWD units keep CNN core workloads identical throughout CNN processing and, as shown in Fig. 13.4.4, exchange feature-map boundary data with one another when local feature maps are fetched. Moreover, the internal data transaction time for workload balancing can be hidden behind the CNN pipeline. As a result, the triple ping-pong buffers with the FWD/BWD unit reduce overall CNN processing time by 23.9%

Figure 13.4.5 shows the PIM architecture specialized for NNS to track a user's hands in the IPE. Hand tracking requires >360K-node k-d tree NNS between the 46-sphere model in the memory and the depth input from the CSE. In the proposed PIM, NNS operation is composed of 2 half cycles such that it performs NNS on a parent in the first half cycle, 1 read operation for fetching next parent address, and on a selected child in the remaining half cycle. The proposed PIM is composed of 4 cell arrays (CA) for a parent node, two child nodes, and a next searching address. Each CA has 36x16 8T-SRAM cells and the bitlines are separated as read bitlines (RBLs) and write bitline (WBLs). The 3 CAs for a parent and 2 child nodes contain ripple-carry comparators that output " $C_{OUT} = \text{Sign}(WBL-RBL) \text{ OR } C_{IN}$ " to the comparison bitline (CBL). A parent CBL is connected to an address decoder to activate the selected child CA by changing LSB of the address decoder in later half cycle of NNS. The proposed PIM achieves 6x speed-up compared with the conventional SRAM design, which requires 6 operations to complete NNS. Moreover, it can skip read operations by 1/3 so that redundant pre-charging power consumption on global bitlines can be reduced by 63.9% (2.8x energy-efficiency enhancement).

Measurement results of hand-depth sensing and hand tracking in the 3D HGR processor are shown in Fig. 13.4.6. Thanks to the CNN stereo, the processor can acquire accurate hand depth showing distinguishable depth of fingertips and low disparity error. In 20cm-to-40cm active range, the average hand tracking accuracy is 4.3mm with 5cm separated VGA stereo cameras, and it achieves mm-scale accuracy of 3D HGR. In addition, the proposed processor consumes 9.02mW @ 50MHz, 0.85V for real-time 1-hand 3D HGR, which is 14x less than the state-of-the-art UI processor [2]. As a result, the 3D HGR processor can satisfy the required power budget (<10mW) with 30ms latency. Moreover, the CNN-stereo engine achieves 1.8TOPS/W which is 1.45x more energy efficient than a state-of-the-art distributed memory architecture [4]. The line-streaming CNN architecture uses 781.5KB on-chip memory, while [4-6] needed ~MB memory, which is impossible to realize as on-chip memory.

The 3D HGR processor for smart mobile devices is fabricated in 65nm CMOS technology, and it occupies 4x4mm<sup>2</sup> integrating 781.5KB of SRAM. Its maximum and average hand tracking error are 10.6mm and 4.3mm, respectively, where that of a ToF system is ~5mm pm average. The highly accurate 3D HGR processor consumes only 9.02mW with 30ms system latency.

#### References:

- [1] Hololens Hardware detail. Available: [https://developer.microsoft.com/en-us/windows/mixed-reality/hololens\\_hardware\\_details](https://developer.microsoft.com/en-us/windows/mixed-reality/hololens_hardware_details)
- [2] S. Park, et al., "A 126.1mW Real-Time Natural UI/UX Processor with Embedded Deep-Learning Core for Low-Power Smart Glasses," *ISSCC*, pp. 254-255, 2016.
- [3] W. Luo, et al., "Efficient Deep Learning for Stereo Matching," *CVPR*, pp. 5695-5703, 2016.
- [4] K. Bong, et al., "A 0.62mW Ultra-Low-Power Convolutional-Neural-Network Face-Recognition Processor and a CIS Integrated with Always-On Haar-Like Face Detector," *ISSCC*, pp. 248-249, 2017.
- [5] Y. H. Chen, et al., "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," *ISSCC*, pp. 262-263, 2016.
- [6] D. Shin, et al., "DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks," *ISSCC*, pp. 240-241, 2017.



Figure 13.4.1: 3D hand-gesture recognition in mobile smart devices.



Figure 13.4.2: Overall architecture.



Figure 13.4.3: Pipelined CNN PE architecture with shift MAC.



Figure 13.4.4: Triple ping-pong buffer architecture with workload balancing.



Figure 13.4.5: Nearest-neighbor searching processing-in-memory (PIM).



Figure 13.4.6: Measurement results and comparison table.



Figure 13.4.7: Chip photography and performance summary.

### 13.5 An Always-On 3.8 $\mu$ J/86% CIFAR-10 Mixed-Signal Binary CNN Processor with All Memory on Chip in 28nm CMOS

Daniel Bankman<sup>1</sup>, Lita Yang<sup>1</sup>, Bert Moons<sup>2</sup>, Marian Verhelst<sup>2</sup>, Boris Murmann<sup>1</sup>

<sup>1</sup>Stanford University, Stanford, CA; <sup>2</sup>KU Leuven, Leuven, Belgium

The trend of pushing deep learning from cloud to edge due to concerns of latency, bandwidth, and privacy has created demand for low-energy deep convolutional neural networks (CNNs). The single-layer classifier in [1] achieves sub-nJ operation, but is limited to moderate accuracy on low-complexity tasks (90% on MNIST). Larger CNN chips provide dataflow computing for high-complexity tasks (AlexNet) at mJ energy [2], but edge deployment remains a challenge due to off-chip DRAM access energy. This paper describes a mixed-signal binary CNN processor that performs image classification of moderate complexity (86% on CIFAR-10) and employs near-memory computing to achieve a classification energy of 3.8 $\mu$ J, a 40x improvement over TrueNorth [3]. We accomplish this using (1) the BinaryNet algorithm for CNNs with weights and activations constrained to +1/-1 [4], which drastically simplifies multiplications (XNOR) and allows integrating all memory on-chip; (2) an energy-efficient switched-capacitor (SC) neuron that addresses BinaryNet's challenge of wide vector summation; (3) architectural parallelism, parameter reuse, and locality.

Figure 13.5.1 illustrates the function and network topology of our design. By enforcing structural regularity, we allow the physical architecture to maximally exploit the locality of the CNN algorithm. Each CNN layer carries out a multi-channel, multi-filter convolution. The number of filters in each convolutional layer is restricted to 256, the filter size is 2x2, and the number of channels is 256. The circuit benefits of this regularity are short wires and arrayed, low fan-out demultiplexers, which minimize path loading between memory and logic.

Figure 13.5.2 shows the top-level architecture, which supports up to 9 layers with a customized instruction set for input-output actions, CNN and fully-connected (FC) layers. The processor reads an RGB image, converts the channels to 85-level thermometer codes, and stacks them into a 256-channel image as the CNN input. At the output, an FC layer digitally computes the 4b class label. For a CNN layer, east and west SRAM banks alternate roles between input and output in a ping-pong fashion. These SRAM banks are 256b wide, each word representing a 256-channel pixel. Computation of a filter is completed inside a neuron, eliminating partial sums. The weights are transferred from SRAM to local neuron memory (latches) and reused, while the filter traverses the image. A data-parallel array of 64 neurons processes a patch of the input image, amortizing the input image SRAM read energy per filter computation by 64x. The input DEMUX block interfaces between SRAM (which loads a pixel) and the neuron array (which receives a patch). For the FC layer, weights are loaded from a separate SRAM bank 64 channels at a time, and the multiply-accumulate operation is performed sequentially in the digital domain.

Figure 13.5.3 shows how locality translates to reduced loading. The input DEMUX is an array of 1-to-4 de-multiplexers with output registers. Each pixel of the input image can be reused in the processing of two overlapping patches, amortizing the input image SRAM read energy per filter computation by 2x. A 2-by-2 crossbar interchanges pixel pairs at the neuron array input. Filter weights are transferred over a 4b per neuron bus, split into north and south halves to reduce the loading of weight transfers by 2x. To minimize neuron array to memory wiring, each neuron writes to the same 4 output channels (1 per filter group) in each CNN layer, allowing implementation of the output DEMUX block as an array of 1-to-4 de-multiplexers. Max pooling occurs incrementally during convolution by first reading a bit in the output image SRAM, and then writing back its logical OR with a neuron output.

Figure 13.5.4 shows the neuron schematic, which computes the weighted sum of a filter with a patch of the input image. With memory energy amortized by parallelism and reuse, and multiplication reduced to XNOR, high-fan-in addition becomes the main bottleneck. However, in the employed SC neuron, the energy cost of addition is reduced by the small voltage swing at the charge conservation node. In contrast, a digital adder tree would involve rail-to-rail swings along its stages and exhibits a larger amount of switched capacitance. The neuron's dominant noise source is the comparator, but its energy cost is amortized over 1024 weights and the CNN can tolerate some noise. As a result, the SC neuron is amenable to low-voltage operation, and uses a combined 0.6V digital supply/analog reference and a 0.8V comparator supply. Because the SC neuron performs a weighted sum with data-dependent switching (apart from the comparator), its energy scales with activity, like static CMOS. The SC neuron uses a capacitive DAC

(CDAC) with four sections: a 1024b thermometer section for applying a filter, a binary-weighted section for the neuron's bias, a threshold section (comparator), and a common-mode (CM) setting section to compensate for parasitics at the charge conservation node. Comparator offset is digitized using calibration at startup, stored in a local register, and subtracted from the bias loaded from SRAM during weight transfer. In environments where large temperature changes may induce significant offset drift, calibration can be performed periodically (e.g. once per second) at negligible cost in average energy per classification and throughput. Behavioral Monte Carlo simulations were run to determine the amount of comparator noise, offset, and unit-capacitor mismatch that the CNN can tolerate without degradation in classification accuracy, resulting in a comparator designed for 4.6mV offset and a 1fF unit capacitor. Because the voltage representing the weighted sum is developed at the charge conservation node, top and bottom plate parasitics do not affect linearity. During convolution, the CDAC is periodically cleared (sampling OV) as required by leakage at the top plates. To prevent drawing excessive charge from the supply, the unit-capacitor bottom plate nodes are discharged by asserting CLR before the top plate is discharged via CLR<sub>e</sub>. To prevent asymmetric charge injection, the top plate switches are opened before the bottom plate voltages resume their values set by filter weights, image inputs, and biases.

Figure 13.5.5 shows the measured results at room temperature. Ten different chips were measured to evaluate the accuracy spread due to thermal noise and mismatch in the SC neuron. At nominal supply voltages ( $V_{DD}=V_{MEM}=1.0V$ ,  $V_{NEU}=0.6V$ ,  $V_{COMP}=0.8V$ ), the chips operate up to 380frames/s (FPS) and achieve 5.4 $\mu$ J/classification. Lowering  $V_{DD}$  and  $V_{MEM}$  to 0.8V leads to 3.8 $\mu$ J/classification (1.43x reduction) at 237 FPS. The mean classification accuracy is 86.05% (see histogram), the same as observed in a perfect digital model. The histogram spread is solely caused by the noise and mismatch in the SC neuron (which can notably lead to a higher classification accuracy than in the perfect digital model). The 95% confidence interval in mean classification accuracy is 86.01% to 86.10%, measured over 10 chips, 30 runs each through the 10,000 image CIFAR-10 test set. Not included in these energy figures is the 1.8V chip I/O energy, which amounts to 0.43 $\mu$ J (a small fraction of the core energy).

To explore further energy savings, we reduced  $V_{DD}$  to 0.6V and set  $V_{MEM}$  to 0.53V, 0.52V, 0.51V and 0.50V to show the impact of bit errors. The mean accuracy degrades to 85.7%, 85.2%, 84.2% and 82.5%, respectively. The large error bars for the lower voltages (see Fig. 13.5.5) are due to SRAM  $V_{MIN}$  variations across the 10 chips. At  $V_{DD}=0.6V$  and  $V_{MEM}=0.53V$  (borderline practical), the chip consumes 2.61 $\mu$ J/classification, a 2.1x reduction versus nominal supplies. From the breakdown in Fig. 13.5.5, we see that neuron energy increases due to leakage at the lower FPS imposed by voltage scaling. However, this increase is small compared to the logic and memory savings.

Figure 13.5.6 compares this work with prior art and Fig. 13.5.7 shows a photo of the 2.44mm×2.44mm die. On the same benchmark dataset (CIFAR-10), we achieve 40-60x improvement in energy per classification over [3], which does not exploit locality and thus suffers from high interconnect activity. The binarized DNN accelerator in [5] has all memory on-chip, but cannot exploit weight reuse, attaching the energy cost of an SRAM bit load with each XNOR operation. The spiking LCA network in [6] exhibits low energy, but has a relatively low accuracy for a lower-complexity task (MNIST).

#### Acknowledgements:

Silicon fabrication was provided by the TSMC university shuttle program. We thank Scott Liao for design support. This work was funded in part by Systems on Nanoscale Information fabricCs (SONIC), one of the six STARnet Centers, sponsored by MARCO and DARPA.

#### References:

- [1] J. Zhang, et al., "A Machine-learning Classifier Implemented in a Standard 6T SRAM Array," *IEEE Symp. VLSI Circuits*, 2016.
- [2] Y. H. Chen, et al., "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," *ISSCC*, pp. 262-263, 2016.
- [3] S. Esser, et al., "Convolutional networks for fast, energy-efficient neuromorphic computing," *Proc. Natl. Acad. Sci. USA*, vol. 113, no. 41, pp. 11441-11446, 2016.
- [4] M. Courbariaux, et al., "Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1," *arXiv preprint*: 1602.02830v3, 2016.
- [5] K. Ando, et al., "BRein Memory: A 13-Layer 4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconfigurable in-Memory Deep Neural Network Accelerator in 65 nm CMOS," *IEEE Symp. VLSI Circuits*, 2017.
- [6] F. Buhler, et al., "A 3.43TOPS/W 48.9pJ/Pixel 50.1nJ/Classification 512 Analog Neuron Sparse Coding Neural Network with On-Chip Learning and Classification in 40nm CMOS," *IEEE Symp. VLSI Circuits*, 2017.



Figure 13.5.1: System design and binary CNN topology.



Figure 13.5.2: Top-level architecture with 64 neurons.



Figure 13.5.3: Locality in logic design and physical architecture.



Figure 13.5.4: Switched-capacitor neuron using charge redistribution for wide vector summation.



Figure 13.5.5: Measured energy and accuracy on CIFAR-10 image classification.

|                                       | This work                                                                | [3] IBM TrueNorth                                                         | [5] VLSI '17 | [6] VLSI '17     |
|---------------------------------------|--------------------------------------------------------------------------|---------------------------------------------------------------------------|--------------|------------------|
| Technology                            | 28nm                                                                     | 28nm                                                                      | 65nm         | 40nm             |
| Algorithm                             | CNN                                                                      | CNN                                                                       | DNN          | LCA              |
| Dataset                               | CIFAR-10                                                                 | CIFAR-10                                                                  | MNIST        | MNIST            |
| (Weight, Activation) Precision [bits] | (1, 1)                                                                   | (1.6, 1)                                                                  | (1.6, 1)     | (4, 1)           |
| Supply [V]                            | $V_{NEU} = 0.6$<br>$V_{COMP} = 0.8$<br>$V_{DD} = 0.8$<br>$V_{MEM} = 0.8$ | $V_{NEU} = 0.6$<br>$V_{COMP} = 0.8$<br>$V_{DD} = 0.6$<br>$V_{MEM} = 0.53$ | 1.0          | 0.55 – 1.0       |
| Classification Accuracy [%]           | 86.05                                                                    | 85.69                                                                     | 83.41        | 90.1             |
| Energy per Classification [μJ]        | 3.79                                                                     | 2.61                                                                      | 164          | 0.28 – 0.73      |
| Power [mW]                            | 0.899                                                                    | 0.094                                                                     | 204.4        | 50 – 600         |
| Frame Rate [FPS]                      | 237                                                                      | 36                                                                        | 1249         | 820K – 3280K     |
| Arithmetic Energy Efficiency          | 532 1b-TOPS/W                                                            | 772 1b-TOPS/W                                                             | –            | 6.0 – 2.3 TOPS/W |

Figure 13.5.6: Comparison to state of the art.



Figure 13.5.7: Die photo.

### 13.6 A 1.8Gb/s 70.6pJ/b 128×16 Link-Adaptive Near-Optimal Massive MIMO Detector in 28nm UTBB-FDSOI

Wei Tang<sup>1</sup>, Hemanth Prabhu<sup>2</sup>, Liang Liu<sup>2</sup>, Viktor Öwall<sup>2</sup>, Zhengya Zhang<sup>1</sup>

<sup>1</sup>University of Michigan, Ann Arbor, MI

<sup>2</sup>Lund University, Lund, Sweden

This work presents a 2.0mm<sup>2</sup> 128×16 massive MIMO detector IC that provides 21dB array gain and 16× multiplexing gain at the system level. The detector implements iterative expectation-propagation detection (EPD) for up to 256-QAM modulation. Tested with measured channel data [1], the detector achieves 4.3dB processing gain over state-of-the-art massive MIMO detectors [2, 3], enabling 2.7× reduction in transmit power for battery-powered mobile terminals. The IC uses link-adaptive processing to meet a variety of practical channel conditions with scalable energy consumption. The design is realized in a condensed systolic array architecture and an approximate moment-matching circuitry to reach 1.8Gb/s at 70.6pJ/b. The performance and energy efficiency can be tuned over a wide range by UTBB-FDSOI body bias.

Real-time detection for massive MIMO is compute-intensive and power-hungry due to massive matrix dimensions and fast varying channels. Previous works [2, 3] demonstrated low-complexity massive MIMO detectors based on independent and identically distributed (i.i.d.) channel assumption in massive MIMO. The i.i.d. channel assumption is impractical and these simplified detectors suffer from significant performance losses when tested in measured massive MIMO channels, especially in cases of high user load. In designing a practical massive MIMO detector, we select EPD that leverages iterative interference cancellation [4] to offer near-optimal performance even in unfavorable channel conditions, while its complexity is limited to  $O(K^2)$  per iteration, where  $K$  is the number of users. We exploit EPD's iterative processing to adapt the processing effort to the channel so to achieve the required BER at the lowest energy. The EPD design incorporates explicit matrix inversion, so it could be reused for both uplink and downlink processing. Evaluated using measured massive MIMO channels, the EPD outperforms a linear MMSE detector by 0.7dB, 4.3dB, and 3.5dB in i.i.d., non-line-of-sight (NLOS), and line-of-sight (LOS) conditions, respectively, as shown in Fig. 13.6.1.

The EPD architecture is shown in Fig. 13.6.2. The Gram and  $y^M$  memory buffers incoming channel and match-filtered uplink streams, and the memory supports flexible access patterns required for reconfiguration. The MMSE parallel interference cancellation (MMSE-PIC) filter cancels the inter-user interference from the uplink user data. The moment-matching unit refines the symbol estimates by incorporating constellation information. The detection-control unit dynamically adjusts the per-iteration processing effort and detects early convergence. Updated symbol estimates from the moment-matching unit are buffered in the symbol estimate memory and fed back to the MMSE-PIC filter for iterative refinement. The architecture is configurable to support vector processing of different lengths to facilitate dynamic dimension reduction, i.e., when a batch of estimates are determined to be reliable, they are frozen and the corresponding users are removed from future iterations. In Fig. 13.6.2, dynamic dimension reduction enables 40–90% of complexity reduction due to the reduced number of users and iteration count. With appropriate threshold choices, the possibility of premature freezing is minimized and the SNR loss is negligible. In designing the silicon prototype, we combine this adaptive architecture with coarse-grained clock gating to save 49.3% power.

One of the most compute-intensive and accuracy-critical parts of the EPD is the matrix inversion block in the MMSE-PIC filter. A systolic array is often used to implement the LDL decomposition to realize highly accurate matrix inversion. The systolic array architecture features a regular architecture, efficient routing and simple control. However, the hardware utilization of a systolic array architecture is only 33.3% [5] due to the need for zero-padding inputs. In this work, we implement a condensed LDL systolic array, which merges under-utilized PE circuitry to improve the hardware utilization to 90% for a 16×16 array, while reducing the interconnect overhead by more than 70%. As shown in Fig. 13.6.3, a PE in a regular systolic array performs division (PE0), multiplication (PE1) or MAC (PE2 and 3) operations and passes its output to the neighboring PEs. In our condensed systolic array, every three PEs in a row are merged. The merging shortens data movements in the systolic array. Rather than passing data along many stages of unused operations in a systolic array, our condensed array limits

data movements using holding buffers to maximize data reuse. The data reuse is especially advantageous in our design, as it requires a relatively long 28b data bit width to support a wide range of channel conditions. The condensed array architecture reduces silicon area by 62% compared to the regular systolic array. Moreover, the condensed array shortens data movement delay and dedicates a larger fraction of a clock period to data processing.

The moment-matching unit computes the likelihood of each constellation point to refine the mean and variance of current symbol estimates. The computational complexity is proportional to the product of the modulation size and the number of simultaneously served users. For a 256-QAM, 128×16 massive MIMO system, the complexity is prohibitive. We implement an approximate moment-matching (AMM) circuitry to cut 90% computation by sacrificing a limited, 0.5dB SNR loss. AMM makes the complexity independent of modulation size by exploiting the symmetry of the QAM constellation in computing the mean and variance estimates, thus the approach is favorable in designing a flexible detector that supports a wide range of modulation schemes. Complexity is further reduced with a piecewise linear approximation to compute mean and variance updates: the mean update is reduced to a hard decision of the input soft symbol; and the variance update is fitted into a first-order polynomial function of the input mean and variance. As shown in Fig. 13.6.4, compared to a brute-force moment-matching implementation using 2 dividers, 65 MACs, and 16 exponential evaluations, the AMM circuitry uses only 2 MACs. AMM also eliminates costly exponentiation and division, and reduces intermediate bit width requirements. The technique cuts the silicon area of the moment-matching unit by more than 90%.

An EPD test chip is fabricated in ST 28nm UTBB-FDSOI technology, occupying 2.0mm<sup>2</sup> core area, as shown in Fig. 13.6.7. The measurement results at different core voltages and body biasing in room temperature are shown in Fig. 13.6.5. At a nominal voltage of 1.0V, the EPD chip runs at 512MHz, delivering a system throughput of 1.6Gb/s. By applying forward body biasing of 0.4V, a maximum working frequency of 569MHz is achieved, corresponding to an 11% boost in detection throughput to 1.8Gb/s. The corresponding core power consumption is 127mW, translating to an energy efficiency of 70.6pJ/b. For a low-power application, reverse body biasing of 0.2V and voltage scaling of 0.7V can be applied to reduce the power consumption to 23.4mW at a throughput of 754Mb/s. Compared to the prior MIMO detector designs shown in Fig. 13.6.6, our EPD chip provides flexibility in terms of modulation and channel adaptation, supports both uplink and downlink processing, and achieves a high processing gain, while maintaining competitive energy and area efficiency. Note that the MPD chip in [2] takes advantage of the assumption of the diagonal dominance in i.i.d. channels using a low-complexity, 13b implementation without explicit matrix inversion. However, the MPD encounters an early error floor and fails to provide sufficient processing gain in practical but unfavorable channels such as LOS. In comparison, our EPD chip obtains 4.3dB processing gain in highly correlated channels, equivalent to a 2.7× boost in link margin that can be utilized to significantly lower the TX power and relax the frontend requirements.

#### Acknowledgements:

The work was supported in part by NSF CCF-1054270, Intel, Silicon Labs, and System Design on Silicon (SoS) Center. Chip fabrication donation was provided by STMicroelectronics. We thank Ove Edfors, Rakesh Gangarajaiah, Babak Mohammadi, Shiming Song and Teyuh Chou for advice.

#### References:

- [1] S. Malkowsky, et al., "The World's First Real-Time Testbed for Massive MIMO: Design, Implementation, and Validation," *IEEE Access*, vol. 5, pp. 9073-9088, 2017.
- [2] W. Tang, et al., "A 0.58mm<sup>2</sup> 2.76Gb/s 79.8pJ/b 256-QAM Massive MIMO Message-Passing Detector," *IEEE Symp. VLSI Circuits*, pp. 1-2, 2016.
- [3] H. Prabhu, et al., "A 60pJ/b 300Mb/s 128×8 Massive MIMO Precoder-Detector in 28nm FD-SOI," *ISSCC*, pp. 60-61, 2017.
- [4] J. Céspedes, et al., "Expectation Propagation Detection for High-Order High-Dimensional MIMO Systems," *IEEE Trans. Commun.*, vol. 62, no. 8, pp. 2840-2849, 2014.
- [5] S. J. Bellis, et al., "Alternative Systolic Array for Non-Square-Root Cholesky Decomposition," *IEE Proc. Comput. Digit. Technol.*, vol. 144, no. 2, pp. 57-64, 1997.
- [6] C.-H. Chen, et al., "A 2.4mm<sup>2</sup> 130mW MMSE-Nonbinary LDPC Iterative Detector-Decoder for 4x4 256-QAM MIMO in 65nm CMOS," *ISSCC*, pp. 338-339, 2015.



Figure 13.6.1: A multi-user massive MIMO system and BER of different channels. Insets are the Gram matrices  $|H^H H|$ .



Figure 13.6.3: Condensed LDL systolic array with enhanced utilization and merged PE designs.



Figure 13.6.5: Measured frequency and power consumption for different core voltages and body biases.



Figure 13.6.2: Link-adaptive EPD architecture and efficiency gains from dimension reduction and early termination.



Figure 13.6.4: Circuitry implementations and complexities of the original and approximate moment-matching.

|                                           | Chen [6] | Tang [2]                   | Prabhu [3] | This Work          |
|-------------------------------------------|----------|----------------------------|------------|--------------------|
| Algorithm                                 | MMSE     | MPD <sup>(a)</sup>         | MMSE       | EPD <sup>(b)</sup> |
| MIMO [ $M \times K$ ]                     | 4x4      | 128x32                     | 128x8      | 128x16             |
| Modulation                                | 256      | 256                        | 256        | QPSK to 256        |
| Channel Adaptiveness                      | no       | no                         | no         | yes                |
| Support Precoding                         | no       | no                         | yes        | yes                |
| Array Gain [dB]                           | 6        | 21                         | 21         | 21                 |
| Multiplexing Gain                         | 4        | 32                         | 8          | 16                 |
| Link Margin Improvement <sup>(c)</sup>    | 0        | 1.0                        | 0          | 0.6                |
| i.i.d. Channel [dB]                       | 0        | Error floor <sup>(d)</sup> | 0          | 4.3                |
| NLOS Channel [dB]                         | 0        | Error floor <sup>(d)</sup> | 0          | 3.5                |
| LOS Channel [dB]                          |          |                            |            |                    |
| Technology [nm]                           | 65       | 40                         | 28         | 28                 |
| Core Area [mm <sup>2</sup> ]              | 0.7      | 0.58                       | -          | 2.0                |
| Gate Count [kGE]                          | 347      | 1,022                      | 288        | 3,607              |
| Power [mW]                                | 26.5     | 221                        | 18         | 127                |
| Frequency [MHz]                           | 517      | 425                        | 300        | 569                |
| System Throughput <sup>(e)</sup> [Gb/s]   | 1.38     | 2.76                       | 0.30       | 1.80               |
| Energy Efficiency <sup>(f)</sup> [pJ/b]   | 307      | 20                         | 240        | 70                 |
| Area Efficiency <sup>(g)</sup> [Mb/s/kGE] | 0.24     | 10                         | 0.26       | 0.5                |

(a) message-passing detection. (b) expectation-passing detection. (c) link margin improvement reflects to SNR gain over MMSE at BER=10<sup>-3</sup>. (d) error floor occurs before BER=10<sup>-3</sup> in NLOS and LOS channels. (e) system throughput assumes channel coherence among 7 OFDM symbols and  $K$  subcarriers. (f) energy efficiency is  $(Power/Throughput) / (K/16)^2$ . (g) area efficiency is  $(Throughput/Gate Count) / (K/16)^2$ .

Figure 13.6.6: Comparison with state-of-the-art MIMO detector implementations.

|                           |                   |
|---------------------------|-------------------|
| Technology                | 28nm              |
| Core Area                 | 2mm <sup>2</sup>  |
| Number of BS Antennas (M) | 128               |
| Number of Users (K)       | ≤ 16              |
| QAM size                  | 4, 16, 64, 256    |
| Channel Adaptiveness      | i.i.d., NLOS, LOS |
| System Throughput         | 1.8Gb/s           |
| Power Consumption         | 127mW             |



Figure 13.6.7: Chip features and microphotograph.

### 13.7 A 232-to-1996KS/s Robust Compressive-Sensing Reconstruction Engine for Real-Time Physiological Signals Monitoring

Ting-Sheng Chen, Hung-Chi Kuo, An-Yeu Wu

National Taiwan University, Taipei, Taiwan

Compressive sensing (CS) techniques enable new reduced-complexity designs for sensor nodes and help reduce overall transmission power in wireless sensor network [1-2]. Prior CS reconstruction chip designs have been described in [3-4]. However, for real-time monitoring of physiological signals, the applied orthogonal matching pursuit (OMP) algorithms they incorporate are sensitive to measurement noise interference and suffer from a slow convergence rate. This paper presents a new CS reconstruction engine fabricated in 40nm CMOS with following features: 1) A sparsity-estimation framework to suppress measurement noise interference at sensing nodes, achieving at least 8dB signal-to-noise ratio (SNR) gain under the same success rate for robust reconstruction. 2) A new flexible indices-updating VLSI architecture, inspired by the gradient descent method [5], that can support arbitrary signal dimension,  $(L_{\text{new}}, M)$ , of CS reconstruction with high sparsity level ( $K_{\max}$ ). 3) Parallel-searching, indices-bypassing, and functional blocks that automatically group processing elements (PEs) are designed to reduce the total CS reconstruction cycle latency by 84%. Compared with prior state-of-the-art designs, this CS reconstruction engine can achieve 10 $\times$  higher throughput rate and 4.2 $\times$  better energy efficiency at the minimum-energy point (MEP).

In blind reconstruction algorithms that can be operated without a priori knowledge of the sparsity level, such as OMP [3-4] and stochastic gradient pursuit (SGP) [5], measurement noise destroys the sparsity ( $K$ ) of received signals, degrading the reconstruction quality and speed [5]. The subspace pursuit (SP) algorithm offers excellent recovery quality and convergence under noisy scenarios. However, it is classified as *non-blind* reconstruction as it needs an *a priori* explicit sparsity level. We propose a two-phase sparsity-estimation subspace pursuit (SE-SP) CS reconstruction algorithm as shown in Figure 13.7.1. The new SE-SP can cope with measurement noise through two phases. Phase-I (P1) performs blind reconstruction similar to OMP. It reaches the maximum chosen indices numbers,  $K_{\max}$ , in order to obtain all potential indices. Then, it estimates the effective sparsity level,  $\hat{K}$ , according to the number of elements whose amplitudes are larger than a certain noise distortion level derived from the residual norm. Phase-II (P2) applies SP with the output of P1 and the estimated  $\hat{K}$  to obtain the  $\hat{K}$ -best sparse solution. The SE-SP still performs blind reconstruction like OMP, but it possesses all of the advantages of the non-blind SP, such as the robustness to measurement noise. Hence, it can achieve 8dB gain (time-sparse signals) in terms of *success rate* with only 10% iteration-count overhead and without any *a priori* sparsity information.

Figure 13.7.2 shows a least-mean-squares (LMS)-based architecture to implement the SP algorithm in our design. It benefits from local data updating, thus is free of global communication overhead and wiring costs. To implement the SP algorithm, we need to add/estimate  $\hat{K}$  supports through a least squares (LS) computation in each iteration. However, a direct implementation of LS is unable to simultaneously realize updating of multiple indices and amplitudes, and the configurability features in [4] for handling variable measurement dimensions ( $M$ ). Furthermore, it requires backward and forward substitution (BS, FS) operations, which results in additional global communication cost and high iteration counts. Although SGP uses LMS to enhance OMP, cache overhead and fixed step-size limit its scalability and convergence. This chip uses a global buffer to transpose the columns of chosen indices into an on-chip cache for the LMS updating process, resulting in 5 $\times$  area reduction of the cache. It approximates the target sparse solution ( $\hat{x}$ ) with the following advantages: 1) Speed up of the SP algorithm: it can add arbitrary support,  $L_{\text{new}} \leq 128$ , at each iteration; then, the sorting engine finds  $K$ -best solution for enhanced reconstruction quality. 2) Support of reconfigurable design: the line buffer-based feature of LMS is adjustable to arbitrary size (e.g., signal dimension of  $M$  and  $K$ ), reaching 100% configurability with only 0.5% area overhead. 3) Local LMS updating enabling scalable designs: it relieves the limitation of global BS/FS operations. Hence, a larger signal sparsity level ( $K=256$ ) and 3 $\times$  higher clock rates can be achieved in this chip.

Figure 13.7.3 shows the block diagram of overall chip architecture. To achieve higher area efficiency, the tasks of the SE-SP are mapped into a folded architecture. 768KB of on-chip memory stores multiple sensing matrices for flexible reconstruction, which can be either single matrix for large or 4 matrices for small signal dimension. The 256 configurable PEs, 192KB cache and multi-task buffers can reconstruct a sparse signal with 100% flexibility. A sorting engine enables the task of finding the  $K$ -best indices/support in Phase-II. The data representation is 32b fixed point. It supports any integer of  $(N, M, K)$  up to (2048, 512, 256), representing a larger sparsity level than prior art.

Figure 13.7.4 shows three architectural optimization techniques to reduce operating cycle count. 1) Dynamic PE grouping: because the task of index searching (IS) features high complexity but low data-dependency, the architecture can be either unfolded or folded according to the measurement size. With larger  $M$ , the PEs use multiple cycles to complete a correlation operation in IS. When  $M$  is small, the PEs are grouped dynamically to reduce the total cycle count of IS by 25% to 50%. 2) Chosen-indices bypassing: the chosen indices from past iterations can be bypassed during IS. The sorting engine checks the chosen indices before loading the column from the sensing matrix, eliminating unnecessary correlation operations and reducing cycle count by 10% when performing IS. 3) Parallel-sparsity estimation: P1 provides a sparsity-order estimation for SP to screen support, rather than reconstructing signals directly. Therefore, this chip accelerates the P1 operation by choosing multiple (2-8) indices in each iteration, which helps to reduce the total iterations by 79%. The multi-task buffer is also designed for transposing up to 8 chosen columns directly. The above optimization can effectively reduce 85% of total cycles, thus enhancing throughput rate and energy efficiency by 6.3 $\times$  for CS reconstruction.

Figure 13.7.5 shows chip measurement results. The measured MEP is at 40MHz under 0.65V supply. Inspired by [4], we found that energy efficiency is nearby to the MEP. Therefore, this chip reduces area costs by using a global  $V_{DD}$  for both logic and memory. Since the SE-SP possesses a noise-tolerance feature, when reconstructing physiological signals (ECG, EMG, EEG and PPG in our measurements) under noisy conditions, it realizes at least 8dB higher SNR gain than OMP-based designs, under the same reconstruction SNR (RSNR).

Figure 13.7.6 shows a comparison with state-of-the-art designs. This CS reconstruction engine can provide 232-1996KS/s for reconstructing physiological signals of multiple patients, while offering full reconfigurability with 100% flexibility to support arbitrary signal dimensions ( $M, N$ ), and robustness to measurement noise interference. By operating at a higher clock rate, but with fewer cycles, this chip achieves 7-19 $\times$  throughput enhancement and 3-7 $\times$  higher energy efficiency compared with prior work. The power consumption is larger than prior art due to the 3 $\times$  higher operating frequency. The radar chart shows that this chip supports larger sparsity level, better energy efficiency and a higher throughput rate. Figure 13.7.7 shows the micrograph and chip summary. In conclusion, the 3.06mm<sup>2</sup> CS reconstruction engine can provide timely physiological signal reconstruction for data collected from CS-based wireless biosensors under noisy conditions, making intelligent patient monitoring a reality.

#### Acknowledgements:

Thank to Prof. Chia-Hsiang Yang and Prof. Tsung-Te Liu for useful discussions. The authors would like to thank National Chip Implementation Center, Taiwan for support on chip fabrication and measurements. This work is supported by Ministry of Science and Technology of Taiwan under Grant MOST 106-2221-E-002-204-MY3. The first two authors contributed equally.

#### References:

- [1] A. Dixon, et al., "Compressed Sensing System Considerations for ECG and EMG Wireless Biosensors," *IEEE Trans. Biomed. Circuits Syst.*, vol. 6, no. 2, pp. 156-166, 2012.
- [2] D. Gangopadhyay, et al., "Compressed sensing analog front-end for bio-sensor applications", *IEEE JSSC*, vol. 49, no. 2, pp. 426-438, 2014.
- [3] Y. -C. Cheng, et al., "Matrix-Inversion-Free Compressed Sensing with Variable Orthogonal Multi-Matching Pursuit Based on Prior Information for ECG Signals," *IEEE Trans. Biomed. Circuits Syst.*, vol. 10, no. 4, pp. 864-873, 2016.
- [4] F. Ren, et al., "A configurable 12-to-237KS/s 12.8mW sparse-approximation engine for mobile ExG data aggregation," *ISSCC*, pp. 334-335, 2015.
- [5] Y. M. Lin, et al., "Low-Complexity Stochastic Gradient Pursuit Algorithm and Architecture for Robust Compressive Sensing Reconstruction," *IEEE Trans. on Signal Process.*, vol. 65, no. 3, pp. 638-650, 2017.



Figure 13.7.1: Proposed two-phase sparsity-aware reconstruction.

Figure 13.7.2: LMS architecture for implementing  $\ell^2$  minimization.

Figure 13.7.3: Block diagram of chip architecture.



Figure 13.7.5: Measured results: Shmoo measurement, power consumptions, and reconstruction quality vs. SNR.



Figure 13.7.6: Comparison with prior chip implementations.



Figure 13.7.7: Chip micrograph and summary.

# Session 14 Overview: *High-Resolution ADCs*

## DATA CONVERTER SUBCOMMITTEE



**Session Chair:**  
**Matt Straayer**  
*Maxim Integrated Products, Chelmsford, MA*



**Associate Chair:**  
**Seung-Tak Ryu**  
*KAIST, Daejeon, Korea*

**Subcommittee Chair: *Un-Ku Moon, Oregon State University, Corvallis, OR***

This session's high-resolution analog-to-digital converters (ADCs) with 12 to 19b ENOB introduce a number of advanced circuit design techniques to achieve very high performance with low power consumption. While many of the proposed designs use an efficient SAR architecture where possible for moderate resolution, higher performance is consistently enabled by delta-sigma and pipeline architectures. Precision is further enabled by techniques such as hardware re-use, calibration, dynamic element matching, chopping, and correlated double-sampling.



1:30 PM

### 14.1 A 50MHz-BW Continuous-Time $\Delta\Sigma$ ADC with Dynamic Error Correction Achieving 79.8dB SNDR and 95.2dB SFDR

T. He, Oregon State University, Corvallis, OR

In Paper 14.1, Oregon State University and MediaTek present an efficient dynamic error correction technique for an NRZ feedback DAC in a continuous-time delta-sigma ADC. The 28nm ADC achieves 80dB SNDR in a 50MHz bandwidth while consuming 64.3mW.



2:00 PM

### 14.2 A 15.2-ENOB Continuous-Time $\Delta\Sigma$ ADC for a 7.3 $\mu$ W 200mV<sub>pp</sub>-Linear-Input-Range Neural Recording Front-End

H. Chandrakumar, University of California, Los Angeles, CA

In Paper 14.2, the University of California, Los Angeles describes a capacitively coupled continuous-time delta-sigma ADC for a neural recording front-end in 40nm CMOS. Chopping and linearity enhancement techniques are key to demonstrating 15.2b-ENOB with only 4.5uW in 5kHz bandwidth.



2:30 PM

**14.3 A 13-ENOB 2<sup>nd</sup>-Order Noise-Shaping SAR ADC Realizing Optimized NTF Zeros Using an Error-Feedback Structure**
*S. Li*, University of Texas, Austin, TX

In Paper 14.3, the University of Texas at Austin proposes an error feedback structure with a passive FIR and re-used comparator to realize complex noise transfer function zeros in a noise-shaping SAR ADC. Clocked at 10MHz, the 40nm ADC realizes 79dB SNDR in 625kHz bandwidth and consumes 84uW.



3:15 PM

**14.4 A 1.1mW 200kS/s Incremental ΔΣ ADC with a DR of 91.5dB Using Integrator Slicing for Dynamic Power Reduction**
*P. Vogelmann*, University of Ulm, Ulm, Germany

In Paper 14.4, the University of Ulm introduces an integrator slicing technique in an incremental delta-sigma ADC that allows for improved noise and power tradeoffs in the front-end integrator stages. The 200kS/s prototype in 0.18μm CMOS achieves 91.5dB dynamic range and 1.1mW.

14



3:45 PM

**14.5 A 280μW Dynamic-Zoom ADC with 120dB DR and 118dB SNDR in 1kHz BW**
*S. Karmakar*, Delft University of Technology, Delft, The Netherlands

In Paper 14.5, Delft University of Technology presents a dynamic zoom ADC with a high-speed asynchronous SAR that works in tandem with a delta-sigma. Fabricated in 0.16μm CMOS, the ADC reaches 118dB SNDR in a 1kHz bandwidth while consuming 280uW.



4:15 PM

**14.6 A 0.4V 13b 270kS/s SAR-ISDM ADC with an Opamp-Less Time-Domain Integrator**
*S-E. Hsieh*, National Tsing Hua University, Hsinchu, Taiwan

In Paper 14.6, National Tsing Hua University proposes a single-path time-domain voltage-controlled delay line for use in a coarse SAR comparator and a fine incremental delta-sigma integrator to realize a 13b ADC. Power consumption is only 638nW at 270kS/s with 11.9b ENOB.



4:45 PM

**14.7 A Signal-Independent Background-Calibrating 20b 1MS/s SAR ADC with 0.3ppm INL**
*H. Li*, Analog Devices, Wilmington, MA

In Paper 14.7, Analog Devices introduces a signal independent background calibration technique to achieve 0.3ppm INL in a 20b 1MS/s pipelined SAR ADC in 0.18μm CMOS. The double conversion calibration settles to 0.25ppm within 100k samples, resulting in 101.5dB SNDR with a 5V reference.

## 14.1 A 50MHz-BW Continuous-Time $\Delta\Sigma$ ADC with Dynamic Error Correction Achieving 79.8dB SNDR and 95.2dB SFDR

Tao He<sup>1</sup>, Michael Ashburn<sup>2</sup>, Stacy Ho<sup>2</sup>, Yi Zhang<sup>1</sup>, Gabor Temes<sup>1</sup>

<sup>1</sup>Oregon State University, Corvallis, OR

<sup>2</sup>MediaTek, Woburn, MA

Continuous-time  $\Delta\Sigma$  modulators (CTDSMs) are widely used in cellular handsets due to their power efficiency and inherent anti-aliasing characteristics. To achieve demanding cellular bandwidth requirements while maintaining good power efficiency, multi-bit feedback is typically used. This approach provides benefits such as lower OSR, relaxed loop filter requirements, and reduced jitter sensitivity. However, at multi-GHz clock rates, dynamic errors introduced by inter-symbol interference (ISI) in a multi-bit feedback DAC become pronounced [1-2], thereby degrading SFDR and reducing blocker tolerance. Several methods of minimizing ISI have been previously reported; however, they require complex circuitry and introduce significant excess loop delay (ELD) [1-2], or require a single-bit RZ DAC [3]. In this CTDSM, an energy-efficient analog-based ISI mitigation scheme is implemented. It introduces only negligible additional ELD, requires minimal extra circuitry, significantly improves SFDR, and can be used with a multi-bit NRZ DAC. Fabricated in 28nm CMOS, the prototype dissipates 64.3mW. It achieves 95.2dB SFDR and 79.8dB SNDR over 50MHz BW sampled at 2GHz.

Figure 14.1.1 shows a block diagram of the implemented CTDSM. The modulator can be configured for either normal operation or ISI calibration mode. In normal operation, the ADC functions as a 4th-order CTDSM with  $H_{inf}=1.8$ . A resonator generates an NTF notch, and two feed-forward paths are used to stabilize the modulator, while using only a single, 4b current-steering feedback DAC. The multi-bit NRZ DAC facilitates a lower OSR and reduces jitter sensitivity. A 4b quantizer includes embedded excess loop delay compensation to stabilize the loop for a half-cycle delay in the feedback path [4]. Along with using an ISI calibration for dynamic errors, a current copier technique is used to address static errors in the DAC [5]. For ISI calibration, a first-order loop is formed by bypassing all but the first stage in the loop filter. The output of the first integrator is sent to a single-bit quantizer, followed by a programmable delay element. Calibration mode is selected using a multiplexer placed at the input to the DAC.

The implemented foreground ISI calibration procedure is simple and requires few additional components, each with modest performance requirements. The basic idea for the calibration is that since each individual DAC element contributes a net dynamic error on every transition, the total dynamic error is proportional to the number of transitions that occur (Fig. 14.1.2). Thus, by taking two measurements where the only difference between them is the number of transitions from a single DAC element, the dynamic error for that element can be determined. In the calibration scheme presented here, the same DC voltage is applied for the two measurements. In order to vary the number of transitions, the programmable delay,  $T_d$ , in Fig. 14.1.1 is changed which in turn changes the noise shaping. By counting the change in the number of transitions ( $\Delta Trans$ ) and the change in the modulator output ( $\Delta DOUT$ ), the ISI error can be calculated. Note that during calibration only the DAC element under test is active. All other DAC elements are set high or low based on the applied DC input voltage in order to achieve 50% duty cycle at the output. This approach maximizes the number of transitions, and helps to avoid low-frequency idle tones. The width of the measurement window is selected such that quantization and circuit noise are averaged to negligible values. Regarding the performance requirements, the first amplifier targets 55dB of DC gain and 7GHz UGBW. This is adequate to minimize the movement of the integrator's summing junction, thereby mitigating the effects of quantizer offset and any DC shift at the output that could be caused by tonal behavior.

An analog approach is used to compensate the dynamic errors determined using the ISI calibration loop. This avoids the drastic increase in word length required for the digital approach, which would otherwise complicate the decimation filter and increase the area and power consumption. The signal path and a single DAC element are shown in Fig. 14.1.3. The output signal from the quantizer is fed to a DFF, where it is re-timed and latched using a delayed clock. The DFF output is then double-buffered in order to level-shift the input signal to the drive voltages for the 2-level DAC element. 200 $\mu$ A current sources are cascaded (not shown) in order to maintain high impedance at the switching node, and to minimize

capacitance. Two additional switches redirect the NMOS and PMOS current sources to the current-copier loop for periodic refreshing. The analog ISI correction is embedded within the first buffer feeding the DAC. Edge rates are modulated by adjusting the degeneration of the buffer, which in turn changes the effective pulse width to reduce the dynamic error. The leading and trailing edges of the drive signals are shifted by the same amount, but in opposite directions, to avoid adding phase error during the trim. The ISI correction utilizes a 4b trim with a range of +/-0.4% of the clock period (+/-2ps) and 0.05% (0.25ps) resolution. The DAC current sources are supplied from 1.5V, while the DAC drivers use 0.2V and 1.1V drive voltages for DVSS\_SW and DVDD\_SW, respectively, to insure that all switches are biased in saturation.

The 4b quantizer and all digital blocks operate from 1.16V. Each comparator includes a 10b threshold adjustment, which is calibrated using an external DC input. The ELD compensation path embedded within the quantizer realizes a feedback coefficient of 1. This feedback is achieved by shifting the threshold of each comparator based on the digital output of the previous cycle [4]. In [4], half of all possible threshold options are included for each comparator, resulting in significant routing complexity and large parasitics for each comparator. In this work, only two threshold options are used for each comparator. This greatly reduces routing complexity, enabling significant power savings and a more compact layout. This simplification is achieved using a rotational pattern for the comparator thresholds, as shown in the 2b example in Fig. 14.1.1.

Figure 14.1.4 shows the measured ISI errors for all 15 DAC units, before and after calibration. After correction, all residual ISI errors are within +/-0.25ps, the resolution of the trim. A 131,072-point FFT with and without compensation is also shown. With a -0.6dBFS input signal (2.4Vpp) at 13.6MHz, ISI calibration improves the SNDR from 78.3dB to 79.8dB, and improves the HD2 and HD3 from -87.7 and -87.3dBFS to -101.3 and -95.8dBFS, respectively. Figure 14.1.5 shows the SNR/SNDR vs. input amplitude. DR is 82.8dB and peak SNR/SNDR are 80.0dB and 79.8dB respectively, nearly identical. This is in contrast to previous state-of-the-art modulators of similar BW [4-6] (Fig. 14.1.6) whose peak SNDRs are significantly lower than peak SNRs. Also listed for comparison is the state-of-the-art THD for a wide-BW CTDSM (-101dBc) [3]. Note, however, that it realizes a lower BW (25MHz) and employs single-bit feedback, resulting in a significant reduction in power efficiency. The total power consumption is 64.3mW, resulting in a FOM,s of 168.7dB. A breakdown of the power consumption is shown in Fig. 14.1.5. The die micrograph is shown in Fig. 14.1.7, occupying an active area of 0.25mm<sup>2</sup>.

### Acknowledgements:

This work was supported by MediaTek Inc. The authors would like to thank the analog circuit design teams in MediaTek at Woburn and Hsinchu for technical support.

### References:

- [1] A. Sanyal, et al., "Dynamic element matching with signal-independent element transition rates for multibit  $\Delta\Sigma$  modulators," *IEEE TCAS-I*, vol. 62, no. 5, pp. 1325-1334, May 2015.
- [2] K. L. Chan, et al., "Dynamic element matching to prevent nonlinear distortion from pulse-shape mismatches in high-resolution DACs," *IEEE JSSC*, vol. 43, no. 9, pp. 2067-2078, Sept. 2008.
- [3] L. Breems, et al., "A 2.2GHz continuous-time  $\Delta\Sigma$  ADC with -102dBc THD and 25MHz BW," *ISSCC*, pp. 272-273, Feb. 2016.
- [4] S. Ho, et al., "A 23mW 73dB Dynamic Range 80MHz BW Continuous-Time Delta-Sigma Modulator in 20nm CMOS," *IEEE Symp. VLSI Circuits*, pp. 102-103, June 2014.
- [5] D.-Y. Yoon, et al., "A 85dB DR 74.6dB SNDR 50MHz BW CT MASH  $\Delta\Sigma$  Modulator in 28nm CMOS," *ISSCC*, pp. 272-273, Feb. 2015.
- [6] Y. Dong, et al., "A 235mW CT 0-3 MASH ADC achieving -167dBFS/Hz NSD with 53MHz BW," *ISSCC*, pp. 480-481, Feb. 2014.



Figure 14.1.1: Block diagram for CTDSM with dynamic error calibration and DELD compensation.

Figure 14.1.2: Two-measurement-based ISI determination.



Figure 14.1.3: Signal chain and DAC element with ISI correction.



|                         | This work | [5]     | [6]     | ISSCC 16 Wu | [4]       | ISSCC 17 Huang | [3]   |
|-------------------------|-----------|---------|---------|-------------|-----------|----------------|-------|
| Process (nm)            | 28        | 28      | 28      | 65          | 20        | 16             | 65    |
| Fs (GHz)                | 2         | 1.8     | 3.2     | 0.9         | 2.81      | 2.15           | 2.2   |
| BW (MHz)                | 50        | 50      | 45.7    | 45          | 80        | 125            | 25    |
| SNDR (dB)               | 79.8      | 74.6    | 72.6    | 75.3        | 67.5      | 71.9           | 77    |
| SNR (dB)                | 80.0      | 76.1    | 84.6    | 78.5        | 70        | 72.6           | 77    |
| DR (dB)                 | 82.8      | 85      | 90      | 82.5        | 73        | 74.8           | 77    |
| Supply Volt. (V)        | 1.16/1.5  | 1.2/1.5 | -1/1.8  | 1.2/1.8     | 1/1.2/1.8 | 1/1.35/1.5     | 1.2   |
| THD (dBc)               | -94.1     | -79.9** | -72.9** | -78.1**     | -71.1**   | -80            | -101  |
| Power (mW)              | 64.3      | 78      | 235     | 24.7        | 23        | 54             | 41.4  |
| Area (mm <sup>2</sup> ) | 0.25      | 0.34    | 0.9     | 0.16        | 0.1       | 0.217          | 0.25  |
| FOM1 (fJ/step)          | 80.5      | 177.7   | 737.5   | 57.7        | 74.2      | 67.2           | 143.1 |
| FOM2 (dB)               | 168.7     | 162.7   | 155.5   | 167.9       | 162.9     | 165.5          | 164.8 |

$$FOM_1 = P/(2^2BW^2(SNDR-1.76)/6.02)$$

$$FOM_2 = SNDR + 10 \log_{10}(BW/P)$$

\*\* THD calculated based on differences between SNRs and SNDRs reported

Figure 14.1.6: Performance summary and comparison to state-of-the-art ADCs.



Figure 14.1.7: Chip micrograph.

## 14.2 A 15.2-ENOB Continuous-Time $\Delta\Sigma$ ADC for a 7.3 $\mu$ W 200mV<sub>pp</sub>-Linear-Input-Range Neural Recording Front-End

Hariprasad Chandrakumar, Dejan Marković

University of California, Los Angeles, CA

Closed-loop neuromodulation with simultaneous stimulation and sensing is desired to advance deep brain stimulation (DBS) therapies. However, stimulation generates large artifacts ( $\sim$ 100mV) at the recording sites that saturate traditional front-ends. We present a 15.2b-ENOB CT  $\Delta\Sigma$ M with 187dB FOM, which along with an 8 $\times$ -gain capacitively coupled chopper instrumentation amplifier (CCIA), realizes a front-end that can digitize neural signals ( $<2$ mV<sub>pp</sub>) from 1Hz to 5kHz in the presence of 200mV<sub>pp</sub> artifacts. Neural recording front-ends need to function within a power budget of 10 $\mu$ W/ch, input-referred noise of 4-8 $\mu$ V<sub>rms</sub> in 1Hz-5kHz, DC input impedance  $Z_{in,DC} > 1$ G $\Omega$  and high-pass (HP) cutoff  $< 1$ Hz [1]. Prior work has addressed power and noise [1]-[2], but has limited dynamic-range and bandwidth (BW), making them incapable of performing true closed-loop operation.

Prior art has digitized neural signals without amplification, with a VCO-ADC in [1], and a CT  $\Delta\Sigma$ M in [2]. However, the BW was limited to 200Hz and 500Hz, respectively (required BW=5kHz), and the ENOB in [2] was limited to 10.2b (required  $>13$ b). Recent advances in artifact-tolerant CCIA [3] enable efficient amplification of neural signals while satisfying electrode-interface requirements. However, the front-end requires an ADC with ENOB $>15$ b, BW of 5kHz and power $<5$  $\mu$ W. For ENOB $>15$ b,  $\Delta\Sigma$ Ms are the most power efficient ADCs [4]-[6]. However, power consumption needs to be reduced by 4 $\times$  below the state-of-the-art to meet our requirements.

We chose a CT- $\Delta\Sigma$  topology for its inherent anti-aliasing and relaxed requirements for loop-filter opamps. Power consumption in CT- $\Delta\Sigma$ Ms is dominated by the 1<sup>st</sup> integrator and feedback-DAC, since they limit noise and linearity. Figure 14.2.1 shows typical implementations of CT- $\Delta\Sigma$ Ms. Gm-C-based integrators are power efficient, but suffer from nonlinearity. Source-degeneration improves linearity [4], but significantly increases noise. Active-RC integrators have good linearity [5], but need higher power to drive the RC feedback components. To meet noise requirements, the input resistance  $R_{in}$  (Fig. 14.2.1) must be small ( $\sim$ 200k $\Omega$ ). Hence,  $C_{int}$  is large for a given corner frequency, leading to higher power in the integrator opamp. A smaller  $R_{in}$  also increases the power consumption of the previous stage and the feedback-DAC due to increased loading. We add a 4 $\times$ -gain stage before the 1<sup>st</sup> integrator to relax its noise requirements (Fig. 14.2.1). Noise and linearity are now limited by the gain stage. The inverting amplifier with cap feedback (IACF) is used as a power-efficient gain stage. During DAC transitions, a decaying error signal ( $V_{er}$ ) appears at the virtual ground of the IACF. Large  $V_{er}$  causes distortion in the IACF. However, since  $V_{er}$  decays quickly, we employ time-varying source-degeneration in the input diff-pair of the IACF. This linearizes the IACF for large  $V_{er}$  while maintaining low-noise performance, since degeneration is used only for a short duration ( $T_s/16$ ).  $C_{in}$  and  $C_f$  are kept small in the IACF (Fig. 14.2.1) to reduce loading on the previous stage (CCIA) and the feedback-DAC, lowering power consumption.

With limited CCIA gain of 8, chopping is employed in the IACF to mitigate flicker noise (Fig. 14.2.2). To prevent aliasing of out-of-band quantization noise, the chopping frequency  $f_{ch}$  must be a multiple of (0.5) $\cdot f_s$ , where  $f_s$  is the oversampling frequency of the  $\Delta\Sigma$  [5]. For OSR = 40 and signal BW = 5kHz,  $f_s = 2f_{ch} = 400$ kHz. Hence, unlike the CCIA (where  $f_{ch} = 25$ kHz is sufficient [3]), the ADC requires a significantly higher  $f_{ch}$ . This causes distortion in the feedback-DAC, since the reference buffer cannot accurately drive a switching capacitive load. The buffer output will have large glitches at chopping-clock transitions, which couple to the virtual ground of the IACF leading to more distortion. Thus, we use a storage cap  $C_S = 14$ pF to assist the reference buffer in driving its switching load. At every chopping-clock or DAC transition, the buffer output is disconnected from the CDAC, and the cap  $C_S$  (pre-charged to  $V_{ref}$ ) charge-shares with the CDAC. This phase ( $\phi_1$  in Fig. 14.2.2) provides most of the charge to the load of the buffer. At the end of  $\phi_1$ , the buffer reconnects to the CDAC, and  $C_S$  connects to  $V_{ref}$  in preparation for the next chopping-clock transition. This reduces the load of the buffer, significantly reducing distortion. We also place dead-band switches at the input of  $g_m$  (Fig. 14.2.2) to prevent residual glitches from coupling to the  $g_m$  input. The duration of the dead-band phase  $\phi_{db}$  is only 12ns (generated by gate delays) as compared to the clock period (2.5 $\mu$ s), ensuring negligible noise contribution from the sampled noise on the gate cap of  $g_m$ .

Figure 14.2.3 shows the proposed neural recording front-end and the CT- $\Delta\Sigma$ M, implemented in a 40nm CMOS technology. Figure 14.2.7 shows the chip micrograph. The  $\Delta\Sigma$ M and CCIA consume 4.5 $\mu$ W and 2.8 $\mu$ W, respectively, from a 1.2V supply. The  $\Delta\Sigma$ M uses a 3<sup>rd</sup>-order CIFF loop filter, and the quantizer is a 6b SAR-ADC. The gain of the IACF is set by  $C_{in}/C_f = 4$ , with  $C_{in} = 4C_f = 128$ fF. The CDAC unit cap  $C_u$  was 2fF, and data-weighted averaging (DWA) provided 1<sup>st</sup> order shaping to CDAC errors. A half-cycle delay helps the convergence of SAR and DWA logic, and excess loop-delay (ELD) compensation was implemented using  $C_{11}$  (Fig. 14.2.3). Chopping is also used in the 1<sup>st</sup> integrator to mitigate flicker noise, since the IACF only provides a small gain of 4. The IACF, reference buffer, 1<sup>st</sup>, 2<sup>nd</sup> and 3<sup>rd</sup> integrators consume 1.56, 0.6, 0.6, 0.3 and 0.3  $\mu$ W, respectively. The SAR ADC and DWA logic consume 1.1 $\mu$ W. In the loop-filter,  $R_1 = 3$ M $\Omega$ ,  $C_1 = 1$ pF,  $R_2 = 25$ M $\Omega$ ,  $C_2 = 300$ fF,  $R_3 = 60$ M $\Omega$  and  $C_3 = 200$ fF. Duty-cycled resistors  $R_{2,3}$  were used to reduce area [3], with 1M $\Omega$  passive resistors needed to achieve the required resistance. The clock frequency for  $R_{2,3}$  is a multiple of  $f_s$  to prevent noise aliasing. Loop-filter opamps were implemented as 2-stage Miller-compensated opamps. The reference buffer was a flipped voltage follower providing  $>35$ dB noise suppression from the power supply. The CCIA (Fig. 14.2.3), with 8 $\times$ -gain, used concepts from [3] to achieve a large linear input-range, low noise, low power, high  $Z_{in,DC}$  and tolerance to large common-mode interferences.

Figure 14.2.4 shows the output PSD of the ADC (without CCIA) for an input amplitude of  $-1.9$ dBFS at 1kHz, with full scale (FS) =  $V_{ref} = 1.1$ V. The peak SNDR is 93.5dB (15.2b ENOB) in a signal BW of 5kHz, and the dynamic-range is 96.5dB. When the proposed techniques (time-varying degeneration, reference-buffer assist and dead-band switches) are disabled, the SNDR reduced to 60.4dB due to increased distortion, thus showing the efficacy of our techniques.

Figure 14.2.5 shows measurements for the complete front-end (CCIA+ADC). The HP corner is set to 0.1Hz by the CCIA to filter  $\pm 100$ mV electrode offsets, and the input-referred noise of the front-end is 1.8 $\mu$ V<sub>rms</sub> (1Hz – 200Hz) and 6.35 $\mu$ V<sub>rms</sub> (1Hz – 5kHz).  $Z_{in,DC}$  is 1.5G $\Omega$ , and the THD for a 200mV<sub>pp</sub> input at 1kHz is  $-81$ dB.

We compare our work to the state-of-the-art (Fig. 14.2.6) using  $FOM_{S,DR} = DR + 10\log(BW/Power)$  and  $FOM_{S,SNDR} = SNDR + 10\log(BW/Power)$ . Our ADC achieves the highest  $FOM_{S,DR}$  (187dB) to date, and  $>6$ dB higher  $FOM_{S,SNDR}$  (184dB) than prior ADCs with ENOB $>12$ b, which corresponds to  $>4$ x lower power consumption. We also compare our front-end with state-of-the-art front-ends intended for closed-loop recording. Our front-end achieves 2 $\times$  higher linear input range, 10 $\times$  higher BW and  $>12.6$ dB higher dynamic range for comparable power, noise and area. Although  $FOM_S$  is used to compare ADCs, we use it to compare front-ends as they capture the tradeoffs between power, bandwidth, noise and linearity, all essential parameters for a closed-loop neural recording system. Our front-end achieves  $>16.8$ dB higher FOM compared to the state-of-the-art, which corresponds to 48 $\times$  lower power consumption.

### Acknowledgments:

The authors thank Prof. S. Pamarti, Dr. V. Karkare and W. Jiang for technical discussions, Dr. V. Hokhikyan and Sida Li for testing support, Dr. I. Fried, Prof. N. Suthana, and Prof. R. Staba for human LFP and spike data, and Lawrence Livermore National Lab for electrodes. This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA). The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

### References:

- [1] W. Jiang, et al., "A  $\pm 50$ -mV Linear-Input-Range VCO-Based Neural-Recording Front-End With Digital Nonlinearity Correction," *IEEE JSSC*, vol. 52, pp.173-184, Jan. 2017.
- [2] B. C. Johnson, et al., "An implantable 700 $\mu$ W 64-channel neuromodulation IC for simultaneous recording and stimulation with rapid artifact recovery," *IEEE Symp. VLSI Circuits*, pp. C48-C49, June 2017.
- [3] H. Chandrakumar and D. Markovic, "A 2.8 $\mu$ W 80mV<sub>pp</sub>-linear-input-range 1.6G $\Omega$ -input impedance bio-signal chopper amplifier tolerant to common-mode interference up to 650mV<sub>pp</sub>," *ISSCC*, pp. 448-449, Feb. 2017.
- [4] I. Ahmed, et al., "A low-power Gm-C-based CT- $\Delta\Sigma$  audio-band ADC in 1.1V 65nm CMOS," *IEEE Symp. VLSI Circuits*, pp. C294-C295, June 2015.
- [5] S. Billa, et al., "Analysis and Design of Continuous-Time Delta-Sigma Converters Incorporating Chopping," *IEEE JSSC*, vol. 52, no. 9, pp. 2350-2361, Sept. 2017.
- [6] B. Gönen, et al., "A Dynamic Zoom ADC With 109-dB DR for Audio Applications," *IEEE JSSC*, vol. 52, no. 6, pp. 1542-1550, June 2017.