

# A 3T1R Nonvolatile TCAM Using MLC ReRAM for Frequent-Off Instant-On Filters in IoT and Big-Data Processing

Meng-Fan Chang, *Senior Member, IEEE*, Chien-Chen Lin, Albert Lee, Yen-Ning Chiang, Chia-Chen Kuo, Geng-Hau Yang, Hsiang-Jen Tsai, Tien-Fu Chen, *Member, IEEE*, and Shyh-Shyuan Sheu

**Abstract**—Existing nonvolatile ternary content-addressable-memory (nvTCAM) suffers from limited word-length (WDL), large write-energy ( $E_W$ ) and search-energy ( $E_S$ ), and large cell area (A). This paper develops a 3T1R nvTCAM cell using a single multiple-level cell (MLC)-resistive RAM (ReRAM) device to achieve long WDL, lower  $E_W$  and  $E_S$ , and reduced cell area. Two peripheral control schemes were developed, dual-replica-row self-timed and invalid-entry power consumption suppression (IEPCS), for the suppression of dc current in 3T1R nvTCAM cells in order to reduce  $E_S$ . Two versions of the IEPCS scheme were developed (basic and charge-recycle-controlled) to alter the tradeoff between area overhead and power consumption in the updating of invalid-bits. A 128 b  $\times$  64 b 3T1R nvTCAM macro was fabricated using back-end-of-line ReRAM under 90-nm CMOS process. The fabricated MLC-based 3T1R nvTCAM macro achieved sub-1-ns search-delay and sub-6-ns wake-up time with supply voltage of 1 V and WDL = 64 b.

**Index Terms**—Nonvolatile memory (NVM), nonvolatile ternary content-addressable-memory (nvTCAM), resistive RAM (ReRAM), TCAM.

## I. INTRODUCTION

THE Internet-of-things (IoT) and big-data (BD) processing are driving demand for reductions in the energy consumption associated with data transmission between cloud servers and local devices as well as issues related to a discontinuous power supply. The transmission of invalid data between local devices and cloud servers accounts for a major proportion of the overall power consumption in IoT and BD applications. Filters capable of identifying valid data make it possible to reduce the amount of data that must be sent to the cloud server by wireless or wireline transmission, thereby reducing overall

Manuscript received August 26, 2016; revised November 5, 2016 and January 24, 2017; accepted March 1, 2017. Date of publication April 7, 2017; date of current version May 23, 2017. This paper was approved by Associate Editor Hideto Hidaka. This work was supported by the Ministry of Science and Technology, Taiwan. (Corresponding author: Meng-Fan Chang.)

M.-F. Chang and Y.-N. Chiang are with the Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan (e-mail: mfchang@ee.nthu.edu.tw).

C.-C. Lin is with Taiwan Semiconductor Manufacturing Company, Hsinchu 300-78, Taiwan.

A. Lee is with the University of California, Los Angeles, CA 90095 USA.

C.-C. Kuo and S.-S. Sheu are with the Electronics and Optoelectronics System Research Laboratories, Industrial Technology Research Institute, Hsinchu 31040, Taiwan.

G.-H. Yang, H.-J. Tsai, and T.-F. Chen are with National Chiao Tung University, Hsinchu 30010, Taiwan.

Color versions of one or more of the figures in this paper are available online at <http://ieeexplore.ieee.org>.

Digital Object Identifier 10.1109/JSSC.2017.2681458

power consumption. Most filter-based IoT and BD devices employ ternary content-addressable-memory (TCAM) [1]–[3] with pre-stored signature patterns as filters [4], [5]. However, reductions in transmission power can be fully realized only by taking advantage of the high filtering rates afforded by high-capacity filters.

For IoT applications, the ratio of standby time to active time for a chip or sub-block varies with the application. Numerous applications induce a long standby time but short chip (or sub-block) activation period. For example, the gas sensor chips in [29] and [30] have a standby period 100–1000 longer than its computing period for various applications. The excessive consumption of standby current by nanometer chips has led to the adoption of energy-efficient devices based on intelligent power ON–OFF or normally OFF operations. However, these features are usually implemented in two-macro schemes aimed at preventing the loss of critical data in power-OFF mode. In the beginning of power-OFF mode, data is moved from volatile memory macros to nonvolatile memory (NVM) using devices such as eFlash [6]–[10]. In power-on mode, data is restored from the NVM macro to the volatile memory macro to facilitate high-speed computing operations through accessing volatile memory. Unfortunately, limited I/O bandwidth can result in long delays and high-power consumption associated with the word-by-word serial movement of data between NVM and volatile-memory macros [11]–[16].

Single-macro nonvolatile TCAMs (nvTCAM) use NVM devices for power-OFF pattern-storage as well as power-ON pattern-search comparisons. These devices were first developed as a means to eliminate the need for the movement of data between SRAM-based TCAM (sTCAM) [1]–[3] and NVM macros during power ON–OFF operations. nvTCAM devices enable faster operations and lower energy consumption in power ON/OFF operations than do conventional two-macro schemes based on sTCAM and NVM macros. Furthermore, the use of fewer components means that nvTCAM devices are far smaller than typical sTCAM, most of which require 16 transistors (16Ts) per cell.

The fact that most of the signature patterns in TCAM for IoT/BD applications are updated (written) very rarely, which makes it possible to use endurance-limited NVM devices in the implementation of nvTCAM cells. Previous nvTCAMs designed using diode-connected 6T2R (D6T2R)



Fig. 1. (a) HfO<sub>x</sub>-based bipolar-type ReRAM device used in this paper. (b) Cross-sectional view of the nvTCAM cell using two BEOL-based NVM devices. (c) Normalized cell area of nvTCAM cells versus write current.

[17] and 4T2R (D4T2R) [18] with STT-MTJ, 2T2R [19] with phase change memory (PCM), or RC-filtered stress-decoupled (RCSD)-4T2R [4] with resistive RAM (ReRAM). These nvTCAMs suffer from a number of issues.

- 1) Large cell area ( $A$ ) and high-write energy ( $E_W$ ) associated with the use of two NVM (2R) devices.
- 2) Limited WDL ( $k$ -bits) induced by small match-line (ML) current-ratio ( $I_{ML\text{-ratio}} = I_{ML\text{-MIS}} / (k^* I_{ML\text{-M}})$ ) between the ML mismatch current ( $I_{ML\text{-MIS}}$ ) and ML leakage current of  $k$  matched cells ( $k \times I_{ML\text{-M}}$ ).
- 3) Long search delays ( $T_{SD}$ ) and excessive search energy ( $E_S$ ) resulting from large ML parasitic load ( $C_{ML}$ ) and small  $I_{ML\text{-ratio}}$ .

Embedded ReRAM [20]–[25] shows considerable promise for use in nvTCAM, thanks to its low  $E_W$ , high single-level resistance-ratio ( $R_{SLC}$ -ratio), and multiple-level cell (MLC) capability.  $R_{SLC}$ -ratio refers to the ratio between the resistances values of an NVM device in high-resistance state (HRS) and in low-resistance state (LRS); i.e.,  $R_{HRS} / R_{LRS}$ .

To overcome problems 1)–3), this study develops a 3T1R nvTCAM cell [5] using MLC-NVM device. We also developed a dual-replica-row self-timed (DR<sup>2</sup>ST) scheme and two versions of invalid-entry power consumption suppression (IEPCS) scheme for our 3T1R nvTCAM cell. A 128 b × 64 b 3T1R nvTCAM macro was then fabricated to evaluate the developed concepts.

The remainder of this paper is organized as follows. Section II discusses the background of nvTCAMs. Section III outlines the 3T1R nvTCAM cell. Section IV describes the macro structure and control circuits of the 3T1R nvTCAM. Section V presents experiment results and

| Write Operation | TE        | BE          | R Value |
|-----------------|-----------|-------------|---------|
| SET to LRS      | $V_{SET}$ | 0V          | Low     |
| SET to MRS      | $V_{SET}$ | 0V          | Middle  |
| RESET           | 0V        | $V_{RESET}$ | High    |

(a)



(c)

a comparison with existing devices. Conclusions are drawn in Section VI.

## II. CHALLENGES OF NONVOLATILE TCAM

### A. Write Energy

1) *Write Energy of nvTCAM*: Several emerging NVM devices (PCM, STT-MRAM, and ReRAM) have been shown to reduce write energy to levels below those of conventional floating-gate based embedded flash (eFlash) devices, thanks to lower write voltages and faster write speeds. These devices are placed above transistors using back-end-of-line (BEOL) processes.

These emerging NVM devices use SET and RESET operations to perform direct over-write operations. The SET operation changes an NVM cell from HRS to LRS, whereas the RESET operation changes the device from LRS to HRS. PCM and ReRAM devices are capable of existing in MLC states. Fig. 1(a) illustrates the cell structure and write conditions of the HfO<sub>x</sub>-based ReRAM device used in this paper, where  $V_{SET}$  and  $V_{RESET}$  indicate the voltage requirements for SET and RESET operations, respectively. These values generally vary according to the specifics of the NVM device.

Table I lists a number of previous silicon-verified nvTCAM cells. Emerging memory devices provide write performance superior to that of conventional eFlash devices; however, the consumption of energy for write operations still is significant and exceeds that of an SRAM cell. The use of two NVM devices in 2-NVM nvTCAM greatly increases the energy consumed in write operations ( $E_W$ ); however, MLC single-NVM nvTCAM devices show considerable promise in reducing write energy.

TABLE I  
PREVIOUS SILICON-VERIFIED nvTCAM CELLS

| Cell Structure                       | Conv. 16T SRAM | 2T2R                        | D4T2R                                     | RCSD-4T2R                   | 3T1R (This work)                 |
|--------------------------------------|----------------|-----------------------------|-------------------------------------------|-----------------------------|----------------------------------|
| Search Scheme ( $V_{ML}$ generation) | Digital Driver | $V_{ML}/R + 2$ bit encoding | $V_R +$ Diode ( $=I_{P1} \cdot R + VD1$ ) | RC filtered + Analog Driver | Resistor divider + Analog Driver |
| NVM used in original design          | -              | PCM                         | STT-MRAM                                  | ReRAM                       | ReRAM                            |
| Features                             | High-speed     | Ultra-high density          | Small R-ratio, long WDL                   | Small stress, long WDL      | High-density, long WDL,          |

MLC NVM macros require four logic states for data storage; however, nvTCAMs require only three (1, 0, don't-care). Thus, MLC nvTCAM require only one middle resistance state (MRS) for don't care ( $X$ ) between the HRS (logic-0) and LRS (logic-1) states. This relaxes the requirements for R-ratios between neighboring MLC states ( $R_{MLC}$ -ratio =  $R_{HRS}/R_{MRS}$  or  $R_{MRS}/R_{LRS}$ ).

### B. Cell Area

Fig. 1(b) presents a partial cross section of an nvTCAM cell using two BEOL-based NVM devices (2NVM structure). The implementation of two NVM devices in a cell greatly increases the cell area, due to the need to accommodate spacing and wire routing. Many ReRAM devices require significant write current ( $I_{SET}$ ) to achieve long data retention times and large R-ratio, both of which are critical to the reliability, life span, and search operations of nvTCAM devices. Thus, large transistors are generally used to provide the write current required for NVM devices in nvTCAM cells. Fig. 1(c) presents the normalized cell area of nvTCAM cells with various write currents. To enable a fair comparison, all nvTCAM cells in Fig. 1(c) employ the same NVM device and transistor sizing policy: 1) the size of transistors on the write path in each nvTCAM cell were selected to satisfy the required write current and 2) the same ML-driver transistor was used in each nvTCAM cell. Reducing the area consumed by transistors, reducing the number of plates, and simplifying the wire routing between NVM devices and transistors makes the cell area of 3T1R significantly smaller than that of RCSD-4T2R cell. Clearly, the use of a single NVM device could greatly reduce the area overhead of nvTCAM cells.

### C. Word Length and Search Delay ( $T_{SD}$ )

Fig. 2(a) presents the structure of a typical TCAM cell array, comprising differential search-data lines (DL and DLB) in the column direction and MLs in the row direction. Prior to search operations, each ML is biased at a pre-charge voltage ( $V_{PRE}$ ).

During search operations, ML voltage ( $V_{ML}$ ) is determined by comparing the input search data (SDIN) and data stored in

TCAM cells. In the case of a mismatch operation, where the data stored in a TCAM cell differs from that on DL/DLB, ML discharge current ( $I_{ML\_MIS}$ ) can cause a significant voltage swing/drop ( $V_{MLSW}$ ), which develops on the ML after the period indicated by the ML developing time ( $T_{ML}$ ). In the case of a match operation, where the data stored in a TCAM cell matches the data on DL/DLB, the ML does not undergo a voltage drop because the generation of  $I_{ML\_MIS}$  is disabled, as long as the ML leakage current ( $I_{ML\_M}$ ) of a match cell is small.

Ensuring robust search operations requires that the difference in ML voltage ( $\Delta V_{MLSW}$ ) between cases of match ( $V_{MLSW\_M}$ ) and worst-case mismatch ( $V_{MLSW\_MIS\_1b}$ ) exceeds the input offset voltage of an ML sense amplifier (ML-SA). The data-pattern that generates the smallest (worst-case)  $\Delta V_{MLSW}$  in mismatch operations is the “1-bit mismatch,” which means that there is only one discharge path (one TCAM cell) generating ML voltage swing.

Most 16T sTCAM have high  $I_{ML\_MIS}$  and ultra-low  $I_{ML\_M}$ , thanks to the digital control (ON/OFF) of the comparison transistor, as Fig. 2(b) shows. This makes it possible for sTCAM to maintain a high ML current-ratio ( $I_{ML}$ -ratio =  $I_{ML\_MIS}/(k \cdot I_{ML\_M})$ ) between mismatch current ( $I_{ML\_MIS}$ ) and the ML leakage current of  $k$  matched cells ( $k \cdot I_{ML\_M}$ ). This leads to a long maximum WDL ( $k$ -bits) and shorter search times ( $T_{SD}$ ), where  $I_{ML\_MIS} \gg I_{ML\_M} \approx 0$ .

Most resistive-type NVM devices have a limited R-ratio and a worst-case  $R_{HRS}$  far below the channel-impedance of a transistor in the cutoff region. Thus, nvTCAM performs analog-based comparisons, rather than the digital ON-OFF comparison of sTCAM. The analog control of nvTCAM involves a small  $I_{ML}$ -ratio resulting from large  $I_{ML\_M}$  and small  $I_{ML\_MIS}$ , particularly in cases where the R-ratio is small and the worst-case (tail-bit)  $R_{HRS}$  is low. This small  $I_{ML}$ -ratio prevents nvTCAM cells from achieving short  $T_{SD}$  and long WDL (large  $k$ ), as shown in Fig. 2(c).

Previous 2T2R cell uses an ON-transistor with a serial connected NVM device to determine  $I_{ML}$ , which depends on the resistance of the NVM along a selected path. A small



Fig. 2. (a) Structure of typical TCAM and nvTCAM. Search waveforms of (b) sTCAM with small  $I_{ML-M}$  and (c) nvTCAM with significant  $I_{ML-M}$ .



Fig. 3. (a) Schematic and (b) layout of 3T1R nvTCAM cell.

$I_{ML}$ -ratio means that compact-area 2T2R cells require 2-bit encoding to extend the WDL and a sense amplifier to provide tolerance for small  $\Delta V_{ML}$  to make the sensing behavior more robust.

Previous D4T2R cell uses a unidirectional voltage-divider to generate dynamic voltage ( $V_X$ ). This process depends on the resistance of NVM along a selected path to turn ON/OFF the diode-connected NMOS ML driver (NML) and then generate ML voltage [ $V_{ML} (\sim= V_{D1} + V_X)$ ]. The fact that NVM devices are placed along the  $I_{ML-MIS}$  path limits the  $I_{ML}$ -ratio and WDL of D4T2R for read-disturb consideration.

Previous 4T2R cell uses the difference in  $RC$  delay between HRS and LRS cells to generate different gate voltages ( $V_{Nx}$ ) in the ML-driver (NML) for search operations. Although this approach achieves a larger  $I_{ML}$ -ratio than that of a 2T2R cell, it greatly increases the cell area and energy required for write operations.

Imposing a large parasitic load ( $C_{ML}$ ) on an ML causes long search delays ( $T_{SD}$ ) and excessive search energy ( $E_S$ ). Previous 2T2R and D4T2R cells include two terminals connected to the ML within a single cell, which results in large  $C_{ML}$ . Previous 4T2R cell uses a transistor with only one terminal



Fig. 4. Write operation of 3T1R cell. (a) Activated path and bias conditions. Waveforms of (b) single-step and (c) two-step write operations.

connected to the ML in order to reduce  $C_{ML}$ ; however, this approach requires considerable energy for toggling between multiple control signals in search operations.

Thus, the development of a compact-area and energy-efficient nvTCAM should be based on a single NVM device with the aim of achieving high  $I_{ML}$ -ratio with small  $C_{ML}$ .

### III. 3T1R nvTCAM CELL

#### A. Structure of 3T1R nvTCAM Cell

Fig. 3 presents a schematic and the layout of the 3T1R nvTCAM cell, comprising an MLC NVM device (R1), nMOS transistor (N1), pMOS transistor (P1), and nMOS

ML-driver (NML). The gate and source of N1 are connected to nMOS-data-line (DLn) and nMOS-source-line (SLn), respectively. The gate and source of P1 are connected to pMOS-data-line (DLp) and pMOS-source-line (SLp), respectively. The drain terminal (NX) of P1 and N1 is connected to the gate of NML. R1 is placed between the NX and the column-based resistor-source-line (SLr). The NVM device (R1) relies on three resistance states; i.e., LRS, MRS, and HRS.

The use of a single NVM device makes it possible to fabricate the 3T1R cell using far less area than that required for 2-NVM-based nvTCAM cells, particularly when significant write current is required for the NVM. The 3-D structure



Fig. 5. Search operation of 3T1R cell. (a) Voltage bias conditions. (b) Waveforms of search-0 (S0), and search-1 (S1) operations.



Fig. 6. Macro structure of 3T1R nvTCAM.

created by stacking a single NVM device on the transistors enables a 2.27× reduction in cell area, compared to conventional 16T sTCAM cells, using the same logic layout rules. The use of a single NVM device reduces the NVM-write energy consumption of the 3T1R cell to below that of 2-NVM nvTCAM cells. The use of a single NML for each cell also makes it possible to achieve small ML loading and a large ON/OFF current ratio.

### B. Write Operations

Fig. 4 illustrates the conditions and waveforms associated with the write operation of the 3T1R nvTCAM cell.



Fig. 7.  $T_{ML}$  required for search-1 and search-0 operations across various global process corners at  $WDL = 64$ .

The R1 that is used here is capable of three write operations: write-0 (W0), write-1 (W1) and write-don't-care (WX).

In the W1 operation, R1 is SET to LRS, in which  $DL_n = V_{G-SET-L}$ ,  $SL_r = V_{SET}$  and  $SL_n = 0$  V, with sufficient amount of  $I_{SET}$  flowing from  $SL_r$  through R1 and N1 into  $SL_n$ .

In the W0 operation, R1 is RESET to HRS, in which  $DL_n = DL_p = SL_r = 0$  V and  $SL_p = V_{RESET}$  with RESET current ( $I_{RESET}$ ) flowing from  $SL_p$  through P1 and R1 into  $SL_r$ .

The WX operation has two device-dependent operating modes: single-step and two-step.

If the NVM device is able to over-write directly to MRS from either HRS or HRS, then the WX operations is implemented using the single-step approach. As shown in Fig. 4(b),



Fig. 8. (a) Structure and waveforms of DR<sup>2</sup>ST with (b)  $p > 1$  and (c)  $p = 14$ .

R1 is SET to an MRS by applying  $DLn = V_{G-SET-M}$ ,  $SLr = V_{SET}$ , and  $SLn = 0$  V.

If the NVM device is only able to overwrite directly to MRS from HRS, then the WX operation requires the two-step approach. As shown in Fig. 4(c), in the first phase, R1 is RESET to HRS under the same conditions as a W0 operation. In the second phase, R1 is SET (SET-M) to an MRS by applying  $DLn = V_{G-SET-M}$ ,  $SLr = V_{SET}$ , and  $SLn = 0$  V.

Conventional MLC ReRAM macros have four resistance states, including two MRS (MRS1 and MRS2). This makes it necessary to employ a program-verify scheme in order to achieve sufficient sensing window. The 3T1R nvTCAM cell in this paper requires only one MRS, which means that it is better able to tolerate large variations in resistance in MRS than is MLC-memory. By using ReRAM devices with a high R-ratio, the proposed 3T1R nvTCAM does not require program verification for macros with medium capacity. For applications with large memory-capacity and long WDL, the use of ReRAM devices with a medium or small R-ratio necessitates a program-verify scheme for the proposed 3T1R cell.

Noted, a higher write current increases the write power. An ultra-high write current would significantly degrade the endurance of ReRAM. Moreover, a multiple program-verify scheme could further degrade cell endurance. Generally, ReRAM devices have a specific range of write currents in which endurance degradation is small. ReRAM devices vary with regard to the tradeoff between R-ratio and endurance. Therefore, the write current must be carefully selected in accordance with the characteristics of the ReRAM device.

The fact that only one NVM device is used in a single cell means that the 3T1R cell is able to use less  $E_W$  than previous nvTCAM cells that employ two NVM devices requiring same R-ratio.

### C. Search Operation

Fig. 5 illustrates the search-0 (S0) and search-1 (S1) operations of 3T1R. In standby mode, the ML is precharged to the ML precharge voltage ( $V_{PRE}$ ) with  $SLr = SLn = DLn = 0$  V,  $DLp = V_{DD}$ , and  $SLp = VSLp$ . The  $VSLp$  could be  $VDD$  or a lower voltage than  $VDD$  for different CMOS technologies. The voltage level on the NX node ( $V_{NX}$ ) is then maintained at 0 V, because  $SLr = 0$ . During search operations, the  $DLp$ ,  $DLn$ , and  $SLr$  of each column are set to voltages corresponding to the SDIN. S1 and S0 then use the characteristics of voltage-divider (in either the P1-R1 or R1-N1 paths), to generate different  $V_{NX}$  values for match or mismatch operations.

As Fig. 5(b) shows, in an S1 operation,  $SLr = DLp = DLn = 0$  V, such that current flows from  $SLp$  through P1 and R1 to  $SLr$ . P1-R1 forms a forward voltage-divider to determine  $V_{NX}$ . If  $R1 = 0$  (HRS), then  $V_{NX}$  rises above the threshold voltage of NML ( $V_{TH-NML}$ ), thereby enabling NML to generate large discharge current ( $I_{ML-MIS}$ ).  $I_{ML-MIS}$  causes the voltage on ML ( $V_{ML}$ ) to drop, which produces a voltage swing on ML ( $V_{MLS}$ ), resulting in a mismatch case. If  $R1 = 1$  or X (LRS or MRS), then  $V_{NX}$  is below  $V_{TH-NML}$ . This means that NML is in the cutoff region with ultra-small  $I_{ML-M}$  ( $\sim 0, \ll I_{ML-MIS}$ ), which maintains  $V_{ML}$  at a high level, resulting in a match case.



Fig. 9. (a) Schematic. (b) Truth table. (c) Operational waveform of IEPCS-basic.

In an S0 operation,  $SL_r = VSL_r$  and  $DL_p = DL_n = V_{DD}$ , such that current flows from  $SL_r$  through  $R_1$  and  $N_1$  to  $SL_n$ .  $N_1-R_1$  forms a reverse voltage-divider to determine  $V_{NX}$ . If  $R_1 = 1$  (LRS), then  $V_{NX}$  rises above  $V_{TH-NML}$ , such that the discharge current generated by NML causes a drop in ML voltage, resulting in a mismatch case. If  $R_1 = 0/X$  (HRS/MRS), then  $V_{NX}$  is below  $V_{TH-NML}$  and NML falls within the cutoff region with small leakage current ( $I_{ML-M}$ ). Accordingly,  $V_{ML}$  is maintained close to  $V_{PRE}$ , resulting in a match case.

Following a given ML voltage developing period ( $T_{ML}$ ),  $\Delta V_{MLSW}$  exceeds the designed sensing margin ( $V_{SM}$ ). In accordance with the self-timing scheme, the ML-SA is turned on ( $SAEN = 1$ ) to compare  $V_{ML}$  with a given reference voltage ( $V_{REF}$ ) and generate a digital output.

Noted, in proposed 3T1R nvTCAM, we considered the issue of read disturb from the following perspectives: 1) short stress time and 2) options for using lower  $SL_r$  voltage ( $VSL_r$ ) and corresponding transistor sizing for  $N_1$  and  $P_1$ .

#### IV. MACRO STRUCTURE OF 3T1R nvTCAM CELL

Fig. 6 illustrates the macro structure of the 3T1R nvTCAM, including the DR<sup>2</sup>ST, IEPCS, write-drivers (WD), ML-SA, and other peripheral circuits. Two versions of the IEPCS scheme were developed (basic and charge-recycle-controlled) to alter the tradeoff between area overhead and power consumption in the updating of invalid-bits.

##### A. Dual-Replica-Row Self-Timed Scheme

The voltage-dividing operation results in dc current between  $SL_r$  and  $SL_p/SL_n$  in the 3T1R cells during search operations. Furthermore, different voltage division paths and variations in the global corners of transistors means that the ML developing time ( $T_{ML}$ ) could vary considerably for a given  $V_{SM}$ .

Fig. 7 presents the ML-developing time ( $T_{ML}$ ) required for S1 and S0 operations across various global process corners at  $WDL = 64$ . The S0 operation produces small fluctuations across corners because the voltage-divider path and



Fig. 10. (a) Schematic, (b) truth table, and (c) operation waveform of IEPICS with CRC for VB updating operations.

NML employ the same type of transistor (nMOS). In contrast, the S1 operation produces large fluctuations across corners, due to the fact that the P1 (pMOS) in voltage-divider path ( $P_1 + R_1$ ) and the NML (nMOS) employ different types of transistor. The difference in voltage-divider path leads to differences in the  $T_{ML}$  required for the target  $V_{SM}$  in S1 and S0 operations.

In this paper, we sought to enhance tolerance for variations in  $T_{ML}$  between S1 and S0 operations under a variety of processes, temperatures, and supply voltages (PVT) conditions. This was achieved by developing a DR<sup>2</sup>ST to track the slower  $T_{ML}$  of S1 and S0 operations with the worst case mismatch data-pattern (1-bit mismatch).

As Fig. 8 shows, the DR<sup>2</sup>ST comprises a voltage-detector (VD) logic and two replica MLs (RMLs), RML0 and RML1, which are placed on top of the macro. RML0 includes  $p$  cells storing HRS as mismatch cells and  $(k-p)$  cells storing LRS as match cells for S0 operations. RML1 includes  $p$  cells storing LRS as mismatch cells and  $(k-p)$  cells storing HRS as match cells for S1 operations.

Each replica-ML (RML1/RML0) includes  $p$  mismatch-cells and  $(k-p)$  match-cells to track the average  $V_{MLSW-MIS-1b}$  of  $p$  mismatch-cells. The ML voltage swing of  $p$ -bits mismatch ( $V_{MLSW-MIS-pb}$ ) is approximately  $p$  times larger than that of 1-bit mismatch.

As shown in Fig. 8(b) ( $p > 1$  case), when the voltage on RML1/RML0 ( $V_{RML} = V_{PRE} - p * I_{ML-MIS} * T_{ML}$ ) is below the trip-point ( $V_{TRIP-VD}$ ) of the voltage detector (VD1/VD0), VD issues an output ( $V_{D0\_out}/V_{D1\_out} = 1$ ) signal indicating that the  $V_{MLSW-MIS-1b}$  of the typical MLs exceeds  $(V_{PRE} - V_{TRIP})/p$ . Selecting appropriate values for  $p$  and  $V_{TRIP}$  makes it possible for a typical  $\Delta V_{MLSW}$  to exceed the  $V_{SM}$  required for ML-SAs. In this case, multiple replica-cells are used to suppress fluctuations in  $T_{ML}$  across dies, due to process variation in the  $I_{ML-MIS}$  of replica-cells.

As shown in Fig. 8(c), DR<sup>2</sup>ST can use  $P = 1$ , as long as the trip-point ( $V_{TRIP-VD}$ ) of VD0/VD1 is high. The case of  $P = 1$ , which enables good tracking of 1-bit mismatch behavior; however,  $T_{ML}$  is susceptible to process variation in the  $I_{ML-MIS}$  of replica cells.

When  $V_{D0\_out}$  and  $V_{D1\_out}$  are both switched to high, DR<sup>2</sup>ST generates an SA enable signal ( $SAEN = 1$ ) to turn-on all ML-SAs, and then sends DLOFF and SLrOFF signals to disable all DL-drivers and SLr-drivers for the termination of search operations in order to reduce power consumption.

Thus, by tracking S1 and S0 separately and generating  $T_{ML}$  based on the slower of  $T_{ML0}$  and  $T_{ML1}$ , the DR<sup>2</sup>ST provides pulse-controlled timing in order to suppress search energy while tolerate PVT variations during search operations. Noted, DR<sup>2</sup>ST was designed to deal with global



Fig. 11. Simulated distributions of  $V_{NX}$  in a 3T1R cell with variations in transistors and resistance of employed ReRAM device at (a)  $V_{DD} = 1\text{V}$  and (b)  $V_{DD} = 0.48\text{V}$ .



Fig. 12. Simulated maximum WDL of nvTCAM cells. (a) Fixed minimum-R-ratio. (b) Fixed-ratio = 100 and statistic with variation in ReRAM resistance.

variations rather than local variations. In this paper, the tolerance of local variation is included in the chosen target sensing margin for ML-SA. The sensing margin for ML-SA is determined by the number of P and the value of  $V_{TRIP-VD}$ .

#### B. Invalid-Entry Power Consumption Suppressor Without CRC (IEPCS-Basic)

In a typical TCAM macro, the search result of an entry is considered valid if the valid-bit (VB) of the entry is high. If the VB of the entry is low, then the ML of the entry is kept

low and treated as a mismatch case during search operations. In many nvTCAM cells, including the 3T1R cell, voltage-dividing operations consume considerable dc current ( $I_{DC-CELL}$ ) during search operations, regardless of whether the entries are valid or invalid. This study developed an IEPCS scheme to reduce the amount of power wasted by  $I_{DC-CELL}$  when dealing with invalid entries.

Fig. 9(a) illustrates the structure of IEPCS-basic, comprising a pair of power switches (PMOS switch and NMOS switch) for SLP and SLN, an ML disable transistor (MLD), and a VB combiner (VBC).



Fig. 13. Search delay versus WDL using minimum-R-ratio approach.

It should be noted that in order to reduce area overhead, every two entries (odd and even rows) share the same SLn and SLP lines in the layout. When entries sharing the same SLp/SLn lines are both invalid, the VBs of the odd row (VBO) and even row (VBE) are both low. Thus, the VBC output (VBx) is kept low to indicate that the IEPCS function of these two-entry set should be enabled. Fortunately, the fact that invalid entries are generally placed in consecutive rows makes IEPCS applicable to most TCAM applications.

In valid cases ( $\text{VB}_x = 0$ ), as Fig. 9(b) shows, the power-switches are in valid-mode ( $\text{SL}_p = \text{V}_{\text{DD}}$  and  $\text{SL}_n = 0 \text{ V}$ ) ready for search operations. In invalid cases ( $\text{VB}_x = 1$ ), the power-switches (PSW and NSW) are in invalid-mode, resulting in  $\text{SL}_p = 0 \text{ V}$ ,  $\text{SL}_n = \text{V}_{\text{DD}}$ , and  $\text{MLD} = \text{ON}$ . When MLD is ON, MLG is turned on and the ML-precharge transistor is disabled to ensure that the ML remains at 0 V. Thus, in a search-1 operation ( $\text{SL}_r = 0 \text{ V}$  and  $\text{P}_1 = \text{ON}$ ), there is no  $I_{\text{DC-CELL}}$  in any of the 3T1R cells of the invalid entry because there is no difference in voltage between  $\text{SL}_r$  and  $\text{SL}_p$ . The situation is the same for a search-0 operations ( $\text{SL}_r = \text{V}_{\text{DD}}$  and  $\text{N}_1$  is ON), in which there is no  $I_{\text{DC-CELL}}$  in any of the 3T1R cells of invalid entries, because  $\text{SL}_r = \text{SL}_n = \text{V}_{\text{DD}}$ . Fig. 9(c) presents the waveform of invalid and valid entries during a search operation using the IEPCS scheme.

### C. Invalid-Entry Power Consumption Suppressor With CRC

During VB updating operations, the power consumed through SLp/SLn toggling is significant, if the number of entries changing validity status is large. To suppress power consumption due to SLp/SLn toggling, we developed a charge-recycled-control (CRC) scheme for IEPCS (IEPCS-CRC).

The CRC comprises a VB-status transition detector (VBTD), a CMOS equalizer (NMOS equalizer and PMOS equalizer) between SLn and SLP, and two pairs of CMOS power-isolation switches (PISO) for disconnecting SLn and SLP from the power source.

Fig. 10 illustrates the operational waveform of CRC for VB (VBO/VBE) updating operations. In cases where there is a change in VBx, the VBTD generates an enable pulse

PEQ\_EN for a period of  $T_{\text{EQ}}$  to turn on the NEQ and PEQ and turn OFF the PISO in order to float the SLn and SLp. This makes it possible for the SLp and SLn to perform equalization operations through charge sharing. At the end of the PEQ\_EN pulse, the PISO is turned on to pull-up/pull-down the SLp/SLn to VDD/VSS from the equalized voltage  $V_{\text{EQ}}$ , in accordance with its VBx status. Without full voltage swing ( $\text{VDD}-\text{V}_{\text{EQ}}$  or  $\text{V}_{\text{EQ}}$ ), the toggling of SLp/SLn consumes far less power than would be the case in the scheme without CRC.

## V. COMPARISON AND EXPERIMENT RESULTS

### A. Comparison

Satisfying the write current requirements for long data retention and fast search speeds requires that the transistor size of N1/P1 and NML exceed the minimum size that can be produced using a given CMOS process. This prevents the performance of 3T1R cell from being sensitive to local process variations in transistors. The following analysis focuses on the performance of 3T1R under global process corners with the worst-case R-ratio. To ensure a fair comparison, all the nvTCAMs used the same ReRAM device, the same R-ratio, the same size transistors in the ML-driver, the same sense amplifier, and the same worst-case ML sensing margin (125 mV) for the following analysis.

Fig. 11 illustrates the distribution of  $V_{\text{NX}}$ , while taking into account variations in ReRAM resistance and transistors. The results were obtained through Monte-Carlo simulation using 10 000 samples. The distribution of ReRAM resistance used in Fig. 11 is based on resistance values measured from an Mb-level ReRAM macro on the same wafer.

Fig. 12 presents the maximum WDL of the proposed 3T1R nvTCAM across various R-ratios. When R-ratio = 100, the 3T1R cell achieved  $\text{WDL} > 230$  across all global process corners. When R-ratio = 15, the 3T1R cell achieved  $\text{WDL} > 120$ . A decrease in R-ratio led to a decrease in the maximum WDL due to a lower ML current ratio.

Fig. 13 presents search delay versus ML length across various R-ratios. It should be noted that  $V_{\text{NX}}$  is lower than  $\text{VDD}$ , which means that the 3T1R nvTCAM has smaller  $I_{\text{ML-MIS}}$  than does sTCAM. On the other hand, the  $2+x$  smaller  $C_{\text{ML}}$  and compact cell area of 3T1R nvTCAM means that the  $RC$  delay time on ML is shorter than that of sTCAM. When combining the smaller  $I_{\text{ML-MIS}}$  and shorter  $RC$  delay, the difference in search delay between 3T1R nvTCAM and sTCAM depends on the R-ratio. At a smaller R-ratio, the search delays of 3T1R are longer due to a larger  $C_{\text{ML}}$  and smaller  $I_{\text{ML}}$ -ratio.

Fig. 14(a) presents a breakdown of search energy for a 3T1R macro with  $\text{WDL} = 64 \text{ b}$ , at R-ratio = 100 under high temperature (125 °C). The dc current in a 3T1R cell causes considerable energy, and depends on the DL pulselength. Despite the fact that sTCAM does not have dc current, it still suffers from large ML switching power due to large ML-capacitance and the active-mode standby current of SRAM cells. The two branches of serial NMOS transistors are connected to the ML in each cell, which means that sTCAMs have  $2+x$  the parasitic load on ML ( $C_{\text{ML}}$ ), compared to the proposed 3T1R nvTCAM, which has only one transistor connected to the ML in each nvCAM cell. Thanks to smaller



Fig. 14. (a) Breakdown of search energy. (b) Search energy versus WDL using minimum-R-ratio approach.



Fig. 15. (a) Normalized search power versus invalid entries with and without IEPKS scheme. (b) Breakdown of search energy at NVIE = 25% without and with DR<sup>2</sup>ST and IEPKS schemes.

$C_{ML}$  and no cell leakage current, the nvTCAM consumes slightly less search energy than does sTCAM, despite the effects of dc current during the short ML developing period at R-ratio = 100. Fig. 14(b) presents the search energy versus ML lengths across various R-ratios. When the WDL increases, the search energy increases almost proportionally. This can be attributed to the fact that search energy is dominated by ML and DL toggling. When the R-ratio is small, an increase in search delay and dc current results in 3T1R consuming more search energy than does sTCAM.

Fig. 15 presents the normalized power versus the number of invalid entries in a 3T1R macro. Noted, the percentage of



Fig. 16. Power consumption versus WDL for valid-entry updating operation.



Fig. 17. Area overhead of IEPKS versus WDL.

invalid entries depends on the capacity of nvTCAM and the specifics of the application. The IEPKS scheme was shown to eliminate dc current in the case of invalid entries, thereby achieving a 21% reduction in search power when the number of invalid-entries (NIVE) accounted for 25% of the total number of entries in a macro. A higher NIVE resulted in more pronounced power savings. Fig. 15(b) break downs the search energy consumption between DR<sup>2</sup>ST and IEPKS schemes.



Fig. 18. Simulated write energy of 3T1R nvTCAM using different numbers of program-verify cycles.



|                                               |                                                                                     |
|-----------------------------------------------|-------------------------------------------------------------------------------------|
| Process                                       | 90nm Bulk CMOS + HfO ReRAM                                                          |
| Capacity                                      | 2 blocks x 64 rows x 64bits                                                         |
| Cell size                                     | Using logic rule: 1.568um <sup>2</sup><br>Using SRAM rule: 0.8704um <sup>2</sup>    |
| Search Speed                                  | 0.96ns (@ VDD=1V, 25°C)                                                             |
| Measured Energy/bit/search<br>(@VDD=1V, 25°C) | w/o IEPCS: 0.51J<br>with IEPCS: 0.41J (with NIVE=25%)                               |
| VDDmin                                        | 0.48V (@ 25°C)                                                                      |
| Supply Voltage                                | 0.5V~1V (Search)<br>1.8V (Write)                                                    |
| ReRAM Write Speed                             | SET/RESET < 5ns                                                                     |
| Write Energy<br>(1/3 "1", 1/3 "0", 1/3 "X")   | 256 pJ per word (64b)<br>(at I <sub>SET</sub> =100μA and I <sub>RESET</sub> =200μA) |

(a)



(b)

Fig. 19. (a) Die photograph and (b) structure of fabricated 3T1R testchip.

Using both DR<sup>2</sup>ST and IEPCS schemes, the 3T1R achieves 32% reduction in search energy compared to that without either scheme.

Fig. 16 illustrates the power consumption required to update the validity status of an entry using IEPCS, with and without the CRC scheme. The CRC scheme resulted in a 44% reduction in the power consumption required to switch an entry between valid and invalid status when WDL = 32. The benefits of the CRC scheme were even more pronounced for entries with a long ML (large WDL).

Fig. 17 illustrates the area overhead for IEPCS versus WDL. At WDL = 64, IEPCS-basic and IEPCS-CRC, respectively, consume 4.2% and 8.2% area overhead for a 3T1R nvTCAM array. At WDL = 256, IEPCS-basic and IEPCS-CRC consume only 0.6% and 1.2% of the area overhead, respectively.

The use of single NVM devices was shown to provide a 54% reduction in  $E_w$  over previous 2-NVM-based cell (RCSD-4T2R) using 1-cycle write scheme, as shown in Fig. 18. Even with the three-cycle program verify scheme, the 3T1R cell still provides a 30% reduction in  $E_w$  over previous RCSD-4T2R cell.

### B. Experiment Results

Fig. 19(a) presents a die photo of the 3T1R nvTCAM testchip fabricated using a 90 nm CMOS process and fast-write BEOL HfO-based ReRAM devices [15], [19]. This nvTCAM macro includes two blocks of 64 rows of 64-bit entries. As shown in Fig. 17(b), we used a D-flip-flop (DFF)-based



Fig. 20. Measured shmoos plot for worst-case  $T_{SD}$ .

timing extraction scheme [23], [24] to derive  $T_{SD}$  from the testchip search-time ( $T_{SD-CHIP} = T_{SD} + T_{PD}$ ) by excluding the path delay time ( $T_{PD}$ ) of the testchip and loadboard. Fig. 19 also includes the measured search energy for the case with NIVE = 25%. In Fig. 19(a), the measured write energy was at  $I_{SET} = 100 \mu A$  and  $I_{RESET} = 200 \mu A$  without using a program-verify scheme.

Fig. 20 presents the measured shmoos plot for the worst-case search pattern. The measured worst-case  $T_{SD}$  of a 3T1R nvTCAM macro at nominal VDD (1 V) was 0.96 ns. A larger  $I_{ML}$ -ratio and fewer devices stacked between the power-rails make it possible to operate the 3T1R macro at low VDD. The



Fig. 21. Waveform captured from power-OFF and power-ON operations. (a) Insufficient power-recover time ( $T_{PR} = 4.4$  ns). (b) Sufficient power-recover time ( $T_{PR} = 5.4$  ns).

measured worst-case  $T_{SD}$  at  $V_{DD} = 0.48$  V was 11.28 ns.

Fig. 21 presents the waveform captured from the power-OFF and power-ON operations of the fabricated 3T1R nvTCAM macro. During power-OFF, VDD was pull-downed to 0 V, while the clocking inputs (CLK and DFF\_CLK) were maintained to simplify the flow of testing. As shown in Fig. 21(a), insufficient power-recover time ( $T_{PR} = 4.4$  ns) resulted in search failure due to the fact that ML was not fully precharged to  $V_{PRE}$ . As shown in Fig. 21(b), sufficient  $T_{PR}$  ( $=5.4$  ns) resulted in a successful search operation when ML was fully precharged to  $V_{PRE}$ .

## VI. CONCLUSION

This paper introduces a 3T1R nvTCAM cell using a single NVM device with bi-directional voltage-divider control scheme. The 3T1R cell requires less area and enables longer WDL, faster search speeds, and lower search and write energy than what has been achieved by previous nvTCAM cells. We also developed two macro-level control schemes (DR<sup>2</sup>ST and IEPCS) to enhance the robustness of the 3T1R nvTCAM and reduce the power consumption for search operations. A fabricated 90-nm MLC-ReRAM-based 3T1R nvTCAM macro confirmed the functionality of the developed schemes and achieved sub-1-ns search speeds at  $V_{DD} = 1$  V.

## ACKNOWLEDGMENT

The authors would like to thank MOST-Taiwan, CIC, and EOL of ITRI for their support in testing and manufacturing.

## REFERENCES

- [1] K. Nii *et al.*, "A 28 nm 400 MHz 4-parallel 1.6 Gsearch/s 80 Mb ternary CAM," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2014, pp. 240–241.
- [2] C.-C. Wang, C. J. Cheng, T. F. Chen, and J. S. Wang, "An adaptively dividable dual-port bitcam for virus-detection processors in mobile devices," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2008, pp. 390–391.
- [3] I. Hayashi *et al.*, "A 250-MHz 18-Mb full ternary CAM with low-voltage matchline sensing scheme in 65-nm CMOS," *IEEE J. Solid-State Circuits*, vol. 48, no. 11, pp. 2671–2680, Nov. 2013.
- [4] L.-Y. Huang *et al.*, "ReRAM-based 4T2R nonvolatile TCAM with 7x NVM-stress reduction, and 4x improvement in speed-wordlength-capacity for normally-off instant-on filter-based search engines used in big-data processing," in *Symp. VLSI Circuits Dig. Tech. Papers*, Jun. 2014, pp. 1–2.
- [5] M.-F. Chang *et al.*, "A 3T1R nonvolatile TCAM using MLC ReRAM with sub-1 ns search time," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [6] M.-F. Chang and S.-J. Shen, "A process variation tolerant embedded split-gate flash memory using pre-stable current sensing scheme," *IEEE J. Solid-State Circuits*, vol. 44, no. 3, pp. 987–994, Mar. 2009.
- [7] M. Jefremow *et al.*, "Bitline-capacitance-cancelation sensing scheme with 11 ns read latency and maximum read throughput of 2.9 GB/s in 65 nm embedded flash for automotive," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2012, pp. 428–430.
- [8] Y. Yano, "Take the expressway to go greener," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2012, pp. 24–30.
- [9] T. Kono *et al.*, "40 nm embedded SG-MONOS flash macros for automotive with 160 MHz random access for code and endurance over 10M cycles for data," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2013, pp. 212–213.
- [10] M.-F. Chang *et al.*, "An asymmetric-voltage-biased current-mode sensing scheme for fast-read embedded flash macros," *IEEE J. Solid-State Circuits*, vol. 50, no. 9, pp. 2188–2198, Sep. 2015.
- [11] P. F. Chiu *et al.*, "Low store energy, low  $V_{DDmin}$ , 8T2R nonvolatile latch and SRAM with vertical-stacked resistive memory (memristor) devices for low power mobile applications," *IEEE J. Solid-State Circuits*, vol. 47, no. 6, pp. 1483–1496, Jun. 2012.

- [12] S. C. Bartling, S. Khanna, M. P. Clinton, S. R. Summerfelt, J. A. Rodriguez, and H. P. McAdams, "An 8 MHz 75  $\mu$ A/MHz zero-leakage non-volatile logic-based cortex-M0 MCU SoC exhibiting 100% digital state retention at VDD=0 V with <400 ns wakeup and sleep transitions," in *Proc. ISSCC*, Feb. 2013, pp. 432–433.
- [13] N. Sakimura *et al.*, "A 90 nm 20 MHz fully nonvolatile microcontroller for standby-power-critical applications," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2014, pp. 184–185.
- [14] V. K. Singhal, V. Menezes, S. Chakravarthy, and M. Mehendale, "A 10.5  $\mu$ A/MHz at 16 MHz single-cycle non-volatile memory access microcontroller with full state retention at 108 nA in a 90 nm process," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 148–149.
- [15] A. Lee *et al.*, "RRAM-based 7T1R nonvolatile SRAM with 2x reduction in store energy and 94x reduction in restore energy for frequent-off instant-on applications," in *Proc. Symp. VLSI Circuits*, Jun. 2015, pp. C76–C77.
- [16] Y. Liu *et al.*, "A 65 nm ReRAM-enabled nonvolatile processor with 6X reduction in restore time and 4X higher clock frequency using adaptive data retention and self-write-termination nonvolatile logic," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2016, pp. 84–85.
- [17] S. Matsunaga *et al.*, "Fabrication of a 99%-energy-less nonvolatile multi-functional CAM chip using hierarchical power gating for a massively parallel full-text-search engine," in *Proc. Symp. VLSI Circuits*, Jun. 2013, pp. C106–C107.
- [18] S. Matsunaga *et al.*, "A 3.14  $\mu$ m<sup>2</sup> 4T-2MTJ-cell fully parallel TCAM based on nonvolatile logic-in-memory architecture," in *Proc. Symp. VLSI Circuits (VLSIC)*, Jun. 2012, pp. 44–45.
- [19] J. Li *et al.*, "1 Mb 0.41  $\mu$ m<sup>2</sup> 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing," in *Proc. Symp. VLSI Circuits*, Jun. 2013, pp. C104–C105.
- [20] S.-S. Sheu *et al.*, "A 4 Mb embedded SLC resistive-RAM macro with 7.2 ns read-write random-access time and 160 ns MLC-access capability," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2011, pp. 200–201.
- [21] O. Wataru *et al.*, "A 4 Mb conductive-bridge resistive memory with 2.3 GB/s read-throughput and 216 MB/s program-throughput," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2011, pp. 200–202.
- [22] M.-F. Chang *et al.*, "A 0.5 V 4 Mb logic-process compatible embedded resistive RAM (ReRAM) in 65 nm CMOS using low-voltage current-mode sensing scheme with 45 ns random read time," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2012, pp. 434–436.
- [23] A. Kawahara *et al.*, "An 8 Mb multi-layered cross-point ReRAM macro with 443 MB/s write throughput," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2012, pp. 432–434.
- [24] M.-F. Chang *et al.*, "Area-efficient embedded resistive RAM (ReRAM) macros using logic-process vertical-parasitic-BJT (VPBJT) switches and read-disturb-free temperature-aware current-mode read scheme," *IEEE J. Solid-State Circuits*, vol. 49, no. 4, pp. 908–916, Apr. 2014.
- [25] M.-F. Chang *et al.*, "19.4 embedded 1 Mb ReRAM in 28 nm CMOS with 0.27-to-1 V read using swing-sample-and-couple sense amplifier and self-boost-write-termination scheme," in *Proc. ISSCC*, Feb. 2014, pp. 332–333.
- [26] H.-Y. Lee *et al.*, "Low power and high speed bipolar switching with a thin reactive Ti buffer layer in robust HfO<sub>2</sub> based RRAM," in *IEDM Tech. Dig.*, Dec. 2008, pp. 1–4.
- [27] Y.-S. Chen *et al.*, "Challenges and opportunities for HfO<sub>X</sub> based resistive random access memory," in *IEDM Tech. Dig.*, Dec. 2011, pp. 31.3.1–31.3.4.
- [28] C.-H. Lien, Y. S. Chen, H. Y. Lee, P. S. Chen, F. T. Chen, and M.-J. Tsai, "The highly scalable and reliable hafnium oxide ReRAM and its future challenges," in *Proc. IEEE Int. Conf. Solid-State Integr. Circuit Technol. (ICSICT)*, Nov. 2014, pp. 1084–1087.
- [29] K.-T. Tang, S. W. Chiu, M. F. Chang, C. C. Hsieh, and J. M. Shyu, "A wearable electronic nose SoC for healthier living," in *Proc. IEEE Biomed. Circuits Syst. Conf. (BioCAS)*, Nov. 2011, pp. 293–296.
- [30] K. T. Tang *et al.*, "A 0.5 V 1.27 mW nose-on-a-chip for rapid diagnosis of ventilator-associated pneumonia," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2014, pp. 420–421.



**Meng-Fan Chang** (M'05–SM'14) received the M.S. degree from The Pennsylvania State University, State College, PA, USA, and the Ph.D. degree from National Chiao Tung University, Hsinchu, Taiwan.

From 1996 to 1997, he was with Mentor Graphics, Warren, NJ, USA, where he was involved in the design of memory compilers. From 1997 to 2001, he was with the Design Service Division, TSMC, Hsinchu, where he was involved in the design of embedded SRAMs and Flash. From 2001 to 2006, he was the Co-Founder and Director with IPLib Company, Taiwan, where he developed embedded SRAM and ROM compilers, Flash macros, and flat-cell ROM products. From 2011 to 2016, he was the Associate Executive Director for the Taiwan's National Program of Intelligent Electronics. He is currently a Full Professor with National Tsing Hua University, Hsinchu. His research interests include circuit designs for volatile and nonvolatile memory, ultra-low-voltage systems, 3-D-memory, circuit-device interactions, memristor logics for neuromorphic computing, and computing-in-memory.

Dr. Chang was a recipient of the Academia Sinica Junior Research Investigators Award in Taiwan in 2012 and the Ta-You Wu Memorial Award of the National Science Council in Taiwan in 2011. He received numerous awards from the Taiwan's National Chip Implementation Center, NTU, MXIC Golden Silicon Awards, and ITRI. He is the corresponding author for numerous ISSCC and Very Large Scale Integration (VLSI) Symposia papers. He has been serving on technical program committees for ISSCC, IEDM, A-SSCC, ISCAS, VLSI-DAT, and numerous international conferences. He is an Associate Editor of the *IEEE TRANSACTIONS ON VLSI SYSTEMS*, the *IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS*, and the *IEICE Electronics*. He has been a Distinguished Lecturer for the IEEE Circuits and Systems Society.



**Chien-Chen Lin** received the B.S. and M.S. degrees in electrical engineering from National Tsing-Hua University, Hsinchu, Taiwan, in 2013 and 2015, respectively.

Since 2015, he has been with Taiwan Semiconductor Manufacturing Company, Hsinchu. His current research interests include the circuit design of emerging nonvolatile memory and embedded memory.



**Albert Lee** received the B.S. and M.S. degrees from National Tsing-Hua University, Hsinchu, Taiwan, in 2013 and 2015, respectively. He is currently pursuing the Ph.D. degree with the University of California, Los Angeles, CA, USA.

His current research interests include the application of emerging memory devices with logic, spin-based phenomenon, magnetic devices, memory circuits, silicon neurons, and neuro-electrodynamical systems.



**Yen-Ning Chiang** received the B.S. degree in engineering and system science from National Tsing Hua University, Hsinchu, Taiwan, in 2015, where she is currently pursuing the M.S. degree with the Institute of Electronics Engineering.

Her current research interests include the circuit design of emerging non-volatile memory.



**Chia-Chen Kuo** received the B.S. and M.S. degrees in electrical engineering from National Tsing Hua University, Hsinchu, Taiwan, in 2010 and 2013, respectively.

He is currently a Memory Circuit Design Engineer with the Electronics and Optoelectronics System Research Laboratories, Industrial Technology Research Institute, Hsinchu. His current research interests include emerging memory IC design.



**Geng-Hau Yang** received the B.S. degree in computer science from National Chung Cheng University, Chiayi, Taiwan, in 2011. He is currently pursuing the Ph.D. degree in computer science with National Chiao Tung University, Hsinchu, Taiwan.

His current research interests include computer architectures, memory system, nonvolatile memories, and embedded systems.



**Hsiang-Jen Tsai** received the B.S. degree in computer science and information management from Providence University, Taichung, Taiwan, in 2007, and the M.S. degree in computer science and engineering from National Chung Hsing University, Taichung, in 2009. He is currently pursuing the Ph.D. degree in computer science with National Chiao Tung University, Hsinchu, Taiwan.

His current research interests include multicore system-on-a-chip design, embedded system design, low-power methodology, non-volatile memory design, and computer architecture.



**Tien-Fu Chen** (S'90–M'93) received the B.S. degree in computer science from National Taiwan University, Taipei, Taiwan, in 1983, and the M.S. and Ph.D. degrees in computer science and engineering from the University of Washington, Seattle, WA, USA, in 1991 and 1993, respectively.

He was a System Software Engineer with Wang Computer Ltd., Taiwan, for three years. He is currently a Professor with the Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan. He has authored several widely-cited papers on dynamic hardware prefetching algorithms and designs. In recent years, he has made contributions to processor design and SOC design methodology. His current research interests include multithreading/multicore media processors, on-chip networks, low-power architecture techniques as well as related software support tools and SoC design environment, computer architectures, system-on-chip design, design automation, and embedded software systems.



**Shyh-Shyuan Sheu** received the Ph.D. degree from the Department of Electrical Engineering, National Central University, Taoyuan, Taiwan.

He is currently the Deputy Research and Development Director with the Electronics and Optoelectronics System Research Laboratories, Industrial Technology Research Institute, Hsinchu, Taiwan. His current research interests include memory, display, and CMOS image sensor circuit design and technology.