

# A Sub-cm<sup>3</sup> Energy-Harvesting Stacked Wireless Sensor Node Featuring a Near-Threshold Voltage IA-32 Microcontroller in 14-nm Tri-Gate CMOS for Always-ON Always-Sensing Applications

Somnath Paul, *Member, IEEE*, Vinayak Honkote, *Member, IEEE*, Ryan Gary Kim, *Member, IEEE*, Turbo Majumder, *Member, IEEE*, Paolo A. Aseron, Vaughn Grossnickle, Robert Sankman, Debendra Mallik, *Fellow, IEEE*, Tao Wang, Sriram Vangal, *Senior Member, IEEE*, James W. Tschanz, *Member, IEEE*, and Vivek De, *Fellow, IEEE*

**Abstract**—An energy-harvesting wireless sensor node (WSN) integrates a 14-nm, 0.79-mm<sup>2</sup>, 32-b Intel Architecture core-based near-threshold voltage (NTV) microcontroller (MCU) that provides 17- $\mu$ W/MHz always-ON, always-sensing (AOAS) capability. The MCU implements four independent voltage-frequency islands, managed by an integrated power management unit and features a subthreshold voltage capable on-die oscillator and 42-nm fin-pitch, 8.3-pA leakage-per-bit SRAM. The MCU operates across a wide frequency (voltage) range from 297 MHz (1 V) to 0.5 MHz (308 mV), dissipating 23.5 mW to 21  $\mu$ W, and achieves 4.8 $\times$  better energy efficiency at an optimum supply voltage ( $V_{OPT}$ ) of 370 mV, 3.5 MHz, and 17 pJ/cycle. A functional AOAS WSN incorporating the NTV MCU shows promise for sustained  $\mu$ W operation.

**Index Terms**—Always-ON always-sensing (AOAS), energy harvesting, Intel Architecture (IA) Internet of things (IoT), microcontroller (MCU), near-threshold voltage (NTV), wireless sensor node (WSN).

## I. INTRODUCTION

UNIQUEOUS sensing enabled by wireless sensor nodes (WSNs) is a key ingredient of modern-day life. The ongoing Internet-of-things (IoT) revolution is expected to network billions of always-ON always-sensing (AOAS) devices deployed by the year 2020 [1]. These devices can possibly communicate among themselves or to a gateway that

Manuscript received August 9, 2016; revised October 14, 2016; accepted November 26, 2016. Date of publication March 3, 2017; date of current version March 23, 2017. This paper was approved by Guest Editor Makoto Ikeda.

S. Paul, V. Honkote, R. G. Kim, T. Majumder, S. Vangal, J. W. Tschanz, and V. De are with the Circuit Research Lab, Intel Corporation, Hillsboro, OR 97124 USA (e-mail: somnath.paul@intel.com).

P. A. Aseron is with the Silicon Technology Prototyping Lab, Intel Corporation, Hillsboro, OR 97124 USA.

V. Grossnickle is with the Platform Engineering Group, Intel Corporation. R. Sankman, D. Mallik, and T. Wang are with the Assembly and Test Technology Development, Intel Corporation, Chandler, AZ 85226 USA.

Color versions of one or more of the figures in this paper are available online at <http://ieeexplore.ieee.org>.

Digital Object Identifier 10.1109/JSSC.2016.2638465



Fig. 1. WSN featuring NTV MCU.

eventually connects them to the server cloud. The devices are ideally WSNs that enable computing at the origin of the data, otherwise referred to as edge computing. This is made possible by tightly integrating sensors, compute, and communication units onto a single platform. For such nodes, it is desired that they possess small (mm-scale) form factor and they operate without batteries, perpetually running on energy harvested from ambient sources.

For energy-neutral operation, WSNs require ultralow power (ULP) microcontrollers (MCUs) [2]. These MCUs enable always-ON (AON) context-aware applications to be continuously running on the WSN, while maximizing WSN's limited available energy. In this paper, we demonstrate a functional AOAS WSN that integrates a 32-b Intel Architecture (IA) core-based MCU that operates in the NTV regime and consumes  $\mu$ W AON power (Fig. 1). In addition, the WSN platform has nonvolatile flash memory for storing any code or data that would otherwise be lost when the MCU is powered OFF. The WSN integrates low-power sensors for sensing ambient conditions (temperature, humidity, etc.),



Fig. 2. 32-b IA NTV MCU packaged in a 4.08-mm<sup>2</sup> (2.46 mm × 1.63 mm) 24-pin BGA substrate.

inertial sensors and a Bluetooth low-energy (BLE) radio to wirelessly transmit sensor data to a remote BLE device. Finally, it integrates a 1-cm<sup>2</sup> solar cell, a harvester, and a power management IC (PMIC) to enable energy harvesting and battery-free operation.

MCU power and performance have been shown to scale with supply voltage [3], with the minimum energy of operation (3–5× improvement compared to super-threshold operation) observed in the NTV regime [3], [4]. The key challenge is to lock-in this excellent energy efficiency benefit at NTV while mitigating performance loss. This paper highlights a sub-mm<sup>2</sup> (0.79 mm<sup>2</sup>), NTV, and energy-optimized 32-b, IA MCU using all four transistor families in Intel's 14-nm second generation tri-gate SoC platform technology (Fig. 2). The NTV MCU has four independent voltage–frequency islands (VFIs) and is operational across a wide range of frequency (voltage) ranges. The MCU provides 17 μW/MHz AOAS capability and enables sustained μW operation of the entire WSN at MHz clock frequencies.

In the following sections, we present the background and motivation behind this paper (Section II), the MCU architecture (Section III), NTV technology implementation details (Section IV), chip-level measurement results (Section V), and the measured WSN power results (Section VI). We conclude in Section VII and outline open areas for further investigation.

## II. BACKGROUND AND MOTIVATION

There has been significant research interest in designing ULP WSNs that sense from the immediate environment, analyze the sensed data, and wirelessly communicate pertinent information to the server cloud under a tight energy envelope [5].

### A. Prior Work

The focus of prior research toward the realization of ULP platforms can be categorized into low-power compute architectures and implementations, low-power sensing and ancillary circuits, and low-power communication radios. In the context of WSNs, there has always been a tradeoff with respect to the share of computation that is carried out on the edge device vis-à-vis that is carried out in the central data center or the cloud. Traditionally, edge devices have been designed to perform simple computations due to their tight energy budgets. However, with more data being sensed

from the physical environment, communication of raw or minimally processed data to the cloud poses a major energy cost. As such, there has been increasing interest in achieving nontrivial data analyses on the edge device within the energy budget [6].

Lowering supply voltage, which is one of the most widely used technique for achieving energy efficiency, also poses newer challenges that need novel approaches in design implementation methodology. The on-chip clock network, for example, is highly sensitive to process variation at low supply voltages, which in turn leads to high variation in skew. Careful clock network design and buffering is necessary to preclude setup and hold violations [7]. Other techniques that have been used include novel power gating techniques [8], high-density low-leakage logic and memories [9], [10], and extremely low-leakage CMOS processes [11]. Low voltage operation also brings challenges in robust operation caused due to process variation as well as transient faults, and has been addressed through various resiliency schemes [3], [12].

Low-power sensing of data from the physical environment has been a major challenge in realization of WSN nodes owing to the analog components and data conversion circuitry involved, which typically have higher quiescent currents than their digital MCU counterparts. Consequently, there has been significant work devoted to building sensors and transducers that can operate on energy budgets so low that they can be used as “smart dust” [13]–[15]. The third challenge in realization of IoT nodes is that of ULP communication radios. With aggressive scaling in size and power of the compute and sensor components, communication has become the bottleneck in terms of area and energy footprints, and several physical and media access control layer techniques are needed to restrict energy consumption [16]. Several such techniques have been used in conjunction with high-efficiency energy harvesters and converters in holistic low-power platforms [17], [18].

### B. Motivation

A multitude of techniques and technologies need to converge to achieve energy-efficient operation. This provides the motivation for building a MCU that can operate at low voltages and with best energy efficiency, and have a minimal die area, so as to be able to sustain active AOAS operation OFF harvested ambient energy in a WSN. The work described in this paper brings together and builds upon many of the ideas presented in earlier work, and leverages novel technical contributions including near-threshold voltage (NTV) technology, multiple transistor families, extremely low-leakage memory, and innovations in clock network design. Through a successful marriage of these technologies, we demonstrate an IA-32-based MCU whose energy efficiency of 17 μW/MHz in an AOAS mode is ahead of the current state of the art, and which has been successfully demonstrated to be part of a functional batteryless WSN that runs off harvested ambient energy.

## III. MCU ARCHITECTURE

The MCU (Fig. 3) consists of three major subsystems: an IA subsystem, an Advanced High-performance Bus (AHB) subsystem, and an AON subsystem.



Fig. 3. NTV MCU block diagram showing voltage domains.

#### A. IA Subsystem

The IA subsystem consists of a 32-b IA-core, which is a Quark CPU derivative [19]. It also has an 8 KB Instruction cache (I\$) and 8 KB data tightly coupled memory (DTCM). I\$ stores frequently used instructions and is hardware managed. DTCM can be considered as a local scratch-pad memory, which offers low latency (single cycle) and deterministic access, particularly useful for data-heavy workloads. The read/write to the DTCM is managed only through software. Both I\$ and DTCM can be enabled/disabled by user program.

#### B. AHB Subsystem

The AHB subsystem consists of a 32-b AHB interconnect that supports multiple masters and slaves. The masters on the AHB interconnect are: 32-b IA core, a direct memory access (DMA) module, and Test Access Port (TAP2AHB) module. The DMA can be programmed to manage data movement between any two slave modules on the AHB bus, without requiring the IA core to explicitly manage this movement. The TAP2AHB is a debug module that translates standard scan messages from the board to AHB transactions on the die. All memory-mapped registers in the design can be accessed via TAP2AHB. There are three slave interfaces on the AHB bus corresponding to the: 1) memory module in AHB subsystem; 2) advanced peripheral bus (APB) interconnect in AHB subsystem; and 3) APB interconnect in AON subsystem. The memory module consists of 16 KB of BootROM and 64 KB of shared memory (SMEM), used for both code and data. The BootROM is a read-only memory that is already preprogrammed with the IA bootstrap code. The boot code guides the core through the boot process that involves copying the user-defined program from a nonvolatile flash memory on the board over the serial peripheral interface (SPI) into SMEM. The program control is transferred to the SMEM once the transfer is finished. SMEM is a volatile memory and can be power gated at 2 KB granularity to minimize leakage from unused sections. The APB interconnect in the AHB subsystem supports standard serial interfaces such as SPI and universal asynchronous receiver/transmitter (UART). There are two SPI interfaces, with one dedicated to on-board flash memory. The



Fig. 4. MCU clocking diagram with CRO (inset).

other SPI and the UART port can be used for: 1) debugging; 2) communication with sensors; or 3) interfacing with BLE.

#### C. AON Subsystem

The AON subsystem must be kept powered-ON while the IA subsystem and the AHB subsystem can be powered-OFF. This allows the MCU to be AON, consuming less than 50  $\mu$ W of the total power for workloads with small duty cycle at MHz operational rates. The AON subsystem comprises of two 32-b timers for tracking real-time intervals. There are 15 general-purpose IO (GPIO) and one inter-integrated circuit ( $I^2C$ ) interface for communication with external sensors and PMIC. The chip has a total of 17 IO signals, two of which are dedicated to reset and real-time clock (RTC), respectively. In addition, the chip has five voltage rails, one for each major subsystem and for level conversion across these voltage domains. The IOMUX module is responsible for multiplexing/demultiplexing all the standard interfaces (SPI/UART/GPIO/ $I^2C$ ) onto these 17 IO signals, controlled via user program. The power-management and clock-control unit (PMUCCU) is responsible for on-die as well as platform power management by communicating with an external PMIC via the  $I^2C$  interface. On the die it supports management of multiple VFI and wakeup/sleep states.

#### D. Clocking Architecture

A calibrated ring oscillator (CRO) serves as a low-power on-chip high-frequency (MHz) clock source (Fig. 4) for the MCU. The CRO is a frequency-locked loop that uses an RTC as a reference to generate a MHz clock output. Internally, it tracks the frequency of oscillation from a ring oscillator and generates a delay code that adjusts the oscillation frequency to closely match the target frequency based on the reference clock. The CRO can operate in: 1) closed-loop mode, in which it accurately tracks the target frequency as well as in 2) open-loop mode at ultralow voltages, producing clock with tens of KHz frequency, enough for AON sensing operation on the MCU. In addition to CRO clock, the MCU can also operate on external MHz clock or RTC clock. The primary clock can be selected via software, which allows the MCU to either reside in a high-performance (HP)/high-frequency state for data processing or in a low-performance/low-frequency state for AON sensing. As illustrated in Fig. 4, the clock frequency for each subsystem can be individually controlled (via software)



Fig. 5. (a) WSN validation platform. (b) Packaged die. (c) MCU layout photograph. (d) MCU die photograph with key blocks.

by changing the division ratio in the frequency dividers. While the core and AHB subsystems are operated at same frequency, the APB subsystems can be run up to half the frequency of the core+AHB subsystems.

#### IV. NTV MCU DESIGN

The MCU is fabricated using 14-nm tri-gate CMOS technology with nine metal interconnect layers (Fig. 5). The MCU cell count is approximately 213 K and the die area is 0.79 mm<sup>2</sup> (0.56 mm × 1.42 mm). The packaged area is measured to be 4.08 mm<sup>2</sup> (2.46 mm × 1.66 mm). The following sections elaborate the NTV design methodology.

##### A. Device Type Selection for Best Energy Efficiency

The NTV design uses HP, standard-performance (SP), ULP, and thick-gate (TG)—all four transistor families in 14-nm second-generation tri-gate SoC platform technology [20]. To minimize variation induced skews, the clock distribution is completely designed using HP devices. The lower threshold voltage ( $V_T$ ) of the HP devices allows improved delay predictability on the clock paths at NTV. SP devices are used for 100% of logic cells to achieve sufficient speeds during active mode of operation. To achieve low standby power, all on-die memories and caches use a custom eight transistor [8T, m1-m8 in Fig. 6(a)], 0.155- $\mu\text{m}^2$  bitcell, built using 84-nm



Fig. 6. (a) 8T SRAM topology used for all on-die memories with 100% ULP devices. (b) Measured ULP memory bit-cell leakage current versus supply voltage at 25 °C.

gate pitch ULP transistors [20]. With discrete read and write ports in the 8T cell, significantly improved read noise margins can be realized over the traditional 6T-SRAM cell, at an additional area expense. The noise margin improvement is due to the elimination of the read-disturb condition of the internal memory node by the introduction of a separate read port in the SRAM cell. As a result, variability tolerance is greatly enhanced, making it a desirable design choice for ULP SRAM memory operating at lower supply voltages down to NTV and energy-optimum points. The bidirectional CMOS IO circuits are designed using high voltage (1.8 V) TG transistors.

##### B. NTV Methodology

As supply voltage approaches the threshold voltage of transistors, circuit behavior changes drastically due to an exponential increase in device delay. The presence of within die variations results in further delay degradation. This problem becomes more prominent when the device sizes are smaller, near the process-allowed minimum width ( $Z_{\min}$  or one fin width in our case), causing excessive timing push-outs and even functional failures in the case of sequential and Register File (RF) cells. In this paper, on-die logic and memory circuits are optimized for reliable NTV operation. The fully synthesized NTV design employs variation-aware pruning methods on the standard cell library to mitigate delay sensitivity to process variations [3]. Clock distribution using higher metal layers resulted in healthier skew and improved min-violations. A variation-aware pruning is performed on the standard cell library to eliminate the circuits that exhibit dc failures or extreme delay degradation due to reduced transistor ON/OFF current ratios and increased sensitivity to process variations [3]. Specifically, variation-aware statistical static timing analysis (SSTA) study is performed on the entire standard cell logic library. Gates with multiple stacked devices have significantly reduced drive currents in the NTV regime. Complex logic gates with four or more stacked devices and wide transmission-gate multiplexers with four or more inputs are therefore pruned from the library, and not used in the design, because they exhibit more than 106% and 133% delay degradation compared to three stack gates or three-wide multiplexers, respectively, at NTV. To enable reliable operation at low voltages, low  $V_T$  (HP) and higher  $V_T$  (SP) devices are used selectively. All the clock paths are designed using low  $V_T$  HP devices because high  $V_T$  devices indicate 81% higher delay

TABLE I  
COMPARISON BETWEEN 8T SRAMS WITH ULP AND SP BITCELLS

| 8T SRAM device type             | Gate pitch | Normalized frequency (0.5V) | Normalized Leakage (0.5V, 25C) | 14nm bit-cell area ( $\mu\text{m}^2$ ) |
|---------------------------------|------------|-----------------------------|--------------------------------|----------------------------------------|
| Standard performance (SP)       | 70nm       | 5X                          | 26X                            | 0.100 $\mu\text{m}^2$                  |
| Ultralow power (ULP, this work) | 84nm       | 1X                          | 1X                             | 0.155 $\mu\text{m}^2$ (1.55X)          |

penalty, in the presence of variations at NTV. Similarly, all minimum sized (single fin device) gates are filtered from the library due to a 138% higher variation impact, when analyzed at 500-mV power supply. As a result, the standard cell library was conservatively constrained and characterized at 0.5, 0.75, and 1.05 V corners for synthesis and timing convergence.

This multicorner methodology simultaneously co-optimizes timing slack and the design across all the three corners and ensures performance targets across the wide voltage operational range. The approach accounts for nonlinear scaling of device delays in the critical path versus interconnect delay scaling across the same voltage range. The ULP transistor optimized memory arrays are designed to provide lowest standby leakage. However, as summarized in Table I, a 5× performance slowdown is estimated over SP transistor 8T memory, but is still fast enough for edge compute applications. Context-aware power-gating of each 2 KB array is supported for further leakage reduction with no state retention. The ULP array also enables 26× lower leakage (at 500 mV supply) and has a 55% area cost over an SP-based 8T memory array, drawn on a 70 nm gate pitch [20]. Fig. 6(b) shows the ULP memory leakage over a wide supply voltage range scaling down to 8.28 pA per bit at the retention limit of 308 mV, as measured at room temperature (25 °C). In the NTV regime, severe effects of process variations result in uncertainty in path delays and may cause setup (max) or hold (min) violations. Setup violations can be corrected by frequency binning. However, hold violations can cause critical functional failures. In this design, the chip timing convergence methodology is enhanced to consider the effect of random variations and provide variation-aware hold margin guard-bands for robust NTV operation [3].

## V. MCU MEASUREMENT RESULTS

The IA MCU is fully functional. Detailed characterization is performed at room temperature  $T = 25^\circ\text{C}$ . The following section presents MCU performance and power data.

### A. MCU Characterization

The IA MCU is functional over a wide operating range (Fig. 7) from 297 MHz (1 V) scaling down to 0.5 MHz (308 mV). While the entire MCU is functional down to 308 mV, we validated SMEM functionality down to 300 mV by independently writing and reading to it via the TAP interface. The ROM and the AHB logic are found to be functional down to 297 mV. With the MCU continuously executing a data encryption workload (AES-128), the minimum energy point is observed at 370 mV ( $V_{\text{OPT}}$ ) at  $T = 25^\circ\text{C}$ . At  $V_{\text{OPT}}$ ,



Fig. 7. Measured MCU power, performance, and energy characteristics across wide voltage range.

MCU runs at 3.5 MHz and dissipates 58  $\mu\text{W}$  power, which translates to an energy-efficiency metric of 17.18 pJ/cycle. Compared to superthreshold operation at 1 V, NTV operation at  $V_{\text{OPT}}$  achieves 4.8× improvement in energy efficiency. Fig. 8 shows the breakdown of power consumption inside the MCU at a superthreshold ( $V_{\text{CORE}} = V_{\text{AON}} = 0.75\text{ V}$ ) and near-threshold ( $V_{\text{CORE}} = V_{\text{AON}} = 0.37\text{ V}$ ) voltage point. At superthreshold voltage, the core (IA+AHB) active power dominates MCU power. However, at NTV, IO and leakage power match the core active power. Beyond  $V_{\text{OPT}}$ , the leakage energy increase offsets the active energy reduction due to voltage scaling, thereby resulting in the optimal energy of operation at  $V_{\text{OPT}}$ . Fig. 9 presents the characterization results for the CRO. The on-die CRO locks to a wide range of target frequencies from 1 V down to 0.4 V. The CRO dissipates 60  $\mu\text{W}$  (450 mV) while generating a 16 MHz output to clock the MCU at  $V_{\text{OPT}}$ . In open-loop condition, CRO is functional down to a deep subthreshold voltage of 128 mV, dissipating 3.8  $\mu\text{W}$ , while generating a 7-kHz clock output. The CRO achieves a measured clock period jitter of 4.6 ps at 400-MHz operation.

### B. MCU Active Power Reduction

The PMU provides several software-controlled options for power reductions on the MCU. Fig. 10 shows the percentage active power reduction as each unit on the MCU is progressively clock gated. At 0.75 V, up to 80% of the active power may be reduced through halting the IA core or clock gating idle units. The improvement is less pronounced at NTV due to higher contribution from IO and leakage. For typical WSN workloads with code footprint ~16 KB, MCU energy can be further improved by enabling I\$ and DTCM. Enabling I\$ and DTCM helps to exploit any code and data locality present in the application, thereby reducing the active power consumed in AHB interconnect and large SMEM (64 KB) access. Our experiments show that almost 40% energy improvement is achieved by enabling both I\$ and DTCM (Fig. 11).

### C. MCU Power States

The PMUCCU enables a wide range of dynamic voltage frequency scaling by: 1) changing the frequency of the CRO



Fig. 8. Breakdown of MCU power at (a) 750 and (b) 370 mV ( $V_{\text{OPT}}$ ), measured at  $T = 25^\circ\text{C}$ .



Fig. 9. CRO operating range and power.

clock and 2) communicating voltage changes to an external PMIC via the I<sup>2</sup>C bus. The PMU state machine defines four power states for the MCU (S3–S0), S3 being the most active state, with the highest power consumption while S0 is the least active state with the smallest power consumption (Fig. 12). In no-sleep state (S3), clock/power gating is not asserted.



Fig. 10. Active power reduction using clock gating, measured at  $T = 25^\circ\text{C}$ .



Fig. 11. Energy improvement with IS and DTCM.

The MCU is clocked by the CRO output. In state S2, also referred to as the *short sleep*, the IA core is halted and the idle units are clock gated, with CRO being functional. In state S1, also referred to as the *long sleep*, IA core is halted, idle units are clock gated, and the CRO output is gated. MCU is only driven by RTC. In *deep sleep* state (S0), core (IA+AHB) and CRO domains are power gated. The AON logic is still powered-ON and driven by RTC. In state S0, the PMU can be programmed to detect any logic level change on any of the GPIOs. It can subsequently instruct the external PMIC to power-ON the core (IA+AHB). This feature allows the MCU to be AON, monitoring key sensor events, with very little power consumption. Note that the time to transition from any sleep state to the active state (S3), also referred to as wake-up latency, progressively increases from state S3 to S0 [Fig. 12(b)].

Enabling four different power states allows the MCU to cater to workloads with varying duty cycles. For workloads with high activity (e.g., image acquisition and processing), the MCU can be AON. Workloads with lower duty cycles can select an appropriate power state (Fig. 12) depending on the wakeup requirements. MCU energy consumption (Core+AON) is measured for all the power states S3–S0 (Fig. 12(b)). The highest energy consumption is measured in state S3, with the core (IA+AHB) contributing most of it. The total energy consumption reduces by 2× in state S2 and by 8× with MCU in state S1. In deep



Fig. 12. (a) Platform power states controlled by MCU. (b) MCU energy at  $V_{\text{OPT}}$  for different power states (S0–S3).



Fig. 13.  $V_{\text{OPT}}$  shift with workload activity.

sleep (S0), energy consumption is  $16\times$  lower compared to state S3. The MCU therefore enables up to  $16\times$  energy savings depending on the duty cycle and wakeup requirements for the workloads. Experiments with workloads having varying levels of activity further suggest that an appropriate operating voltage must be selected for energy-optimal operation of the MCU. Fig. 13 shows the shift in  $V_{\text{OPT}}$  as workloads with different duty cycles operate in power states S3–S1. Such a shift is caused by the fact that for workloads with low levels of activity, leakage contribution dominates the total energy. As the workload activity increases, minimum energy point ( $V_{\text{OPT}}$ ) shifts to lower values of  $V_{\text{DD}}$ . Note that no leakage recovery

TABLE II  
COMPARISON WITH STATE-OF-THE-ART MCUs

|                                                | VLSI 2015 [5]                        | VLSI 2015 [6]                        | ISSCC 2015 [7]                         | ISSCC 2015 [8]                                    | This work                                                                              |
|------------------------------------------------|--------------------------------------|--------------------------------------|----------------------------------------|---------------------------------------------------|----------------------------------------------------------------------------------------|
| Technology                                     | 28nm UTBB FD-SOI                     | 65nm SOTB                            | 65nm CMOS                              | 180nm CMOS                                        | 14nm Tri-gate CMOS                                                                     |
| Processor                                      | 32-b LatticeMico RISC                | 32-b <sup>a</sup>                    | 32-b ARM Cortex M0+                    | 32-b ARM Cortex M0+                               | 32-b x86 IA                                                                            |
| Area (mm <sup>2</sup> )                        | 1.32                                 | 16.9                                 | 3.76                                   | 2.04                                              | <b>0.79</b>                                                                            |
| $V_{\text{DD}}$ range and $V_{\text{OPT}}$ (V) | 0.3–0.5 ( $V_{\text{OPT}} = 0.375$ ) | 0.35–0.6 ( $V_{\text{OPT}} = 0.41$ ) | 0.25–1.2 ( $V_{\text{OPT}} = 0.35^b$ ) | 0.16–1.15 ( $V_{\text{OPT}} = 0.35\text{–}0.55$ ) | <b>0.308 – 1.0 (</b> $V_{\text{OPT}} = 0.370$ <b>)</b><br>$V_{\text{CRO,MIN}} = 0.128$ |
| Frequency range                                | 1–77MHz                              | 6–27MHz                              | 27kHz–66MHz                            | 2–15Hz                                            | <b>0.5–297MHz</b>                                                                      |
| Energy                                         | 4.9pJ/cycle <sup>c</sup>             | 33uW/MHz <sup>d</sup>                | 11.7pJ/cycle <sup>b</sup>              | 147.5uW/MHz;<br>44pJ/instruction <sup>e</sup>     | <b>17.18pJ/cycle<sup>f</sup></b>                                                       |
| Total on-chip memory                           | 64KB Inst + 8KB Data                 | 64KB SRAM + 16KB SRAM + 2KB ROM      | 8KB ULV SRAM + 16KB SRAM + 2KB BootROM | 128B                                              | <b>8KB I\$ + 8KB DTCM + 64KB SMEM + 16KB BootROM</b>                                   |

<sup>a</sup>ISA not reported; <sup>b</sup>AES encryption workload; <sup>c</sup>Workload not reported; <sup>d</sup>CRC32 workload; <sup>e</sup>Toggle program; <sup>f</sup>MCU always-active, running AES encryption workload

mechanism, such as power gating, was considered for this analysis (Fig. 13). In the presence of aggressive fine-grained power gating, it is possible to recover leakage energy when the system is idle, resulting in a lower  $V_{\text{OPT}}$ .

#### D. Comparison With Prior Work

Table II compares the NTV MCU presented in this paper with state-of-the-art MCUs [21]–[24]. The MCU presented in this paper has the smallest die area compared to prior MCUs. This is the first 32-b IA-based MCU that is functional over a large voltage (frequency) range allowing the user to tune the performance as per application requirements. The MCU on-chip memory (I\$ + DTCM + SMEM) is far larger in size compared to most of the prior MCUs. This permits complex applications with large code and data memory footprints to be run on the MCU. Finally, the best energy efficiency ( $17 \mu\text{W}/\text{MHz}$ ) for the MCU as observed in the NTV regime is at par or better than state-of-the-art MCUs. It is to be noted that most of the previous MCUs are either running simple workloads with small memory footprint or actually do not report the workload for measuring the energy efficiency. We measured the energy efficiency with the MCU continuously running an AES-128 encryption workload on tens of KB of data in SMEM. In sum, small form-factor, high-energy efficiency, and adaptability to workload characteristics makes the IA-based NTV MCU an attractive compute device in the IoT space.

## VI. WSN MEASUREMENT RESULTS

A cross section of an integrated WSN module with IA-NTV MCU, and commercially available energy harvesting components with solar cells, power regulating ICs, a BLE chip, a 2.4-GHz chip antenna, flash memory, temperature and humidity sensors, a pressure sensor, several passives, and an (optional) rechargeable battery as shown in Fig. 14. The module architecture utilizes a multilayer routed 1.2-mm-thick substrate, optimized lateral component placements, packed into a 10 mm × 12 mm area as shown in Fig. 14(a) and allows multilayer component stacking for highly reduced lateral area. Fig. 14(b) shows an optional external electrical access port,



Fig. 14. Stacked WSN module architecture. (a)  $10\text{ mm} \times 12\text{ mm}$  module without external access socket. (b) Larger ( $12.8\text{ mm} \times 12\text{ mm}$ ) mote module with external access socket.



Fig. 15. Integrated wireless sensor module with components highlighted.



Fig. 16. Packaged solar cell. (a) Top view: two series-connected solar cell dies with wire bonding connections. (b) Top and bottom of the solar BGA package.

enabled by a 24-pin surface mounted socket, but at a cost of increased module area to  $12.8\text{ mm} \times 12\text{ mm}$ . A picture of a WSN module after dual-side component assembly on the top and bottom sides with various components identified is shown in Fig. 15.

A custom solar cell ball grid array (BGA) package was designed to fit the WSN module. Monocrystalline solar wafers were diced to smaller  $2.5\text{ mm} \times 2.5\text{ mm}$  dies and connected to a 4-pin BGA package through conductive die attach and gold wire bonding. A two-die series-connected configuration, as shown in Fig. 16(a), was chosen to achieve a maximum power point tracking (MPPT) output voltage of  $\sim 800\text{ mV}$ , which is beneficial to improve the boost charging efficiency of the energy harvesting IC. Fig. 16(b) shows the top and bottom view of the assembled solar cell package. The functional



Fig. 17. Measured WSN AOAS power profile over a 4-min interval.



Fig. 18. Measured WSN power with on-platform VRs and with external power supplies. VR efficiency is observed in the 30–60% range over the entire workload.

WSN operates perpetually harvesting energy from indoor light (1000 lux) with the solar cell. Power for the WSN is measured for AOAS operation with the sensor polling data at the rate of 30 Hz, and communicating the same to the MCU over the I<sup>2</sup>C interface. For this measurement at  $T = 25^\circ\text{C}$ ,  $V_{\text{CORE}}$  is set to 0.45 V, while the IO voltage ( $V_{\text{IO}}$ ) and the platform voltage are set to 1.5 V, respectively. Fig. 17 shows the measured WSN power profile over a 4-min interval. This profile captures the power consumption as the platform moves through major events. Post-boot, the platform enters into an AOAS state in which the sensor is polling and BLE is advertising continuously, the advertisement interval being 4 s. The MCU operates in short sleep state, waking up to service any interrupt from the sensor or BLE. On detecting BLE advertisement packets from WSN, a remote BLE master can initiate a connection with the BLE module on the WSN. In the connected state, the WSN consumes almost 5× higher power compared to AOAS state (Fig. 17). The power consumption is still higher (10×) when BLE actively transmits BLE packets with 14B of raw sensor data at the rate of 56 kbps. WSN power consumption falls back to AOAS profile when connection to remote BLE is terminated. With BLE communication being infrequent for typical WSN workloads, average WSN power is dominated by power consumption in the AOAS state. For our experiments, in AOAS state, the MCU consumes  $290\mu\text{W}$ , operating at 13 MHz, while the rest of the platform contributes  $70\mu\text{W}$ . For an AOAS workload with deep sleep (S0) periods, MCU power consumption drops further to just  $120\mu\text{W}$ . WSN power profile with on-platform VRs and with external power supplies is given in Fig. 18. Owing to lower efficiency at  $\mu\text{A}$  load conditions for commercial voltage regulators (VRs), average power consumption for the WSN in AOAS state was found to increase to 1 mW.

## VII. CONCLUSION

In this paper, we presented a 0.79-mm<sup>2</sup>  $\mu$ W NTV MCU in 14-nm tri-gate CMOS. The NTV MCU is functional over a wide dynamic range from 297 MHz (1 V) to 0.5 MHz (308 mV), with the maximum energy efficiency achieved at 370 mV ( $V_{OPT}$ ). At  $V_{OPT}$ , the MCU operates with an energy efficiency of 17  $\mu$ W/MHz, a 4.8 $\times$  better energy efficiency compared to operation at 1 V. The MCU exposes several power states to the programmer and allows up to 16 $\times$  energy savings for workloads with varying levels of activity. With the NTV IA-based MCU, we also demonstrated an autonomous energy-harvesting WSN with  $P_{AVG}$  of 360  $\mu$ W for AOAS workloads.

WSNs are fast approaching the concept of “smart dust” which can sense, compute, and wirelessly relay real-time information about the ambient. However, for the concept to become a true reality, future WSNs need to focus on low-power AON long-range radios for any infrequent communication with distant peers. More focus is required on high-efficiency on-die VRs for  $\mu$ A load conditions and efficient harvesting circuits from multiple sources such as solar/RF, etc. Lastly, high-density packaging is required to improve the form factor.

## ACKNOWLEDGMENT

The authors would like to thank S. Park, T. Nguyen, D. Kurian, S. Liff, M. Kumar, S. Jayaraman, S. Darshana, R. Jumade, S. Karpenko, J. Kulkarni, S. Jain, C. Roberts, A. Srinivasan, Y. Liao, Y. Hoskote, E. Petryk, K. Caviaasca, J. Held, M. Haycock, and M. Mayberry at Intel for help, encouragement, support, and Intel’s assembly and test technology development team for their efforts with the chip package design and assembly.

## REFERENCES

- [1] Intel Corporation. *A Guide to the Internet of Things*. Accessed on Oct. 10, 2016. [Online]. Available: <http://www.intel.com/content/www/us/en/internet-of-things/infographics/guide-to-iot.html>
- [2] B. A. Warneke and K. S. J. Pister, “An ultra-low energy microcontroller for smart dust wireless sensor networks,” in *IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers*, Feb. 2004, pp. 316–317.
- [3] S. Jain *et al.*, “A 280mV-to-1.2V wide-operating-range IA-32 processor in 32nm CMOS,” in *IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers*, Feb. 2012, pp. 66–68.
- [4] S. Paul *et al.*, “A 3.6GB/s 1.3mW 400mV 0.051mm<sup>2</sup> near-threshold voltage resilient router in 22nm tri-gate CMOS,” in *VLSI Circuits Symp. Dig. Tech. Papers*, Jun. 2013, pp. C30–C31.
- [5] H. Jayakumar, K. Lee, W. S. Lee, A. Raha, Y. Kim, and V. Raghunathan, “Powering the Internet of Things,” in *Int. Symp. Low-Power Electron. Design (ISLPED) Dig. Tech. Papers*, Aug. 2014, pp. 375–380.
- [6] L. Nachman, J. Huang, J. Shahabdeen, R. Adler, and R. Kling, “IMOTE2: Serious computation at the edge,” in *Int. Wireless Commun. Mobile Comput. Conf. Dig. Tech. Papers*, Aug. 2008, pp. 1118–1123.
- [7] M. Seok, D. Blaauw, and D. Sylvester, “Clock network design for ultra-low power applications,” in *Int. Symp. Low-Power Electron. Design (ISLPED) Dig. Tech. Papers*, Aug. 2010, pp. 271–276.
- [8] M. Seok, S. Hanson, D. Blaauw, and D. Sylvester, “Sleep mode analysis and optimization with minimal-sized power gating switch for ultra-low  $V_{dd}$  operation,” *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 20, no. 4, pp. 605–615, Apr. 2012.
- [9] D. Bol *et al.*, “A 25MHz 7 $\mu$ W/MHz ultra-low-voltage microcontroller SoC in 65nm LP/GP CMOS for low-carbon wireless sensor nodes,” in *IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers*, Feb. 2012, pp. 490–492.
- [10] D. Bol *et al.*, “Sleepwalker: A 25-MHz 0.4-V sub-mm<sup>2</sup> 7- $\mu$ W/MHz microcontroller in 65-nm LP/GP CMOS for low-carbon wireless sensor nodes,” *IEEE J. Solid-State Circuits*, vol. 48, no. 1, pp. 20–32, Jan. 2013.
- [11] H. Okuhara, K. Kitamori, Y. Fujita, K. Usami, and H. Amano, “An optimal power supply and body bias voltage for a ultra low power micro-controller with silicon on thin box MOSFET,” in *Int. Symp. Low-Power Electron. Design (ISLPED) Dig. Tech. Papers*, Jul. 2015, pp. 207–212.
- [12] S. Kim and M. Seok, “Variation-tolerant ultra-low-voltage microprocessor with a low-overhead, within-a-cycle *in-situ* timing-error detection and correction technique,” *IEEE J. Solid-State Circuits*, vol. 50, no. 6, pp. 1478–1490, Jun. 2015.
- [13] M. D. Scott, B. E. Boser, and K. S. J. Pister, “An ultralow-energy ADC for smart dust,” *IEEE J. Solid-State Circuits*, vol. 38, no. 7, pp. 1123–1129, Jul. 2003.
- [14] D. Bol, G. de Strel, F. Botman, A. K. Lusala, and N. Couniot, “A 65-nm 0.5-V 17-pJ/frame/pixel DPS CMOS image sensor for ultra-low-power SoCs achieving 40-dB dynamic range,” in *VLSI Circuits Symp. Dig. Tech. Papers*, Jun. 2014, pp. 1–2.
- [15] S. C. Folea and G. Mois, “A low-power wireless sensor for online ambient monitoring,” *IEEE Sensors J.*, vol. 15, no. 2, pp. 742–749, Feb. 2015.
- [16] J. K. Brown, K. K. Huang, E. Ansari, R. R. Rogel, Y. Lee, and D. D. Wentzloff, “An ultra-low-power 9.8GHz crystal-less UWB transceiver with digital baseband integrated in 0.18 $\mu$ m BiCMOS,” in *IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers*, Feb. 2013, pp. 442–443.
- [17] A. Klinefelter *et al.*, “A 6.45 $\mu$ W self-powered IoT SoC with integrated energy-harvesting power management and ULP asymmetric radios,” in *IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [18] Y. Lee *et al.*, “A modular 1 mm<sup>3</sup> die-stacked sensing platform with low power I<sup>2</sup>C inter-die communication and multi-modal energy harvesting,” *IEEE J. Solid-State Circuits*, vol. 48, no. 1, pp. 229–243, Jan. 2013.
- [19] Intel Corporation. *Intel Quark Processors*. Accessed on Oct. 10, 2016. [Online]. Available: <http://www.intel.com/content/www/us/en/embedded/products/quark/overview.html>
- [20] C.-H. Jan *et al.*, “A 14 nm SoC platform technology featuring 2<sup>nd</sup> generation tri-gate transistors, 70 nm gate pitch, 52 nm metal pitch, and 0.0499  $\mu$ m<sup>2</sup> SRAM cells, optimized for low power, high performance and high density SoC products,” in *VLSI Technol. Symp. Dig. Tech. Papers*, Jun. 2015, pp. T12–T13.
- [21] M. Turnquist, M. Hiienkarri, J. Mäkipää, R. Jevtic, E. Pohjalainen, and T. Kallio, “Fully integrated DC-DC converter and a 0.4V 32-bit CPU with timing-error prevention supplied from a prototype 1.55V Li-ion battery,” in *VLSI Circuits Symp. Dig. Tech. Papers*, Jun. 2015, pp. C320–C321.
- [22] Y. Tsuji *et al.*, “Sub- $\mu$ W standby power, <18  $\mu$ W/DMIPS@25MHz MCU with embedded atom-switch programmable logic and ROM,” in *VLSI Technol. Symp. Dig. Tech. Papers*, Jun. 2015, pp. T86–T87.
- [23] J. Myers, A. Savanth, D. Howard, R. Gaddh, P. Prabhat, and D. Flynn, “An 80nW retention 11.7pJ/cycle active subthreshold ARM Cortex-M0+ subsystem in 65nm CMOS for WSN applications,” in *IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers*, Feb. 2015, pp. 144–145.
- [24] W. Lim, I. Lee, D. Sylvester, and D. Blaauw, “Batteryless sub-nW Cortex-M0+ processor with dynamic leakage-suppression logic,” in *IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers*, Feb. 2015, pp. 146–147.



**Somnath Paul** (S’07–M’11) received the B.E. degree in electronics and telecommunication engineering from Jadavpur University, Kolkata, India, and the Ph.D. degree in computer engineering from Case Western Reserve University, Cleveland, OH, USA.

He is currently a Research Scientist with the Circuit Research Labs Intel Corporation, Hillsboro, OR, USA. His research interests include hardware-software co-design for improving energy efficiency, yield, and reliability in nanoscale technologies.

Dr. Paul was a recipient of the 2011 Outstanding Dissertation Award from the European Design and Automation Association and the 2012 Best Paper Award in the International Conference on Very Large Scale Integration Design.



**Vinayak Honkote** (S'07–M'11) received the Bachelor's degree from the Bangalore Institute of Technology, Bengaluru, India, in 2003, and the M.S. and Ph.D. degrees in electrical engineering from Drexel University, Philadelphia, PA, USA, in 2006 and 2010, respectively. His Ph.D. research focused on design and automation of ultralow-power resonant clocking technologies.

He joined Intel in 2011, where he is currently a Research Scientist with the Architecture and Design Research group, and is involved in many topics including low-power clocking, voltage regulators, system-on-a-chip prototyping, human computer interaction, vision-based interactive display, energy harvesting systems, and near threshold voltage computing. His current research interests include energy-efficient computing, autonomous systems, and neuromorphic computing.



**Ryan Gary Kim** (S'14–M'16) received the B.S. and Ph.D. degrees in electrical engineering and computer sciences from Washington State University, Pullman, WA, USA, in 2011 and 2016, respectively. He was a research intern at Circuit Research Labs, Intel Labs, Intel Corporation, in 2014 and 2015. During his internship he contributed towards the validation and software development for ultra-low power IA-based sensor nodes. He is currently a postdoctoral researcher at Carnegie Mellon University under Prof. Radu Marculescu. His research interests are on the energy efficiency and scalability of manycore systems. Specifically, he focuses on wireless NoC design and protocols, power management techniques, and machine learning for heterogeneous/homogeneous manycore optimization.



**Turbo Majumder** (S'11–M'13) received the B.Tech. (Hons.) degree in electronics and electrical communication engineering and the M.Tech. degree in automation and computer vision from IIT Kharagpur, Kharagpur, India, in 2005, and the Ph.D. degree in electrical engineering from Washington State University, Pullman, WA, USA, in 2013.

He was an ASIC Design Engineer at NVidia from 2005 to 2006, a Senior Design Engineer at Freescale Semiconductor, from 2006 to 2009, and an Assistant Professor with the Department of Electrical Engineering, IIT Delhi, New Delhi, India, from 2013 to 2015. He is currently a Research Scientist with the Circuit Research Labs at Intel, Hillsboro, OR, USA. His current research interests include energy-efficient computing, manycore network-on-chip platforms, systems-on-a-chip, and hardware acceleration for high-performance computing.



**Paolo A. Aseron** received the B.S. degree in computer engineering from the University of the Philippines, Quezon City, Philippines, in 2001.

He was with Canon, Quezon City, Philippines, and Shimomaruko Tokyo, Japan, where he focused on systems-on-a-chip platform development from 2001 to 2003. He has been with Intel Labs, Intel Corporation, Hillsboro, OR, USA, since 2006. His current research interests include high-performance low-power architecture, circuits, memory, wireless communications, and power delivery.



**Vaughn Grossnickle** received the B.Sc. and M.Sc. degrees from Brigham Young University, Provo, UT, USA, in 2000.

He joined Intel Corporation, Hillsboro, OR, USA, in 2000, and has 15 years of experience in the area of clock generation and distribution for advanced system-on-a-chip (SoC) processors. He is currently a Clocking Domain Lead with the SoC development team responsible for generations of Intel Core, Pentium, and Celeron products. He holds six patents.



**Robert Sankman** received the Bachelor's degree in chemical engineering from the University of Illinois, Champaign, IL, USA, in 1980.

In 1980, he joined Intel, Chandler, AZ, USA, as a Process Engineer, during the startup of Intel's Fab 6 facility at Chandler, AZ, USA, where he is an Intel Fellow and the Director of package pathfinding with the Assembly Test Technology Development group. He is responsible for directing the definition of packaging and assembly activities for Intel's advanced logic products. He has also contributed

his expertise to numerous papers. He holds over 30 patents in the field of electronic packaging.



**Debendra Mallik** (M'98–SM'03–F'15) received the B.Tech. degree in mechanical engineering from IIT Kharagpur, Kharagpur, India, in 1980, and the M.S. degree in engineering science and mechanics from Iowa State University, USA, in 1983.

He has been with the Assembly and Test Technology Development group at Intel Corporation, Chandler, AZ, USA, since 1983. He has contributed to the development of numerous advanced semiconductor package technologies, including Intel's first organic flip chip package. He has authored over 15 technical

papers and holds over 35 U.S. patents.

Mr. Mallik has served in various professional organizations such as the IEEE CPMT Board of Governors, the IEEE Phoenix Section Executive Committee, the International Technology Roadmap for Semiconductors, and the IEEE Electronic Components and Technology Conference.



**Tao Wang** received the B.S. degree in materials science and engineering, and the M.S. degree in electronic science and technology from Tsinghua University, Beijing, China, in 2010 and 2012, respectively, and the M.S. degree in electrical computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2014.

Since 2014, he has been with the Assembly and Test Technology Development group, Intel Corporation, Chandler, AZ, USA, where he has been involved in package architecture and design for

processors, application-specific integrated circuits (ASICs), RF/microwave modules, and Internet of Things systems.



**Sriram Vangal** (M'00–SM'13) received the B.E. degree from Bangalore University, India, in 1993, and the M.S. degree from the University of Nebraska, Lincoln, in 1995, and the Ph.D. degree in electrical engineering from Linköping University, Sweden in 2007, all in computer engineering.

He joined Intel Corporation in 1995 and has played a lead role in multi-core CPU development and ultra-low power silicon research. He is a Principal Engineer and was the R&D lead for the advanced prototype team that designed the industry's first single-chip 80-core, sub-100W "Polaris" TeraFLOPS processor (2006), the 48-iA core "Rock Creek" single-chip cloud computer (SCC - 2009), and "Claremont" near-threshold voltage (NTV) iA processor (2011).

Dr. Vangal has received two Intel Achievement Awards for his work. He has published over 35 conference and journal papers in this field, has authored three book chapters, and has over 30 issued patents. His more recent research focus is in the areas of energy-efficient and sustainable net-zero energy computing.



**Vivek De** (M'89–SM'07–F'11) received the Ph.D. degree in electrical engineering from the Rensselaer Polytechnic Institute, Troy, NY, USA.

He is currently an Intel Fellow and the Director of Circuit Technology Research with Intel Labs at Intel Corporation, Hillsboro, OR, USA. He is responsible for providing strategic technical directions for long-term research in future circuit technologies and leading energy efficiency research across the hardware stack. He has authored or co-authored over 249 publications in refereed international conferences

and journals, and holds 208 patents issued, with 27 more patents filed (pending).

Dr. De received the Intel Achievement Award for his contributions to an integrated voltage regulator technology. He received the Best Paper Award at the 1996 IEEE International Application-Specified Integrated Circuit Conference, and nominations for the Best Paper Awards at the 2007 IEEE/ACM Design Automation Conference, and the 2008 IEEE/ACM International Conference on Computer-Aided Design. One of his publications was recognized in the 2013 IEEE/ACM Design Automation Conference as one of the "Top 10 Cited Papers in 50 Years of Design Automation Conference."



**James W. Tschanz** (S'94–M'99) received the B.S. degree in computer engineering and the M.S. degree in electrical engineering from the University of Illinois at Urbana–Champaign, Champaign, IL, USA.

He is a Circuits Researcher at Intel Corporation, Hillsboro, OR, USA, and is currently the Director of the Intel Circuit Research Lab. Since 1999, he has been involved in low-power circuit research at Intel. He also taught very large scale integrated design for seven years as an Adjunct Faculty Member at the Oregon Graduate Institute. He has authored journal papers, and three book chapters, and holds over 53 conference and 51 issued patents.