

# Postsilicon Voltage Guard-Band Reduction in a 22 nm Graphics Execution Core Using Adaptive Voltage Scaling and Dynamic Power Gating

Minki Cho, *Member, IEEE*, Stephen T. Kim, *Member, IEEE*, Carlos Tokunaga, *Member, IEEE*, Charles Augustine, *Member, IEEE*, Jaydeep P. Kulkarni, *Senior Member, IEEE*, Krishnan Ravichandran, *Senior Member, IEEE*, James W. Tschanz, *Member, IEEE*, Muhammad M. Khellah, *Senior Member, IEEE*, and Vivek De, *Fellow, IEEE*

**Abstract**—In high volume manufacturing, conventional approach to deal with inverse-temperature dependence (ITD) and aging is to add a postsilicon flat voltage guard band to all dies based on testing a small random sample of dies. Although this scheme guarantees error-free operation, it significantly degrades energy efficiency, as it penalize all dies for the maximum delay degradation due to ITD and aging as seen by the worst case die, while also assuming maximum aging condition. In this paper, a graphics execution core implemented in 22 nm trigate process uses per-die tunable replica circuit (TRC) to monitor delay degradation due to ITD and actual aging conditions. TRC triggers adaptive voltage scaling to dynamically adjust  $V_{CC}$  as needed during run time to maintain correct operation at minimum additional voltage. Measured data show up to 33% (14%) energy savings at 0.4 V (0.8 V) compared with baseline scheme. The TRC is also utilized in a dynamic power gating (DPG) scheme to lower energy overhead due to fast droop guard band. DPG introduces a load line effect during normal operation, thus saving energy, while deactivating this load line upon droop detection by the TRC to maintain ISO performance as baseline. Silicon data show that DPG can improve energy efficiency by 14.5% (7%) at 0.8 V (0.6 V).

**Index Terms**—Adaptive voltage scaling (AVS), aging, dynamic power gating (DPG), execution core, graphics, guard band, high volume manufacturing (HVM), inverse temperature dependence (ITD), margin, voltage droop.

## I. INTRODUCTION

**M**OORE'S law continued scaling of process technology enables performance improvement and low power consumption in various applications ranging from small mobile SoCs to multicore server chips. However, limited resolution of scaled technologies causes considerable die-to-die and within-die process variations, which hurt yield in high volume

Manuscript received May 11, 2016; revised July 7, 2016 and August 10, 2016; accepted August 10, 2016. Date of publication September 12, 2016; date of current version January 4, 2017. This paper was approved by Guest Editor Dennis Sylvester. This work was supported by the U.S. Government (DARPA).

The authors are with the Circuit Research Laboratory, Intel Corporation, Hillsboro, OR 97124 USA (e-mail: minki.cho@intel.com).

Color versions of one or more of the figures in this paper are available online at <http://ieeexplore.ieee.org>.

Digital Object Identifier 10.1109/JSSC.2016.2601319

manufacturing (HVM), where majority of chips must satisfy a performance target. Voltage binning is typically used to improve energy efficiency in the presence of process variations by finding the per-die minimum intrinsic  $V_{CC}$  that allows it to run at target frequency. In addition to process variation, dynamic variations in temperature, aging, and supply voltage noise need to be addressed as well. In the conventional HVM, a flat voltage guard band based on worst case possible dynamic variations is added on top of the intrinsic  $V_{CC}$  of each die. While using a conservative flat guard-band approach is much faster and most cost effective, it unnecessarily increases the energy consumption of an average chip, as it consistently applies the same flat voltage guard band to the whole yield. Alternately, per-die postsilicon characterization can be used as the most energy efficient means to perform error-free operation with the least amount of guard band. However, it is a prohibitively expensive procedure due to additional per-die testing time, impacting time-to-market and final product cost.

Adaptive voltage scaling (AVS) and/or frequency scaling techniques have been developed as possible effective solutions to improve performance and energy efficiency in the presence of dynamic variations. Grenat *et al.* [1] present the power management techniques to boost CPU/GPU frequencies as long as the temperature remains below the thermal limit. Clerc *et al.* [2] propose a closed-loop temperature-process compensation system that uses a critical path monitor to control the amount of back-gate forward body bias in order to maintain target frequency across wide dynamic range. Kim *et al.* [3] used on-die odometer that measures impact of bias temperature instability (BTI), hot carrier injection (HCI), and time-dependent dielectric breakdown (TDDB) on the delay of digital circuits by comparing beat frequencies of two ring oscillators. Cho *et al.* [5] assessed the benefit of using a tunable replica circuit (TRC) monitor [4] to accurately track the gradual degradation of voltage–frequency characteristics of a digital block due to voltage–temperature stress cycles at runtime under different operational conditions. TRC was also used to detect an oncoming droop and triggering instruction replay [6], or performing clock division or gating [7].



Fig. 1. ITD guard band. (a) Conceptual ITD graph of an example die. (b) Conventional ITD guard-band determination in HVM. (c) Accounting for ITD in conventional design.



Fig. 2. Conventional postsilicon voltage guard-band determination for (a) aging and (b) voltage droop.



Fig. 3. AVS with TRC. (a) Overall schematic. (b) ITD application. (c) Aging application.

In addition, TRC was used in a digital low-dropout (DLDO) design to kick off a nonlinear response mode to reduce output rail degradation during a droop event [8]. Other droop mitigation scheme employs multiple PLLs and switches during a droop to another PLL already locked at a lower speed [9], or employs adaptive frequency system where a single PLL's voltage-controlled-oscillator self-adapts to supply noise [10].

In this paper, we present a 22 nm graphics execution core that uses *in situ* TRC to monitor critical timing margin and

trigger AVS as needed to dynamically adjust  $V_{CC}$  during run time. The TRC monitors slow variations in temperature and aging and provides a time-to-digital converter (TDC) code, representing the timing margin measurement, to an AVS controller. Based on the TDC code, the AVS controller communicates a new voltage ID (VID) to the external voltage regulator module (VRM) to maintain minimum  $V_{CC}$  necessary to meet a given performance level. In addition, the TRC is also utilized in a dynamic power gating (DPG) scheme to lower



Fig. 4. Schematic of the TRC.



Fig. 5. (a) Baseline power gating. (b) Proposed DPG scheme.

energy overhead due to fast droop guard band. DPG introduces a load line effect during normal operation, thus saving energy for the average load current case, while deactivating this load line upon droop detection by the TRC to maintain ISO performance as baseline.

The rest of this paper is organized as follows. Section II explains the conventional guard band approach, which is the motivation of AVS and DPG. Section III describes the implementation details of the proposed AVS and DPG. Section IV gives the details of our test-chip prototype. Section V presents the silicon measurements, and Section VI concludes this paper by summarizing the key results and insights.

## II. CONVENTIONAL POSTSILICON VOLTAGE GUARD BAND

### A. Inverse Temperature Dependence

High temperature has been traditionally one of most important issues hampering performance and reliability in silicon. Maximum achievable frequency degrades with temperature rise at nominal voltage; thus, voltage needs to increase to meet target frequency. Increased power consumption at hot due to maintaining performance generates more heat, which in turn increases power in a form of positive feedback. To maintain temperature and performance within the limit of the existing cooling systems, voltage/frequency throttling has been



Fig. 6. Two-point TRC calibration example where two TRC programings are used across  $F-V_{CC}$ .

widely used. At low operating voltage, inverse temperature dependence (ITD) causes delay degradation as temperature decreases [11]. ITD depends on the tradeoff between mobility improvement at low temperature on one hand, and over-drive voltage reduction due to threshold voltage increase, on the other hand. With technology scaling, the crossover point between normal temperature dependence (NTD), where performance degrades as temperature rises, and ITD, where performance degrades as temperature decreases, moves higher toward nominal  $V_{CC}$  operating point.

Fig. 1(a) shows a conceptual ITD graph of an example die with an ITD metric of  $V_{CC\text{-HOT}}/V_{CC\text{-COLD}}$ , where  $V_{CC\text{-COLD}}$  is the minimum  $V_{CC}$  at cold, and  $V_{CC\text{-HOT}}$  is minimum  $V_{CC}$  at HOT to meet the same given delay target. When  $V_{CC\text{-HOT}}/V_{CC\text{-COLD}}$  is less than 1, ITD has occurred. Thus, to maintain the same frequency at cold temperature, we need to adapt  $V_{CC}$  based on temperature slope. In HVM, a small random sample of dies is tested to determine the worst case ITD compensation that should be used for the rest of the population. Die-to-die process variation results in large scattering of the ITD band, which increases worst case ITD compensation, as shown in Fig. 1(b). During postsilicon characterization, and for each frequency target, we estimate the worst case slope across the temperature range and store that in a lookup table (LUT). During actual usage, and based on current temperature reading (sampled on a very slow clock) from one or more distributed temperature sensors, power management unit (PMU) uses the LUT to determine how much voltage increase is needed to maintain ISO performance and accordingly sends a new VID to the external voltage regulator, as shown in Fig. 1(c).

### B. Aging

Along with ITD guard band, aging becomes a critical issue in deep nanometer nodes for reliable operation during lifetime. Negative BTI and positive BTI degrade the performance of PMOS/NMOS devices due to threshold voltage increases over time [12]. Increased threshold voltage results in higher minimum operating  $V_{CC}$  to run at the same target frequency. Degradation in voltage-frequency characteristics that is induced by BTI aging varies across dies. Moreover, aging



Fig. 7. Postsilicon TRC calibration flow.



Fig. 8. Postsilicon TDC bit selection for DPG.

also varies from a core to another, depending on actual voltage/temperatures stress cycles as impacted by the core's DVFS usage, power-gating, and clock-gating events. In conventional HVM, aging guard band needs to be chosen as the worst case voltage shift across process and usage. To determine this guard band, a random sample of dies is measured at different frequency targets under worst case usage, as shown in Fig. 2(a). Shift from beginning-of-life (BOL) voltage,  $V_{CC\text{-BOL}}$ , to end-of-life (EOL) voltage,  $V_{CC\text{-EOL}}$ , is measured



Fig. 9. Comparison of operating  $V_{CC}$  using conventional guard band to that with AVS and DPG.



Fig. 10. Overall chip block diagram.

for each die with the worst case shift taken as the aging  $V_{GB}$  for the rest of the population.

### C. Voltage Droop

In an execution core, simultaneous operations in memory or/and logic circuits demand high current flow, which creates fast transient voltage droops. Worst case voltage droop degrades maximum frequency requiring the addition of a voltage droop guard band to enable the core to run at the target frequency, as shown in Fig. 2(b). In HVM, worst case voltage droop is determined by testing the core using a power-virus like workload, which rarely happens in real life. Thus, the conventional guard band approach tends to highly

overcompensate for the impact of voltage droop for an average workload, reducing energy efficiency as a result.

## III. ADAPTIVE VOLTAGE SCALING AND DYNAMIC POWER GATING

In this section, we detail the implementation of our TRC-based AVS for reducing ITD and aging guard bands, and also explain how TRC can be used along with DPG to mitigate the impact of voltage droop.

### A. Adaptive Voltage Scaling for ITD and Aging

The TRC is programmed during HVM test to track the behavior of a critical path delay as temperature, and aging



Fig. 11. Die micrograph and overall test-chip details.

conditions change. During actual operation, a small shift in the TRC delay margin from a reference point that is determined at test time indicates a temperature change and/or an aging event. This shift is communicated to the PMU, which adjusts  $V_{CC}$  as needed through the VRM, as shown in Fig. 3. Unlike conventional guard-band approach, TRC-based AVS allows per-die minimum voltage adaptation to maintain error-free operation as temperature gets colder. Similarly, aging compensation is tailored per die based on its own voltage/temperatures stress cycles from BOL to EOL.

### B. Tunable Replica Circuit

The TRC is designed to mimic the critical path in the target system. As shown in Fig. 4, it supports accurate synthesizability of the critical path using a postsilicon selectable fixed-length delay portion made of either of INV, NAND, NOR, or interconnect delay elements, plus a variable-length INV-based delay portion for further TRC fine-tuning to better track the voltage–frequency curve of each die. Furthermore, the TRC allows programming either rising or falling signal transition to better match the critical path polarity. A 16 b TDC translates the TRC timing margin to a digital code. This code is used by the PMU to adapt the  $V_{CC}$  up/down based on current ITD and aging conditions, as described earlier.

### C. Dynamic Power Gating

We propose utilizing TRC in a new DPG approach to lower energy during active mode, despite using the same droop  $V_{GB}$  as baseline. As shown in Fig. 5, a PMOS power gate (PG) is traditionally used to shut OFF leakage current during idle periods. The PG is sized up for minimum IR drop when a maximum load current flows through when the core is active. This means that the PG impedance is lower than what is actually needed for other workloads, leaving an opportunity for energy savings. The DPG scheme exploits this by dividing the baseline PG into a primary PG (PPG) and a secondary PG (SPG). The PPG is always turned on during active operation, and since its impedance is designed to be higher than the original larger PG, it creates a higher IR drop (or a load-line effect) for the average load current case, reducing the virtual rail voltage as a result, thus lowering average dynamic and



Fig. 12. TRC path synthesizability for a given die at ISO frequency.

leakage energy. The SPG is dynamically turned on in the case of a higher current demand, which under worst case scenario comes suddenly requiring a quick response. DPG controller will distinguish a fast droop signature, rather than a slow ITD or aging signature, by observing a large shift in the TDC code in a given clock cycle. This can be done by monitoring one of the TDC bits, as will be explained in Section III. D. Upon droop detection, DPG controller will trigger a 3 b shift register that turns on the SPG legs to maintain the same frequency as the baseline case.

For maximum power benefit, the percentage of PG that is allocated as PPG should be minimized. However, to maintain the same voltage droop as the baseline case during worst case current transient, and considering the latency to turn on the SPG, the PPG size needs to be carefully selected—trading off reliable operation with power reduction. Size of the PPG can be determined *a priori*-based, for example, on a one-time postsilicon calibration step running worst case workload. In this paper, PPG was set equal to SPG.

In the DLDO work in [8], a controller constantly monitors workload activity via a “closed-loop” configuration and accordingly adjusts the number of “turned-on” PGs to maintain



Fig. 13. TRC programming. (a) Single-point programming. (b) Multipoint programming.

$V_{out}$  (virtual rail) close to a target  $V_{ref}$ , where  $V_{ref}$  can be any value smaller than the input rail ( $V_{cc}$ ). On the other hand, DPG is a simple scheme that does not have the controller complexity (and associated area/power overhead) of a full-fledged closed-loop DLDO design. DPG thus mainly operates in an “open-loop” configuration, where the number of normally “turned-on” PGs (called PPGs) can be determined *a priori*-based, for example, on a one-time postsilicon calibration step running worst case workload.

#### D. TRC Calibration Procedure in HVM

One of the challenges for AVS and DPG is how accurate the TRC tracks the behavior of the critical path delay under temperature, aging, and droop events. Maximizing TRC-to-critical path tracking minimizes additional voltage margin that needs to be added for the TRC itself, thus maximizing the potential energy savings of AVS and DPG. Fig. 6 shows a conceptual graph of an execution core  $F$ - $V_{CC}$  curve along with two possible TRC programmings. Since the execution core consists of various logic gate and device types, critical path may actually change from one to another across the  $F$ - $V_{CC}$  range. As compared with a single-point tracking approach, having the flexibility of multipoint programming, where possibly different TRC gate types and lengths are used for a given portion of the  $F$ - $V_{CC}$  curve, can potentially achieve better tracking not only across the die’s  $F$ - $V_{CC}$  range, but also across temperature, aging, and droop. This point will be clearly demonstrated in Section V.

A TRC calibration procedure that captures the critical path sensitivity across temperature and voltage-frequency points is proposed in Fig. 7. This flow leverages the existing ITD characterization step (that finds the worst case ITD compensation factor at BOL) in a conventional HVM test. Therefore, our TRC calibration proposal requires minimal additional test time. The calibration starts by picking a random sample



Fig. 14. Measured temperature dependence for an example die.

of dies. For each die in the sample, and for each  $F_i$  target, the test content is run and the minimum  $V_{CC-HOT}$  that allows the die to function correctly at  $F_i$  is found. The TRC is then programmed at  $V_{CC-HOT}$  to produce a reference TDC code for this die. As the 16 b TDC code indicates the timing margin, a code value of  $0 \times FFFF$  means large margin, while a code of  $0 \times 0000$  represents zero margin. We program the TRC to give a reference TDC at midpoint or  $0 \times FF00$ . The temperature is then reduced to COLD, and the same test content is run again to find minimum  $V_{CC-HOT}$  at  $F_i$ . Under the same programming from the HOT step, the TRC delay margin is then checked at the  $F_i$ - $V_{CC-COLD}$  point. If TRC has close tracking from HOT to COLD, the TDC code will remain at around midrange. If a large mismatch exists, we may choose different types of path and then repeat the calibrating procedure, until HOT to COLD delay signature is close enough. At the end of this calibration step, and for each  $F_i$  point tested, the TRC programming type that produced the least tracking error across the random sample



Fig. 15. Measured baseline ITD GB for six dies at (a) 0.4 and (b) 0.8 V.



Fig. 16. Measured ITD GB with TRC-based AVS for six dies at (a) 0.4 and (b) 0.8 V.

is selected thereafter as the programming type for the rest of the population during HVM class test at HOT.

While the path type for a given frequency range is determined globally for all dies based on the above one-time calibration step, different dies in HVM test can have different TRC path *length* depending on their process corner and thus their  $F-V_{CC}$  characteristic curves. Thus, the length of the TRC in a given die is tuned independently using the variable-length INV-based delay portion of the TRC (see Fig. 4). To reduce programming time, a built-in auto calibration engine can be used to program the length of the TRC across the  $F-V$  points. From storage perspective, we need to have some NV memory to store the type and length of the TRC at each  $F/V$  point. Since the TRC in this paper has 16 configuration bits and assuming 5  $F/V$  operating points, an 80 b register file is required per TRC.

For droop impact reduction using DPG, and besides the need for good TRC-to-critical path tracking across  $F-V_{CC}$  range, carefully selecting the TDC bit that enables DPG is also critical. While a low droop detection sensitivity results in higher droop magnitude and so cause functional failure at target frequency, high droop detection sensitivity will overcompensate for even small droops, toggling DPG too often,

and reducing net energy savings of the DPG scheme as a result. Thus, an optimal TDC bit selection exists. To determine which TDC bit should be selected for enabling DPG, the virus test content is rerun on the random sample of dies at the end of the above calibration flow, as shown in Fig. 8. Most aggressive TDC bit that allows us to meet target  $F_i$  while maximizing energy savings is determined in this step. This TDC bit is then used for all dies during HVM testing. In Section V, we present measured data that illustrate the advantage of multipoint programming and careful TDC bit selection on the proposed schemes.

#### E. Comparison

Fig. 9 compares operating  $V_{CC}$  using conventional flat voltage guard-band approach and under the proposed TRC-based AVS and DPG schemes. In the baseline case, final operating  $V_{CC}$  accounts for worst case die-to-die variation in ITD and aging, worst case usage conditions for aging, and virus-like workload for droop. The proposed AVS with per-die TRC removes entire die-to-die ITD guard band as well entire die-to-die aging guard band at BOL. It adapts  $V_{CC}$  over time depending on actual aging condition from BOL to EOL. DPG does not actually reduce droop voltage guard band, but reduces



Fig. 17. Measured energy savings with ITD GB reduction.

the effect of it on average energy. Since we include TRC as a delay margin sensor, with tracking accuracy depending on the calibration quality, any TRC to critical path tracking error needs to be margined for. Nevertheless, overall operating  $V_{CC}$  and energy will reduce as measurements in Section V will show.

The benefit of the TRC in cutting down the conventional (worst case-based) voltage guard band is reduced by the mismatch error between TRC and actual critical path across process, voltage, temperature, and aging. We reduce this error through per-die (process) multipoint (voltage) programming, where we select the path type to maximize TRC tracking to critical path across wide HOT to COLD range (temperature). The iRazor [13] paper addresses about a canary (ring oscillator) scheme, where a separate linear correlation between the canary and processor frequency is determined for each temperature and voltage conditions. This approach is expected to have probably higher tracking error than the TRC, since a ring oscillator canary neither tracks the real critical path type nor its length as the TRC does. Furthermore, the canary requires programming per temperature as well as per  $F/V$  range which is expensive, while in the TRC case, only programming per  $F/V$  range is needed as the TRC itself is expected to track temperature and aging changes.

#### IV. TEST PROTOTYPE

To evaluate the proposed TRC-based AVS and DPG schemes, a 3.38 mm<sup>2</sup> graphics execution core that performs key floating-point operations for a 3-D graphics pipeline is fabricated in 22 nm, as shown in Fig. 10. The test chip includes a PLL for on-die clock generation, a 270 kB SRAM array and a test controller for issuing at-speed test vectors, and a signature register (MISR) for validating correct test results [14]. The test chip is measured using test sequences ported from presilicon validation. Two equal PG halves with associated DPG controllers are placed on the top and bottom sides of the core. Two TRCs, one for capturing delay change due to slow variations in temperature and aging and the other for tracking fast voltage droop, are used and both are powered



Fig. 18. Measured baseline aging GB at different  $V_{CC}/F$  operating points.



Fig. 19. Measured TRC aging tracking.

from the core rail. For the first TRC, the entire TDC code feeds into an AVS controller, which issues a new VID as needed to an external VRM to maintain correct operation in the presence of temperature and aging variations. For the second TRC, one bit of the TDC code is selected to kick in DPG to react to a voltage droop. A programmable on-die current inducer is included to create droop events of variable lengths and magnitudes. Die micrograph and overall chip details are given in Fig. 11.

#### V. MEASUREMENT RESULTS

In this section, we present silicon measurement results for: 1) TRC calibration; 2) ITD guard band; 3) aging guard band; and 4) droop guard band. An example of a TRC postsilicon fine-tuning using a combination of fixed-length path types and variable-length INV delays is shown in Fig. 12. Fig. 13 compares TRC to critical-path tracking error using single- and multipoint TRC calibration approaches. For the single-point calibration, TRC is programmed to just pass at 0.7 V and associated frequency during class test at HOT (90 °C) temperature. The same programming is then used at any other  $V_{CC}/F$  operating point during real life usage at all temperature conditions. On the other hand, for multipoint programming, a unique TRC programming can be selected as needed per each  $V_{CC}/F$  point tested at HOT. Using single-point programming produces a maximum TRC to critical-path tracking error of 5% when the core runs at 25 °C. This error is completely eliminated (for this die) when allowing a unique programming



Fig. 20. Measured (a) voltage reduction and (b) energy savings using per-die aging-aware AVS.

for each of the four  $V_{CC}/F$  points tested across the  $V_{CC}/F$  curve.

Fig. 14 shows the temperature dependence results. Note that ITD impact is obvious when operating at lower  $V_{CC}$  (speed), needing a higher voltage compensation as temperature is reduced, while NTD is observed, at higher  $V_{CC}$ . To assess the benefit of TRC-based AVS versus the baseline approach, we randomly selected a number of typical/slow/fast dies and characterized them across a wide voltage (from  $F_{SLOW}$  to  $F_{FAST}$ ) and temperature (25 °C–90 °C) ranges, as given in Fig. 15. Each die is first tested at HOT (90 °C) and minimum  $V_{CC-HOT}$  that meets a given  $F_i$  target is found. Then, a lower temperature is selected and minimum  $V_{CC-COLD}$  at the same frequency is found. The ITD  $V_{GB}$  needed at a given  $F_i$  is then upper bounded by  $V_{CC-COLD}$  that satisfies the *worst case* die. A thermal sensor in the baseline scheme provides temperature reading at run time, which along with current  $F_i$  is used to look up this (*worst case*) ITD  $V_{GB}$ . At  $F_{SLOW}$  (0.4 V), ITD effect ( $V_{CC-HOT}/V_{CC-COLD} < 1$ ) is more pronounced with a large spread due to ITD variation across dies. At  $F_{FAST}$  (0.8 V), on the other hand, there is less ITD effect but, nevertheless, the spread is still large as dies can have variation in their NTD ( $V_{CC-HOT}/V_{CC-COLD} > 1$ ). At 25 °C, this results in a worst case ITD  $V_{GB}$  equal to 6.4% and 3.1% of  $V_{CC}$  for the two ends, respectively. With TRC-based AVS, the spread in ITD voltage compensation at 25 °C is significantly tightened from 8.5% and 6.1% at  $F_{SLOW}$  and  $F_{FAST}$  using the worst case baseline approach (Fig. 15) to 2.5% and 1.5%, respectively, using a per-die TRC monitor, as shown in Fig. 16. For the die that needed the least compensation, this translates to 8% (3%) energy savings at 0.4 V (0.8 V) at 25 °C, compared with the baseline scheme, as given in Fig. 17.

To determine worst case aging  $V_{GB}$  for the baseline approach, a number of dies are measured at different frequency targets, and BOL voltage ( $V_{CC-BOL}$ ) is determined for each as illustrated for an example die in Fig. 18. Aging is then

performed in an accelerated manner using higher voltage and temperature while mimicking a wide DVFS range in actual usage scenario. A  $V_{CC}$  search is performed under normal condition after each accelerated stress period (~2 h) to find the new minimum  $V_{CC}$  needed to maintain time-0 frequency target in the presence of accumulated aging. The measurements are repeated for a total of 12 h until worst case EOL usage is reached. In the baseline scheme, the  $V_{CC-EOL}-V_{CC-BOL}$  of the worst case die is then added at BOL as a flat aging  $V_{GB}$  to all dies in HVM. This  $V_{GB}$  is ~4% when operating at 0.8 V and increases to ~10% when operating at 0.4 V, illustrating more sensitivity to aging at low-performance modes.

When using the TRC as an aging sensor, it needs to go through the same voltage/temperature/usage cycles as the core to be able to accurately track shift in delay margin. By operating the TRC on the virtual rail (Fig. 10), any core power gating event is tracked by the TRC, and the same is true for any core-level clock gating event. On the other hand, the TRC cannot track the data activity of the most-aged critical path. Based on BTI aging mechanism, a dc stress gives worse delay degradation as compared with an ac stress, since devices can slightly recover during the ac stress [5]. In order to avoid functional failure, the TRC in this paper remains clock gated while the core is active, and is only un gated to capture any shift in its delay margin due to slow variations in temperature and aging. This mimics a conservative dc-stressed critical path scenario, thus slightly over estimating actual path aging in some cases. Fig. 19 shows how the TRC accurately tracks aging characteristics from BOL to EOL, and does this consistently over a wide voltage/frequency range.

Fig. 20 shows the benefit of aging-aware AVS in terms of voltage reduction and corresponding energy savings for an example die from its BOL to its EOL. Compared with baseline scheme that employs the EOL voltage shift of the worst case die as the aging  $V_{GB}$  for all dies starting at their BOL,



Fig. 21. Measured baseline versus DPG during a droop at (a)  $V_{CC} = 0.82$  V and (b)  $V_{CC} = 0.66$  V.



Fig. 22. Measured (a) average virtual rail voltage for baseline and DPG and (b) energy comparison.

aging-aware AVS reduces  $V_{CC\text{-BOL}}$  by 4% (10%) at  $F_{FAST}$  ( $F_{SLOW}$ ), which results in energy reduction at BOL of 9% (20%), as shown in Fig. 20. Depending on a given die actual usage, the energy benefit of aging-aware AVS

reduces over time to become 1% (6%) at  $F_{FAST}$  ( $F_{SLOW}$ ) at maximum possible EOL. Because of the TRC-based AVS scheme, combining voltage guard-band reduction for both ITD and aging gives a 33% (14%) energy savings at 0.4 V (0.8 V).



Fig. 23. Impact of TDC bit location selected for DPG on (a) performance and (b) energy.



Fig. 24. Transient captures of DPG with different TDC bit location configurations.

To test the DPG scheme, we used the on-die current inducer to create a virus-like transient event, while the core was running a small snippet of a typical floating-point test, as shown by the scope captures in Fig. 21. With DPG kicked in, the virtual rail operates at an average voltage of 0.69 V, which is  $\sim 11\%$  lower than the 0.78 V of the baseline case for a  $V_{CC}$  of 0.82 V. During the virus transient event, the virtual rail of a DPG-enabled system droops to about the same value (0.62 V) as that of the baseline. This is assured by the TRC, as it quickly triggers the SPG to maintain the same frequency as the baseline case. Similar behavior can be observed for the  $V_{CC}$  of 0.66 V case. As summarized in Fig. 22, DPG reduces average virtual rail voltage by 11% (8%) at 0.8 V (0.6 V), which gives an energy savings of 14.5% (7%) compared with the baseline case.

The location of the TDC bit selected to trigger DPG action has an impact on the net energy savings of this scheme. Choosing a conservative bit location that forces DPG to kick in even for small droops allows us to safely meet target frequency, but also results in less, even possibly negative, energy savings compared with the baseline case. This is due to: 1) effectively losing the load-line effect of DPG as the average virtual rail voltage rises closer to that of the baseline case and

2) the energy overhead caused by frequent switching between PPG and SPG starts to become more significant. On the other hand, a less aggressive bit location selection will result in degrading the response speed of DPG, thus allowing a larger droop, and so reducing maximum frequency achievable at a given  $V_{CC}$  compared with the baseline case. Fig. 23 shows the measured data that illustrates the above tradeoff. At the optimal TDC bit location, the target frequency of 600 MHz at  $V_{CC} = 0.66$  V is just met with DPG enabled, and this bit selection achieves the maximum energy savings. Beyond this point, less savings are observed for the above reasons. Fig. 24 shows the corresponding transient captures for different TDC bit locations used for enabling DPG. At aggressive selection, DPG is triggered more frequently, while at the most optimal selection, DPG is kicked only during the worst case droop.

## VI. CONCLUSION

A graphics execution core uses an *in situ* TRC to monitor delay degradation due to slow variations in temperature and aging, and trigger AVS to dynamically adjust  $V_{CC}$  as needed during run time. Measured data show a 33% (14%) energy reduction under worst case temperature and aging at 0.4 V (0.8 V). The TRC is also used in a DPG scheme to lower

energy overhead due to fast droop voltage guard band by 14.5% (7%) at 0.8 V (0.6 V). We also introduced a testing-friendly TRC calibration flow that is executed once at BOL on a small random sample of dies to learn about critical path sensitivity across voltage and temperature, and used later for the rest of the population during HVM testing. Although this paper specifically addressed ITD and aging guard-band reduction with AVS, the same technique can benefit other variations such as HCI or TDDB.

#### ACKNOWLEDGMENT

The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

#### REFERENCES

- [1] A. Grenat *et al.*, "Increasing the performance of a 28 nm × 86-64 microprocessor through system power management," in *IEEE ISSCC Dig. Tech. Papers*, Jan./Feb. 2016, pp. 74–75.
- [2] S. Clerc *et al.*, "A 0.33 V/–40 °C process/temperature closed-loop compensation SoC embedding all-digital clock multiplier and DC-DC converter exploiting FDSOI 28 nm back-gate biasing," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2016, pp. 1–3.
- [3] T.-H. Kim, R. Persaud, and C. H. Kim, "Silicon odometer: An on-chip reliability monitor for measuring frequency degradation of digital circuits," *IEEE J. Solid-State Circuits*, vol. 43, no. 4, pp. 874–880, Apr. 2008.
- [4] A. Drake *et al.*, "A distributed critical-path timing monitor for a 65 nm high-performance microprocessor," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2007, pp. 398–399.
- [5] M. Cho, C. Tokunaga, M. M. Khellah, J. W. Tschanz, and V. De, "Aging-aware adaptive voltage scaling in 22 nm high-K/metal-gate tri-gate CMOS," in *Proc. IEEE Custom Integr. Circuits Conf.*, Sep. 2015, pp. 1–4.
- [6] J. Tschanz *et al.*, "A 45 nm resilient and adaptive microprocessor core for dynamic variation tolerance," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2010, pp. 282–283.
- [7] C. Tokunaga *et al.*, "A graphics execution core in 22 nm CMOS featuring adaptive clocking, selective boosting and state-retentive sleep," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2014, pp. 108–109.
- [8] S. T. Kim *et al.*, "Enabling wide autonomous DVFS in a 22 nm graphics execution core using a digitally controlled hybrid LDO/switched-capacitor VR with fast droop mitigation," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [9] J. Tschanz *et al.*, "Adaptive frequency and biasing techniques for tolerance to dynamic temperature-voltage variations and aging," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2009, pp. 292–293.
- [10] N. Kurd, P. Mosalikanti, M. Neidengard, J. Douglas, and R. Kumar, "Next generation Intel Core micro-architecture (Nehalem) clocking," *IEEE J. Solid-State Circuits*, vol. 44, no. 4, pp. 1121–1129, Apr. 2009.
- [11] M. Cho, M. Khellah, K. Chae, K. Ahmed, J. Tschanz, and S. Mukhopadhyay, "Characterization of inverse temperature dependence in logic circuits," in *Proc. IEEE Custom Integr. Circuits Conf.*, Sep. 2012, pp. 1–4.
- [12] E. Karl, P. Singh, D. Blaauw, and D. Sylvester, "Compact *in-situ* sensors for monitoring negative-bias-temperature-instability effect and oxide degradation," in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2008, pp. 410–411.
- [13] Y. Zhang *et al.*, "iRazor: 3-transistor current-based error detection and correction in an ARM Cortex-R4 processor," in *IEEE ISSCC Dig. Tech. Papers*, Jan./Feb. 2016, pp. 160–162.
- [14] M. Cho *et al.*, "Post-silicon voltage-guard-band reduction in a 22 nm graphics execution core using adaptive voltage scaling and dynamic power gating," in *IEEE ISSCC Dig. Tech. Papers*, Jan./Feb. 2016, pp. 152–153.



**Minki Cho** (M'13) received the B.E. degree in electronics engineering from Sogang University, Seoul, South Korea, in 2006, and the M.S. and Ph.D. degrees in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2009 and 2012, respectively.

He is currently with Intel Circuit Research Laboratory, Hillsboro, OR, USA, as a Research Scientist. His current research interests include low-power digital circuit design and reliability of digital circuit in nanometer nodes.

Dr. Cho received the 2013 IEEE TRANSACTIONS ON COMPONENTS, PACKAGING AND MANUFACTURING TECHNOLOGY Best Paper Award.



**Stephen T. Kim** (M'13) received the B.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 2007, and the M.S. and Ph.D. degrees in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2009 and 2012, respectively.

He is currently a Research Scientist with the Circuit Research Laboratory, Intel Corporation, Hillsboro, OR, USA. His current research interests include energy-efficient power delivery design.



**Carlos Tokunaga** (S'98–M'08) received the B.S. degree in electronics engineering from the University of Los Andes, Bogotá, Colombia, in 2001, and the M.S. and Ph.D. degrees in electrical engineering from the University of Michigan, Ann Arbor, MI, USA, in 2005 and 2008, respectively.

He is currently a Research Scientist with the Circuit Research Laboratory, Intel Corporation, Hillsboro, OR, USA. His current research interests include VLSI design with particular emphasis on energy-efficient resilient



**Charles Augustine** (S'08–M'11) received the bachelor's degree in electronics from BITS Pilani, Pilani, India, in 2004, and the Ph.D. degree in electrical and computer engineering from Purdue University, West Lafayette, IN, USA, in 2011.

He is currently a Staff Research Scientist with the Circuit Research Laboratory, Intel Corporation, Hillsboro, OR, USA. His current research interests include ultra-low-power memory and logic circuits for neuromorphic systems.

Dr. Augustine received the 2015 Mahboob Khan Outstanding Industry Liaison Award from SRC, the Best Paper Award in the International Symposium on Low Power Electronics and Design in 2012, the Best Paper in Session Award at SRC Techcon in 2009, the AMD Design Excellence Award from Purdue in 2008, and the Bronze medal for academic excellence from BITS Pilani in 2004. He has held positions with Texas Instruments, ST Microelectronics, Philips Semiconductors, and Freescale Semiconductor, where he was involved in CMOS digital integrated circuits and memories, including spin-torque based memories. He has authored over 55 papers in refereed journals and conferences and has filed six patents (issued) and 11 patents (pending).



**Jaydeep P. Kulkarni** (M'09–SM'15) received the B.E. degree from the University of Pune, India, in 2002, the M.Tech. degree from the Indian Institute of Science (IISc), Bangalore, India, in 2004, and the Ph.D. degree from Purdue University, West Lafayette, IN, USA, in 2009, all in electrical engineering.

From 2004 to 2005, he was with Cypress Semiconductors, Bangalore, where he was involved in low power SRAM design. He is currently a Staff Research Scientist with the Circuit Research Laboratory, Intel Corporation, Hillsboro, OR, USA. His current research interests include energy efficient integrated circuit design and circuits/applications of emerging non-silicon technologies. He has filed 20 patents and published 50 papers.

Dr. Kulkarni received the 2004 Best M.Tech. Student Award from IISc Bangalore, the 2008 SRC Inventor Recognition Awards, the 2008 ISLPED Design Contest Award, the 2008 Intel Foundation Ph.D. Fellowship Award, the best paper in session award at 2008 SRC TECHCON, the 2010 Outstanding Doctoral Dissertation Award from Purdue School of Electronics and Communication Engineering, the 2012 Intel Patent Recognition Award, Five Intel Divisional Recognition Awards, the 2015 IEEE Circuits and Systems Society's Transactions on VLSI systems Best Paper Award, and the 2015 Semiconductor Research Corporation's (SRC) Outstanding Industrial Liaison Award. He has participated in Technical Program Committees of A-SSCC, ISLPED, and ASQED conferences and involved in the IEEE Circuits and Systems Society's VLSI Technical Committee. He serves as an Associate Editor of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS, and Industrial Liaison at the SRC and the SONIC STARnet Research Program.



**Krishnan Ravichandran** (M'14) received the B.S.C.E. degree from IIT Madras and an M.S.C.E. degree from Carnegie Mellon University, Pittsburgh, PA, USA.

He leads the Power Delivery Circuits and Systems Research Group in Intel Labs, Intel Corporation, Hillsboro, OR, USA. He has been with Intel for 25 years as a Circuit Design Engineer with the Mobile Platforms Group. He was a member of the Core Architecture Team that launched several generations of mobile processors including the Centrino Family of processors. He has been leading research on power management circuits for small form-factor mobile systems.

Centrino Family of processors. He has been leading research on power management circuits for small form-factor mobile systems.



**James W. Tschanz** (M'99) received the B.S. degree in computer engineering and the M.S. degree in electrical engineering from the University of Illinois at Urbana-Champaign, Urbana, IL, USA, in 1997 and 1999, respectively.

Since 1999, he has been a Circuit Researcher with the Circuit Research Laboratory, Intel Corporation, Hillsboro, OR, USA. His research interests include low-power digital circuits, design techniques, and methods for tolerating parameter variations. He also taught VLSI design for 7 years as an Adjunct Faculty

Member with the Oregon Graduate Institute, Beaverton, OR, USA. He has authored 53 conference and journal papers in this field, and authored three book chapters, and over 41 issued patents.



**Muhammad M. Khellah** (SM'15) is a principal research scientist at Intel Labs, where he leads research on low-power circuits and architectures with particular focus on power management, resiliency, and embedded memories. After obtaining his Ph.D. from the University of Waterloo, Waterloo, ON, Canada in 1999, he joined Intel and was first involved in the design of SRAM caches for the Pentium microprocessor products. He has published over 70 technical papers in refereed international conferences and journals, and has 74 patents granted,

and a few pending, all in the area of VLSI design.

Dr. Khellah served as an associate editor for the IEEE TCAS-I, technical program co-chair for the 2014 IEEE/ACM ISLPED, and general co-chair for the 2016 IEEE/ACM ISLPED. He currently serves on the TPC of the IEEE ISSCC and IEEE CICC.



**Vivek De** (F'11) received the Ph.D. degree in electrical engineering from the Rensselaer Polytechnic Institute, Troy, NY, USA.

He is currently an Intel Fellow and the Director of Circuit Technology Research, Intel Corporation, Hillsboro, OR, USA. He is responsible for providing strategic technical directions for long term research in future circuit technologies and leading energy efficiency research across the hardware stack. He has authored 245 publications in refereed international conferences and journals and 207 patents, with

28 more patents filed (pending).

Dr. De received the Intel Achievement Award for his contributions to an integrated voltage regulator technology. He received a Best Paper Award at the 1996 IEEE International ASIC Conference, and nominations for best paper awards at the 2007 IEEE/ACM Design Automation Conference (DAC) and the 2008 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). His publications were recognized in the 2013 IEEE/ACM DAC as one of the Top 10 Cited Papers in 50 Years of DAC.