

# Power Integrity Analysis of a 28 nm Dual-Core ARM Cortex-A57 Cluster Using an All-Digital Power Delivery Monitor

Paul N. Whatmough, *Member, IEEE*, Shidhartha Das, *Member, IEEE*, Zacharias Hadjilambrou, and David M. Bull, *Member, IEEE*

**Abstract**—This paper presents a power delivery monitor (PDM) peripheral integrated in a flip-chip packaged 28 nm system-on-chip (SoC) for mobile computing. The PDM is composed entirely of digital standard cells and consists of: 1) a fully integrated VCO-based digital sampling oscilloscope; 2) a synthetic current load; and 3) an event engine for triggering, analysis, and debug. Incorporated inside an SoC, it enables rapid, automated analysis of supply impedance, as well as monitoring supply voltage droop of multi-core CPUs running full software workloads and during scan-test operations. To demonstrate these capabilities, we describe a power integrity case study of a dual-core ARM Cortex-A57 cluster in a commercial 28 nm mobile SoC. Measurements are presented of power delivery network (PDN) electrical parameters, along with waveforms of the CPU cluster running test cases and benchmarks on bare metal and Linux OS. The effect of aggressive power management techniques, such as power gating on the dominant resonant frequency and peak impedance, is highlighted. Finally, we present measurements of supply voltage noise during various scan-test operations, an often-neglected aspect of SoC power integrity.

**Index Terms**—All-digital ADC, digital sampling oscilloscope, margins, on-chip oscilloscope, power delivery, supply voltage noise, variation.

## I. INTRODUCTION

MODERN computing platforms from mobile through to servers are increasingly constrained to operate within a fixed, or shrinking, power budget. The design of system-on-chips (SoCs) that offer increased compute performance while operating at a commensurate power budget to previous generations places huge demands on the efficiency of the silicon implementation. Continuing process technology scaling has ensured sustained area and power efficiency improvements, in turn enabling more and more complex SoCs, composed of multiple clusters of CPUs, GPUs, and additional specialized compute engines. However, increased integration comes at the cost of increasing peak current and current density, to the extent that these systems are ultimately constrained by

Manuscript received November 12, 2016; revised January 9, 2017; accepted January 29, 2017. Date of publication March 13, 2017; date of current version May 23, 2017. This paper was approved by Associate Editor Vivek De. This work was supported by the European Community's Horizon 2020 Program for research and technical development, through the UniServer Project under Grant 688540.

P. N. Whatmough is with Harvard University, Cambridge, MA 02138 USA (e-mail: pwhatmough@eecs.harvard.edu).

S. Das and D. M. Bull are with ARM Ltd., Cambridge CB1 9NJ, U.K. (e-mail: sdas@arm.com; dbull@arm.com).

Z. Hadjilambrou is with the Computer Science Department, University of Cyprus, Nicosia 678, Cyprus (e-mail: sharkis99@gmail.com).

Color versions of one or more of the figures in this paper are available online at <http://ieeexplore.ieee.org>.

Digital Object Identifier 10.1109/JSSC.2017.2669025

power delivery. Of particular concern are pathological AC supply noise conditions, which can occur due to an infrequent combination of system and micro-architectural events at the resonant frequency of the power delivery system [1], [2]. These events effectively limit the energy efficiency of the system, as sufficient voltage margin must be employed to guarantee that these conditions do not result in system failure. Accurately minimizing voltage margins is critical to balance energy efficiency and robustness.

On-chip AC supply voltage noise is challenging to characterize convincingly during post-silicon measurement, because the most severe voltage droops tend to be triggered by complex interactions between application software, operating system, CPU clusters, and the memory system. These interactions are often not exposed by simple bare metal<sup>1</sup> workloads. In order to expose the true worst-case voltage noise conditions, it is necessary to employ a combination of targeted test cases, along with a broad range of real world compiled code. These requirements motivate integrating circuitry into the SoC to enable on-chip characterization of voltage noise.

Previous work on on-chip voltage noise monitoring includes continuous time approaches that rely on analog buffered IO to off-chip test equipment [3], [4], accurate on-chip analog comparators and voltage references [5], [8], and calibrated digital CMOS delay-lines [7], [9]. Equivalent time sampling (ETS) is often employed, whereby a sampled waveform is built up from multiple triggers on a repetitive waveform [6], [8], [9]. ETS relaxes bandwidth requirements, but typically increases test time, limiting the approach to relatively small code sequences that can be looped indefinitely. For longer tests, such as booting an OS, it is necessary to avoid ETS in favor of real-time sampling. It is also desirable to be able to easily and autonomously characterize supply voltage impedance from on-chip circuitry.

In response to the challenges of characterizing power integrity post-silicon, this paper presents an all-digital fully integrated power delivery monitor (PDM) [10]. The all-digital design allows the IP to be easily integrated into first silicon without custom layout. The proposed IP enables: 1) measurement of supply impedance; 2) measurement of pathological voltage droops that occur during the execution of long and complex software workloads; and 3) measurement of supply voltage noise during scan-test scenarios. A representative flip-chip packaged 28 nm mobile SoC is used as a

<sup>1</sup>Refers to software running on a CPU without an OS.



Fig. 1. Simplified power delivery network composed of PCB, package, and silicon die (top), with typical system impedance with frequency (left) and step response showing multiple resonances (right).

vehicle for power integrity analysis, demonstrating the PDM as an effective tool during post-silicon bring-up and debugging. The SoC incorporates two heterogeneous 64-bit CPU clusters (dual-core ARM Cortex-A57 and quad-core ARM Cortex-A53), a GPU cluster, and high-speed cache-coherent interconnect. A system control processor (SCP) runs power management firmware for controlling power-gates, power management IC (PMIC) for controlling supply voltages, and PLLs for controlling clock frequencies. The power integrity case study focuses on the A57 cluster.

The remainder of this paper is organized as follows. Section II briefly introduces power integrity challenges for mobile SoCs. Section III describes the proposed PDM peripheral and measured performance. In Section IV, we present a case study, which uses the PDM to perform an extensive characterization and analysis of a representative mobile SoC product, from software down to scan test. Section V concludes this paper.

## II. POWER INTEGRITY CHALLENGES

It has long been observed that the combined trends of decreasing supply voltage and increasing integration density lead to rapidly increasing supply current. In turn, this places very stringent requirements on power supply impedance [6], which must be kept small enough to ensure that fast supply currents do not cause the voltage to collapse. Due to stagnant package inductance scaling, this trend is increasingly problematic. Fig. 1 shows a simplified lumped-element representation of a generic power delivery network consisting of PCB, package, and silicon die. The impedance as a function of frequency,  $Z(f)$ , is small ( $\sim 1\text{ m}\Omega$ ) at low frequencies close to DC, which is necessary to minimize  $IR$  voltage drop and

$I^2R$  power loss. However, this low impedance environment is prone to resonances. There are typically three significant resonances, which correspond to various interactions of PCB, package, and die. The largest of these, the first resonance, is often in the range of 50–200 MHz, where peak impedance can be as high as two orders of magnitude larger than the DC resistance. The first resonance arises due to the die capacitance and the package inductance.

A significant impulse or step in the current demand of the die results in a supply voltage droop waveform, such as the one shown in Fig. 1, which shows two visible frequency components (the third component can be observed only with a longer time base). Such supply droop waveforms directly modulate the propagation delay of logic circuits and can result in timing violations with respect to the target clock period. It is necessary to include supply voltage margin to prevent timing violations, effectively accounting for the difference between the nominal supply voltage ( $V_{NOM}$ ) and the minimum observed during supply transients ( $V_{MIN}$ ). However, the additional voltage margin quadratically degrades power efficiency. In practice, voltage margins can be minimized only after gaining a detailed understanding of supply voltage noise conditions during both functional test and scan-test scenarios.

Unfortunately, AC supply noise is both difficult to simulate pre-silicon and difficult to measure post-silicon [11]. Simulation requires accurate parasitic extraction of the PCB, package, and die, and accurate but fast proxy models for CPU activity. Measurement is also challenging, since the voltage noise signal exists on chip, and hence is difficult to observe accurately. Finally, there is also a software element to the problem, because the real-world CPU activity transients are often due to complex interactions between SW, OS, CPUs,



Fig. 2. PDM block diagram indicating the main functional blocks and two power domains.

peripherals, and memory system. Simple bare metal workloads are often insufficient to elucidate the worst case.

### III. POWER DELIVERY MONITOR

In this section, we describe the design of the proposed PDM peripheral. A number of design decisions were taken to emphasize ease of SoC integration at design time, and ease of use during silicon bring up.

#### A. Peripheral Organization

To ease integration, we use a conventional APB on-chip peripheral bus interface, and implement the entire design using only digital logic library cells and SRAM to simplify process porting.<sup>2</sup> Fig. 2 shows the organization of the PDM peripheral, which consists of an on-chip digital sampling oscilloscope (OC-DSO) and synthetic current load (SCL) in the front end (FE), and an event engine and bus interface in the back end (BE). The supply voltage for the FE circuits ( $V_{FE}$ ) is connected to the power domain to be analyzed (e.g.,  $V_{A57}$  in the example in Section IV). The compact FE component measures real-time supply voltage waveforms (i.e.,  $V_{FE} - V_{SS}$ ) and does not involve any analog circuits, analog I/O, or off-chip measurement equipment. The BE is in the SoC power domain ( $V_{SYS}$ ) to minimize loading and self-induced noise on  $V_{FE}$ .

The OC-DSO runs continuously in real-time, logging data, and capturing waveforms on trigger events. Event counter and tide-mark registers track the size and frequency of voltage transients. These data can be easily correlated with CPU hardware event counters. For transients of interest, threshold and gradient triggers can initiate waveform capture of up to 2k points

into the internal SRAM trace buffer. A decimation block allows flexible adaptation of the bandwidth/sample rate to allow measurement of low-frequency transients. Autonomous operation can be easily implemented from an on-chip core, or externally via the CoreSight debug port [12]. Multiple FE modules can be supported from a single BE peripheral in order to allow characterization of a number of power domains, which is an important consideration due to the trend for an increasing number of SoC power domains.

#### B. All-Digital On-Chip Digital Sampling Oscilloscope

For on-chip supply noise sensing applications, simple CMOS delay-line circuits [7], [8] are very attractive for reasons of process portability and layout automation, as they can be implemented using only digital standard cells. The basic concept is to exploit the transfer function of supply voltage to the inverse of CMOS gate delay, which is linear to first order. However, despite the simplicity, the performance of these circuits is far from ideal as they exhibit distortion-limited INL for the following reasons: 1) non-linearity in the open-loop transfer function; 2) asymmetry in the gate rise/fall delays; 3) poor matching of small digital transistors; and 4) poor layout matching using automated place and route. However, by exploiting calibration, it is possible to achieve sufficient performance for the given application.

Rather than a linear delay-line arrangement, we use a ring oscillator (RO) (i.e., VCO), which reduces circuit size for equal dynamic range and introduces first-order noise shaping [13]. Fig. 3 shows the VCO, which consists of a 31-stage NAND2 ring. The VCO is totally free running and is not reset with respect to the sample clock. The oscillation frequency of the VCO is approximately 1.77 GHz at nominal process,

<sup>2</sup>Due to reduced layout effort and generally simplified physical design.



Fig. 3. Process portable all-digital supply voltage sampling using coarse and fine sampling of VCO. Two coarse counters at opposite ends of the VCO guarantee at least one of them will have settled when the counter outputs are sampled. The fine phase output is subsequently inspected to find the position of the VCO transition and the coarse counter furthest away from the transition is used to generate the output code. All registers are clocked by the sample clock unless otherwise annotated.

voltage and temperature (PVT) corner. The essential operation is to measure how many times the propagating transition traverses the VCO in a single cycle of the sample clock. To achieve this, the VCO phase is sampled in two parts: 1) coarse,  $C[n]$ , measured in whole rotations around the VCO ( $C[n] \cdot 2\pi$ ) and 2) fine,  $F[n]$ , measured in increments of single NAND2 stages [ $F[n] \cdot (2\pi/31)$ ]. Both phase paths are differentiated separately, before multiplying the integer part by the number of NAND2 stages in the VCO (31), and finally combining the two to produce a binary output code ( $DOUT[n]$ ), which is linear with voltage to the first order.

The integer part is accumulated using a free-running gray counter, which is clocked on both edges of a node from the VCO. Conventional library flip flops are then used to sample the integer counter and fractional phase taps using a sample clock ( $F_S$ ), which is asynchronous with respect to the VCO. Since this represents an asynchronous clock crossing, sampling the coarse gray counter when logic is toggling can lead to bit errors.<sup>3</sup> The gray code ensures that the maxi-

mum Hamming error in the decoded binary word is 1 LSB, but due to the multiplication, this still equates to an error of  $\pm 31$  LSBs in final the output  $DOUT[n]$ . To reduce this, a second (replica) coarse counter is introduced at a node on the opposite side of the VCO (half way around). The VCO has the property that the transition in the ring cannot be close to two equidistant points on the ring at the same time instant, assuming sufficient number of stages.<sup>4</sup> Hence, the decoder can subsequently inspect the phase of the VCO (from the fine part) and select the coarse counter ( $C0[n]$  or  $C1[n]$ ) furthest from the edge in the VCO at the sample instance. In this fashion, we ensure we use the coarse output that has safely settled before the sampler is clocked.

### C. All-Digital Synthetic Current Load

In order to characterize on-chip supply impedance, we use an on-chip SCL. The SCL (Fig. 2) is used to draw current at a programmable frequency while the OC-DSO simultaneously measures the supply voltage response. By repeating this

<sup>3</sup>Especially for a library flip-flop cell, which may have un-optimized setup and hold times.

<sup>4</sup>Given the stage delay,  $t_{stage}$ , flip-flop setup time,  $t_{su}$ , and hold time,  $t_{hold}$ , the number of stages in the VCO,  $N$ , should be chosen such that  $N > 2(t_{su} + t_{hold})/t_{stage}$ .



Fig. 4. Die photo of 28 nm SoC layout (right) and PDM layout detail and dimensions (left).

process for a range of frequencies, the  $Z(f)$  response is constructed, quickly identifying the frequency/impedance of resonances. Further practical details of this process are given in the case study in Section V.

The most common approach to implementing an on-chip current load is to use one or more transistors to short  $V_{DD}$  and  $V_{SS}$ . However, there are a number of concerns with this approach in a commercial SoC, not least of which are the electrical and functional risks associated with the shorting device, which must never turn on inadvertently. There are also secondary considerations for the leakage of this device. To allay these concerns, the proposed SCL avoids DC paths between supplies, instead using up to 512 parallel nine-stage NAND2 ROs. When enabled, the short ROs oscillate drawing dynamic current from the supply in the process. The number of ROs is a tradeoff between maximum load current and circuit area. The load current can be modulated at a given frequency, using the enable input to the RO. The RO enables are controlled by a flexible function generator, which produces square wave, step, impulse, and pseudo-random linear feedback shift register (LFSR) patterns. The circuit is composed entirely of digital standard cells to avoid any custom layout.

#### D. Event Engine

The event engine detects, counts, and responds to events of interest in the real-time sample stream generated by the OC-DSO. The event engine is based around two flexible trigger blocks, which can flag various  $V_{FE}$  droop or overshoot events using level or slew-rate thresholds. The trigger blocks drive event counters used to monitor the number of occurrences of a configured trigger for a given test case. The triggers can also be used to initiate storage of a waveform into the SRAM trace buffer, in a similar fashion to a conventional oscilloscope. A set of internal timers provide a simple means to determine when a given trigger event occurred, which is

necessary for diagnosing the source of a droop event through alignment with CPU activity. Finally, the triggers can be used to assert an interrupt to the SCP, which can then take further action within the SoC, for example to halt a CPU cluster.

A tidemark block constantly monitors the output of the OC-DSO, tracking the highest and lowest voltages observed during a test. The data are available at any time to be read from memory-mapped registers. With these features, a typical workflow to analyze a given workload or droop event can use the event counters and tidemarks to observe the frequency and magnitude of droop events, respectively. Following this, it is a simple process to then capture waveforms for, say, the largest droop in the workload to diagnose the architectural event that caused it. An example of this is given in Section IV.

#### E. Measurement Results

Before looking at a system-level case study, we first report measured performance of the OC-DSO in isolation. The entire PDM macro is approximately  $350 \times 310 \mu\text{m}^2$  (Table I), as shown in Fig. 4, where the main functional blocks of the peripheral are highlighted. The total power consumption for the whole macro (with SCL off) is  $416 \mu\text{W}$  during waveform capture,  $25 \mu\text{W}$  of which is from the  $V_{A57}$  voltage rail, with the remainder supplied by  $V_{SYS}$ .

The OC-DSO output,  $D_{OUT}[n]$ , is a measurement of frequency with units of  $t_{NAND2}/t_{CLK}$ , and requires a conversion from frequency to voltage (F-to-V). The F-to-V conversion function is determined from calibration, and also corrects for non-linearity and process/temperature variation. The SCP performs the F-to-V conversion in software after reading out the measurement data from the PDM. Performing the conversion in software is not as power efficient as dedicated hardware [14], but since the PDM is used only during chip bring up, this is not a significant concern.



Fig. 5. OC-DSO resolution and linearity parameters without decimation, for a 400-mV input voltage range. Calibration can be performed by the SCP and PMIC, without external test equipment.

The calibration itself can be performed autonomously by the SCP firmware, by using the (off-chip) PMIC to change the supply voltage. Using software running on the SCP, a slow DC sweep is performed by repeating two steps: 1) set  $V_{\text{FE}}$  using the PMIC and 2) take multiple samples of  $D_{\text{OUT}}[n]$  from the OC-DSO and store the average along with the voltage. The resulting list of voltage/code pairs can then be inspected to determine the voltages of the code centers. This can either be used directly as a look-up table, or alternatively a polynomial function can be fit to the data.<sup>5</sup> The calibration process is performed in software by the SCP using the PMIC, without any external test and measurement equipment.

Fig. 5 gives measured digital output code as a function of  $V_{\text{FE}}$  (input voltage), along with corresponding INL. The usable input voltage range is  $\sim 400$  mV, centered on the nominal supply voltage of 900 mV. The voltage resolution (linear input voltage range divided by the linear output code range) achieved ( $V_{\text{ LSB}}$ ) is dependent on the VCO integration time ( $1/F_s$ ): at 800 MS/s,  $V_{\text{ LSB}}$  is  $\sim 3.3$  mV, at 400 MS/s,  $V_{\text{ LSB}}$  is  $\sim 1.6$  mV. Sample rate is flexible up to a maximum of 2.24 GS/s for a TT die at 25 °C, which can be useful for timing very fast transients. Measured INL<sub>MAX</sub> is  $-0.9/+0.7$  LSB after correction with a ninth-order polynomial, over the entire 400-mV input voltage range.

In measuring supply voltage noise, our main interest is in capturing global resonant voltage droop waveforms, which are typically in the frequency range of 50–200 MHz (Section II). Although higher frequency noise is certainly present, it tends to be very localized and of such small magnitude it is unlikely to affect the performance of the CPU cores. A sample rate of at least twice the highest signal frequency is required in order to prevent significant aliasing of in-band signal power.

<sup>5</sup>A look-up table is faster and is simpler to generate, but requires a larger memory footprint.

TABLE I

SUMMARY OF PDM MACRO OPERATING AT 800 MS/s. QUOTED POWER CONSUMPTION DOES NOT INCLUDE F-TO-V CONVERSION (SECTION III-E), WHICH IS PERFORMED IN SOFTWARE ON THE SCP

|                           |                                 |
|---------------------------|---------------------------------|
| <b>Process Technology</b> | TSMC 28nm CMOS                  |
| <b>Packaging</b>          | C4 Flip-Chip                    |
| <b>Nominal Vdd</b>        | 0.8 – 1.0 V                     |
| <b>Noise Source</b>       | Dual-Core 64-bit ARM Cortex-A57 |
| <b>Max. Sample Rate</b>   | 2.24 GS/s                       |
| <b>Waveform Memory</b>    | 2 Ksamples                      |
| <b>Supply Noise Range</b> | 0.7 – 1.1 V                     |
| <b>DNL<sub>max</sub></b>  | -0.4 / +0.6 LSB                 |
| <b>INL<sub>max</sub></b>  | -0.9 / +0.7 LSB                 |
| <b>Average Resolution</b> | 3.3 mV                          |
| <b>Power Consumption</b>  | 416 uW                          |
| <b>Macro Area</b>         | 350 × 310 $\mu\text{m}^2$       |

But, a much higher sample rate is desirable, since: 1) there is no explicit anti-alias filter, which increases out of band alias power and 2) when looking in the time domain, it is desirable to have much more than two samples per cycle to help distinguish resonant waveforms. In the following case study, we mainly used a sample rate of 800 MHz, which was found to be sufficient to identify various noise waveforms without sacrificing resolution.

Table II gives a brief comparison of previously published supply voltage noise monitors and supply voltage noise characterization studies. One of the key goals in this paper was to provide an SoC peripheral, which is easy to integrate pre-silicon and easy to use post-silicon. The proposed PDM macro does not use any analog circuits, does not require off-chip measurement equipment (such as an oscilloscope), is self-calibrating (in software), and has extensive on-chip triggering and event counting features.

#### IV. POWER INTEGRITY CASE STUDY

In this section, we present a power integrity analysis of a 28 nm SoC using the PDM peripheral integrated into the SoC. The PDM is integrated above the dual-core Cortex-A57 cluster (Fig. 4), with  $V_{\text{FE}}$  connected to the A57 cluster supply ( $V_{\text{A57}}$ ). Also, highlighted in Fig. 4 is a microcontroller referred to as the SCP, used for power management duties and, in this case, to drive tests using the PDM.

##### A. Supply Impedance

The first consideration in the case study is the frequency and peak impedance of resonances. This information can be

TABLE II  
COMPARISON OF PREVIOUSLY PRESENTED SUPPLY VOLTAGE NOISE MONITORS AND CHARACTERIZATION TEST CHIPS

|                                    | VLSI'04 [3]                     | JSSC'04 [5]                               | JSSC'05 [6]                            | TCAS-I'11 [7]                         | JSSC'15 [4]                     | ISSCC'16 [8]              | This Work                       |
|------------------------------------|---------------------------------|-------------------------------------------|----------------------------------------|---------------------------------------|---------------------------------|---------------------------|---------------------------------|
| <b>Process Technology</b>          | 90nm                            | 90nm                                      | 130nm                                  | 45nm                                  | 40nm                            | 20nm                      | 28nm                            |
| <b>Supply Voltage Noise Source</b> | 32-bit RISC CPU                 |                                           | High-speed transceiver                 | 32-bit RISC CPU                       | 8-Core 64-bit ARM CPU           | 10-Core 64-bit ARM CPU    | Dual-Core 64-bit ARM Cortex-A57 |
| <b>Supply Voltage Measurement</b>  | Analog buffer to off-chip scope | Analog comparators and voltage references | Analog S/H and slow VCO voltage sensor | All-digital delay-line voltage sensor | Analog buffer to off-chip scope | Analog DAC and comparator | All-digital VCO voltage sensor  |
| <b>Analog I/O</b>                  | Yes                             | No                                        | No                                     | No                                    | Yes                             | No                        | No                              |
| <b>Waveform Capture</b>            | Real-time <sup>1</sup>          | No                                        | ETS <sup>2</sup>                       | Real-time                             | Real-time <sup>1</sup>          | ETS <sup>2</sup>          | Real-time                       |
| <b>Triggering</b>                  | Yes <sup>1</sup>                | No                                        | No                                     | No                                    | Yes <sup>1</sup>                | Yes                       | Yes                             |
| <b>On-chip Load</b>                | No                              | Yes                                       | Yes                                    | Yes                                   | No                              | No                        | Yes                             |
| <b>Max Fs</b>                      | 20GS/s <sup>1</sup>             | N/A                                       | 20GS/s <sup>2</sup>                    | >1.5GS/s                              |                                 | 2GHz <sup>2</sup>         | 2.24GS/s                        |
| <b>Vin Range</b>                   | 1V                              | 0.7-1.3V                                  | 0.6-1.2V                               |                                       | 200mV                           | 0.6-1.2V                  | 0.7-1.1V                        |
| <b>Resolution</b>                  |                                 |                                           | 385uV                                  |                                       |                                 | ~10mV                     | 3.3mV @800MS/s                  |

<sup>1</sup> Using off-chip oscilloscope

<sup>2</sup> Only periodic waveforms by equivalent-time sampling (ETS)



Fig. 6. Simplified lumped model of the PDN with PDM. In order to measure AC electrical parameters, the SCL generates a square wave current load of frequency  $F_{SCL}$ , which can be swept while measuring the resulting voltage swing and average current reported by the PMIC (scaled to account for current of square wave harmonics). No external test equipment is required.



Fig. 7. AC electrical parameters measured using PDM, for a range of power management modes.

correlated with pre-silicon simulations to check for simulation/modeling mismatch errors, and/or layout issues with PCB, package, or silicon implementation. The analysis of AC supply

impedance can be automated on the SoC using the PDM peripheral, under the control of the SCP. Fig. 6 shows a lumped model of the power delivery network, including the A57 cluster and the PDM. The supply impedance at a single frequency,  $f$ , can be characterized as follows: 1) use the SCL to draw current from the rail with a square wave pattern at a frequency of  $F_{SCL} = f$ ; 2) use the OC-DSO to measure the resulting average voltage droop,  $V(F_{SCL})$ ; and 3) use the PMIC (controlled by an  $I^2C$  interface to the SCP) to measure the average current,  $I_{PMIC}$ . With these three measurements, the impedance can be calculated as

$$|Z(f)| = |V(F_{SCL})| / ((4/\pi) I_{PMIC}) \quad (1)$$

where the  $4/\pi$  scaling adjusts for the power at the fundamental frequency of the square wave load current, since it is much more convenient to generate a square wave pattern in digital



Fig. 8. Two significant supply noise events measured on the A57 cluster while entering WFE (top) and while turning on power gates to a core (bottom).



Fig. 9. Measured waveform from OC-DSO and corresponding CPU logic activity for a supply noise event arising from a pipeline flush following a branch misprediction. CPU clock frequency is 1.1 GHz and supply voltage is 1 V.

logic than a sine wave. By sweeping  $f$ , and repeating the process at each frequency point,  $Z(f)$  can be automatically characterized, as shown for the A57 cluster in Fig. 7. The compensation for the square wave harmonics in (1) is reasonable as long as the third harmonic falls beyond the highest resonance peak, otherwise there will be a contribution from the harmonics, which cannot be distinguished from the fundamental tone. No external test equipment is required and the whole test can be orchestrated from the SCP.

Since impedance characterization using the PDM is automated and fast, we were able to characterize the impact of the low-power modes on the SoC. Individual CPU cores can be power gated when idle to reduce leakage power and the



Fig. 10. Process to automatically generate worst-case assembly code tests using a genetic algorithm (top). Two examples of this approach show code optimized for max  $V_{DD}$  droop (left) and max  $I_{DD}$  (right).



Fig. 11. Measured maximum voltage droop and minimum supply voltage ( $V_{MIN}$ ) for 1.1-GHz operation for a range of binaries running on bare metal (top) and Linux OS (bottom). Benchmarks “gaDIDT,” “manDIDT,” and “powVirus” are assembly code test cases, while the remainder are compiled from C code. CPU clock frequency is 1.1 GHz and supply voltage is 1 V.

whole cluster can also be put in retention or turned off entirely. However, the low-power modes complicate power delivery, because the dominant resonant peak tends to shift in frequency and magnitude depending on the number of cores (power domains) active. This is shown in Fig. 7, where the resonant peak shifts significantly for different low-power modes.

### B. Supply Voltage Noise

Supply impedance resonances give rise to voltage noise when excited by a transient  $I_{DD}$  waveform. Many common CPU events cause a characteristic “step” in  $I_{DD}$ . These events are typically straightforward to orchestrate, and can easily be measured using the PDM. Some examples are given in Fig. 8, which shows captured voltage waveforms for two common CPU events: wait-for-event (WFE) and power-gate turn on. Although the voltage waveforms in Fig. 8 show supply noise of 129–164 mVpp, they are not in general a significant concern



Fig. 12. Complete data set for measured maximum voltage droop and minimum supply voltage ( $V_{MIN}$ ) for 1 V/1.1-GHz operation for SPEC2006 benchmark suite running on Linux.



Fig. 13. Distributions of all test cases in terms of max droop (left) and  $V_{MIN}$  (right), for bare metal/Linux and single/dual core.

to logic timing because they can be readily “softened” to avoid generating severe transients. This is largely true, because they occur at the beginning and end of a period of execution.

On the other hand, CPU activity transients that occur somewhere in the middle of the execution of a block of code can be much more difficult to tolerate. For example, Fig. 9 shows a significant negative activity impulse due to a stall event (a branch misprediction in this case), resulting in an inductive overshoot in supply voltage. In contrast to the events in Fig. 8, this is more difficult both to orchestrate and to prevent. Since the branch mispredict, and other related activity transients due to stalls and flushes, can occur spuriously within a complex

sequence of instructions, it is not possible to mitigate without adversely impacting IPC. Fig. 9 also shows an overlay of related logic signals from the CPU pipeline trace, which was aligned using the triggered counters in the PDM to determine the CPU cycle in which the event occurred.

### C. Automatic Test Case Generation

In order to minimize voltage margins, it is necessary to understand the CPU activity patterns that give rise to worst-case voltage noise. Writing assembly test cases by hand to try and orchestrate specific pipeline activity is time-consuming, especially for out-of-order CPUs. Instead, we do this in an

automatic fashion, by using a genetic algorithm to generate assembly code loops [15]–[17] targeting specific noise droop conditions [15], as measured by the PDM.

Fig. 10 describes the process, which begins with a seed population of random assembly code loops. The population of assembly code is measured by running each on the A57 cluster, while using the PDM to measure voltage droop. A set of parents is then chosen from the population based on the measurements, before crossover and mutation of the code generates the next population from the parents. Fig. 10 shows two examples, one optimized for maximum voltage droop (“Max Droop”) and the other for maximum current consumption (“Max DC”). The plots demonstrate the progression of the GA over successive generations, showing both the best example and the average for the entire population. Good results can be achieved in tens of generations.

#### D. Measurement Summary

Fig. 11 shows measurements for three assembly test cases: “gaDIDT,” automatically optimized for large supply noise, “manDIDT,” a hand written droop stress case, and “powVirus,” an automatically generated test that draws maximum  $I_{DD}$ . A range of compiled C code benchmarks are also included, to establish more typical supply noise conditions. The results include measurements of max  $V_{DD}$  droop, along with  $V_{MIN}$  observed for each benchmark.  $V_{MIN}$  was measured by repeatedly executing the workload at 1.1 GHz, with successively lower supply voltage until the CPUs are seen to crash by one of a number of observable failure mechanisms. The results are given for single-core and dual-core scenarios, as well as under bare metal and Linux OS at 1.1 GHz.

The “gaDIDT” test case (Section IV-C), is a synthetic pathological worst case for AC supply droop and results in by far the largest measured voltage droops of 170 and 100 mV for dual-core and single-core configurations, respectively, on bare metal. Compiled C code benchmarks exhibit significantly smaller maximum droop. Execution on two cores slightly exacerbates both max  $V_{DD}$  droop and  $V_{MIN}$  observed in all cases. Running the same programs under Linux OS reveals very similar maximum droop as with bare metal. However, the  $V_{MIN}$  sensitivity is worse, leading to noticeably higher measured  $V_{MIN}$  under Linux. We attribute this increased  $V_{MIN}$  sensitivity to the OS scheduler, more cache churn, and generally richer system peripheral interaction. This illustrates the importance of analysis of a representative system, as opposed to a CPU in isolation.

Fig. 12 shows a complete data set for the SPEC2006 benchmark, which is a much larger data set. We observed a lot more variation than we saw with the small set of compiled benchmarks (Fig. 11). This generally hints that some of these benchmarks either generate large activity transients, or exhibit some periodicity at the resonant frequency.

Finally, Fig. 13 summarizes these results by showing distributions for each data set on bare metal and Linux for single-core and dual-core configurations. The maximum droop plot demonstrates the long tail associated with the supply voltage noise problem. The automatically generated test



Fig. 14. Measured scan-shift-induced power supply noise is heavily impacted by the scan pattern activity. (a) Worst-case pattern exhibits substantial oscillation swing, while (b) relaxed pattern which suppresses half of the flip flop toggles results in reduced swing.

cases (Section IV-C) bring out the pathological worst-case droop events, which were not quite matched in compiled benchmarks, which represent typical code. However, the more compiled code we ran, the closer we got, with some of the SPEC2006 examples coming close to the assembly test cases. The  $V_{MIN}$  plot again shows the importance of characterizing voltage margins while running code under an OS, as failure sensitivities were found to be significantly higher in this scenario, regardless of the test case.

#### E. Scan-Test Supply Voltage Noise

High toggle rate in flip flops during scan shift increases the activity rate in combinational logic. This significantly increases processor  $I_{DD}$  compared with functional tests, leading to large  $V_{DD}$  swings, potentially causing test-pattern mismatches during vector replay on the automatic test equipment. Low toggle-rate test-vectors reduce  $I_{DD}$  during scan shift, at the expense of increased test time.

Fig. 14 shows  $V_{DD}$  oscillations, as measured using the PDM, during scan shift. The shift pattern  $0 \times AAAA$  results in an extreme power supply noise condition, since all flip flops inside the scan chain toggle in every cycle. We compare this against the power-supply noise observed using a relaxed pattern ( $0x8888$ ) that suppresses half of the transitions during the shift operation. The increase in step-current magnitude at the rising edge of the scan clock results in larger magnitude of  $V_{DD}$  undershoots and overshoots for the worst-case toggle pattern. In both the cases, the magnitude of the peak-to-peak swing is significant (650 mVpp for the worst-case pattern and 513 mVpp for the relaxed pattern) and is in excess of  $V_{DD}$  noise observed during functional test.



Fig. 15. Dependence of power-supply oscillations on scan-shift frequency. (a) Slower scan-shift frequency allows oscillations to attenuate before new oscillations are initiated at the next clock edge. (b) Faster scan-shift frequency causes superposition of two oscillations, resulting in continuous supply noise for the whole clock cycle.

Fig. 15 shows the frequency dependence of the scan clock on the shift-induced power-supply oscillations. The rising edge of the clock represents the initiation of the current step, where all flip flops toggle as a new bit is shifted in. The resultant switching activity in the combinational logic generates high peak currents.

The scan-clock frequency in Fig. 15(a) is 10 MHz (cycle time of 100 ns) during which all oscillations eventually attenuate. Another step-current excitation is generated at the rising edge of the next shift cycle. The falling clock edge does not cause a sufficiently large excitation, since no combinational logic toggles (only the clock network). The frequency of power-supply oscillations matches the previously measured first resonance (Fig. 7).

Fig. 15(b) shows the supply oscillations at a shift frequency of 20 MHz (50-ns cycle time). Now, the power-supply

oscillations initiated at the rising clock edge do not have sufficient time to attenuate before the next rising edge. Consequently, the supply network experiences the effects of two current steps at the subsequent rising clock edge. The first is the attenuated but time-shifted effect of oscillations initiated at the first rising edge of the clock that superimpose with new oscillations initiated at the second rising-edge of the clock.

This level of visibility into supply voltage noise during scan test is expected to allow rapid optimization of the conflicting goals of maximizing scan-test speed and correlation with functional tests.

## V. CONCLUSION

Optimization of supply voltage margins is critical to safely balancing energy efficiency and robustness of large digital circuits, such as SoCs. The proposed PDM peripheral is implemented entirely in digital logic cells and SRAMs and is, therefore, easy to port to new process technologies where custom layout is time-consuming. Using the PDM integrated in a 28 nm mobile SoC, a case study of  $V_{DD}$  noise in a dual-core ARM Cortex-A57 cluster is presented. Power supply impedance is automatically measured.  $V_{DD}$  droop is analyzed using hand-written assembly, automatically generated assembly, and compiled benchmarks running on bare metal and Linux. Finally, the important aspect of scan test-induced  $V_{DD}$  droop is considered. The measurement case study emphasizes the importance of running a wide variety of code under an OS, as it significantly increases the  $V_{MIN}$  sensitivity for a given workload. Analysis of measured  $V_{DD}$  droops emphasizes the challenges of achieving short test time without inducing excessive  $V_{DD}$  noise that can jeopardize correlation with functional tests. For commercial SoC product bring up and analysis, it is expected that a considerable time to market advantage could be realized through this paper.

## REFERENCES

- [1] N. Kurd, P. Mosalikanti, M. Neidengard, J. Douglas, and R. Kumar, “Next generation intel core micro-architecture (nehalem) clocking,” *IEEE J. Solid-State Circuits*, vol. 44, no. 4, pp. 1121–1129, Apr. 2009.
- [2] S. Pant, “Design and analysis of power distribution networks in VLSI circuits,” Ph.D. dissertation, EECS Depart., Univ. Michigan, Ann Arbor, MI, USA, 2007.
- [3] M. Fukazawa *et al.*, “Fine-grained in-circuit continuous-time probing technique of dynamic supply variations in SoCs,” in *IEEE ISSCC Dig. Tech. Papers*, Feb. 2007, pp. 288–289.
- [4] L. Ravezzi and H. Partovi, “Clock and synchronization networks for a 3 GHz 64 Bit ARMv8 8-core SoC,” *IEEE J. Solid-State Circuits*, vol. 50, no. 7, pp. 1702–1710, Jul. 2015.
- [5] A. Muhtaroglu, G. Taylor, and T. Rahal-Arabi, “On-die droop detector for analog sensing of power supply noise,” *J. Solid-State Circuits*, vol. 39, no. 4, pp. 651–660, Apr. 2004.
- [6] E. Alon, V. Stojanovic, and M. A. Horowitz, “Circuits and techniques for high-resolution measurement of on-chip power supply noise,” *J. Solid-State Circuits*, vol. 40, no. 4, pp. 820–828, Apr. 2005.
- [7] K. A. Bowman *et al.*, “All-digital circuit-level dynamic variation monitor for silicon debug and adaptive clock control,” *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 58, no. 9, pp. 2017–2025, Sep. 2011.
- [8] H. T. Mair *et al.*, “4.3 A 20 nm 2.5 GHz ultra-low-power tri-cluster CPU subsystem with adaptive power allocation for optimal mobile SoC performance,” in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Jan. 2016, pp. 76–77.
- [9] P. J. Restle *et al.*, “Timing uncertainty measurements on the Power5 microprocessor,” in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, vol. 1. Feb. 2004, pp. 354–355.

- [10] P. N. Whatmough, S. Das, Z. Hadjilambrou, and D. M. Bull, "An all-digital power-delivery monitor for analysis of a 28 nm dual-core ARM Cortex-A57 cluster," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [11] S. Das, P. Whatmough, and D. Bull, "Modeling and characterization of the system-level Power Delivery Network for a dual-core ARM Cortex-A57 cluster in 28nm CMOS," in *Proc. IEEE/ACM Int. Symp. Low Power Electron. Design (ISLPED)*, Rome, Italy, Jul. 2015, pp. 146–151.
- [12] ARM. (Aug. 2013). *CoreSight Technical Introduction*. [Online]. Available: [http://infocenter.arm.com/help/topic/com.arm.doc.epm039795/coresight\\_technical\\_introduction\\_EPM\\_039795.pdf](http://infocenter.arm.com/help/topic/com.arm.doc.epm039795/coresight_technical_introduction_EPM_039795.pdf)
- [13] M. Hovin, A. Olsen, T. S. Lande, and C. Toumazou, "Delta-sigma modulators using frequency-modulated intermediate values," *IEEE J. Solid-State Circuits*, vol. 32, no. 1, pp. 13–22, Jan. 1997.
- [14] S. Rao, K. Reddy, B. Young, and P. K. Hanumolu, "A deterministic digital background calibration technique for VCO-based ADCs," *IEEE J. Solid-State Circuits*, vol. 49, no. 4, pp. 950–960, Apr. 2014.
- [15] Y. Kim *et al.*, "Automating stressmark generation for testing processor voltage fluctuations," *IEEE Micro*, vol. 33, no. 4, pp. 66–75, Jul./Aug. 2013.
- [16] Y. Kim *et al.*, "AUDIT: Stress testing the automatic way," in *Proc. 45th Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO)*, Dec. 2012, pp. 212–223.
- [17] R. Bertran *et al.*, "Voltage noise in multi-core processors: Empirical characterization and optimization opportunities," in *Proc. IEEE Int. Symp. Microarchitecture*, Dec. 2014, pp. 368–380.



**Shidhartha Das** (M'07) received the B.Tech. degree in electrical engineering from IIT Bombay, Mumbai, India, in 2002, and the M.S. and Ph.D. degrees in computer science and engineering from the University of Michigan, Ann Arbor, MI, USA, in 2005 and 2009, respectively.

He is currently a Principal Engineer with ARM Ltd., U.K., in the Research and Development Group, where he was involved in the several aspects of circuits and systems for emerging technologies, low-power and variation-tolerant circuits, and micro-architectural design. His current research interests include emerging non-volatile memory technologies, micro-architectural and circuit design for variation measurement and mitigation, on-chip power delivery, and VLSI architectures for digital signal processing accelerators.

Dr. Das was a recipient of the multiple best paper awards, including the IEEE/ACM International Symposium on Low-Power Electronic Design Best Paper Award in 2015, the Sophia Antipolis Micro-electronics Conference 2010, the IEEE/ACM International Symposium on Microarchitecture 2003, and the Microprocessor Review Analysts' Choice Award in Innovation 2007. He is a recipient of the ARM Inventor of the Year Award in 2016, for his contributions to emerging non-volatile memory technologies. His research has been featured in the IEEE Spectrum and has been invited to several top-tier journals and conferences. He serves on the Technical Program Committee at the European Solid-State Circuits Conference, the International Symposium on Low-Power Engineering and Design, and the International On-Line Testing Symposium.



**Paul N. Whatmough** (M'09) received the B.Eng. degree (Hons.) in electronic communications engineering from Lancaster University, Lancaster, U.K., in 2003, the M.Sc. degree (Hons.) in communications systems and signal processing from the University of Bristol, Bristol, U.K., in 2004, and the Ph.D. degree in electronic engineering from University College London, London, U.K., in 2012.

From 2005 to 2008, he was a Research Scientist with Philips/NXP Research Laboratories, Surrey, U.K., where he was involved in circuits and systems for highly-digital multi-standard software defined radio. From 2008 to 2015, he was with Silicon Research and Development Department, ARM Ltd., Cambridge, U.K. He is currently a Research Associate with Electrical Engineering and Computer Science Department, Harvard University, Cambridge, MA, USA, leading inter-disciplinary research on hardware support for machine learning. His current research interests include hardware accelerators, digital signal processing, variation tolerance, supply voltage noise and circuits, and systems for emerging Internet of Things applications.

Dr. Whatmough is a member of the IET. He was a recipient of the IET Student Project Award in 2003, the IEEE Communications Chapter Award in 2004, the European Wireless Technology Conference Young Engineering Prize in 2008, and the IEEE/ACM International Symposium on Low-Power Electronic Design Best Paper Award in 2015.



**Zacharias Hadjilambrou** received the M.Sc. degree in computer science from the University of Cyprus, Nicosia, Cyprus where is currently pursuing the Ph.D. degree.

His current research interests include computer architecture, data centers, and workload characterization.



**David M. Bull** (M'12) received the B.Sc. degree in computer science from Royal Holloway College, University of London, England, U.K., in 1991.

He joined ARM Ltd., Cambridge, U.K., in 1995, and spent nine years, where he was involved in the various aspects of processor development, including micro-architecture and circuits. He has focused on the ARM9 and ARM11 processor families, and was the Design Lead of the ARM926EJ-S. Since 2004, he has been focused on research into advanced circuit and micro-architectural techniques, and has led the ARM RAZOR Research Project. He is a Senior Principal Engineer with ARM Ltd.