

---

# Electronics Systems

## Computer Engineering

### Designing for Low Power

Luca Fanucci

[Adapted from Rabaey's *Digital Integrated Circuits*, Second Edition, ©2003  
J. Rabaey, A. Chandrakasan, B. Nikolic]

# Why Power Matters

---

- ❑ Packaging costs
- ❑ Power supply rail design
- ❑ Chip and system cooling costs
- ❑ Noise immunity and system reliability
- ❑ Battery life (in portable systems)
- ❑ Environmental concerns
  - Office equipment (professional, government and banks) accounted for 14% of total US commercial energy usage in 2012\*
  - *Energy Star* compliant systems. The *Energy Star* program is incorporating standby energy into its ratings. Standby energy in office equipment represents a significant hidden energy cost.

\* U.S. Energy Information Administration, 2012 *Commercial Building Energy Consumption Survey: Energy Usage Summary*, Table 1 (March 2016)

# Why worry about power? -- Power Dissipation

Lead microprocessors power continues to increase



Power delivery and dissipation will be prohibitive

Source: Borkar, De Intel®

# Why worry about power? -- Chip Power Density



Source: Borkar, De Intel®

# Chip Power Density Distribution

Power Map



On-Die Temperature



- ❑ Power density is not uniformly distributed across the chip
- ❑ Silicon is not a good heat conductor
- ❑ Max junction temperature is determined by hot-spots
  - Impact on packaging, w.r.t. cooling

## Problem Illustration (1/2)



## Problem Illustration (2/2)



过热  
过热的CPU  
弯曲

## The Intel's Tejas project



Craig R. Barrett, the chief executive of Intel, told analysts that the company would move down a "**parallel track.**"

Intel Corporation's newest microprocessor (tejas) was running slower and hotter than its predecessor.

Obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor

**New York Times, May 17, 2004**

# Pentium4 processor

- Dual-Core/Multi-Threaded Pentium®4 Processor on 90nm process
  - 2-1M caches, speeds to 3.2Ghz, support for over clocking, up to 4 threads.
- Shared 800Mhz quad-pumped FSB.
  - Independent bus tuning per agent
- Enhanced auto-halt and 2-state speed step power management
  - Independent events supported per core.



# Highlights (3.2 GHz)

- 241M transistors
- 235mm<sup>2</sup>
- 9 cores, 10 threads
- >200 GFlops (SP)
- >20 GFlops (DP)
- Up to 25 GB/s memory B/W
- Up to 75 GB/s I/O B/W
- >300 GB/s EIB
- Top frequency >4GHz  
(observed in lab)



# The Performance vs. Power Dilemma

Maintain Battery Life



Wireless/  
Handheld  
Lowest leakage  
and/or dynamic power

Increase Performance



3D Graphics/  
Multimedia  
Thermal management  
Packaging, cooling,  
cost



130nm,  
90nm, 65nm  
Technology

Lower Cost  
Increased leakage  
IR-drop  
Electromigration

# Power Management Is Challenging

- Leakage Power
- Dynamic Power

*Leakage power grows from <5% of power budget at .25 micron to 20-25% at 130nm to 40-50% at 90nm and continuing to increase at 65nm and beyond.*



*Leakage power begins to dominate at advanced process geometries*

# Why worry about power ? -- Battery Size/Weight



Expected battery lifetime increase  
over the next 5 years: **30 to 40%**

From Rabaey, 1995

# Why worry about power? -- Standby Power

| Year                      | 2002 | 2005 | 2008 | 2011 | 2014 |
|---------------------------|------|------|------|------|------|
| Power supply $V_{dd}$ (V) | 1.5  | 1.2  | 0.9  | 0.7  | 0.6  |
| Threshold $V_T$ (V)       | 0.4  | 0.4  | 0.35 | 0.3  | 0.25 |

- Drain leakage will increase as  $V_T$  decreases to maintain noise margins and meet frequency demands, leading to excessive **battery draining standby** power consumption.



# Low power design challenge

## The challenge

“To design an embedded system (HW *and* SW) that provides the target functionality with minimum power consumption”

## The solution

From the **system concept** down to the **implementation phase**, adopt a **design style** that includes **power consumption as a figure of merit**, and exploit all the opportunities and techniques available at each design level to reduce it

# Power Saving Opportunities



# CMOS Energy & Power Equations

$$E = C_L V_{DD}^2 P_{0 \rightarrow 1} + t_{sc} V_{DD} I_{peak} P_{0 \rightarrow 1} + V_{DD} I_{leakage}$$

$$f_{0 \rightarrow 1} = P_{0 \rightarrow 1} * f_{clock}$$

$$P = C_L V_{DD}^2 f_{0 \rightarrow 1} + t_{sc} V_{DD} I_{peak} f_{0 \rightarrow 1} +$$

Dynamic power

Short-circuit power

$V_{DD} I_{leakage}$

Leakage power

DARIDORRE IL PD  
POSSIBLUE.

# Dynamic Power Consumption



$$\text{Energy/transition} = C_L * V_{DD}^2 * P_{0 \rightarrow 1}$$

$$P_{dyn} = \text{Energy/transition} * f = C_L * V_{DD}^2 * P_{0 \rightarrow 1} * f$$

$$P_{dyn} = C_{EFF} * V_{DD}^2 * f \quad \text{where } C_{EFF} = P_{0 \rightarrow 1} C_L$$

∅ Switching activity → ∅ Dynamic power

Not a function of transistor sizes!

Data dependent - a function of **switching activity**!

Power =  $\frac{1}{2} C_{eff} V_{dd}^2 f$   
Switching activity =  $f_{0 \rightarrow 1}$

# Lowering Dynamic Power

Capacitance:

Function of fan-out,  
wire length, transistor  
sizes

RISURRE CA  
CAPACITÀ

Activity factor:

How often, on average,  
do wires switch?

$$P_{\text{dyn}} = C_L V_{DD}^2 P_{0 \rightarrow 1} f$$

RIVOLCO SUPPLY VOLTAGE, PROBLEMI  
CON PERFORMANCE  $\propto \frac{1}{V_{DD} - V_{t,m}}$  AVERAGI

Supply Voltage:  
Has been dropping  
with successive  
generations

Clock frequency:  
Increasing...

# Short Circuit Power Consumption



Finite slope of the input signal causes a direct current path between  $V_{DD}$  and GND for a short period of time during switching when both the NMOS and PMOS transistors are conducting.

# Short Circuit Currents Determinants

$$E_{sc} = t_{sc} V_{DD} I_{peak} P_{0 \rightarrow 1}$$

$$P_{sc} = t_{sc} V_{DD} I_{peak} f_{0 \rightarrow 1}$$

- Duration and slope of the input signal,  $t_{sc}$
- $I_{peak}$  determined by
  - the saturation current of the P and N transistors which depend on their **sizes**, process technology, temperature, etc.
  - strong function of the ratio between input and output slopes
    - a function of  $C_L$

A sono Arcate  
TECHNICAL ASK  
AUDREY  
US NEW STAR

# Leakage (Static) Power Consumption



Sub-threshold current is the dominant factor.

All increase **exponentially** with temperature!

Aumenta con la temperatura

relacionada con



Aumenta con  $V_T$  (Al aumentar el  $V_T$ )

## Leakage as a Function of $V_T$

- Continued scaling of supply voltage and the subsequent scaling of threshold voltage will make subthreshold conduction a dominate component of power dissipation.



- An  $90\text{mV/decade}$   $V_T$  roll-off - so each  $255\text{mV}$  increase in  $V_T$  gives 3 orders of magnitude reduction in leakage (but adversely affects performance)

# TSMC Processes Leakage and $V_T$

---

|                                | <b>CL018<br/>G</b> | <b>CL018<br/>LP</b> | <b>CL018<br/>ULP</b> | <b>CL018<br/>HS</b> | <b>CL015<br/>HS</b> | <b>CL013<br/>HS</b> |
|--------------------------------|--------------------|---------------------|----------------------|---------------------|---------------------|---------------------|
| $V_{dd}$                       | 1.8 V              | 1.8 V               | 1.8 V                | 2 V                 | 1.5 V               | 1.2 V               |
| $T_{ox}$ (effective)           | 42 Å               | 42 Å                | 42 Å                 | 42 Å                | 29 Å                | 24 Å                |
| $L_{gate}$                     | 0.16 μm            | 0.16 μm             | 0.18 μm              | 0.13 μm             | 0.11 μm             | 0.08 μm             |
| $I_{DSat}$ (n/p)<br>(μA/μm)    | 600/260            | 500/180             | 320/130              | 780/360             | 860/370             | 920/400             |
| $I_{off}$ (leakage)<br>(pA/μm) | 20                 | 1.60                | 0.15                 | 300                 | 1,800               | 13,000              |
| $V_{Tn}$                       | 0.42 V             | 0.63 V              | 0.73 V               | 0.40 V              | 0.29 V              | 0.25 V              |
| FET Perf.<br>(GHz)             | 30                 | 22                  | 14                   | 43                  | 52                  | 80                  |

From MPR, 2000

# Exponential Increase in Leakage Currents



From De, 1999

# Review: Energy & Power Equations

$$E = C_L V_{DD}^2 P_{0 \rightarrow 1} + t_{sc} V_{DD} I_{peak} P_{0 \rightarrow 1} + V_{DD} I_{leakage}$$

$$f_{0 \rightarrow 1} = P_{0 \rightarrow 1} * f_{clock}$$

$$P = C_L V_{DD}^2 f_{0 \rightarrow 1} + t_{sc} V_{DD} I_{peak} f_{0 \rightarrow 1} + V_{DD} I_{leakage}$$

Dynamic power  
(% decreasing  
relatively with  
deep submicron)

Short-circuit  
power

Leakage power  
(% increasing  
with deep  
submicron)

*Leakage power grows from <5% of power budget at .25 micron to 20-25% at 130nm to 40-50% at 90nm and continuing to increase at 65nm and beyond.*

# Power and Energy Design Space

|         | Constant Throughput/Latency                                                                                                                     | Variable Throughput/Latency                            |                                             |
|---------|-------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|---------------------------------------------|
| Energy  | Design Time                                                                                                                                     | Non-active Modules                                     | Run Time                                    |
| Active  | Logic Design<br>Reduced $V_{dd}$<br> Sizing<br>Multi- $V_{dd}$ | Clock Gating                                           | DFS, DVS<br>(Dynamic Freq, Voltage Scaling) |
| Leakage | + Multi- $V_T$                                                                                                                                  | Sleep Transistors<br>Multi- $V_{dd}$<br>Variable $V_T$ | + Variable $V_T$                            |

# Dynamic Power as a Function of Device Size

- ❑ Device sizing affects dynamic energy consumption
  - gain is largest for networks with large overall effective fan-outs ( $F = C_L/C_{g,1}$ ) *Decreases as F increases*
- ❑ The optimal gate sizing factor ( $f$ ) for dynamic energy is smaller than the one for performance, especially for large  $F$ 's
  - e.g., for  $F=20$ ,  
 $f_{\text{opt}}(\text{energy}) = 3.53$  while  
 $f_{\text{opt}}(\text{performance}) = 4.47$
- ❑ If energy is a concern avoid oversizing beyond the optimal



From Nikolic, UCB

# Standard-Cell Technology Library

## Austriamicrosystems, 0,35 um CMOS



**0.35µm CMOS**

29

Digital Standard Cell Databook

Conditions for characterization library c35\_CORELIBD\_BC, corner c35\_CORELIBD\_BC\_best: Vdd= 3.63V, Tj= -50.0 deg. C .  
 Output transition is defined from 20% to 80% (rising) and from 80% to 20% (falling) output voltage.  
 Propagation delay is measured from 50% (input rise) or 50% (input fall) to 50% (output rise) or 50% (output fall).

|           |                        |
|-----------|------------------------|
| Strength  | 1                      |
| Cell Area | 29.120 $\mu\text{m}^2$ |
| Equation  | $Q = \text{!}A$        |
| Type      | Combinational          |
| Input     | A                      |
| Output    | Q                      |



| State Table |   |
|-------------|---|
| A           | Q |
| L           | H |
| H           | L |

| Capacitance [fF] |        |
|------------------|--------|
| A                | 2.8210 |

| Propagation Delay [ns] |      |           |      |        |
|------------------------|------|-----------|------|--------|
| Input Transition [ns]  |      | 0.01 4.00 |      |        |
| Load Capacitance [fF]  | 5.00 | 100.00    | 5.00 | 100.00 |
| A to Q                 | fall | 0.04      | 0.41 | -0.36  |
|                        | rise | 0.06      | 0.67 | 0.83   |
|                        |      | 0.63      | 1.88 |        |

| Output Transition [ns] |      |        |      |        |      |
|------------------------|------|--------|------|--------|------|
| Input Transition [ns]  |      | 0.01   |      | 4.00   |      |
| Load Capacitance [fF]  | 5.00 | 100.00 | 5.00 | 100.00 |      |
| A to Q                 | fall | 0.04   | 0.54 | 0.70   | 1.43 |
|                        | rise | 0.08   | 1.03 | 0.62   | 1.58 |

| Dynamic Power Consumption [nW/MHz] |      |        |       |        |        |
|------------------------------------|------|--------|-------|--------|--------|
| Input Transition [ns]              |      | 0.01   |       | 4.00   |        |
| Load Capacitance [fF]              | 5.00 | 100.00 | 5.00  | 100.00 |        |
| A to Q                             | fall | 1.93   | 2.35  | 516.63 | 510.09 |
|                                    | rise | 38.82  | 40.07 | 813.89 | 712.95 |

| Leakage [pW] |  |
|--------------|--|
| 0.26         |  |

|           |                        |
|-----------|------------------------|
| Strength  | 1                      |
| Cell Area | 43.680 $\mu\text{m}^2$ |
| Equation  | $Q = \text{!}(A \& B)$ |
| Type      | Combinational          |
| Input     | A, B                   |
| Output    | Q                      |



| State Table |   |   |
|-------------|---|---|
| A           | B | Q |
| L           | - | H |
| H           | H | L |
| -           | L | H |

| Propagation Delay [ns] |      |        |      |        |
|------------------------|------|--------|------|--------|
| Input Transition [ns]  |      | 0.01   |      | 4.00   |
| Load Capacitance [fF]  | 5.00 | 100.00 | 5.00 | 100.00 |
| A to Q                 | fall | 0.06   | 0.58 | -0.13  |
|                        | rise | 0.07   | 0.68 | 0.70   |
| B to Q                 | fall | 0.06   | 0.58 | -0.37  |
|                        | rise | 0.08   | 0.69 | 0.91   |
|                        |      |        |      | 1.77   |
|                        |      |        |      | 0.60   |
|                        |      |        |      | 1.87   |

| Output Transition [ns] |      |        |      |        |
|------------------------|------|--------|------|--------|
| Input Transition [ns]  |      | 0.01   |      | 4.00   |
| Load Capacitance [fF]  | 5.00 | 100.00 | 5.00 | 100.00 |
| A to Q                 | fall | 0.07   | 0.83 | 0.77   |
|                        | rise | 0.10   | 1.05 | 0.75   |
| B to Q                 | fall | 0.07   | 0.83 | 0.76   |
|                        | rise | 0.11   | 1.06 | 0.82   |
|                        |      |        |      | 1.64   |
|                        |      |        |      | 1.71   |
|                        |      |        |      | 1.53   |
|                        |      |        |      | 1.72   |

| Capacitance [fF] |        |
|------------------|--------|
| A                | 2.7240 |
| B                | 3.0190 |

| Leakage [pW] |      |
|--------------|------|
|              | 0.28 |

| Dynamic Power Consumption [nW/MHz] |      |        |       |        |
|------------------------------------|------|--------|-------|--------|
| Input Transition [ns]              |      | 0.01   |       | 4.00   |
| Load Capacitance [fF]              | 5.00 | 100.00 | 5.00  | 100.00 |
| A to Q                             | fall | 8.00   | 8.49  | 555.73 |
|                                    | rise | 54.46  | 55.63 | 732.49 |
| B to Q                             | fall | 7.37   | 7.81  | 616.29 |
|                                    | rise | 64.63  | 64.73 | 837.28 |
|                                    |      |        |       | 471.83 |
|                                    |      |        |       | 629.72 |
|                                    |      |        |       | 518.84 |
|                                    |      |        |       | 703.47 |

NAND2X1

am  
www.ams.com

|           |                        |
|-----------|------------------------|
| Strength  | 2                      |
| Cell Area | 43.680 $\mu\text{m}^2$ |
| Equation  | $Q = \neg(A \& B)$     |
| Type      | Combinational          |
| Input     | A, B                   |
| Output    | Q                      |



| State Table |   |   |
|-------------|---|---|
| A           | B | Q |
| L           | - | H |
| H           | H | L |
| -           | L | H |

#### Propagation Delay [ns]

| Input Transition [ns] |      | 0.01  |        | 4.00  |        |
|-----------------------|------|-------|--------|-------|--------|
| Load Capacitance [fF] |      | 10.00 | 200.00 | 10.00 | 200.00 |
| A to Q                | fall | 0.06  | 0.66   | -0.09 | 1.13   |
|                       | rise | 0.06  | 0.75   | 0.61  | 1.81   |
| B to Q                | fall | 0.06  | 0.66   | -0.27 | 0.79   |
|                       | rise | 0.06  | 0.69   | 0.75  | 1.78   |

#### Output Transition [ns]

| Input Transition [ns] |      | 0.01  |        | 4.00  |        |
|-----------------------|------|-------|--------|-------|--------|
| Load Capacitance [fF] |      | 10.00 | 200.00 | 10.00 | 200.00 |
| A to Q                | fall | 0.07  | 0.92   | 0.75  | 1.73   |
|                       | rise | 0.08  | 1.16   | 0.71  | 1.80   |
| B to Q                | fall | 0.07  | 0.92   | 0.74  | 1.61   |
|                       | rise | 0.08  | 1.07   | 0.80  | 1.75   |

| Capacitance [fF] |        |
|------------------|--------|
| A                | 3.8420 |
| B                | 4.4020 |

| Leakage [pW] |      |
|--------------|------|
| A            | 0.31 |

#### Dynamic Power Consumption [nW/MHz]

| Input Transition [ns] |      | 0.01  |        | 4.00    |         |
|-----------------------|------|-------|--------|---------|---------|
| Load Capacitance [fF] |      | 10.00 | 200.00 | 10.00   | 200.00  |
| A to Q                | fall | 5.99  | 6.95   | 974.79  | 808.95  |
|                       | rise | 64.40 | 64.67  | 1272.31 | 1068.90 |
| B to Q                | fall | 4.28  | 5.14   | 1162.42 | 945.27  |
|                       | rise | 77.18 | 78.81  | 1504.40 | 1252.61 |

NAND2X2

am  
www.ams.com

|           |                        |
|-----------|------------------------|
| Strength  | 6                      |
| Cell Area | 72.800 $\mu\text{m}^2$ |
| Equation  | $Q = \neg(A \& B)$     |
| Type      | Combinational          |
| Input     | A, B                   |
| Output    | Q                      |



| State Table |   |   |
|-------------|---|---|
| A           | B | Q |
| L           | - | H |
| H           | H | L |
| -           | L | H |

#### Propagation Delay [ns]

| Input Transition [ns] |      | 0.01  |        | 4.00  |        |
|-----------------------|------|-------|--------|-------|--------|
| Load Capacitance [fF] |      | 30.00 | 600.00 | 30.00 | 600.00 |
| A to Q                | fall | 0.05  | 0.65   | -0.04 | 1.16   |
|                       | rise | 0.05  | 0.68   | 0.51  | 1.68   |
| B to Q                | fall | 0.05  | 0.65   | -0.28 | 0.80   |
|                       | rise | 0.06  | 0.69   | 0.73  | 1.77   |

#### Output Transition [ns]

| Input Transition [ns] |      | 0.01  |        | 4.00  |        |
|-----------------------|------|-------|--------|-------|--------|
| Load Capacitance [fF] |      | 30.00 | 600.00 | 30.00 | 600.00 |
| A to Q                | fall | 0.06  | 0.92   | 0.72  | 1.70   |
|                       | rise | 0.07  | 1.06   | 0.70  | 1.73   |
| B to Q                | fall | 0.06  | 0.92   | 0.72  | 1.59   |
|                       | rise | 0.08  | 1.07   | 0.78  | 1.74   |

#### Capacitance [fF]

#### Leakage [pW]

|   |         |
|---|---------|
| A | 10.1390 |
| B | 11.0760 |

#### Dynamic Power Consumption [nW/MHz]

| Input Transition [ns] |      | 0.01   |        | 4.00    |         |
|-----------------------|------|--------|--------|---------|---------|
| Load Capacitance [fF] |      | 30.00  | 600.00 | 30.00   | 600.00  |
| A to Q                | fall | 5.73   | 8.27   | 3025.58 | 2459.24 |
|                       | rise | 155.47 | 156.48 | 3806.99 | 3246.79 |
| B to Q                | fall | 5.87   | 8.76   | 3407.23 | 2766.61 |
|                       | rise | 196.60 | 199.91 | 4408.72 | 3647.73 |

NAND2X6

am  
www.ams.com

|           |                          |
|-----------|--------------------------|
| Strength  | 1                        |
| Cell Area | 43.680 $\mu\text{m}^2$   |
| Equation  | $Q = \text{!}(A \mid B)$ |
| Type      | Combinational            |
| Input     | A, B                     |
| Output    | Q                        |



| State Table |   |   |
|-------------|---|---|
| A           | B | Q |
| L           | L | H |
| H           | - | L |
| -           | H | L |

### Propagation Delay [ns]

| Input Transition [ns] |      | 0.01 |        | 4.00  |        |
|-----------------------|------|------|--------|-------|--------|
| Load Capacitance [fF] |      | 5.00 | 100.00 | 5.00  | 100.00 |
| A to Q                | fall | 0.04 | 0.41   | -0.64 | 0.48   |
|                       | rise | 0.11 | 1.17   | 1.28  | 2.61   |
| B to Q                | fall | 0.05 | 0.42   | -0.45 | 0.56   |
|                       | rise | 0.11 | 1.18   | 1.00  | 2.25   |

### Output Transition [ns]

| Input Transition [ns] |      | 0.01 |        | 4.00 |        |
|-----------------------|------|------|--------|------|--------|
| Load Capacitance [fF] |      | 5.00 | 100.00 | 5.00 | 100.00 |
| A to Q                | fall | 0.04 | 0.55   | 0.73 | 1.46   |
|                       | rise | 0.15 | 1.81   | 0.62 | 2.12   |
| B to Q                | fall | 0.05 | 0.55   | 0.94 | 1.55   |
|                       | rise | 0.15 | 1.81   | 0.80 | 2.12   |

| Capacitance [fF] |        |
|------------------|--------|
| A                | 2.6610 |
| B                | 2.9400 |

| Leakage [pW] |  |
|--------------|--|
| 0.26         |  |

### Dynamic Power Consumption [nW/MHz]

| Input Transition [ns] |      | 0.01  |        | 4.00   |        |
|-----------------------|------|-------|--------|--------|--------|
| Load Capacitance [fF] |      | 5.00  | 100.00 | 5.00   | 100.00 |
| A to Q                | fall | 4.63  | 5.12   | 326.27 | 363.32 |
|                       | rise | 45.80 | 47.19  | 661.91 | 550.88 |
| B to Q                | fall | 7.48  | 7.98   | 453.09 | 450.46 |
|                       | rise | 55.57 | 56.53  | 796.35 | 658.80 |

NOR2X1

am  
www.ams.com

# Dynamic Power Consumption is Data Dependent

- Switching activity,  $P_{0 \rightarrow 1}$ , has two components
  - A static component – function of the logic topology ~~slow Q0E0~~
  - A dynamic component – function of the timing behavior (glitching) ~~diffuse us~~

2-input NOR Gate

| A | B | Out |
|---|---|-----|
| 0 | 0 | 1   |
| 0 | 1 | 0   |
| 1 | 0 | 0   |
| 1 | 1 | 0   |

Static transition probability

$$P_{0 \rightarrow 1} = P_{\text{out}=0} \times P_{\text{out}=1} \quad \frac{\text{Probabilistic}}{\text{of switching}}$$
$$= P_0 \times (1-P_0)$$

With input signal probabilities

$$P_{A=1} = 1/2$$
$$P_{B=1} = 1/2$$

NOR static transition probability

$$= 3/4 \times 1/4 = \underline{\underline{3/16}}$$

# Transition Probabilities for Some Basic Gates

|      | $P_{0 \rightarrow 1} = P_{\text{out}=0} \times P_{\text{out}=1}$ |
|------|------------------------------------------------------------------|
| NOR  | $(1 - (1 - P_A)(1 - P_B)) \times (1 - P_A)(1 - P_B)$             |
| OR   | $(1 - P_A)(1 - P_B) \times (1 - (1 - P_A)(1 - P_B))$             |
| NAND | $P_A P_B \times (1 - P_A P_B)$                                   |
| AND  | $(1 - P_A P_B) \times P_A P_B$                                   |
| XOR  | $(1 - (P_A + P_B - 2P_A P_B)) \times (P_A + P_B - 2P_A P_B)$     |



$$\begin{aligned}\text{For } X: P_{0 \rightarrow 1} &= P_0 \times P_1 = (1 - P_A) P_A \\ &= 0.5 \times 0.5 = 0.25\end{aligned}$$

$$\begin{aligned}\text{For } Z: P_{0 \rightarrow 1} &= P_0 \times P_1 = (1 - P_X P_B) P_X P_B \\ &= (1 - (0.5 \times 0.5)) \times (0.5 \times 0.5) = 3/16\end{aligned}$$

ZATOPOWOCIA DLA WFWENTY! QUESTO RUMBO!

# Inter-signal Correlations

- Determining switching activity is complicated by the fact that signals exhibit correlation in space and time
  - reconvergent fan-out

$$(1-0.5)(1-0.5) \times (1-(1-0.5)(1-0.5)) = 3/16$$



Reconvergent

$$(1 - 3/16 \times 0.5) \times (3/16 \times 0.5) = 0.085$$

$$P(Z=1) = P(B=1) \& P(A=1 | B=1)$$



- Have to use conditional probabilities

# Logic Restructuring

- Logic restructuring: changing the topology of a logic network to reduce transitions

$$\text{AND: } P_{0 \rightarrow 1} = P_0 \times P_1 = (1 - P_A P_B) \times P_A P_B$$



Chain implementation has a lower overall switching activity than the tree implementation for random inputs

Ignores glitching effects

# Input Ordering

Pode ser feito automaticamente por ferramentas.



A INPUTA SÓ 2 GATES

Beneficial to postpone the introduction of signals with a **high** transition rate (signals with signal probability close to 0.5)

# Glitching in Static CMOS Networks

- Gates have a nonzero propagation delay resulting in spurious transitions or **glitches** (dynamic hazards)
  - glitch: node exhibits multiple transitions in a single cycle before settling to the correct logic value



# Glitching in an RCA



# Balanced Delay Paths to Reduce Glitching

- Glitching is due to a mismatch in the path lengths in the logic network; if all input signals of a gate change simultaneously, no glitching occurs



So equalize the lengths of timing paths through logic

# Power and Energy Design Space

|         | Constant Throughput/Latency                                                    | Variable Throughput/Latency                            |                                             |
|---------|--------------------------------------------------------------------------------|--------------------------------------------------------|---------------------------------------------|
| Energy  | Design Time                                                                    | Non-active Modules                                     | Run Time                                    |
| Active  | Logic Design<br>Reduced $V_{dd}$<br>Sizing<br><b>Multi-<math>V_{dd}</math></b> | Clock Gating                                           | DFS, DVS<br>(Dynamic Freq, Voltage Scaling) |
| Leakage | + Multi- $V_T$                                                                 | Sleep Transistors<br>Multi- $V_{dd}$<br>Variable $V_T$ | + Variable $V_T$                            |

# Dynamic Power as a Function of $V_{DD}$

- Decreasing the  $V_{DD}$  **decreases** dynamic energy consumption (quadratically)
- But, **increases** gate delay (decreases performance)

$$t_{p_{HL}} \propto \frac{K \cdot C}{\beta_n \cdot (V_{DD} - V_{Tn})}$$

$$t_{p_{LH}} \propto \frac{K \cdot C}{\beta_p \cdot (V_{DD} + V_{Tp})}$$



- Determine the critical path(s) at **design time** and use high  $V_{DD}$  for the transistors on those paths for speed. Use a lower  $V_{DD}$  on the other gates, especially those that drive large capacitances (as this yields the largest energy benefits).

# Multiple $V_{DD}$ Considerations

- How many  $V_{DD}$ ? – Two is becoming common
  - Many chips already have two supplies (one for core and one for I/O)
- When combining multiple supplies, **level converters** are required whenever a module at the lower supply drives a gate at the higher supply (step-up)
  - If a gate supplied with  $V_{DDL}$  drives a gate at  $V_{DDH}$ , the PMOS never turns off
    - The cross-coupled PMOS transistors do the level conversion
    - The NMOS transistor operate on a reduced supply
  - Level converters are not needed for a step-down change in voltage
  - Overhead of level converters can be mitigated by doing conversions at register boundaries and embedding the level conversion inside the flipflop (see next slide)



# Dual-Supply Inside a Logic Block

- Minimum energy consumption is achieved if **all** logic paths are critical (have the same delay)
- Clustered voltage-scaling
  - Each path starts with  $V_{DDH}$  and switches to  $V_{DDL}$  (gray logic gates) when delay **slack** is available
  - Level conversion is done in the flipflops at the end of the paths



# Power and Energy Design Space

---

|         | Constant Throughput/Latency                                   | Variable Throughput/Latency                            |                                             |
|---------|---------------------------------------------------------------|--------------------------------------------------------|---------------------------------------------|
| Energy  | Design Time                                                   | Non-active Modules                                     | Run Time                                    |
| Active  | Logic Design<br>Reduced $V_{dd}$<br>Sizing<br>Multi- $V_{dd}$ | Clock Gating                                           | DFS, DVS<br>(Dynamic Freq, Voltage Scaling) |
| Leakage | + Multi- $V_T$                                                | Sleep Transistors<br>Multi- $V_{dd}$<br>Variable $V_T$ | + Variable $V_T$                            |

# Leakage as a Function of Design Time $V_T$

- ❑ Reducing the  $V_T$  increases the sub-threshold leakage current (exponentially)
  - 90mV reduction in  $V_T$  increases leakage by an order of magnitude
- ❑ But, reducing  $V_T$  decreases gate delay (increases performance)
- ❑ Determine the critical path(s) at design time and use low  $V_T$  devices on the transistors on those paths for speed. Use a high  $V_T$  on the other logic for leakage control.
  - A careful assignment of  $V_T$ 's can reduce the leakage by as much as 80%



## Dual-Thresholds Inside a Logic Block

- ❑ Minimum energy consumption is achieved if **all** logic paths are critical (have the same delay)
- ❑ Use lower threshold on timing-critical paths
  - Assignment can be done on a per gate or transistor basis; no clustering of the logic is needed
  - No level converters are needed



$$T \geq t_{c-q} + t_{p\text{logic}} + t_{su}$$

## Example for evaluating minimum Clock Period



$T$  (clock period)  
↓  
[Clock Pulse Diagram]

$$T \geq t_{c-q} + t_{p\text{logic}} + t_{su}$$

# Example for evaluating minimum Clock Period



|   |      |      |      |      |      |
|---|------|------|------|------|------|
| A | 0,9  | 0,35 | 0,13 | 0,17 | 1,55 |
| B | 0,9  | 0,16 |      | 0,16 | 1,22 |
| C | 0,75 | 0,17 |      | 0,16 | 1,08 |
| D | 0,74 | 0,14 |      | 0,16 | 1,04 |
| E | 1,2  | 0,11 |      | 0,17 | 1,48 |
| F | 1,2  | 0,33 | 0,13 | 0,17 | 1,83 |

$1,83 - 1,22 = \Delta$   
 (Δ Vario possa  
 prececerre)  
 546,45 MHz

T (clock period)



$$T \geq t_{c-q} + t_{\text{plogic}} + t_{\text{su}}$$

---

# **Low Power Techniques in Microarchitectures and Memories**

# Review: Energy & Power Equations

$$E = C_L V_{DD}^2 P_{0 \rightarrow 1} + t_{sc} V_{DD} I_{peak} P_{0 \rightarrow 1} + V_{DD} I_{leakage}$$

$$f_{0 \rightarrow 1} = P_{0 \rightarrow 1} * f_{clock}$$

$$P = C_L V_{DD}^2 f_{0 \rightarrow 1} + t_{sc} V_{DD} I_{peak} f_{0 \rightarrow 1} + V_{DD} I_{leakage}$$

Dynamic power  
(% decreasing  
relatively with  
deep submicron)

Short-circuit  
power

Leakage power  
(% increasing  
with deep  
submicron)

*Leakage power grows from <5% of power budget at .25 micron to 20-25% at 130nm to 40-50% at 90nm and continuing to increase at 65nm and beyond.*

# Power and Energy Design Space

|         | Constant Throughput/Latency                                   | Variable Throughput/Latency                            |                                             |
|---------|---------------------------------------------------------------|--------------------------------------------------------|---------------------------------------------|
| Energy  | Design Time                                                   | Non-active Modules                                     | Run Time                                    |
| Active  | Logic Design<br>Reduced $V_{dd}$<br>Sizing<br>Multi- $V_{dd}$ | Clock Gating                                           | DFS, DVS<br>(Dynamic Freq, Voltage Scaling) |
| Leakage | + Multi- $V_T$                                                | Sleep Transistors<br>Multi- $V_{dd}$<br>Variable $V_T$ | + Variable $V_T$                            |

# Bus Multiplexing

- Buses are a significant source of power dissipation due to high switching activities and large capacitive loading
  - 15% of total power in Alpha 21064
  - 30% of total power in Intel 80386
- Share long data buses with time multiplexing ( $S_1$  uses even cycles,  $S_2$  odd)



- But what if data samples are correlated (e.g., sign bits)?

# Correlated Data Streams



- ❑ For a shared (multiplexed) bus advantages of data correlation are lost (bus carries samples from two uncorrelated data streams)
  - Bus sharing should **not be used** for **positively** correlated data streams
  - Bus sharing **may prove advantageous** in a **negatively** correlated data stream (where successive samples switch sign bits) - more random switching

# Glitch Reduction by Pipelining

- ❑ Glitches depend on the **logic depth** of the circuit - gates deeper in the logic network are more prone to glitching
  - arrival times of the gate inputs are more spread due to delay imbalances
  - usually affected more by primary input switching
- ❑ Reduce logic depth by adding pipeline registers
  - additional energy used by the clock and pipeline registers



# Power and Energy Design Space

|         | Constant Throughput/Latency                                   | Variable Throughput/Latency                            |                                             |
|---------|---------------------------------------------------------------|--------------------------------------------------------|---------------------------------------------|
| Energy  | Design Time                                                   | Non-active Modules                                     |                                             |
| Active  | Logic Design<br>Reduced $V_{dd}$<br>Sizing<br>Multi- $V_{dd}$ | Clock Gating                                           | DFS, DVS<br>(Dynamic Freq, Voltage Scaling) |
| Leakage | + Multi- $V_T$                                                | Sleep Transistors<br>Multi- $V_{dd}$<br>Variable $V_T$ | + Variable $V_T$                            |

# Clock Gating

- ❑ Most popular method for power reduction of clock signals and functional units
- ❑ Gate off clock to idle functional units
  - e.g., floating point units
  - need logic to generate **disable** signal
    - increases complexity of control logic
    - consumes power
    - timing critical to avoid clock glitches at OR gate output
  - additional gate delay on clock signal
    - gating OR gate can replace a buffer in the clock distribution tree



# Clock Gating in a Pipelined Datapath

- For idle units (e.g., floating point units in Exec stage, WB stage for instructions with no write back operation)



# Power and Energy Design Space

---

|         | Constant<br>Throughput/Latency                                                                                                                                    | Variable<br>Throughput/Latency                                 |                                                                                                                               |
|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| Energy  | Design Time                                                                                                                                                       | Non-active Modules                                             | Run Time                                                                                                                      |
| Active  | Logic Design<br><br>Reduced $V_{dd}$<br><br>Sizing<br><br><span style="border: 2px solid red; border-radius: 50%; padding: 2px;">Multi-<math>V_{dd}</math></span> | Clock Gating                                                   | <span style="border: 2px solid red; border-radius: 50%; padding: 2px;">DFS, DVS</span><br><br>(Dynamic Freq, Voltage Scaling) |
| Leakage | + Multi- $V_T$                                                                                                                                                    | Sleep Transistors<br><br>Multi- $V_{dd}$<br><br>Variable $V_T$ | + Variable $V_T$                                                                                                              |

## Review: Dynamic Power as a Function of $V_{DD}$

- Decreasing the  $V_{DD}$  **decreases** dynamic energy consumption (quadratically)
- But, **increases** gate delay (decreases performance)



- Determine the critical path(s) at **design time** and use high  $V_{DD}$  for the transistors on those paths for speed. Use a lower  $V_{DD}$  on the other gates, especially those that drive large capacitances (as this yields the largest energy benefits).

# Dynamic Frequency and Voltage Scaling

## ❑ Intel's SpeedStep

- Hardware that steps down the clock frequency (dynamic frequency scaling – DFS) when the user unplugs from AC power
  - PLL from 650MHz → 500MHz
- CPU stalls during SpeedStep adjustment

## ❑ Transmeta LongRun

- Hardware that applies **both** DFS and DVS (dynamic supply voltage scaling)
  - 32 levels of  $V_{DD}$  from 1.1V to 1.6V
  - PLL from 200MHz → 700MHz in increments of 33MHz
- Triggered when CPU load change is detected by software
  - heavier load → ramp up  $V_{DD}$ , when stable speed up clock
  - lighter load → slow down clock, when PLL locks onto new rate, ramp down  $V_{DD}$
- CPU stalls only during PLL relock (< 20 microsec)



# Dynamic Thermal Management (DTM)



Trigger Mechanism:  
When do we enable  
DTM techniques?



Initiation Mechanism:  
How do we enable  
technique?



Response Mechanism:  
What technique do we  
enable?

# DTM Trigger Mechanisms



- ❑ Mechanism: How to deduce temperature?
- ❑ Direct approach: on-chip temperature sensors
  - Based on differential voltage change across 2 diodes of different sizes
  - May require >1 sensor
  - Hysteresis and delay are problems
- ❑ Policy: When to begin responding?
  - Trigger level set too high means higher packaging costs
  - Trigger level set too low means frequent triggering and loss in performance
- ❑ Choose trigger level to exploit difference between average and worst case power

# DTM Initiation and Response Mechanisms



- ❑ Operating system or microarchitectural control?
  - Hardware support can reduce performance penalty by 20-30%
- ❑ Initiation of policy incurs some delay
  - When using DVS and/or DFS, much of the performance penalty can be attributed to enabling/disabling overhead
  - Increasing policy delay reduces overhead; smarter initiation techniques would help as well
- ❑ Thermal window (100Kcycles+)
  - Larger thermal windows “smooth” short thermal spikes

# DTM Activation and Deactivation Cycle



- Initiation Delay – OS interrupt/handler
- Response Delay – Invocation time (e.g., adjust clock)
- Policy Delay – Number of cycles engaged
- Shutoff Delay – Disabling time (e.g., re-adjust clock)

# DTM Savings Benefits



# Power and Energy Design Space

|         | Constant Throughput/Latency                                   | Variable Throughput/Latency                            |                                             |
|---------|---------------------------------------------------------------|--------------------------------------------------------|---------------------------------------------|
| Energy  | Design Time                                                   | Non-active Modules                                     | Run Time                                    |
| Active  | Logic Design<br>Reduced $V_{dd}$<br>Sizing<br>Multi- $V_{dd}$ | Clock Gating                                           | DFS, DVS<br>(Dynamic Freq, Voltage Scaling) |
| Leakage | + Multi- $V_T$                                                | Sleep Transistors<br>Multi- $V_{dd}$<br>Variable $V_T$ | + Variable $V_T$                            |