

# **Electronics Systems**

## **Computer Engineering**

### **Designing for Low Power**

Luca Fanucci

[Adapted from Rabaey's *Digital Integrated Circuits*, Second Edition, ©2003  
J. Rabaey, A. Chandrakasan, B. Nikolic]

# Why Power Matters

---

- Packaging costs
- Power supply rail design
- Chip and system cooling costs
- Noise immunity and system reliability
- Battery life (in portable systems)

## Environmental concerns

- Office equipment (professional, government and banks) accounted for 14% of total US commercial energy usage in 2012\*
- *Energy Star* compliant systems. The *Energy Star* program is incorporating standby energy into its ratings. Standby energy in office equipment represents a significant hidden energy cost.

\* U.S. Energy Information Administration, 2012 Commercial Building Energy Consumption Survey: Energy Usage Summary, Table 1 (March 2016)

# Why worry about power? -- Power Dissipation

Lead microprocessors power continues to increase



Power delivery and dissipation will be prohibitive

Source: Borkar, De Intel®

# Why worry about power? -- Chip Power Density



Source: Borkar, De Intel®

# Chip Power Density Distribution

Power Map



On-Die Temperature



- ❑ Power density is not uniformly distributed across the chip
- ❑ Silicon is not a good heat conductor
- ❑ Max junction temperature is determined by hot-spots
  - Impact on packaging, w.r.t. cooling

## Problem Illustration (1/2)



## Problem Illustration (2/2)



# The Intel's Tejas project



Craig R. Barrett, the chief executive of Intel, told analysts that the company would move down a "parallel track."

Intel Corporation's newest microprocessor (tejas) was running slower and hotter than its predecessor.

Obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor

New York Times, May 17, 2004

# Pentium®4 processor

- Dual-Core/Multi-Threaded Pentium®4 Processor on 90nm process
  - 2-1M caches, speeds to 3.2Ghz, support for over clocking, up to 4 threads.
  - Shared 800Mhz quad-pumped FSB.
    - Independent bus tuning per agent
  - Enhanced auto-halt and 2-state speed step power management
    - Independent events supported per core.



# Highlights (3.2 GHz)

- 241 M transistors
- 235 mm<sup>2</sup>
- 9 cores, 10 threads
- >200 GFlops (SP)
- >20 GFlops (DP)
- Up to 25 GB/s memory B/W
- Up to 75 GB/s I/O B/W
- >300 GB/s EIB
- Top frequency >4GHz  
(observed in lab)



# The Performance vs. Power Dilemma



# Power Management Is Challenging



*Leakage power begins to dominate at advanced process geometries*

# Why worry about power? -- Battery Size/Weight



Expected battery lifetime increase  
over the next 5 years: **30 to 40%**

From Rabaey, 1995

# Why worry about power? -- Standby Power

| Year                      | 2002 | 2005 | 2008 | 2011 | 2014 |
|---------------------------|------|------|------|------|------|
| Power supply $V_{dd}$ (V) | 1.5  | 1.2  | 0.9  | 0.7  | 0.6  |
| Threshold $V_T$ (V)       | 0.4  | 0.4  | 0.35 | 0.3  | 0.25 |

- Drain leakage will increase as  $V_T$  decreases to maintain noise margins and meet frequency demands, leading to excessive **battery draining standby power consumption.**



Source: Borkar, De Intel®

# Low power design challenge

## The challenge

“To design an embedded system (HW *and* SW) that provides the target functionality with minimum power consumption”

## The solution

From the system concept down to the implementation phase, adopt a design style that includes power consumption as a figure of merit, and exploit all the opportunities and techniques available at each design level to reduce it

# Power Saving Opportunities



# CMOS Energy & Power Equations

$$E = C_L V_{DD}^2 P_{0 \rightarrow 1} + t_{sc} V_{DD} |_{peak} P_{0 \rightarrow 1} + V_{DD} |_{leakage}$$

$$f_{0 \rightarrow 1} = P_{0 \rightarrow 1} * f_{clock}$$

$$P = C_L V_{DD}^2 f_{0 \rightarrow 1} + t_{sc} V_{DD} |_{peak} f_{0 \rightarrow 1} + V_{DD} |_{leakage}$$

Dynamic power

Short-circuit power

Leakage power

# Dynamic Power Consumption



$$\text{Energy/transition} = C_L * V_{DD}^2 * P_{0 \rightarrow 1}$$
$$P_{dyn} = \text{Energy/transition} * f = C_L * V_{DD}^2 * (P_{0 \rightarrow 1} * f)$$

$$P_{dyn} = C_{EFF} * V_{DD}^2 * f \quad \text{where } C_{EFF} = P_{0 \rightarrow 1} C_L$$

Not a function of transistor sizes!

Data dependent - a function of **switching activity**!

# Lowering Dynamic Power

Capacitance:  
Function of fan-out,  
wire length, transistor  
sizes

Supply Voltage:  
Has been dropping  
with successive  
generations

$$P_{\text{dyn}} = C_L V_{DD}^2 P_{0 \rightarrow 1} f$$

Activity factor:  
How often, on average,  
do wires switch?

Clock frequency:  
Increasing...

# Short Circuit Power Consumption



Finite slope of the input signal causes a direct current path between  $V_{DD}$  and GND for a short period of time during switching when both the NMOS and PMOS transistors are conducting.

# Short Circuit Currents Determinants

---

$$E_{sc} = t_{sc} V_{DD} |_{peak} P_{0 \rightarrow 1}$$

$$P_{sc} = t_{sc} V_{DD} |_{peak} f_{0 \rightarrow 1}$$

- Duration and slope of the input signal,  $t_{sc}$
- $|_{peak}$  determined by
  - the saturation current of the P and N transistors which depend on their **sizes**, process technology, temperature, etc.
  - strong function of the ratio between input and output slopes
    - a function of  $C_L$

# Leakage (Static) Power Consumption



Sub-threshold current is the dominant factor.

All increase **exponentially** with temperature!

## Leakage as a Function of $V_T$

- ❑ Continued scaling of supply voltage and the subsequent scaling of threshold voltage will make subthreshold conduction a dominate component of power dissipation.

- ❑ An  $90\text{mV}/\text{decade}$   $V_T$  roll-off - so each  $255\text{mV}$  increase in  $V_T$  gives 3 orders of magnitude reduction in leakage (but adversely affects performance)



# TSMC Processes Leakage and $V_T$

|                               | <b>CL018<br/>G</b> | <b>CL018<br/>LP</b> | <b>CL018<br/>ULP</b> | <b>CL018<br/>HS</b> | <b>CL015<br/>HS</b> | <b>CL013<br/>HS</b> |
|-------------------------------|--------------------|---------------------|----------------------|---------------------|---------------------|---------------------|
| $V_{dd}$                      | 1.8 V              | 1.8 V               | 1.8 V                | 2 V                 | 1.5 V               | 1.2 V               |
| $T_{ox}$ (effective)          | 42 Å               | 42 Å                | 42 Å                 | 42 Å                | 29 Å                | 24 Å                |
| $L_{gate}$                    | 0.16 μm            | 0.16 μm             | 0.18 μm              | 0.13 μm             | 0.11 μm             | 0.08 μm             |
| $I_{DSat}^{(n/p)}$ (μA/μm)    | 600/260            | 500/180             | 320/130              | 780/360             | 860/370             | 920/400             |
| $I_{off}^{(leakage)}$ (pA/μm) | 20                 | 1.60                | 0.15                 | 300                 | 1,800               | 13,000              |
| $V_{Tn}$                      | 0.42 V             | 0.63 V              | 0.73 V               | 0.40 V              | 0.29 V              | 0.25 V              |
| FET Perf.<br>(GHz)            | 30                 | 22                  | 14                   | 43                  | 52                  | 80                  |

From MPR, 2000

# Exponential Increase in Leakage Currents



From De, 1999

# Review: Energy & Power Equations

$$E = C_L V_{DD}^2 P_{0 \rightarrow 1} + t_{sc} V_{DD} |_{\text{leakage}} P_{0 \rightarrow 1} + V_{DD}$$

$$f_{0 \rightarrow 1} = P_{0 \rightarrow 1} * f_{\text{clock}}$$

$P = C_L V_{DD}^2 f_{0 \rightarrow 1} + t_{sc} V_{DD} |_{\text{leakage}} f_{0 \rightarrow 1} + V_{DD} |_{\text{leakage}}$

**Dynamic power**  
(% decreasing  
relatively with  
deep submicron)

**Short-circuit  
power**

**Leakage power**  
(% increasing  
with deep  
submicron)

*Leakage power grows from <5% of power budget at .25 micron to 20-25% at 130nm to 40-50% at 90nm and continuing to increase at 65nm and beyond.*

# Power and Energy Design Space

|         | Constant Throughput/Latency                                                                                                    | Throughput/Latency                                                                                                                                                                                            | Variable Throughput/Latency                                                                                                                                                                                  |
|---------|--------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Energy  | Design Time                                                                                                                    | Non-active Modules                                                                                                                                                                                            | Run Time                                                                                                                                                                                                     |
| Active  | Logic Design<br> Reduced $V_{dd}$<br>Sizing | Clock Gating<br> Multi- $V_{dd}$                                                                                           | DFS, DVFS<br>(Dynamic Freq, Voltage Scaling)                                                                                                                                                                 |
| Leakage |  + Multi- $V_T$                           |  Sleep Transistors<br> Multi- $V_{dd}$ |  + Variable $V_T$<br> Variable $V_T$ |

# Dynamic Power as a Function of Device Size

- Device sizing affects dynamic energy consumption
  - gain is largest for networks with large overall effective fan-outs ( $F = C_L/C_{g,1}$ )
  - The optimal gate sizing factor ( $f$ ) for dynamic energy is smaller than the one for performance, especially for large  $F$ 's
    - e.g., for  $F=20$ ,  
 $f_{\text{opt}}(\text{energy}) = 3.53$  while  
 $f_{\text{opt}}(\text{performance}) = 4.47$
- If energy is a concern avoid oversizing beyond the optimal



From Nikolic, UCB

# **Standard-Cell Technology Library Austriamicrosystems, 0,35 um CMOS**



**0.35 $\mu$ m CMOS**

Digital Standard Cell Databook

# INVX1



[www.ams.com](http://www.ams.com)

Conditions for characterization library c35\_CORELIBD\_BC, corner c35\_CORELIBD\_BC best: Vdd= 3.63V, TJ= -50.0 deg. C.  
 Output transition is defined from 20% to 80% (rising) and from 80% to 20% (falling) output voltage.  
 Propagation delay is measured from 50% (input rise) or 50% (input fall) to 50% (output rise) or 50% (output fall).

|           |                        |
|-----------|------------------------|
| Strength  | 1                      |
| Cell Area | 29.120 $\mu\text{m}^2$ |
| Equation  | $Q = \text{!}A$        |
| Type      | Combinational          |
| Input     | A                      |
| Output    | Q                      |



## Capacitance [fF]

A 2.8210

## State Table

| A | Q |
|---|---|
| L | H |
| H | L |

## Propagation Delay [ns]

|                       |      |        |
|-----------------------|------|--------|
| Input Transition [ns] | 0.01 | 4.00   |
| Load Capacitance [fF] | 5.00 | 100.00 |
| A to Q                | fall | 0.41   |
|                       | rise | 0.67   |

## Output Transition [ns]

|                       |      |        |
|-----------------------|------|--------|
| Input Transition [ns] | 0.01 | 4.00   |
| Load Capacitance [fF] | 5.00 | 100.00 |
| A to Q                | fall | 0.04   |
|                       | rise | 0.08   |

## Dynamic Power Consumption [nW/MHz]

|                       |      |        |
|-----------------------|------|--------|
| Input Transition [ns] | 0.01 | 4.00   |
| Load Capacitance [fF] | 5.00 | 100.00 |
| A to Q                | fall | 1.93   |
|                       | rise | 38.82  |

## Leakage [pW]

0.26

**Strength 1**

|           |                                       |
|-----------|---------------------------------------|
| Cell Area | 43.680 $\mu\text{m}^2$                |
| Equation  | $Q = \overline{A} \cdot \overline{B}$ |
| Type      | Combinational                         |
| Input     | A, B                                  |
| Output    | Q                                     |

**State Table**

|  | A | B | Q |
|--|---|---|---|
|  | L | - | H |
|  | H | H | L |
|  | - | L | H |

**Propagation Delay [ns]**

| Input Transition [ns] | 0.01 | 4.00   |
|-----------------------|------|--------|
| Load Capacitance [fF] | 5.00 | 100.00 |
| A to Q                | fall | 0.06   |
|                       | rise | 0.07   |
| B to Q                | fall | 0.06   |
|                       | rise | 0.08   |

**Output Transition [ns]**

| Input Transition [ns] | 0.01 | 4.00   |
|-----------------------|------|--------|
| Load Capacitance [fF] | 5.00 | 100.00 |
| A to Q                | fall | 0.13   |
|                       | rise | 0.70   |
| B to Q                | fall | 0.37   |
|                       | rise | 0.91   |

**Leakage [pW]**

|                  |        |
|------------------|--------|
| Capacitance [fF] | 0.01   |
| A                | 2.7240 |
| B                | 3.0190 |

**Dynamic Power Consumption [nW/MHz]**

| Input Transition [ns] | 0.01 | 4.00   |
|-----------------------|------|--------|
| Load Capacitance [fF] | 5.00 | 100.00 |
| A to Q                | fall | 8.00   |
|                       | rise | 54.46  |
| B to Q                | fall | 7.37   |
|                       | rise | 64.63  |

**NAND2X1**

| Strength 2                       |
|----------------------------------|
| Cell Area 43.680 $\mu\text{m}^2$ |
| Equation Q = "!(A & B)"          |
| Type Combinational               |
| Input A, B                       |
| Output Q                         |



| State Table |   |   |   |
|-------------|---|---|---|
|             | A | B | Q |
|             | L | - | H |
|             | H | H | L |
|             | - | L | H |

| Propagation Delay [ns] |       |        |       |
|------------------------|-------|--------|-------|
| Input Transition [ns]  | 0.01  | 4.00   |       |
| Load Capacitance [fF]  | 10.00 | 200.00 | 10.00 |
| A to Q                 | Fall  | 0.06   | 0.66  |
|                        | Rise  | 0.06   | 0.75  |
| B to Q                 | Fall  | 0.06   | 0.66  |
|                        | Rise  | 0.06   | 0.69  |

| Output Transition [ns] |       |        |       |
|------------------------|-------|--------|-------|
| Input Transition [ns]  | 0.01  | 4.00   |       |
| Load Capacitance [fF]  | 10.00 | 200.00 | 10.00 |
| A to Q                 | Fall  | 0.07   | 0.92  |
|                        | Rise  | 0.08   | 1.16  |
| B to Q                 | Fall  | 0.07   | 0.92  |
|                        | Rise  | 0.08   | 1.07  |

| Leakage [pW] |      |
|--------------|------|
| A            | 0.31 |
| B            | 0.31 |

| Dynamic Power Consumption [nW/MHz] |       |        |       |
|------------------------------------|-------|--------|-------|
| Input Transition [ns]              | 0.01  | 4.00   |       |
| Load Capacitance [fF]              | 10.00 | 200.00 | 10.00 |
| A to Q                             | Fall  | 5.99   | 6.95  |
|                                    | Rise  | 64.40  | 64.67 |
| B to Q                             | Fall  | 4.28   | 5.14  |
|                                    | Rise  | 77.18  | 78.81 |

| Dynamic Power Consumption [nW/MHz] |       |         |         |
|------------------------------------|-------|---------|---------|
| Input Transition [ns]              | 0.01  | 4.00    |         |
| Load Capacitance [fF]              | 10.00 | 200.00  | 10.00   |
| A to Q                             | Fall  | 974.79  | 808.95  |
|                                    | Rise  | 1272.31 | 1068.90 |
| B to Q                             | Fall  | 1162.42 | 945.27  |
|                                    | Rise  | 1504.40 | 1252.61 |

NAND2X2

am

www.ams.com

| Strength  | 6                                     |
|-----------|---------------------------------------|
| Cell Area | 72.800 $\mu\text{m}^2$                |
| Equation  | $Q = \overline{A} \cdot \overline{B}$ |
| Type      | Combinational                         |
| Input     | A, B                                  |
| Output    | Q                                     |



| State Table |   |   |
|-------------|---|---|
| A           | B | Q |
| L           | - | H |
| H           | H | L |
| -           | L | H |

| Propagation Delay [ns] |       |        |        |
|------------------------|-------|--------|--------|
| Input Transition [ns]  | 0.01  | 4.00   |        |
| Load Capacitance [fF]  | 30.00 | 600.00 | 600.00 |
| A to Q                 | fall  | 0.05   | 0.65   |
|                        | rise  | 0.05   | 0.68   |
| B to Q                 | fall  | 0.05   | 0.65   |
|                        | rise  | 0.06   | 0.69   |

| Output Transition [ns] |       |        |        |
|------------------------|-------|--------|--------|
| Input Transition [ns]  | 0.01  | 4.00   |        |
| Load Capacitance [fF]  | 30.00 | 600.00 | 600.00 |
| A to Q                 | fall  | 0.06   | 0.92   |
|                        | rise  | 0.07   | 1.06   |
| B to Q                 | fall  | 0.06   | 0.92   |
|                        | rise  | 0.08   | 1.07   |

| Capacitance [fF] |         |  |  |
|------------------|---------|--|--|
| A                | 10.1390 |  |  |
| B                | 11.0760 |  |  |

| Leakage [pW] |      |  |  |
|--------------|------|--|--|
| A            | 0.35 |  |  |

| Dynamic Power Consumption [nW/MHz] |       |        |        |
|------------------------------------|-------|--------|--------|
| Input Transition [ns]              | 0.01  | 4.00   |        |
| Load Capacitance [fF]              | 30.00 | 600.00 | 600.00 |
| A to Q                             | fall  | 5.73   | 8.27   |
|                                    | rise  | 155.47 | 156.48 |
| B to Q                             | fall  | 5.87   | 8.76   |
|                                    | rise  | 196.60 | 199.91 |

**NAND2X6**

www.ams.com

| Strength 1 |                                                   |
|------------|---------------------------------------------------|
| Cell Area  | 43.680 $\mu\text{m}^2$                            |
| Equation   | $Q = \overline{(\overline{A} \mid \overline{B})}$ |
| Type       | Combinational                                     |
| Input      | A, B                                              |
| Output     | Q                                                 |



| State Table |   |   |
|-------------|---|---|
| A           | B | Q |
| L           | L | H |
| H           | - | L |
| -           | H | L |

| Propagation Delay [ns] |      |        |       |        |
|------------------------|------|--------|-------|--------|
| Input Transition [ns]  | 0.01 | 0.05   | 0.10  | 4.00   |
| Load Capacitance [fF]  | 5.00 | 100.00 | 5.00  | 100.00 |
|                        | tall | 0.04   | -0.64 | 0.48   |
| A to Q                 | rise | 0.11   | 1.17  | 1.28   |
|                        | tall | 0.05   | 0.42  | -0.45  |
| B to Q                 | rise | 0.11   | 1.18  | 1.00   |
|                        | tall | 0.05   | 0.45  | 0.56   |

| Output Transition [ns] |      |        |      |        |
|------------------------|------|--------|------|--------|
| Input Transition [ns]  | 0.01 | 0.05   | 0.10 | 4.00   |
| Load Capacitance [fF]  | 5.00 | 100.00 | 5.00 | 100.00 |
|                        | tall | 0.04   | 0.55 | 0.73   |
| A to Q                 | rise | 0.15   | 1.81 | 0.62   |
|                        | tall | 0.05   | 0.55 | 0.94   |
| B to Q                 | rise | 0.15   | 1.81 | 0.80   |
|                        | tall | 0.05   | 0.55 | 1.55   |

| Capacitance [fF]      |      |       |       |        |
|-----------------------|------|-------|-------|--------|
| Input Transition [ns] | 0.01 | 0.05  | 4.00  |        |
| A to Q                | fall | 4.63  | 5.12  | 326.27 |
|                       | rise | 45.80 | 47.19 | 661.91 |
| B to Q                | fall | 7.48  | 7.98  | 453.09 |
|                       | rise | 55.57 | 56.53 | 796.35 |

| Leakage [pW]          |      |        |      |  |
|-----------------------|------|--------|------|--|
| Input Transition [ns] | 0.01 | 0.05   | 4.00 |  |
| A to Q                | fall | 2.6610 | 0.26 |  |
|                       | rise | 2.9400 |      |  |
| B to Q                | fall |        |      |  |
|                       | rise |        |      |  |

| Dynamic Power Consumption [nW/MHz] |      |        |       |        |
|------------------------------------|------|--------|-------|--------|
| Input Transition [ns]              | 0.01 | 0.05   | 0.10  | 4.00   |
| Load Capacitance [fF]              | 5.00 | 100.00 | 5.00  | 100.00 |
|                                    | tall | 463    | 512   | 363.32 |
| A to Q                             | rise | 45.80  | 47.19 | 550.88 |
|                                    | tall | 7.48   | 7.98  | 450.46 |
| B to Q                             | rise | 55.57  | 56.53 | 658.80 |
|                                    | tall |        |       |        |

NOR2X1

am  
www.ams.com

# **Dynamic Power Consumption is Data Dependent**

---

- ❑ Switching activity,  $P_{0 \rightarrow 1}$ , has two components
  - A static component – function of the logic topology
  - A dynamic component – function of the timing behavior (glitching)

**Static transition probability**

$$P_{0 \rightarrow 1} = P_{\text{out}=0} \times P_{\text{out}=1}$$
$$= P_0 \times (1 - P_0)$$

2-input NOR Gate

| A | B | Out |
|---|---|-----|
| 0 | 0 | 1   |
| 1 | 0 | 0   |
| 1 | 1 | 0   |

With input **signal probabilities**

$$P_{A=1} = 1/2$$
$$P_{B=1} = 1/2$$

NOR static transition probability

$$= 3/4 \times 1/4 = 3/16$$

# Transition Probabilities for Some Basic Gates

|      | $P_{0 \rightarrow 1} = P_{\text{out}=0} \times P_{\text{out}=1}$ |
|------|------------------------------------------------------------------|
| NOR  | $(1 - (1 - P_A)(1 - P_B)) \times (1 - P_A)(1 - P_B)$             |
| OR   | $(1 - P_A)(1 - P_B) \times (1 - (1 - P_A)(1 - P_B))$             |
| NAND | $P_A P_B \times (1 - P_A P_B)$                                   |
| AND  | $(1 - P_A P_B) \times P_A P_B$                                   |
| XOR  | $(1 - (P_A + P_B - 2P_A P_B)) \times (P_A + P_B - 2P_A P_B)$     |



$$\text{For } X: P_{0 \rightarrow 1} = P_0 \times P_1 = (1 - P_A) P_A$$

$$= 0.5 \times 0.5 = 0.25$$

$$\text{For } Z: P_{0 \rightarrow 1} = P_0 \times P_1 = (1 - P_X P_B) P_X P_B$$

$$= (1 - (0.5 \times 0.5)) \times (0.5 \times 0.5) = 3/16$$

# Inter-signal Correlations

- ❑ Determining switching activity is complicated by the fact that signals exhibit correlation in space and time
  - reconvergent fan-out

$$(1-0.5)(1-0.5) \times (1-(1-0.5)(1-0.5)) = 3/16$$



$$(1 - 3/16 \times 0.5) \times (3/16 \times 0.5) = 0.085$$

Reconvergent

$$P(Z=1) = P(B=1) \text{ & } P(A=1 \mid B=1)$$

- ❑ Have to use **conditional probabilities**

# Logic Restructuring

- Logic restructuring: changing the topology of a logic network to reduce transitions

$$\text{AND: } P_{0 \rightarrow 1} = P_0 \times P_1 = (1 - P_A P_B) \times P_A P_B$$



Chain implementation has a lower overall switching activity than the tree implementation for random inputs

Ignores glitching effects

# Input Ordering

$$(1-0.5 \times 0.2) \times (0.5 \times 0.2) = 0.09$$



$$(1-0.2 \times 0.1) \times (0.2 \times 0.1) = 0.0196$$



Beneficial to postpone the introduction of signals with a  
**high transition rate** (signals with signal probability  
close to 0.5)

# Glitching in Static CMOS Networks

- Gates have a nonzero propagation delay resulting in spurious transitions or **glitches** (dynamic hazards)
  - glitch: node exhibits multiple transitions in a single cycle before settling to the correct logic value



# Glitching in an RCA



## Balanced Delay Paths to Reduce Glitching

- ❑ Glitching is due to a mismatch in the path lengths in the logic network; if all input signals of a gate change simultaneously, no glitching occurs



So equalize the lengths of timing paths through logic

# Power and Energy Design Space

|         | Constant Throughput/Latency                                                                    | Variable Throughput/Latency                                                |
|---------|------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|
| Energy  | Design Time                                                                                    | Non-active Modules                                                         |
| Active  | Reduced $V_{dd}$<br>Logic Design<br>Clock Gating<br>Sizing<br><b>Multi-<math>V_{dd}</math></b> | DVS, DVFS<br>(Dynamic Freq, Voltage Scaling)                               |
| Leakage | + Multi- $V_T$                                                                                 | Sleep Transistors<br>Multi- $V_{dd}$<br>Variable $V_T$<br>+ Variable $V_T$ |

# Dynamic Power as a Function of $V_{DD}$

- Decreasing the  $V_{DD}$  **decreases** dynamic energy consumption (quadratically)
- But, **increases** gate delay (decreases performance)



- Determine the critical path(s) at **design time** and use high  $V_{DD}$  for the transistors on those paths for speed. Use a lower  $V_{DD}$  on the other gates, especially those that drive large capacitances (as this yields the largest energy benefits).

# Multiple $V_{DD}$ Considerations

- ❑ How many  $V_{DD}$ ? – Two is becoming common
  - Many chips already have two supplies (one for core and one for I/O)
- ❑ When combining multiple supplies, **level converters** are required whenever a module at the lower supply drives a gate at the higher supply (step-up)
  - If a gate supplied with  $V_{DDL}$  drives a gate at  $V_{DDH}$ , the PMOS never turns off
    - The cross-coupled PMOS transistors do the level conversion
    - The NMOS transistor operate on a reduced supply
  - Level converters are not needed for a step-down change in voltage
  - Overhead of level converters can be mitigated by doing conversions at register boundaries and embedding the level conversion inside the flipflop (see next slide)



# Dual-Supply Inside a Logic Block

- ❑ Minimum energy consumption is achieved if **all** logic paths are critical (have the same delay)
- ❑ Clustered voltage-scaling
  - Each path starts with  $V_{DDH}$  and switches to  $V_{DDL}$  (gray logic gates) when delay **slack** is available
  - Level conversion is done in the flipflops at the end of the paths



$$T \geq t_{C-q} + t_{p\text{logic}} + t_{SU}$$

# Power and Energy Design Space

|         | Constant Throughput/Latency      | Variable Throughput/Latency                                       |
|---------|----------------------------------|-------------------------------------------------------------------|
| Energy  | Design Time                      | Non-active Modules                                                |
| Active  | Logic Design<br>Reduced $V_{dd}$ | Clock Gating<br>Sizing<br>Multi- $V_{dd}$                         |
|         |                                  | Sleep Transistors<br>Multi- $V_{dd}$<br>$+ \text{ Variable } V_T$ |
| Leakage |                                  | $+ \text{ Multi-}V_T$<br>$+ \text{ Variable } V_T$                |
|         |                                  |                                                                   |

# Leakage as a Function of Design Time $V_T$

- ❑ Reducing the  $V_T$  **increases** the sub-threshold leakage current (exponentially)
  - 90mV reduction in  $V_T$  increases leakage by an order of magnitude
- ❑ But, reducing  $V_T$  **decreases** gate delay (increases performance)



- ❑ Determine the critical path(s) at **design time** and use low  $V_T$  devices on the transistors on those paths for speed. Use a high  $V_T$  on the other logic for leakage control.
  - A careful assignment of  $V_T$ 's can reduce the leakage by as much as 80%

## Dual-Thresholds Inside a Logic Block

- ❑ Minimum energy consumption is achieved if **all** logic paths are critical (have the same delay)
- ❑ Use lower threshold on timing-critical paths
  - Assignment can be done on a per gate or transistor basis; no clustering of the logic is needed
  - No level converters are needed



## Example for evaluating minimum Clock Period



$T \text{ (clock period)}$

A waveform diagram showing a single clock cycle. The horizontal axis is labeled  $T$  (clock period). The vertical axis shows the signal levels. The waveform starts at high level, goes low, then high again, representing one full clock cycle.

$$T \geq t_{c-q} + t_{p\text{logic}} + t_{su}$$

# Example for evaluating minimum Clock Period



|   | T (clock period) |      |      |      |      |            |
|---|------------------|------|------|------|------|------------|
| A | 0,9              | 0,35 | 0,13 | 0,17 | 1,55 |            |
| B | 0,9              | 0,16 | 0,16 | 1,22 |      |            |
| C | 0,75             | 0,17 | 0,16 | 1,08 |      |            |
| D | 0,74             | 0,14 | 0,16 | 1,04 |      |            |
| E | 1,2              | 0,11 | 0,17 | 1,48 |      |            |
| F | 1,2              | 0,33 | 0,13 | 0,17 | 1,83 | 546,45 MHz |

---

## **Low Power Techniques in Microarchitectures and Memories**

[Adapted from Irwin ©2002]

# Review: Energy & Power Equations

$$E = C_L V_{DD}^2 P_{0 \rightarrow 1} + t_{sc} V_{DD} |_{\text{leakage}} P_{0 \rightarrow 1} + V_{DD}$$

$$f_{0 \rightarrow 1} = P_{0 \rightarrow 1} * f_{\text{clock}}$$

$P = C_L V_{DD}^2 f_{0 \rightarrow 1} + t_{sc} V_{DD} |_{\text{leakage}} f_{0 \rightarrow 1} + V_{DD} |_{\text{leakage}}$

**Dynamic power**  
(% decreasing  
relatively with  
deep submicron)

**Short-circuit  
power**

**Leakage power**  
(% increasing  
with deep  
submicron)

*Leakage power grows from <5% of power budget at .25 micron to 20-25% at 130nm to 40-50% at 90nm and continuing to increase at 65nm and beyond.*

# Power and Energy Design Space

|         | Constant Throughput/Latency                                                              | Variable Throughput/Latency                                          |
|---------|------------------------------------------------------------------------------------------|----------------------------------------------------------------------|
| Energy  | Design Time                                                                              | Non-active Modules                                                   |
| Active  | Reduced $V_{dd}$<br>Logic Design<br>Reduced $V_{dd}$<br>Active Sizing<br>Multi- $V_{dd}$ | Clock Gating<br>(Dynamic Freq, Voltage Scaling)                      |
| Leakage | $+ \text{Multi-}V_T$                                                                     | Sleep Transistors<br>$\text{Multi-}V_{dd}$<br>$\text{Variable } V_T$ |
|         |                                                                                          | $+ \text{Variable } V_T$                                             |

# Bus Multiplexing

- ❑ Buses are a significant source of power dissipation due to high switching activities and large capacitive loading
  - 15% of total power in Alpha 21064
  - 30% of total power in Intel 80386
- ❑ Share long data buses with time multiplexing ( $S_1$  uses even cycles,  $S_2$  odd)



- ❑ But what if data samples are correlated (e.g., sign bits)?

# Correlated Data Streams



# Glitch Reduction by Pipelining

- ❑ Glitches depend on the **logic depth** of the circuit - gates deeper in the logic network are more prone to glitching
  - arrival times of the gate inputs are more spread due to delay imbalances
  - usually affected more by primary input switching
- ❑ Reduce logic depth by adding pipeline registers
  - additional energy used by the clock and pipeline registers



# Power and Energy Design Space

|         | Constant Throughput/Latency                      | Variable Throughput/Latency                              |
|---------|--------------------------------------------------|----------------------------------------------------------|
| Energy  | Design Time                                      | Non-active Modules                                       |
| Active  | Reduced $V_{dd}$<br>Logic Design<br>Clock Gating | DFS, DVFS<br>(Dynamic Freq, Voltage Scaling)             |
| Leakage | Multi- $V_{dd}$<br>+ Multi- $V_T$                | Sleep Transistors<br>Multi- $V_{dd}$<br>$+ Variable V_T$ |

# Clock Gating

- ❑ Most popular method for power reduction of clock signals and functional units

- ❑ Gate off clock to idle functional units

- e.g., floating point units
- need logic to generate **disable** signal
  - increases complexity of control logic
  - consumes power
  - timing critical to avoid clock glitches at OR gate output
- additional gate delay on clock signal
  - gating OR gate can replace a buffer in the clock distribution tree



# Clock Gating in a Pipelined Datapath

- ❑ For idle units (e.g., floating point units in Exec stage, WB stage for instructions with no write back operation)



# Power and Energy Design Space

|         | Constant Throughput/Latency                                                                    | Throughput/Latency                                  | Variable Throughput/Latency         |
|---------|------------------------------------------------------------------------------------------------|-----------------------------------------------------|-------------------------------------|
| Energy  | Design Time                                                                                    | Non-active Modules                                  | Run Time                            |
| Active  | Logic Design<br>Reduced $V_{dd}$<br>Clock Gating<br>Sizing<br><b>Multi-<math>V_{dd}</math></b> | <b>DFS, DVFS</b><br>(Dynamic Freq, Voltage Scaling) |                                     |
| Leakage | + Multi- $V_T$                                                                                 | Sleep Transistors                                   | Multi- $V_{dd}$<br>+ Variable $V_T$ |

## Review: Dynamic Power as a Function of $V_{DD}$

- ☐ Decreasing the  $V_{DD}$  **decreases** dynamic energy consumption (quadratically)
- ☐ But, **increases** gate delay (decreases performance)



- ☐ Determine the critical path(s) at **design time** and use high  $V_{DD}$  for the transistors on those paths for speed. Use a lower  $V_{DD}$  on the other gates, especially those that drive large capacitances (as this yields the largest energy benefits).

# Dynamic Frequency and Voltage Scaling

## ❑ Intel's SpeedStep

- Hardware that steps down the clock frequency (dynamic frequency scaling – DFS) when the user unplugs from AC power
  - PLL from 650MHz → 500MHz
- CPU stalls during SpeedStep adjustment

## ❑ Transmeta LongRun

- Hardware that applies **both** DFS **and** DVFS (dynamic supply voltage scaling)
  - 32 levels of  $V_{DD}$  from 1.1V to 1.6V
  - PLL from 200MHz → 700MHz in increments of 33MHz
- Triggered when CPU load change is detected by software
  - heavier load → ramp up  $V_{DD}$ , when stable speed up clock
  - lighter load → slow down clock, when PLL locks onto new rate, ramp down  $V_{DD}$
- CPU stalls only during PLL relock (< 20 microsec)

**Opzioni risparmio energia**

Impostazioni avanzate | Gestione alimentazione VAIO

Selezionare la combinazione per la gestione del risparmio di energia che si desidera personalizzare, quindi scegliere le impostazioni desiderate per la gestione del risparmio energia sul computer.

Cambia le impostazioni attualmente non disponibili

Bilanciato [attivo] ▶

- + Disco rigido
- + Internet Explorer
- + Impostazioni sfondo del desktop
- + Impostazioni scheda wireless
- + Sospensione
- + Impostazioni USB
- + Intel(R) Graphics Settings
- + Pulsanti di alimentazione e coperchio
- + Risparmio energia del processore
- + Schermo

Ripristina combinazione predefinita

Modifica le impostazioni per la combinazione: Bilanciato

Specificare le impostazioni desiderate per la sospensione e per lo schermo.

A batteria

Alimentazione da rete elettrica

Attenuazione schermo: 2 minuti ▶ 5 minuti

Disattivazione schermo: 5 minuti ▶ 10 minuti

Sospensione computer: 15 minuti ▶ 30 minuti

Luminosità schermo:

Cambia impostazioni avanzate risparmio energia

Ripristina impostazioni predefinite per questa combinazione

OK Annulla Applica Salva cambiamenti Annulla

# Dynamic Thermal Management (DTM)



Trigger Mechanism:

When do we enable  
DTM techniques?



Initiation Mechanism:

How do we enable  
technique?



Response Mechanism:

What technique do we  
enable?

# DTM Trigger Mechanisms



- Mechanism: How to deduce temperature?
  - Direct approach: on-chip temperature sensors
    - Based on differential voltage change across 2 diodes of different sizes
    - May require >1 sensor
    - Hysteresis and delay are problems
  - Policy: When to begin responding?
    - Trigger level set too high means higher packaging costs
    - Trigger level set too low means frequent triggering and loss in performance
  - Choose trigger level to exploit difference between average and worst case power

# DTM Initiation and Response Mechanisms



- ❑ Operating system or microarchitectural control?
  - Hardware support can reduce performance penalty by 20-30%
- ❑ Initiation of policy incurs some delay
  - When using DVFS and/or DFS, much of the performance penalty can be attributed to enabling/disabling overhead
  - Increasing policy delay reduces overhead; smarter initiation techniques would help as well
- ❑ Thermal window (100Kcycles+)
  - Larger thermal windows “smooth” short thermal spikes

# DTM Activation and Deactivation Cycle



- ❑ Initiation Delay – OS interrupt/handler
- ❑ Response Delay – Invocation time (e.g., adjust clock)
- ❑ Policy Delay – Number of cycles engaged
- ❑ Shutoff Delay – Disabling time (e.g., re-adjust clock)

## **DTM Savings Benefits**



# Power and Energy Design Space

|         | Constant Throughput/Latency                                                   | Variable Throughput/Latency                                                |
|---------|-------------------------------------------------------------------------------|----------------------------------------------------------------------------|
| Energy  | Design Time                                                                   | Non-active Modules                                                         |
| Active  | Reduced $V_{dd}$<br>Logic Design<br>Clock Gating<br>Sizing<br>Multi- $V_{dd}$ | DVS, DVFS<br>(Dynamic Freq, Voltage Scaling)                               |
| Leakage | + Multi- $V_T$                                                                | Sleep Transistors<br>Multi- $V_{dd}$<br>Variable $V_T$<br>+ Variable $V_T$ |