

# EECS251B : Advanced Digital Circuits and Systems

## Lecture 20 – Low Power Design

Borivoje Nikolić, Vladimir Stojanović, Sophia Shao



IEEE MICRO, Nov/Dec 2021  
Micropocessor at 50: Looking Back and Looking Forward  
Special issue on 50 years of a microprocessor



EECS251B L20 LOW-POWER DESIGN

Advertisement in the Electronics News Weekly in November 1971  
announcing the Intel 4004.

Berkeley

### Recap

- Power is a primary design constraint
  - In both cloud and edge systems
- Excess performance traded off for power savings



## Architectural Optimizations

### From System View: What is the Optimum?

- How do sensitivities relate to more traditional metrics:
  - Power per operation (MIPS/W, GOPS/W, TOPS/W)
  - Energy per operation (Joules per op)
  - Energy-delay product
- Can be reformatted as a goal of optimizing power  $\times$  delay<sup>n</sup>
  - $n = 0$  – minimize power per operation
  - $n = 1$  – minimize energy per operation
  - $n = 2$  – minimize energy-delay product
  - $n = 3$  – minimize energy-(delay)<sup>2</sup> product

EECS251B L20 LOW-POWER DESIGN

### Optimal Processors

- Processors used to be optimized for performance
  - Optimal logic depth was found to be 8-11 FO4 delays in superscalar processors
  - 1.8-3 FO4 in sequentials, rest in combinatorial
    - Kunkel, Smith, ISCA'86
    - Hriskesh, Jouppi, Farkas, Burger, Keckler, Shivakumar, ISCA'02
    - Harstein, Puzak, ISCA'02
    - Sprangle, Carmean, ISCA'02
- But those designs have very high power dissipation
  - Need to optimize for both performance and power/energy

### Optimization Problem

- Set up optimization problem:
  - Maximize performance under energy constraints
  - Minimize energy under performance constraints
- Or minimize a composite function of  $E^n D^m$ 
  - What are the right n and m?
- $n = 1, m = 1$  is EDP – improves at lower  $V_{DD}$
- $n = 1, m = 2$  is invariant to  $V_{DD}$ 
  - $E \sim CV_{DD}^2$
  - $D \sim 1/V_{DD}$

EECS251B L20 LOW-POWER DESIGN

### Hardware Intensity

- Introduced by Zyuban and Strenski in 2002.
- Measures where is the design on the Energy-Delay curve
- Parameter in cost function optimization

$$F_c = (E/E_0)(D/D_0)^\eta \quad 0 \leq \eta < +\infty,$$

$$\eta = -\frac{D \partial E}{E \partial D} \Big|_V$$

Slope of the optimal E-D curve at the chosen design point



### Optimum Across Hierarchy Layers



Zyuban et al, TComp'04

Optimal logic depth in pipelined processors is ~18FO4  
Relatively flat in the 16-22FO4 range

EECS251B L20 LOW-POWER DESIGN

## Architectural Tradeoffs

- H, Mair, ISSCC'20



## Architectural Tradeoffs: Tri-Gear

- HP: High performance (ARM Cortex A78, optimized for speed, 3.0GHz)
- BP: Balanced performance (ARM Cortex A78, optimized for power, 2.6GHz)
- HE: High efficiency (ARM A55, 2.0GHz)



## Announcements

- Quiz 2 today
- Homework 3 due next week

## Alpha-Power Based Delay Model



$$t_{pi} = \frac{K_d V_{DD}}{(V_{DD} - V_{Th})^\alpha} \left( 1 + \frac{C_{L,i}}{C_{in,i}} \right)$$

$$D = \sum t_{pi} = \sum \frac{K_d V_{DD}}{(V_{DD} - V_{Th})^\alpha} \left( 1 + \frac{W_{L,i}}{W_{in,i}} \right)$$

EECS251B L20 LOW-POWER DESIGN

## Circuit-Level Tradeoffs



## Sizing, Supply, Threshold Optimization

- Transistor sizing can yield large power savings with small delay penalties
  - Gate sizing
  - Beta-ratio adjustments
  - (Stack resizing)
- Supply voltage affects both active and leakage energy
- Threshold voltage affects primarily the leakage

## Energy Models

### Switching

$$E_{Sw} = \alpha_{0 \rightarrow 1} (C_{L,i} + C_{int,i}) V_{DD}^2$$



### Leakage

$$E_{Lk} = W_{in} I_0 e^{-\frac{(V_{Th} - \gamma V_{DD})}{nV_t}} V_{DD} D$$

## Apply to Sizing of an Inverter Chain



Unconstrained energy: find min  $D = \sum t_{pi}$

$$C_{gin,j} = \sqrt{C_{gin,j-1} C_{gin,j+1}}$$

$$W_j = \sqrt{W_{j-1} W_{j+1}}$$

Constrained energy: find min  $D$ , under  $E < E_{max}$   
Where  $E = \sum e_i$

EECS251B L20 LOW-POWER DESIGN

### Constrained Optimization

- Find  $\min(D)$  subject to  $E = E_{\max}$ 
  - Constrained function minimization
- E.g. Lagrange multipliers

$$\Lambda(x) = D(x) + \lambda(E(x) - E_{\max})$$

$$\frac{\partial \Lambda}{\partial x} = 0$$

- Can solve analytically for  $x = W_i, V_{DD}, V_{Th}$

EECS251B L20 LOW-POWER DESIGN

Or dual:

$$K(x) = E(x) + \lambda(D - D_{\max})$$

### Inverter Chain: Sizing Optimization



### Inverter Chain: Sizing Optimization



- Variable taper achieves minimum energy
- Reduce number of stages at large  $d_{inc}$

EECS251B L20 LOW-POWER DESIGN

$$W_j = \sqrt{\frac{W_{j-1}W_{j+1}}{1 + \lambda W_{j-1}}}$$

[Ma, Franzon, IEEE JSSC, 9/94]

$$\lambda = -\frac{2KV_{DD}^2}{\tau_{nom}S_W}$$

$e_j$  – energy per stage  
 $f_j$  – fanout per stage

$$S_W \propto \frac{e_j}{f_j - f_{j-1}}$$

Stojanovic, ICCAD'02

### Sensitivity to Sizing and Supply

#### Gate sizing ( $W_i$ )

$$-\frac{\partial E_{SW}}{\partial D} \Big/ \frac{\partial W_j}{\partial W_j} = \frac{e_j}{\tau_{nom}(f_j - f_{j-1})}$$

∞ for equal  $f_{eff}$   
( $D_{min}$ )

#### Supply voltage ( $V_{dd}$ )

$$-\frac{\partial E_{SW}}{\partial D} \Big/ \frac{\partial V_{DD}}{\partial V_{DD}} = \frac{E_{SW}}{D} 2 \frac{1 - x_v}{\alpha - 1 + x_v}$$

$$x_v = (V_{th} + \Delta V_{th})/V_{dd}$$



### Sensitivity to $V_{th}$

#### Threshold voltage ( $V_{th}$ )

$$-\frac{\partial E}{\partial D} \Big/ \frac{\partial \Delta V_{th}}{\partial \Delta V_{th}} = P_{lk} \left( \frac{V_{DD} - V_{th} - \Delta V_{th}}{\alpha n V_t} - 1 \right)$$

Low initial leakage  
⇒ speedup comes for “free”

EECS251B L20 LOW-POWER DESIGN



### Reducing $V_{dd}$



- Strong function of voltage ( $V^2$  dependence).
- Relatively independent of logic function and style.
- Power Delay Product Improves with lowering  $V_{dd}$ .

Chandrakasan, JSSC'92

EECS251B L20 LOW-POWER DESIGN

### Lower $V_{dd}$ Increases Delay



- Relatively independent of logic function and style.

$$T_d = \frac{C_L * V_{dd}}{I}$$

$$I \sim (V_{dd} - V_t)^2$$

$$\frac{T_d(V_{dd=2})}{T_d(V_{dd=5})} = \frac{(2) * (5 - 0.7)^2}{(5) * (2 - 0.7)^2} \approx 4$$

## Trade-off Between Power and Delay



## Architecture Trade-off for Fixed-rate Processing Reference Datapath



- Critical path delay  $\Rightarrow T_{adder} + T_{comparator} (= 25\text{ns}) \Rightarrow f_{ref} = 40\text{MHz}$
  - Total capacitance being switched =  $C_{ref}$
  - $V_{dd} = V_{ref} = 5\text{V}$
  - Power for reference datapath =  $P_{ref} = C_{ref} V_{ref}^2 f_{ref}$  from [Chandrakasan92] (IEEE JSSC)
- EECS2518 L20 LOW-POWER DESIGN

## Parallel Datapath



## Pipelined Datapath



- Critical path delay is less  $\Rightarrow \max [T_{adder}, T_{comparator}]$
  - Keeping clock rate constant:  $f_{pipe} = f_{ref}$   
Voltage can be dropped  $\Rightarrow V_{pipe} = V_{ref} / 1.7$
  - Capacitance slightly higher:  $C_{pipe} = 1.15 C_{ref}$
  - $P_{pipe} = (1.15 C_{ref}) (V_{ref}/1.7)^2 f_{ref} \approx 0.39 P_{ref}$
- EECS2518 L20 LOW-POWER DESIGN

## A Simple Datapath: Summary

| Architecture type                              | Voltage | Area | Power |
|------------------------------------------------|---------|------|-------|
| Simple datapath (no pipelining or parallelism) | 5V      | 1    | 1     |
| Pipelined datapath                             | 2.9V    | 1.3  | 0.39  |
| Parallel datapath                              | 2.9V    | 3.4  | 0.36  |
| Pipeline-Parallel                              | 2.0V    | 3.7  | 0.2   |

EECS2518 L20 LOW-POWER DESIGN



## Multiple Supplies

### Multiple Supply Voltages

- Block-level supply assignment ("power domains" or "voltage islands")
    - Higher throughput/lower latency functions are implemented in higher  $V_{DD}$
    - Slower functions are implemented with lower  $V_{DD}$
    - Often called "Voltage islands"
    - Separate supply grids, level conversion performed at block boundaries
  - Multiple supplies inside a block
    - Non-critical paths moved to lower supply voltage
    - Level conversion within the block
    - Physical design challenging
    - (Not used in practice)
- EECS2518 L20 LOW-POWER DESIGN

### Power Domains



## Practical Examples

- Intel Skylake (ISSCC'16)
  - Four power planes indicated by colors



## Leakage Issue

- Driving from  $V_{DDL}$  to  $V_{DDH}$

→ Level converter



## Multiple Supplies in a Block

### Conventional Design



### CVS Structure



## Level-Converting Flip-Flop



## Practical Examples

- Intel 28-core Skylake-SP (ISSCC'18)



EECS2518 L20 LOW-POWER DESIGN

## Multiple Supplies Within A Block

- Downsizing, lowering the supply on the critical path will lower the operating frequency
- Downsize (lowering supply) non-critical paths
  - Narrows down the path delay distribution
  - Increases impact of variations



## Multiple Supplies in a Block

### CVS



### Layout:



## Summary

- Power-performance tradeoffs
  - Sizing
  - Supplies
  - Thresholds
- Lowering supplies
- Multiple supply voltages

EECS2518 L20 LOW-POWER DESIGN



## Next Lecture

- Low-power design
  - Dynamic voltage-frequency scaling
  - Clock gating



EECS251B L20 LOW-POWER DESIGN



41