

inst.eecs.berkeley.edu/~eecs251b

# EECS251B : Advanced Digital Circuits and Systems



## Lecture 20 – Low Power Design

### Borivoje Nikolić, Vladimir Stojanović, Sophia Shao




**IEEE MICRO, Nov/Dec 2021**  
Microprocessor at 50: Looking Back and Looking Forward  
Special issue on 50 years of a microprocessor



Announcing a new era of integrated electronics  
A micro-programmable on a chip  
intel delivers.

EECS251B L20 LOW-POWER DESIGN

Advertisement in the Electronics News Weekly in November 1971 announcing the Intel 4004.

Berkeley 

1

## Recap

- Power is a primary design constraint
  - In both cloud and edge systems
- Excess performance traded off for power savings

EECS251B L20 LOW-POWER DESIGN

2

2



3



4

## From System View: What is the Optimum?

- How do sensitivities relate to more traditional metrics:
  - Power per operation (MIPS/W, GOPS/W, TOPS/W)
  - Energy per operation (Joules per op)
  - Energy-delay product
- Can be reformatted as a goal of optimizing power  $\times$  delay<sup>n</sup>
  - n = 0 – minimize power per operation
  - n = 1 – minimize energy per operation
  - n = 2 – minimize energy-delay product
  - n = 3 – minimize energy-(delay)<sup>2</sup> product

EECS251B L20 LOW-POWER DESIGN

5

## Optimization Problem

- Set up optimization problem:
  - Maximize performance under energy constraints
  - Minimize energy under performance constraints
- Or minimize a composite function of E<sup>n</sup>D<sup>m</sup>
  - What are the right n and m?
- n = 1, m = 1 is EDP – improves at lower V<sub>DD</sub>
- n = 1, m = 2 is invariant to V<sub>DD</sub>
  - E  $\sim CV_{DD}^2$
  - D  $\sim 1/V_{DD}$

EECS251B L20 LOW-POWER DESIGN

6

## Hardware Intensity

- Introduced by Zyuban and Strenski in 2002.
- Measures where is the design on the Energy-Delay curve
- Parameter in cost function optimization

$$F_c = (E/E_0)(D/D_0)^\eta \quad 0 \leq \eta < +\infty,$$

$$\eta = -\left. \frac{D \partial E}{E \partial D} \right|_V$$

**Slope of the optimal E-D curve at the chosen design point**



EECS251B L20 LOW-POWER DESIGN

7

## Optimum Across Hierarchy Layers



Zyuban et al, TComp'04

**Optimal logic depth in pipelined processors is ~18FO4**  
Relatively flat in the 16-22FO4 range

EECS251B L20 LOW-POWER DESIGN

8



9



10

**Announcements**

- Quiz 2 today
- Homework 3 due next week

EECS251B L20 LOW-POWER DESIGN

11

**Circuit-Level Tradeoffs**



EECS251B L20 LOW-POWER DESIGN

12



13



14

## Sizing, Supply, Threshold Optimization

- Transistor sizing can yield large power savings with small delay penalties
  - Gate sizing
  - Beta-ratio adjustments
  - (Stack resizing)
- Supply voltage affects both active and leakage energy
- Threshold voltage affects primarily the leakage

$$\beta = W_p/W_n$$

EECS251B L20 LOW-POWER DESIGN

15

15

## Apply to Sizing of an Inverter Chain



*Unconstrained energy: find min D =  $\sum t_{pi}$*

$$C_{gin,j} = \sqrt{C_{gin,j-1} C_{gin,j+1}} \quad W_j = \sqrt{W_{j-1} W_{j+1}}$$

*Constrained energy: find min D, under E < E<sub>max</sub>*  
Where  $E = \sum e_i$

EECS251B L20 LOW-POWER DESIGN

16

16

## Constrained Optimization

- Find  $\min(D)$  subject to  $E = E_{\max}$ 
  - *Constrained function minimization*
- E.g. Lagrange multipliers

$$\Lambda(x) = D(x) + \lambda(E(x) - E_{\max})$$

Or dual:

$$K(x) = E(x) + \lambda(D - D_{\max})$$

$$\frac{\partial \Lambda}{\partial x} = 0$$

- Can solve analytically for  $x = W_p, V_{DD}, V_{Th}$

EECS251B L20 LOW-POWER DESIGN

17

## Inverter Chain: Sizing Optimization

EECS251B L20 LOW-POWER DESIGN

18



19



20

**Sensitivity to  $V_{th}$**

- Threshold voltage ( $V_{th}$ )

$$-\frac{\partial E}{\partial \Delta V_{th}} = P_{Lk} \left( \frac{V_{DD} - V_{Th} - \Delta V_{Th}}{\alpha n V_t} - 1 \right)$$

**Low initial leakage**  
**⇒ speedup comes for “free”**



EECS251B L20 LOW-POWER DESIGN

21

**Scaling Supplies**



EECS251B L20 LOW-POWER DESIGN

22

## Reducing $V_{dd}$



$$P \times t_d = E_t = C_L \times V_{dd}^2$$

$$\frac{E(V_{dd}=2)}{E(V_{dd}=5)} = \frac{(C_L) \times (2)^2}{(C_L) \times (5)^2}$$

$$E(V_{dd}=2) \approx 0.16 E(V_{dd}=5)$$

- Strong function of voltage ( $V^2$  dependence).
- Relatively independent of logic function and style.
- Power Delay Product Improves with lowering  $V_{DD}$ .

Chandrakasan, JSSC'92

EECS251B L20 LOW-POWER DESIGN

23

## Lower $V_{DD}$ Increases Delay



$$T_d = \frac{C_L \times V_{dd}}{I}$$

$$I \sim (V_{dd} - V_t)^2$$

$$\frac{T_d(V_{dd}=2)}{T_d(V_{dd}=5)} = \frac{(2) \times (5 - 0.7)^2}{(5) \times (2 - 0.7)^2} \approx 4$$

- Relatively independent of logic function and style.

EECS251B L20 LOW-POWER DESIGN

24



25



26

## Parallel Datapath



- The clock rate can be reduced by half with the same throughput  $\Rightarrow f_{\text{par}} = f_{\text{ref}} / 2$
- $V_{\text{par}} = V_{\text{ref}} / 1.7$ ,  $C_{\text{par}} = 2.15C_{\text{ref}}$
- $P_{\text{par}} = (2.15C_{\text{ref}}) (V_{\text{ref}}/1.7)^2 (f_{\text{ref}}/2) \approx 0.36 P_{\text{ref}}$

EECS251B L20 LOW-POWER DESIGN

27

## Pipelined Datapath



- Critical path delay is less  $\Rightarrow \max [T_{\text{adder}}, T_{\text{comparator}}]$
- Keeping clock rate constant:  $f_{\text{pipe}} = f_{\text{ref}}$   
Voltage can be dropped  $\Rightarrow V_{\text{pipe}} = V_{\text{ref}} / 1.7$
- Capacitance slightly higher:  $C_{\text{pipe}} = 1.15C_{\text{ref}}$
- $P_{\text{pipe}} = (1.15C_{\text{ref}}) (V_{\text{ref}}/1.7)^2 f_{\text{ref}} \approx 0.39 P_{\text{ref}}$

EECS251B L20 LOW-POWER DESIGN

28

28

## A Simple Datapath: Summary

| Architecture type                                 | Voltage | Area | Power |
|---------------------------------------------------|---------|------|-------|
| Simple datapath<br>(no pipelining or parallelism) | 5V      | 1    | 1     |
| Pipelined datapath                                | 2.9V    | 1.3  | 0.39  |
| Parallel datapath                                 | 2.9V    | 3.4  | 0.36  |
| Pipeline-Parallel                                 | 2.0V    | 3.7  | 0.2   |

EECS251B L20 LOW-POWER DESIGN

29



## Multiple Supplies

EECS251B L20 LOW-POWER DESIGN

30

## Multiple Supply Voltages

- Block-level supply assignment (“power domains” or “voltage islands”)
  - Higher throughput/lower latency functions are implemented in higher  $V_{DD}$
  - Slower functions are implemented with lower  $V_{DD}$
  - Often called “Voltage islands”
  - Separate supply grids, level conversion performed at block boundaries
- Multiple supplies inside a block
  - Non-critical paths moved to lower supply voltage
  - Level conversion within the block
  - Physical design challenging
  - (Not used in practice)

EECS251B L20 LOW-POWER DESIGN

31

## Power Domains

EECS251B L20 LOW-POWER DESIGN

32

**Practical Examples**

- Intel Skylake (ISSCC'16)
  - Four power planes indicated by colors



The diagram shows a cross-section of a chip area labeled "SA Power planes". It features several power planes colored red, cyan, magenta, and yellow, which are interconnected with various metal layers and vias. Labels like "fuse" and "dip" are visible. The background is black.

EECS251B L20 LOW-POWER DESIGN

33

33

**Practical Examples**

- Intel 28-core Skylake-SP (ISSCC'18)
  - 9 primary VCC domains are partitioned into 35 VCC planes



The diagram illustrates the power architecture of the Intel 28-core Skylake-SP. It shows a grid of cores (yellow blocks) and various supply planes (green, blue, red, purple) that connect them. A legend on the right side identifies the different supply types:

- Vcc: core supply (per core)
- Vccclm: Un-core supply
- Vccsa: System Agent supply
- Vccio: Infrastructure supply
- Vccsfr: PLL supply
- Vccddrd: DDR logic supply
- Vccddra: DDR I/O supply

EECS251B L20 LOW-POWER DESIGN

34

34

## Leakage Issue

- Driving from  $V_{DDL}$  to  $V_{DDH}$

▶ Level converter

The diagram shows a circuit with two NMOS transistors. The top NMOS has its gate at  $V_{DDH}$  and source at  $V_{DDL}$ . The bottom NMOS has its gate at  $V_{DDH}$  and source at ground. A voltage  $V_{GS}$  is applied to the common drain node. A condition  $|V_{GS}| > V_{Th}$  is shown above the top NMOS. A condition  $V_{OH} < V_{DDH}$  is shown below the bottom NMOS. A leakage current  $I_{Leak}$  is indicated flowing from the drain of the top NMOS to the drain of the bottom NMOS.

The level converter circuit consists of four NMOS transistors and one PMOS transistor. The PMOS is connected between  $V_{DDH}$  and the output  $V_{out}$ . The other three NMOS transistors are connected between  $V_{in}$ , ground, and  $V_{out}$ . The circuit is powered by  $V_{DDH}$  and  $V_{DDL}$ .

EECS251B L20 LOW-POWER DESIGN

3.5

35

## Multiple Supplies Within A Block

- Downsizing, lowering the supply on the critical path will lower the operating frequency
- Downsize (lowering supply) non-critical paths
  - Narrows down the path delay distribution
  - Increases impact of variations

The figure contains two graphs. Both graphs have 'Path count' on the vertical axis and 'Delay' on the horizontal axis. The left graph shows the 'Original delay distribution' (shaded area under a bell-shaped curve) and the 'Scaled supply delay distribution' (shaded area under a narrower bell-shaped curve shifted to the right). A vertical dashed line indicates the 'Target delay'. The right graph shows the 'Original delay distribution' and the 'Optimized delay distribution' (shaded area under a very narrow bell-shaped curve shifted to the right), also targeting the same 'Target delay'.

EECS251B L20 LOW-POWER DESIGN

36

36



37



38



39



40



41