

# Why save power?

- Power/Energy is important, especially for embedded systems
  - Mobile and portable devices: Smartphones, Laptops
  - Other devices: Wearables, Medical devices, etc.
  - 3D architectures
- Lower power usage = Better battery life
- More power = More heat



# Samsung Galaxy Note 7



<https://www.theverge.com/2016/10/9/13218730/samsung-galaxy-note-7-fire-replacement-fourth-virginia>

<https://www.theverge.com/2016/10/9/13215728/samsung-galaxy-note-7-third-fire-smoke-inhalation>

<https://mashable.com/article/samsung-galaxy-a21-airplane-fire>

# Some power basics

- Processors have gates
  - Short circuit power + Leakage power + Dynamic power
  - $P_{dynamic} = \alpha CV^2f$  (P: power,  $\alpha$ : switching factor, C: capacitance, V: voltage, f: frequency)
  - $E = P * \text{Execution time}$



(a)

# Short circuit power



SHORT CIRCUIT CURRENT PATH IN CMOS

- Happens during transitions
- Both transistors momentarily ON



when both transistor are ON momentarily the short circuit comes in the form of spike

# Leakage power

- Happens during normal operations
- OFF transistors are not ideal open circuit
- More serious for smaller process nodes



# Dynamic power

- Happens due to switching
- Requires two toggles
  - Charge Load capacitor
  - Discharge load capacitor



During charging



During discharging

# How to reduce dynamic power?

$$P = \alpha C V^2 f$$

$$\alpha \downarrow, C \downarrow, V \downarrow, f \downarrow \Rightarrow P \downarrow$$

- Reducing  $\alpha$ : design circuit with suitable gates, reduce glitch, power gating
- Reducing  $C$ : use smaller gates
- Reducing  $V$ : use lower supply voltage
- Reducing  $f$ : use lower clock frequency
- Reducing  $V$  has the highest benefit



# Avoid glitches



- **Unbalanced paths causes unnecessary toggles due to glitch**
- **Extra dynamic power consumed**
- **Creating balanced paths reduce α**
- **Low power dissipation**

# Power and Clock gating



- **Power gating saves leakage power**
- **Clock gating reduces dynamic power**

# Static vs Dynamic logic design



# Power-Performance trade-offs



- **Lower power:**
  - Pros: Lower power bill, lower heat, lower fan noise, longer-lasting battery
  - Cons: Poor performance
- How about application-aware power management?

# Saving power in Manycore

- Lots of cores
- Not all of them are used all the time
- Different levels of usage in each core
- Opportunity to save power



# Workload characteristics

- Applications do not require peak computing capability all the time
- Reduce V, f when the demand is low
- Meet deadlines exactly instead of being too fast/slow



# Workload characteristics: k-mer counting



(a)



(b)

Fig. 2: Gerbil: (a) Instruction types, and (b) CPU utilization and runtime (normalized) for varying number of intermediate files

# Voltage Frequency Scaling (VFS)



- If you're not using something, turn it off

# Voltage Frequency Scaling



$\langle V, F \rangle$  pairs for each core



- Our problem:
  - Circuit is given, no changes allowed
  - Only allowed to change V and F values

# Example of VFS

- **Depending on task, vary voltage/frequency**
- **Entire setup implemented using hardware with help of OS**



# Why scale V and F together(1)?



- Reducing V decreases pull-up strength
- Delay of circuit increases automatically
- Clock frequency must be adjusted



# Why scale V and F together(2)?

- Oscillators are not stable at all voltages for a given frequency

Increasing f



# Nexus Processor VFS

- Running Android 4.1.2 Jelly Bean OS.
- ARM Cortex A8 Hummingbird Processor
  - Supports dynamic frequency scaling from 100Mhz to 1Ghz
  - Supports voltage scaling from 800mV to 1500mV
- Supported frequencies along with their predefined voltage of operation are as follows
  - 100Mhz -> 950mV
  - 200Mhz -> 950mV
  - 400Mhz -> 1050mV
  - 800Mhz -> 1200mV
  - 1000Mhz -> 1250mV



# Intel Pentium processor

| Frequency     | Voltage |
|---------------|---------|
| 1.6 GHz (HFM) | 1.484 V |
| 1.4 GHz       | 1.420 V |
| 1.2 GHz       | 1.276 V |
| 1.0 GHz       | 1.164 V |
| 800 MHz       | 1.036 V |
| 600 MHz (LFM) | 0.956 V |



# DVFS implementation (software)

- Implemented using Linux kernels
  - Supports multiple CPU governors and schedulers.
- CPU governor
  - Decides how to scale the frequency of the processor based on the workload.
  - Policies: Power save, Conservative, Performance, on demand
- Scheduler
  - Gives tasks access to resources at the time of execution

# Static policies for VFS



- How to identify changes
- Static policies are easy to implement in hardware
- Sub-optimal power-performance



# Dynamic Voltage Frequency Scaling (DVFS)



- Dynamically adapt as per application requirement
- Difficult to implement in hardware



# Challenges with Dynamic Voltage Frequency Scaling



- How to identify changes?
- Overhead in switching
- ML prediction?

# DVFS example



- IPC for FFT with respect to time
- Voltage closely matches the behavior
- ML policy used for prediction

# Communication by cores

- Similar behavior in communication as well
- Not all cores communicate all the time
- Chance to save power



# DVFS implementations (hardware)

- Many ways to implement DVFS
- Off-Chip
  - Slower to react
  - Easy to design
- On-Chip
  - Faster to react
  - Difficult to design



# DVFS hardware



- DVFS is enabled by a “DVFS Controller”
- DVFS applied on a per-core basis
- Often not scalable with number of cores

# The controller



**Figure 1. (a) Power stage of hybrid SISC regulator;  $VDD = 2 \text{ V}$ ,  $L = 500 \text{ pH}$ ,  $C_0 = 2 \text{ nF}$**   
**(b) Switching patterns of the switches when  $V_{OUT}$  is between  $VDD$  and  $VDD/2$ , and P4/N4 pair is disabled.**

# DVFS in manycore



# DVFS challenges

- One controller per core
- Multiple V/F configurations
- Communication is challenge



# Communication challenge in DVFS



- Also known as **Globally asynchronous locally synchronous (GALS)** system
- Communication is based on acknowledgements

# Voltage Frequency Islands (VFI)

- Birds of a feather flock together
- Have one controller per cluster
- Cores belonging to same cluster have same voltage and frequency
- ARM board (Big-Little clusters)



# ARM board (Odriod xu4)

- ARM board
- Two computing clusters
- Big-Little clusters
- Performance → Big cluster
- Power → Little cluster
- Map tasks according to requirement



# Apple M1 chip

- Two types of CPUs
  - Performance cores
  - Efficiency cores



# Manycore application characteristics

- Per-core DVFS is costly
- Many cores have similar IPC
- Create groups of similar cores,  
a.k.a., islands
- One controller per cluster



# Voltage Frequency Island



- A group of cores have similar behavior
- $\langle V, F \rangle$  pairs for each group instead of each core
- Fewer DVFS controllers
- Easier communication

# Overall VFI methodology



Fig. 1. Illustration of the proposed VFI-partitioned multicore codesign methodology. During the VFI clustering phase, each application is profiled in order to obtain key time-varying core and network statistics. These statistics are used to find the optimal clustering, within particular size constraints, for each application. During system design, these clusters are then mapped to physical cores. The NoC and VFI controller (VFI CTRL), the dedicated hardware block that dynamically tunes the V/F of the VFI, are then placed to tailor to the traffic and cluster characteristics of each application. During runtime, the VFI CTRL obtains the core and traffic data in order to determine optimal V/F levels for the VFI.

# Effectiveness of VFI



- Average energy improvement for the Naïve M3D systems is 21.68%
- Energy improvement is 21.87% for the Hetero M3D systems

# VFI in 3D



- VFI can be done with 3D architectures as well
- Higher V/F cores near the sink

# VFI: TSV vs M3D



- VFI for TSV and M3D are different
- For M3D, higher V/F cores not near sink

# VFI outcome: TSV vs M3D



- M3D architectures have better power-performance trade-offs
- Pareto front is closer to origin

# Experimental setup



Fig. 6. Experimental set up for establishing the performance and thermal tradeoff for TSV- and M3D-based NoC architectures.

- Use open-source simulators to get the data