

Campus

| B5  
7mm×30行 40页

Digital System Synthesis - ELEC 6233

lkl1n16@stan.ac.uk

Campus®

B5 点线本 | 7 mm×30行 | 40页

KOKUYO

## # Summary of topics

High-level synthesis for system-on-chip design

- Data structure, modelling, algorithms, mathematical background
- Synthesis and design space, optimisation
- Scheduling and binding (映射)
- Testing in SoC design

Synthesis for low power

- Clock gating techniques
- Dynamic power, leakage power, power estimation
- Power-gating

SystemC - a high-level hardware modelling language

Performance and energy efficiency

- Many-core processor architectures
- Power dissipation and performance monitoring

## High-level synthesis - Overview

### # Mathematical Background

Certain concepts from Boolean algebra { Binary decision diagrams (BDDs)  
Boolean satisfiability and cover

#### [BDDs]

A boolean function can be represented as a rooted, directed, acyclic graph, which consists of several decision nodes and terminal nodes. There are two types of terminal nodes called 0-terminal and 1-terminal. Each decision node  $N$  is labelled by Boolean variable  $X_N$  and has two child nodes called low child and high child.

| $X_1$ | $X_2$ | $X_3$ | $f$ |
|-------|-------|-------|-----|
| 0     | 0     | 0     | 0   |
| 0     | 0     | 1     | 0   |
| 0     | 1     | 0     | 0   |
| 0     | 1     | 1     | 1   |
| 1     | 0     | 0     | 0   |
| 1     | 0     | 1     | 0   |
| 1     | 1     | 0     | 1   |
| 1     | 1     | 1     | 1   |



[Boolean satisfiability] {  $a \text{ and not } b, a=1, b=0 \Rightarrow \text{TRUE} \mid \text{satisfiable}$   
 $a \text{ and not } a \times \mid \text{unsatisfiable}$

Boolean satisfiability problem is the problem of determining if there exists an interpretation that satisfies a given Boolean formula. In other words, it asks whether the variables of a given Boolean formula can be consistently replaced by the values TRUE or FALSE in such a way that the formula evaluates to TRUE. If this is the case, the formula is called satisfiable.

### # Boolean Satisfiability (SAT)

Many high-level synthesis task can be solved by determining boolean satisfiability of logic functions

SAT: Find a assignment to the variables  $x_1, \dots, x_n$  such that the boolean function  $f(x_1, \dots, x_n) = 1$ , or prove that no such assignment exists.

### # Ordered Binary Decision Diagrams (OBDDs)

Different problems that can be modeled by OBDDs { Digital circuit design and verification

{ Sensitivity and probabilistic analysis on digital circuits  
Finite-state system analysis

### # Many Uses of DBDDs and SAT

SAT is a core computational engine for major applications

EDA

{ Testing  
Logic synthesis  
FPGA routing  
Path delay analysis  
... } Knowledge based deduction  
AI { Automatic theorem proving

Also core engine for many other problems:

Integer Linear Programming

Theorem provers (Coq, Isabelle...)

{ HOL - Higher-Order Logic Proof Assistant

Isabelle - successor of HOL

combinational circuit can be simplified by SAT solver

[Example] SAT can be used to solve the logic equivalence problem

{ Combinational circuit  $C_A$ , with  $n$  inputs and  $m$  outputs  
 { Combinational circuit  $C_B$ , with  $n$  inputs and  $m$  outputs



Are the two circuits equivalent?

{ A complicated problem formulation:

Are the output equivalent for all input values?

{ A simpler alternative:

Are there (at least one) input values that distinguish outputs of the two circuits?

Digital integrated circuits can exhibit defects

Physical defects are modelled as logic faults

Most often used fault model: single stuck-at fault model

{  $s_{i=0}$ : fixed at 0

{  $s_{i=1}$ : fixed at 1

Circuit A vs Circuit B

Good Circuit vs Faulty circuit

for  $i=1$  to  $n$   
 for  $j=i$  to  $n$

for  $k=j$  to  $n$

$$n^2(n-1)^2 + \dots + 2^2 \cdot 1^2 = \frac{n(n+1)(n+2)}{6} \quad \text{时间复杂度为 } O(n^3)$$

# Is Boolean satisfiability (SAT) tractable?

An algorithmic problem with  $n$  variables is said to be tractable if the algorithm necessary to solve it is of polynomial time.  
 An algorithm is said to be of polynomial time (of P complexity) if its run time is upper bounded by a polynomial expression of the order of the input for the algorithm, i.e.,  $T(n) = Dn^k$ ,  $k$  is a constant.

SAT is the first established NP problem (Non-deterministic polynomial complexity)

△ Cook's theorem: A problem is in NP if and only if it can be reduced to SAT.

# P, NP, NP-complete complexity

If there is a polynomial algorithm for an NP problem, then  $P=NP$

If there is an NP problem cannot be solved in polynomial time ( $P \neq NP$ ), it is NP-hard

An NP-complete problem is a problem that is both NP and NP-hard



Sizes of inputs increase, the complexity increase as exponential

$$(\bar{a}\bar{b} + \bar{b}\bar{c})(a+c) = a\bar{a}\bar{b} + \bar{a}\bar{b}c + a\bar{b}\bar{c} + \bar{b}cc \\ = \bar{a}\bar{b}c + a\bar{b}\bar{c}$$

## # Conjunctive normal form (CNF) 合取范式

For SAT, the Boolean formula must be presented in CNF

A boolean formula is in CNF if it is a conjunction of clauses, where a clause is a disjunction of literals

$$F = \bar{a}\bar{b}c + a\bar{b}\bar{c} = \bar{b}(\bar{a} + \bar{c})(a + c)$$

↑      ↑  
literal clause

CNF is an AND of ORS

$$\left. \begin{array}{l} (x + \bar{y})(\bar{x} + y) \\ (\bar{x}y) + (\bar{x}y) \end{array} \right\} \checkmark$$

[Algorithm to convert any Boolean formula to CNF (Tseitin transformation)]

Recursive algorithm: convert(F):

① If F is a literal, return F

② If F has the form  $F = p \oplus q$ ,

Convert ( $p$ ), which must have the form  $p = P_1 P_2 \dots P_n$

Convert ( $q$ ), which must have the form  $q = Q_1 Q_2 \dots Q_m$

return  $P_1 P_2 \dots P_n \oplus Q_1 Q_2 \dots Q_m$

③ If F has the form  $F = p + q$

Convert ( $p$ ), which must have the form  $p = p_1 p_2 \dots p_n$ ,

convert ( $q$ ), which must have the form  $q = q_1 q_2 \dots q_m$

return  $(P_1 + Q_1)(P_1 + Q_2) \dots (P_1 + Q_m)(P_2 + Q_1) \dots (P_2 + Q_m) \dots (P_n + Q_1) \dots (P_n + Q_m)$

$\frac{P}{P}$

$\frac{Q}{Q}$

e.g.:  $\frac{PV\bar{B}}{PV\bar{V}V\bar{A}}$

## # Literal and clauses

④ If F has the form  $\bar{a}b$

return convert( $\bar{a}b$ )

⑤ If F has the form  $\bar{ab}$

return convert( $\bar{a} + \bar{b}$ )

$$P_1 P_2 \dots P_n + Q_1 Q_2 \dots Q_m = (P_1 + Q_1)(P_2 + Q_2) \dots (P_n + Q_n)(P_1 + Q_2) \dots (P_1 + Q_m) \dots (P_n + Q_m)$$

↑  
 $\left\{ \begin{array}{l} \text{if } P_i = 0 \Rightarrow Q_1 = \dots = Q_m = 1 \in \text{The above expression is satisfiable} \\ \text{if } P_i = 0 \Rightarrow Q_1 = \dots = Q_m = 1 \in \text{if and only if} \end{array} \right.$

## # Conjunctive Normal Form (CNF)

A **literal** is either a propositional atom  $p$  or its negation  $\neg p$ .  
A **clause** is a disjunction  $\vee l_1 \vee l_2 \vee \dots \vee l_n$  of one or more literals  $l_i$ .

A formula is in **conjunctive normal form** if it is a conjunction of one or more clauses

$$\neg p, p \vee q, (\neg p \vee q) \wedge (r \vee \neg t \vee \neg p)$$

## # Testing validity of a formula in CNF

\* Theorem

① A clause  $\vee l_1 \vee l_2 \vee \dots \vee l_n$  is valid iff there exists  $i, j \Rightarrow l_i = \neg l_j$

② A CNF formula  $C_1 \wedge C_2 \wedge \dots \wedge C_n$  is valid if each of its clauses  $C_i$  is valid.

$\neg p \vee q \vee r \vee r$  is valid

$(\neg p \vee q \vee r) \wedge (r \vee \neg r)$  is valid

$(\neg p \vee q \vee r) \wedge (r \vee s)$  is NOT valid

$$\text{Convert } F = \bar{a}\bar{b}c + a\bar{b}\bar{c}$$

$$F = \bar{a}\bar{b}\bar{c} + \bar{a}\bar{b}c + a\bar{b}\bar{c} + a\bar{b}c + ab\bar{c}$$

$$= (a+b+c)(a+\bar{b}+\bar{c})(\bar{a}+b+\bar{c})(\bar{a}+\bar{b}+c)(\bar{a}+\bar{b}+c)$$

$$(\bar{a}+\bar{b})(a+\bar{b}) = a\bar{a} + \bar{a}\bar{b} + a\bar{b} + \bar{b}\bar{b} = \bar{b}(a+\bar{a}) + \bar{b} = \bar{b}$$

$$= \bar{b}(\bar{a}+\bar{b})(\bar{a}+\bar{c})(a+\bar{b})(\bar{b}+\bar{c})(a+\bar{c})(\bar{b}+\bar{c})$$

$$= \bar{b}(\bar{a}+\bar{c})(a+\bar{c})$$

## Davis - Putnam SAT algorithm

Given a logic network, is there a sequence of inputs such that the network returns true as output?

Let the logical network be represented by the equation:

$$\text{Davis - Putnam: Recursive Algorithm that creates a } F = (A+B)(B+\bar{C})(A+\bar{B})$$

Performing search the tree by making assignments to the remaining variables at each stage of the algorithm

Performs better than on average than an exhaustive search

- Effectively prunes failed branches from the search

(DP)

### Davis - Putnam Algorithm

```
procedure split(F) {
    if F has an empty clause, then return
    if F has no clause, then exit with current partial assignment
    select next unassigned variable,  $x_i$  in F
    split( $F(x_i=0)$ ) // 分支
    split( $F(x_i=1)$ )
}
```

$$F = (A+B)(B+\bar{C})(A+\bar{B})$$



## # DA Algorithm complexity

Unfortunately in the worst case the Davis - Putnam Algorithm is still non-polynomial wrt number of variables (exponential), i.e. NP-complete

Currently there is no known solution that provides a worst case performance better than exponential.

## # DLL Algorithm - an improvement over DP (Davis, Logemann and Loveland)

Also known as DPLL (Davis - Putnam - Logemann - Loveland) for historical reasons.

Basic framework for many modern SAT solvers

### # Basic DPLL Procedure - Depth First Search (DFS)

{ FD: Forward decision  
D: Decision

Example: Set of clauses in a CNF logic function



Forced Decision

$\xrightarrow{\text{Backtrack}} \text{B} \xrightarrow{\text{Backtrack}} \text{C}$

$\xrightarrow{\text{Conflict}}$

$a \rightarrow c \wedge d$   
 $b \rightarrow c \wedge d$   
 $c \rightarrow d$

(FD)  
 $a \rightarrow c \wedge d$

$\xrightarrow{\text{Backtrack}} \text{C}$

$a \rightarrow c \wedge d$   
 $b \rightarrow c \wedge d$   
 $c \rightarrow d$



## # Advantages off DLL over DP

Eliminates the exponential memory requirements

Exponential time is still a problem

Limited practical applicability - largest use seen in automatic theorem proving

Limited size of problems are allowed

- Problem size limited by total size of clauses (1300 clauses reported)

## # Binary Decision Diagrams - efficient Boolean space search

Store the Boolean function in a Directed Acyclic Graph (DAG) representation. Compacted form of the binary decision tree. Reduction rules to manipulate the graph.

## [Conflict Driven Learning and Non-Chronological Backtracking]

$$\begin{aligned} & a+d \\ & \bar{a} + \bar{c} + \bar{h} \\ & a + h + l \\ & b + k \\ & \bar{g} + \bar{c} + i \\ & \bar{g} + h + \bar{i} \\ & g + h + j \\ & g + j + \bar{k} \end{aligned}$$

} Conflict



## # Conflict-Driven Learning - advantages

Learned clause is useful forever

Useful in generating future conflict clauses

Can restart, i.e. abandon the current search tree and reconstruct a new one

- Adds to robustness in the solver

- The clauses learned before the restart are still in the Boolean function after the restart and can help pruning the search space

## # Boolean Constraint Propagation

Boolean Constraint Propagation (BCP) == Unit Propagation (UP) == One literal Rule (OLR)

BCP is based on unit clauses, i.e. clauses that are composed of a single literal.

If a set of clauses contains the unit clause  $U$ , the other clauses are simplified by the iterative application of the following 2 rules:

- Every clause (other than clause  $U$  itself) containing  $U$  is removed.

- In every clause that contains the negation of  $U$ :  $\bar{U}$ , the literal  $\bar{U}$  is removed.

## # BCP example

$$F = a(a+b)(\bar{a}+c)(\bar{c}+d)$$

① Since  $a(b)$  contains  $a$ , this clause can be removed

② Since  $(\bar{a}+c)$  contains the negation of  $a$ ,  $\bar{a}$  can be removed from the clause. Hence:  $F = a(\bar{c}+d)$

③ Since  $(\bar{c}+d)$  contains the negation of  $c$ ,  $\bar{c}$  can be removed from the clause. Hence:  $F = ad$

## Dynamic Power Reduction

### # Revision

Power of Digital Circuit :  $P = P_{\text{dynamic}} + P_{\text{leakage}}$

Dynamic caused by charging and discharging of capacitors  $P_d = V_{dd}^2 * F_{CK} * C_L * E_{SW}$

Switching activity depends on input patterns



OR gates consume power if input 2 of OR gate changes from 0 to 1



AND gates consume power if input 1 of AND gate changes from 0 to 1



Get switching activity (0-1 toggles)

Dynamic power reduction happens at various design levels

- Physical layout level = shorter interconnect  $\rightarrow$  smaller  $C_L$
- Transistor level = transistor with smaller  $W$   $\rightarrow$  smaller  $C_L$
- Architectural level { Switching activity reduction : Clock gating  
Vdd reduction : Multiple Vdd

length  
reason why we want 3-dimension chip  $\rightarrow$  reduce the connection length

The longer the length, the bigger the capacitance

### # Clock gating : architectural level dynamic power reduction

When clocked modules (memory, register files, ...) are inactive (stored values not changing), use clock gating to disable the clock feeding module.



D Flip-Flop



FF consumes power due to switching activity even when input is not changing (clock is changing)

Clock gating reduces switching activity/dynamic power, up to 50% reduction

### # Example 1 : Clock gating two, 3-bit counters Occur only on synchronous circuit

Each counter increments on every clock cycle when INCR1 and INCR2 high

INCR1 always high, COUNTER 1 increment every clock ; COUNTER 2 increment only once every 8 clocks, but consume power during each clock cycle.



- Clock gating logic (2AND) cause counters to be clocked by only when their increment signals high
- COUNTER 2 not clocked when INCR2 is low

### # Example 2 : Clocking gating processor register file



Hold the instruction that is being decoded

Only clock register file when instruction decoding require the use of register file

美国新思科技公司

## # RTL Clock Gating Using Synopsys Design Compiler

Example: Clock 16-bit "data\_in" into flip-flops "Flop7" when flop7 enable is "1", "enable" generated by "en\_gen" block, where enable "1" when input "count" is "111" produced by "counter8" block



- Invoke "RTL clocking gating" by the command  
-Set\_clock\_gating\_style -sequential\_cell\_latch -minimum\_bitwidth 3
- RTL clocking gating works by identifying groups of flip-flops that share common "enable" indicating new values need to be clocked.

flop7.vhd

```

entity flop7 is
    port (clk: in std_logic;
          nreset: in std_logic;
          enable: in std_logic;
          data_in: in std_logic_vector (15 downto 0);
          data_out: out std_logic_vector (15 downto 0));
end entity flop7;

```

```

architecture flop7_arch of flop7 is
begin

```

```

process (nreset, clk) is
begin
    if (nreset = '1') then
        data_out <= (others => '0');
    elsif (rising-edge (clk)) then
        if (enable = '1') then
            data_out <= data_in;
        end if;
    end if;
end process;
end flop7_arch;

```



Inserted Automatically by Power Compiler into the clock network

## # Clock gating circuits



Latch captures the state of the enable signal and hold it until next clock cycle.

$$\text{Delay} = Kd * Vdd / (Vdd - Vt)^2$$

## # Multiple Vdd

Architectural level dynamic power reduction.

- Modules on critical paths use highest Vdd (to meet required timing constraints)
- Modules on non-critical paths use lower Vdd (reducing power)



Assume  $Vdd = 2.2V$ ,  $Vt = 0.5V$ , FUs delay = 10ns

Critical path operation: 02, 05, 06, 07, 08, 09, 010, 011

Operation: 01, 03, 04 are candidates for multi-cycled FUs

Critical path length =  $10 + 10 + 10 + 10 + 10 + 10 = 60\text{ns}$

Delay of multi-cycled 01 = 50ns, 03 = 20ns, 04 = 40ns (delay冗余)

Delay of digital circuit,  $D = Kd * Vdd / (Vdd - Vt)^2$

Where D functional unit delay, Vdd supply voltage, Vt threshold voltage of the functional unit, and Kd a constant

$$\text{Consider multi-cycled FU 01} \quad \left\{ \begin{array}{l} 50 = (Kd * Vdd) / (Vdd - 0.5)^2 \dots (1) \\ 10 = (Kd * 2.2) / (2.2 - 0.5)^2 \dots (2) \end{array} \right.$$

From Eq.(2),  $Kd = 13.1$ . Substitute Kd in Eq.(1) and solve for Vdd gives  $Vdd = 1V$ .

The multi-cycled FU 01 will operate with  $Vdd = 1V$ .

Similarly, multi-cycled FU 03,  $Vdd = 1.48V$

将 01 的 Vdd 调低，1 阶运算至与 level 4 相同。

Similarly, multi-cycled FU 03,  $Vdd = 1.48V$



## Multi-Vdd Design Requirements

- Use of multiple supply voltages
- Level shifts needed after reduced voltage module
- Multi-Vdd design flow (page 4 . part 3 notes)

$$\text{For } 03 \quad \left\{ \begin{array}{l} 20 = Kd * Vdd / (Vdd - 0.5)^2 \\ 10 = Kd * 2.2 / (2.2 - 0.5)^2 \end{array} \right.$$

$$\begin{aligned} \frac{20}{10} &= \frac{Vdd}{(Vdd - 0.5)^2} \times \frac{(2.2 - 0.5)^2}{2.2} = 4.4 = \frac{2.89Vdd}{(Vdd - 0.5)^2} \\ 2.89Vdd &= 4.4(Vdd^2 - Vdd + 0.25) \\ 2.89Vdd &= 4.4Vdd^2 - 4.4Vdd + 1.1 \\ 4.4Vdd^2 - 7.29Vdd + 1.1 &= 0 \Rightarrow Vdd^2 \approx 1.48 \end{aligned}$$

In order to use Multiple Vdd at RTL level

UBF was invented.

$$\text{For } 04 \quad \left\{ \begin{array}{l} 40 = Kd * Vdd / (Vdd - 0.5)^2 \\ 10 = Kd * 2.2 / (2.2 - 0.5)^2 \end{array} \right.$$

$$4 = \frac{Vdd}{(Vdd - 0.5)^2} \cdot \frac{(2.2 - 0.5)^2}{2.2}$$

$$4(Vdd^2 - Vdd + 0.25) = 1.314Vdd$$

$$4Vdd^2 - 5.314Vdd + 1 = 0$$

$$-0.314 \pm \sqrt{0.314^2 - 4 \cdot 1 \cdot 1} \quad Vdd = 1.1V$$

Different leakage current becomes dominant depends on different scaling

## Leakage Power Reduction



- Leakage Power consumption occurs as long as the circuit is powered on
- Sub-threshold current between source and drain in MOS transistor occurs when gate voltage ( $V_{GS}$ ) is below transistor threshold voltage  $V_T$
- Sub-threshold current increases by 3x with each technology generation due to scaling of the transistor threshold voltage  $V_T$
- At 45nm, total power (60% dynamic and 40% leakage)
- $V_{DD}$  reduces with technology scaling (15% lower operating voltage, 30% smaller transistor),  $V_T$  must scale to deliver transistor performance

$$V_T \neq V_T \text{ (因漏电)} \quad V_T = V_T - \Delta V_T$$

Peakage  $\propto V_{DD} \cdot I_{leakage}$

$$I_{leakage} \propto I_0 e^{(\frac{V_{GS}-V_T}{nV_T})}, I_0 = \mu_0 C_{ox} \frac{W}{L} (n-1) \cdot V_T^2$$

technology dependent

Low  $V_T \rightarrow$  More power



consume more power but perform faster

## ⑤ Sleep Transistor (Power Gating)

- # Leakage Power Reduction
- ① Technology Level
- ② Stacking Effect
- ③ Body Bias
- ④ Adaptive Body Bias

## [Technology Level]

Many libraries are required, hard to use multiple libraries

Multiple  $V_T$  technology where fabrication process provides both high and low threshold transistor  $V_T$  (low: fast and leaky and high: slow and less leaky) for N and P transistors. Default design uses high  $V_T$  and careful replacing high  $V_T$  with low  $V_T$ , one can get low  $V_T$  performance and yet significantly lower leakage power.

Synopsys Galaxy Design Platform provides an automated way of embedding \*multi- $V_T$ \* to reduce leakage power.

△ Example: critical path delay determines performance of the design, all other paths have lower delay. Use low  $V_T$  transistors in modules on the critical path, and high  $V_T$  transistors in modules on noncritical paths.

## Multiple $V_T$ and Multiple $V_{DD}$ Design Flow



↳ Vocabulary in the slide



叠加效应

## [Stacking Effect]

Exam

Sub-threshold leakage current flowing through stacks of series connected transistors reduces when more than one transistor in the stack is turned off. This effect is known as Stacking Effect.



Principle:

When M1 and M2 are off, voltage at node M is positive due to small drain current.

Due to  $V_m > 0$ , Body-to-Source Potential ( $V_{SB}$ ) of M1 becomes negative resulting in increase in threshold voltage (larger body effect) of M1 and hence reducing the sub-threshold leakage

2-input NAND Gate

$\{V_{to} = \text{Zero Substrate Bias}$

$\{\gamma = \text{Body Effect Parameter}$

$$V_t = V_{to} + \gamma (\sqrt{(V_{SB} + 2\phi_F)} - \sqrt{2\phi_F}) \quad \leftarrow \text{Net Stacking Effect}$$

$$V_t = V_{to} \quad (\text{when } V_{SB}=0) \quad \leftarrow \text{For body bias}$$

Example: 2-input NAND gate, leakage current ( $I_S$ )  $A=B=0$ , PMOS conducting, NMOS non-conducting, leakage current through 2 stacked transistors  $200 \text{ pA}$

$A=1, B=0$  or  $A=0, B=1$ , one of the nmos in weak region, leakage current through 1 transistor  $0.46 \text{ pA}$   
 $A=1, B=1$ , both PMOS in weak region, leakage current  $= 215 = 0.90 \text{ pA}$ . For how much values were obtained

## Key observations

Exam

1. If number of stacked transistors  $> 3$  leakage current very small

2. Leakage current for transistors in parallel is the sum of the individual currents. For 2-input NAND highest leakage current occurs for input 11, similarly for NOR, highest leakage current occurs for input 00.

## Note:

It is possible to find an input vector that puts the logic in a minimal leakage state. This input vector can hard-wired into the logic and activated when the logic is not in use (standby mode)

Cost too much  
[Body Bias]



Exam

Applying reverse body bias to transistor reduces leakage and can be used to reduce standby leakage power. Alternatively apply forward body bias, which reduces  $V_t$  and improves circuit performance but increases leakage.

## [Adaptive Body Bias]

Combination of reverse and forward body bias can be applied to reduce leakage power with reduced performance or improve performance with higher leakage power.

## Case study

- Test chip is divided into 23 smaller designs, each design has circuitry to apply forward and reverse body. P transistors in each design can be biased individually, but all N transistors of the test chip have the same body bias.
- Slower dies are applied with forward bias to improve performance and faster leaky dies are applied with reverse body bias to reduce leakage
- 100% yield achieved, significant boost to high frequency binning.



Two transistors in series  
Stack effect, less power consumption

No. \_\_\_\_\_

Date \_\_\_\_\_



More transitions  $\rightarrow$  Stacking Effect  $\rightarrow$  High resistance  $\rightarrow$  Low current

## [Power Gating]

### (Exam) Principle

#### • Sleep transistors

- Insert high  $V_t$  MOSFET between power rail and logic blocks
- Header transistor off lead to logic block floating to near zero
- Footer transistor off lead to logic block floating to near 1

#### • Header switches preferred if multiple power rails since gated block logics float to zero irrespective of supply voltages

## # Isolation Cell



Follow a protocol in order to do the power gating.

Store states (avoid losing data)  
(Input isolation)

## Exam

UPF ✓

Power Gating design synthesis Flow. (Asking draw a flow)

Code

{ P2FO  
Power gating

notly about this

Gate leakage X  $\leftarrow$  Not a big problem; high grain material has been invented

isolation cell  $\rightarrow$  tracing the problem

## # State Retention Register

Retention registers are used to hold data/register states, before a power down. These states can be restored from these registers on power up. Hence retention registers are always on. Which means that even during power down mode, there will be on and will be consuming power.

A retention register consists of

- A latch or regular Flip-Flop - LVT (Low Threshold Voltage) cell for high performance. This will be given the normal power supply.
- State saving latch - High Threshold Voltage cells with different power supply. This power supply should be active during power down mode too.



In normal mode the state-saving latch will not be active. Just before power down, the value in the regular flip flop, will be transferred to the state-saving register, and the power to the regular register will be shut down. The power to the state-saving latch, should be maintained and thus it keeps the register state.

## PPT

Sleep mode: state stored in balloon (latch) which is powered by Vdd

Active mode: balloon inactive; do not interfere with normal operation

B1 and B2 pins control the saving and restoring of state. In practice, retention registers may have only 1 control pin.

## # Output Isolation

When  $V_{in}$  is floating, there will be shortcircuit current



Without output isolation, floating output causes shortcircuit current in active logics



Isolation logic clamping the floating output prevents the shortcircuit current in active logics

## # Power Gating



Require a controller to:

Clamp the power gated block output and save states before power off

Restore the states and unclamp the output after power on.



## # UPF (Unified Power Format)

① Define power domain and create supply net for each power domain

create-power-domain soc-power-domain

create-power-domain cpu-power-domain -elements tcpu

set-domain-supply-net soc-power-domain  
-primary-power-net vdd  
-primary-ground-net vss

set-domain-supply-net cpu-power-domain  
-primary-power-net sw-vdd  
-primary-ground-net vss

② Create power switches

```
create-power-switch pg-switch -domain cpu_power_domain \
    -input-supply-port !vdd vdd \
    -output-supply-port tsw-vdd sw-vdd \
    -control-port !power pg_ctrl/power \
    -on-state !on-state vdd !power \
    -off-state !off-state !power
```

### ③ Create output isolation

```
set_isolation iso1 -domain cpu_power_domain \
-isolation_power_net Vdd \
-isolation_groud_net Vss \
-clamp_value 0 \
-applies_to_outputs
```

```
set_isolation_control iso1 -domain cpu_power_domain \
-isolation_signal pg_ctrl/nclamp \
-isolation_sense_low \
-location parent
```

### ④ Create state retention

```
set_retention ret1 -domain cpu_power_domain \
-retention_power_net Vdd \
-retention_groud_net Vss
```

```
set_retention_control ret1 -domain cpu_power_domain \
-save_signal pg_ctrl / retain high \
-restore_signal pg_ctrl / retain low
```

## # Power Gating Design Synthesis Flow



### Power gating design synthesis requires:

- ① Power description (upf file)
- ② Power gating controller model (RTL)
- ③ Cell library with power gating Support
- ④ Synthesis Tool with power gating capability

```

module fifo cpower, retain, nclamp, clk, n_rst, wr_en,
rd_en, d_in, d_out, full, empty;
parameter DATA_WIDTH=8;
parameter ADDR_WIDTH=3;
parameter DEPTH=8;
input power, retain, nclamp, clk, n_rst, wr_en, rd_en;
input [DATA_WIDTH-1:0] d_in;
output full, empty;
output [DATA_WIDTH-1:0] d_out;
reg [ADDR_WIDTH-1:0] distance;
reg [DATA_WIDTH-1:0] buffer [DEPTH-1:0];
reg [ADDR_WIDTH-1:0] wr_p, rd_p;
integer i;

always@(posedge clk) begin
    if (!n_rst) wr_p <= {ADDR_WIDTH}{1'b0};
    else if (wr_en && !full) wr_p <= wr_p + 1;
end

always@(posedge clk) begin
    if (!n_rst) rd_p <= {ADDR_WIDTH}{1'b0};
    else if (rd_en && !empty) rd_p <= rd_p + 1;
end

```

FIFO (A design to be power gated)

```

always@(posedge clk) begin
    if (!n_rst) distance <= {ADDR_WIDTH+1}{1'b0};
    // if read only, decrease distance
    else if (rd_en && !empty && (!wr_en || full)) distance <= distance - 1;
    // if write only, increase distance
    else if (wr_en && !full && (!rd_en || empty)) distance <= distance + 1;
    // otherwise keep distance the same
end

always@(posedge clk) begin
    if (!n_rst) begin
        for (i=0; i<DEPTH; i=i+1) buffer[i] <= {DATA_WIDTH}{1'b0};
    end
    else if (wr_en && !full) begin
        buffer[wr_p] <= d_in;
    end
end

assign full = (distance == DEPTH) ? 1'b1 : 1'b0;
assign empty = (distance == 0) ? 1'b1 : 1'b0;
assign d_out = buffer[rd_p];
endmodule.

```

No.

Date

## # Gate Leakage

This is ~~not~~ another type of leakage current (beside subthreshold current) and happens because decrease in transistor oxide thickness which takes place with every technology node. Gate tunneling current increases by 25x for every  $1\text{A}^\circ$  decrease in oxide thickness, resulting in nearly 30x increase in gate tunneling current per technology generation. Gate leakage power needs to be considered, in particular in 45 nm and beyond.

## Energy Management

### # Energy vs Power

Power is energy consumption per unit time

Heat generation depends on power consumption. Fast chips run hot, and controlling power consumption is important for increasing reliability. By reducing the speed at which the chip operates, we can reduce its power consumption (although not the total energy required for the operation)

dynamic power

$$\downarrow P_{dyn} = V^2 \cdot dI \cdot f \cdot C_g \cdot E_g (\text{SW})$$

Energy  $\nabla P \star t \uparrow$

降低运行速度，功率损耗降低，但并不意味着总能量降低

因为速度下降，需要更多工作时间。

Battery life, depends on energy consumption which can be lowered by power consumption.

For example reducing power supply, 1.2V to 0.9V causes the power consumption to drop by  $1.2^3 / 0.9^3 = 1.78$

### # Energy Efficient Embedded Systems



#### Embedded OS

- Active power management OS provides the appropriate V/F values to run the application at the desired performance
- Idle power management OS puts part of the system into sleep and wake up based on requirement

Force on Software  
remove the leakage power  $\rightarrow$  dis: Shut down, require time to recover

### # Energy Management : system level

DPM

DVFS

Dynamic power management (DPM), and dynamic voltage frequency scaling (DVFS)

DPM: shut down processing elements (PE) when inactive.

Different algorithms

- Greedy: Go to sleep as soon as processing is finished

- Timeout: Stay on expecting a new request. After time  $t$  of idleness go to sleep

Adaptive Timeout: Vary timeout  $t$  depending on activity.

Shallow sleep states are entered and exited quickly (few clock cycles), deep sleep states cost more in time and energy (dozens of clock cycles)



$\Leftarrow$  DPM



Take a lot of time and energy from if it is in deep sleep state.



If the system wakes up every second, it should not go to a deep sleep state.

# DVFS

## # Energy Management : System level

DVFS : adapts PE performance to requirement by dynamically changing PE clock frequency and Vdd.  
 PE has non-uniform workload, executing MP3 requires 1 order of magnitude > processing performance than MP3  
 非均匀负载

Block-diagram of DVFS-enable processor



Turning off or reducing Vdd and hence power optimisation is possible due to the presence of idle time and slack time (空闲时间) in systems schedule.

In DVFS, power changes cubically with frequency-voltage scaling, and overall performance is roughly linear with clock frequency.

| Supply voltage | Frequency | Power saving | Perf - Loss |
|----------------|-----------|--------------|-------------|
| 100%           | 100%      | 0            | 0           |
| 95%            | 95%       | 14%          | 5%          |
| 90%            | 90%       | 27%          | 10%         |
| 85%            | 85%       | 39%          | 15%         |

[Wiki] Dynamic voltage scaling is a power management technique in computer architecture, where the voltage used in a component is increased or decreased, depending upon circumstance.

[Techtarget] DVFS is the adjustment of power and speed settings on a computing device's various processors, controller chips and peripheral devices to optimize resources allocation (资源分配) for tasks and maximize power saving when those resources are not needed.

DVFS allows devices to perform needed tasks with the minimum amount of required power. The technology is used in almost all modern computer hardware to maximize power savings, battery life and longevity (寿命) of devices while still maintaining ready compute performance availability.

Idle time = periods where PEs do not experience any workload

Slack time: amount of time that a PE is idle during the execution of a function. Result from tasks finishing execution before their deadline  
 slack = deadline - finish time

Why Idle and slack times exist in systems?

- not all PEs utilised constantly during run-time
- performance of PEs can not be perfectly matched to application needs: i.e. over-design commonly employed, with benefit of re-use of hardware over several product generations.

## # Energy management examples

PE performs a task  $T_x$  in 20ms and dissipates power of 500mW when running at 33MHz at  $V_{dd} = 2.5V$ , and  $V_t = 0.8V$ . To meet the application performance requirements, the task need to be repeated every 30ms.



← Figure shows a shutdown during idle time of 10ms.

$$\text{Energy consumption (DPM) of executing task } T_x : 500 \text{ mW} \times 20 \text{ ms} = 10 \text{ mJ}$$

← Figure shows DVFS exploiting of slack time, instead of shutting down the PE during the 10ms of idleness,  $V_{dd}$  can be reduced from 2.5V to 2.06V. To ensure correct operation frequency is reduced from 33MHz to 22MHz =  $(20 \times 33 / 30)$ .  $T_x$  in 20ms  $\rightarrow T_x$  in 30ms  $\frac{20}{30} \times \frac{200^2}{2.5^2} \rightarrow 226.3 \text{ mW}$

With this voltage and frequency values, the PE dissipates power of Therefore, energy consumption of executing task :

$$T_x = 226.3 \text{ mW} \times 30 \text{ ms} = 6.79 \text{ mJ.}$$

## # Energy Management Application

**Approach 1 (design-time)**: explicitly hand-code the shut down or changing clock frequency as part of the processor application software. This requires understanding of the application and processor power states.

**Approach 2 (run-time)**: automatically by the OS once the processor has identified to be inactive or the required performance of the processor to execute a particular task has been established. This requires energy-aware OS/middleware, capable of identifying processor inactive states, and determining the working frequency without violating deadline.

- characterise the application running tasks at runtime and accordingly make the V-f setting selection making use of runtime statistics (CPU utilisation, cache hit/miss ratio, ...) obtained from the hardware performance counters.

Approach 1 is mature, approach 2 still developing

$$V_{dd} = V_t + \frac{V_0}{2d^2} + \sqrt{\left(V_t + \frac{V_0}{2d^2}\right)^2 - V_t^2} \quad (1)$$

$$V_0 = \frac{(V_{max} - V_t)^2}{V_{max}} = \frac{(2.5 - 0.8)^2}{2.5} = 1.156 \quad (2)$$

$$d^r = \frac{d(V_{dd})}{d(V_{max})} = \frac{30}{20} = 1.5$$

$$V_{dd} = 0.8 + \frac{1.156}{2 \times 1.5} + \sqrt{\left(0.8 + \frac{1.156}{2 \times 1.5}\right)^2 - 0.8^2} = 2.06 \quad (3)$$

$$\frac{P_{V_{max}}}{P_{V_{min}}} = \frac{1}{d^r} \cdot \frac{V_{dd}^2}{V_{max}^2}$$

$$P_{V_{dd}} = \frac{1}{1.5} \cdot \frac{2.06^2}{2.5^2} \times 500 \text{ mW} = 226.3 \text{ mW}$$

## # Linux Power Governors 功率调节器

Powersave - Set to the lowest frequency

Performance - Set to the highest frequency

Ondemand - { Set the freq. based on the current CPU usage  
Max freq. when CPU usage is high ; decrease gradually

Conservative - Similar to ondemand ; increase / decrease frequency in steps

Userspace - User defined frequency setting

[Shell command] `sudo cpufreq-set --cpu core_no --governor name`  
cpufreq is a Linux power management tool ; "sudo apt-get install cpufrequtils" to install  
example: `sudo cpufreq-set --cpu0 --governor ondemand`



## # Interplay of DPM and DVFS

Processors support DPM and DVFS, important to understand the interplay between DPM and DVFS

Energy saved with DVFS come at the cost of increased execution time, implying shortened idle period for applying DPM, and greater leakage energy consumption.

Ideal DVFS policy must understand this interplay and hence perform v-f setting selection with the objective of reducing the overall system energy consumption.

This depends on the processor workloads characterisation

must identify idle period

- DPM, the characterisation is in terms of the distributions of idle period durations. ↪ Save power (leakage power)

- DVFS, it is in terms of CPU / memory intensiveness of the executing tasks.

DPM and DVFS outperform each other under different workloads, no single PM policy fits perfectly all operating conditions, hence hybrid power management is needed.

## # Energy Efficient Software

Individual instructions of software is the fundamental unit of software, similar to gates in hardware.

Software organization (coding) affects the energy cost of a program since this affects how processors / DSP consumes energy.

Energy cost (static) =  $B_i \cdot n_i$  (where  $B_i$  = base cost of instruction i,  $n_i$  = no. of occurrences of instruction i)

Energy cost (dynamic) =  $\sum B_{ij} \cdot n_{ij}$  (where  $B_{ij}$  = dynamic cost of sequence j,  $n_{ij}$  = no. of occurrences of sequence j)

$I = \text{current drawn by processor during execution of program}$ ,  $V = \text{Vdd}$ ,  $n = \text{no. of cycles needed to execute program}$ .  $T = \text{clk period}$ .

| Instruction | Current (mA) | Cycles |
|-------------|--------------|--------|
| NOP         | 276          | 1      |
| ADD         | 314          | 1      |
| JUMP        | 373          | 3      |
| MUL         | 428          | 1      |

Energy cost (dynamic) accounts for the previous processor state and depends on sequence of events (e.g. cache misses)

Energy cost (program) =  $\sum B_{ij} \cdot n_{ij} + \sum D_{ij} \cdot n_{ij}$

$B_{ij}$  = base cost of instruction i,  $n_{ij}$  = no. of occurrences of instruction i

$D_{ij}$  = dynamic cost of sequence j,  $n_{ij}$  = no. of occurrences of sequence j

## Low-power system-level designs : General approaches

### # System Definition

An interconnection of possibly heterogeneous resources (electronic, electro-mechanical, optical, etc) which are often separately designed.  
An interconnection of multiple cores within a chip or across several chips / boards

### # Power / Energy Optimization Opportunities

Hardware components { Optimise communications  
Programmability

Software and algorithmic considerations { Operating system  
Applications, compilation techniques and algorithms

### # Optimising communication

Observation has already been made that energy in real-life systems is to a large extend dissipated in communication channels, sometimes even more than in the computational elements  
Experiments have demonstrated that in designs, about 10%-40% of the total power may be dissipated in buses, multiplexers  
This amount can increase dramatically for systems with multiple chips.  
Power consumption of the communication channels is highly dependent on algorithm and architecture-level design decisions  
Properties of algorithms and architectures are important for reducing the energy consumption due to the communication channels:  
- locality 地点位置, regularity 正则性

- [Locality] • relates to the degree to which a system or a algorithm has natural isolated clusters of operation or storage with few interconnections between them
- Partitioning the system or algorithm into spatially local clusters ensures that the majority of the data transfers take place within the clusters (■) and relatively few between clusters



- [Regularity] • In an algorithm, it refers to the repeated occurrence of computational patterns.

- Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (buses, multiplexers, buffers) and less control hardware.



### # Programmability

It allows one system to be used for many applications

More important for mobile systems because they operate in a dynamically changing environment and must be able to adapt to the new environment

- A mobile computer should be able to switch to a different network, without changing the application
- It should have the flexibility to handle a variety of multimedia services and standards
- It should have adaptability to accommodate the nomadic (nomadic) environment, required level of security, and available resources.

### [Spectrum of applications and hardware implementations]



## Software and algorithmic considerations

### # Operating system based power/energy management

### # Dynamic power management

Power management exploits periods of idleness caused by system under-utilisation

- Especially in mobile systems, the utilization is not constant

Designers naturally focus on worst-case conditions, peak performance requirements and peak utilization.

As a result, systems are also designed to operate under high utilisation, but they are actually fully exploited during a relatively small fraction of their lifetime

Dynamic power management { Binary power management

{ Dynamic power management of adaptive modules.

#### [Binary power management]

Deactivate functional units when no computation is done required

The main cost problem is to the cost of restarting a powered down module or component

Restarting induces an increase in task latency (e.g. time to restore a saved CPU state, spin-up of a disk), and possibly also an increase in energy consumption (e.g. due to higher start-up current in disks)

The two main questions involved are then: 1) When to shut-down, and 2) when to wake-up

↳ a predictive approach, where the system initiates a wakeup in advance of the predicted end of an idle interval, works better

#### [Dynamic power management of adaptive modules] this equivalent to DVFS

System performance measure

Computer subsystems may be designed for multiple levels of reduced energy consumption at the expense of some other

Key to the approach is a high degree of adaptability of the modules.

It is needed to adapt to the changing operating conditions dynamically in the most (energy) efficient way

Using this model, the inherent trade-offs between e.g. performance and energy consumption can be evaluated and a proper adaptation of the whole system can be made.

## High-level Synthesis Steps - Scheduling

### # Typical high-level synthesis steps

High-level spec in C++/  
SystemC or other



Intermediate representation

e.g. CDFG (Control Data Flow Graph)



Optimise CDFG



Perform Scheduling and  
determine resources  
(order in which operations execute)



Perform binding (map operations to hardware)



Generate RTL model

### # Example - high-level synthesis from C code by hand

// Shift and add multiplication

$P = M * Q$ , n-bits

$P=0$ ;

for ( $i=0$  ;  $i < n$  ;  $i++$ )

{

    if ( $Q[i] == 1$ )

$P = P + M$ ;

$P = P \ll 1$ ;

① Step 1: Build CFG for controller (an FSM)



② Step 2 : Determine datapath resources



④ Step 4 : Add control signals

⑤ Step 5 : Combine control and datapath.



## # Automated high-level synthesis

Determine when to perform each operation

- Scheduling
- Consider alternatives and optimise

Allocate resources for each operation

- Resource allocation
- Consider alternatives and optimise, for performance

Map operations onto resources

- Binding

## # Optimisations

After creating CDFG, optimise the graph

Data flow optimisations

Control flow optimisations

Cost functions

- Area (CH/S size)
- Performance (speed)
- Latency
- Power / energy

## # Scheduling

Scheduling assigns a start time to each operation in DFG

- Start times must not violate dependencies in DFG
- Start times must meet performance constraints (Also, resource constraints)

Performed on the DFG at each CFG node (each state)

- Difficult to process multiple CFG nodes in parallel.

Data flow graph

Example:



## # Typical problems in scheduling

Several types of scheduling issues

- Usually some trade-off of performance and resource constraints

Issues:

- Unconstrained scheduling
  - Not very useful, every schedule is valid
- Minimum latency scheduling - can be found through optimisation
- Latency constrained scheduling
- Resource constrained scheduling
- Minimum-latency, resource constrained scheduling
  - Find the schedule with the shortest latency, that uses less than a specified # of resources
  - NP - Complete problem (因线性规划而使问题无法解决)
- Minimum - resource, latency constrained scheduling
  - Find the schedule that meets a given latency constraint, and uses the minimum # of resources
  - NP - Complete !

## # Minimum Latency Scheduling

ASAP (as soon as possible) algorithm

- ① Find a candidate node in DFG
  - Candidate is a node whose predecessors have been scheduled (or no)
- ② Schedule node one cycle later than max cycle of predecessor
- ③ Repeat until all nodes scheduled.



ALAP (as late as possible) algorithm

- Run ASAP, get minimum latency L
- Find a candidate node in DFG
  - Candidate is node whose successors are scheduled (or there are none)
- Schedule node one cycle before start cycle of successor
  - Nodes with no successors scheduled to last cycle L
- Repeat until all nodes scheduled.

## # Latency-Constrained Scheduling

Instead of finding the minimum latency, find latency at most that of a given constraint L

- Solutions:

- ① Use ASAP, verify that minimum latency is at most L
- ② Use ALAP starting with cycle L instead of minimum latency (no need to run ASAP first, as L is given)



## # Scheduling with Resource Constraints

Scheduling must use less than specified number of resources.



## # Minimum - Latency , Resource - Constrained Scheduling



## # Hu's algorithm - example

Basic idea

- (number of resources)
- Input : Data Flow Graph  $G$ , resource constraint  $r$
  - 1. Label each node in  $G$  by the longest path passing through it.
  - 2. Initialise cycle count:  $L=1$
  - 3. Repeat
    - Determine the subset  $V$  of unscheduled nodes in  $G$  whose predecessors have been scheduled or there are none
    - Select subset  $S$  of  $V$  such that  $|S| \leq r$  and labels in  $S$  are largest
    - Schedule the  $S$  nodes at cycle  $L$
    - $L \leftarrow L + 1$

Until all nodes in  $G$  are scheduled.



Assume  
 $r=3$

Longest path, node  
scheduled  
at cycle 1 : A, B, C



## # Summary

Scheduling assigns each operation in a DFG a start time  
- Done for each DFG in the CDFG

## Types of Scheduling algorithms

- Minimum latency
- ASAP, ALAP
- Latency constrained
- ASAP, ALAP
- Minimum - latency , resource - constrained
- Hu's algorithm
- List scheduling (extension of Hu's algorithm)
- Minimum - resource , latency - constrained .
- List scheduling

Next {

## # Extension to Hu's Scheduling algorithm

## Common Extensions:

- Multiple resource types
- Multi-cycle operation, e.g., multiplication may take 2 cycles



## # Minimum latency, resource-constrained scheduling

## List-scheduling

- Extension for multiple resource types
  - Basic idea - run Hu's algorithm for each resource type
1. Input: graph, set of constraints  $R_t$  for each resource type  $t$
  2. Label nodes based on max distance to output
  3. For each resource type  $t$ 
    - 3.1. Determine candidate nodes,  $C_t$  (no predecessors or scheduled predecessors)
    - 3.2. Schedule up to  $R_t$  operations from  $C_t$  based on priority, to current cycle.  
- Where  $R_t$  is the constraint on resource type  $t$ .
  4. Increment cycle, repeat from 3, until all nodes scheduled.

Step ①:



Candidates Candidate, but not scheduled due to Low distance to output

| Mult   | ALU | Cycle |
|--------|-----|-------|
| Step 3 | 1   | 1     |
| Step 4 | C1  | 2     |
| Step 3 | 5,6 | 3     |
|        | 4   | 3     |
|        | 7   | 3     |
|        | 8   | 4     |

EX { 2 ALU } { 2 MUL }



| MUL | ALU | Cycle |
|-----|-----|-------|
| A,D | B,C | 1     |
| E   | -   | 2     |
| F   | -   | 3     |
| D   | -   | 4     |
| -   | G   | 5     |
| -   | H   | 6     |



EX 1



|            |      |     |
|------------|------|-----|
| ALU        | MUL  | Cyc |
| A, D, H, I | B, C | 1   |
| E, D, I    | C    | 2   |
| F, D       | J    | 3   |
| G, K       | =    | 4   |

EX 2



|      |      |     |
|------|------|-----|
| ALU  | MUL  | Cyc |
| A, B | C, D | 1   |
| -    | -    | 2   |
| -    | E    | 3   |
| -    | D    | 4   |
| F    | -    | 5   |
| -    | -    | 6   |



## # Scheduling for minimum resources with latency constraints

If no resource constraints are given, schedule should determine numbers of required resources.  
- Max # of each resource type used in a single cycle.

## Minimum - Resource Latency - Constrained Scheduling

- Optimise : for all schedules that have latency less than the constraint,  
find the one that uses the fewest resources

List scheduling for minimum resources

Basic ideas :

- 1. Compute latest start times for each node using ALAP with specific latency constraint
  - Latest start times must take into account multiple multicycle operations
- 2. For each cycle and for each resource type
  - 2.1 Determine candidate nodes
  - 2.2 Compute slack for each candidate      Slack = current cycle - latest possible cycle
  - 2.3 Schedule the nodes with 0 slack  
Update required number of resources (start with 1 of each at the beginning)
  - 2.4 Schedule the nodes that require no extra resources
- 3. Repeat loop 2. until all nodes scheduled.



Last Possible Cycle

| Node | LPC |
|------|-----|
| A    | 1   |
| B    | 1   |
| C    | 1   |
| D    | 3   |
| E    | 2   |
| F    | 2   |
| G    | 3   |



# Example: minimum resources with latency constraints

① Find ALAP schedule

Latency Constraint = 3 cycles

② For cycle and for each resource type

- 2.1 Determine candidate nodes C
  - 2.2 Compute slack for each candidate
- Slack = current cycle - last possible cycle

Candidate {A, B, C, D}  
Initial Resource = 1 MUL, 1 ALU

2.3 Schedule candidate nodes with 0 slack

Update required number of resources

2.4 Schedule candidate nodes that require no extra resources.

Candidates = {A, B, C, D}



Original

| Node | LPC | Slack | Cycle |
|------|-----|-------|-------|
|------|-----|-------|-------|

| Node | LPC | Slack | Cycle |
|------|-----|-------|-------|
| A    | 1   | 1-0   | 1     |
| B    | 1   | 1-0   | 1     |
| C    | 1   | 1-0   | 1     |
| D    | 3   | 3-1   | 2     |
| E    | 2   |       |       |
| F    | 2   |       |       |
| G    | 3   |       |       |

Note: Node D requires 1 more ALU  
Not scheduled.

② For cycle and for each resource type ...

2.3 ...

2.4 ...

Candidate = {D, E, F}

Resources = {2 MUL}



Node LPC Slack Cycle

| Node | LPC | Slack | Cycle |
|------|-----|-------|-------|
| A    | 1   |       | 1     |
| B    | 1   |       | 1     |
| C    | 1   |       | 1     |
| D    | 3   | 3-2   | 1     |
| E    | 2   | 2-2   | 2     |
| F    | 2   | 2-2   | 2     |
| G    | 3   |       |       |

Notes: require no extra resource thus being scheduled

② Candidate = {G}

Resource = {2 ALU}

# Other extensions of Hu's algorithm

Chaining - Multiple operations in a single cycle



Pipeline

Input: DFG, required data delivery rate

For fully pipelined circuit, one resource needed per operation

# Power constrained scheduling = based on scheduling for minimum resources with latency constraints

Each resource has an additional attribute: power consumption

Modification of Hu's algorithm for minimum power consumption

- 1. Start with only one type of each resource
- 2. Complete latest start times for each node using ALAP with specified latency constraint
  - Latest start times must take into account multi-cycle operations
- 3. For each cycle and for each resource type

3.1 Determine candidate nodes

3.2 Compute slack for each candidate Slack = current cycle - last possible value

3.3 If there are nodes with 0 slack, schedule all of them - ask for extra resources if necessary

3.4 If there are no nodes with 0 slack, schedule only one node from those with the lowest slack arbitrarily, but first try a node with the highest power consumption.

- 4. Repeat loop 3. until all nodes scheduled.

1. Power : ALU - 1mW MUL - 5mW

2. Find ALAP (Latency Constraint = 3 cycles)

3. For cycle and for each resource type (3.1 + 3.2)

Node LPC Slack Cycle Candidate = {A B C D}

A 1  $\frac{1}{0}$  1 } Initial Resources = { 1 ALU  
MUL

B 1  $\frac{1}{0}$  1 } 3.3 Resources { 1 ALU  
Power = 7mW

C 1  $\frac{1}{0}$  1 }

D 3  $\frac{3-1}{3-2}$  2 | 0 3 }

E 2  $\frac{2-2}{0}$  2 } { 4, 5, 6 }

F 2  $\frac{2-2}{D}$  2 } Resources { 2 ALU  
Power = 10mW BUT do not schedule !!!

G 3  $\frac{3-3}{0}$  3 } { 4, 7 }

There are spare ALUs

Resource { 2 ALU Power = 2mW

## High-level Synthesis Steps - Blinding and resource sharing

### # Binding 捆绑, 装订

Scheduling determines

- When ops will execute
- How many resources are needed

Binding determines which ops execute on which resources

- If multiple ops use the same resource
- Resource sharing

Basic idea: given a schedule, map operations onto resources such that operations in same cycle do not use same resource



Many binding possibilities and many side effects

- Band binding may increase resources, require huge steering logic, reduce clock rate

Can't bind same resource to more than op in a cycle

- 1 resource can't perform multiple ops simultaneously



How to automate binding?

- More graph theory

Binding uses compatibility graphs

Compatibility graph, definition:

- Each node is an operation
- Edges represent compatible operations between a pair of nodes
- Compatible means that two ops can share a resource
  - Compatible ops use same type of resource (ALU, etc.) and are scheduled to different cycles.



Note: Fully Connected subgraphs can share one resource (all involved nodes are compatible).  
e.g. nodes {2,4,7,8} in ALU subgraph can share a single ALU



## # Compatibility Graph

Binding goal: Find minimum number of fully connected subgraphs that cover entire graph.

- Well-known problem in graph theory: Clique partitioning (NP-complete) 因子圖



Cliques =  $\{\{2, 8, 7, 4\}, \{3\}, \{1, 5\}, \{6\}\}$

$\left\{ \begin{array}{l} \text{ALU1 executes } 2, 8, 7, 4 \\ \text{ALU2 executes } 3 \\ \text{MUL1 executes } 1, 5 \\ \text{MUL2 executes } 6 \end{array} \right.$ 
 另一种回答



- ① Add resources and registers
- ② Add mux for each input
- ③ Add input to left mux for each left input in DFG
- ④ Do same for right mux
- ⑤ If only 1 input, remove mux.

## # Left Edge Algorithm

Alternative to clique partitioning

- Name 'left edge' adopted from graph theory

- Each node in scheduled DFG has a 'left edge' - start point & 'right edge' - end point



- 1) Initialize right\\_edge to 0
- 2) Find a node  $N$  whose left edge is  $\geq$  right\\_edge
- 3) Bind  $N$  to a particular resource
- 4) Update right\\_edge to the right edge of  $N$
- 5) Repeat from 2) for nodes using the same resource type until right\\_edge processes all nodes
- 6) Repeat from 1) until all nodes bound.

## # Extensions

Algorithms presented here will find a valid binding

- But they do not consider amount of steering logic required
- Different bindings can require significantly different # of muxes

Possible solution

- Extend compatibility graph
- Use weighted edges/nodes to introduce cost function representing steering logic
- Perform clique partitioning, finding the best set of cliques that minimize weight

## # Binding Summary

Binding maps operations onto physical resources - Determine sharing among resources

Binding may greatly affect steering logic (控制逻辑)

Binding becomes trivial for fully pipelined circuits - 1 resource per operation

Translation from bound DFG to data path hardware is straightforward.

## # Summary of main HLS steps

HLS tool front-end 前端 parsing / compiling converts user code into intermediate representation 解釈 - We considered CDFG

Scheduling assigns a start time for each operation in DFG

- CFG nodes start times are defined by control dependencies
- Resource allocation determined by schedule

- Determines how resources are shared

Binding maps scheduled operations onto physical resources

Final result:

- Scheduling / Bound DFG can be translated into a data path
- CFG can be translated to a controller FSM
- This way High-level Synthesis can create a custom circuit for any CDFG.

$$a^2 + a^*b + 2^*c \quad \{ \begin{array}{l} 1 \text{ ALU} \\ 2 \text{ MUL} \end{array}$$

ASAP



| Cycle | ALU | MUL   |
|-------|-----|-------|
| 1     | -   | A B C |
| 2     | D   | C     |
| 3     | E   |       |

ALAP



|                            |                   |                                 |
|----------------------------|-------------------|---------------------------------|
| Cycle 1: $C = \{A, B, C\}$ | A 1<br>B 1<br>C 2 | LPC arbitrary schedule A 5mW    |
| Cycle 2: $C = \{B, C\}$    | B 0<br>C 1        | $\Rightarrow$ Schedule B 5mW    |
| Cycle 3: $C = \{C, D\}$    | C 0<br>D 0        | $\Rightarrow$ Schedule C, D 6mW |
| Cycle 4: $C = \{E\}$       | E 0               | $\Rightarrow$ Schedule E        |

$$\begin{cases} x = a^2 - 0.5^*b - c \\ y = a^2 - 2^*(b+c) \end{cases} \quad \{ \begin{array}{l} 1 \text{ ALU} \\ 1 \text{ MUL} \end{array}$$



| Cycle | MUL | ALU |
|-------|-----|-----|
| 1     | B C | A   |
| 2     | -   | -   |
| 3     | C D | -   |
| 4     | -   | -   |
| 5     | D E | -   |
| 6     | -   | G   |
| 7     | -   | F   |

$$\textcircled{1} \quad C = \{A, B^*, C^*\} \quad R = \{1 \text{ MUL}, 1 \text{ ALU}\} \Rightarrow S = \{A, B^*\}$$

$$\text{LPS} \quad \text{Slack time}$$

|    |   |     |
|----|---|-----|
| ✓A | 3 | 3-2 |
| ✓B | 3 | 3-2 |

$$\textcircled{2} \quad C = \{D^*, C^*\} \quad R = \{1 \text{ MUL}, 1 \text{ ALU}\} \Rightarrow S = \{C^*\}$$

|    |   |     |
|----|---|-----|
| ✓C | 3 | 3-2 |
|    | 2 | 1-0 |

$$\textcircled{3} \quad C = \{D^*\} \quad R^* = \{2 \text{ MUL}, 1 \text{ ALU}\} \Rightarrow S = \{D^*\}$$

|    |   |     |
|----|---|-----|
| ✓C | 3 | 3-2 |
| D  | 4 | 4-3 |
|    | 2 | 1-0 |

$$\textcircled{4} \quad C = \{E\} \quad R^* = \{2 \text{ MUL}, 1 \text{ ALU}\} \Rightarrow S = \{E\}$$

|   |   |     |
|---|---|-----|
| E | 5 | 5-0 |
| F | 6 | 6-0 |
| G | 6 | 6-0 |

$$\textcircled{5} \quad C = \{F, G\} \quad R^* = \{2 \text{ MUL}, 2 \text{ ALU}\} \Rightarrow S = \{F, G\}$$

|   |   |     |
|---|---|-----|
| E | 5 | 5-0 |
| F | 6 | 6-0 |
| G | 6 | 6-0 |



Step 1: ALAP

| R = {IALU, IMUL} | C = {A, B, C} | S = {A, C} | 1 | LPS               | slack | Cycle |
|------------------|---------------|------------|---|-------------------|-------|-------|
| R = {IALU, IMUL} | C = {B, E}    | S = {X}    | 2 | A                 | 3     | 1     |
| R = {IALU, IMUL} | C = {B, E}    | S = {B}    | 3 | B                 | 3     | 1     |
| R = {IALU, IMUL} | C = {E}       | S = {E}    | 4 | R* = {IALU, 2MUL} | D     | 5     |
| R = {IALU, 2MUL} | C = {D}       | S = {D}    | 5 | E                 | 4     |       |
| R = {IALU, 1MUL} |               | S = {F, G} | 6 | F                 | 6     |       |
|                  |               |            |   | G                 | 6     |       |

find an assignment to variables  $x_0, x_1, \dots, x_n$  such that the boolean function  $f(x_0, x_1, \dots, x_n) = 1$  or prove that there are none. BDDs provide an representation of boolean function in binary tree. In contrast to the truth table and algebraic function, BDDs provide an efficient means of recursive analysis that tackle the state of more complex SAT problems.

At temperature  $T$  the current design  $x^*$ . If the simulated energy, e.g. the cost function  $f(x)$  has decreased, move to this new design pair. Generate a random perturbation  $x'$  of  $x^*$ . If the energy  $f(x')$  has increased, the new point is accepted due according to the probability of the boltzmann distribution,  $p = e^{-\frac{f(x') - f(x)}{kT}}$ .  $kT \rightarrow K$  constant. Then the system reaches the thermal equilibrium, reduce the temperature and repeat. However, it is hard to find a global optimum b/c the presence of many local optima provided it has long enough. Slow performance, highly dependent on the heuristic concerning the cooling schedule and probability of accepting a worse move.



- 1. GA is easily used in multi-objective optimization
- 2. Is capable of finding a global minimum
- 3. Blend itself easily implementation on parallel platform
- 4. Easy to be synthesized both for simulation & physical design
- GA has slow performance compared to other stochastic methods
- GA is very slow w.r.t. the first two + has a bug

V<sub>d</sub>

Stop - clock

locking output

retiree states

power off



```
struct adder: SC_MODULE {
    // ports, processes, internal data
}
```

```
SC_MODULE(adder) {
    // ports, processes, internal data
}
```

## # SystemC module

Module is the basic building block in SystemC

SystemC has a macro for convenient module definition

- Macro name : SC\_MODULE
- Macro implements C++ class inheritance

## # Module constructor

Module constructor is macro

- Macro name : SC\_CTOR
- SC\_CTOR must have module name as an argument

```
SC_MODULE(adder) {
```

```
// ports, processes , internal data
```

```
SC_CTOR(adder) {
```

```
// body of constructor
```

```
// process declaration , sensitivities
```

```
}
```

## SystemC

### # What is SystemC?

Superset of C++ - SystemC is a library of classes and macros designed to facilitate hardware modelling.  
 SystemC supports hardware and software co-design

SystemC has a rich set of data types to model hardware and software systems.  
 It allows multiple abstraction levels, from high level down to cycle-accurate RTL level.

`SC_MODULE()` ← is a macro that defines a new class for a SystemC module

### # Why SystemC?

Arbitrary levels of abstraction { From RTL level to system-level transaction modelling  
 Hardware/Software Co-design

Simulation performance is high { High-performance C++ compilers  
 Illusion of concurrency

High productivity in system development

### # SystemC design flow



### # SystemC module hierarchy



### # SystemC jargon

#### [Module]

- Basic building block for structural partitioning
- Contains ports, processes, data
- Other Modules

#### [Process]

- Basic specification mechanism for functional description
- Three types of processes
  - `sc_method` = sensitive to some ports/signals, no wait statements
  - `sc_thread` = sensitive to some ports/signals, with wait statements
  - `sc_cthread` = sensitive to only clock

#### [Interface]

- A set of operations for communication
- No specification about the actual implementation
- For example, a signal supports read and write operations

#### [Channel]

- An implementation of the interface operations
- Performs the actual data transfer
- Vary from basic signals to complex protocols

#### [Port]

- Objects for inter-module communication
- Each port is a class and specifies communication interface.

# Space !!!!

<SC\_Unit<8>>

我录!!

No.

Date

背景颜色 Tools → Edit Preferences → more windows

# design.cpp

```
#include "systemc.h"
nodele
SC_MODULE(first_counter) {
    SC_in<sc_clk> clock; //Clock input
    SC_in<sc_bv<4>> reset; //Syn Reset
    SC_in<sc_bv<4>> enable; //Enable
    SC_out<sc_uint<4>> counter_out; //4-bit vector output

    SC_unit<4> count; //Local variable

    functionality void incr_count() {
        if(reset.read()==0) {
            count = 0;
            count_out.write(count);
        } else if(enable.read()==1) {
            count = count + 1;
            count_out.write(count);
            cout << "@" << sc_time_stamp() << ":: Incremented Counter"
                << count_out.read() << endl;
        }
    }
}
```

Constructor Block  
Method Sensitive List

}

SC\_METHOD  
SC\_THREAD

SC\_method will run more than once by design and  
methods cannot be suspended by a wait statement.

# testbench.cpp

```
# include "systemc.h"
# include "design.cpp"

int sc_main (int argc, char* argv[]) {
    sc_signal<sc_bv<4>> clock;
    sc_signal<sc_bv<4>> reset;
    sc_signal<sc_bv<4>> enable;
    sc_signal<sc_unit<4>> counter_out;

    int t=0;
    first_counter counter ("COUNTER");
    counter.clock(clock);
    counter.reset(reset);
    counter.enable(enable);
    counter.counter_out(counter_out);

    SC_start (1, SC_NS);

    //Open VCD file
    SC_trace_file *wf = sc_create_vcd_trace_file ("counter");
    //Dump the desired signals
    SC_trace(wf, clock, "clock");
    SC_trace(wf, reset, "reset");
    SC_trace(wf, enable, "enable");
    SC_trace(wf, counter_out, "count");

    //Initialize all variables
    reset=0;
    enable=0;
    for (t=0; t<5; t++) {
        clock=0;
        SC_start (1, SC_NS);
        clock=1;
        SC_start (1, SC_NS);
    }
    ...
}
```

## SystemC

## [Work Example]

# add1.cpp

```
#include "systemc.h"

SC_MODULE(BIT_ADDER) {
    SC_IN<SC_LOGIC> a, b, cin; } Input
    SC_OUT<SC_LOGIC> sum, cout; } Output

    SC_CTOR(BIT_ADDER) { Constructor Block
        SC_METHOD(process); // Declare method
        SENSITIVE << a << b << cin; } Sensitivity list
    }

    void process() { Define the functionality
        SC_LOGIC aANDb, aXORb, cinANDaXORb;
        // Declare variables of type "logic"
        aANDb = a.read() & b.read();
        aXORb = a.read() ^ b.read();
        cinANDaXORb = cin.read() & aXORb;

        sum = aXORb ^ cin.read();
        cout = aANDb | cinANDaXORb;
    }
}; // semicolon
```

# add1\_tst.cpp

#include "systemc.h" Write to the inputs and read the outputs

```
SC_MODULE(testbench) {
    SC_OUT<SC_LOGIC> A_p, B_p, CIN_p;
    SC_IN<SC_LOGIC> SUM_p, COUT_p;
    SC_CTOR(testbench) { CTOR: Construct. Block
        SC_METHOD();
        SC_THREAD(process);
    }

    void process() {
```

```
A_p = SC_LOGIC_0;
B_p = SC_LOGIC_0;
CIN_p = SC_LOGIC_0;
wait(10, SC_NS);
print();
SC_stop(); // End simulation
```

```
void print() {
    cout << "At time " << sc_time_stamp() << ":" ;
    cout << "(a, b, carry-in): ";
    cout << A_p.read() << B_p.read() << CIN_p.read();
    cout << " (sum, carry-out): " << SUM_p.read() << COUT_p.read() << endl;
}
```

# add1\_main.cpp

```
#include "add1.cpp"
#include "add1_tst.cpp"

int sc_main(int argc, char* argv[]) {
    // Declare signals to be tied to the modules
    sc_signal<SC_LOGIC> A_s, B_s, CIN_s, SUM_s, COUT_s;

    // Instantiate
    BIT_ADDER adder1("BitAdder1");
    adder1 << A_s << B_s << CIN_s << SUM_s << COUT_s;

    testbench test1("TestBench1");
    test1 << A_s << B_s << CIN_s << SUM_s << COUT_s;

    SC_start(2000, SC_NS);

    return 0;
}
```

将 Testbench 模块输出

插入 Adder 模块输入

PS: 主体分为三层

① Module 层, 加法器的代码

② Testbench 层, 生成输入逻辑, 读取仿真结果

③ Top-level 层, 实体化模块, 模块间连接

Campus 无线装订本  
B5 40页  
WCN-CNB1430



6 957748 309932  
MADE IN CHINA

国誉商业(上海)有限公司  
<http://www.kokuyo.cn/st/>

上海市奉贤区人杰路128号

TEL : 400-820-0798 FAX : 021-3255-8508

产地 : 上海市 QB/T1438-2007合格

B5 252×179mm

● 采用日本进口纸张，纸质细滑，牢固不掉页。