

- 30/7
1. High level synthesis (architectural syn)
  2. logic synthesis
  3. Physical Design Automation

- VLSI design flow.



- optimization

- a) time (performance) (2)
- b) power / energy (3)
- c) area (1)

→ area

area/performance (AP)

area/performance/testability (APT)

chronological order  
in which features  
were optimized

order of optimizat<sup>n</sup>

- Performance / area / testability (PAT)
- Perf / power / area / test
- Power / perf / area / test / area (PPAT / PPTA)
  - nowadays not many care abt it
  - ⇒ more area & more static power

{ Security : i) 0 & 1 dissipate diff amt of power  
                     ⇒ if we analyze the power traits we can extract the key, encrypting a msg  
                     ii) 0 & 1 take diff amt of time  
                     ⇒ analyze time taken }

- power / perf / test / area / security (PPTAS)

{ to solve i) equalize power dissipat<sup>n</sup> by doing dummy operat<sup>n</sup>

→ however, perf would lower  
 (conflicting req.) }

- power / perf / test / area / security / intelligence (PPTAS)

## • WAYS OF SPECIFYING A CIRCUIT

(1)



⇒ Top-down approach

- based on experience we can skip some steps

(II)

Behaviour



Structure

Bottom up approach  
(however we have some idea how much area the modules cover)

- makes structure more precise
- but not optimized

• ALGORITHMS  $\xrightarrow{(HLS)}$  RTL

(High level synthesis)  
→ gives macroscopic view of design

- components (data path) → functional units
- controller (FSM) (depends on req. spec.)

Data path

- i) Functional units
- ii) memory (temporary storage)
- iii) steering logic → for communication we can have
  - broadcast (BUS)
  - one-to-one / point-to-point (multiplexers)

~~an~~ we want to perform

$$a = b + c$$

$$d = a + e$$



(M1) Binding



(one cycle)

Performance should be optimized only, as much as long as it makes a difference

- assigning which ALU will do what operation
- this approach helps when we want do it in small amt of time
- however area not optimised



5/8

HLS

Algorithm  $\longrightarrow$  RTL / Data Path

- $\rightarrow$  FU
- $\rightarrow$  mem (reg, steering logic)
- $\rightarrow$  steering logic

Broadcast P-to-p  
(Bus)  
(MUX)

Controller  $\rightarrow$  FSM

### Area + Performance

FU  
SL

time to execute program

it can be optimized by changing parameters and constraint

(AP)  
constraint  $\Rightarrow$  compromise



die

we want to produce more chips (on one die)  
 $\rightarrow$  increase the yield (no. of working chips)

- defect density has almost stayed const. over the years  
 $\rightarrow$  determines how many dies to have on a chip

$\Rightarrow$  Area:

- i) yield
- ii) power

- performance: optimize only as much as req.  
 $\Rightarrow$  ~~extra~~ budget can be used on optimizing area

- need to have a fair idea about the area and performance of FUs
- Latency: delay between launching op and getting the output
  - $\rightarrow$  but by pipelining we can improve the throughput
  - $\rightarrow$  whenever there are dependencies, latency matters; otherwise throughput matters

$\Rightarrow$  Performance



$$\text{eqn} \quad \frac{d^2y}{dx^2} + 3u \frac{dy}{dx} + 3uy = 0$$

$$\left. \begin{array}{l} y(0) = y \\ \frac{dy}{dx}(0) = u \\ u=0 \end{array} \right\} \text{initial cond'n}$$

iteration {

$$x_l = x + dx$$

$$y_l = y + u dx$$

$$u_l = u - 3u dx - 3y dx$$

check if  $|c| < \alpha \Rightarrow \text{terminate} \}$

• Dataflow graph



- if we use one FU for operation then  
clock cycle = time taken by 2 multiplier & 2 ALU
  - however if we recycle the FU then the minimum  
time taken = 4 cycles (to get uL)
  - we need only 5 multiplier and 2 ALU
  - SCHEDULING: Placing operation in time
  - BINDING: Placing operation in space
  - Since Op of M1 fed back to M1, it helps to  
develop the steering logic
    - ⇒ our interest is in data dependency rather  
than the actual data
  - Upon removing the data from data flow  
graph we get a SEQUENCING GRAPH
- attach all data nodes to NOP

→ Scheduling

| time  | operation        |
|-------|------------------|
| $t=1$ | 1, 2, 3, 4, 5, 6 |
| $t=2$ | 7, 8, 9, 10      |
| $t=3$ | 11               |
| $t=4$ | 12               |

→ Binding

| Multiplexer | operat <sup>n</sup> |
|-------------|---------------------|
| 1           | 1, 7                |
| 2           | 2, 8                |
| 3           | 3, 9                |
| 4           | 4                   |
| 5           | 5                   |

| ALU | operat <sup>n</sup> |
|-----|---------------------|
| 1   | 6, 9, 11, 12        |
| 2   | 10                  |

→ moving y<sub>l</sub> cycle down 1 cycle helps us reduce 1 multiplier

6/8



⇒ Sequencing Graph

- $\lambda = 4$  (latency)
- Ad-hoc method



- Scheduling: Placement of operation in time (define start time)

Binding: Placement of operations in space (define a physical resource)

- $t_i^s$ : start time of FU

$d_i$ : delay of FU

$$\Rightarrow t_3^s = \max \{ t_1^s + d_1, t_2^s + d_2 \} \rightarrow \text{puts a constraint of scheduling order}$$

$$\Rightarrow t_i^s = \max \{ t_j^s + d_j \}$$

If we have a constraint of resources as well

$\rightarrow$  1 Mul, 1 ALU



$\Rightarrow$  SCHEDULING UNDER RESOURCE CONSTRAINT

$\lambda \leftarrow$  latency

Here,  $\lambda = 7$  + Ad-hoc method

Need to develop a systematic methodology

- ASAP: As soon as possible
  - ⇒ schedule any node ASAP if predecessors have been scheduled

$$t_i^s = \max_j (t_j^s + d_{ij}) \quad j \rightarrow \text{predecessors}$$

→ execution time of NOP = 0 (since it's a function we added)



$$\lambda = 5 - 1 = 4$$

time taken by both NOP

⇒ This gives the optimal performance  
however area not optimized

- ALAP: As late as possible

⇒ schedule operations from the bottom

$$t_i^s = \min_j (t_j^s - d_{ij}) \quad j \rightarrow \text{predecessor}$$



- ⇒ time remains same
- ⇒ Resources needed now:
  - 2 multiplier, 3 ALU (if multiplier conflict than ALU, then we choose ALAP)
- for scheduling
- Now, ASAP & ALAP tell us the range of time for each operation and hence puts a bound on performance
  - eg: operation 1 & 2 has to be done in  $t_1$ , operation 3 can be done in  $t_1$  or  $t_2$

2/8 Having ASAP & ALAP we know the performance constraint. We can optimize area to reduce power consumption now.

- Operation 1,  $t_1^S = 1$  (ASAP)
- $t_1^E = 3$  (ALAP, if  $\lambda = 6$ ) → provides "SLACK"
- { Hence it can start at  $t = 1, 2, 3$  (for  $\lambda = 6$ ) } → flexibility (can optimize area)

→ Variants: LP, ILP, MILP, ZOLP

↓  
mixed <sup>two or one</sup>  
→ Need to be able to map scheduling program  
to ILP

Q1) 400 wooden planks, 450 person hours  
To construct a

chair: 5 wood + 10 person hrs = Rs 100 profit

table: 10 wood + 15 person hrs = Rs 150 profit

How many chairs and tables to  
maximize profit?

$$\rightarrow \text{Profit} = 100C + 150T$$

$$C = \frac{5W+10P}{5} \min\left\{\frac{400}{5}, \frac{450}{10}\right\} \quad \begin{array}{l} \text{if we want to} \\ \text{make only} \end{array}$$

$$T = \frac{10W+15P}{10} \min\left\{\frac{400}{10}, \frac{450}{15}\right\} \quad \begin{array}{l} \text{chair/table} \\ \text{constraint} \end{array}$$

$$W \leq 400, P \leq 450$$

$$5C + 10T \leq 400 \quad (\text{constraint on wood})$$

$$10C + 15T \leq 450$$

$$C \geq 0$$

$$T \geq 0$$



- But we might get more profit by simply selling woods and human hours  
we need to determine the selling rate/price

$$w \geq 0 \quad \left\{ \begin{array}{l} \text{price of wood} \\ \text{or human hours} \end{array} \right.$$

$$p \geq 0$$

$$\Rightarrow 5w + 10p \geq 100$$

amt greater than made  
by selling a chair

$$10w + 15p \geq 150$$

$\Rightarrow$  "DUAL" of the problem

53/8

LP - real variable

ILP - only integer

MLP - both real and integer variables

ZOLP - only zero & one allowed

everything is a subset of this; any no. can be expressed as 0, 1

## SCHEDULING

- Placing operation in time

- define start time of every operation

① unique start time

$$u_{ie} = \begin{cases} 1 & \text{when operation } e \text{ starts} \\ & \text{at time step } i \\ 0 & \text{otherwise} \end{cases}$$

variables of type  $\Rightarrow ZOLP$

If total no. operation (individual) =  $k$   
latency of the system =  $\lambda$

$\Rightarrow$  total no. of op variables =  $k\lambda$

since any operation can have a unique time step only.

$$\sum_i x_{ie} = 1$$

eg:  $x_{11} + x_{21} + x_{31} + \underbrace{x_{41} + x_{51}}_{\text{but we know that } x_{41} \text{ cannot be scheduled at } t=3,4,5} = 1$  (for  $\lambda = 5$ , in earlier problem)

$$\Rightarrow x_{11} + x_{21} = 1$$

$$x_{16} + x_{26} + x_{36} + \underbrace{x_{46} + x_{56}}_{\substack{\text{(due to ASAP)} \\ \text{(due to ALAP)}}} = 1$$

$$\Rightarrow x_{26} + x_{36} = 1$$

$$x_{18} + x_{28} + x_{38} + x_{48} + x_{58} = 1$$

$$\Rightarrow x_{28} + x_{38} + x_{48} + x_{58} = 1$$

we simplified the equation, however we can solve for solution even with the original set of equations

## (2) Dependency relationship must be respected

If  $i$ th operation is dependent on  $j$ th operation

$$\sum i \cdot x_{il} \geq \sum j \cdot x_{jk} + d_j \quad \begin{matrix} \rightarrow \text{delay of gates} \\ \rightarrow \text{fixed} \end{matrix}$$

$$\sum i \cdot x_{ie} - \sum j \cdot x_{je} - d_j \geq 0$$

eg:  $2x_{26} + 3x_{36} - x_{11} - 2x_{21} - 1 \geq 0$

if  $x_{26}=1$  then  $x_{21}$  has to be 0

if  $x_{26}=1$  then either  $x_{11}$  or  $x_{21} = 1$

$$2x_{28} + 3x_{35} + 4x_{48} + 5x_{58} - x_{14} - 2x_{24} \\ - 3x_{34} - 4x_{44} - 1 \quad \left. \begin{array}{l} \} \\ \} \end{array} \right\} \geq 0$$

delay = 1

### ③ Resource constraint

If we have the constraint of only 2 multipliers present by 2 ALU

then  $x_{11} + x_{12} + x_{13} + x_{14} \leq 2$   $\left. \begin{array}{l} \} \\ \} \end{array} \right\} t=1$   
 $x_{15} \leq 2$

$t=2$  :  $x_{21} + x_{22} + x_{23} + x_{24} + x_{26} + x_{27} \leq 2$

$$x_{28} + x_{29} + x_{25} \leq 2$$

NOTE : all the equations in ①, ②, ③ have to be simultaneously satisfied

19/8

- ASAP scheduling gives the best timing
- Max/upper bound ( $\lambda$ ) given by ALAP
- Their difference gives the 'margin / SLACK'
- Optimization

For,  $a_1$  multipliers  $\Leftarrow$  4 unit area

$a_2$  multiplier  $\overset{\text{ALUs}}{\Leftarrow}$  1 unit area

#### a) Area optimization

$$C = 4a_1 + a_2 \rightarrow \underset{\text{cost}}{\downarrow} \text{minimize}$$

## b) Performance optimization

$a_1, a_2, \dots$  = fixed

$\lambda \leftarrow$  need to be minimized

$\Rightarrow \min_{x_i} \sum i \cdot x_{i\ell} \rightarrow$  minimize the time when an operation is started

i.e. minimize  $\{ 1 \cdot x_{11} + 2 \cdot x_{21} + \dots + x_{12} + 2 \cdot x_{22} + \dots \}$

## c) co-optimization

$\rightarrow$  need to bring all parameters to same scale and assign some weight to each parameter

$$C = S \times \{\text{normalised area}\} + (1-S) \times \{\text{normalised perf} \frac{\text{time}}{\text{perf}}\} \quad (\text{perf} \propto \frac{1}{\text{time}})$$

## \* SIMPLEX

## • LIST SCHEDULING

make a list of eligible candidates and pick a few which satisfy the constraint  
(shortlisting done by assigning priority)

### Under resource constraint

$\rightarrow$  assuming area contributed by FU; letting logic area negligible

- need to assign weights/priority number
  - start from ALAP and assign each level a number starting from 0 (for NOP)  
but this way operations 1, 2, 3 are allotted 4
  - we some operations have same weight then
    - we can choose randomly
    - choose from a list

eg:  $U_1 = \{1, 2, 3, 4, 6, 7\}$  (multiplier)  $\rightarrow$  unscheduled  
 $t=1$   $U_2 = \{5, 8, 9, 10, 11\}$  (ALU)

$$\Rightarrow S_1^1 = \{1, 2\}$$

$$\underline{S_1^2 = \{5\}}$$

then  $U_1 = \{3, 4, 6, 7\}$

$t=2$   $U_2 = \{8, 9, 10, 11\}$

$\Rightarrow S_2^1 = \{3, 6\} \rightarrow$  based on value assigned to operation

$$\underline{S_2^2 = \{9\}}$$

$t=3$   $U_1 = \{4, 7\}$

$$U_2 = \{8, 10, 11\}$$

$$\Rightarrow S_3^1 = \{4, 7\}$$

$$\underline{S_3^2 = \emptyset}$$

$t=4$   $U_1 = \emptyset$

$$U_2 = \{8, 10, 11\}$$

$$S_4^1 = \emptyset$$

$$\underline{S_4^2 = \{8, 10\}}$$

$t=5$

$$U_1 = \emptyset$$

$$U_2 = \{11\}$$

$$S_5^1 = \emptyset$$

$$\underline{S_5^2 = \{11\}}$$

20/8 • Number of cycles for diff. FUs is diff.

eg: multiplier = 2 cycles

ALU = 1 cycle

$$t=1: U_1 = \{1, 2, 3, 4, 6, 7\}$$

$$U_2 = \{5, 8, 9, 10, 11\}$$

$$S_1^1 = \{1, 2, 3\}$$

$$S_1^2 = \{5\}$$

$$t=2: U_1 = \{3, 4, 6, 7\}$$

$$U_2 = \{8, 9, 10, 11\}$$

$$P_1 = \{1, 2, 3\} \text{ (operations in progress)}$$

$$P_2 = \emptyset$$

$$S_2^1 =$$

$$\text{Now, } |S_2^1| + |P_1| \leq 2$$

$$\Rightarrow S_2^1 = \emptyset$$

$$S_2^2 = \{9\}$$

$$t=3: U_1 = \{3, 4, 6, 7\}$$

$$U_2 = \{8, 10, 11\}$$

$$P_1 = \emptyset, P_2 = \emptyset$$

$$S_3^1 = \{3, 6\}; S_3^2 = \emptyset$$

$$t=4: U_1 = \{4, 7\}; U_2 = \{8, 10, 11\}$$

$$P_1 = \{3, 6\}; P_2 = \emptyset$$

$$S_4^1 = \emptyset; S_4^2 = \emptyset$$

$$t=5: U_1 = \{4, 7\}; U_2 = \{8, 10, 11\}$$

$$P_1 = \emptyset; P_2 = \emptyset$$

$$S_5^1 = \{4, 7\}; S_5^2 = \{10\}$$

$$t=6: U_1 = \emptyset; U_2 = \{8, 11\}$$

$$P_1 = \{4, 7\}; P_2 = \emptyset$$

$$S_6^1 = \emptyset; S_6^2 = \emptyset$$

$$t=7: U_1 = \emptyset; U_2 = \{8, 11\}$$

$$P_1 = \emptyset; P_2 = \emptyset$$

$$S_7^1 = \emptyset; S_7^2 = \{8, 11\}$$

- 2 multiplier =  $\rightarrow M_1$  (2 cycle, more expensive)  
 $\rightarrow M_2$  (1 cycle)

2 ALU  $\rightarrow A_1$  (2 cycle)  
 $\rightarrow A_2$  (1 cycle)

22/10

$$a = b + c \\ d = a * e \quad \Rightarrow \text{a crossing clk boundary}$$



### Binding of operation to physical resource

share resource

(if no conflicting reg.  
i.e. not in concurrent use)

### Binding of variables to register



After

• BIN

1)  
2)

F

## After scheduled sequencing (2-M, 2-ALU)



Bind 2 Multi to 6 multiplication operat<sup>n</sup>  
such that operations which are concurrent don't  
get same resource

- BINDING OPERATION TO PHYSICAL RESOURCE

Based on above mapping we can devise a graph

- 1) conflict graph
- 2) compatibility graph

- Build them based on operations' type

For Multiplication operat<sup>n</sup>:



draw a line if conflict b/w them



⇒ 'conflict graph'

undirected coz if conflict  
occurs both ways



conflict graph  
for ALU

compatibility graph for multiplication:



### \* graph colouring problem

(adjacent nodes shouldn't get same colour,  
colour graph with min<sup>m</sup> colours)  
chromatic number

→ assigning resources based on graph is  
a similar problem



→ we choose b/w conflict & compatibility graph  
based on which is more sparse

complete graph: every node connected to every other node

e.g.

clique: subgraph inside a graph which is complete



→ we want to find the minimum number of cliques and assign 1 resource to one clique



Solving with ILP

Input: scheduled sequencing graph

$u_{il} \rightarrow$  operat<sup>n</sup> l starts at time step i

given i/p we know their values

let  $y_{lk} \rightarrow$  operat<sup>n</sup> l binds with resource K

$1 \leq l \leq n$  (total operat<sup>n</sup>)

$1 \leq k \leq m$  (total resources)

## ① uniqueness

$$\sum_k y_{ik} = 1$$

26/08



Need one register per liveness range  
 ⇒ no overlapping liveness range variables can share register



$$R_1 = \{z_1, z_3, z_4\}$$

$$R_2 = \{z_2\}$$

$$R_3 = \{z_5, z_6\}$$

ER SHARING

regulates  
concurrency

binding

- (1) lifetime of variable
- (2)  $\Rightarrow$  LIVENESS RANGE

To systematize the problem transform it into a graph



[conflict graph]

$\rightarrow$  Graph colouring



[compatibility graph]

$\rightarrow$  Minimum clique

$\rightarrow$  Storage elements also have to store external variables





- If we do binding first, we allot operations to resources first  
 Helps when steering logic has significant area

- We can do concurrent scheduling & binding as well.

$\rightarrow$  ILP formulation

Need to place operation in time and space together

- 1)  $n_{op} = 1$ ,  $l^{th}$  operation starts at  $i^{th}$  time step

$$b_{er} = \begin{cases} 1 & \text{if } e^{\text{th}} \text{ operation carried out} \\ & \text{by } r^{\text{th}} \text{ resource} \\ 0 & \text{else} \end{cases}$$

Total no. of operation =  $n_{op}$

$$\text{latency} = \lambda$$

total resources =  $n_{res}$

$$\text{total variable} = n_{op} \times \lambda \\ (x_{ie})$$

$$\text{total variable} = n_{op} \times n_{res} \\ (b_{er})$$

## optimization

Minimization of latency

$$\sum_{i \in e} c_i \cdot x_{il}$$

Minimization of resources

$$\sum_{i \in e} w_i \cdot x_{ir}$$


  
 area  
 weight assigned to area

Total cost

$$\delta (\text{norm. latency}) + (1-\delta) (\text{norm. latency})$$

## constraints

1) unique resource binding

$$\sum_r b_{er} = 1$$

$r \in T(V_e)$  → should belong to relevant set (All can't do multiplication)

eg:  $b_{11} + b_{12} = 1$

$$b_{21} + b_{22} = 1$$

$$b_{31} + b_{32} = 1$$

$$b_{83} + b_{84} = 1$$

$$b_{83} + b_{84} = 1$$

$$b_{93} + b_{94} = 1$$

② Binding

$$\sum_i b_{ij}$$

③ unique start time

$$\sum_{i=1}^n x_{ie} = 1$$

$$\text{eg: } x_{11} + x_{21} + x_{31} + x_{41} + x_{51} + x_{61} = 1$$

$$\text{eg: } b_{11} x_1$$

$$+ b_{21} x_2$$

$$+ b_{31} x_3$$

$$+ b_{41} x_4$$

$$+ b_{51} x_5$$

$$+ b_{61} x_6$$

$$+ b_{71} x_7$$

$$+ b_{81} x_8$$

$$+ b_{91} x_9$$

④ Dependency

$i^{th}$  operation dependant on  $k^{th}$  operation  
then it cannot be started before  $k^{th}$  operation

$$\sum_i i \cdot x_{ie} \geq \sum_i i \cdot x_{ik} + d_k$$

$$\text{eg: } 2 \cdot x_{26} + 3 \cdot x_{36} - 1 \cdot x_{11} - 2 \cdot x_{21} - 1 \geq 0$$

$$(b) d_K = 2$$

$$\text{then } x_{26} = 1 \Rightarrow x_{11} \text{ & } x_{21} = 0$$

→ not possible

⑤ Limited resource

$$t=1: x_{11} + x_{12} + x_{13} + x_{14} + x \leq 2$$

$$t=2: x_{21} + x_{22} + x_{23} + x_{24} + x_{26} + x_{27} \leq 2$$

$$+ \underbrace{x_{11} + x_{12} + x_{13} + x_{14}}_{\text{if } d_K=2 \text{ and they are still running}} + \underbrace{x_{21} + x_{22} + x_{23} + x_{24}}_{\text{cannot be scheduled in time}} + x_{26} + x_{27} \leq 2$$

∴ if  $d_K=2$  and they are still running cannot be scheduled in time

## 2) unique resource

$$\sum_{i=1}^n x_{ie} = 1$$

$$\text{eg: } x_{11} + x_{21} + x_{31} + x_{41} + x_{51} + x_{61} = 1$$

## 3) Dependency

$i^{\text{th}}$  operation dependant on  $k^{\text{th}}$  operation  
then it cannot be started before  $k^{\text{th}}$  operation

$$\sum_i i \cdot x_{ie} \geq \sum_i i \cdot x_{ik} + d_k$$

$$\text{eg: } 2 \cdot x_{26} + 3 \cdot x_{36} - 1 \cdot x_{11} - 2 \cdot x_{21} - 1 \geq 0$$

$$\text{if } d_k = 2$$

$$\text{then } x_{26} = 1 \Rightarrow x_{11} \text{ & } x_{21} = 0$$

→ not possible

## 4) Limited resource

$$t=1: x_{11} + x_{12} + x_{13} + x_{14} + x \leq 2$$

$$t=2: x_{21} + x_{22} + x_{23} + x_{24} + x_{26} + x_{27} \leq 2$$

$$+ \underbrace{x_{11} + x_{12} + x_{13} + x_{14}}_{\text{if } d_k = 2 \text{ and they are still running}} \leq 2$$

cannot be scheduled if  $d_k = 2$

$t=3:$

$$x_{21} + x_{22} + x_{23} + x_{24} + x_{26} + x_{27} \\ + x_{33} + x_{34}$$

### 6) Binding

$$\sum_{l} b_{l,p} \sum_{i=d_l+1}^j x_{i,p} \leq 1 \quad \rightarrow \text{non-linear due to multiplication}$$

operations binded with a resource  
shouldn't be concurrent

e.g.:  $b_{11} x_{11} + b_{21} x_{12} + b_{31} x_{13} + b_{41} x_{14} \leq 1$

(for resource = 1, time step = 1)

$$b_{12} x_{11} + b_{22} x_{12} + b_{32} x_{13} + b_{42} x_{14} \leq 1$$

(for resource = 2, time step = 1)

~~20/08~~

### Quiz 1 Solution

- 1) a) ASAP and ALAP both give max. performance (upper bound)
- b) cannot have min<sup>m</sup> resources by ASAP, gives min<sup>m</sup> time
- c) Binding only assigns physical resource, not time step

2)

NW write ILP for chain

- scheduling and binding

→ 4 cycles

→ cycle time = max (delay multi, delay ALU)  
= 10ns

→ total iteration = 40ns

### Pipelining

#### 1) Structure

- res
- no

eg: mu

to reduce time  
and increase though  
we need to do Pipelinin

#### 2) Function

eg:



Now can input new value every 10ns

Initially cycle time = 20ns

Final cycle time = 10ns

→ Easier to split directed acyclic graph  
(no loops)

- Data Introduction Interval: how often we can enter new data



If cycle time > 2.5ns

$$\Rightarrow \text{DII} = 5\text{ns}$$

## PIPELINING

### 1) Structural Pipelining

- resources are pipelined
- non-hierarchical sequencing graph used

e.g.: multiplier

$$\text{Latency} = 2 \text{ cycle}$$

$$\text{DII} = 1 \text{ cycle}$$

### 2) Functional Pipelining



List scheduling:

$$t=1: U_1 = \{1, 2, 3, 4\} \quad U_2 = \{5\}$$

$$P_1 = \emptyset$$

$$S_1^1 = \{1, 2\}$$

$$P_2 = \emptyset$$

$$S_2^1 = \{5\}$$

$$t=2 \quad U_1 = \{3, 4\} \quad U_2 = \{9\}$$

$$P_1 = \{1, 2\}$$

$$P_2 = \emptyset$$

$$S_1^2 = \{3, 4\}$$

$$S_2^2 = \{9\}$$

due to pipelining

$$t=3 \quad U_1 = \{6\} \quad U_2 = \{3\}$$

$$P_1 = \{3, 4\}$$

$$P_2 = \emptyset$$

$$S_1^3 = \{6\}$$

$$S_2^3 = \emptyset$$

$$P_1 = \{6\}$$

$$P_2 \rightarrow \emptyset$$

$$t = 5 \quad U_1 = \emptyset \\ P_1 = \emptyset \\ S_1^5 = \emptyset$$

$$U_2 = \{10\}$$

$$\begin{array}{ll} t = 6 & U_1 = \emptyset \\ & P_1 = \emptyset \\ & S_1 G = \emptyset \end{array}$$

$$U_2 = \{11\}$$

- number of cycles is higher but cycle time is less

• 1LP

1) uniqueness of start time

$$\sum_i x_{il} = 1 \quad (\text{every operation should be started only once})$$

9:

## 2) Dependency

if  $k^{th}$  operation depends on  $l^{th}$  operation

$$\sum_i x_{ie} \leq \sum_j x_{jk} + \alpha$$

### 3) Resource

$$x_{11} + x_{12} + x_{13} + x_{14} \leq 2$$

$$x_{15} \leq 1$$

$$+ x_{26} + x_{27} \Rightarrow + \underbrace{\dots}_{\text{...}} \quad \left. \begin{array}{l} x_{21} + x_{22} + \\ x_{23} + x_{24} \end{array} \right\} \leq 2$$

no need to  
add due to  
Pipeline

If  $S = \text{data}$

- it's introduced

! intended

$$t, t+s, t+2s, \dots$$

time step operator

should be.

(concurrent)

200



- ZLP resource constraint contd.

$$\left( \begin{array}{l} \text{potential operations} \\ \text{in cycle 1} \end{array} \right) + \left( \begin{array}{l} \text{potential operations} \\ \text{in cycle 3} \end{array} \right) \leq \text{Resources}$$

$$\left( \begin{array}{l} \text{pot. op. in} \\ \text{cycle 2} \end{array} \right) + \left( \begin{array}{l} \text{pot. op. in} \\ \text{cycle 3} \end{array} \right) \leq \text{Resources}$$

- Let's say for binding we have scheduled sequencing graph.

- earlier throughput =  $\frac{11}{4}$

- now throughput =  $\frac{11}{2}$

- conflict graph



- pipelining improves throughput but resource requirement also increases

- PPTAS1

power testability

Power usage is static & dynamic

leakage switching activity

depend on the inputs

Dynamic power - depends on how many outputs are switching upon switching the inputs

Static power - which MOS is on depends on the input; hence those would constantly consume power

- at idle state what set of input to apply so that leakage is minimum.
- minimize the switching activity (however  $\eta_p$  comes from environ.)
- power = energy / time

$$= \Delta V^2 f \quad (\text{average})$$

↳ number of times it toggles

$$P(t) = v(t) I(t) \quad (\text{instantaneous})$$



- Average power determines the heat dissipated and hence defines all cooling requirement
  - avg. power relates to energy consumpt<sup>n</sup> and hence determines our battery life
  - heating determines burnout of chip
- 
- we mostly define peak as energy consumed over a clock period
    - at peaks we don't know what the device output would be
    - if  $V_{DD}$  goes down, then charging time  $T \Rightarrow$  delay  $\uparrow \rightarrow$  lose synchronisat<sup>n</sup>  
 $\Rightarrow$  wrong result
- correct result is crucial

09/09

- average of power is important in the cooling system
- radio, display and processor are the power dissipation sources in phones
- radio & display are mostly const. over the generation



- Peak is important to decide the powerline
- if the power supply isn't strong enough and voltage is higher there is drop in voltage called voltage drop
- Today chips are outputting at 0.8-1V &  $V_T = 0.4V$ , so we don't have enough margin  
 $V_{DD} - V_T \uparrow \Rightarrow \text{delay} \uparrow$ , so it may not be able to propagate the signal



- real peak is instantaneous

- Dynamic power is coming from switching.  
Switching/Toggle count will tell us the dynamic part.



Need to find power dissipation of ALU from switches.

$$\text{Toggle} = F(\text{input})$$

- Static power = no. of gates  
Static power has become low coz of finfets, high-K etc.
- Our main aim is to reduce dynamic power



→ synchronize w/p switching



Register Binding

→ for reg. binding we saw 2 graphs, whichever gives less power that'll have good synchronization

- T → testability

Modelling faults:

- single stuck-at fault (SSAF)

only one fault can occur at a time.  
It doesn't mean that if it is stuck at 0 then it is grounded.

- of course it's difficult. Practically there are multiple stuck-at faults
- no. of faults linearly proportional to no. of gates.
- suppose clk freq is 1 GHz, there are 4 billion gates so total delay will be 4 sec, which is very small and good



10/09

### Toggle problem

- rebinding of variables to registers
- may need additional registers (so that switching occurs at same)



- if switching occurs at a different time then excess power spent unnecessarily

- static power already considered during minimizat'n of area

- Reduce useless power dissipat'n caused by Share the resources

- if we want disable of area so then
- delay of (hence clock)
  - thus we can reduce power
- In CAD tools (not the tools occur) (combinational mode)
- Scalability

### TESTABILITY

Def

Fa

- Ideally clock should be same to all flipflops



if this clock comes 1 ns earlier than effectively we are giving comb. logic 1 ns time

- if we want to suppress clock, we don't disable for one flip flop but for an entire area so that skew (delay) doesn't occur
- delay of gate varies upto 30%.  
(hence clock gating leads to skew)
  - thus we mostly directly use the clock than gating it
- In CAD try to separate out domain  
(not the most optimized but less mistakes occur)  
(combinational feedback not generally used)
- Scalability requires uniformity

## TESTABILITY

Defect

↓  
(surface as)

Fault

SSAF model : logic fault  
timing fault

- while testing gate, we toggle each  $\oplus$  &  $\ominus$



Sequential

Diagnosis: Analysing and uncovering which exact fault is there (we usually care if defect is there or not)

### - Defect

→ systematic (occurs in all chips)  
random

- i) we first identify which defect occurs
- ii) then we do a fault diagnosis
- iii) Failure Mode Analysis (FMA): peel off chip layer by layer

- Timing Defect: chip fails as we increase the frequency of the clock
  - can be used for lower speed applications (at a lower price)
  - need to find the frequency at which it works
  - since each testing costs something, we need to test lower the freq at optimum intervals
  - can't lower too much each revenue would be lower than production cost

Time fr

→ we  
⇒ a  
⇒ w  
m

sequ  
Maxi  
any p  
want

SA  
X

- want  
take

ce

→



### Sequential CKT



### Time frame Expansion

- we pass the output from FF again to the CL
  - ⇒ another set of test vector required.
  - ⇒ while CL has 1 set of Test vector, SI may have many sets before we reach primary output

Sequential depth

Maximum no. of clk's it passed through for any pair of  $Y_p$  &  $\delta_p$

→ want to minimize this



- want longest path to be small so that testing takes lesser time
  - (i.e. minimize Sequential depth)

→

32/09

- For testing fault coverage 99% stuck-at fault check needs to be there 1% are redundant faults

$$\text{Fault Efficiency} = \frac{\# \text{ detectable faults}}{\# \text{ faults}}$$

(coverage)

- every net can have 2 faults



$$\text{No. of stuck-at faults possible} = 2 \times 4 = 8$$

$$\left. \begin{array}{l} \text{No. of non-testable faults} \\ (\text{Redundant faults}) \end{array} \right\} = 2 (SAI_A, SAI_B)$$

↳ don't cause any error

$$\Rightarrow \text{Faults coverage} = \frac{6}{8} \times 100\%$$

$$\text{Fault Efficiency} = \frac{(\# \text{ detectable faults})}{\# (\# \text{ faults}) - (\# \text{ redundant faults})}$$
$$= \frac{6}{8 - 2} \times 100$$
$$= 100\%$$

- Difference between FE and FC tells us the amount of redundancy we have in our circuit
- Sometimes, redundancy are intentional (like in the above case we could've used NOT directly - but we wanted uniformity)

- Redundancy = no test vector available to test the fault
- generating test vector for redundant faults may take a lot of time so somewhere one needs to draw the line and simply acknowledge that the fault is redundant (although they may not be)

→ but how to decide this limit

→ problem becomes severe for sequential logic

(because if we have  $n$  inputs, we applied  $2^n$  combinations, however as we unroll, no. of inputs becomes  $2n \dots \Rightarrow 2^{2n}$  combinations)

i.e. need to control unrolling

- Sequential Depth

max. no. of flip-flops between a pair of input/output

- Controllability [Ease of assigning value]  
How easily we can control an input



- improvement of controllability is also important



## Observability

How easily an output can be observed

- improve observability

- To make test generation easier

- improve controllability
- improve observability
- reduce seq. depth



bad  
good controllability  
but good observability  
⇒ need to change register binding

→ Re-binding variables to registers

- sequential depth can be optimised during re-scheduling and re-binding (improve slack)

23/09

## • Chaining (ILP in chaining)



→ uniqueness remains

→ dependency

(only successive operation can be chained)

⇒  $C_{i,j} = 0$  if they don't have edge in between

⇒ multiply earlier expression for dependency with  $C_{ij}$ . However makes it non-linear

## • Testability - Security

1) controllable (register should have easy access from primary input)

2) observable (register should have easy path to primary output)

↳ how many register it is passing through before coming at PO

3) Sequential Depth (reduce SD as far as possible)  
→ defines the no. of registers to pass through

⇒ all the 3 can be determined by

## • Register Allocation

↳ need to allocate it relatively

give different control, observe, SD

Ex :



$\Rightarrow$  4 registers

⇒ Left Edge Assignment algorithm  
or Right Edge Algorithm



$$d : s_D = 3$$

$$e : sp = 3$$

$$b \cdot SD = 2$$

$$\alpha : SD = 1$$

LEFT EDGE  
ALGORITHM

## • Right Edge Algorithm

$$P_1 = \{a, f, g\}$$

$$R_2 = \{b, c\}$$

$$R_3 = 1d3$$

$P_q > \{c\}$



$$R_1 = h \{ a \} g \}$$

$$R_2 = h \{ b \}$$

$$R_3 = h \{ d, c \}$$

$$R_4 = h \{ e, f \}$$



Observability:

$$d = 2 \quad \} \text{improved} \quad (\text{SD also improved } 11\%)$$

$$e = 2$$

$$b = 2$$