

\* Books : ① synthesis & optimization of digital circuits  
(Reto)

- Area + performance + Testability + power + security + Intelligence
- Testability - Design should be as open as possible
- Security - " " " closed "

### \* Computer Aided Design:

1. High-level synthesis → high-level algo to RTL
2. Logic synthesis → RTL to gate-level Design
3. Physical Design → gate-level to PCB
4. Design verification → correctness
5. Testing



we have to go from Algo to transistor layout in a systematic order

- We can go from Algo to transistor eqn by following a spiral-like path
  - We need a rough estimate at every level
  - Delay incurred in wire is also crucial
-  Routing has huge no. of possibilities
- Top-down approach is systematic.  
Gajski's Y-chart (discussed on the last page)
  - Bottom-up is difficult because we don't focus on the specifications

### Hybrid approach:



Previous experience can help us skip some steps

Behaviour & Structural - Top-down

Physical Design - Bottom-up

- high-level synthesis output: datapath & controller
  - input to Logic synthesis & output: schematic
- Routing algorithms <sup>are</sup> approximately best sol<sup>n</sup> since iterating over all possibilities is ~~impossible~~ not practical.
- FPGA is better for low-volume products since fixed costs are very high
- FPGAs are much faster than simulation (slower than ASIC) & hence they are used for prototyping

14/8/19 Algos can also be represented graphically:

$$a = b + c, \quad d = a + e, \quad p = q + r$$



we can implement this algo by simply replacing each node with the corresponding ALU block

Resources required:

- 1> Functional units
- 2> Steering data

If asynchronous arrival of inputs: 3> Memory

Optimized? Parameters are PPT ASI

For now, let the parameters be Area, Performance



$$\text{Delay} = \underline{2 \text{ ns}}$$

: 3 units of area  
2 ns delay

Using only one adder :



Time taken =  $3 \text{ ns}$  : not optimal  
Area = 1 unit : optimal

We also need 5 control signals  
( $2+2$  for mux, 1 for register)

Intermediate:

Time :  $2 \text{ ns}$

Area : 2 units



(Assuming storage logic take up negligible space)



Solving diff. equations: eg

$$\frac{d^2y}{dx^2} + 3xy \frac{dy}{dx} + 3y = 0$$

$$\frac{dy}{dx} = \frac{-3xy - 3y}{d^2y}$$

$$y(0)$$

$$\frac{dy}{dx}(\star) = u$$

$$x' = x + dx$$

$$y(0) = y$$

$$y' = y + u dx$$

$$x(0) = 0$$

$$u' = u - 3xy dx - 3y dx$$

~~x < a~~: Termination condition



We can reduce from 6 multipliers, 5 ALUs to

4 multipliers, 2 ALUs (since at one stage, we need only the components at that stage)

| e.g. stage | 1 | 2 | 3 | 4 |
|------------|---|---|---|---|
| MUL        | 4 | 2 | - | - |
| ALU        | 1 | 2 | 1 | 1 |

$\rightarrow \max = 4 \quad \} \quad 4 \text{ MUL}$   
 $\rightarrow \max = 2 \quad \} \quad 2 \text{ ALUs}$

However, we can further optimize by shifting the nodes downwards/upwards



But this was done intuitively. Some systematic method has to be developed

- We need to also map the hardware to their job. placing of operation in space  $\rightarrow$  binding
- placing of operations in time  $\rightarrow$  scheduling

## Optimal / Scheduling / Binding

Area | Performance

constraints: e.g. 30 fps for video processing

optimization: speed of general purpose computer

- upper bound on performance can be obtained by longest path & ample infinite resources to implement
- we can have a constraint on Area  
e.g.  $\leq 9$  units  $\Rightarrow$  only one multiplier allowed, few ALUs

∴ place in time - scheduling - defines the state machine

Place in space - Binding - Defines the steering logic

19/8/19 # Policy 1 : ASAP

As soon as a job finished, assign next time unit slice to all its successor. Don't care about required resources. This is precisely what we did in the first case.



$$\text{Latency} = \lambda = t_{\text{last node}}^{\text{start}} - t_0^s = 5 - 1 = 4$$

ASAP will give an upper bound on performance

## \* Policy 2: ALAP (As late as possible)

We start from the bottom



ASAP & ALAP will give same performance but different resource allocation

- Apart from the maximum path, we can move around the remaining nodes to optimize resources
- Now suppose performance constraint is not 4 but 5
- Nodes 1, 2 can be moved up by maximum of 1 node
- Slack for 1, 2 = 1
- " " 3 = 2 (can be done at t=1, 2 also)
- $\text{Slack} = t_i^L - t_i^S \rightarrow \text{ASAP time}$

\* We can use already-established optimization techniques to solve for optimal resources

For ILP: e.g. 400 wooden boards, 450 person hours

Chairs: 40 profit, 5 boards + 10 person hours

Table: 50 " , 10 " + 15 "

$\therefore$  If  $X$  chairs &  $Y$ -tables, profit =  $40X + 50Y$

$\rightarrow$  Goal: Maximize of  $40X + 50Y$

Constraint:  $5X + 10Y \leq 400$  — Wood

$10X + 15Y \leq 450$  — hours

$X \geq 0, Y \geq 0$



- LP  
 ILP : only integers  
 MLP : mixed - some integers & some fractions  
 ZLP : only 0,1 allowed (subset of ILP)



### Formulating the ILP:

Let  $x_{il}$ : operation  $i$  is scheduled at time  $l$

$$x_{il} = \begin{cases} 1 & \text{if operation } i \text{ is scheduled at step } l \\ 0 & \text{otherwise} \end{cases}$$

Total :  $i : 1 \text{ to } 5$ ,  $l : 1 \text{ to } 11 \Rightarrow \text{Total} = 55$

such variables

## Generating the constraints:

1)  $\sum_l x_{il} = 1$  : Every operation is scheduled in ONLY one cycle

$$\text{eg. } x_{11} + x_{12} + x_{13} + x_{14} + x_{15} = 1$$

But ALAP & ASAP tells us that  $l=1 \text{ or } 2 \text{ only}$

$$\therefore x_{13} = x_{14} = x_{15} = 0$$

\*  $x_{il} = 0$  for  $l > \text{ALAP}(i)$  and  $l < \text{ASAP}(i)$

2) But we haven't considered dependencies i.e.

~~operation  $i$  should be scheduled before operation  $j$~~

Dependent operation should be scheduled after completion of its predecessor

i.e.  $i^{\text{th}}$  operation is dependent on  $j^{\text{th}}$  operation  
and hence  $l_i > l_j$

$$\therefore * \sum_l (x_{il}) \geq \sum_l (x_{jl}) + d_{ij}$$

start time of operation  $i$       start time of operation  $j$       time of execution of operation  $j$

\*  $1: x_{g11} + 2x_{g12}$

For our given graph,  $d_j = 1 \forall j$

$$2x_{62} + 3x_{63} - 1 \cdot x_{11} - 2x_{12} - 1 \geq 0$$

- For node 1

This tells us that either  $x_{11} = 1$  or  $x_{12} = 1$  if  $x_{63}$

$$2x_{82} + 3x_{83} + 4x_{84} + 5x_{85}$$

$$- 1 \cdot x_{41} - 2x_{42} - 3x_{43} - 4x_{44} - 1 \geq 0$$

For node 8 ↗

- Resource constraints:

At any point in time, suppose we cannot have more than ' $a_m$ ' multipliers & ' $a_k$ ' ALUs

$$\text{Cycle 1: } x_{11} + x_{21} + x_{31} + x_{41} \leq a_m, \quad x_{51} \leq a_k$$

$$\text{Cycle 2: } x_{12} + x_{22} + x_{32} + x_{42} + x_{62} + x_{72} \leq a_m$$

$$x_{52} + x_{82} + x_{92} \leq a_k$$

$+ x_{71} + x_{81}$        $\downarrow d_j$

### ~~OPTIMIZATION FUNCTION~~

$$\text{Area: } \min(1 \cdot a_k + 4 a_m)$$

$\downarrow$   
if multiplier occupies  $l$  units of area  
ALU    "    1 unit "  $\uparrow$

To maximize performance : we can minimize :

$$\sum_i \sum_l (x_{il} l) \xrightarrow{\text{path}} \text{will automatically minimize } l \text{ for longest path}$$

Note: The above will give ASAP if area constraint is relaxed.

We can also co-optimize area & performance :

$$\text{Minimize} (f \cdot \text{Area} + (1-f) \text{Performance})$$

Both should be normalized to the same scale

## \* Heuristic-based Scheduling:

1. List-scheduling under resource constraint ( $2M + 2ALU$ )

$$U_1 = \{1, 2, 3, 4\} \quad U_2 = \{5\}$$

$U$  - unscheduled list - contains list of operations

when set contains lesser operations than constraints,  
no problem ✓

when set contains more nodes to be scheduled :  
we can give more priority to nodes which lie in  
the longest path.



Calculate Min dist. of each node from the bottom  
Higher the distance  $\Rightarrow$  Higher the priority

$$\therefore \begin{array}{l} S_1^1 \xrightarrow{\text{mult.}} \\ S_1^2 \xrightarrow{\text{ALU}} \end{array}$$

$$U_1 = \{3, 4, 6\} \quad U_2 = \{9\} \quad (\text{8 not yet eligible})$$

$$S_2^1 = \{3, 6\} \quad S_2^2 = \{9\}$$

$$U_1 = \{4, 7\} \quad U_2 = \{10\}$$

$$S_3^1 = \{4, 7\} \quad S_3^2 = \{10\}$$

$$U_1 = \cancel{\{7\}} \quad U_2 = \{8, 11\}$$

$$\therefore S_4^1 = \{3\} \quad S_4^2 = \{8, 11\} \quad \checkmark$$

$\times - \times - \times$

This may not give the best but gives optimum sol<sup>n</sup>

- We can also have a variant where operations take different no. of time steps

Say multiplication takes 2 time units.

$$U_1 = \{1, 2, 3, 4\}$$

$$U_2 = \{5\}$$

$$S_1^1 = \{1, 2\}$$

$$S_1^2 = \{5\}$$

$$U_1 = \{3, 4\}$$

$$U_2 = \{9\}$$

$$\leftarrow P_1 = \{1, 2\}$$

$$P_2 = \emptyset$$

in progress

$$|S_1| + |P_1| \leq a_m \Rightarrow S_2 = \emptyset$$

↓  
resources

$$U_1 = \{3, 4, 6\}$$

$$U_2 = \emptyset$$

$$P_1 = \emptyset$$

$$P_2 = \emptyset$$

$$S_3^1 = \{3, 6\}$$

$$S_2^2 = \emptyset$$

$$U_1 = \{4\}$$

$$U_2 = \emptyset$$

$$P_1 = \{3, 6\}$$

$$P_2 = \emptyset$$

$$|S_1| + |P_1| \leq 2 \Rightarrow S_3^1 = \emptyset$$

$$S_4^2 = \emptyset$$

$$U_1 = \{4, 7\}$$

$$U_2 = \{10\}$$

$$P_1 = \emptyset$$

$$P_2 = \emptyset$$

$$\therefore S_5^1 = \{4, 7\}$$

$$\therefore S_5^2 = \{10\}$$

$$U_1 = \emptyset$$

$$P_1 = \{4, 7\}$$

$$S_6^1 = \emptyset$$

$$U_2 = \emptyset$$

$$P_2 = \emptyset$$

$$S_6^2 = \emptyset$$

- - - - -

$$U_1 = \emptyset$$

$$U_2 = \{11, 8\}$$

$$P_1 = \emptyset$$

$$P_2 = \emptyset$$

$$S_7^1 = \emptyset$$

$$S_7^2 = \{8, 11\}$$

⇒ Generating on ILP if different latencies:

1. Uniqueness will remain the same
2. Dependency: we had a  $d_j$  term which took care of the execution time  
∴ If  $op_i$  is dependent on  $op_j$ :

$$\sum l_i x_{il} \geq \sum l_{ij} x_{jl} + d_j \rightarrow = \begin{cases} 1 & \text{for addition} \\ 2 & \text{for multiplication} \end{cases}$$

e.g.  $2x_{62} + 3x_{63} + 4x_{64} + 5x_{65}$

$$- 1 \cdot x_{11} - 2x_{12} - 3x_{13} - 4x_{14} - 2 \geq 0$$

3. Resource constraints:

max. 2 operations

$$0) \quad x_{11} + x_{21} + x_{31} + x_{41} \leq 2$$

$$0) \quad \underbrace{x_{11} + x_{21} + x_{31} + x_{41}}_{\text{already running}} + \underbrace{x_{12} + x_{22} + x_{32} + x_{42} + x_{52} + x_{62} + x_{72}}_{\text{new possibilities}} \leq 2$$

$\Rightarrow$  For different kind of multipliers:

2 ALUs: 1 latency, 1 unit area

2 Multipliers:  $\begin{cases} 2 \text{ latency, 3 unit area} \\ 1 \text{ latency, 5 unit area} \end{cases}$

Use heuristic based scheduling ✓ "an  
S-List"  $\rightarrow$  decide which multiplier

$x_{ijk} = 1$  if operation  $i$  is scheduled at time  $j$   
on multiplier  $k$

$$\text{Uniqueness: } (x_{111} + x_{112}) + (x_{121} + x_{122}) + \dots + (x_{151} + x_{152}) = 1$$

$$\text{Dependency: } \sum (x_{111} + x_{112}) \geq \dots$$

~~Binding~~: Mapping logic units/multipliers to the task.

- Sharing of operations is possible only when operations are not concurrent.
  - We can map the problem to a graph-based problem
- ① Make a conflict graph for each operation, where edge  $(i, j)$  exists between node  $i$  & node  $j$  if  $i^{\text{th}}$  &  $j^{\text{th}}$  operation are scheduled at the same time. for multiplication:



- ② We can also solve the complement of the above method:

Compatibility Graph:  
i, j are connected if  
operations i & j are NOT  
concurrent



• Either of the two will be sparse & hence easier to process.

② continued : we can solve compatibility graph using the MIN CLIQUE method :

clique : COMPLETE subgraph within a graph  
all nodes are connected to each other



Now that we have minimum no. of cliques, just assign each unit to one clique



\* BINDING is done post solving the ILP

If there are multiple solutions to the clique problem we can optimize for steering logic, reliability, etc.

e.g.  $V_{TP}$  changes if -ve bias is applied & hence if operated for a long time, delay increases (NBIT)

$\therefore$  workload among all the units must be distributed i.e. clique sizes should be uniform

\* Co-optimization of area & performance:

$\int \text{Area} + (1-f) \text{Performance}$

$\underbrace{\quad\quad\quad}_{\text{we need to normalize both so that they can be optimized together}}$

(Like feature scaling)

e.g. we can divide by max area & max latency to get them on the scale 0-1 & now use

Co-optimization.

if we once operations are assigned to resources, we develop the steering logic

e.g.



select pins  
controlled acc.  
steering logic

e.g.



Any signal which is  
crossing the time  
boundary needs to  
be stored

- we can assign one register for each intermediate signal ( $\therefore 6$  registers in this case)

- we can minimize # by sharing the resources  
i.e. binding operations to physical registers



i.e. Variables having overlapping  
liveless range need different  
Registers/resources

- We can use graph coloring again:  
make conflict graph, e.g. edge  $(i, j)$  exists if node  $i$  & node  $j$  have overlapping liveness range.



$\therefore$  Red:  $z_1, z_3, z_6$  will get the same register

We can also use compatibility graph:  
edge  $(i, j)$  exists if nodes  $i \& j$  DONT have overlapping liveness range.

Solve for min-clique



We also have external variables which need to be stored in registers. 2-registers for  $u, v, y$  additional

\* If steering logic area is small, we do scheduling before binding.

But if it is significant, we need to do concurrent scheduling & binding

\* We need a state machine which specifies all control signals, outputs, next state logic, etc.

e.g.  $a = b + c$ ,  $d = a + e$



Result

|     | X | Y | Z |
|-----|---|---|---|
| t=1 | 1 | 0 | 1 |
| t=2 | 0 | 1 | 1 |

redundant  
always key,

\* Co-optimization of scheduling & binding:

$$x_{il} = \begin{cases} 1 & \text{if } i^{\text{th}} \text{ operation is scheduled in time step } l \\ 0 & \text{otherwise} \end{cases}$$

$$y_{ir} = \begin{cases} 1 & \text{if } i^{\text{th}} \text{ operation is bound with } r^{\text{th}} \text{ resource} \\ 0 & \text{otherwise} \end{cases}$$

### Constraints:

> Uniqueness:

$$\sum_{t=1 \text{ to } T} x_{itl} = 1 \quad \forall i \quad (\text{Each op scheduled exactly once})$$

$$\sum_l y_{itl} = 1 \quad (\text{spatial uniqueness})$$

i.e. if op 1 is assigned to 1<sup>st</sup> multiplier, it cannot be assigned to any other multiplier

for the graph we have already analysed:



total time = 5

$$x_{11} + x_{12} + x_{13} + x_{14} + x_{15} = 1$$

$$y_{11} + y_{12} = 1$$

$$y_{21} + y_{22} = 1$$

$$y_{33} + y_{34} = 1$$

if resources 1, 2 are multipliers  
3, 4 are adders

2) Dependency:  $\sum l_{xil} - \sum l_{xjl} - d_j > 0$   
 if  $x_i$  is dependent on  $j^{th}$

e.g.  $x_6$  is dependent on 1 or 2:

$$2x_{62} + 3x_{63} - 1 \cdot x_{11} - 2x_{12} - 1 > 0$$

3) Resource  $\sum_{i=1}^n x_{ii} + x_{21} + x_{31} + x_{41} \leq 2$  — multiplicity  
 $x_{12} + x_{22} + x_{32} + x_{42} + x_{62} + x_{72} \leq 2$  — of time

4) Sharing of resources: (Non-linearity)

$$x_{11}y_{11} + x_{21}y_{21} + x_{31}y_{31} + x_{41}y_{41} \leq 1$$

$\uparrow$  first is scheduled first resource is assigned

Similarly for the  $2^{nd}$  resource:

$$x_{12}y_{12} + x_{22}y_{22} + x_{32}y_{32} + x_{42}y_{42} \leq 1$$

Minimization objective:

$$\delta \sum l_{xil} + (1-\delta) \sum w_{qr}$$

$\sim$   
performance

$\sim$   
weighted areas of FU

- Control-dominated circuits will be optimized by considering steering logic

## Optimization for Power :



Unfortunately, inverter it has both static & dynamic leakage  
Ideally, 0 &  $\infty$  resistance are expected



- Dynamic leakage  $\propto$  No. of Transitions / Toggles
- Static leakage  $\propto$  # transistors  
But # transistors is already taken care of in area  
 $\sim 10^{-1}$  of total power
- Dynamic leakage is difficult to estimate since ~~it is~~  
~~the~~ toggling depends on input & input is not fixed  
Do simulations & plot toggles for some inputs  
we only care about relative power dissipation

- Power is instantaneous :
- Avg. - 
- Average power will influence the heat sinks required  
e.g. cooling wing liquids or fan enough?
  - But we need to also consider the peaks
  - Since the wires supplying  $V_{DD}$  to the transistors (supply rails to inverter) can carry constant power, we will see a voltage drop.
  - But current inverters run at around  $0.8\text{ V}$  and we might wander close to  $0.4\text{ V}$  (Threshold voltage)
  - Also, delay  $\propto \frac{1}{(V_{DD} - V_{Th})}$ 
    - ↓      ↑
    - decrease      fixed $\Rightarrow$  increase of delay  
may cause synchronization problems
  -  voltage drop
  - Peak power determines the power delivery network



Even though we are not storing value of addition in  $R_3$ , there is still power loss from at  $t=1$  because since inputs have changed, combinational logic will still calculate the result.

So even though we want to store in  $R_3$  the result at  $t=4$ , power loss at  $t=1$  also!

- Called spurious switching / useless power
- To reduce this, we need to ensure that switching occurs simultaneously.
- Register rebinding & scheduling will be required now.

- ⇒ Testability: 1) hardware defect
  - ⇒ Algo defects e.g. OR instead of XOR in the code
- 
- Systematic defect - Poor understanding of new technology
  - Random defect - crack in the wafer  
(prevalent in mature devices)

Any defect will give rise to a logical implication.  
So, to test random defects:

(  
 $\overline{D_o}$  inverter does not give diff. output for  $\overline{D_i}$ )  
 ∵ Even though it is a logical error, don't affect manufacturing

⇒ stuck at Fault model:

→ either 0 or 1. NEED NOT mean that they  
 $\overline{\overline{D}}$  stuck are directly connected to +5/gnd

Only covers logical defects

if we assume  
 $2^B$   $\approx 2 \text{ billion}$  nos

- Total configurations of stuck at fault: 3
  - either  $S_1 = 0$   
 $S_2 = 1$
  - no fault
- But multiple SAF vectors are mainly the same as that of single SAF  
 $\therefore$  single stuck at fault is good enough assumption



we may not be able to always propagate the fault to a primary output

So, we need to keep passing it through the logic till we are able to propagate it to the output



Unrolling increases the complexity

※ Sequential depth: Max. no. of FF between any pair of input & output.

Since more the no. of FF (higher sequential depth), increases the Unrolls required, we should keep this in my mind during design

e.g. binding of registers should be done in such a way that sequential depth is minimized



Total 8 possible SSAF.

only 2 are redundant ( $\because \bar{X}F = \bar{X}$ )  
All others are testable

\* Fault coverage =  $\frac{\# \text{ detectable faults}}{\# \text{ Total FF}} \text{ (Test vectoring error)}$

Fault Efficiency =  $\frac{\# \text{ detectable}}{\# \text{ Total} - \# \text{ redundant}}$

In this case, FC =  $\frac{6}{8} \times 100 = 75\%$ .

FE =  $\frac{6}{8-2} \times 100 = 100\%$ .

Redundant faults are those which don't cause any logical malfunctioning.

Some faults can be malfunctioning, yet a test vector won't exist in some cases.



$$\bar{f} \cdot \bar{r} = \frac{1}{f_r}$$

• Calculating fault vector for redundant ~~fault~~ faults take a very long time. So, if we stop midway, we don't know if it's redundant or not.

Put own FE falls since # redundant faults ↓

⇒ More problem in sequential circuits:



$$\begin{aligned} PI &\rightarrow FF \\ FF &\rightarrow PO \\ PI &\rightarrow PO \\ FF &\rightarrow FF \end{aligned}$$

Although M<sub>1</sub> & M<sub>2</sub> can separately operate at 10 GHz, if we directly connect them, total delay = 16 ns, hence we can't use 100 GHz clock.

Standard practice: Use FF as intermediate storage  
We will need more 2 clock cycles to pass data from one module to the other but clock freq. will remain high.



- To test for fault  $g$ , we cannot directly apply primary input ~~s<sub>pos</sub>~~. It has to be passed through an adder first. ∴ Not easy to control
- But the fault on the right can be easily tested since inputs can be directly driven
- 1. Reduce sequential depth
  - ✓ scheduling & binding  
will affect this
  - optimization done during
- 2. Controllability enhancement } Register-binding
- 3. Observability "
- In this case, we can observe the output & hence the faults can be atleast observed
- ∴ Good observability



choice between registers  
at  $t=1$  &  $t=2$



We put  $R_1$  after multiplexer because all inputs must be registered (discussed on last page)



paths from register to register

sequential depth = 3 since minimum registers a bit will pass through in longest path is 3 ( $R_4 \rightarrow R_2 \rightarrow R_1 \rightarrow \text{output}$ )

(max. no. of registers between a pair of i/p / o/p)

If we simply swap R1 & R2 at t=2:



All are controllable since each register directly connected to primary input.

Now, sequential depth = 2 not 3 but at the cost of a 4-to-1 mux instead of 2-to-1 MUX

\* Recap of observability & controllability



∴ No. of registers between primary input & current register  $\propto \frac{1}{\text{controllability}}$

# registers between output & current register  $\propto \frac{1}{\text{observability}}$

Heuristic for assigning registers: (Left Edx Alg.)

Start from  $t=1$ , scan from left to right & pick one which has emptied recently

e.g.



Sequential depth = 2

(write again)

Now: Modify LEA to increase feasibility (register binding)

- More priority is given to reduction of sequential depth
- Testability is dependent on scheduling, register binding & resource binding

☞ Security: hardware attacks are now popular  
RSA, AES is mathematically impossible to break!  
But side-channel attacks such as power probing, etc.  
are used

Clearly, testability vs security is a trade-off

↓                          ↗

want the circuit to      closed down as much  
be open for probing      as possible

: Assuming  $x = 1$ .

$$x_{11} + x_{21} + x_{31} + x_{41} + \underbrace{x_{73} + x_{43}}_{\text{can be executed at } t=3} \leq a_{\text{mult}}$$

$$x_{62} + x_{72} + x_{32} + x_{42} \leq a_m$$

⇒ Designing the functional units:

① ON set: input values which evaluate to 1

② OFF set: complement

→ Instead of storing entire truth table, store whichever set is smaller

. It is a canonical representation

③ Truth table - not scalable

④ BDD: based on Shannon's expansion:

$$f(x_1, x_2, \dots, x_n) = x_i f_{x_i} + \bar{x}_i f_{\bar{x}_i} \Rightarrow$$



$$f_{x_i} = f(x_1, x_2, \dots, x_{i-1}, 1, x_{i+1}, \dots, x_n)$$

Similarly for  $f_{\bar{x}_i}$

Continue with  $x_j$  as the next decomposition variable

e.g.

|   | a | b | c | f |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 |   |
| 0 | 1 | 0 | 0 |   |
| 0 | 1 | 1 | 1 |   |
| 1 | 0 | 0 | 0 |   |
| 1 | 0 | 1 | 1 |   |
| 1 | 1 | 0 | 1 |   |
| 1 | 1 | 1 | 1 |   |



$$\text{Total nodes} = 2^{\frac{n}{2}} - 1 + 2^{\frac{n}{2}}$$

internal nodes      leaves

We can store only 2 leaves but still, total of  $2^{\frac{n}{2}} + 1$  no.

Removing redundancy:



- Clearly, it is ordered since it depends on the order of variables chosen

for many functions, the RABOD grows linearly  
(for some like multiplier, it is still bad)  
but still better than truth table, which grows  
exponentially.

we can compare 2 functions  $f_1 \stackrel{?}{=} f_2$  by comparing  
the trees & hence test a new implementation

- But how can we start constructing/reducing  
it if it starts with  $2^{n+1} - 1$  nodes?
- use decomposition property to build it incrementally
- ① base cases ✓      ② combination properties ✓  
use some sort of DP algo.

$$\text{Let } \Rightarrow f = x_i f_{x_i} + \bar{x}_i f_{\bar{x}_i}$$

$$g = x_i g_{x_i} + \bar{x}_i g_{\bar{x}_i}$$

Then,  $f$  or  $g$ :



when  $g$  itself is independent of  $x_i$ ,  $g_{\bar{x}_i} \equiv g_{x_i}$

e.g.  $x_1, x_2$ :



- Taking complement is as simple as swapping the nodes
- $x_1 x_2 + \bar{x}_1 \bar{x}_2 = x_1 \oplus x_2$ :



Total nodes = 5

$x_1 \oplus x_2 \oplus x_3$  :

Total nodes = 7



$x_1 \oplus x_2$  : 5 nodes

$x_1 \oplus x_2 \oplus x_3$  : 7 nodes

$x_1 \oplus x_2 \oplus x_3 \oplus x_4$  : 9 nodes

↳ linear increase in # nodes

Many common functions such as adder have XOR & XNOR operations !!!

∴ Growth of ROBDD is linear

• (Doesn't scale for multiplier)

• Can be implemented in circuit by replacing all nodes with muxes.

• Applications of ROBDD:

1) Equivalence checking:  $f_1 \stackrel{?}{=} f_2$

2) Implement directly

We can store the netlist in the form of a graph

e.g.



Each gate will serve as a node in a graph

Each node will have the following properties:

|           |
|-----------|
| node no.  |
| Gate type |
| # input   |
| # output  |
| Inputs    |
| Outputs   |
| value     |

eg. 2

eg. 1



calculate & store during simulation

we need to proceed in a proper manner i.e. assign a level to each gate

Level (any primary input) = 0

Level (any gate) = max(level of all inputs) + 1

(called) compiled code simulation

- . when testing multiple inputs, we can check if the current output changed.  
if at a particular gate, both inputs to it did not change despite a change in the primary input, we need not evaluate it again & hence save time  
→ called as event-driven simulation
- however, we can pipeline the process by using multiple processors (one at each level)
- . Simple compiled code simulation may be better in the practical world



- . If  $d$  is delay of 1 MUX: total delay =  $nd$   
Since a max of  $n$  MUX can come in our way  
Area  $\propto n$

\* Order of variables has a huge impact on the size & shape of RODD

e.g.  $a_1 b_1 + a_2 b_2$       ↘  $a_1 a_2 b_1 b_2$   
                                  ↗  $a_1 b_1 a_2 b_2$

NP-hard : Only heuristics exist to choose the right order