

\* Ripple carry Adder (or) Parallel Adder.

4-bit Ripple carry Adder.

$A_{n-1}, B_{n-1}$



- > N-bit Adder can be constructed by cascading N full adders circuits in series, such that first full adder 'C0' is connected to next full adder carry-in & so on.
- > This arrangement is known as Ripple carry Adder ; Since carry bit ripples from one stage to another.
- > The delay through the circuit depends upon the number of stages that must be traversed & is a function of the applied pip signals.
- > for some input combinations no ripples occur at all, while for others, the carry has to ripple from LSB to the MSB. posn. The propagation delay of such structure is defined as -the worst case delay over all possible patterns.
- > The delay is then proportional to the no. of bits in the pipwords  $N'$  & is approximated by

$$t_{\text{adder}} \approx (N-1)t_{\text{carry}} + t_{\text{sum}}$$

$$\begin{aligned} t_{\text{carry}} &= 12 \text{ ns} \\ t_{\text{sum}} &= 15 \text{ ns} \\ t_{\text{adder}} &= \frac{12(N-1) + 15}{12} \end{aligned}$$

- > The propagation delay of the ripple carry adder is linearly proportional to  $N$ . This property becomes very important when designing adder for wide datapaths.

\* A 16-bit RCA is realized using 16 identical full adders (FA) as shown in fig. The carry-propagation delay of each FA is 12ns & the sum-propagation delay of each FA is 15ns. The worst case delay of this 16-bit adder?



$$\begin{aligned}
 t_{\text{add}} &= (N-1)t_{\text{pc}} + t_{\text{sum}} \\
 &= (16-1)t_{\text{pc}} + t_{\text{sum}} \\
 &= 15 \times 12 \text{ ns} + 15 \text{ ns}, \\
 &= \underline{\underline{195 \text{ ns}}}
 \end{aligned}$$

bit address.

Hence, to increase the speed of the adder, faster carry generation techniques are made use of.

These techniques improve the speed of operation of the adder, but on another hand increase the floor area as well.

### (i) Carry Look Ahead Adder.

and in em em.

In RCA, the final carry-out cannot be obtained without the propagation of carry-in through the previous adder cells. Using the general propagate function, this carry propagation can be avoided & the carry out can be computed directly.

$$C_i^o = G_i + P_i C_{i-1}^o$$

$i=0$

$$C_0 = G_0 + P_0 C_{-1} \quad \dots \textcircled{1}$$

$i=1$

$$C_1 = G_1 + P_1 C_0 \quad \dots \textcircled{2}$$

$$i=2 \quad G_2 + P_2 (G_1 + P_1 (G_0 + P_0 C_{-1}))$$

$$C_2 = G_2 + P_2 G_1$$

$$= G_2 + P_2 (G_1 + P_1 (G_0 + P_0 C_{-1}))$$

$i=3$

$$C_3 = G_3 + P_3 C_2$$

$$= G_3 + P_3 \{ G_2 + P_2 G_1 + P_1 P_2 G_0 + P_0 P_1 P_2 C_{-1} \}$$

Inputs  
A, B,  $C_{in}$

| A | B | $C_{in}$ | S | $C_{out}$ |
|---|---|----------|---|-----------|
| 0 | 0 | 0        | 0 | 0         |
| 0 | 0 | 1        | 1 | 0         |
| 0 | 1 | 0        | 1 | 0         |
| 0 | 1 | 1        | 0 | 1         |
| 1 | 0 | 0        | 1 | 0         |
| 1 | 0 | 1        | 0 | 1         |
| 1 | 1 | 0        | 0 | 1         |
| 1 | 1 | 1        | 1 | 1         |

No carry generation  
 $\rightarrow C_{out} = 0$

Carry propagate  
 $C_{out} = C_{in}$

$C_{out} = 1$

$$C_0 = \underbrace{G_0}_{\text{Carry generate}} + \underbrace{P_0}_{\text{Carry propagate}} \underbrace{(A \oplus B) C_{in}}_{(A+B) C_{in}}$$

Carry generate  
Carry propagate

$$C_0 = G_0 + P_0 C_{in}$$

- > These equations indicate that carry outputs can be computed directly, without the need for propagating through the adder cells.
- > Hence, as there is no ripple effect, the speed of the adder is increased.
- > But the limitation is that with increased no. of carry outputs, the expression becomes much larger & logic becomes quite complex.

> CLA:



4-bit CLA.

7-bit full Adder Using CLA.



This unit produces the four carry outputs simultaneously. However,  $C_0, C_1, C_2$  are required only for producing the sum bits. These carry ops will not ripple through.

\* 16-bit full Adder by 4-bit Full Adder.



- for  $\bar{S}_3$  expression - CMOS implementation

$$\bar{S}_3 = \bar{G}_3 + P_3 \bar{G}_2 + P_2 P_3 \bar{G}_1 + P_1 P_2 P_3 G_0 + P_0 P_1 P_2 P_3 C_{-1}$$



Carry select adder :-



4-bit CSA.

# 2-bit CSA:



$$\begin{array}{r}
 & A_1 & A_0 \\
 \textcircled{1} & | & | \\
 & B_1 & B_0 \\
 \hline
 & 1 & 0
 \end{array}$$

### \* Carry Selected Adder.

- ① → A standard logic design technique to accelerate the critical path is to pre-compute the ops for both possible ilps, & then use a multiplexes to select b/w the two op choices.
- ② CSA consists of two ripple carry adders & a multiplexer.
- ③ The calculation is done twice with an assumption that the carry ilp is '0' & another adder with an assumption that the carry ilp is 1.
- ④ After the two results are calculated, the correct sum, as well as the correct carry-out, is then selected with the multiplexer once the carry-in is known.

### \* Improvement in Speed.

Let us consider the RCA; for an n-bit adder, the computation time is given by  $T = nk_1$ , where  $k_1$  is the delay through one adder cell.

→ for, carry selected adder, if the adder is divided into blocks, with each block containing two adders only, then  $T = 2k_1 + (n-2)k_2$

where  $k_2$  is the delay through the mux.

→ Hence it is observed that, if not planned properly then 'k<sub>2</sub>' may ~~not~~ <sup>drastically</sup> there may not be any great advantage.

$\rightarrow T_1 - k_2$

Let the  $n$ -bit adder be divided into ' $M$ ' blocks & each block contain ' $p$ ' adder cells in series i.e.  $n = MP$ .  
 (Ex: 16-bit  $\rightarrow$  2 blocks  $\rightarrow$  each block has 8 adder cells). The completion.

time ( $T$ ) is the sum of  $\frac{2 \times 8 = 16}{\text{propagation delay}}$  through the first block + propagation delay through the muxes.

$$\text{i.e. } T = Pk_1 + (M-1)k_2 \quad (1)$$

To obtain minimum  $T$  eq(1) has to be differentiated w.r.t. ' $M$ ' & the result to be equated to zero.

$$n = MP \\ P = n/M$$

$$\text{i.e. } \frac{dT}{dM} = \frac{d}{dM} \left\{ \frac{n}{M} k_1 + (M-1)k_2 \right\} = 0$$

$$nk_1 \left( -\frac{1}{M^2} \right) + k_2 = 0 = 0$$

$$(08) \frac{nk_1}{M^2} = k_2 \Rightarrow M = \sqrt{\frac{nk_1}{k_2}}$$

Ex:- for a 64-bit CSA, given  $k_1 = 4ns$  &  $k_2 = 1ns$ . find the no. of blocks & no. of adder cells in each block for achieving min.  $T$ .

$$\text{Given } n = 64, k_1 = 4ns, k_2 = 1ns$$

$$M = \sqrt{\frac{64 \times 4}{1}} = \sqrt{256} = 16$$

$$P = \frac{64}{16} = 4, T = \frac{16 + 15}{2} = \underline{\underline{31ns}}$$

\* Carry skip adders / carry-by-pass adder.

4-bit carry skip adder.



Carry skip logic:- Using carry propagate logic + 2:1 mux

Carry is skipped if carry is propagated to next RCA block.

$$\rightarrow \text{Carry propagate} - BP = (A_0 \oplus B_0)(A_1 \oplus B_1)(A_2 \oplus B_2)(A_3 \oplus B_3)$$

block  $\rightarrow$  if  $P = 1$ , then no need to wait for

'C' generation directly passed cin to next block.

which eliminates carry generation time i.e. C'

Eg:-

$$A = 1010$$

$$B = 0110$$

$$\begin{array}{r} 1010 \\ 0110 \\ \hline 10000 \\ \hline c_3 \quad s \quad g \quad c_0 \end{array}$$

$$P = A \oplus B \Rightarrow P_0 = 0$$

$$P_1 = 0$$

$$P_2 = 1$$

$$P_3 = 1$$

$$BP = P_0 \cdot P_1 \cdot P_2 \cdot P_3$$

$$= 0$$

$\therefore$  Wait for 'C' generation

$\rightarrow g_m$

> Thus, to save this propagation time a special signal called as "block propagation" is used.

> This is defined as ( $BP = \pi P_0$ ) & if this signal is '1', then carry-in need not be propagated through the block.

> Instead it can be directly transmitted through a mux, to the next block.

Thus, the propagation delay gets minimized. Speed of the adder gets increased.

\* Case if  $BP=0$ , then it indicates that there can be generation of carry & hence the 1st carry needs to be propagated through the block. Now to reduce this delay, the choice of the block size becomes necessary, for which the computation is -

Let ' $k'$ ' be the time needed by the carry signal to propagate thro' the adder cell &  $k_2$  - propagation delay of mux. Then computation time is given by,

$$T = 2(P-1)k_1 + (M-2)k_2 \quad \text{--- (1)} \quad [n = MP]$$

$$\therefore T_{\min} - \frac{dT}{dM} = \frac{d}{dM} \left[ 2 \left\{ \frac{n}{M} \right\} k_1 + (M-2)k_2 \right] = 0$$

$$\Rightarrow 2n \left\{ -\frac{1}{M^2} \right\} k_1 + k_2 = 0 = 0$$

$$\frac{2nk_1}{M^2} = k_2$$

$$M = \sqrt{\frac{2nk_1}{k_2}}$$



Read

$V_{DD}$  of my

$$Q = 0$$

Scanned by CamScanner

$2x-5$

~~$SX$~~

$\begin{array}{l} 00 \rightarrow \text{No op} \\ 01 \rightarrow \text{Add} \\ 10 \rightarrow \text{Sub} \\ 11 \rightarrow \text{No op} \end{array} \} \text{ARS}$

$5 \rightarrow 00101 \rightarrow 5 \rightarrow 11011$

$2 \rightarrow 00010$

End of process

00000 110110

Parity: ~~01110~~ 100

|   |   |   |   |   |
|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 |
| 1 | 1 | 1 | 1 | 0 |

1'1110 110110

1 1111 011011

Pass 2 :-

1 1 1 1 1 0 1 1 0 1

1 1 1 1 1 1 0 1 1 0 1

Pass 3 :-

1 1 1 1 1

$$\textcircled{1} \quad \begin{array}{r} 0 0 0 1 0 \\ \hline 1 1 1 \\ \hline 0 0 0 0 1 \end{array}$$

0 0 0 0 1 1 0 1 1 0 1

0 0 0 0 0 1 1 0 1 1 0

Pass 4 :-

0 0 0 0 0

0 0 0 1 0

1 1 1 1 0

↓ 1 1 1 0 1 1 0 1 1 0

↓ 1 1 1 1 0 1 1 0 1 1

Pass 5/7

11111 011011

11111 101101

---

↓

+ neglect

2) Complement.

00000 010010  
+ 1

---

↓

10.

Carry generated in  
Pass 3

So it is → 10 //

\* Shift Registers:



| Shift unit | literal Result  |
|------------|-----------------|
| 0.         | literal $<3:0>$ |
| 1.         | literal $<4:1>$ |
| 2.         | literal $<5:2>$ |
| 3          | literal $<6:3>$ |

$\checkmark$  zero / one detector



a) one detector.



b) zero detector.





\* Magnitude Comparator:



\* Equality Comparator:



all ones.

\* Parity generation



28 + full adder ✓

# LAMDA BASED DESIGN RULES

## ① N-well :-



## ② Active area :-



## ③ Poly I :-



④ Contact to poly



⑤ Contact to active



⑥ metal 1



Width =  $2\lambda$

Spacing =  $2\lambda$



$2\lambda = \text{min overlap of}$   
Poly Contact

③ metal 3D



④ metal 3D



⑤ via 3D



mos



2x2

2x2



1x1



2x1



\* Data path logic systems \*

\* Adder,

→ Ripple Carry adder.

→ Carry look ahead adder.

→ Carry skip adder. / Carry bypass adder.

→ Carry select adder.

| a | b | $C_{in}$ | s | $C_o$ |
|---|---|----------|---|-------|
| 0 | 0 | 0        | 0 | 0     |
| 0 | 0 | 1        | 1 | 0     |
| 0 | 1 | 0        | 1 | 0     |
| 0 | 1 | 1        | 0 | 1     |
| 1 | 0 | 0        | 1 | 0     |
| 1 | 0 | 1        | 0 | 1     |
| 1 | 1 | 0        | 0 | 1     |
| 1 | 1 | 1        | 1 | 1     |

$$S = a \oplus b \oplus c$$

$$C_o = ab + bc + ca$$

$$\underline{S = abc + \bar{C}_o(a+b+c)}$$

$$\underline{C_o = \underline{ab} + \underline{c}(\bar{a}+b)}$$



Add. type  
Ans  
seeing

28T CMOS \*Full adder using CMOS\*



$$\underline{\text{NAND}} = \overline{AB}$$

A N S  
And n-type  
leaving





~~PFA~~  
 Adhoc  
 need }  
 BSSf.  
 n.m

HA  
 with  
 bonding

sum -  $\Sigma$   
 "beginning"

$$Y = \bar{A}\bar{B} + AB$$



PPS  
PMOS - { + S . P  
NMOS - { + P . S





~~AND~~ & ~~OR~~

$$Y = AB + BA - 1P$$

of will be ~~Y~~  $Y = \overline{AB} + BA$

$$= (\overline{A}\overline{B}) + (\overline{B}\overline{A})$$

$$= (\overline{A} + B) (\overline{B} + A)$$

$$= \overline{AB} + \overline{AA} + \overline{BB} + AB$$

$$= \overline{AB} + \overline{A} \cdot \overline{B}$$

$$\overline{AB} = A + B$$

$$\overline{AB} = \overline{A} + \overline{B} = \overline{A} \cdot \overline{B}$$

OR  $\overline{\overline{A} + \overline{B}} = \overline{A} + \overline{B} = A + B$



Ans other logic  
Ans



Q. Why testing is needed in VLSI design. Explain the principle of testing.

Ans: Need for testing:

- 1) During fabrication process several types of defects may exists such as catastrophic, crystalline. Catastrophic defect is due to contamination, resulting in destruction of all the transistors on the chip and crystalline defect is because of destruction of single transistor on chip.
- 2) It is necessary to test the chip from the flaws. Hence it is mandatory to check the chip regarding its performance and functionality. Identifying the fault chips is a complex job and also time consuming. The faulty chip creates huge difficulty in system debugging. It also increases the debugging cost. Therefore the Design for Testability (DFT) is necessary.

Testing principle:

- When the chip is digital, the stimuli are called test patterns or test vectors.
- Automatic test equipment (ATE) carries out this process. A powerful computer operating under the control of a Test program, a program written in a high level language. Digital Signal Processor (DSP) used for analog testing.
- chips are automatically fed to the tester through the wafer handler system. A probe card or membrane probe contacts pads of bare or packaged chip.



# Stuck-At Faults

- How does a chip fail?
  - Usually failures are shorts between two conductors or opens in a conductor
  - This can cause very complicated behavior
- A simpler model: *Stuck-At*
  - Assume all failures cause nodes to be “stuck-at” 0 or 1, i.e. shorted to GND or  $V_{DD}$
  - Not quite true, but works well in practice

## Single Stuck-Line (SSL) Model

- A single node in the circuit is stuck-at 1 (s-a-1) or 0 (s-a-0).



Fault-free function  $z = AB + CD$

Faulty function  $z^f = AB$



Fault-free function  $z = AB + CD$

Faulty function  $z^f = AB + D$

Number of possible stuck-at faults in a circuit with  $n$  lines?

Number of faults reduced by finding equivalent classes

# Bridging Faults

- Models short circuits, pairs of nodes considered
- Number of bridging faults?
- Feedback vs non-feedback bridging faults



| bridge | A | B | z | $z^f$ | Wired-AND | Wired-OR |
|--------|---|---|---|-------|-----------|----------|
|        | 0 | 0 | 0 | 0     | 0         | 0        |
|        | 0 | 1 | 0 | ?     | 0         | 1        |
|        | 1 | 0 | 1 | ?     | 0         | 1        |
|        | 1 | 1 | 1 | 1     | 1         | 1        |

$$z^f = ?$$

What are the test patterns in this example?

# Stuck-Open Faults



Fault-free circuit:  $z = \overline{a+b}$   
Faulty circuit:  $z^f = \overline{a+b} + \overline{ab}\tilde{z}$

$\tilde{z}$ : Previous value of  $z$

Case 1:  $a = b = 1$ ,  $z$  pulled down to 0

Case 2:  $a = 1, b = 0$ ,  $z$  retains previous state

A test for a stuck-open fault requires two patterns  
 $\{ab = 00, ab = 10\}$

## 15.6 Design for Testability

The keys to designing circuits that are testable are controllability and observability. Restated, controllability is the ability to set (to 1) and reset (to 0) every node internal to the circuit. Observability is the ability to observe, either directly or indirectly, the state of any node in the circuit. Good observability and controllability reduce the cost of manufacturing testing because they allow high fault coverage with relatively few test vectors. Moreover, they can be essential to silicon debug because physically probing internal signals has become so difficult.

We will first cover three main approaches to what is commonly called *Design for Testability* (DFT). These may be categorized as follows:

- *Ad hoc* testing
- Scan-based approaches
- Built-in self-test (BIST)

### 15.6.1 Ad Hoc Testing

*Ad hoc* test techniques, as their name suggests, are collections of ideas aimed at reducing the combinational explosion of testing. They are summarized here for historical reasons. They are only useful for small designs where scan, ATPG, and BIST are not available. A complete scan-based testing methodology is recommended for all digital circuits. Having said that, the following are common techniques for ad hoc testing:

- Partitioning large sequential circuits
- Adding test points
- Adding multiplexers
- Providing for easy state reset

A technique classified in this category is the use of the bus in a bus-oriented system for test purposes. Each register has been made loadable from the bus and capable of being driven onto the bus. Here, the internal logic values that exist on a data bus are enabled onto the bus for testing purposes.

Frequently, multiplexers can be used to provide alternative signal paths during testing. In CMOS, transmission gate multiplexers provide low area and delay overhead.

Any design should always have a method of resetting the internal state of the chip within a single cycle or at most a few cycles. Apart from making testing easier, this also makes simulation faster as a few cycles are required to initialize the chip.

### 15.5.6 Fault Coverage

A measure of goodness of a set of test vectors is the amount of *fault coverage* it achieves. That is, for the vectors applied, what percentage of the chip's internal nodes were checked? Conceptually, the way in which the fault coverage is calculated is as follows. Each circuit node is taken in sequence and held to 0 (S-A-0), and the circuit is simulated with the test vectors comparing the chip outputs with a *known good machine*—a circuit with no nodes artificially set to 0 (or 1). When a discrepancy is detected between the *faulty machine* and the good machine, the fault is marked as detected and the simulation is stopped. This is repeated for setting the node to 1 (S-A-1). In turn, every node is stuck (artificially) at 1 and 0 sequentially. The fault coverage of a set of test vectors is the percentage of the total nodes that can be detected as faulty when the vectors are applied. To achieve world-class quality levels, circuits are required to have in excess of 98.5% fault coverage. The *Verification Methodology Manual* [Bergeron05] is the bible for fault coverage techniques.

### 15.5.7 Automatic Test Pattern Generation (ATPG)

Historically, in the IC industry, logic and circuit designers implemented the functions at the RTL or schematic level, mask designers completed the layout, and test engineers wrote the tests. In many ways, the test engineers were the Sherlock Holmes of the industry, reverse engineering circuits and devising tests that would test the circuits in an adequate manner. For the longest time, test engineers implored circuit designers to include extra circuitry to ease the burden of test generation. Happily, as processes have increased in density and chips have increased in complexity, the inclusion of test circuitry has become less of an overhead for both the designer and the manager worried about the cost of the die. In addition, as tools have improved, more of the burden for generating tests has fallen on the designer. To deal with this burden, *Automatic Test Pattern Generation* (ATPG) methods have been invented. The use of some form of ATPG is standard for most digital designs.



FIGURE 15.15 An example of a delay fault

Commercial ATPG tools can achieve excellent fault coverage. However, they are computation-intensive and often must be run on servers or compute farms with many parallel processors. Some tools use statistical algorithms to predict the fault coverage of a set of vectors without performing as much simulation. Adding scan and built-in self-test, as described in Section 15.6, improves the observability of a system and can reduce the number of test vectors required to achieve a desired fault coverage.

### 15.5.8 Delay Fault Testing

The fault models dealt with until this point have neglected timing. Failures that occur in CMOS could leave the functionality of the circuit untouched, but affect the timing. For

# How to Test Chips?



## Built-in Self-test

---

- Built-in self-test lets blocks test themselves
  - Generate pseudo-random inputs to comb. logic
  - Combine outputs into a *syndrome*
  - With high probability, block is fault-free if it produces the expected syndrome

## □ Linear Feedback Shift Register

- Shift register with input taken from XOR of state
- Pseudo-Random Sequence Generator



| Step | Q             |
|------|---------------|
| 0    | 111           |
| 1    | 110           |
| 2    | 101           |
| 3    | 010           |
| 4    | 100           |
| 5    | 001           |
| 6    | 011           |
| 7    | 111 (repeats) |

#### 4.4.1 Architecture of Braun multiplier

An  $n \times n$ -bit Braun multiplier requires  $n(n - 1)$  adders and  $n^2$  AND gates [1, 5]. One efficient implementation of the Braun multiplier is the regular layout of the adder array as shown in Fig. 4.2. The internal structure of the full adder used in the Braun multiplier is depicted in Fig. 4.3 [6]. This makes Braun multipliers ideal for Very Large Scale Integration (VLSI) and Application Specific Integrated Circuit (ASIC) realization.

Each of the  $X, Y$ , product bits is generated in parallel with the AND gates [1]. Each partial product can be added to the previous sum of partial products by using a row of adders. The carry-out signals are shifted one bit to the left and are then added to the sums of the first adder and the new partial product. The shifting of the carry-out bits to the left is done by a Carry Save Adder (CSA). As the carry bits are passed diagonally downward to the next adder stage, there is no horizontal carry propagation for the first four rows. Instead, the respective carry bit is "saved" for the subsequent adder stage. Ripple Carry Adders (RCA) are used at the final stage of the array to output the final result.

#### 4.4.2 Performance of Braun multiplier

The Braun multiplier performs well for unsigned operands that are less than 16 bits, in terms of speed, power and area. Besides, it has a simple and regular structure as compared to other multiplier schemes. However, the number of components required in building the Braun multiplier increases quadratically with the number of bits. This makes the Braun multiplier inefficient and so it is rarely employed while handling large operands. Another pitfall of the Braun multiplier is its potential susceptibility to glitching problems at the last stage of the full adders due to the exploitation of the Ripple Carry Adders (RCA).

#### 4.4.3 Speed consideration

The delay of the Braun multiplier is dependent on the delay of the full adder cell and also on the final adder in the last row. In the multiplier array, a full adder with balanced carry and sum delays is desirable because the sum and carry signals are both in the critical path. The speed and power of the full adder are very important for large arrays.

The worst-case multiplication time of a Braun multiplier can be expressed as [5]

$$t_{\text{Braun}} = (n - 1)t_{\text{carry-save}} + t_{\text{AND}} + (n - 1)t_{\text{Ripple-Carry}} \quad (4.12)$$

where  $t_{\text{carry-save}}$  = time required to generate Carry-out ( $C_{\text{out}}$ ) or Sum ( $S_{\text{out}}$ ) at the output after the inputs are supplied to a CSA

$t_{\text{Ripple-Carry}}$  = time taken for the Carry-out ( $C_{\text{out}}$ ) or Sum ( $S_{\text{out}}$ ) to be generated at the output after the inputs are supplied to a RCA

$t_{\text{AND}}$  = delay of an AND gate



for braun multiplier - &lt; Previous Next &gt; - View all

Copyrighted material

126 Chapter Four



X: 4-bit multiplicand

Y: 4-bit multiplier

P: 8-bit product of X and Y

P<sub>n</sub> = X<sub>i</sub>Y<sub>j</sub> is a product bit

Figure 4.2 Schematic diagram of a 4 × 4-bit Braun multiplier.

Copyrighted material

FIGURE 8.34 181 ALU

### 27.1 Array Multiplication

Parallel multiplier is based on the observation that partial products in the multiplication process may be independently computed in parallel. For example, consider the unsigned binary integers  $X$  and  $Y$ .

$$X = \sum_{i=0}^{m-1} X_i 2^i$$

$$Y = \sum_{j=0}^{n-1} Y_j 2^j$$

The product is found by

$$\begin{aligned}
 P = X \times Y &= \sum_{i=0}^{m-1} X_i 2^i \cdot \sum_{j=0}^{n-1} Y_j 2^j \\
 &= \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} (X_i Y_j) 2^{i+j} \\
 &= \sum_{k=0}^{m+n-1} P_k 2^k.
 \end{aligned}$$

Thus  $P_k$  are the partial product terms called summands. There are  $mn$  summands, which are produced in parallel by a set of  $mn$  AND gates. For 4-bit numbers, the expression above may be expanded as in Table 8.2.

An  $n \times n$  multiplier requires  $n(n - 2)$  full adders,  $n$  half adders, and  $n^2$  AND gates. The worst-case delay associated with such a multiplier is  $(2n + 1)\tau_p$ , where  $\tau_p$  is the worst-case adder delay. Figure 8.35 shows a cell that may be used to construct a parallel multiplier. The  $X_i$  term is propagated diagonally from top right to bottom left, while the  $Y_j$  term is propagated horizontally. Incoming partial products enter at the top. Incoming CARRY IN values enter at the top right of the cell. The bit-wise AND is performed in the cell, and the SUM is passed to the next cell below. The CARRY OUT is passed to the bottom left of the cell. Figure 8.36 shows the multiplier array with the partial products enumerated. This arrangement may be drawn as a square array, as shown in Fig. 8.37, which is the most convenient for implementation. In this version the degeneracy of the first two rows of the multiplier are shown. The first row of the multiplier adders has been replaced with AND gates while the second row employs half-adders rather than full adders. This optimization might not be done if a completely regular multiplier were required (i.e., one



FIGURE 8.35 Array multiplier cell

TABLE 8.2 4-bit Multiplier Partial Products

|    |    | X3   | X2   | X1   | X0   | Multiplicand |
|----|----|------|------|------|------|--------------|
|    |    | Y3   | Y2   | Y1   | Y0   | Multiplier   |
|    |    | X3Y0 | X2Y0 | X1Y0 | X0Y0 |              |
|    |    | X3Y1 | X2Y1 | X1Y1 | X0Y1 |              |
|    |    | X3Y2 | X2Y2 | X1Y2 | X0Y2 |              |
|    |    | X3Y3 | X2Y3 | X1Y3 | X0Y3 |              |
| P7 | P6 | P5   | P4   | P3   | P2   | P1           |
| P0 |    |      |      |      |      | Product      |

# ARRAY Multiplier



4 X 4

n

$n(n-2)$   
FA



**Figure 8.13** Optimized-area TG adder

mate goal. Two transistors may be eliminated by using an inverter on the output of the XOR gate. In addition with some optimization, the output buffers may be eliminated, as shown in Fig. 8.13.<sup>4</sup>

### 8.2.1.5 Carry-Lookahead Adders

The linear growth of adder carry-delay with the size of the input word for an  $n$ -bit adder may be improved by calculating the carries to each stage in parallel. The carry of the  $i$ th stage,  $C_i$ , may be expressed as

$$C_i = G_i + P_i \cdot C_{i-1}, \quad (8.6)$$

where

$$G_i = A_i \cdot B_i \quad \text{generate signal}$$

$$P_i = A_i + B_i \quad \text{propagate signal}.$$

Expanding this yields

$$C_i = G_i + P_i G_{i-1} + P_i P_{i-1} G_{i-2} + \dots + P_i \dots P_1 C_0. \quad (8.7)$$

The sum  $S_i$  is generated by

$$S_i = C_{i-1} \oplus A_i \oplus B_i \quad (8.8)$$

or  $C_{i-1} \oplus P_i$  (if  $P_i = A_i \oplus B_i$ ).

The size and fan-in of the gates needed to implement this carry-lookahead scheme can clearly get out of hand. As a result, the number of stages of look-

and is usually limited to about four. For four stages of lookahead, the appropriate terms are

$$\begin{aligned}C_0 &= G_0 + P_0 CI \\C_1 &= G_1 + P_1 G_0 + P_1 P_0 CI \\C_2 &= G_2 + P_2 G_1 + P_2 P_1 G_0 + P_2 P_1 P_0 CI \\C_3 &= G_3 + P_3 G_2 + P_3 P_2 G_1 + P_3 P_2 P_1 G_0 + P_3 P_2 P_1 P_0 CI.\end{aligned}\quad (8.9)$$

Figure 8.14 shows a generic carry-lookahead adder. The PG generation and SUM generation circuits surround a carry-generate block. A possible implementation of the carry gate for this kind of carry-lookahead adder for 4 bits is shown in Fig. 8.15. Note that the gates have been partitioned to keep the number of inputs less than or equal to four. This is typical of the type of carry lookahead that would be used in a gate-array or standard-cell design. The circuit and layout are quite irregular. Taking the term of  $C_3$ , we note that it may be expressed as

$$C_3 = G_3 + P_3 \cdot (G_2 + P_2 \cdot (G_1 + P_1 \cdot (G_0 + P_0 \cdot CI))). \quad (8.10)$$

This function may be implemented as a domino CMOS (nMOS) gate, as shown in Fig. 8.16(d). Carry  $C_0 - C_2$  are generated similarly. Note that the worst-case delay path in this circuit has six n-transistors in series. A high-speed static version of the carry-lookahead gate for  $C_3$  is shown in Fig. 8.17.<sup>5</sup>



FIGURE 8.14 G...



- What is the next step in the evolution of programmable logic?
  - **More gates!**
- How do we get more gates?
- We could put several PALs on one chip and put an interconnection matrix between them!!
  - This is called a **Complex PLD (CPLD)**.

## Complex PLDs

- To reach more complexity, the logical consequence of the SPLD-architecture was the CPLD-architecture.
- A "complex programmable logic device" (CPLD) contains many SPLD-like (PAL-like) devices interconnected via a programmable switch matrix.
- The SPLD-like devices were called logic-blocks, which contain many SPLD-like macrocells.



## Architecture of CPLD

# Architecture of CPLD



# Complex PLDs

- Significant characteristics for the CPLD-architecture :
  - » product terms generated in programmable macrocells
  - » typically one dedicated flip-flop per macrocell
  - » many macrocells per logic-block
  - » typically all logic-blocks identical
  - » minimum two logic-blocks per device
  - » routing between logic-blocks via global switch matrix

# **Complex PLDs**

- **Main-advantages :**

- » predictable timing
- » fast pin-to-pin delay
- » efficient resource utilization by switch-matrix
- » medium design complexities possible

- **Main-disadvantages :**

- » higher complexities need a very complex (expensive) switch-matrix

# What is an FPGA?

- Field Programmable Gate Array
- Fully programmable alternative to a customized chip
- Used to implement functions in hardware
- Also called a Reconfigurable Processing Unit (RPU)

The FPGA approach to arrange primitive logic elements (logic cells) in rows/columns with programmable routing between them.

# FPGA Architectures



## FPGA

- The FPGA-architecture consists of many **logic-modules**, which are placed in an array-structure. The **channels** between the logic-modules are used for routing.
- The array of logic-modules is surrounded by programmable **I/O-modules** and connected via programmable interconnects.
- This freedom of routing allows every logic-module to reach every other logic-module or I/O-module.
- The worldwide first PLD with FPGA-architecture was developed by **Xilinx** in 1984.



- Applications of FPGAs include digital signal processing, software-defined radio, aerospace and defense systems, ASIC prototyping, medical imaging, computer vision, speech recognition, cryptography, bioinformatics, computer hardware emulation, radio astronomy, metal detection and a growing range of other areas.
-

# FPGA Types

- **2 types of FPGAs**

- Reprogrammable (SRAM-based)

- Xilinx, Altera, Lattice, Atmel



SRAM logic cell

- One-time Programmable (OTP)(Anti-fuse technology)

- Actel, Quicklogic



OTP logic cell

# Xilinx FPGA



| FPGA                                                                                                                                                                                                                                                                                                                                                                                                                                              | CPLD                                                                                                                                                                                                                                                                                                                                                                                          |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <p>1) FPGA architecture contains gate array like structures.</p> <p>2) Its density range is from medium to high.</p> <p>3) FPGA speed depends on the application.</p> <p>4) For Interconnection purpose routing channels are used.</p> <p>5) In FPGA power consumption is medium.</p> <p>6) Once supply is removed, data is losted in FPGA, i.e. they are volatile.</p> <p>7) <del>FPGAs are internally based on look-up tables (LUTs).</del></p> | <p>1) CPLD architecture contains PAL like structures.</p> <p>2) Its density range is from low to medium.</p> <p>3) CPLD speed is very high.</p> <p>4) For Interconnection purpose crossbar connections are used.</p> <p>5) In CPLD power consumption is high.</p> <p>6) CPLD contains on chip non-volatile memory.</p> <p>7) <del>CPLDs form the logic functions with sea-of-gates.</del></p> |

D      B      D ✓  
 S      S ✓  
 I      P ✓  
 P      I