



## *EE141-Spring 2010 Digital Integrated Circuits*

Lecture 24  
Multipliers

EECS141

Lecture #24

1

## *Administrativia*

- ❑ New homework to be posted this weekend
  - Last one to be graded
- ❑ Project Phase 3 Launched Next We
  - Some insights today

EECS141

Lecture #24

2

## ***Class Material***

- ❑ Last lecture
  - Adders
- ❑ Today's lecture
  - Multipliers
  - Introduction to Memory
- ❑ Reading (Ch 11)



## ***Multipliers***

## Binary Multiplication

$$\begin{array}{r}
 \begin{array}{r}
 1 & 0 & 1 & 0 & 1 & 0 \\
 \times & & 1 & 0 & 1 & 1 \\
 \hline
 1 & 0 & 1 & 0 & 1 & 0 \\
 1 & 0 & 1 & 0 & 1 & 0 \\
 0 & 0 & 0 & 0 & 0 & 0 \\
 + & 1 & 0 & 1 & 0 & 1 & 0 \\
 \hline
 1 & 1 & 1 & 0 & 0 & 1 & 1 & 0
 \end{array}
 \end{array}$$

Multiplicand  
 Multiplier  
 Partial products  
 Result

## Binary Multiplication

$$\begin{aligned}
 Z = X \times Y &= \sum_{k=0}^{M+N-1} Z_k 2^k \\
 &= \left( \sum_{i=0}^{M-1} X_i 2^i \right) \left( \sum_{j=0}^{N-1} Y_j 2^j \right) \\
 &= \sum_{i=0}^{M-1} \left( \sum_{j=0}^{N-1} X_i Y_j 2^{i+j} \right)
 \end{aligned}$$

with

$$\begin{aligned}
 X &= \sum_{i=0}^{M-1} X_i 2^i \\
 Y &= \sum_{j=0}^{N-1} Y_j 2^j
 \end{aligned}$$

## The Array Multiplier



## The M-by-N Array Multiplier: Critical Path



$$t_{multi} \approx [(M-1) + (N-2)] \cdot t_{carry} + (N-1) \cdot t_{sum} + t_{and}$$

## Transmission-Gate Full Adder

Balanced  $t_{\text{sum}}$  and  $t_{\text{carry}}$



## Carry-Save Multiplier



$$t_{\text{mult}} = t_{\text{and}} + (N - 1) \cdot t_{\text{carry}} + t_{\text{merge}}$$

## Multiplier Floorplan



EECS141

Lecture #24

11

## Wallace-Tree Multiplier



EECS141

Lecture #24

12

## Wallace-Tree Multiplier



## Wallace-Tree Multiplier



## ***Multipliers – Summary***

❑ Optimization constraints different than in binary adder

- Once again:

- Need to identify critical path
- And find ways to use parallelism to reduce it

❑ Other possible techniques

- Logarithmic versus linear (Wallace Tree Mult)
- Data encoding (Booth)
- Pipelining

## ***First glimpse at system level optimization***

## ***The Binary Shifter***



## The Barrel Shifter



Area Dominated by Wiring

EECS141

Lecture #24

17

## 4x4 Barrel Shifter



$$\text{Width}_{\text{barrel}} \sim 2 p_m M$$

EECS141

Lecture #24

18

## Logarithmic Shifter



EECS141

Lecture #24

19

## 0-7 bit Logarithmic Shifter



$$\text{width}_{\log} \approx p_m \cdot \left[ 2K + (1 + 2 + \dots + 2^{K-1}) \right] = p_m \cdot (2^K + 2K - 1)$$

EECS141

Lecture #24

20

## *Arithmetic - Summary*

EECS141

Lecture #24

21



## *Semiconductor Memory*

EECS141

Lecture #24

22<sup>22</sup>

## Why Memory?

Intel 45nm Core 2



EECS141

Lecture #24

23

## Semiconductor Memory Classification

| Read-Write Memory |                      | Non-Volatile<br>Read-Write<br>Memory | Read-Only Memory                       |
|-------------------|----------------------|--------------------------------------|----------------------------------------|
| Random<br>Access  | Non-Random<br>Access | EPROM                                | Mask-Programmed<br>Programmable (PROM) |
| SRAM              | FIFO                 | FLASH                                |                                        |
| DRAM              | LIFO                 |                                      |                                        |
|                   | Shift Register       |                                      |                                        |
|                   | CAM                  |                                      |                                        |

EECS141

Lecture #24

24

## Random Access Memories (RAM)

### □ STATIC (SRAM)

**Data stored as long as supply is applied**  
**Larger (6 transistors/cell)**  
**Fast**  
**Differential (usually)**

### □ DYNAMIC (DRAM)

**Periodic refresh required**  
**Smaller (1-3 transistors/cell)**  
**Slower**  
**Single Ended**

## Random Access Chip Architecture

- Conceptual: linear array
  - Each box holds some data
  - But this does not lead to a nice layout shape
  - Too long and skinny



- Create a 2-D array
  - Decode Row and Column address to get data





EECS141

Lecture #24

27



EECS141

Lecture #24

28

## Row Decoders

**Collection of  $2^M$  complex logic gates  
Organized in regular and dense fashion**

### (N)AND Decoder

$$WL_0 = \overline{A_0} \overline{A_1} \overline{A_2} \overline{A_3} \overline{A_4} \overline{A_5} \overline{A_6} \overline{A_7} \overline{A_8} \overline{A_9}$$

$$WL_{511} = A_0 A_1 A_2 A_3 A_4 A_5 A_6 A_7 A_8 \overline{A_9}$$

### NOR Decoder

$$WL_0 = \overline{A_0 + A_1 + A_2 + A_3 + A_4 + A_5 + A_6 + A_7 + A_8 + A_9}$$

$$WL_{511} = \overline{\overline{A_0} + \overline{A_1} + \overline{A_2} + \overline{A_3} + \overline{A_4} + \overline{A_5} + \overline{A_6} + \overline{A_7} + \overline{A_8} + \overline{A_9}}$$

## Decoder Design Example



- ❑ Look at decoder for 256x256 memory block (8KBytes)

## Problem Setup

- ❑ Goal: Build fastest possible decoder with static CMOS logic

- ❑ What we know

- Basically need 256 AND gates, each one of them drives one word line



## Problem Setup (1)

- ❑ Each word line has 256 cells connected to it
- ❑ Total output load is  $256 \cdot C_{cell} + C_{wire}$
- ❑ Assume that decoder input capacitance is  $C_{address} = 4 \cdot C_{cell}$
- ❑ Each address drives  $2^8 / 2$  AND gates
  - A0 drives  $\frac{1}{2}$  of the gates, A0\_b the other  $\frac{1}{2}$  of the gates
- ❑ Neglecting  $C_{wire}$ , the fan-out on each one of the 16 address wires is:
 
$$BF = \frac{C_{load}}{C_{in}} = \frac{(2^8 / 2)(2^8 C_{cell})}{4C_{cell}} = 2^{13}$$

## Decoder Fan-Out

- ❑ FB of at least  $2^{13}$  means that we will want to use more than  $\log_4(2^{13}) = 6.5$  stages to implement the AND8
- ❑ Need many stages anyways
  - So what is the best way to implement the AND gate?
  - Will see next that it's the one with the most stages and least complicated gates

## 8-Input AND



$$\begin{array}{ll}
 LE = 10/3 & 1 \\
 \Pi LE = 10/3 & \\
 P = 8 + 1 &
 \end{array}
 \quad
 \begin{array}{ll}
 LE = 2 & 5/3 \\
 \Pi LE = 10/3 & \\
 P = 4 + 2 &
 \end{array}
 \quad
 \begin{array}{ll}
 LE = 4/3 & 5/3 \\
 \Pi LE = 80/27 & \\
 P = 2 + 2 + 2 + 1 &
 \end{array}
 \quad
 \begin{array}{ll}
 4/3 & 1 \\
 & \\
 & 
 \end{array}$$

## 8-Input AND



- ❑ Using 2-input NAND gates
  - 8-input gate takes 6 stages
- ❑ Total LE is  $(4/3)^3 \approx 2.4$
- ❑ So PE is  $2.4 * 2^{13}$  – optimal N of ~7.1

## Decoder So Far

- ❑ 256 8-input AND gates
  - Each built out of tree of NAND gates and inverters
- ❑ Issue:
  - Every address line has to drive 128 gates (and wire) right away
  - Can't build gates small enough - Forces us to add buffers just to drive address inputs



## Look Inside Each AND8 Gate



EECS141

Lecture #24

37

## Predecoders

- ❑ Use a single gate for each of the shared terms
  - E.g., from  $A_0, \bar{A}_0, A_1$ , and  $\bar{A}_1$ , generate four signals:  $A_0A_1, \bar{A}_0A_1, A_0\bar{A}_1, \bar{A}_0\bar{A}_1$
- ❑ In other words, we are decoding smaller groups of address bits first
  - And using the “predecoded” outputs to do the rest of the decoding

EECS141

Lecture #24

38

## Predecoder and Decoder

