

## TWO-OPERAND ADDITION

Chapter two...

### What's the Deal?

- All we want to do is add up a couple numbers...

- Chapter one tells us that we can add signed numbers just by adding the mapped positive bit vectors  
 $(Z_R = (X_R + Y_R) \bmod C)$

$$x + y + c_{in} = 2^n c_{out} + s$$



$$s = (x + y + c_{in}) \bmod 2^n$$

$$\begin{aligned} c_{out} &= \begin{cases} 1 & \text{if } (x + y + c_{in}) \geq 2^n \\ 0 & \text{otherwise} \end{cases} \\ &= \lfloor (x + y + c_{in}) / 2^n \rfloor \end{aligned}$$

### Adder Bits

#### Half Adder



#### Full Adder



### Adder Bits

#### Half Adder



#### Full Adder



### Big Problem (for speed)

#### Carry generation!

- Carry at  $i$  depends on  $j \leq i$
- Non-trivial to do fast – lots of inputs...



### Fast Adders

- The primary objective is to speed up the generation of the carries!

- Carry Propagate Adders –
  - Produce an answer in conventional fixed-radix NRS
- Carry Save and Signed Digit adders –
  - Avoid carry propagation by producing sums in redundant notations
- Hybrid Adders
  - Combine as many schemes as make sense...





## Build a faster circuit?

- Complementary Pass Transistor Logic (CPL)
  - Slightly faster, but more area

Figures from David Harris...

## Build a faster circuit?

- Dual-rail domino
  - Very fast, but large and power hungry
  - Used in very fast multipliers

Figures from David Harris...

## Build a faster carry chain?

- Manchester Carry Chain
  - Use transmission gates to make carry “wire”

## Manchester Carry control

## MCC

Switch-Logic:

- Implement propagate with pass gate
- Implement kill with a pull down transistor
- Implement generate with a pull up transistor

To reduce the logic needed, and the capacitance on the carry chain use precharge switch logic. Precharge the output high, and pull it low if needed. The inputs to the gate can be outputs from other domino gates (Carry is a monotonic function of P, C, K<sup>1</sup>)

<sup>1</sup> Need to be careful, since we will use a inverter to buffer the output before we use it. That is the reason that the switch logic is generating C\_b and not C. Switching G and K will generate C directly.

## Manchester Carry Chain

## MCC

The carry chain is only part of the adder. You need to generate the P, G signals that the adder needs and to generate the sum at the end. In addition to the carry chain, each bit cell needs the following gates:

• The gates that generate P, G, K can be precharge gates, since the inputs are usually stable signals. This means that P, G, K can be domino \_v signals, and can drive the domino carry chain.

• The final EXOR must be a static gate since it is not a monotonic function of its inputs, and its inputs will be \_v signals.

## Manchester Carry Delay

$$T_{SRd} = t_{sw} + (n - 1)t_p + (n/m)t_{buf} + t_s$$

- $t_{sw}$  is time to set all switches
- $t_p$  is time to propagate through a switch
- $t_{buf}$  is a buffer – need restoring buffer every m bits
- $t_s$  computes the sum based on the carries

▫ This works well if  $t_p$  is small...

## Timing of MCC

The good news is there is not a gate between stages.

The bad news is that the number of series transistors increases with the number of stages, so the delay will grow like  $n^2$

• Capacitance per stage (assuming all 4:2 devices, no diff sharing)  
 $3 \text{ ndiff} + \text{pdiff} + Cg + \text{inv} + \text{bit-width of wire} = 12fF + 4fF + 4fF + 8fF + 8fF (30) = 36fF$

• Resistance per-stage is 6.5K, so the delay is approximately  $.12ns * n^2$ ,  $(RCn^2/2)$  where n is the number of stages directly tied together.

Slide from Mark Horowitz, Stanford

## Sizing MCC

Critical path is through the pass chain. Try to reduce this delay:

- Make P and G transistors 4x larger, and share diffusion<sup>1</sup>
- Capacitance per stage:  
 $2\text{ndiff}(16fA) + \text{pdiff} + Cg + \text{inv} + \text{bit-width of wire} = 32fF + 4fF + 16fF + 8fF + 8fF (30) = 68fF$
- Resistance per-stage is 1.6K, delay is  $0.054ns * n^2$ .

<sup>1</sup>: Make G larger since it does not hurt (diffusion is shared), and since it will be important in faster adders)

Slide from Mark Horowitz, Stanford

## Sizing MCC

To limit the effect of the  $n^2$  term, break carry chain into sections.

- Each section is about 4 stages long (3 stages might be better)
- Between sections the carry is buffered.

• The buffering makes the delay linear with the number of bits

• But the carry stills needs to flow through all the carry chains.

Slide from Mark Horowitz, Stanford

## Buffered Carry Chains

What is the 'right' number of stages?



Assume first transistor is 8x min, and final inverter is minimum  
Delay is the inverter delay (Cout rising) plus the delay of the chain including the resistance of the initial 8x transistor.

Slide from Mark Horowitz, Stanford

## Timing MCC

| Stages | Total Delay | Delay per bit |
|--------|-------------|---------------|
| 1      | 0.61        | 0.61          |
| 2      | 0.82        | 0.42          |
| 3      | 1.15        | 0.38          |
| 4      | 1.58        | 0.39          |
| 5      | 2.12        | 0.42          |
| 6      | 2.77        | 0.46          |

So for these sizing, the optimal number in a stage is around 4, and the average delay per bit is around 0.4 ns. This is not optimally sized (pMOS in final inverter should be larger) but it is probably close.

Slide from Mark Horowitz, Stanford

## Layout of MCC

Layout of a Manchester adder is not too bad, even with groups:

|           |      |       |
|-----------|------|-------|
| P_G gen ↑ | C ↑  | XOR ↑ |
| P_G gen ↓ | C ↑  | XOR ↓ |
| P_G gen ↑ | C ↑  | XOR ↑ |
| P_G gen ↓ | C* ↑ | XOR ↓ |
| P_G gen ↑ | C ↑  | XOR ↑ |
| P_G gen ↓ | C ↑  | XOR ↓ |
| P_G gen ↑ | C ↑  | XOR ↑ |
| P_G gen ↓ | C* ↑ | XOR ↓ |

Slide from Mark Horowitz, Stanford

## Back to Adder Bits

- Revisit the full adder:

| $X_i$ | $Y_i$ | $C_i$ | $C_{i+1}$ | $S_i$ | Comment   |
|-------|-------|-------|-----------|-------|-----------|
| 0     | 0     | 0     | 0         | 0     | Kill      |
| 0     | 0     | 1     | 0         | 1     | Kill      |
| 0     | 1     | 0     | 0         | 1     | Propagate |
| 0     | 1     | 1     | 1         | 0     | Propagate |
| 1     | 0     | 0     | 0         | 1     | Propagate |
| 1     | 0     | 1     | 1         | 0     | Propagate |
| 1     | 1     | 0     | 1         | 0     | Generate  |
| 1     | 1     | 1     | 1         | 1     | Generate  |

## Back to Adder Bits

- Revisit the full adder:

Case 1 (Kill):  $k_i = x'_i y'_i = (x_i + y_i)'$   
Case 2 (Propagate):  $p_i = x_i \oplus y_i$   
Case 3 (Generate):  $g_i = x_i y_i$

$$C_{i+1} = g_i + p_i c_i = x_i y_i + (x_i \oplus y_i) c_i$$

Consider  $a_i = p_i + g_i$   
[note  $g_i + p_i c_i = g_i + (g_i + p_i) c_i$  ]  
then  $c_{i+1} = g_i + a_i c_i$

| $X_i$ | $Y_i$ | $C_i$ | $C_{i+1}$ | $S_i$ | Comment   |
|-------|-------|-------|-----------|-------|-----------|
| 0     | 0     | 0     | 0         | 0     | Kill      |
| 0     | 0     | 1     | 0         | 1     | Kill      |
| 0     | 1     | 0     | 0         | 1     | Propagate |
| 0     | 1     | 1     | 1         | 0     | Propagate |
| 1     | 0     | 0     | 0         | 1     | Propagate |
| 1     | 0     | 1     | 1         | 0     | Propagate |
| 1     | 1     | 0     | 1         | 0     | Generate  |
| 1     | 1     | 1     | 1         | 1     | Generate  |

## Carry Chains

- Two types
  - 1-carry chain and 0-carry chain
  - 1-carry always starts at  $g_i=1$  (or  $c_{in}=1$ ), and propagates over consecutive positions  $p_i=1$
  - 0-carries start at  $k_i=1$  position (or  $c_{in}=0$ )...

| $i$       | 9   | 8   | 7   | 6   | 5   | 4   | 3   | 2   | 1   | 0   |
|-----------|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| $x_i$     | 1   | 0   | 1   | 0   | 1   | 1   | 1   | 1   | 0   | 0   |
| $y_i$     | 0   | 0   | 0   | 1   | 0   | 1   | 0   | 0   | 1   | 0   |
| $p$       | $p$ | $k$ | $p$ | $p$ | $p$ | $g$ | $p$ | $p$ | $p$ | $k$ |
| $a$       | $a$ | $a$ | $a$ | $a$ | $a$ | $a$ | $a$ | $a$ | $a$ | $a$ |
| $c_{i+1}$ | 0   | 0   | 1   | 0   | 1   | 1   | 0   | 0   | 0   | 0   |

## Group Carries

- Carry equation can be generalized to groups of bits
 
$$c_{j+1} = g_{(j,i)} + p_{(j,i)} c_i = g_{(j,i)} + a_{(j,i)} c_i$$

$$c_{j+1} = g_{(j,0)} + p_{(j,0)} c_0 = g_{(j,0)} + a_{(j,0)} c_0$$
- Combine subranges recursively

$$g_{(f,d)} = g_{(f,e)} + p_{(f,e)} g_{(e-1,d)} = g_{(f,e)} + a_{(f,e)} g_{(e-1,d)}$$

$$a_{(f,d)} = a_{(f,e)} a_{(e-1,d)}$$

$$p_{(f,d)} = p_{(f,e)} p_{(e-1,d)}$$



### Example (2.1)

- Now you can compute  $c_{13}$

$x = 0110|0010|1100|0011$   
 $y = 1011|1101|0001|1110$   
 $p|pppp|ppkp|ppgp$

$$k_{(12-0)} = k_{(12-8)} + p_{(12-8)}k_{(7-0)} = 1$$

$$c_{13} = g_{(12-0)} + p_{(12-0)}c_{in} = 0$$

### Example (2.1)

- With  $c_{13}$  you can compute  $s_{13}$

$x = 0110|0010|1100|0011$   
 $y = 1011|1101|0001|1110$   
 $p|pppp|ppkp|ppgp$

$$k_{(12-0)} = k_{(12-8)} + p_{(12-8)}k_{(7-0)} = 1$$

$$c_{13} = g_{(12-0)} + p_{(12-0)}c_{in} = 0$$

$$s_{13} = x_{13} \oplus y_{13} \oplus c_{13} = 1 \oplus 1 \oplus 0 = 0$$

### Carry Skip Adder

- The idea is to reduce the number of cells the worst-case carry must propagate through
  - Divide n-bit adder into groups of m-bits
  - Determine group propagate for each m-bits
  - If the entire group p is true, skip around it



## Carry Skip delay

$$\begin{aligned} T_{CSK} &= mt_c + t_{mux} + \left(\frac{n}{m} - 2\right)t_{mux} + (m-1)t_c + t_s \\ &= (2m-1)t_c + \left(\frac{n}{m} - 1\right)t_{mux} + t_s \end{aligned}$$

- Worst case is when a carry is generated in the first bit of the adder
  - Then propagated through all bits up to but not including the high order bit
  - That is, skip all groups but the first and last

## Problem with clearing carries

- Watch out – some books show an AND/OR version that doesn't really work!

- Problem is that carries might be left over from previous addition and have to dribble out...



Figure 2.10: Carry-skip adder using AND-OR for bypass

## Group Size

- Previous delay analysis assumes all groups are the same size
  - This isn't the best for speed...
  - Carries generated in the first group have to skip more groups!
  - For fixed size:

$$\begin{aligned} m_{opt} &= \left(\frac{t_{mux}}{2t_c}n\right)^{1/2} \text{ (minimum delay)} \\ T_{opt} &\approx (8t_{mux}t_c n)^{1/2} \end{aligned}$$

## Carry Skip with different m

- If you vary the group size with the groups at the ends shorter than the groups in the middle, you can speed things up

▪ N=60, t\_c = t\_{mux} = 5

▪ M=6, T\_{CS}=21.5

▪ M=4,5,6,7,8,9,7,6,5,4, T\_{CS}=17.5



## Carry Skip – Another View

Since we have divided the bits in the word into a number of groups.

- For each group check to see if all the Pg are true
- If so, then bypass the Cin to Cout of that group
- Otherwise, do the normal thing.



Slide from Mark Horowitz, Stanford

## Carry Skip – Another View

All groups can calculate Pg at the same time (in parallel)

Worse-case is when carry needs to propagate through all bits

- Since we precomputed Pg, that path is now much shorter
  - Hop around groups, rather than through them
- Critical path is now through one local carry chain, then through a number of bypass gates, then back into a final local carry chain.



- This improvement did not cost much hardware.

Slide from Mark Horowitz, Stanford

## Carry Skip - Layout

Layout of a bypass adder is almost the same, C\* gets a more stuff:

Also have a few more wires to route. You need to generate Pg (a 4 input NAND gate in the PG gen section, and you need to route Cin\_b to C\*).

Slide from Mark Horowitz, Stanford

## Carry Lookahead

- General idea – find a way to compute all carries at the same time
  - Generate logic for all carries in terms of just the X, Y and Cin bits

$$x^{(i)} = \sum_{v=0}^i x_v 2^v \quad c_i = 1 \text{ if } (x^{(i-1)} + y^{(i-1)} + c_0) \geq 2^i$$

- This is a switching function of  $2i+1$  variables

## Carry Lookahead

$$T_{1-CLA} = \frac{n}{m}t_{group} + t_s$$

## Carry Lookahead equations

- Remember:  $C_i = G_i + P_i C_i$
- $C_1 = G_0 + P_0 C_0$
- $C_2 = G_1 + P_1 C_1$   
 $= G_1 + P_1(G_0 + P_0 C_0)$   
 $= G_1 + P_1 G_0 + P_1 P_0 C_0$
- $C_3 = G_2 + P_2 G_1 + P_2 P_1 G_0 + P_2 P_1 P_0 C_0$
- $C_4 = G_3 + P_3 G_2 + P_3 P_2 G_1 + P_3 P_2 P_1 G_0 + P_3 P_2 P_1 P_0 C_0$
- Or  $C_4 = G_3 + P_3(G_2 + P_2(G_1 + P_1(G_0 + P_0 C_0)))$

## Carry Lookahead equations

- Remember:  $C_i = G_i + A_i C_i$
- $C_1 = G_0 + A_0 C_0$
- $C_2 = G_1 + A_1 C_1$   
 $= G_1 + A_1(G_0 + A_0 C_0)$   
 $= G_1 + A_1 G_0 + A_1 A_0 C_0$
- $C_3 = G_2 + A_2 G_1 + A_2 A_1 G_0 + A_2 A_1 A_0 C_0$
- $C_4 = G_3 + A_3 G_2 + A_3 A_2 G_1 + A_3 A_2 A_1 G_0 + A_3 A_2 A_1 A_0 C_0$
- Or  $C_4 = G_3 + A_3(G_2 + A_2(G_1 + A_1(G_0 + A_0 C_0)))$

## CLA-4 Module





### Three-level CLA

- Extend to three or more levels by having lookahead between sections
  - First compute  $a_i, p_i, g_i$
  - L-1 level of CLA to compute As and Gs
  - $n/m^L$  CLGs connected in ripple to compute carries of bits
  - One level of XOR to compute the sum

$$T_{L-CLA} = t_{a,g} + (L-1)t_{A,G} + \frac{n}{m^L}t_{clg} + (L-1)t_{clg} + t_s$$

### Three-level CLA ( $n=8, m=2$ )

\* Carry-out is not used

### CLA Critical Path

### Prefix Adders (Tree Adders)

- More general form of carry lookahead tree
  - Built using different organizations of the same set of basic PG cells (PA cells)
  - All based on the fact that  $c_i$  corresponds to the generate signal spanning bit positions (-1) to  $i-1$
  - Prefix adder is an interconnection of cells that produce  $g_{(i-1,r)}$  for all  $i$
  - Cells connected to produce  $g$  signals that span an increasing number of bits

### PG (PA) cell

### Overlapping Ranges

- Starting with  $g, a$  of each bit, first level generates  $g, a$  for two bits, then four, etc.
  - If right input spans bits [right2, right1], and left spans [left2, left1], with  $right2+1 \geq left1$
  - Then output spans bits [left2, right1]
  - For example right[5,2] and left[8,4] means output spans bits [8,2]

Figure 2.17: Composition of spans in computing  $(g, a)$  signals.











## Aside – ALU Design

Once you have an adder, making an ALU is very simple

Two approaches:

Build a separate logic unit and mux together the outputs. This is probably the fastest solution, since you don't slow down the add critical path, but it will take more area.

Merge the two designs together by changing the definition of P and G. Since the output (Sum) is  $P \oplus Cin$ , if  $G = 0$ , and  $Cin(\text{to adder}) = 0$  then Sum will equal P. Can do logical operations by using a general function box for the P function.

The first is probably the preferable solution, but I will show the second, because it is a little more clever (and the programmable P function unit is a perfect LU for the first solution)

## Aside – ALU +P function block

The block that generates the signal called P must be able to generate any Boolean function of two variables. This is easy -- just use a mux. To reduce control lines, I will use a precharge mux.



## Aside – ALU +G function block

This is similar to the P function block, but it does not need to be as complex. If we only wanted to do addition and logic functions, then it would only need to generate the functions (AND, 0). But we want to be able to do subtraction too.

- $A - B = A + \bar{B} + 1$ , where  $\bar{B}$  is the ones complement of B, which is just the complement of each bit.
- Since after the P, and G function block, no other part of the adder uses A,B, we can get subtract by redefining P and G, and setting Cin to be 1
- If we didn't do this, we would need to add an explicit mux to invert one of the inputs to adder in the case of subtraction.
- For addition:  
 $P = A \bar{B} + B \bar{A}; \quad G = A B$
- For subtraction:  
 $P = A B + \bar{B} \bar{A}; \quad G = A \bar{B}$

## Rest of the ALU

Is basically the same as an adder:

- Need a fast carry chain
- Final static XOR gate
- Latch to hold the value (since the output of the ALU is  $_v1$ )
- Bus driver to drive the output of the latch on bus when the ALU result is needed

## Redundant Digit Adders

### Use a redundant digit set

- Operands might be in conventional or in redundant form
- Main idea is to reduce the carry propagation
- But, increases number of bits in the result
- Useful for things like accumulation, multi-operand addition, multiplication, etc.

## Carry Save Adder

### Add three binary vectors

- Using an array of one-bit adders (i.e. full adders)

### But, don't propagate the carries !

- Output is two vectors: carry and pseudo-sum (or sum)

$$x + y + z = vc + vz = v$$

- Several combinations of vc and vs represent the same result

## Carry Save Adder

- If you want to convert back to conventional numbers, add vs and vc
  - Because there two bits for every conventional sum bit, you can think of the answer in Carry Save form to be digits in the set {0,1,2}
  - Carry Save produces a reduction from three binary vectors to two, so it's also called a 3-2 reduction
  - Adder is a [3:2] adder

## Carry Save Adder



## Carry Save Example

$$\begin{array}{r}
 X \quad 0 \ 1 \ 1 \ 1 \ 0 \ 1 \ 0 \ 0 \\
 Y \quad 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 1 \ 1 \\
 Z \quad 1 \ 0 \ 1 \ 0 \ 1 \ 0 \ 1 \ 0 \\
 \hline
 VS \quad 1 \ 1 \ 1 \ 0 \ 0 \ 1 \ 0 \ 1 \\
 (c_{out}, VC) \ 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 1 \ 0 \ 1 \\
 \text{digit value} \ 0 \ 1 \ 2 \ 2 \ 1 \ 0 \ 2 \ 0 \ 2
 \end{array}$$

## Carry Save Example

$$\begin{array}{r}
 X \quad 0 \ 1 \ 1 \ 1 \ 0 \ 1 \ 0 \ 0 \ 116 \\
 Y \quad 0 \ 0 \ 1 \ 1 \ 1 \ 0 \ 1 \ 1 \ 59 \\
 Z \quad 1 \ 0 \ 1 \ 0 \ 1 \ 0 \ 1 \ 0 \ 170 \\
 \hline
 229 \qquad VS \quad 1 \ 1 \ 1 \ 0 \ 0 \ 1 \ 0 \ 1 \ 345 \\
 117 \ (c_{out}, VC) \ 0 \ 1 \ 1 \ 1 \ 0 \ 1 \ 0 \ 1 \ \text{Cin} \\
 \text{digit value} \ 0 \ 1 \ 2 \ 2 \ 1 \ 0 \ 2 \ 0 \ 2 \\
 \hline
 256 \ 128 \ 64 \ 32 \ 16 \ 8 \ 4 \ 2 \ 1 \\
 | \quad | \\
 128 + 128 + 64 + 32 + 16 + 8 + 4 + 2 = 346
 \end{array}$$

## Carry Save

- What if two operands are both carry-save?
  - Then each operand is in Xs Xc form
  - So, you need a [4:2] adder instead of a [3:2]
  - Combine four vectors into two...
  - Still no carries!
  - Answer is still in redundant carry-save form

## Carry Save [4:2]





### 4:2 compressor cell

| Inputs |   |   |   | Cin=0 |   | Cin=1 |   | Cout |
|--------|---|---|---|-------|---|-------|---|------|
| A      | B | C | D | C     | S | C     | S |      |
| 0      | 0 | 0 | 0 | 0     | 0 | 0     | 1 | 0    |
| 0      | 0 | 0 | 1 | 0     | 1 | 1     | 0 | 0    |
| 0      | 0 | 1 | 0 | 1     | 0 | 0     | 0 | 0    |
| 0      | 1 | 0 | 0 | 0     | 0 | 1     | 0 | 0    |
| 1      | 0 | 0 | 0 | 0     | 0 | 0     | 0 | 0    |
| 0      | 0 | 1 | 1 | 0     | 1 | 1     | 1 | 1    |
| 0      | 1 | 1 | 0 | 0     | 1 | 0     | 1 | 0    |
| 1      | 1 | 0 | 0 | 0     | 0 | 1     | 0 | 0    |
| 0      | 1 | 0 | 1 | 0     | 1 | 1     | 0 | 1    |
| 1      | 1 | 1 | 0 | 0     | 1 | 0     | 1 | 0    |
| 1      | 1 | 1 | 1 | 1     | 0 | 1     | 1 | 1    |



- ### High Radix Carry Save
- Regular carry-save doubles the number of bits
  - You can reduce the number of bits with high-radix carry-save
  - If r is the radix
    - Vs is represented in radix r
    - Vc has one bit per radix-r digit

### Radix-8 Carry Save

|                   |       |       |       |   |
|-------------------|-------|-------|-------|---|
| XS                | 1 0 1 | 1 0 1 | 1 0 0 |   |
| XC                | 1     | 1     | 0     |   |
| Y                 | 0 1 0 | 0 0 1 | 1 1 1 |   |
| VS                | 0 0 0 | 1 1 1 | 0 1 1 |   |
| ( $c_{out}$ , VC) | 1     | 0     | 1     | 0 |

### Radix-8 Carry Save

|                   |       |       |       |        |
|-------------------|-------|-------|-------|--------|
| 512               | 64    | 8     | 1     |        |
| XS                | 1 0 1 | 1 0 1 | 1 0 0 | $4_8$  |
| XC                | 1     | 1     | 0     | $0_2$  |
| Y                 | 0 1 0 | 0 0 1 | 1 1 1 | $7_8$  |
| VS                | 0 0 0 | 1 1 1 | 0 1 1 | $13_8$ |
| ( $c_{out}$ , VC) | 1     | 0     | 1     | 0      |

$1*512 \quad 0 \quad (7+1)*8 \quad (3+0)*1 = 579$



### Signed Digit Adders

- Another form of redundant digit representation
  - Uses signed-digit representation (redundant)
  - With digit set  $D = \{-a, \dots, -1, 0, 1, \dots, a\}$
  - Limits carry propagation to next position
  - Addition algorithm:
    - Step 1:  $x + y = w + t$   
 $x_i + y_i = w_i + r t_{i+1}$
    - Step 2:  $s = w + t$   
 $s_i = w_i + t_i$
  - No carry produced in Step 2



### Signed Digit Adder

Case A : two SD operands; result SD

Step 1:

$$(t_{i+1}, w_i) = \begin{cases} (0, x_i + y_i) & \text{if } -a+1 \leq x_i + y_i \leq a-1 \\ (1, x_i + y_i - r) & \text{if } x_i + y_i \geq a \\ (-1, x_i + y_i + r) & \text{if } x_i + y_i \leq -a \end{cases}$$

- algorithm modified for  $r = 2$

Case B : two conventional operands; result SD

Case C : one conventional, one SD; result SD

## Signed Digit Adder

- I'm not going to spend more time on this one...
  - My sense is that it's not as important in terms of actual implementations as Carry Save
  - Reasonably complex stuff – multiple recodings

## Summary

| Scheme                         | Delay proportional to | Area proportional to   |
|--------------------------------|-----------------------|------------------------|
| Linear structures:             |                       |                        |
| Carry ripple                   | $n$                   | $n$                    |
| Carry lookahead (one level)    | $n/m$                 | $(k_m m)(n/m) = k_m n$ |
| Carry select (one level)       | $n/m$                 | $(k_m m)(n/m) = k_m n$ |
| Carry skip (one level)         | $\sqrt{n}$            | $n$                    |
| Logarithmic structures:        |                       |                        |
| Carry lookahead (max. levels)  | $2 \log_m n$          | $(k_m m)(n/m) = k_m n$ |
| Prefix                         | $\log_m n$            | $((k_m m) \log_m n)n$  |
| Conditional sum                | $\log_2(n/m)$         | $(k_m + \log_2(n/m))n$ |
| Completion signal (avg. delay) | $(\log_2 n)/m$        | $k_m m(n/m) = k_m n$   |
| Redundant                      | <i>const.</i>         | $n$                    |

## Case Study

- Dec Alpha 21064 64-bit adder
  - 5ns cycle time in a 0.75μ CMOS process
  - Very high performance for the day!
  - A mix of multiple techniques!

## Alpha 21064

- In 8-bit chunks – Manchester carry chain
  - Chain was also tapered to reduce the load caused by the remainder of the chain
  - Chain was pre-discharged at start of cycle
  - Three signals used: P, G, and K
  - Two Manchester chains:
    - One assuming Cin=0
    - One assuming Cin=1

## Alpha 21064

- Carry Lookahead used on least significant 32 bits
  - Implemented as distributed differential circuits
  - Provide carry that controls most significant 32
- Conditional Sum used for most significant 32
  - Six 8-bit select switches used to implement conditional sum on the 8-bit level

## Alpha 21064

- Finally, Carry Select used to produce the most significant 32 bits.
  - Final selection done using NMOS carry-select byte-wide muxes
- Also apparently pipelined with a row of latches after the lookahead...



### Local PGK Logic (Imagine)

- Pre-computation necessary to do fast carry computation
  - $P = a \oplus b$
  - $G = ab$
  - $K = \sim(a+b)$
- Size gates to fan-out to four carry chains
- Note: To do A-B, use  $\sim B$  here

Slide from Mark Horowitz, Stanford

### Group PKG (Imagine)

- Manchester Carry Chains.
  - Usually dynamic, but still works with static logic
  - Group PGK's:
    - $GP = P[7].P[6].P[5].P[4].P[3].P[2].P[1].P[0]$
    - $GG = G[7] + G[6].P[7] + G[5].P[6].P[7] + \dots$
    - $GK = \sim(GG+GP)$
  - Use Carry chains
  - Example for GG (group generate):

Slide from Mark Horowitz, Stanford

### Carry Chain Sizing

- Minimize size of transistors not on critical path
- Taper sizes along carry chain
  - Reduces diffusion capacitance

Slide from Mark Horowitz, Stanford

### Static Carry Chains

- Sizing is to reduce the parasitic delay
  - This delay dominates in large fanin structures, since it grows proportional to  $n^2$
  - Using geometric sizing (reducing each transistor along the chain by  $\alpha$ ) makes the parasitic delay linear
  - But still does not make them fast
- Even though this chain is static logic
  - Drive the carry chain both up (K) and down (G)
  - Output is degraded, since it uses nMOS only pass devices
    - Using CMOS transmission gates is usually slower because of the added parasitic capacitors

Slide from Mark Horowitz, Stanford

### Global Carry Chain

- Must fan out GCIN[3:1] to 8 muxes
  - Added load capacitance slows down the chain

Slide from Mark Horowitz, Stanford

### Conditional Sums

- Use same carry chain
- Do two of these (one for  $cin=0$ , one for  $cin=1$ ):
  - Mux SUM0[7:0] and SUM1[7:0] with output of global carry chain

Slide from Mark Horowitz, Stanford

## Arithmetic for Media Processing

- Used in Media processing
  - DSP's, multimedia extensions to instruction set architectures (MMX, VIS)
- Consider three variations of conventional arithmetic:
  - Segmented Arithmetic
    - Break carry chain
  - Arithmetic operations similar to add/subtract
    - Example: 4 parallel 8-bit unsigned absolute differences
  - Saturation
    - Don't wraparound on overflow

## Segmented Add Operation

- Support 32-bit, dual half-word, or quad byte ops
- Example: 4 parallel byte additions
- Treat each byte as a separate 2's complement number
- Don't propagate the carries across byte boundaries

$$\begin{array}{r}
 \begin{array}{|c|c|c|c|} \hline
 1 & 4 & 0 & 2 \\ \hline
 & F & F & F \\ \hline
 \end{array}
 + \begin{array}{|c|c|c|c|} \hline
 F & E & 0 & 2 \\ \hline
 & 0 & 1 & 0 \\ \hline
 \end{array}
 \hline
 \begin{array}{|c|c|c|c|} \hline
 1 & 2 & 0 & 4 \\ \hline
 & 0 & 0 & 0 \\ \hline
 \end{array}
 \begin{array}{|c|c|c|c|} \hline
 0 & 1 & 0 & 1 \\ \hline
 \end{array}
 \end{array}$$

Slide from Mark Horowitz, Stanford

## Modify for Segmentation



Slide from Mark Horowitz, Stanford

## Global Carry w/ Segment

- This method adds mux to critical path
- We can improve on this
  - Possible to move off critical path
  - By moving to start of adder
- If the op type is known early
  - For cells at start of segment
    - Change P, G definition
    - Change Cin to local carry chains



Slide from Mark Horowitz, Stanford

## Absolute Difference

- Example: 4 parallel byte absolute differences
  - Important in MPEG encoding algorithm
- Algorithm:
  - Take two unsigned 8-bit numbers (between 0 and 255)
  - Compute  $|a-b|$
  - Result is unsigned (between 0 and 255)

Slide from Mark Horowitz, Stanford

## Absolute Difference

- How do we compute  $|a-b|$ ?
  - We need to compute  $a-b$  and  $b-a$  and take the positive one
  - Remember that in 2's complement,  $-x = \sim x + 1$
  - The carry-select adder will compute  $a+\sim b+1$  and  $a+\sim b$
  - $a+\sim b+1 = a-b$
  - $a+\sim b = a - b - 1$
  - Note that
    - $\sim(b - a) + 1 = a - b$
    - so
    - $\sim(b - a) = a - b - 1$
    - or
    - $b - a = \sim(a - b - 1)$
  - So, to compute  $|a-b|$ , just choose between SUM1 or  $\sim$ SUM0 depending on the sign bit

Slide from Mark Horowitz, Stanford



## Saturation

- Often in media and signal processors, saturating arithmetic is supported:
  - Don't wraparound on overflow
  - Result should be largest (or smallest) value possible
- Examples:
  - 32-bit saturating integer add:
    - $\text{IADD\$32}(0xFFFFFFF, 0x0000001) = 0x7FFFFFFF$
  - 8-bit saturating unsigned subtract:
    - $\text{USUB\$8}(0x02FE02FE, 0x03FE01FF) = 0x00000100$

Slide from Mark Horowitz, Stanford



## Simulated Performance

- Two implementations:
  - Custom circuits (using circuits from these slides)
    - ~15.1 FO4 delays through integer adder
    - ~9.3 FO4 delays through overflow detection and saturation
    - ~ $3000\lambda \times 500\lambda$  for adder only (excluding ovf det and saturation)
  - Standard cell implementation
    - ~23.5 FO4 delays through integer adder
    - ~10 FO4 delays through overflow detection and saturation
    - ~ $8000\lambda \times 800\lambda$  for adder only (excluding ovf det and saturation)
  - Significant room for speed improvement through any of the following techniques:
    - Domino circuits
    - Faster carry-chain structures
      - e.g. carry-select on upper half of carry chains within each group pgk

Slide from Mark Horowitz, Stanford



## Related Results

- 8 bit Manchester carry chains are slow, no matter how you size them. If you are going to use 8 bit groups, you probably need look-ahead in that group
- Using dynamic gates is much faster than static gates but
  - Need to worry a lot more about clock skew and noise margin issues. You also need to think about power
- It is possible to move segment overhead off the critical path
- Saturation is a pain since you need to know overflow condition before you can select the correct sum
  - But you can calculate overflow early if you spend hardware

Slide from Mark Horowitz, Stanford

## Summary (from Harris/Weste)

- If they're fast enough, use ripple-carry
  - Compact, simple
- Carry skip and carry select work well for small bit sizes (8-16)
  - Hybrids combining techniques are popular
- At 32, 64, and beyond, tree adders are much faster
  - Again, hybrids are common

## Adder Summary

Table 10.3 Comparison of adder architectures

| Architecture                     | Classification  | Logic Levels    | Max Fanout  | Tracks | Cells             |
|----------------------------------|-----------------|-----------------|-------------|--------|-------------------|
| Carry-Ripple                     |                 | $N = 1$         | 1           | 1      | $N$               |
| Carry-Skip ( $n = 4$ )           |                 | $N/4 + 5$       | 2           | 1      | $1.25N$           |
| Carry-Increment ( $n = 4$ )      |                 | $N/4 + 2$       | 4           | 1      | $2N$              |
| Carry-Increment (variable group) |                 | $\sqrt{2N}$     | $\sqrt{2N}$ | 1      | $2N$              |
| Brent-Kung                       | ( $L-1, 0, 0$ ) | $2\log_2 N - 1$ | 2           | 1      | $2N$              |
| Sklansky                         | ( $0, L-1, 0$ ) | $\log_2 N$      | $N/2 + 1$   | 1      | $0.5 N \log_2 N$  |
| Kogge-Stone                      | ( $0, 0, L-1$ ) | $\log_2 N$      | 2           | $N/2$  | $N \log_2 N$      |
| Han-Carlson                      | ( $1, 0, L-2$ ) | $\log_2 N + 1$  | 2           | $N/4$  | $0.5 N \log_2 N$  |
| Ladner Fischer ( $j = 1$ )       | ( $1, L-2, 0$ ) | $\log_2 N + 1$  | $N/4 + 1$   | 1      | $0.25 N \log_2 N$ |
| Knowles [2,1,...,1]              | ( $0, 1, L-2$ ) | $\log_2 N$      | 3           | $N/4$  | $N \log_2 N$      |

## Synthesized Adders (Harris/Weste)

- Similar to my experiment
  - But with 0.18u library, Synopsys DesignWare
  - Synopsys can map "+" to carry-ripple, carry-select, carry-lookahead, and some prefix adders
  - Fastest are tree adders with (prelayout) speeds of 7.0 and 8.5 FO4 delays for 32 and 64 bit adders

## Area vs. Delay, Synthesized Adders



FIG 10.47 Area vs. delay of synthesized adders