

## Pipelining:-

Pipelining is a particularly effective way of organizing concurrent activity in a computer system.

clock

cycle



I<sub>j+1</sub>



I<sub>j+2</sub>



## Pipeline organization:-

In the first stage, PC is used to fetch a new instruction. As other instructions are fetched, execution proceeds through successive stages. Information such as registers addresses, immediate data, etc must be carried through the pipeline from one stage to next. This information is held in interstage buffers. These include RA, RB, RM, Ry, R<sub>2</sub>, IR, PC-temp and additional storage.



## Pipelining issues:-

- There are times when it is not possible to have a new instruction enter the pipeline in every cycle.
- Any condition that causes pipeline to stall is called a hazard.

## Data dependencies:-

Consider

Add R<sub>2</sub>, R<sub>3</sub>, #100

Subtract R<sub>9</sub>, R<sub>2</sub>, #30

The destination R<sub>2</sub> for add is a source for subtract.

There is a data dependency between these two instructions. The subtract instruction is stalled for three cycles to delay reading register R<sub>2</sub> until cycle 6 when the new value becomes available.

Clock

Cycle

Add  
R<sub>2</sub>, R<sub>3</sub>, #100

|   |   |   |   |   |
|---|---|---|---|---|
| F | D | C | M | W |
|---|---|---|---|---|

Subtract

R<sub>9</sub>, R<sub>2</sub>, #30

|   |   |  |  |   |
|---|---|--|--|---|
| F | D |  |  | C |
|---|---|--|--|---|

## Operand forwarding:-

- Pipeline stalls due to data dependency can be alleviated through the use of operand forwarding.
- The desired value is actually available at end of cycle 5, when the ALU completes the operation.
- The value is stored in R<sub>2</sub>, part of buffer B<sub>3</sub>, the hardware can forward the value.

clock  
cycle

Add

R<sub>2</sub>, R<sub>3</sub>, #100

|   |   |   |   |   |
|---|---|---|---|---|
| F | D | C | M | W |
|---|---|---|---|---|

Subtract

R<sub>9</sub>, R<sub>2</sub>, #30

|   |   |   |   |   |
|---|---|---|---|---|
| F | D | C | M | W |
|---|---|---|---|---|

Handling data dependencies in software:-

- An alternate approach is to leave the task of detecting data dependencies and dealing with them to compiler.
- When the compiler identifies a data dependency between two successive instructions, it can insert explicit NOP (No-operation) instructions between them.
- However the code size increases and the execution time is not reduced.
- The compiler attempt to optimize code to improve performance.

Add R<sub>2</sub>, R<sub>3</sub>, #100

NOP

NOP

NOP

Subtract R<sub>9</sub>, R<sub>2</sub>, #30

clock  
cycle

|   |   |   |   |   |
|---|---|---|---|---|
| F | D | C | M | W |
|---|---|---|---|---|

NOP

|   |   |   |   |   |
|---|---|---|---|---|
| F | D | C | M | W |
|---|---|---|---|---|

NOP

|   |   |   |   |   |
|---|---|---|---|---|
| F | D | C | M | W |
|---|---|---|---|---|

NOP

|   |   |   |   |   |
|---|---|---|---|---|
| F | D | C | M | W |
|---|---|---|---|---|

subtract

R<sub>9</sub>, R<sub>2</sub>, #30

|   |   |   |   |   |
|---|---|---|---|---|
| F | D | C | M | W |
|---|---|---|---|---|

## Memory delay:

- Delays arising from memory access, are either cause of pipeline stalls.
- A bad instruction may require more than one clock cycle to obtain its operand from memory.
- Because, the requested instruction or data are not found in the cache, resulting in cache miss.

Load R<sub>2</sub>, (R<sub>3</sub>)

Subtract R<sub>1</sub>, R<sub>2</sub>, #30



- The compiler can eliminate the one-cycle stall for this type of data dependency by reordering instructions to insert a useful instruction between the load and instruction that depends on data ready from memory.
- If a useful instruction cannot be found by the compiler, then hardware introduces stall automatically.
- If hardware does not deal with dependencies, then the compiler must insert an explicit NOP instruction.

## Branch delays:-

### Unconditional branches:-

- The branch instruction is fetched in cycle 1, decoded in cycle 2 and target address is computed in cycle 3. Hence, target is fetched in cycle 4 after the PC has been updated with target address.
- In pipelined execution,  $I_{j+1} + I_{j+2}$  are fetched in 4, before the branch instruction is decoded.
- The resulting delay is branch penalty.
- Reducing branch penalty requires branch target to be computed earlier.
- Rather than wait until compute, update PC in decode.
- The hardware must be modified to implement this.



- When the instruction decoder determines that the instruction is indeed a branch instruction, the computed target address will be available before the end of the cycle.
- It can then be used to fetch target instruction in next cycle.

## conditional branches:-

Add

R<sub>7</sub>, R<sub>8</sub>, R<sub>9</sub>

Branch-if-[R<sub>3</sub>]=0 TARGET

I<sub>j+1</sub>

:

TARGET: I<sub>k</sub>

- The location that follows branch instruction is called branch delay slot
- The compiler attempts to find a suitable instruction to occupy the delay slot, one that needs to be executed even when the branch is taken.
- If a useful instruction is found, then there will be no branch penalty.
- If no useful instruction, a NOP must be placed instead.

Branch-if-[R<sub>3</sub>]=0

TARGET

Add

R<sub>7</sub>, R<sub>8</sub>, R<sub>9</sub>

I<sub>j+1</sub>

:

TARGET: I<sub>k</sub>

- In the above example, Add can be safely moved into branch delay slot
- Add is always fetched and executed even if the branch is taken

## Branch prediction:-

- To reduce the branch penalty further, the processor needs to anticipate that the instruction being fetched is a branch instruction and predict its outcome to determine which instruction should be fetched in cycle 2.

### Static branch prediction:-

- The simple approach is a form of static prediction.
- The same choice is used every time a conditional branch is encountered.
- The processor can determine the static prediction taken or not-taken by checking sign of the branch offset.

### Dynamic branch prediction:-

- The processor hardware assesses the likelihood of a given branch being taken by keeping track of branch decisions every time that a branch instruction is executed.



LT - branch is likely to be taken

LNT - branch is likely not to be taken



ST - strongly likely to be taken

LT - likely to be taken

LNT - likely not to be taken

SNT - strongly likely not to be taken

30/03/2021

#### Resource limitations:-

- The pipeline stalls when there are insufficient hardware resources to permit all actions to proceed concurrently.
- If two instructions need to access the same resource in the same clock cycle, one instruction must be stalled to allow other instruction to use the resource.
- This can be prevented by using additional hardware.
- Single cache - Fetch and memory access.
- If both the fetch and memory stages of the pipeline are connected to the cache, then it is not possible for activity in both stages to proceed simultaneously.
- One access at a time / cycle, not simultaneously.
- Use separate caches for data and instructions.

#### Performance evaluation:-

##### Non-pipelined system:-

In non-pipelined system, the execution time  $T$  of a program with  $N$  instructions is given by

$$T = \frac{N \times S}{R}$$

where  $S$  = avg. no. of clock cycles to execute one instruction

$R$  = clock rate in cycles per second

### Instruction throughput:-

It is defined as no. of instructions executed per second.

$$P_{np} = \frac{R}{S}$$

### Pipelined system:-

- If there is no cache miss,  $S$  is 5
- Thus, in absence of stalls,  $S$  is 1, and throughput with pipelining is

$$P_p = R \quad (S=1)$$

- An  $n$ -stage pipeline has the potential to increase the throughput by  $n$  times
- Real measure of performance is real execution time of a program

### Effects and stalls of penalties:-

- Consider a processor with operand forwarding in hardware, so there are no penalties due to data dependencies, except in case of load instructions.
- Ideal pipelined execution has  $S=1$ , stalls due to such load instructions have the effect of increasing  $S$  by amount stall
- Assume that load instructions constitute 25% of the dynamic instruction and 40% of these are dependent.

. A one-cycle stall is needed in such cases,

$$S_{\text{stall}} = 0.25 \times 0.90 \times 1 = 0.10$$

- . Execution time  $T$  is increased by 10%, throughput is reduced

$$P_p = \frac{R}{1 + S_{\text{stall}}} = \frac{R}{1.1} = 0.91R$$

- . Consider penalties due to mispredicting branches during program execution.

Assume that branches constitute 20% of dynamic count of a program, and average prediction accuracy is 90%, so 10% are executed incur a one-cycle penalty.

$$S_{\text{branch-penalty}} = 0.20 \times 0.10 \times 1 = 0.02$$

- . The time to access the slower main memory is a penalty that stalls the pipeline for  $P_m$  cycles every time there is a cache miss. A fraction  $m_i$  of all instructions that are fetched incur a cache miss. A fraction  $d$  of all instructions are load or store, and  $m_d$  are instructions that incur a cache miss.

$$S_{\text{miss}} = (m_i + d \times m_d) \times P_m$$

When all factors are combined,  $S$  is increased from the ideal value of 1 to  $1 + S_{\text{stall}} + S_{\text{branch-penalty}} + S_{\text{miss}}$ .

## Arithmetic:-

Addition / subtraction of signed numbers:-

| $x_i$ | $y_i$ | carry-in | sum $s_i$ | carry-out $c_{i+1}$ |
|-------|-------|----------|-----------|---------------------|
| 0     | 0     | 0        | 0         | 0                   |
| 0     | 0     | 1        | 1         | 0                   |
| 0     | 1     | 0        | 1         | 0                   |
| 0     | 1     | 1        | 0         | 1                   |
| 0     | 0     | 0        | 1         | 0                   |
| 1     | 0     | 1        | 0         | 1                   |
| 1     | 1     | 0        | 0         | 1                   |
| 1     | 1     | 1        | 1         | 1                   |

$$S_i = \overline{x_i} \overline{y_i} c_i + \overline{x_i} y_i \overline{c_i} + x_i \overline{y_i} \overline{c_i} + x_i y_i c_i = x_i \oplus y_i \oplus c_i$$

$$c_{i+1} = \overline{x_i} y_i c_i + x_i \overline{y_i} c_i + x_i y_i \overline{c_i} + x_i y_i c_i$$

$$= y_i c_i (\overline{x_i} + x_i) + x_i (\overline{y_i} c_i + y_i \overline{c_i})$$

$$= y_i c_i + x_i c_i + x_i y_i$$

Sum :-



Carry :-



$n$ -bit adder:-

- cascade  $n$  full adder (FA) blocks to form a  $n$ -bit adder.
- carries propagate or ripple through this cascade,  $n$ -bit ripple adder.



$k$   $n$ -bit adder:-

$k$   $n$ -bit numbers can be added by cascading  $k$   $n$ -bit adders



$n$ -bit subtractor:-

- $x-y$  is equivalent to adding 2's complement of  $y$  to  $x$ .
- 2's c is equivalent to  $is\ c + 1$
- $x-y = x+\bar{y}+1$



$1$ -bit adder/subtractor :-



Add/Sub control:-

- = 0, addition
- = 1, subtraction

## Detecting overflows:-

- Overflows can only occur when the sign of two operands is the same
  - Overflow occurs if the sign of result is different from the sign of operands
  - Recall that the MSB represent the sign.
  - $x_{n-1}, y_{n-1}, s_{n-1}$  represent the sign of x, y and s
  - Circuit to detect overflow can be implemented by the following logic expression:

$$\text{Overflow} = x_{n-1} y_{n-1} \bar{s}_{n-1} + \bar{x}_{n-1} \bar{y}_{n-1} s_{n-1}$$

$$\text{Overflow} = c_n \oplus c_{n-1}$$

computing the add time:-

consider 0<sup>th</sup> stage:

- $C_1$  is available after 2 gate delays
  - $S_0$  is available after 1 gate delay
  - $G$  is available after 4 gate delays
  - $S_1$  is available after 3 gate delays

So - For an  $n$ -bit adder,  $s_{n-1}$  is available after  $2n-1$  gate delays,  $c_n$  is available after  $2n$  gate delays.

\* Two approaches to reduce delay in addition

- Fastest possible electronic technology in ripple carry adders
  - Augmented logic gate network structure



Fast addition:-

$$S_i = x_i \oplus y_i \oplus c_i$$

$$c_{i+1} = x_i y_i + x_i c_i + y_i c_i$$

$$c_{i+1} = x_i y_i + c_i(x_i + y_i)$$

$$c_{i+1} = G_i + P_i c_i$$

$$\text{where } G_i = x_i y_i$$

$$P_i = x_i + y_i$$

•  $G_i$  is called generate function and  $P_i$  is called propagate function.

•  $G_i, P_i$  are computed only from  $x_i, y_i$  and not from  $c_i$ , thus they can be computed in one gate delay after  $x$  and  $y$  are applied to the inputs of an n-bit adder

Carry-look ahead adder:-

$$c_{i+1} = G_i + P_i c_i \quad [$$

$$c_i = G_{i-1} + P_{i-1} c_{i-1}$$

$$c_{i+1} = G_{i+1} + P_{i+1} (G_{i-1} + P_{i-1} c_{i-1})$$

$$c_{i+1} = G_i + P_i (G_{i-1} + P_{i-1} (G_{i-2} + P_{i-2} c_{i-2}))$$

:

$$c_{i+1} = G_i + P_i G_{i-1} + P_i P_{i-1} G_{i-2} + \dots + P_i P_{i-1} \dots P_0 C_0$$

• All carries can be obtained 3 gate delays after  $X, Y$  &  $C_0$  are applied.

→ One gate delay for  $P_i$  and  $G_i$

→ Two gate delays in AND-OR circuit for  $c_{i+1}$

• All sums can be obtained 1 gate delay after the carries are computed

• Independent of  $n$ , n-bit addition requires only 4 gate delay,

This is called carry-look ahead adder

4-bit carry lookahead adder:-

The carries can be implemented as

$$C_1 = G_0 + P_0 C_0$$

$$C_2 = G_1 + P_1 C_1 + P_1 P_0 C_0$$

$$C_3 = G_2 + P_2 G_1 + P_2 P_1 G_0 + P_2 P_1 P_0 C_0$$

$$C_4 = G_3 + P_3 G_2 + P_3 P_2 G_1 + P_3 P_2 P_1 G_0 + P_3 P_2 P_1 P_0 C_0$$



## Multiplication:-

Multiplication of unsigned numbers:-

- We add the partial products at end
- alternative would be to add the partial products at each stage
- Rules to implement multiplication are:-  
 → If the  $i^{\text{th}}$  bit of the multiplier is 1, shift the multiplicand and add the shifted multiplicand to the current value of the partial product.
- Hand over the partial product to the next stage
- Value of the partial product at the start stage is 0

$$\begin{array}{r}
 1011 \text{ (Multiplicand } M) \\
 1101 \text{ (Multiplier } Q) \\
 \hline
 1011 \\
 0000 \\
 1011 \\
 \hline
 1000111
 \end{array}
 \quad
 \begin{array}{r}
 110110 \\
 111 \\
 \hline
 10001110
 \end{array}$$

combinatorial array

multiplier:- Bit of incoming partial product ( $PP_i$ )



0011100  
2524352524

M

19

0, no add }  $f_{st}$

shift) J  
-1, 200) 2 1

shift  $\rightarrow$   $\exists^{\infty}$

1, add 33rd  
lift

$$5 \frac{3}{8} \quad 10$$

Sequential multiplication method:-



Multiplicand = +6, multiplier = -14, size = 4



# sequential circuit multiplier



## Signed Multiplication:-

case - (i) :-

Both are non negative numbers (+ve) multiplication  
of unsigned numbers

case - (ii) :-

Both are -ve numbers

Take 2's complement of both multiplier and multiplicand

case - (iii) :-

-ve multiplicand

+ve multiplier

• When we add a negative multiplicand to a partial product, we must extend the sign-bit value of the multiplicand to the left as far as the product will extend

- If multiplicand is +ve , '0' to the left side
- If multiplicand is -ve , add '1' to the left side
- If the final result is -ve take 2's complement of that to get correct result

consider  $(-13) \times (+11)$

$$\begin{array}{r} & 1 & 0 & 0 & 1 & 1 & & (-13 - 2\text{'s c}) \\ & 0 & 1 & 0 & 1 & 1 & & (+11) \\ \hline & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 1 \\ & 1 & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 1 \\ & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ & 1 & 1 & 1 & 0 & 0 & 1 & 1 \\ & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \hline & 0 & 0 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 1 & (-143) \end{array}$$

case-iv:-

For a negative multiplier, a straightforward solution is to form the 2's complement of both multiplier and the multiplicand and proceed as in the case of a positive multiplier.

This is possible because complementation of both operands does not change the value or the sign of the product.

A technique that works equally well for both negative and positive multipliers - Booth algorithm.

The Booth's algorithm:-

- It handles both the +ve and -ve numbers equally.
- This algorithm generates a 2n bit product.
- It treats both the +ve and -ve 2's complement operands uniformly.

| Multiplicand | version of multiplicand selected by bit i |
|--------------|-------------------------------------------|
| 0 0          | $0 \times M$                              |
| 0 1          | $+1 \times M$                             |
| 1 0          | $-1 \times M$                             |
| 1 1          | $0 \times M$                              |

$(+13) \times (-6) \Rightarrow$

$\begin{array}{r} 00110 \\ \times 11010 \\ \hline 0000000000 \\ 111110011 \\ 00001101 \\ 1110011 \\ \hline 00000001 \\ \hline 1110110010 \end{array}$

$\begin{array}{r} 00110 \\ \times 11010 \\ \hline 0000000000 \\ 111110011 \\ 00001101 \\ 1110011 \\ \hline 00000001 \\ \hline 1110110010 \end{array}$

$(-78)$

05-07

- Best case - a long string of 1s
- Worst case - 0's and 1's are alternating

Worst: 0 1 01 01 01 01 01 01 01

01 10 10 10 10 10 10 10 10

Ordinary: 11 00 01 01 10 11 11 00  
01 00 11 11 01 11 000 100

Good: 0 0 00 1 1 11 10 00 0.11  
0 0 0 1 0 0 0 00 1 00 1 00

Fast multiplication:-

1. Bit-pair recoding of multipliers
2. Carry save addition of summands
3. Summand addition tree using 3-2 reduces.

Bit-pair recoding of multipliers:-

| Multiplier bit-pair | Multiplier bit on right | Multiplicand |
|---------------------|-------------------------|--------------|
| i+1 i               | i-1                     | -i           |
| 0 0                 | 0                       | 0xM          |
| 0 0                 | 1                       | +1xM         |
| 0 1                 | 0                       | +1xM         |
| 0 1                 | 1                       | +2xM         |
| 1 0                 | 0                       | -2xM         |
| 1 0                 | 1                       | -1xM         |
| 1 1                 | 0                       | -1xM         |
| 1 1                 | 1                       | 0xM          |

Ex:-

$$\begin{array}{r} 1 \ 0 \ 1 \\ \swarrow \searrow \\ -1 \ 1 \\ 2^1 \ 2^0 \end{array}$$

$$= -1 \times 2^1 + 1 \times 2^0 = -2 + 1$$

$$= -1$$

$$\begin{array}{r} 0 \ 0 \ 1 \\ \swarrow \searrow \\ 0 \ 1 \\ 2^1 \ 2^0 \end{array}$$

$$= 2^1 \times 0 + 1 \times 2^0$$

$$= 0 + 1 = 1$$

$$\text{Ex: } (+13) \times (-6)$$

$$\begin{array}{r}
 & 0 & 0 & 1 & 1 & 0 & 1 & (+13) \\
 & & | & & | & & | & \\
 & 1 & 1 & 0 & 1 & 0 & (-6) \\
 \hline
 & 0 & 1 & 1 & 0 & 1
 \end{array}$$

$$\begin{array}{r}
 & 0 & 1 & 1 & 0 & 1 \\
 & 0 & & -1 & & -2 \\
 \hline
 1 & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 1 & 0 \\
 1 & 1 & 1 & 1 & 1 & 0 & 0 & 1 & 1 \\
 \hline
 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\
 \hline
 1 & 1 & 1 & 1 & 0 & 1 & 1 & 0 & 0 & 1 & 0
 \end{array}$$

$$\mu = -11$$

$$\phi = +27$$

110101

$$0 \ 1 \ 1 \ 0 \ 1 \ 1 \boxed{0}$$

1 1 1 0 1 0 1  
 +2 → 2's c  
 -1  
 xtra 0

---

0 0 0 0 0 0 0 0 1 0 1 0 1 0  
 0 0 0 0 0 0 0 0 1 0 1 0 1 0 ← 2 pos shift

ext 1 1 1 0 1 0 1 0 1 0 1 0 1 0 → 2 pos shift

---

1 1 1 0 1 0 1 0 1 0 1 0 1 0

Carry-Save addition of summands :-

$$\begin{array}{cccc} m_3 & m_2 & m_1 & m_0 \\ \alpha_3 & \alpha_2 & \alpha_1 & \alpha_0 \\ \hline m_2\alpha_0 & m_1\alpha_0 & m_0\alpha_0 & m_0\alpha_0 \end{array}$$

## Ripple carry array for $4 \times 4$ multiplier:-



07/09/2021

## Carry save array by $4 \times 4$ Multiplier:-



- Multiplication requires the addition of several summands.
- A technique called carry save addition (CSA) can be used to speed up the process.
- Instead of letting the carries ripple along the row, they can be 'saved' and introduced into the next row at the correct weighted positions.
- The delay through the carry-save array is somewhat less than the delay through the ripple carry array.
- Multiplication requires the addition of several summands.

- consider the addition of many summands, we can group the summands in threes and perform carry-save addition on each of these groups. This is parallel to generate a set of S and C vectors in one full adder delay.
- group all of the S and C vectors into threes, and perform carry-save addition on them, generating a further set of S and C vectors in one more full-adder delay.
- continue with this process until there are only two vectors remaining.
- they can be added in a RCA or CLA to produce the desired product.

10/10/2023

Summand addition tree using 3-2 reducers:-

• Gate delay

→ By Array multiplier:

$$6(7-1) - 1 = 29$$

→ By CSA = 15 (146+8)



$$\begin{array}{r}
 101101 \quad (45) \\
 111111 \quad (63) \\
 \hline
 101101
 \end{array}$$

$$\begin{array}{r}
 101101 \\
 101101 \\
 \hline
 0101000011
 \end{array}$$

$$\begin{array}{r}
 00111100
 \end{array}$$



$$\begin{array}{r}
 101101 \\
 101101 \\
 101101 \\
 \hline
 11000011
 \end{array}$$

$$\begin{array}{r}
 00111100
 \end{array}$$



$$\begin{array}{r}
 & & & & & & & & & \\
 & & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 \\
 & & 0 & 0 & 1 & 1 & 1 & 0 & 0 & \\
 & & 0 & 0 & 0 & 0 & 0 & 1 & 1 & 1 \\
 \hline
 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & \\
 & 1 & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 1 & 1 \\
 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\
 & 0 & 0 & 1 & 1 & 1 & 1 & 0 & 0 \\
 \hline
 & 0 & 1 & 0 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 \\
 & 0 & 0 & 1 & 0 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
 \hline
 & 0 & 1 & 0 & 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 \\
 \end{array}$$

## Division:-

## Manual division:-

$$13) \overline{)274} \quad ( \quad 21$$

$$\begin{array}{r} 26 \\ \hline 14 \\ \hline 13 \\ \hline 1 \end{array}$$

$$\begin{array}{r}
 \text{dividend} \\
 \uparrow \\
 1101) \quad 1000 \overset{\downarrow}{1} 0010(10101 \\
 \overset{\uparrow}{01101} \\
 \hline
 0010000 \\
 \overset{\downarrow}{1101} \\
 \hline
 1110 \\
 \overset{\downarrow}{1101} \\
 \hline
 (1)
 \end{array}$$

longhand division :-

- Position the divisor appropriately with respect to dividend and perform a subtraction.
  - If the remainder is zero or positive, a quotient bit of 1 is determined, the remainder is extended by another bit



of the dividend, the divisor is repositioned, and another subtraction is performed

- if the remainder is negative, the quotient bit is 0 is determined, the dividend is restored by adding back the divisor, and the divisor is repositioned for another subtraction

Restoring division:-

- This method uses three n-bit registers A, M, Q for dividing two n-bit numbers.

• The register M is used to hold the divisor

• Initially A contains zero and M holds n-bit divisor

At each iteration

→ shift A and Q left one binary position

→ subtract M from A, and place the answer back in A

→ If the sign of A is 1, set  $q_0$  to 0 and add M back to A (restore A); otherwise, set  $q_0$  to 1

→ Repeat these steps n times



Example:

$$\begin{array}{r} \text{B=Q} \\ 11) \frac{1000}{10} \end{array}$$

Initial

$$\begin{array}{r} A \\ 0 0 0 0 0 \end{array}$$

$$\begin{array}{r} Q \\ 1 0 0 0 \end{array}$$

$$M: \quad 0 0 0 1 1$$

$$\text{shift : } \begin{array}{r} 0 0 0 0 1 \end{array} \quad \boxed{0 0 0 0}$$

$$\text{subtract } M - A: \begin{array}{r} 1 1 1 0 1 \\ - 1 \\ \hline 1 1 1 0 \end{array}$$

$$\text{set } q_0: \begin{array}{r} 1 1 1 0 \\ \text{---} \\ \text{set } q_0 = 0 \end{array}$$

$$\text{restore : } \begin{array}{r} 0 0 0 1 1 \\ \text{---} \\ \text{size} = 3 \end{array} \quad \begin{array}{r} 0 0 0 0 0 \\ 0 0 0 1 1 \end{array}$$

$$\text{shift : } \begin{array}{r} 0 0 0 1 0 \end{array} \quad \begin{array}{r} 0 0 0 0 0 \\ 0 0 0 0 0 \end{array}$$

$$\text{subtract : } \begin{array}{r} 1 1 1 0 1 \\ - 1 \\ \hline 1 1 1 0 \end{array}$$

$$\text{set } q_0: \begin{array}{r} 1 1 1 0 \\ \text{---} \\ \text{set } q_0 = 1 \end{array} \quad Q=5$$

$$\text{restore : } \begin{array}{r} 0 0 0 1 1 \\ \text{---} \\ \text{size} = 4 \end{array} \quad \begin{array}{r} 0 0 0 0 0 \\ 0 0 0 1 1 \end{array}$$

$$\text{shift : } \begin{array}{r} 0 0 1 0 0 \\ 0 0 0 0 0 \end{array}$$

$$\text{subtract : } \begin{array}{r} 0 1 1 1 0 \\ - 1 \\ \hline 0 1 1 0 \end{array}$$

$$\text{set } q_0: \begin{array}{r} 0 1 1 0 \\ \text{---} \\ \text{set } q_0 = 1 \end{array}$$

~~-----~~  $\Rightarrow$  No restoration

$$\text{shift : } \begin{array}{r} 0 0 0 1 0 \\ 0 0 1 0 \end{array}$$

$$\text{subtract : } \begin{array}{r} 1 1 1 0 1 \\ - 1 \\ \hline 1 1 1 0 \end{array}$$

$$\text{set } q_0: \begin{array}{r} 1 1 1 0 \\ \text{---} \\ \text{set } q_0 = 1 \end{array}$$

$$\text{restore : } \begin{array}{r} 0 0 0 1 1 \\ \text{---} \\ \text{size} = 0 \end{array} \quad \begin{array}{r} 0 0 1 0 \\ 0 0 1 0 \end{array}$$

Quotient  
remainder

10/12/2021

## Non-restoring division:-

Avoid the need for restoring A after an unsuccessful subtraction.

Step-1: (Repeat n times)

If the sign of A is 0, shift A and Q left one bit position and subtract M from A, otherwise shift A & Q and add M to A, If the sign of A is 0, set  $q_0 = 1$ , or 0

Step-2: If sign of A is 1, add M to A

Initially       $\begin{array}{r} A \\ \hline 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \end{array}$

M:       $\begin{array}{r} 0 & 0 & 0 & 1 & 1 \end{array}$

shift?       $\begin{array}{r} 0 & 0 & 0 & 1 & 0 \end{array}$

subtract?       $\rightarrow 2'sc: \begin{array}{r} 1 & 1 & 1 & 0 & 1 \end{array}$

shift?       $\begin{array}{r} 1 & 1 & 1 & 0 & 0 \end{array}$

set  $q_0$ :       $\begin{array}{r} 1 \\ \hline 1 & 1 & 1 & 0 & 0 & 0 & 0 \end{array}$

shift :       $\begin{array}{r} 1 & 1 & 1 & 0 & 0 & 0 \end{array}$

Add :       $\begin{array}{r} 0 & 0 & 0 & 1 & 1 \\ \hline \end{array}$

set  $q_0$ :       $\begin{array}{r} 1 \\ \hline 1 & 1 & 1 & 1 & 0 & 0 & 0 \end{array}$

shift :       $\begin{array}{r} 1 & 1 & 1 & 1 & 0 \end{array}$

Add :       $\begin{array}{r} 0 & 0 & 0 & 1 & 1 \\ \hline \end{array}$

set  $q_0$ :       $\begin{array}{r} 0 \\ \hline 0 & 0 & 0 & 1 & 1 & 0 & 0 \end{array}$

shift :       $\begin{array}{r} 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1 \end{array}$

subtract :       $\begin{array}{r} 1 & 1 & 1 & 0 & 1 \\ \hline \end{array}$

set  $q_0$ :       $\begin{array}{r} 1 \\ \hline 1 & 1 & 1 & 1 & 1 & 0 & 0 & 1 \end{array}$

quotient

$\begin{array}{r} 1 & 1 & 1 & 1 & 1 \\ \hline 0 & 0 & 0 & 1 & 1 \end{array}$  } restore  
remainder

Add       $\begin{array}{r} 0 & 0 & 0 & 1 & 0 \\ \hline \end{array}$

remainder

- Represent the binary positive number 1101011 in the IEEE single precision format

$$1101011 = +1 \cdot 101011 \times 2^6$$

23-bit mantissa is

10101100000000000000000

$$\text{Exponent} = 6 = 0110$$

- Represent the decimal -0.75 into IEEE single precision

$$-0.75 \times 2 = 1.5 \rightarrow 1$$

$$0.5 \times 2 = 1.0 \rightarrow 1$$

$$0.0 \times 2 = 0.0 \rightarrow 0$$

$$0.0 \times 2 = 0.0 \rightarrow 0$$

$$(-0.75)_{10} = (-0.11)_2$$

$$\text{Normalize} = -1 \cdot 1 \times 2^{-1}$$

$$\text{mantissa} = 10000 \underbrace{\dots}_{\text{23 bits}}$$

$$\text{exponent} = -1$$

$$e^1 = E + 127 = -1 + 127 = 126 = 0111110$$

|   |         |      |   |
|---|---------|------|---|
| 1 | 0111110 | 1000 | . |
|---|---------|------|---|

- Represent -9.5 in 64-bit IEEE floating point representation

$$\text{Decimal Number} = -9.50$$

$$(9)_{10} = (1001)_2$$

$$(0.5) \times 2 = 1.0 \rightarrow 1$$

$$0.0 \times 2 = 0.0 \rightarrow 0$$

$$(-9.5)_{10} = (-1001.1)_2$$

$$M = -1.0011 \times 2^3$$

$$M = -0011000$$

$$E = E + 1023 = 3 + 1023 = 1026, \text{ sign} = 1$$

|   |              |        |   |
|---|--------------|--------|---|
| 1 | 100000000010 | 001000 | . |
|---|--------------|--------|---|

$$1460.125 = (10110110100.001)_2$$

Normalization:-

$$1.\underbrace{0110110100001}_{\text{Mantissa}} \times 2^{10}$$

$$\text{sign} = 0, E = 10 = 101 + 10 = 137 = 10001001$$

|   |          |               |
|---|----------|---------------|
| 0 | 10001001 | 0110110100001 |
|---|----------|---------------|

1/01/2021

Floating point arithmetic : ADD / SUB rule :-

- choose the number with the smaller exponent
- shift its mantissa right until the exponents of both the numbers are equal
- Add or subtract the mantissas
- Determine the sign of the result
- Normalize the result if necessary and truncate round to the number of mantissa bits.

MUL rule:-

- Add the exponents
- Subtract the bias
- Multiply the mantissas and determine the sign of the result
- Normalize the result
- Truncate round the mantissa of the result

DIV rule:-

- Subtract the exponents
- Add the bias
- Divide the mantissas and determine the sign of result

- Normalize the result (if necessary)
- Truncate / round the mantissa of the result.

Special values:-

- The eng values 0 and 255 of excess exponent  $\epsilon'$  are used to represent special values
- when  $\epsilon'$  is 0 and  $M=0$ , the exact 0 value is represented
- when  $\epsilon'=235$  and  $M=0$ , the value  $\infty$  is represented
- when  $\epsilon'=0$  &  $M \neq 0$  denormal numbers are represented  
Their value is  $\pm 0.M \times 2^{-126}$

01/2011

## Input / output organization:-

### Bus structure:-

- The interconnection network is used to transfer the data among the processor, memory and I/O devices.
- The commonly used interconnection network is called bus.



- Only one source / destination pair of units can use this bus to transfer data at any one time.
- Bus consists of three sets of lines used to carry address, data, control signals.



- Processor, main memory, I/O devices are interconnected by means of a bus
- Each I/O device is assigned a unique set of addresses for the registers in its interface
- When a processor places a particular address on the address lines, it is examined by the address decoders of all devices on the bus
- Address is observed by all device decoders
- The device recognizes the address, responds to the commands issued on the control lines
- The processor uses the control lines to request a Read/Write operation, requested data is transferred over data lines

### Memory mapped I/O:-

- It is a way to exchange data and instructions between CPU & I/O attached to it
- It is the one where processor and I/O device share the same address space.
- Any instruction that can access memory can be used to transfer data to and from I/O device
- Load R2, DATAIN

DATAIN is address of a register in an input device

- Store R2, DATAOUT

DATAOUT is a register in an output device  
bus operation:-

bus protocol

Synchronous bus

Asynchronous bus

## Bus protocol

- The set of rules that govern the behaviour of various devices connected to the bus, as to when to place information on the bus, when to assert control signals etc.
- These rules are implemented by control signals that indicate what and when actions are to be taken
- One control line, usually labelled R/W specifies whether a Read or write operation is to be performed
- Read - 1, Write - 0

• Control signals specify:

- whether it is a read or a write operation
- when several operand sizes (byte, word, long) required, size is indicated by other control lines
- Timing information to indicate when the processor and I/O devices may place data or receive data from the bus.
- Two types of data transfer depending on the mechanism of timing data
  - Synchronous
  - Asynchronous
- In any data transfer one device plays the role of a master
- This is a device that indicates initiates data transfer by issuing read/write command on the bus
- Normally CPU behaves like a master

Synchronous bus:-

- All devices derive timing information from a common clock line - Bus clock
- The signal on this line has two phases
  - high level followed by low level

- The clockline has equally spaced pulses which define equal time intervals (bus cycles)
- One data transfer can take place during one bus cycle



- Once the master places the device address and command on the bus, it takes time for this information to propagate to the devices

• Also all the devices have to be given enough time to decode the address and control signals, so that addressed slave can place data on bus

• Width of the pulse  $t_1$  to depend on

→ Maximum propagation delay between two devices connected to bus

→ Time taken by all the devices to decode the address and control signals, so that the addressed slave can respond at  $t_1$

• At the end of the clock cycle, at time  $t_2$ , the master stores the data on data lines into its buffer if it's a read operation (strobe means capture the value and store in buffer)

• When data are to be loaded into a storage buffer register, the data should be available for a period longer than the setup time of the device.

than the setup time of the device.

• Width of pulse  $t_2 - t_1$  should be longer than:

→ Max propagation time of bus

→ set up time of input buffer register of the master.

→ set up time of output buffer register of the master.

Limitations:-

• Data transfer has to be completed within one clock cycle

→ clock period  $t_2 - t_1$  must be such that the longest propagation delay on the bus and the slowest device interface must be accommodated

→ forces all the devices to operate at the speed of slowest device.

• Processor just assumes that the data are available at  $t_2$  in case of a read operation, or are read by the device in case of a write operation.



Protocol

### Asynchronous bus:-

- Data transfers on the bus is controlled by a handshake between the master and the slave
- common clock in the synchronous bus case is replaced by two timing control lines.  
Master ready, slave ready.
- Master-ready signal is asserted by the master to indicate to the slave that it is ready to participate a data transfer
- slave-ready signal is asserted by the slave in response to the master-ready from the master, and it indicated to the master that the slave is ready to transfer the data.
- \* Data transfer using the handshake protocol:-
  - Master places the address & command information on bus
  - asserts the Master-ready signal to indicate to the slave that

- The address and command information has been placed on bus
- All devices on the bus decode the address
  - Addressed slave performs the required operation and informs the processor it has done so by asserting the slave-ready signal.
  - Master removes all the signal from the bus, once slave-ready is asserted
  - If the operation is read, master also stores the data into its input buffer



- At  $t_0$  - master places the address and command info on bus
- $t_1$  - master asserts the M-R signal. M-R is asserted at  $t_1$  instead of  $t_0$
- $t_2$  - Addressed slaves place the data on bus and asserts S-R signal
- $t_3$  - S-R signal arrives at the master
- $t_4$  = Master removes the address + command info
- $t_5$  - Slave receives the transaction

## Advantages:-

- Eliminates the need for synchronization between the sender and the receiver
- can accommodate varying delays automatically, using slave ready signal

## Disadvantages:-

- Data transfer rate with full handshake is limited by two round trip delays
- Data transfers using a synchronous bus involves only one round trip delay; and hence a synchronous bus can achieve higher rate.

## Arbitration:-

- There are occasions when two or more entities contend for the use of a single resource in a computer system.
- This decision is usually made in an arbitration process performed by an arbitration circuit.
- The arbitration process starts when each device sending a request to use the shared resource.
- A device that wishes to use the bus sends a request to the arbiter.
- When multiple requests arrive at the same time, the arbiter selects one request and grants the bus to the corresponding device.





Granting is at the level of address

~~slotbus~~

Interface circuits:-

- I/O interface circuits consist of the circuitry required to connect an I/O device to a computer bus.
- side of the interface which connects to the computer has bus signals for:

Address, Data, Control.

- side of the interface which connects to the I/O device has:

→ Datapath and associated controls to transfer data between the interface and the I/O device.

→ This side is called as a "port"

• Ports can be classified into two:

→ Parallel port

→ Serial port

• Parallel port transfers data in the form of number of bits, normally 8 or 16 to or from the device

• Serial port transfers and receives data one bit at a time

- Processor communicates with the bus in the same way, whether it is parallel port or a serial port
- conversion from the parallel to serial and vice versa takes place inside the interface circuit



functions

### Functions of the I/O interface:-

- Provides a register for temporary storage of data.
- Includes a status register containing status information that can be accessed by the processor
- Includes a control register that holds the information governing the behaviour of the interface
- Contains address-decoding circuitry to determine when it is being addressed by the processor
- Generates the required timing signals.
- Performs any format conversion that may be necessary to transfer data between the processor and the I/O device, such as parallel-to-serial conversion in the case of serial port.

- Keyboard is connected to a processor using a parallel port.
- Processor is 32-bits and uses memory mapped I/O and the asynchronous bus protocol
- On the processor side of the interface:
  - Datalines
  - Address lines
  - control or R/W line
  - Master-ready signal
  - Slave ready signal
- On the keyboard side of the interface:
  - encoder circuit which generates a code for key pressed
  - Datalines contain the code for the key
  - Valid line changes from 0 to 1, when the key is pressed
  - This causes the code to be loaded into DATAIN and SIN to be set to 1
  - circuit can be used to connect a keyboard to a processor
- There are only two registers:
  - A data register (KBD-DATA)
  - status register (KBD-STATUS)
- The latter contains the keyboard status flag, KIN
- When key pressed (KIN=1, KBD-DATA = data)
- Output of encoder: one byte of data representing encoded character and one valid bit



- On the printer side:

- idle signal line which the printer asserts when it is ready to accept the character
- This causes the sout flag to be set to 1
- Processor places a new character into a dataout register
- valid signal asserted by the interface circuit when it places a new character on data lines

*solution*

- On the processor side

- Data lines



- Address lines



- control or R/W lines



- When the I/O routine checks DOUT and finds out it equal to 1, it sends a character to DISP-DATA

- This clears the sout flag to 0 and sets New-data signal to 1

- In response, the display returns ready to 0 accepts and displays the character in DISP-DATA

- When it is ready to receive another character it asserts ready again and the cycle repeats

Serial port:-

- Serial port is used to connect the processor to I/O devices that require transmission of data one bit at a time

- Serial port communicates in a bit-serial fashion on the device side and bit-parallel on the bus side

→ Transformation between parallel and serial formats is

achieved with shift registers that have parallel access capability.



- Input shift register accepts input one at a time from the I/O device
- Once all the 8 bits are received, the contents of the input shift register are loaded in parallel into DATAIN registers
- Output data in the DATAOUT register are loaded into the output shift register
- Bits are shifted out of the output shift register and sent out to the I/O device bit at a time.
- As soon as data from the input shift are loaded into DATAIN, it can start accepting another 8 bits of data.
- Input shift register and DATAIN registers are both used at input so that the input shift register can start receiving

another set of 8 bits from the input device after loading the contents to DATAIN, before the processor reads the contents of DATAIN. This is called as double buffering.

23/04/2021

- Serial interface require fewer wires and hence serial transmission is convenient for connecting devices that are physically distant from the computer.
- Speed of transmission of the data over a serial interface is known as the "bit rate".
  - Bitrate depends on the nature of the devices connected
- In order to accommodate devices with a range of speeds, a serial interface must be able to use a range of clock speeds.
- Several standard serial interfaces have been developed
  - Universal Asynchronous Receiver Transmitter (UART) for low speed serial devices
  - RS-232-C for connection to communication links

### PCI Bus:-

- A no. of standards have been developed for the expansion bus
- Some have evolved by default
- IBM's Industry Standard Architecture
- Three widely used bus standards:-
  - PCI (Peripheral component Interconnect)
  - SCSI (small computer system interface)
  - USB (Universal serial Bus)

### PCI :-

- Peripheral component interconnect
- Introduced in 1992

low-cost and processor independent bus

it is housed on the motherboard of a computer and used to connect I/O interfaces for a wide variety of devices  
Plug and Play capability



## Bus structures:-

- The PCI bus is connected to the processor bus via a controller called a bridge
  - The bridge has a serial port for connecting the computer's memory
  - Another special high speed port for connecting graphic devices
  - The bridge translates and commands and responses from one bus to other and transfers data between them
  - The processor sends a request to an I/O device, the bridge forwards the commands and address lines on bus
  - I/O devices are connected to the PCI bus through port that use standards such as Ethernet, USB, SATA, SCSI
  - PCI bus has supports 3 interfaces
  - RW operation that involves a single word is treated as burst of length one.
  - PCI uses same bus to transfer both address and data
- 28/01/20*
- We assumed that the master maintains the address info on the bus until data transfer is completed. But, the address is needed only long enough for the slave to be selected
  - Thus, the address is needed on the bus for one clock cycle only, forcing the address lines to be used for sending data in subsequent clock cycles
  - The result is a significant cost reduction

## Data transfer:-

- Bus master, which is the device that initiates data transfer by issuing read and write commands, is called initiator in PCI terminology

- . the addressed device that responds to these commands is called a target
- . the target ready, TRDY# = slave-ready
- . initiator-ready signal, IRDY#, to support burst transfers signals on PCI bus:-

| Name         | Function                                                                                                              |
|--------------|-----------------------------------------------------------------------------------------------------------------------|
| CLK          | A 33-MHz or 66-MHz clock                                                                                              |
| FRAME#       | sent by initiator to indicate duration of a transaction                                                               |
| AD           | 32 address/data lines, which may be optionally increased to 64                                                        |
| CBE #        | command/byte-enable lines                                                                                             |
| IRDY#, TRDY# | Initiator-ready, target-ready signals                                                                                 |
| DEVSEL#      | A response from the device indicating that it has recognized its address and is ready for a data transfer transaction |
| ID SEL#      | Initialization Device select                                                                                          |

- . A complete transfer operation on the PCI Bus involving an address and a burst of data is called transaction.



- clock cycle #1: Bus master asserts frame# to indicate the beginning of a transaction
- It sends address on address lines and read command on C1BE# lines
- clock cycle #2: The initiator removes the address and disconnects its drivers from AD lines and asserts IRDY#
- IRDY# - it is ready to receive data
- Target asserts DEVSEL# signal
- clock cycle #3:
  - Target asserts TRDY# and begins to send the data
  - It maintains DEVSEL# is asserted state until the end of the transaction.
  - Initiator uses frame# to indicate duration of the bus
  - This deactivates at cycle #5, during this the initiator receives 3rd word
  - In response to that it receives next word and stops
  - After sending 4<sup>th</sup> word deactivates TRDY# and DEVSEL# and disconnects its drivers on the AD lines

### SCSI Bus

| Category | Name             | Function                                                                                                                            |
|----------|------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| Data     | - DB(0) to DB(7) | Datalines carry one byte of info during the info transfer phase & identify device during arbitration selection & reselection phases |
| Phase    | - BSY<br>- SEL   | Parity bit for the data bus<br>busy asserted when bus is not free<br>selection asserted during selection & reselection              |

Info type

- CID control / data asserted during transfer or control info (command status or message)

- MSG may indicates the info being transferred is a message

Handshake

- REQ request asserted by a target or request data transfer cycle

- ACK Acknowledge: asserted by the initiator when it has completed ad transfer operation

Direction of transfer

- I/O Input / Output asserted to indicate input operation

Other

- ATN Attention: asserted by an initiator when it wishes to send a message

- RST Reset: causes a device controls to disconnect from the bus and assume the start - update