

## Pipelining

Necessarily, there is an overlapping in between 2 stages of Instruction Execution.

Diagram :



For Non-Pipelined System (Time) =  $k n \gamma$

For Pipelined System (Time) =  $(k + (n - 1)) \gamma$

## Classification of Pipelined Processor

According to the level of processing Höndler (1971) has proposed the following :

- 1) Arithmetic Pipelines - The Arithmetic Logic Unit of a computer can be segmentized for pipeline operation in various data format.  
Eg- The 4 stage pipeline used in Star-100, 8 stage pipeline used in TIASC (Texas Instrument Advanced Science Computer)
- 2) Instruction Pipeline - The execution of a stream of instructions can be pipelined by overlapping the execution of the current execution with the Fetch, Decode, Operand Fetch of subsequent instruction
- 3) Processor Pipelining - Processor Pipelining refers to the pipelined processing of the same data stream by a cascade of processors each of which process a specific task

According to Pipelined Configuration & Control Strategies Ramamoorthy & Li (1977) had proposed the following pipeline classification scheme.

①

1) Unifunctional & Multifunctional Pipeline - A pipelined unit with a fixed and dedicated function such as floating point adder is called unifunctional pipeline. Eg- The Cray-I has 12 unifunctional pipeline units.

A multifunctional pipeline may perform different function either at same time or different time. Eg- By interconnecting different subsets of stages. TIASC has 4 multifunctional pipeline units.

2) Static & Dynamic Pipeline - A static pipeline may assume only one functional configuration at a time. Pipelining is made possible in static pipelines only if instruction of the same type are to be executed continuously. The function performed by a static pipeline should not change frequently.

A dynamic pipeline processor permits several functional configurations to exist simultaneously.

3) Scalar & Vector Pipelined Processor - A scalar pipeline process a sequence of scalar operation under the control of a loop.

Vector pipeline are specially designed to handle vector instruction over the vector operands.

H/W → Read → Throughput ( $w, n$ )  
Speed Up ( $S_k$ )

## Architectural Classification (Also Known as Flynn's Classification)

Based on the multiplicity of instruction streams and data streams in a computer system: (1965)

- \* **Stream** - is used to denote a sequence of items as executed or operated by a single processor
- \* **Instruction Stream** - is a sequence of instructions as executed by the processor.
- \* **Data Stream** - is a sequence of data including input, partial ~~and~~ or temporal result called for the instruction stream ?.

### 1) Single Instruction Single Data Stream (SISD) - A single processor executes

a single instruction to operate on data stored in a single memory. Instructions are executed sequentially but may be overlapped in their execution stages

Eg- IBM Uniprocessor - 701, IBM - 1620



### 2) Single Instruction Multiple Data Stream (SIMD) - A single machine instruction controls the simultaneous execution of processing element (PE)

Eg- Vector Processor - ILLIAC - IV



A [30]

B [30]

$$C[30] = A + B$$

- 3) Multiple Instruction Single Data Stream (MISD) - A sequence of data is transferred to a set of processor each of which execute a different instruction sequence. Their n processing unit each receiving distinct instruction operating on same data and output of one processor becomes input of next processor.



- 4) Multiple Instruction Multiple Data Stream (MIMD) - A set of processor simultaneously execute different instruction sequence on different data set.



Example - IBM - 370, Cray - II, IBM - 3081, UNIVAC - 1100/80

Data, Information

- Data - The data space is the largest including the numeric numbers in various formats, characters, symbols and multi-dimensional measures.  
Data objects are considered mutually unrelated in space.
- Information - An information item is a collection of data objects that are related by some syntactic structure or relation. This derives into three architectural configurations.
- Knowledge - Knowledge consists of information contents plus some semantic meanings.
- Intelligence - Intelligence is derived from a collection of knowledge items.
- Parallel Computer Structures

Parallel computers for computing - Parallel computer are those system that emphasize parallel processing.

- 1) Pipeline Computer
- 2) Array Processor
- 3) Multiprocessor System

- ① A pipeline computer performs overlapped computation to explore temporal parallelism.
- ② An array processor uses multiple synchronized ALUs to achieve parallelism.

- ③ A multiprocessor system achieves asynchronous parallelism through set of interactive processor

Computer Architecture Prac. 1

ASIC - Application Specific IC

FPGA - Field Programmable gate array.

- Language - Verilog. File Extension - .vl

```
Module name (input a; input b; output c);  
begin  
assign c = a & b;  
end  
end module
```

For testing verilog code we use test bench

Eg - module testabc

a = .x ;

b = .y ;

c = .z ;

initial

x = 1 ;

y = 0 ;

# 100 ;

module name a1(a(x); b(y); c(z));

\* At the end we have to call the module

\* end & end module.

## AND Gate

Theory - The AND gate is a basic digital gate that implements logic conjunction ( $\wedge$ ) from mathematical logic - AND. A High output (1) results only if all the inputs to AND gate are HIGH (1). If none or not all inputs to the AND gate are HIGH, Low output results.

## Truth Table :

| A | B | A & B |
|---|---|-------|
| 0 | 0 | 0     |
| 0 | 1 | 0     |
| 1 | 0 | 0     |
| 1 | 1 | 1     |

## Code :

## Non-Linear Pipeline

- ## Linear Pipelining



In case of Linear Pipeline there is no feed forward or feed back connection available

- ## Non - Linear Pipeline



## Non-Linear Pipeline

After getting executed in S2 whether the data will move to S3 or S4? There is an ambiguity as a result reservation table is used.

Collision - When more than one instruction is placed in the reservation table in the same stage then there is a collision.

Forbidden Latency - Number of clock - cycle difference b/w two colliding instructions.

|    | Forbidden Latency |   |   |   |   |   |        |
|----|-------------------|---|---|---|---|---|--------|
|    | 1                 | 2 | 3 | 4 | 5 | 6 | Output |
| S1 | X                 |   |   |   |   |   |        |
| S2 |                   | X | X |   | X |   |        |
| S3 |                   |   | X |   | X |   |        |
| S4 |                   |   |   |   |   |   |        |

$$FL = 2, 4$$

PL = 1, 3, 5, 6 (Permissible Latency).

Collision Vector

$$C = C_6 \ C_5 \ C_4 \ C_3 \ C_2 \ C_1$$

We have to look at the values of the forbidden latency and place 1 at place where the latency is forbidden.

$$C = 001010 = 1010$$

16/2/22.

Reservation Table.

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|----------------|---|---|---|---|---|---|---|---|
| S <sub>3</sub> | X |   |   |   | X | X |   |   |
| S <sub>2</sub> |   | X | X |   |   |   |   |   |
| S <sub>1</sub> |   |   | X | X | X |   |   |   |

Forbidden Latencies = 2, 4, 5, 7.

$$\text{Collision Vector} = C_8 \ C_7 \ C_6 \ C_5 \ C_4 \ C_3 \ C_2 \ C_1 \\ = 1011010$$

If we send a task in machine after 1 latency, new collision vector will be (For T<sub>1</sub>).

$$C'_1 = 0101101 \ (\text{After shift operation on initial CV}).$$

$$1011010 \ (\text{For T}_2)$$

$$01111111 \ (\text{New CV}).$$

As  $C_8$  is 0 so we have to send the next instruction  $T_2$  at 8 latency gap.

0000000  
1011010  
1011010 → New cv.



After 3 latency gap if we insert another instruction

|                      |    |         |
|----------------------|----|---------|
| 0001011 (for $T_1$ ) | →  | 0001011 |
| OR 1011010           | OR | 1011010 |
| 1011011 (for $T_2$ ) |    |         |

Permissible latency cycle for this reservation table - (1, 8)

X X (3, 6) (3) (6)  
X (3, 8), (6, 8), (8).

Throughput - Output per unit time.

Average Latencies - 4.5, 4.5, 3.6, 5.5, 7.8.

Minimum Average Latency = 3.

Greedy cycle - If each latency contained in the cycle is minimal latency and form a state in the cycle then this state is known as greedy cycle.



$b[3:0]$

$a[3:0]$

$c[3:0]$

$a[0]$

|                | 1 | 2 | 3 | 4 | 5 | 6 |
|----------------|---|---|---|---|---|---|
| S <sub>3</sub> | X | X |   |   | X |   |
| S <sub>2</sub> |   | X | X |   |   |   |
| S <sub>1</sub> |   |   |   | X |   |   |

Q) Draw the state diagram and find the greedy cycle.

Ans:-

Forbidden Latencies = 2, 3, 5

Permissible Latencies = 1, 4, 6

$$\begin{aligned} \text{Collision Vector} &= c_6 \ c_5 \ c_4 \ c_3 \ c_2 \ c_1 \\ &= 0 \ 0 \ 1 \ 0 \ 1 \ 1 \ 0 \\ &= 10110 \end{aligned}$$

(I) If we send next instruction after 1 latency, new collision vector is

10110 (For 1<sup>st</sup> Instruction),

or 01011

11111 (For 2<sup>nd</sup> Instruction).

If we send next instruction after 6 latency, new collision vector is -

000000

010110

010110 → For (I<sub>3</sub>)

(II) If we send I<sub>2</sub> after 4 latency. State Diagram  
then, new collision vector will be

10110

or 00001

10111



Permissible latency cycle: (1, 6), (4)

Avg. latency cycle: 3.5, 4.

min Greedy cycle = (1, 6).

Q)

|                |   |   |   |   |   |   |
|----------------|---|---|---|---|---|---|
|                | X |   | X |   | X |   |
| S <sub>3</sub> |   | X |   | X |   | X |
| S <sub>2</sub> |   | X | X | X |   | X |

state Diagram &amp; Greedy Cycle

A) Forbidden Latencies = 2, 3, 4, 5  
Permissible Latencies = 1, 6, 7.

Initial.

Collision Vector = C<sub>7</sub> C<sub>6</sub> C<sub>5</sub> C<sub>4</sub> C<sub>3</sub> C<sub>2</sub> C<sub>1</sub>  
= 00 11110 .  
= 11110 .

Case I.

T<sub>1</sub> after 1 latency of T<sub>1</sub>.11110  
OR 01111  
11111 → New Collision Vector.T<sub>2</sub> after 6 latency of T<sub>2</sub>.11110  
OR 00000  
11110 → New Collision Vector.State Diagram

Permissible Latency Cycles - (1, 6), (6)

Avg. Latency Cycle = (3.5) 6 min.

Greedy Cycle = (1, 6)



Q)

|       |   |   |   |   |   |   |   |   |   |
|-------|---|---|---|---|---|---|---|---|---|
| $S_3$ | - | X | X | - | - | - | - | - | - |
| $S_2$ | X | - | - | - | X | - | X | - | - |
| $S_1$ | - | - | X | X | - | X | - | - | - |

1 2 3 4 5 6 7 8

$$FD = \{2, 4, 5, 7\}$$

$$PL = \{1, 3, 6, 8\}$$

$$\begin{aligned} CV &= C_8 C_7 C_6 C_5 C_4 C_3 C_2 C_1 \\ &= \underline{\underline{0}} 1 0 1 1 0 1 0 \\ &= 1011010 \end{aligned}$$

Case I

$T_2$  after 1 Latency of  $T_1$ .

1011010

OR 0101101

01111111 → (New Collision Vector).

$T_3$  after 8 Latency of  $T_2$ .

1011010

OR 0000000

1011010 (New CV)



## Hazard

Any condition that causes the pipeline to stalled is called a hazard.

Data Hazard - Data Hazard is a condition in which either the source or destination of operands of an instruction are not available at the expected time as a result some operation has to be delayed and causes the pipeline stall.

## Technique to Handle Hazard

The domain  $D(I)$  of an instruction  $I$  is a set of resource object whose data object may affect the execution of an instruction. The range  $R(I)$  of an instruction  $I$  is a set of resource object whose data object may be modified by the execution of instruction  $I$ . Obviously, the operands to be used in an instruction execution are retrieved from its domain and the result will be stored in its range.

The one three class of data hazard according to various data update patterns:

- i) Read after Write - Hazard between two instructions  $I$  and  $J$  may occur when  $J$  attempts to read some data object that has been modified by  $I$ .
- ii) Write after Read - Hazard between two instructions  $I$  and  $J$  may occur when  $J$  attempts to modify some data object that is read by  $I$ .
- iii) Write after Write - Hazard between two instructions  $I$  and  $J$  may occur if both  $I$  and  $J$  attempt to modify the same data object.

\* In case of Read after Write  $R(I) \cap D(J) \neq \emptyset$ .

\* Eg- I A:  $B+C$  } , Read after write.  
J D:  $A+M$

- \* In the example Domain of I is  $D(B, C)$  and range of I is  $R(A)$ . Similarly, Domain of J is  $D(A, M)$  and range of J is  $R(D)$ . So, if we find  $R(I) \cap D(J) = \{A\} \neq \emptyset$  So, the Hazard occurs.
- \* In case of write after read  $R(J) \cap D(I) \neq \emptyset$
- \* In case of Write after write  $R(I) \cap R(J) \neq \emptyset$ .

### Methods for Removing / Dealing with Data Hazard

(1) Internal Forwarding - Internal Forwarding refers to a "short-circuit" technique for replacing unnecessary memory access for register to register transfer in a sequence of Fetch, Arithmetic and Store operation.

(2) Store Fetch Forwarding - Eg -

$$\left. \begin{array}{c} Mi \leftarrow R_1 \\ R_2 \leftarrow Mi \end{array} \right\}_1 \quad \left. \begin{array}{c} Mi \leftarrow R_1 \\ R_2 \leftarrow R_1 \end{array} \right\}_2$$

The operation 1 requires comparatively more time for execution compared to operation 2. So 2<sup>nd</sup> one is better option and is known as store fetch forwarding.

(3) Fetch Fetch Forwarding -

$$\left. \begin{array}{c} R_1 \leftarrow Mi \\ R_2 \leftarrow Mi \end{array} \right\}_1 \quad \left. \begin{array}{c} R_1 \leftarrow Mi \\ R_2 \leftarrow R_1 \end{array} \right\}_2$$

Since, 1<sup>st</sup> operation requires to access memory more number of times compared to 2<sup>nd</sup> operation, 2<sup>nd</sup> operation requires less time to execute and this is known as fetch-fetch forwarding.

(4) Store Store Forwarding -

$$\left. \begin{array}{c} Mi \leftarrow R_1 \\ Mi \leftarrow R_2 \end{array} \right\}_1 \quad \left. \begin{array}{c} Mi \leftarrow R_2 \end{array} \right\}_2$$

Since  $R_2$  is overwriting the data on  $Mi$  created by  $R_1$  so we can overlook the 1<sup>st</sup> step of 1<sup>st</sup> operation to

reduce the time required making the execution faster. This method is known as **stone-stone forwarding**.

② Register Tagging - Register Tagging refers to the use of tagged registers, buffers and reservation station for concurrent activity among multiple arithmetic unit.



Diagram - Graph  $M_1$  Register Tagging



STEP - 1



STEP 2

## Computer Architecture (Assignment - 3)

### Expt - 3 - Minimization Techniques.

Objective - Implementation of Boolean function and logic Equations.

Problem statement - Majority circuit. - A majority circuit is a combinational circuit whose output is 1 if input has more 1's than 0's.  
 Design the majority circuit.

#### (2) 2's complement

Design a 4-bit combinational circuit for implementing 2's complement so that the circuit could be completely constructed using XOR Gate.

#### (3) Function Minimization.

##### (a) Implement the Boolean Function. $F(A, B, C, D)$

$$= (0, 1, 3, 4, 8, 9, 15).$$

Use a Verilog Program to test the circuit.

##### (b) Write a Verilog Program to design, simulate and test a combinational circuit with 3-bit input $a_2, a_1, a_0$ and 3-bit output $b_2, b_1, b_0$ . When binary input is 0, 1, 2, 3 the binary output is 1 greater than the input. When input is 4, 5, 6, 7 in binary output is 1 less than input.

### Questionnaires

- (1) Explain the difference b/w \$monitor and \$display
- (2) Given the following Verilog code, what value of a is

displayed -

always @ clk begin.

a = 0

a <= 1.

\$display(a);

end

- (Q3) For the following Verilog code segment, if the IR is ABCD 3456 in Hexadecimal, the value of "data" in decimal will be initial value of (Note that data is a 4-bit variable)

wire [

AB CD  
AB 00 00  
AB 01 01  
AB 10 10

Ans(3b)

|            |                  |    |   |   |   |  |
|------------|------------------|----|---|---|---|--|
| $\bar{C}D$ | $\bar{A}\bar{B}$ | 00 | X | X | X |  |
| $\bar{C}D$ | $\bar{A}B$       | 01 | X |   |   |  |
| $CD$       | $A\bar{B}$       | 11 |   |   | X |  |
| $CD$       | $A\bar{B}$       | 10 | X | X |   |  |

$$\bar{B} + \bar{A}\bar{B} + \bar{A}\bar{B}$$

$$= \bar{A}\bar{D} + B\bar{C}\bar{D} + \bar{A}\bar{B}\bar{C} + ABCD.$$

$$= (\bar{A} \oplus \bar{B} \oplus \bar{C} \oplus \bar{D}) / (B \oplus C \oplus D)$$

$$= (\bar{A} \oplus \bar{B} \oplus \bar{C}) / (A \oplus B \oplus C \oplus D)$$

$$Y = \bar{A}\bar{D} + \bar{C} + B$$

$$Q1 = \bar{A}\bar{D}$$

$$P2 = B\bar{C}\bar{D}$$

$$P1 = \bar{A}\bar{B}\bar{C}$$

$$S1 = ABCD$$

F =

$$1111 + 1 = 10000$$

↑

$$\begin{array}{r}
 0\ 0\ 0\ 0 \quad (+) \ 1 \rightarrow \\
 0\ 0\ 0\ 1 \\
 0\ 0\ 1\ 0 \\
 0\ 0\ 1\ 1 \\
 0\ 1\ 0\ 0
 \end{array}$$

## HAZARD REMOVAL

- (Q)  $I_1 : R_0 \leftarrow M_1$       }  $\rightarrow I_1 : R_0 \leftarrow M_1$   
 $I_2 : R_1 \leftarrow M_1$       }  $\rightarrow I_2 : R_1 \leftarrow R_0$   
 $I_3 : R_1 \leftarrow R_0 + R_1$       }  $\rightarrow I_3 : R_1 \leftarrow R_0 + R_0$   
 $I_4 : M_2 \leftarrow R_1$       }  $\rightarrow I_4 : M_2 \leftarrow R_1$   
 $I_5 : R_2 \leftarrow M_2$       }  $\rightarrow I_5 : R_2 \leftarrow R_1$   
 $I_6 : R_3 \leftarrow R_2 + R_1$       }  $\rightarrow I_6 : R_3 \leftarrow R_1 + R_1$   
 $I_7 : M_3 \leftarrow R_4$       }  $\rightarrow I_7 : M_3 \leftarrow R_3 + R_3$



Final Diagram



- Q)  $I_1 : R_0 \leftarrow M_1$   
 $I_2 : R_1 \leftarrow M_1$   
 $I_3 : R_1 \leftarrow R_0 + R_1$   
 $I_4 : M_2 \leftarrow R_1$   
 $I_5 : R_3 \leftarrow M_2$   
 $I_6 : R_4 \leftarrow R_3 + R_1$   
 $I_7 : M_3 \leftarrow R_4.$

Step 1



Step 2

Revised Instructions



- $I_1 : R_0 \leftarrow M_1$   
 $I_2 : R_1 \leftarrow R_0$   
 $I_3 : R_1 \leftarrow R_0 + R_1$   
 $I_4 : M_2 \leftarrow R_1$   
 $I_5 : R_3 \leftarrow M_2$   
 $I_6 : R_4 \leftarrow R_3 + R_1$   
 $I_7 : M_3 \leftarrow R_4.$

Step 3

Revised Instructions



- $I_1 : R_0 \leftarrow M_1$   
 $I_2 : R_1 \leftarrow R_0$   
 $I_3 : R_1 \leftarrow R_0 + R_0$   
 $I_4 : M_2 \leftarrow R_1$   
 $I_5 : R_3 \leftarrow M_2$   
 $I_6 : R_4 \leftarrow R_3 + R_1$   
 $I_7 : M_3 \leftarrow R_4$

Step 4



Revised Instructions

- I1 :  $R0 \leftarrow M1$
- I2 :  $R1 \leftarrow R0$
- I3 :  $R1 \leftarrow R0 + R0$
- I4 :  $M2 \leftarrow R0 + R0$
- I5 :  $R3 \leftarrow R1$
- I6 :  $R4 \leftarrow R3 + R1$
- I7 :  $M3 \leftarrow R4$

Step 5



Revised Instructions

- I1 :  $R0 \leftarrow M1$
- I2 :  $R1 \leftarrow R0$
- I3 :  $R1 \leftarrow R0 + R0$
- I4 :  $M2 \leftarrow R0 + R0$
- I5 :  $R3 \leftarrow R1$
- I6 :  $R4 \leftarrow R3 + R1$
- I7 :  $M3 \leftarrow R3 + R1$

I<sub>1</sub> :  $R_0 \leftarrow M_1$ .

I<sub>2</sub> :  $R_1 \leftarrow M_2$ .

I<sub>3</sub> :  $R_2 \leftarrow M_3$

I<sub>4</sub> :  $R_3 \leftarrow M_4$

I<sub>5</sub> :  $R_4 \leftarrow R_0 + R_1$ .

I<sub>6</sub> :  $R_5 \leftarrow R_4 * M_5$ .

I<sub>7</sub> :  $M_5 \leftarrow R_4$ .

I<sub>8</sub> :  $R_4 \leftarrow R_2 + R_3$ .

I<sub>9</sub> :  $M_1 \leftarrow R_4$

I<sub>10</sub> :  $M_2 \leftarrow R_5$ .

I<sub>11</sub> :  $M_3 \leftarrow R_4$ .

Initial Step



Step 2

New Instructions



I<sub>1</sub> :  $R_0 \leftarrow M_1$

I<sub>2</sub> :  $R_1 \leftarrow M_2$

I<sub>3</sub> :  $R_2 \leftarrow M_3$

I<sub>4</sub> :  $R_3 \leftarrow M_4$

I<sub>5</sub> :  $R_4 \leftarrow M_1 + M_2$

I<sub>6</sub> :  $R_5 \leftarrow$

I<sub>7</sub> :

I<sub>8</sub> :

To I<sub>11</sub>

## combinational circuit

Objective - Implementation of combinational circuit.

- Q1) Write a Verilog Program that implements multiplexer module 4x1.
- Q2) Write a Verilog Program that includes above module (4x1 MUX) to describe the functionality of 8x1 MUX.
- Q3) Write a Verilog code to design, simulate and test the circuit of 3:8 decoder.

following

- 1) What is difference b/w two lines of verilog code.

# 5,  $a = b;$

$a = #5b;$

- 2) What do you mean by continuous assignment.

- 3) For the following Verilog code segment:

```
wire [7:0] A;
```

```
wire B;
```

```
assign B = ^A;
```

if the value of A is 16'b 10110011, what will be the value of {A[4:3], {B}}?

A1) Multiplexer



$$Y = \bar{S}_1 \cdot \bar{S}_0 \cdot a + \bar{S}_1 S_0 b + S_1 \bar{S}_0 c + S_1 S_0 \cdot d$$

| $S_2$ | $S_1$ | $S_0$ | $Y$ |                             |
|-------|-------|-------|-----|-----------------------------|
| 0     | 0     | 0     | a   | $\rightarrow S_2 S_1 S_0 a$ |
| 0     | 0     | 1     | b   | $\rightarrow S_2 S_1 S_0 b$ |
| 0     | 1     | 0     | c   | $\rightarrow S_2 S_1 S_0 c$ |
| 0     | 1     | 1     | d   | $\rightarrow S_2 S_1 S_0 d$ |
| 1     | 0     | 0     | e   | $\rightarrow S_2 S_1 S_0 e$ |
| 1     | 0     | 1     | f   | $\rightarrow S_2 S_1 S_0 f$ |
| 1     | 1     | 0     | g   | $\rightarrow S_2 S_1 S_0 g$ |
| 1     | 1     | 1     | h   | $\rightarrow S_2 S_1 S_0 h$ |

8 data.

assign  $y = \{\bar{S}_2 \bar{S}_1 \bar{S}_0 a + \bar{S}_2 \bar{S}_1 S_0 b + \bar{S}_2 S_1 \bar{S}_0 c + \bar{S}_2 S_1 S_0 d + S_2 S_1 S_0 e + S_2 S_1 S_0 f + S_2 S_1 S_0 g + S_2 S_1 S_0 h\}$

input [2:0] A;      Y0 Y1 Y2 Y3      A2 A1 A0 Y1 Y6 Y5 Y4 Y4

101  
±      output [7:0] y;      1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

$y[0] = \bar{A}_2 \bar{A}_1 \bar{A}_0$       0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0

$y[1] = \bar{A}_2 \bar{A}_1 A_0$       0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0

$y[2] = \bar{A}_2 A_1 \bar{A}_0$       0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1

$y[3] = \bar{A}_2 A_1 A_0$       0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0

$y[4] = A_2 \bar{A}_1 \bar{A}_0$       0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

$y[5] = A_2 \bar{A}_1 A_0$       0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0

$y[6] = A_2 A_1 \bar{A}_0$

$y[7] = A_2 A_1 A_0$

MUX (4x1),

(3) 02 01 00  
8x1 MUX.



(4) 01 . modul. - (—).  
a = 3.

000 →



S2=0? ~~autP~~: util 2

mux a1 ( $\{S[1:0]\}$ ,  $\{I[3:0]\}$ , 01);

mux a2 ( $\{S[1:0]\}$ ,  $\{I[3:0]\}$ , 02);

Cache Miss - When instruction is not available at the time of execution, then it is called as the 'Instruction Hazard' / Cache Miss.

### Instruction Hazard Handling

① Perfect Target Instruction → Prefetch

② Prefetch a Bunch of Instructions

③ High Speed Buffer

④ Control Hazard

100 JUMP 200

102 ADD A, B, C

106 SUB A, D, E

190 : HLT :

200 ADD M<sub>1</sub>, A, M<sub>2</sub>

204 ADD A, B, D

208 JUMP 102

2 Types of Control Hazard

conditional Control Hazard

unconditional Control Hazard

### How to Deal with Control Hazard

- Perfect Target Instruction - a) Fetch both instruction stream branch not taken and taken.
- b) Both are saved until branch end.
- Branch Target Buffer - Store the address of previously executed instruction as branch target address and address of next few instruction. When decoding branch

instruction search the branch target buffer.

- Loop Buffer - Storage of entire loop that allow to execute loop without memory access.
  - Branch Prediction - Predict the branch condition and fetch an instruction stream based on the prediction.
  - Delayed Branch - Compiler detects the branch and rearrange the instruction sequence by inserting useful instruction that keep the pipeline busy in presence of a branch instruction.
- ④ Structural Hazard - When any resource is not available during the time of instruction execution, this is called structural hazard.

## Cache Memory

- Locality of Reference -

↳ Localities of reference are spatial and temporal.

|         |
|---------|
| PID-100 |
| PID-101 |
| PID-102 |
| PID-103 |
| PID-104 |

↳ When a program runs with a page fault of 10H, it will thrash between RAM and Cache due to Page Fault. - When a certain page is not available on the RAM during requirement.



| C010 | C010 | C01A | C11A | C01A | C01A | C01A |
|------|------|------|------|------|------|------|
| 0    | 0    | 1    | 0    | 0    | 0    | 0    |
| 1    | 0    | X    | 1    | 0    | 0    | 0    |
| 0    | 1    | X    | X    | 1    | 0    | 0    |
| 1    | 1    | X    | X    | X    | 1    | 0    |
| X    | X    | 0    | 0    | 0    | 0    | 0    |

## Combinational Circuits

Objection : Implementation of combinational

Q1) Function Implementation using decoder

$$F_1 = \overline{(2, 4, 7)} = 2$$

$$F_2 = (0, 3)$$

$$F_3 = (0, 2, 3, 4, 7)$$

1. Write a HDL to implement the circuit with a decoder constructed with NAND or AND gates connected to the decoder outputs.

Use a block diagram for the decoder. Minimise the number of inputs in the external gates.

Priority Encoder.

Write a Verilog code to design simulate and test a circuit of a 4:2 Priority encoder.

| A[3] | A[2] | A[1] | A[0] | O[1] | O[0] |
|------|------|------|------|------|------|
| 0    | 0    | 0    | 1    | 0    | 0    |
| 0    | 0    | 1    | X    | 0    | 1    |
| 0    | 1    | X    | X    | 1    | 0    |
| 1    | X    | X    | X    | 1    | 1    |
| 0    | 0    | 0    | 0    | X    | X    |

| Y1 |    | Y2 |    | Y3 |    |
|----|----|----|----|----|----|
| I1 | I2 | I1 | I2 | I1 | I2 |
| 00 | X  | 1  | 1  | 1  |    |
| 01 | 0  | 1  | 1  | 1  |    |
| 11 | 0  | 1  | 1  | 1  |    |
| 10 | 0  | 1  | 1  | 1  |    |

Y<sub>0</sub>

$I_3 \quad I_2$

|    | 11 | 10 | 00 | 01 | 11 | 10 |
|----|----|----|----|----|----|----|
| X  | 0  | 0  | 1  | 1  |    |    |
| 01 | 0  | 0  | 1  | 1  |    |    |
| "  | 1  | 0  | 1  | 1  |    |    |
| 10 | 1  | 0  | 1  | 1  |    |    |

$$Y_0 = I_3 + I_2 \bar{I}_2$$

$\cdot A(A), \therefore Y(Y)$

### Magnitude Comparator.

Write a Verilog code to design, simulate and test a circuit of 4 bit magnitude comparator.

int arr[10];

0 to (n-1)



HDL → Verilog . . .



arr[10] = - ;

arr[9] + . . .

arr[8] =



000 → 0000 0001

001 → 0000

f1 = 0

f2 = 0 ⇒ 0 .

f3 = 0 .



$$Y_0 = Y_3 + Y_1 \bar{Y}_2 \quad \cdot A(A) \quad \cdot Y(Y)$$

### ③ Magnitude Comparator.

Write a Verilog code to design, simulate and test a circuit of 4 bit magnitude comparator.

int arr [0];      0 to (n-1)



$$\lceil arr[10] \rceil = - ; \quad 10110 \quad 0000^*$$

arr [9] + . . .

$$arr[9] = 0$$

arr [8] =



$$000 \rightarrow 000000001 \quad 001 \rightarrow 00000$$

$$f_1 = 0$$

$$f_2 = 0 \Rightarrow 0$$

$$f_3 = 0$$

|              |              |            |            |                   |                   |
|--------------|--------------|------------|------------|-------------------|-------------------|
| $A \wedge B$ | $B \wedge A$ | $A \vee B$ | $B \vee A$ | $A \rightarrow B$ | $B \rightarrow A$ |
| 0 0          | 0 0          | 0          | 0          | 1                 |                   |
| 0 0          | 0 1          | 0          | 1          | 0                 |                   |
| 0 0          | 1 0          | 0          | 1          | 0                 |                   |
| 0 0          | 1 1          | 1          | 1          | 0                 |                   |

$$100 - 011 \rightarrow 100 - 100 \\ \Rightarrow 100 + 101. \quad 100 + 1 \rightarrow 101.$$

$$100 - 101 \rightarrow 100 - 100$$

$$\begin{array}{r} 100 \\ + 101 \\ \hline 1001 \end{array}$$

$$\begin{array}{r} 100 \\ - 011 \\ \hline 011 \end{array}$$

$$= 011 - \overbrace{1000}^{\downarrow}$$

$$\begin{array}{r} 1001 \\ 1100 \\ - 1000 \\ \hline 10100 \end{array}$$

### Multi-Processor Memory Management

In case of Uni-Processor System, data is first transferred to RAM (MM) from Hard Disk Drive (HDD) and then it is used or accessed by CPU from RAM.



In case of Multi-processor system, many CPU's access a particular memory module (MM) at the same time / many modules at the same time.

### Uniform Memory Access (UMA)



### Non-Uniform Memory Access (NUMA)

\* Memory access bandwidth

is reduced in case of NUMA as compared to UMA.



- \* Any memory module directly attached to the CPU is called L1 cache and the global memory module is referred to as the L2 cache.

- AMD (Advanced Microdevices) used shared memory module for both graphical and CPU processor.

(\*)

Symbols for Internal

5 MCs (5 marks)

5 SAs (5 marks)

- ① Pipeline ( )  
② Cache Memory ( )

COMA (Cache only Memory Access)



## Revision

$$I_1 : R_o \leftarrow M_1$$

$$I_2 : R_o \leftarrow R_o + M_2$$

$$I_3 : R_o \leftarrow R_o * M_3$$

$$R_o := M_4 \leftarrow R_o$$

Step 1 -  $M_1$   $M_2$   $M_3$

$$M_1 \leftarrow //$$

$$M_2 \leftarrow //$$

$$M_3 \leftarrow //$$

$$R_o \leftarrow //$$

$$+$$

$$I_{3b}$$

$$V \leftarrow V + I_{3b}$$

$$V$$

$$I_{3c}$$

$$V$$



$$I_{3c}$$

$$R_o$$

$$V$$

Step 2 -

$$\{I_1, I_2, I_3\} = A \Rightarrow SCD$$

$$R_o \leftarrow R_o + M_2$$

$$(A \cdot A) \rightarrow TV$$

$$\{I_1, I_2, I_3\} = A \Rightarrow SCD$$

$$I_{3c}$$

$$R_o$$

$$V$$

$$(I_1 \otimes (I_2 \cdot I_3)) \rightarrow (I_1 \otimes (I_2 \cdot I_3)) \rightarrow (I_1 \otimes (I_2 \cdot I_3)) \rightarrow (I_1 \otimes (I_2 \cdot I_3))$$

$$\{I_1, I_2, I_3\} = A \Rightarrow SCD$$

$$I_{3c}$$

$$R_o$$

$$V$$

### Vector Processor



i)  $f_1 : V \rightarrow V$       VSQR      iii)  $f_3 : V \times V \rightarrow V$ .

ii)  $f_2 : V \rightarrow S$       VSIN(A)      iv)  $f_4 : V \cdot S \rightarrow V$ .

Eg 1 -  $A = \{2, 5, 8, 9, 11\}$ .

VSQR(A) =  $\{\sqrt{2}, \sqrt{5}, \sqrt{8}, \sqrt{9}, \sqrt{11}\}$ .

Eg 2 -  $A = \{2, 3, 6, 7\}$ .

VSUM(CA) = 15.

gives a masking vector.

Eg 3 -  $A = \{2, 8, 9, 15\}$ .      VTtest (A, B).

$B = \{7, 10, 11, 12\}$

$C(I) = 10 \text{ if } A(I) < B(I)$

$C = \{10, 1, 1, 0\}$ ,       $C(I) = 0 \text{ if } A(I) \geq B(I)$ .

b) Masking vector.

Type - 3 Eg -  $A = \{2, 8, 9, 15\}$ .

$B = \{7, 10, 11, 12\}$ .

VSUM(A, B) =  $\{9, 18, 20, 27\}$ .

Type - 4 Eg -  $A = \{2, 8, 9, 15\}$ .

VSUM(A, 5) =  $\{7, 13, 14, 20\}$ .

### Special Instruction

i) Boolean Instruction      ii) Compress Instruction

## (1) Con Merge Instruction

Eg of Con Merge Instruction - A = {2, 8, 9, 15}.

$$C = \{1, 0, 0, 13\}$$

eliminate comments with

$$B = \{2, 15\}.$$

$$C = \{1, 0, 0, 1\}.$$

$$M = \{2, 7, 9, 15\}.$$

## Architectures of Vector Processor

(1) Memory to Memory Architecture - Instruction is fetched from memory and stores the advanced

data into the memory. Eg - TIASC (Texas Instruments Advanced

Science Computer), TIC, Cyber- 200.

Gray-I.

(2) Register to Register Architecture - FUSITSU- VP - 200. Instruction is fetched from register and data is stored into the register.

## Vector Processing Methods

- (1) Horizontal Processing.
- (2) Vertical Processing.
- (3) Vector Looping

(4) In case of Horizontal processing, elements are processed from left to right.

(5) In case of vertical processing, elements are processed from top to bottom.

(6) In case of vector looping, elements are processed from both left to right and top to bottom.

```

output-reg [1:0] y;
always @ (clk begin
    casex (x)
        4'b0000 : y <= 2'b00;
        4'b001x : y <= 2'b11;
        4'b01xx : y <= 2'b10;
        4'b1xxx : y <= 2'b11;
        default : $display ('Error');
    endcase
end
endmodule

```

| x    | y  |
|------|----|
| 0001 | 00 |
| 001x | 01 |
| 01xx | 10 |
| 1xxx | 11 |

By  
Tamer

QUESTION - Q1) - URGENT  
 Design and test a C  
 logic soft module

Expt - 6

Title

Code converter

Objective - Implementation of code converter and Familiarization with.

PS 1 - Binary to BCD converter.

PS 2 - BCD to Gray converter.

PS 3 - Binary to Gray Code converter.

PS 4 - Gray Code to Binary converter.

PS 5 - Binary to Excess - 3 converter.

PS 6 - Excess - 3 to Binary Code converter.

$$\begin{array}{r} \text{Bin} \\ \hline 0000 \end{array} \xrightarrow{\quad \text{BCD} \quad} \begin{array}{r} \text{BCD} \\ \hline 0000 \end{array}$$

4 0

$$\begin{array}{r} 1001 \\ 1001 \\ 1010 \\ 1001 \end{array} \xrightarrow{\quad \text{Octal} \quad} \begin{array}{r} 1001 \\ 0000 \\ 0001 \\ 0000 \end{array}$$

↓

344

21

11

0010 00010010 0001  
0010 0001  
0010 0001

Q)

| S4 | X |   |   |   |   | X | X |   |
|----|---|---|---|---|---|---|---|---|
| S3 |   | X |   | X |   |   |   |   |
| S2 |   |   | X |   | X |   |   |   |
| S1 |   |   |   | X |   |   |   |   |
|    | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |

A) From the given Reservation Table -

Forbidden Latency (FL) = 2, 4, 5, 7.

Permissible Latency (PL) = 1, 3, 6, 8.

Collision Vector (CV) = C<sub>8</sub> C<sub>7</sub> C<sub>6</sub> C<sub>5</sub> C<sub>4</sub> C<sub>3</sub> C<sub>2</sub> C<sub>1</sub>

= 0 1 0 0 1 0 1 1 0 1 0

= 0 1 0 0 1 0 1 1 0 1 0

∴ Initial Collision Vector is - 01011010.

Case I

If we give an instruction in the pipeline after 1 latency - new collision vector (C') will be

00101101 → (Right Shift Initial by 1).

OR 01011010

01111111 = C'

Next instruction can only be only after a gap of 8 latency or more - so new collision vector (C'') will be - (8), (3, 1)

000000000 → (Right Shift Initial by 8).

OR 01011010

(8) = Current 01011010 + ("Initial" 00000000)

Case II

If we give an instruction in the pipeline after 3 latency - new

Collision Vector ( $C'$ ) will be -

$$00001011 \rightarrow (\text{Right Shift Initial by } 3) \\ \text{OR } 01011010 \\ 010110101 = C'$$

### Case III

If we give one instruction in the pipeline after 6 latency gap - new collision vector ( $C'$ ) will be -

$$00000001 \rightarrow (\text{Right shift Initial by } 6) \\ \text{OR } 01011010 \\ 01011011 = C'.$$

### State Diagram



Permissible Latency Cycles for the Reservation Table -  
 $(1, 8), (8), (3, 6), (3), (6), (3, 8), (6, 8)$

Average Latency of Cycles =  $4.5, 8, 4.5, 3, 6, 5.5, 7$

Greedy cycle (cycle with lowest average latency) = (3).

Q)

|    |   |   |   |   |   |   |
|----|---|---|---|---|---|---|
| S3 | X |   |   |   | X |   |
| S2 |   |   | X |   |   | X |
| S1 |   | X |   | X |   | X |

From the given Reservation Table -

Forbidden Latencies (FL) = 2, 3, 4, 5.

Permissible Latencies (PL) = 1, 6, 7.

$$\therefore \text{Initial collision vector (CV)} = C_7 C_6 C_5 C_4 C_3 C_2 C_1$$

$$= 0011110$$

#### Cases :

If we give an instruction to the pipeline after 1 latency gap - new collision vector (C') is -

$$0001111 \rightarrow (\text{Right shift Initial by 1 bit}).$$

OR  $0011110$

$$0011111 = C'$$

Again if we give an instruction to the pipeline after 6 latency gap - new collision vector (C'') is -

$$0000000 \rightarrow (\text{Right shift Initial by 6 bits}).$$

OR  $0011110$

$$0011110 = C''.$$

Again if we give an instruction to the pipeline after 7 latency gap - new collision vector (C''') is -

$$0000000 \rightarrow (\text{Right shift Initial by 7 bits})$$

OR  $0011110$

$$0011110 = C'''$$

State Diagram for the Reservation Table -



Permissible Latency Cycles for the Reservation Table -  
(1, 6), (6).

Average Latency of cycles =  $3 \cdot 5; 6$ .

Greedy Cycle (cycle with lowest average - (1, 6)).

## Pipeline Chaining

Q

Pipeline chaining is expanded from the concept of internal forwarding. Basically, chaining is a linking process that occurs when results obtained from one pipelined unit are directly feed into the operand registers of another functional pipelines. In other words intermediate results do not have to be restored into the memory and can be used even before the vector operation is completed. Chaining permits successive operations to be issued as soon as the first result becomes available as an operand. Of course the desired functional pipeline and operand registers must be properly reserved otherwise chaining operation have to suspended until the demanded resource becomes unavailable.

### ~~Design of a vector compiler.~~

Major optimizing function are given below according to the level of sophistication in generating efficient scalar and vector code modules.

- i) General Optimization -
  - a) Common Expression Elimination
  - b) Invariant Expression Movement
  - c) Strength Reduction
  - d) Register Optimization
  - e) Constant Folding.
  
- ii) Extended Optimization -
  - a) Intrinsic Function Integration  
(Eg - sqrt, sin)
  - b) Reduction of Iteration number in nested loop.
  - c) Recorder of execution sequence to reduce pipeline overhead.
  - d) Temporal Storage Management
  - e) Code Avoidance.

- iii) vector Extended optimization -
  - a) Full Vectorization
  - b) Pipeline Chaining.
  - c) Pipeline Antichaining.
- iv)
  - ④ Vector Register optimization.
  - ⑤ Patta Parallelization.
- MIPS (Million Instructions per Seconds).

What is MIPS?

The speed of a scalar processor is usually measured by the number of instructions executed per unit time, such as the use of million instructions per second).

- MegaFlops

What is MegaFlops?

For a vector processor, it is universally accepted to measure the number of arithmetic operations performed per unit time such as the use of mega floating point operations (MegaFlops) per second.

- iii) Vector Extended Optimization -
  - a) Full Vectorization
  - b) Pipeline Chaining,
  - c) Pipeline Antichaining.
- iv)
  - ① Vector Register optimization
  - ② Patta Parallelization.

- MIPS (Million Instructions per seconds):

What is MIPS?

The speed of a scalar processor is usually measured by the number of instructions executed per unit time, such as the use of million instructions per second).

- MegaFlops

What is MegaFlops?

For a vector processor, it is universally accepted to measure the number of arithmetic operations performed per unit time such as the use of mega floating point operations (MegaFlops) per second.