

10/1/2023

Components of Processor → ALU  
 CU, Registers, I/O port



→ Heart of Computer System ↔ Heart of Embedded System

- ↳ Just as processor and memory and I/O components has to be connected externally
  - ↳ Large circuit size
  - ↳ Cost is high
  - ↳ Not suitable to be used with devices running on stored power like batteries.
  - ↳ Mostly don't have power features
- It has external processor along with internal memory and I/O components
- Smaller circuit size.
- Cost is low.
- Since the power consumption is less it can be used with devices running on stored power like batteries.

(AEI) substitution for hardware

↳ Generally in Von-Neuman architecture

Eg - Intel, Pentium, AMD

↳ Generally in Harvard architecture.

Eg - ARM, Cortex-M4, Raspberry.

## RISC

- 1/2/2024
- ↳ Reduced instruction Set Computers
  - ↳ Emphasises on software
  - ↳ It has hardwired control unit
  - ↳ Simple instructions
  - ↳ Single clock
  - ↳ More no. of registers
  - ↳ Supports less native data types
  - ↳ Supports less instruction formats
  - ↳ Supports less addressing modes.
  - ↳ Supports pipelining

## CISC

- ↳ Complex Instruction Set computers.
- ↳ Emphasises on hardware.
- ↳ It has microprogrammed control unit
- ↳ Complex instructions
- ↳ Multi- clock
- ↳ Requires Less no. of registers.
- ↳ Supports less no. of more native data types.
- ↳ Supports more instruction formats.
- ↳ Supports more addressing modes.
- ↳ Pipelining is not supported.

Eg- ARM, MIPS,  
SPARC

Eg- Intel, AMD, IBM 360/3

## Instruction Set Architecture (ISA)



# ISA is the part of computer architecture related to programming, including the native data types, instruction, registers, addressing modes, memory architecture, interrupt and exception handling and external I/O.

It serves as a boundary between software and hardware.

Seven Dimension of ISA :-

> class of ISA : Nearly all ISA are classified as general purpose register architecture where the operands are stored either in registers or memory locations.

Eg - 80x86 [ 16 GPRS (Registers) and 16 that can hold floating point data ] Register mem arch.

MIPS [ 32 GPRS (Registers) and 32 that can hold FP data ] Load-store architecture

> Memory Addressing :

Almost all computers use byte addressing to access the memory operands.

Eg - MIPS [ Requires object must be aligned An access to an object of size  $S$ 's byte at byte address  $A$  is aligned if  $A \bmod S = 0$  ]

80x86 — It doesn't require alignment but access are generally faster if operands are aligned.

Output from this is — : trigger A2T NA



3) Addressing Modes :- It specifies the address of a memory object.

Eg - MIPS :- Register, Immediate & Displacement  
& 80x86

4) Type and Size of Operand :-

Eg - MIPS & 80x86 :- Operand size :-

8 bits (ASCII char)

16 bits (Unicode char / half word)

32 bits (Integer / word)

64 bits (Long integer / double word)

32 bits (IEEE 754 FP single)

64 bits (IEEE 754 FP double)

\* 80x86 can support 80 bits (Extended double precision)

5) Operations :-

- Data Transfer

- ALU

- Control

- FP

Eg - MIPS :- Simple and easy to pipeline ISA  
Represented by RISC.

O = 2 from A

80x86 :- It has much richer larger set of operations

two floating point units FPU - 28x08

6) Control Flow Instructions :-

All ISA supports :-

- Conditional Branches
- Unconditional Jumps

- procedure calls

- return from procedure

Eg- MIPS

conditional branches checks the content of registers.

Procedure calls places the return address of registers.

Conditional branches test condition code bits set on side effect of ALU operation.

Procedure calls places the return address on a stack in memory.

7) Encoding ISA :-

There are two basic choices of encoding:

- Fixed length encoding

- Variable length encoding

Eg- MIPS — Instructions are 32-bit long fixed.

80x86 — Instructions Variable length, ranges from 1 to 18 bytes

### Measuring and Reporting Performance

Performance can be measured in two ways:-

(i) Elapsed / Execution time :  $P = \frac{t}{T}$  where  $t$  = finish time - start time

(ii) Throughput : No. of task executed per unit time.

$$P = \frac{t}{T}$$

where  $P \rightarrow$  performance

$T \rightarrow$  execution time

\* Note More throughput and less execution time.   
 throughput more greater.

### Basic Performance Equation

$$T = \frac{N \times S}{R}$$

where  $T$  → execution time of program on any HLL (High level Language)

$N$  → Actual no. of instruction to be executed

$S$  → Avg. no. of basic steps

$R$  → Clockrate =  $\frac{1}{\text{clock cycle}}$

CPU Time :   
 (Amount of time the processor takes to execute a program)   
 User CPU Time : Amount of time spent by the processor to execute user instruction.   
 Systems CPU time : Amount of time spent by the processor to execute system instruction required by the program.

User CPU Time : Amount of time spent by the processor to execute user instruction.

System CPU Time : Amount of time spent by the processor to execute system instruction required by the program.

CPI : Cycles Per Instruction

# clock cycles required : No. of CPI

# Instruction to be executed

IPC : Instructions Per Cycle =  $\frac{\# \text{ Instruction to be executed}}{\# \text{ clock cycles used}}$

- \* CPI should be less
- \* IPC should be more

### Moore's Law :-

# It says that no. of transistors in a chip doubles every 2 years.

### CPU Performance Equation :-

(Derivation)

$$\text{CPU Time} = \frac{\text{CPU clock cycles per program}}{\text{clock rate}} \times \text{Clock cycle Time}$$

OR

$$\text{CPU Time} = \frac{\text{CPU clock cycles per program}}{\text{clock rate}}$$

$$CPI = \frac{\text{CPU clock cycles per program}}{\text{Instruction Count}}$$

$$\text{clock cycles} = \frac{1}{\text{clock rate}}$$

$$\text{CPU clock cycles per program} = CPI \times \text{Instruction Count}$$

Substituting eqn (iv) in eqn (ii) :-

$$\text{CPU time} = CPI \times \text{Instruction Count} \times \text{Clock cycle time}$$

↳ CPI depends on Organisation and Instruction set arch.

↳ Instruction count depends upon ISA and Computer Technology

↳ Clock cycle time depends on H/w technology and Organization

$$\text{Total no. of clock cycles} = \sum_{i=1}^n IC_i \times CPI_i$$

where  $IC_i \rightarrow$  No. of times instruction  $i$  is executed on the program.

(continued)  $CPI_i \rightarrow$  The avg. no. of instruction per clock cycle for instruction  $i$ .

$$\text{CPU Time} = \left( \sum_{i=1}^n IC_i \times CPI_i \right) \times \text{Clock Cycle Time}$$

$$\text{Overall CPI} = \frac{\left( \sum_{i=1}^n IC_i \times CPI_i \right)}{\text{Instruction Count}}$$

Q) What is the average CPI under the following condition?

| Instruction Type | Cycles | Relative Frequency |
|------------------|--------|--------------------|
| A                | 2      | 40 %               |
| B                | 3      | 30 %               |
| D                | 10     | 25 %               |

$$\text{Ans - Total clock cycles} = \sum_{i=1}^n IC_i \times CPI_i$$

$$= 2 \times 0.4 + 3 \times 0.3 + 5 \times 0.25 + 10 \times 0.5$$

$$= 0.8 + 0.9 + 1.25 + 0.5$$

$$= 3.45$$



$$\text{Average CPI} = \frac{3.45}{4} = 0.86$$

Q) A program runs on 20 secs on Machine - A with a clock speed of 200 MHz. A computer architect wants to build machine - B, which will run this program in 6 secs. The architect has determined that a substantial increase in the clock rate is possible. But this may effect the design of the rest of the CPU, causing machine - B to require 1.2 times as many clock cycles as machine - A. What clock rate should be target for a best design?

$$\text{Ans - CPU Time}_A = \frac{\text{Clock Cycles}_A}{\text{Clock Rate}_A}$$

$$\Rightarrow 20 \text{ sec} = \frac{200 \times 10^6}{\text{Clock Rate}_A}$$

$$\Rightarrow \frac{20 \text{ sec}}{1.85 + 0.5 \times 1.85} = \frac{\text{Clock Cycles}_A}{200 \times 10^6 \text{ Hz}}$$

$$\Rightarrow \text{Clock Cycles}_A = \frac{20 \times 200 \times 10^6}{1.85 + 0.5 \times 1.85} = 10.2 \times 10^6$$

$$\text{CPU Time}_B = \frac{\text{Clock Cycles}_B}{\text{Clock Rate}_B}$$

$$\Rightarrow 6 \text{ sec} = \frac{1.2 \times 20 \times 200 \times 10^6}{\text{Clock Rate}_B}$$

$$\Rightarrow \text{Clock Rate}_B = \frac{1.2 \times 20 \times 200 \times 10^6}{6} = 800 \text{ MHz}$$

$\therefore$  Targeted clock rate should be 800 MHz.



- 12/02/2013 Q) Suppose we have made the following arrangement.
- Frequency of FP operations (other than FPSQR) = 25%.
  - Average CPI of FP operation = 4.0
  - Average CPI of other construction = 1.33
  - CPI of FPSQR = 2.0
- Assume two design alternatives and alternatives :-
- Decrease the CPI of FPSQR to 2.5
  - Decrease the CPI of FP to 2.5
- Compose the 2 design alternatives using CPU performance equation.

| Ans - | Type  | Frequency | CPI  | Original Test   |
|-------|-------|-----------|------|-----------------|
|       | FP    | 25%       | 4.0  |                 |
|       | FPSQR | 2%        | 2.0  | = 0.01093 - min |
|       | other | 73%       | 1.33 |                 |

(100 - 27)% = 73%  $\times$  0.01093 = 0.00925

Now, CPU Time original =  $\frac{25\% \times 4 + 2\% \times 2 + 73\% \times 1.33}{2.37 \times 0.01}$

i) CPU Time  $\frac{25\% \times 4 + 2\% \times 2 + 73\% \times 1.33}{FPSQR} = 2.01$

ii) CPU Time  $\frac{25\% \times 2.5 + 2\% \times 2 + 73\% \times 1.33}{FP} = 0.123005 \times 0.01 \times 2.01 = 0.0025$

SPEC Rating

SHM  $\frac{0.0025}{0.01093} = 0.227$

# System Performance Evaluation Corporation.

↳ A non-profit organisation called SPEC selects and publishes representative application programs from different application domains.

- Programs selected are :-
- Games
  - Astro physics
  - Quantum Chemistry
  - Database
  - Compilers

↳ The same program is also compiled and run on one computer selected as reference.

$$\text{SPEC rating} = \frac{\text{Running on Reference Computer}}{\text{Running time on Computer Under Test}}$$

$$\text{Overall SPEC Rating} = \left( \prod_{i=1}^n \text{SPEC}_i \right)^{1/n}$$

where  $n \rightarrow$  no. of programs in the suit

Q) Show that the ratio of the geometric mean is equal to the geometric mean of performance ratio and that the reference computer of SPEC ratio matters not.

Ans - Let there are 2 computers A and B and SPEC ratio of each two is  $\frac{\text{geometric mean of A}}{\text{geometric mean of B}}$

Then ratio of geometric mean of B

$$= \sqrt[n]{\prod_{i=1}^n \text{SPEC}_{A_i}}$$

$$= \sqrt[n]{\prod_{i=1}^n \text{SPEC}_{B_i}}$$

$$= \sqrt[n]{\prod_{i=1}^n \frac{\text{SPEC}_{A_i}}{\text{SPEC}_{B_i}}}$$



$$= \sqrt[n]{\frac{\text{Execution time of Reference}}{\frac{\text{Execution time of } A_i}{\frac{\text{Execution time of Reference}}{\text{Execution time } B_i}}}}$$

$$= \sqrt[n]{\frac{\text{Execution time } B_i}{\text{Execution time } A_i}} = \text{Performance} = 3.92$$

$$= \sqrt[n]{\frac{\text{Performance of } A_i}{\text{Performance of } B_i}} = \text{Performance} = 3.92 \text{ times}$$

$A_i \quad \text{Performance} = \frac{1}{\text{Execution time}}$

Amdahl's Law :- 15/02/2023  
 It depends on 2 factors :-  
 (i) Fraction enhanced :— The fraction of the computation time in the original computer that can be computed converted to take advantage of the enhancement.

$$\boxed{\text{Fraction enhanced} \leq 1}$$

(ii) Speedup enhanced :— The improvement gained by the enhanced operation mode.

$$\boxed{\text{Speedup enhanced} \geq 1}$$

$$\boxed{\text{Speedup overall} = \frac{1}{(1 - \text{Fraction enhanced}) + \frac{\text{Fraction enhanced}}{\text{Speedup enhanced}}}}$$

- a) The processor has to be enhanced for web serving. The new processor is 10 times faster on computation than the original one. The original processor was busy with computation 40% of time and waiting for I/O 60%. What is the overall speedup gained by incorporating of the enhancement?



→ New CPU is 10 times faster

Fraction enhanced = 40%

Speedup enhanced = 10

$$\boxed{\text{Speedup overall} = \frac{1}{(1 - \text{Fraction enhanced}) + \frac{\text{Fraction enhanced}}{\text{Speedup enhanced}}}}$$

$$= \frac{1}{(1 - 0.4) + \frac{0.4}{10}} = 1.56$$

- a) Suppose FFSQR is responsible for 20% of execution time of a critical graphics benchmark. 1<sup>st</sup> proposal is to enhance the FFSQR hardware and speedup the operation by a factor of 10. Second proposal is to make all FP instructions run faster by a factor of 1.6. FP instruc



are responsible for total of 50% of the execution time for the application. Compare both design alternatives.

Ans - Speedup

<1> FASTER

$$\text{Fraction enhanced} = 20\%$$

$$\text{Speedup enhanced} = \frac{1}{0.2} = 5$$

<2> SPEED

FP

$$\text{Fraction enhanced} = 50\%$$

$$\text{Speedup enhanced} = \frac{1}{0.5} = 2$$

So, Speedup overall

for 1

$$= \frac{1}{(1-0.2) + \frac{0.2}{5}} = 1.2 \text{ days}$$

$$\text{Speedup overall for 2} = \frac{1}{1-0.5 + \frac{0.5}{2}} = 1.23 \text{ days}$$

So, the 2nd alternative is slightly better than the 1st one

$$\text{Speedup overall} = \frac{1}{(1 - \sum \text{Fraction enhanced}) + \sum \text{Fraction enhanced}} = \frac{1}{1 - \sum \text{Speedup enhanced}}$$

16/2/2013

### PIPELINING

# It is effective way of organising concurrent activity in a computer system.

→ It is a technique for overlapping the execution of several instruction to reduce the execution time for a set of instruction.

Speedup all pipeline has been described in 820471. The reason of improving branch is due to overlapping between IF and ID stage. If we do not have overlapping between IF and ID stage, then it is necessary to wait for the result of branch until the next instruction is fetched.

### DLX Pipeline

→ It has 5 stages.

> Instruction Fetch (IF)

Fetch the instruction

> Incrementing the PC

> Instruction Decode (ID)

Interpretation of op-code

Decode and Search where the source system are present

Read the register content

Check for branch

> Execution (EX)

ALU operations

EA calculation

Effective Address

> Memory (MEM)

Load

Store

> Write Back (WB)

Writing the final result into the register file.



Scanned with OKEN Scanner

An instruction can be forwarded to next stage if the buffer of the next stage is free.

Time-Space Diagram :-

|                | 1              | 2              | 3              | 4              | 5              | 6              | 7              | 8              | 9 | ... |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|---|-----|
| I <sub>1</sub> | F <sub>1</sub> | D <sub>1</sub> | E <sub>1</sub> | M <sub>1</sub> | W <sub>1</sub> |                |                |                |   |     |
| I <sub>2</sub> | F <sub>2</sub> | D <sub>2</sub> | E <sub>2</sub> | M <sub>2</sub> | W <sub>2</sub> |                |                |                |   |     |
| I <sub>3</sub> |                | F <sub>3</sub> | D <sub>3</sub> | E <sub>3</sub> | M <sub>3</sub> | W <sub>3</sub> |                |                |   |     |
| I <sub>4</sub> |                |                | F <sub>4</sub> | D <sub>4</sub> | E <sub>4</sub> | M <sub>4</sub> | W <sub>4</sub> |                |   |     |
| I <sub>5</sub> |                |                |                | F <sub>5</sub> | D <sub>5</sub> | E <sub>5</sub> | M <sub>5</sub> | W <sub>5</sub> |   |     |
|                |                |                |                |                |                |                |                |                |   |     |



# The pipeline takes the time of its slowest stage.



| CC →           | 1              | 2              | 3              | 4              | 5              | 6              | 7              | 8              | 9              | 10             | 11             | 12             | 13             | ...            |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | F <sub>1</sub> | D <sub>1</sub> | D <sub>1</sub> | E <sub>1</sub> | E <sub>1</sub> | E <sub>1</sub> | M <sub>1</sub> | M <sub>1</sub> | W <sub>1</sub> |                |                |                |                |                |
| I <sub>2</sub> |                | F <sub>2</sub> |                | D <sub>2</sub> | D <sub>2</sub> | D <sub>2</sub> | E <sub>2</sub> | E <sub>2</sub> | E <sub>2</sub> | M <sub>2</sub> | M <sub>2</sub> | W <sub>1</sub> |                |                |
| I <sub>3</sub> |                |                | F <sub>3</sub> | D <sub>3</sub> | D <sub>3</sub> | E <sub>3</sub> | M <sub>3</sub> | M <sub>3</sub> | W <sub>3</sub> |                | D <sub>3</sub> | D <sub>3</sub> | E <sub>3</sub> | M <sub>3</sub> |

→ In a pipeline, a single instruction takes same or more time for execution while multiple instruction takes less time.

## Taxonomy of Parallel Computer Architecture

### Flynn's Classification :-

→ Based on data stream and instruction stream.

→ 4 categories of computer :-

- 1) SISD → Single Instruction Stream - Single Data Stream
- 2) SIMD → Single Instruction Stream - Multiple Data Stream
- 3) MISD → Multiple Instruction Stream - Single Data Stream
- 4) MIMD → Multiple Instruction Stream - Multiple Data Stream.

### SISD Organisation



CU → Control Unit

PU → Processing Unit

IS → Instruction Stream

DS → Data Stream

MM → Memory Module



↳ Instructions are executed sequentially but may overlap in their execution stage.

↳ Most SISD uniprocessors are pipelined.

↳ It can have functional unit but no ALU.

Eg - IBM 701 (1), IBM 1620 (1), IBM 360/91, CDC 6600

## (2) SIMD Organisation :-



↳ This class corresponds to array processor.

↳ There are multiple PUs supervised by the same Control Unit.

↳ All PUs receive the same instruction broadcast from the CU but operate on different datasets from distinct data streams.

↳ The Shared Memory subsystem may contain multiple modules.

Eg - ILLIAC-IV, BSP-(C), MPP, DAP.

## (3) MIMD Organisation



↳ Here m-processor unit acts each receiving distinct instruction operating over the same data stream & its derivatives.

↳ The result of one processor become the input of next processor in the macropipe.

\* There is no practical implementation of this organisation.

## (4) MIMD Organisation :-



↳ Most multiprocessors and multi-computers are classified into this category.

↳ An MIMD Computer implies Interaction among n-processors because all memory streams are derived from same data space shared by all processors. [Tightly Coupled]

↳ If the n-data streams were derived from disjoint subspaces of the shared memory then we would have so called Multiple SISD operation. [Loosely Coupled].

Eg- CMMP, CRAY-2, CRAY-XMP, HEP

22/2/2023

### Parallelism in Software

1) Instruction Level : Multiple instruction from the same instruction stream can be executed concurrently.

↳ Generated and managed by hardware (super scalar) or by compiler (VLIW).

↳ Limited in practice by data or controlled dependencies.

2) Task Level / Thread Level : Multiple threads or instruction or instruction sequences from the same application can be executed concurrently.

↳ Generated by compiler / user and managed by compiler or hardware.

↳ Limited in practice by communication / synchronisation overhead and by

3) Data Level : Instructions from a single stream operate concurrently on several data.

↳ Limited by non-regular data manipulation patterns and by memory bandwidth.

4) Transaction Level : Multiple threads / processes from different transaction can be executed concurrently.

↳ Limited by concurrency overhead.

Multiprocessor

# Multiple processes are

3 types :-

1) UMA (Uniform Memory Access)

2) NUMA (Non-uniform Memory Access)

3) COMA (Cache Only Memory Access)

#### 1) Uniform Memory Access (UMA)



Scanned with OKEN Scanner

→ All processors have equal access time to all memory

(i) Symmetric UMA : All processor have equal access to all peripheral devices.

(ii) Asymmetric UMA : Only one or a subset of processor can execute OS and handle I/O. The remaining processor have no I/O capability and are called attached processors.

## 26/12/2023 2) Non-Uniform Memory Access



→ It is a shared memory system in which, the access time varies with the location of the Memory Word.

→ The processors are divided into several clusters and each cluster is a UMA or NUMA.

→ The clusters are connected to Global Shared Memory through an interface.

## 3) Cache Only Memory Access (COMA)



→ It is a special case of NUMA in which the distributed Main Memories are converted into cache.

→ All the caches form a Global Address Space.

→ Remote cache access is achieved by the distributed cache directories.

## Distributed Memory (Multi Computer)



→ The system consist of multiple computers called nodes interconnected by a message passing network. It without using bus sharing information.



↳ Each node is an autonomous computer consisting of a processor, local memory and sometimes attached disc or I/O peripherals.

### Multi-Processors

- ↳ They have GSM.
- ↳ Tightly coupled system.
- ↳ Shared Memory architecture.
- ↳ They share the clock and I/O.

### Multi-Computers

- ↳ Do not have GSM.
- ↳ Loosely coupled system.
- ↳ Message Passing architecture.
- ↳ They do not share clock or I/O.

### Super Scalar Processors :-

29/2/2028



$$\text{Performance} = \text{IPC} \quad (\text{Instructions Per Cycle})$$

- ↳ The processor decides how many instructions to issue.
- ↳ Both static and dynamic techniques are used.

### H/w Requirements :-

- Instruction alignment
- Arithmetic units and pipeline

- Interaction among functional integer and FP units

### Advantages :-

- ↳ Smaller code size.
- ↳ Binary compatibility across generation of hardware.
- ↳ Faster computation.
- ↳ More throughput.

### Limitations :-

- ↳ It limits of available ILP (Instruction Level Parallelism) in the program.

- ↳ Hardware complexity.



### VLIW

#### Very Long Instruction Word.

- ↳ It issues a fix number of instructions per cycle.
- ↳ This is done by formatting a fixed no. of instructions as one Long Instruction Word.
- ↳ This is statically scheduled by the compiler.

### Advantages :-

- Simplified hardware for decoding and issuing instruction.
- No interlocked hardware. (The compiler has already checked it.)
- More registers but simplified hardware for register ports.

### Limitations :-

- Technical Problems.



- A cache miss in any FU causes the entire processor to stall because of block stop operation of VLIW.
- Primary compatibility problems since machines with different no. of issues and FU latencies require diff. versions of machine code.
- Cannot be programmed in assembly language because of high compatibility complexity issue.

## Vector Processing

11/03/2023

# It is a co-processor specially designed to perform vector computation.

- A vector instruction involves a large array of operands
- The same computation will be performed over a array of data
- They are often used in multi-pipelined super computer.

Format of Vector Instruction :-

| Op code | Base Addr. | Base Addr. | Base Addr.  | Vector Length |
|---------|------------|------------|-------------|---------------|
|         | Source 1   | Source 2   | Destination |               |

→ The vector processing can be successfully implemented by using the following approaches :-

- Enrich vector instruction set
- Combine scalar instruction
- Select a suitable algorithm.
- Use of vectorizing compiler.



IPU : Fetches and decodes both scalar and vector instruction

- The scalar instructions are dispatched to scalar processors through scalar registers
- The vector instructions are dispatched to VIC.

VIC : - Decoding of vector instruction.

- Calculating effective address of vector operands
- Setting of vector access controller and vector processor
- Monitoring the execution of vector instructions.

VAC : Responsible for fetching the vector operands by a series of access from the main memory.

- More vector registers are provided to bridge the speed gap between the main memory and vector processor to avoid a stall



### Advantages :-

- ↪ Flynn's Bottleneck → It can be reduced by using vector instructions.  
Fewer instructions are required to execute a program. This reduces the bandwidth required for instruction fetch.
- ↪ Data Hazard → Can be eliminated due to the structured nature of the data used by the vector machine.
- ↪ Memory Latency → Can be reduced by using pipeline 'load and store' operation
- ↪ Control Hazard → Reduced as a result of specifying a large number of iterations in a single vector instruction.
- ↪ Pipeline → Can be exploited to the maximum extent. This is facilitated by the absence of data and control hazard.

### Array Processor

- ↪ It can be considered as a vector or a set of vectors.
- ↪ A processor capable of processing arrays is called array processing.
- ↪ It performs a single instruction on multiple execution unit in the same clock cycle.
- ↪ The different execution units process instruction using same set of vectors in their array.
- ↪ The operations are divided into several streams of operands and run for the strength in the parallel processing unit.

↪ It is special case of pipelining where the parallel pipelines operate on different stream of operands.

↪ The control unit of an array processor fetches and decodes instruction and broadcast the appropriate control signals to the division data points within all the processors.



### Vector Processing

↪ Parallelism is achieved by a multiplicity of concurrently executable distinct and specialized functional units that collectively constitute the overall ALU of the process.

### Array Processing

↪ Parallelism is achieved by a multiplicity of processing elements (PES) each of the each of which constructs a distinct ALU on its own right.



→ Parallelism is also achieved by pipelining the individual functional unit.

→ Sequential current stream data traffic are at the previous memory interface parallelism is achieved.

→ Data communication is achieved through memory functional units.

→ Data communication is through shared memory or through direct transmission between PFs

## Super Scalar

- The code size is small.
- Complex hardware for decoding and issuing.
- They are compatible across generation.
- No change in between is required

## VLIW

- The code size is large
- Simple hardware for decoding and issuing instr.
- They are not compatible across generation.
- Requires more registers for simplified hardware / ~~required~~ register points.

## Processor types

Processor is classified according to instruction set architecture into two (2) namely  
a) RISC (Reduced Instruction Set Computer)  
b) CISC (Complex Instruction Set Computer)

## Processor types

Processor is classified according to instruction set architecture into two (2) namely  
a) RISC (Reduced Instruction Set Computer)  
b) CISC (Complex Instruction Set Computer)

CLASSIFICATIONOF PIPELINE

# Handler's Classification - 1977

Based on Level of Processing.

### 1) Instruction Pipeline

# Instruction look-ahead pipeline

↳ Execution of stream of instructions is pipelined by overlapping the execution of current instruction with -  
fetch, decode, operand fetch of other instruction

↳ It is Inter Instruction Pipeline



### 2) Arithmetic Pipeline

↳ The ALU of a computer are segmented into various data formats for pipeline operations.

↳ Intra-instruction pipeline.



### 3) Processor Pipeline

→ Same data stream is processed by cascade of processor, each processor performs a specific task.



Speedup vs No. of Stages :-

$$* \text{ Speedup ratio} = \frac{\text{Execution time in non-pipelined processor}}{\text{Execution time in pipelined processor}}$$

Let  $T_n \rightarrow$  Execution time of single instruction in a non-pipelined architecture

$T_p \rightarrow$  Execution time of single instruction in a pipelined architecture

$k \rightarrow$  No. of stages

$N \rightarrow$  No. of instructions to be executed

$$\text{So, Speedup} = \frac{N \cdot T_n}{kT_p + (N-1)T_p}$$

$$\boxed{\text{Speedup} = \frac{N \cdot T_n}{(N+k-1)T_p}}$$

when  $N \gg k$ , ' $k$ ' can be ignored,

$$\text{Speedup} = \frac{N \cdot T_n}{N \cdot T_p}$$

$$\Rightarrow \boxed{\text{Speedup} = \frac{T_n}{T_p}}$$

In an ideal case,  $T_n = K \cdot T_p$

$$\text{So, Speedup ratio} = \frac{K \cdot T_p}{T_p}$$

$$\Rightarrow \boxed{\text{Speedup ratio} \propto K}$$

More the no. of stages, more is the speed.

Pipeline Performance

→ For a variety of regions, one of the pipeline stages may not be able to complete its task for a given instruction in time allocated.

|                | 1              | 2              | 3              | 4              | 5              | 6              | 7              | 8              |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | F <sub>1</sub> | D <sub>1</sub> | E <sub>1</sub> | E <sub>1</sub> | E <sub>1</sub> | M <sub>1</sub> | W <sub>1</sub> |                |
| I <sub>2</sub> |                | F <sub>2</sub> | D <sub>2</sub> | -              | -              | E <sub>2</sub> | M <sub>2</sub> | W <sub>2</sub> |
| I <sub>3</sub> |                |                | F <sub>3</sub> | -              | -              | D <sub>3</sub> |                |                |
| I <sub>4</sub> |                |                |                | -              | -              | E <sub>4</sub> |                |                |
| I <sub>5</sub> |                |                |                |                | -              | -              | E <sub>5</sub> |                |

2 clock cycle penalty

2 stalls or  
2 bubbles

→ Any reason for which a stall occurs are known as hazards.

3 types of Hazards :-

(i) Structural Hazard

The situation when 2 instructions require the use of a



given hardware resource at the same time.

- Eg -
- Common bus
  - Multiple instruction fetch with one ALU.
  - Register file not having enough read & write ports.
  - Data and instructions are in the same cache.

### (ii) Data Hazard

# It is any condition in which either the source or the destination operands of an instruction are not available at the time expected in the pipeline.

Eg - I<sub>1</sub>: Add R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>      R<sub>1</sub> ← R<sub>2</sub> + R<sub>3</sub>      R<sub>2</sub> and R<sub>3</sub> not available at the time of addition.  
I<sub>2</sub>: MUL R<sub>5</sub>, R<sub>1</sub>, R<sub>6</sub>      R<sub>5</sub> ← R<sub>1</sub> \* R<sub>6</sub>      R<sub>1</sub> and R<sub>6</sub> not available at the time of multiplication.

### (iii) Control Hazard / Instruction Hazard

# The pipeline may also be stalled because of a delay in the availability of instructions. It arises from the pipelining of branches and other instructions that change the PC.

Eg - Cache miss

### Performance of Pipeline with Stalls

$$\text{Speedup from pipeline} = \frac{\text{Average Instruction Time unpipelined}}{\text{Average Instruction Time pipelined}}$$
$$= \frac{\text{CPI unpipelined} \times \text{Clock Cycle Time unpipelined}}{\text{CPI pipelined} \times \text{Clock Cycle Time pipelined}}$$

$$= \frac{\text{CPI unpipelined}}{\text{CPI pipelined}} \times \frac{\text{Clock Cycle Time unpipelined}}{\text{Clock Cycle Time pipelined}}$$

wkt, CPI pipelined = Ideal CPI + Pipeline Stall Clock cycles per instruction.  
Assuming Ideal CPI = 1,

$$\text{CPI pipelined} = 1 + \text{Pipeline Stall Clock Cycles per instruction}$$

$$\text{Speedup from pipelining} = \frac{\text{CPI unpipelined}}{1 + \text{Pipeline stall Clock cycles per instruction}}$$

$$= \frac{\text{Pipeline depth}}{1 + \text{Pipeline stall Clock cycles per instruction}}$$

$$\Rightarrow \text{Speedup from pipelining} = \frac{\text{No. of stages in the pipeline}}{1 + \text{Pipeline stall Clock cycles per instruction}}$$

### Mimimizing Structural Hazard

Structural hazards can be minimised by introducing additional resources.

- Eg -
- Introducing multi bus in place of common single bus.
  - Adding multiple number of ALUs for multiple instruction fetch.
  - Introducing multiple no. of R/w ports on register file.
  - Keeping data and instruction in separate cache.

## Data Hazard Minimisation Techniques

1) If Register write occurs in the 1<sup>st</sup> half of clock cycle and register read occurs in the 2<sup>nd</sup> half of clock cycle.

I<sub>1</sub>: Add R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>      R<sub>1</sub>  $\leftarrow$  R<sub>2</sub> + R<sub>3</sub>

I<sub>2</sub>: Mul R<sub>5</sub>, R<sub>1</sub>, R<sub>6</sub>      R<sub>5</sub>  $\leftarrow$  R<sub>1</sub> \* R<sub>6</sub>

2) Data Forwarding / Result Forwarding / Operand Forwarding:

→ Data are available at the output of the ALU once the execute stage completes.

→ Hence the delay can be reduced or possibly eliminated if we arrange the result of instruction I<sub>1</sub> to be forwarded directly for use for Execution stage of I<sub>2</sub>.

### Data Hazards Requiring Stall :-

Unfortunately all potential Data Hazards can be handled by bypassing.

Eg - I<sub>1</sub> : LD R<sub>1</sub>, O(R<sub>2</sub>) + R<sub>1</sub>  $\leftarrow$  O(R<sub>2</sub>)

I<sub>2</sub> : ID, R<sub>1</sub>, R<sub>2</sub>, R<sub>5</sub>      R<sub>4</sub>  $\leftarrow$  R<sub>1</sub> - R<sub>5</sub>

I<sub>3</sub> : AND R<sub>6</sub>, R<sub>1</sub>, R<sub>7</sub>      R<sub>6</sub>  $\leftarrow$  R<sub>1</sub> ^ R<sub>2</sub>

I<sub>4</sub> : OR R<sub>5</sub>, R<sub>1</sub>, R<sub>7</sub>      R<sub>5</sub>  $\leftarrow$  R<sub>1</sub> v R<sub>6</sub>

∴ So, forwarding path cannot operate backward in time.

### 3) Handling Data Hazard through Software :-

→ We assume that data dependency is discovered by the hardware while the instruction is being decoded.

→ The compiler can introduce some delay cycles needed between the dependent instruction by inserting NOP instruction.

→ Being aware of the need for a delay, the compiler can attempt to reorder the instruction to perform useful task in NOP stalls and thus achieve better performance.

Eg - I<sub>1</sub> : Add R<sub>1</sub>, R<sub>2</sub>, R<sub>2</sub>

I<sub>2</sub> : Stall NOP

I<sub>3</sub> : Stall NOP

I<sub>4</sub> : Stall NOP

I<sub>5</sub> : Mul R<sub>5</sub>, R<sub>1</sub>, R<sub>6</sub>

Limitation :- Code size increases

→ The NOP instructions inserted to satisfy the requirement of one implementation may not be needed and hence lead to reduce performance on different implementation.

15/03/2023

## Branch Hazard / Control Hazard / Instruction Hazard

→ Branch instruction may cause pipeline to stall.

→ The time lost as a result of branch instruction is often referred to as Branch Penalty.

→ For a larger pipeline, the branch penalty may be higher.

### Basic Mechanism for Branch Penalty

I) The simplest scheme to handle branches is to freeze or flush the pipeline holding or deleting any instruction after the branch until the branch destination is known.

II) A higher performance but slightly more complex scheme is to treat every branch as not taken simply allowing the hardware to continue as if the branch were not executed.

III) An alternative approach, to treat every branch as taken. Not so effective.



## Control Hazard Minimization Techniques :-

### 1. Delayed Branch :-

The location following a branch instruction is called a branch delay slot.

- There might be more than one branch delay slot depending on the time it takes to execute a branch instruction.
- The instructions in the delay slots are always executed whether the branch is taken or not taken.
- The objective is to place useful instruction in this slot.

→ If no useful instructions are found, these slots may be filled with NOP-instructions.

Eg - If  $(a == 5)$  {



Is else {

I5 } Branch delayed so it won't be ready at the same time as I6

I6 } Branch delayed so it won't be ready at the same time as I7

I7 }

→ 4 stage pipeline [F/D | Ex | Mem | WB]

Eg - Loop : Shift Left R1  
Decrement R2  
Branch ≠ 0 Loop

ADD: R5, R6

Branch delay slot  
(Instruction following the 'Branch' instruction)

→ Branching takes place one instruction later than where the branch instruction appears in the instruction sequence in the memory. Hence the name delayed branch.

→ Effectiveness depends on how often it is possible to re-order the instructions.

→ More the depth of pipeline, more the branch delay slot.

→ The compiler has to search for instruction to fill the branch delay slot. It's all at runtime.

25/03/2024

### 2. Instruction Queuing and Prefetching :-



→ A sophisticated fetch unit can fetch instructions before they are needed and put them in a queue.

→ A separate unit takes instruction from the front of the queue and sends them to execution unit. It also does decoding.

→ When hazard occurs, the fetch unit busy in fetching instructions and stores them in a queue and similarly when there is delay in fetching instruction, it continues to issue instruction from the queue.

→ The instruction fetch unit executes the branch instruction concurrently with the execution with the execution of other instruction.



→ The technique is referred to as Branch Folding.

→ Branch folding occurs if the time at which branch instruction is encountered at least one instruction must be available in the queue other than the branch instruction.

→ Adequate supply of instruction for processing is required to make the queue full.

Q) Find the effective addition to the CPI arising from branching for this pipeline assuming the following f

|                                 |      |
|---------------------------------|------|
| Unconditional Branches          | 4 %  |
| Conditional Branches, Not Taken | 6 %  |
| Conditional Branches, Taken     | 10 % |

| Branch Scheme  | Penalty uncondition | Penalty NT | T |
|----------------|---------------------|------------|---|
| Flush Pipeline | 2                   | 3          | 3 |
| Predicted, T   | 2                   | 3          | 2 |
| Predicted, NT  | 2                   | 0          | 3 |

| Ans - Branch Scheme | Addition to the CPI from Branch Cost |                        |                        |              |
|---------------------|--------------------------------------|------------------------|------------------------|--------------|
|                     | Conditional Branch                   | Conditional Branch NT  | Conditional Branches T | All Branches |
| Frequency of event  | 4 %.                                 | 6 %.                   | 10 %.                  | 20 %.        |
| Stall Pipeline      | $0.04 \times 2 = 0.08$               | $0.06 \times 3 = 0.18$ | $0.1 \times 3 = 0.3$   | 0.56 %.      |
| Pipeline T          | $0.04 \times 2 = 0.08$               | $0.06 \times 3 = 0.16$ | $0.1 \times 2 = 0.2$   | 0.46         |
| Pipeline NT         | $0.04 \times 2 = 0.08$               | $0.06 \times 0 = 0$    | $0.1 \times 3 = 0.3$   | 0.38         |

The ideal pipeline would be 1.56 times faster than a pipeline that uses the stall pipeline.

The predicted untaken scheme would be 1.18 times faster than the stall pipeline scheme under the same assumption.

### Instruction Level Parallelism (ILP)

# It is the measure of how many of the operations in a computer program can be performed simultaneously.

→ It is a potential overlap among instructions.

ILP has 2 components →

- Static (Itanium processor)
- Dynamic (Pentium processor)

Eg:-



Eg -  $I_1 : e = a + b$   
 $I_2 : f = c - d$   
 $I_3 : m = e/f$

ILP =  $\frac{3}{2}$

→ More the ILP, the better.

To obtain better ILP :-

### Out of Order Execution :-

# Instructions execute in any order without violating the data dependency.

↳ It can be used to exploit the ILP.

↳ There are 3 different types of dependency that may arise because of out-of-order execution :-

- (i) Data dependency
- (ii) Name dependency
- (iii) Control dependency

#### Data Dependence :-

An instruction- $j$  is data dependent on instruction- $i$  if either of the following holds :-

↳ Instruction- $i$  produces a result that is used by instruction- $j$ .

↳ Instruction- $j$  is data dependent on instruction- $k$  and instruction- $i$  is data dependent on instruction- $k$ .

Eg -  $I_i : R_1 \leftarrow R_2 + R_3$   
 $I_j : R_5 \leftarrow R_1 + R_5$



### Possible cases for out of Order Execution :-

4 possible cases :-

I) Read after Read (CRAR) : Instruction- $j$  tries to read a data item before  $I_i$  reads it.

Eg -  $I_i : Add R_1, R_2, R_3 \quad R_1 \leftarrow R_2 + R_3$   
 $I_j : Mul R_5, R_2, R_8 \quad R_5 \leftarrow R_2 * R_8$

Allows out-of-order execution

II) Read After Write (RAW) :  $I_j$  tries to read a source before it is written by  $I_i$ .

Eg -  $I_i : Add R_1, R_2, R_3 \quad R_1 \leftarrow R_2 + R_3$   
 $I_j : Mul R_5, R_1, R_6 \quad R_5 \leftarrow R_1 * R_6$

Can't be out-of-order execution

III) Write After Read (WAR) :  $I_j$  tries to write a destination before it is read by  $I_i$ .

Eg -  $I_i : Add R_1, R_2, R_3 \quad R_1 \leftarrow R_2 + R_3$   
 $I_j : Mul R_2, R_5, R_6 \quad R_2 \leftarrow R_5 * R_6$

Doesn't allow out-of-order execution

IV) Write After Write (WAH) :  $I_j$  tries to write an operand before it is written by  $I_i$ .

Eg -  $I_i : Add R_1, R_2, R_3 \quad R_1 \leftarrow R_2 + R_3$   
 $I_j : Mul R_1, R_5, R_6 \quad R_1 \leftarrow R_5 * R_6$

Doesn't allow out-of-order execution

WAH hazards are present only in pipelines that writes in more than one pipeline stages or allow an instruction to proceed even if the previous instruction is stalled.



### (ii) Name Dependencies :-

# Occurs when two instructions use the same register or memory location called 'name', but there is no flow of data between the instructions associated with that name.



↳ The name dependency can be resolved by using Register Renaming Techniques.

↳ This can be done statically by the compiler or dynamically through hardware.

RAR

RAW

WAR

WAW

True dependency as it allows out-of-order dependency while other can be resolved.

### (iii) Control Dependency :-

It determines the ordering of an instruction - i wrt a branch instruction so that the instruction - i is executed in correct program order and only when it should be.

Eg - if ( $P_1$ ) {  
    s<sub>1</sub> } Here s<sub>1</sub> is control dependent on P<sub>1</sub> AW  
    { s<sub>1</sub> } s<sub>2</sub> is control dependent on P<sub>2</sub>  
        { s<sub>2</sub> } s<sub>2</sub> is not control dependent on P<sub>1</sub>  
    if ( $P_2$ ) {  
        s<sub>2</sub> }

↳ The control dependencies must be preserved to preserve the program order.

↳ There are 2 constraints imposed on control dependencies :-

↳ An instruction that is control dependent on a branch can't be moved before the branch so that its execution is no longer controlled by the branch.

↳ An instruction that is not control dependent on a branch can't be moved after the branch so that its execution is controlled after by the branch.

### Branch prediction



## MODULE - III

### Branch Prediction

It is a technique to reduce branch penalty for conditional branches.

It predicts whether a particular branch will be taken or not.

Speculative Execution : It means that instructions are executed before the processor is certain that they are in the correct execution sequence.

Care must be taken that no processor registers or memory locations are updated until it is confirmed that these instructions should indeed be executed.

Static Branch Prediction : The prediction decision is always same everytime a given instruction is executed.

Dynamic Branch Prediction : The approach in which the prediction decision may change depending on execution history.

### Single bit Prediction



Code snippet

```

≡
if (d == 0)
  d = 1;
if (d == 1)
≡
  
```

Inputs are given sequentially 0, 1, 2,

Ans - Let initial assumption of  $b_1$  &  $b_2 = NT$

| $d == ?$ | $b_1$ prediction | $b_1$ action | New $b_1$ prediction | $b_2$ prediction | $b_2$ action | New $b_2$ prediction |
|----------|------------------|--------------|----------------------|------------------|--------------|----------------------|
| 0        | NT               | NT           | NT                   | NT               | NT           | NT                   |
| 1        | NT               | T*           | T                    | NT               | NT           | NT                   |
| 2        | T                | T*           | T                    | NT               | T*           | T                    |

So, No. of correct predictions = 4

No. of miss prediction = 2

Code Snippet

```

≡
if (d == 0)
  d = 1;
if (d == 1)
≡
  
```

Inputs are given sequentially 2, 2, 0, 0

Ans - Assume  $b_1$  and  $b_2$  as NT

for making all predictions as true predicted answers. The true false for them will be predicted answer will be reflected in the output as well.

| $d == ?$ | b <sub>1</sub> prediction | b <sub>1</sub> action | New b <sub>1</sub> prediction | b <sub>2</sub> prediction | b <sub>2</sub> action | New b <sub>2</sub> prediction |
|----------|---------------------------|-----------------------|-------------------------------|---------------------------|-----------------------|-------------------------------|
| 2        | NT                        | T                     | T                             | NT                        | T                     | T                             |
| 2        | T                         | T                     | T                             | T                         | T                     | T                             |
| 0        | T                         | NT                    | NT                            | T                         | NT                    | NT                            |
| 0        | NT                        | NT                    | NT                            | NT                        | NT                    | NT                            |

No. of correct prediction = 4

No. of incorrect prediction = 4

Q) Inputs given sequentially 2, 0, 2, 0.

Ans - Assume b<sub>1</sub> and b<sub>2</sub> as NT

| $d == ?$ | b <sub>1</sub> prediction | b <sub>1</sub> action | New b <sub>1</sub> prediction | b <sub>2</sub> prediction | b <sub>2</sub> action | New b <sub>2</sub> prediction |
|----------|---------------------------|-----------------------|-------------------------------|---------------------------|-----------------------|-------------------------------|
| 2        | NT                        | T                     | T                             | NT                        | T                     | T                             |
| 0        | T                         | NT                    | NT                            | T                         | NT                    | NT                            |
| 2        | NT                        | T                     | T                             | NT                        | T                     | T                             |
| 0        | T                         | NT                    | NT                            | T                         | NT                    | NT                            |

No. of correct prediction = 0

The execution history used in predicting the outcome of a given branch instruction is the result of most recent execution of that instruction.

→ The processor assumes that the next time the instruction is to be executed, the result is likely to be same.

→ Better performance can be achieved by keeping more information about execution history.

Two-bit Prediction :-

How the bits are called correlated bits.



State transition diagram for 2-bit predictions.

Q) code snippet :-

Following inputs are given sequentially :-

2, 0, 2, 0

Find the no. of correct prediction and no. of miss prediction using 2-bit branch prediction techniques.

```

≡
if (d == 0)
    d = 1;
if (d == 1)
    ≡
  
```



| $d = ?$ | $b_1$ prediction | $b_2$ Action | New $b_1$ prediction | $b_2$ prediction | $b_2$ Action | New $b_2$ prediction |
|---------|------------------|--------------|----------------------|------------------|--------------|----------------------|
| 2       | NT / NT          | T            | I / NT               | NT / NT          | T            | NT / I               |
| 0       | T / NT           | NT           | T / NT               | NT / T           | NT           | NT / T               |
| 2       | T / NT           | T            | I / NT               | NT / T           | T            | NT / T               |
| 0       | T / NT           | NT           | T / NT               | NT / T           | NT           | NT / T               |

Note :-

- 1) If the previous prediction is NT, then the 1<sup>st</sup> bit of the correct prediction is considered.
- 2) If the previous prediction is I, then the 2<sup>nd</sup> bit of the correct prediction is considered.

No. of correct predictions = 6

No. of miss predictions = 2

a) Input 0, 1, 2, 0, 1, 2 → target address

Ans - No. of correct prediction = 8 as stages parallel  
No. of miss prediction = 9 → all stages

initiating time for one branch starting from 1st stage till completion of next branch. Total time = 9 stages

| $d = ?$ | $b_1$ prediction | $b_2$ action | New $b_1$ prediction | $b_2$ prediction | $b_2$ action | New $b_2$ action |
|---------|------------------|--------------|----------------------|------------------|--------------|------------------|
| 0       | NT / NT          | NT           | NT / NT              | NT / NT          | NT           | NT / NT          |
| 1       | NT / NT          | T            | T / NT               | NT / NT          | NT           | NT / NT          |
| 2       | T / NT           | T            | T / NT               | NT / NT          | T            | NT / T           |
| 0       | T / NT           | NT           | T / NT               | NT / T           | NT           | NT / T           |
| 1       | T / NT           | T            | T / NT               | NT / T           | NT           | NT / T           |
| 2       | T / NT           | T            | T / T                | NT / T           | T            | NT / T           |

### Branch Target Buffer (BTB)

5/4/2024

# A branch prediction cache that stores the predicted address for the next instruction after a branch is called Branch Target Buffer or Branch target cache.

- It predicts the next instruction address and send it out before decoding the instruction.
- If the PC of the fetched instruction = PC in the predicted buffer then this corresponding predicted PC is used on next PC.
- We only need to store predicted taken branch instruction.





| Instructions in Buffer | Prediction | Actual Branch | Penalty cycle |
|------------------------|------------|---------------|---------------|
| Yes                    | Taken      | Taken         | 0             |
| Yes                    | Taken      | Not Taken     | 2             |
| No                     |            | Taken         | 2             |
| No                     |            | Not Taken     | 0             |

### Loop Unrolling

It is a technique to reduce the branch penalty.

Eg -  $x : i = 1$   
 $\text{if } (i \leq 100) \{$   
 $A[i] = B[i] + C[i]$   
 $i++;$   
 $\text{goto } x\}$

continuous branch inst<sup>n</sup> occurs

```
for (i=1; i<=100; i++)
  A[i] = B[i] + C[i];
```

```
for (i=1; i<=100; i++)
  {
```

 $A[i] = B[i] + C[i];$ 
 $A[i+1] = B[i+1] + C[i+1];$ 
 $A[i+2] = B[i+2] + C[i+2];$ 
 $A[i+3] = B[i+3] + C[i+3];$ 

}

→ It transforms the loop with  $N$ -iterations into a loop with  $N/M$  iteration where each of the iteration in the new loop does the work of  $M$  iteration of the old loop.

- It increases the number of ~~different~~ instruction between branches giving the compiler and hardware more opportunity to find the instruction level parallelism.
- There will be reduction in the loop overhead.

### Super-Scalar Operation



| cl             | 1              | 2              | 3              | 4              | 5              | 6              | 7              | 8 | 9 |
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|---|---|
| I <sub>1</sub> | F <sub>1</sub> | D <sub>1</sub> | E <sub>1</sub> | M <sub>1</sub> | W <sub>1</sub> |                |                |   |   |
| I <sub>2</sub> | F <sub>1</sub> | D <sub>1</sub> | E <sub>1</sub> | E <sub>1</sub> | M <sub>2</sub> | W <sub>1</sub> |                |   |   |
| I <sub>3</sub> | F <sub>2</sub> | D <sub>2</sub> | F <sub>2</sub> | M <sub>2</sub> | W <sub>2</sub> |                |                |   |   |
| I <sub>4</sub> | F <sub>2</sub> | D <sub>2</sub> | E <sub>2</sub> | E <sub>2</sub> | E <sub>2</sub> | M <sub>2</sub> | W <sub>2</sub> |   |   |

| cl             | 1              | 2              | 3              | 4              | 5              | 6 | 7 | 8              | 9 |
|----------------|----------------|----------------|----------------|----------------|----------------|---|---|----------------|---|
| I <sub>1</sub> | F <sub>1</sub> | D <sub>1</sub> | E <sub>1</sub> | M <sub>1</sub> | W <sub>1</sub> | - | - | W <sub>1</sub> |   |
| I <sub>2</sub> |                |                |                |                |                |   |   |                |   |
| I <sub>3</sub> |                |                |                |                |                |   |   |                |   |
| I <sub>4</sub> |                |                |                |                |                |   |   |                |   |

- An instruction may cause exception.
- Let I<sub>2</sub> depends on I<sub>1</sub>, if an exception occurs in I<sub>2</sub>, then I<sub>1</sub> will be inconsistent state.
- However one or more succeeding instruction have been expected. executed to completion.

If such a situation is permitted, the processor is set to have imprecise exception.

If an exception occurs during an instruction, all subsequent instruction that have been partially executed are discarded. This is called precise exception.

The instruction may complete execution out of order but they must in the Program order.

### Memory Systems

Design Objectives :-

- Speed ( $\uparrow$ )  $\rightarrow$  Cache memory
- Size ( $\uparrow$ )  $\rightarrow$  Virtual memory tech.
- Cost ( $\downarrow$ )

### Memory Hierarchy

Cache Memory  $\rightarrow$  Cache Memory  $\rightarrow$  Static RAM  
Main Memory  $\rightarrow$  Dynamic RAM





If the data is not found in  $L_k$  — it is not present on  $L_{k-1}, L_{k-2}, \dots, L_1$

If the data is found in  $L_k$  — it is also found in  $L_{k+1}, L_{k+2}, \dots, L_2$

### Cache Memory



Processor issues a Read request, a block of words is transferred from the main memory to the cache, one word at a time.

Subsequent references to the data in this block of words are found in the cache.

At any given time, only some blocks of the main memory are held in the cache which blocks in the main memory are in the cache are determined by mapping function.

When the cache is full, and a block of words needs to be transferred from the main memory, some block of words in the cache must be replaced by. This is determined by a replacement algorithm.

The existence of a cache is transparent to the processor. The processor issues Read and Write request on the same manner.



Read hit : The data is obtained from the cache.

Write hit : The data is stored in the cache.

Cache has a replica of contents of main memory.

Contents of the cache and the main memory may be updated simultaneously called Write through Protocol

- Update the contents of the cache and mark it as updated by setting a bit known as dirty bit or modified bit.
- The contents of the main memory are updated when the block is replaced. This is write back or copy back protocol.

### Read Miss :-

- Block of words containing the requested word is transferred from the memory.
- After the block is transferred, the desired word is forwarded to the processor.
- The desired word may also be forwarded to the processor as soon as it is transferred without waiting for the entire block to be transferred. This is called early restart or load through.

### Write Miss :-

- Write through protocol is used, then the contents of the main memory are updated directly.
- If Write-back protocol is used, the block containing the addressed word is transferred to the main memory.

### Cache Coherence

- # If the data is present in main memory and the cache memory, then the data is coherent.
- If the data is present in both main memory and cache, and both processor and cache update the data, then the data is inconsistent.

### Mapping Functions :-

Determine how many memory blocks are placed in the cache.

Three mapping functions :-

(i) Direct mapping

(ii) Associative mapping

(iii) Set - associative mapping

$$\text{Block size of main memory} = \text{Block size of cache}$$

|              | C/M                       | M/M                   |
|--------------|---------------------------|-----------------------|
| Size         | $2^7 \times 2^4 = 2^{11}$ | $2^{16}$              |
| No. of Block | $2^7$                     | $2^{16}/2^4 = 2^{12}$ |
| Block size   | $2^4$                     | $2^4$                 |

(i) Direct Mapping :  $j^{\text{th}}$  block of main memory is mapped into  $j \bmod (\text{no. of blocks in cache memory})$ .

| Eg - | M/M | C/M | C/M | M/M                                     |
|------|-----|-----|-----|-----------------------------------------|
|      | 0   | 0   | 0   | $0 \rightarrow 0, 128, 256, 512, \dots$ |
|      | 1   | 1   | 1   | $1 \rightarrow 1, 129, 257, \dots$      |
|      | 2   | 2   | 2   | $2 \rightarrow 2, 130, 258, \dots$      |
|      | :   | :   | :   |                                         |
|      | 127 | 127 | 127 |                                         |
|      | 128 | 0   | 0   |                                         |
|      | 129 | 1   | 1   |                                         |
|      | :   | :   | :   |                                         |
|      | 256 | 0   | 0   |                                         |
|      | 257 | 1   | 1   |                                         |
|      | :   | :   | :   |                                         |





$$\text{Tag} = \frac{\text{No. of blocks in main memory}}{\text{No. of blocks in cache memory}}$$

Drawbacks :— ~~Sharing~~ One block of cache memory is multiple blocks of the main memory.

Contention Problem

(ii) Associative Mapping : Any block of the main memory is mapped to any block of the cache memory.



Tag Word → No. of words in a block : ~~no. of blocks~~  
→ No. of blocks in Main Memory

Drawback :— Searching Problem.

(iii) 2-way set associative mapping :



$$\text{Set} = \frac{\text{No. of blocks in cache memory}}{\text{set size}}$$

Hit rate :-

$$\text{Hit rate} = \frac{\text{No. of hits}}{\text{Total no. of attempts}}$$

Either read hit or write hit.

$$\text{Miss rate} = \frac{\text{No. of miss}}{\text{Total no. of attempts}}$$

$$\text{Hit rate} + \text{Miss rate} = 1$$

Miss penalty : Extra time required to bring the required block from the main memory.

For single level cache :-

$$\text{Avg Access time} = hC + (1-h)M$$

where  $h \rightarrow$  hit rate

$C \rightarrow$  cache memory access time

$M \rightarrow$  Miss penalty

↳ Hit rate can be improved by increasing block size, while keeping cache size constant.

↳ Block sizes that are neither very small nor very large give best results.

↳ Miss penalty can be reduced if load-through approach is used to when loading new blocks into cache.

For two-level cache :-

Avg. access Time,  $T_{av}$  :-

$$T_{av} = h_1 c_1 + (1-h_1) h_2 c_2 + (1-h_1) (1-h_2) M$$

where,  $h_1 \rightarrow$  hit rate in L1 cache  
 $h_2 \rightarrow$  hit rate in L2 cache

$c_1 \rightarrow$  L1 cache memory access time

$c_2 \rightarrow$  L2 cache memory access time

$M \rightarrow$  Miss penalty.

How to reduce the Miss Penalty :-

- 1) Using Multi-level Cache
- 2) Selecting correct block size
- 3) Critical word forest
- 4) Early restart
- 5) Giving priority to Read Miss over write miss
- 6) Victim cache.

### Virtual Memory

- The physical main memory is not as large as the address space spanned by an issued by the processor.
- When a program doesn't fit into the main memory, the parts of it not currently being executed are stored in secondary storage device.

- The technique that automatically move the data blocks into physical main memory when they are required for execution is called virtual memory techniques.
- The processor references an instruction and data space that is independent of available physical memory.
- The address generated by the processor is known as virtual address which is converted to physical address through memory management unit.



### Address Translation :-

- The program is divided into fixed sized blocks called pages and the main memory is divided into fixed size blocks called frames.

The page size = Frame size



Scanned with OKEN Scanner

- ↳ No. of pages may not be equal to no. of frames.
- ↳ The address generated by the processor has 2 parts
  - ↳ page number
  - ↳ offset

- ↳ There is a table called page map table which keeps track of the corresponding frame number to a corresponding page.
- ↳ There are 2 entries in PMT:
  - ↳ Frame no.
  - ↳ Control bits

| Frame No. | Control Bits |
|-----------|--------------|
| :         |              |
| :         |              |
| :         |              |
| :         |              |



- ↳ The starting address of the PMT is stored in a register called PTBR (Page Table Base Register).
- ↳ The page no. is added with the content of PTBR to get the corresponding entry in the PMT.



Control Bits: These are used to give restriction or permission bits to different pages like Read, Write, Execute, Modify, etc.

Eg - Read, Write, Execute, Modify, etc.

### TLB (Translation Look-Ahead Buffer)

- ↳ Ideally the PMT should be kept in MMU which is a part of CPU.
- ↳ However CPU has limited no. of registers for which it is stored in main memory.

However a part of the PMT which is very frequently being used can be stored in a buffer called TLB.



Page Fault: If the requested page is not found in the main memory, a page fault is said to occur.

Demand Paging : A page will not be loaded to the main memory unless and until it is being demanded.

Pref-Paging : The pages will be loaded to the main memory even if it is not being demanded.

Thrashing : A high page fault may result in thrashing. The processor will be busy in swap-in and swap-out operation rather than actual execution.



# MODULE - V

I/O



## I/O Interface

It provides a method for transforming operation between internal storage and ~~external~~ external I/O devices



Communication of I/O process to I/P-O/P devices

The two most frequently used interfacing techniques are  
- using of all data of device to be transferred through bus

## I/O interface for an Input device :-



### Need of interface :-

- ↳ Peripherals are electromechanical or electromagnetic devices and their manner of operation is different from the operation of CPU and memory, which are electronic devices. Therefore conversion of signal values are important.
- ↳ The data transfer rate of peripherals is usually slower than the data transfer rate of CPU and consequently a synchronization mechanism may be needed.
- ↳ Data codes and formats in peripherals differ from the word format from CPU and memory.
- ↳ The operating methods of peripherals are different from each other and each must be controlled so as not to disturb the operation of other peripherals connected to the CPU.

of other peripherals connected to the CPU.

Address Decoder : Enables the device to recognise its address when this address appears in the address bus.

Data & Status Register : The data register holds the data being transferred to or from the processor.

↳ Status register contains the information relevant to the operation of I/O devices.

↳ Both data and status register are connected to the data bus and assigned with unique addresses.

Control Circuits : The control signals from the processor can be communicated to the I/O device through the control circuits.

Interface Circuits : Address decoder, data & status register, control circuits are required to coordinate that constitutes the I/O interface.

### Interface Modes :-

(i) Memory Mapped I/O

(ii) I/O Mapped I/O

(iii) Memory Mapped I/O

When the I/O devices and memory share the same address space, the arrangement memory mapped I/O.

Here any machine instruction that can access memory can be used to transfer data from I/O devices.

### (iii) I/O mapped I/O

- ↳ Same processes have special input and output instructions to perform I/O transfer. (Eg - Intel).
- ↳ Special I/O address space.
- ↳ Deals with fewer address lines.

### Data Transfer

- 3 steps :-
- Program controlled I/O
  - Interrupt Driven I/O
  - DMA (Direct Memory Access)

### (i) Program Controlled I/O



- ↳ In this scheme, the I/O operations are completely controlled by the CPU.
- ↳ CPU executes the programs that directly initiates and terminates I/O operation.
- ↳ It requires little special I/O hardware but it is quite time consuming for CPU since CPU has to wait for slower I/O operation to complete.

↳ The processor repeatedly checks the status of the flag to achieve the required synchronization between processor and any I/O devices.

↳ The program I/O mode is called "Busy Waiting Mode".



### (ii) Interrupt Driven I/O

Interrupt : Hardware signal which is when raised, the processor stops its normal execution to address the interrupt.



→ The CPU issues I/O commands to the I/O module and starts doing other work which may be execution of some other programs.

→ When the I/O operation completes, I/O module interrupts the CPU by informing to CPU that I/O has finished, the CPU then may proceed execution of the program.

### DMA (Direct Memory Access)

→ A special control unit may be provided to allow transfer of a block of data directly between external device and main memory without continuous intervention of the processor. This approach is called DMA.

→ DMA transfers are performed by a control circuit that is a part of I/O device interface called DMA controller.

→ DMA controller performs the function that would be normally carried out by the processor when accessing the main memory.

→ DMA operation must be under the control of the program executed by the processor.

→ Processor sends the starting address, no. of words in the block and the direction of transfer at the time of initiation.

→ When the entire block is transferred, the controller informs the processor by raising an interrupt signal.



CPU issues a command to DMA for work

CPU keeps on doing other work

Operation of DMA controller

CPU reads the status of DMA module

Interrupt

CPU performs the next instruction

After completion of transfer, DMA controller sends interrupt to CPU.

Processor reads interrupt and sends interrupt acknowledgement to DMA controller.

Processor sends interrupt acknowledgement to Main Memory.

System ports of memory controller DMA controller.

DMA controller → Processor → Main Memory

Disk → DMA controller → Processor → Main Memory

Disk → DMA controller → Processor → Main Memory

Keyboard → DMA controller → Processor → Main Memory

Printer → DMA controller → Processor → Main Memory

H/W interface → DMA controller → Processor → Main Memory

Among different devices of DMA, highest priority given to high speed peripherals.

Since the processor originates most memory cycles, the DMA controller can be set to steal memory cycles from the processor.

This interleaved interleaving technique is called Cycle Stealing.

DMA controller may be given exclusive access to the memory to transfer a block of data without interruption known as block / burst mode.

Two registers are used for storing the starting address and the word count.

Third register is used for storing the status and control part.

Eg- R/W used for direction for transfer  
I/E bit interrupt enabled or not

IRQ → used for interrupt request  
→ used whether the transfer completes or not



## Interconnection Networks

It is a backbone network which facilitates efficient data transmission between processing element or between processing elements and memory modules.

Complex collection of switches and links which is used for efficient data transmission.



There are several different physical form available for the ICN :-

### (i) Time Shared Common Bus :

- The simplex ICN for multiple processor is a common communication path connecting all of the FUs.
- The common path is called time shared or common bus.
- Only one processor can communicate with memory or another processor at a given time.
- A command is issued to inform the destination unit about what operation is to be performed.



store we have a no. of local buses each connected to its own local memory and to one or more processors.

A system bus controller links each local bus to a common system bus.

The memory connected to system bus is shared by all processors.

Drawbacks :-

- Only one communication at a time
- Problem in system bus may collapse entire communication

Advantages -

- Low cost
- Attaching and detaching system bus is easy

### (ii) Multi-port Memory

- It employs separate buses for each memory module and each CPUs.



↳ Each processor bus is connected to each memory module

↳ The processor bus consist of



required to communicate with memory.

↳ Memory Access conflicts are resolved by assigning fixed priorities to each memory port.

Advantages — High transfer rate that can be achieved due to multiple paths b/w processor and memory

Disadvantages — Requires expensive memory control logic and large no. of cables and connectors.

### (iii) Crossbar Switch

- Consists of a no. of crosspoints that are placed at intersections between processor buses and memory module paths.



↳ Each crosspoint is a switch that determines a path from processor to MM.

↳ Each switch has a control logic to set up the transfer path between processor and memory.

↳ Priority levels are established by the arbitration logic to select one CPU when two or more CPUs attempt to access the same memory.

↳ The hardware required to implement the switch can become large and complex.

### Multistep Switching Network

↳ Basic component is 2 input 2 output interchange switch.



→ 2 processors  $P_1$  and  $P_2$  are connected through switches to 8 memory modules marked

→ To connect  $P_1$  to memory 101 it is necessary to form path from  $P_1$  to output 1 in the 1<sup>st</sup> level switch, 0 in 2<sup>nd</sup> level and 1 in 3<sup>rd</sup> level switch

→ So, either  $P_1$  or  $P_2$  can be connected to any of the 8 memory modules.

Based on Arrangement of Link :-

(i) Static ICN : If the links between the nodes are fixed, it is called static ICN.

Eg - 1-D → Linear arrangement



(ii) Dynamic ICN : If the links between the nodes are established dynamically, it is known as dynamic ICN.

a) single stage

b) multi stage

a) Single stage ICN :-



IS (Input Selector) :  $1 \rightarrow D$  Multiplexers

OS (Output selector) :  $D \rightarrow 1$  Multiplexers

b) Multi stage ICN :-



If no. of Nodes =  $N$  usually perfect square

No. of Nodes =  $\log_2 N$  in a stage

Eg - No. of nodes = 16  
No. of nodes in =

$$= \log_2 16$$

Eg - No. of nodes,  $N = 8$

$$\log_2 N = 3 \text{ --- No. of stages}$$

$$\frac{N}{2} = 4 \text{ --- No. of nodes in each stage}$$



### 2x2 switches



### Inter PE Communication

→ The network design for inter PE communication is based on :-

- (i) Operation Mode
- (ii) Control Strategies
- (iii) Switching Methodology
- (iv) N/W Topology

### Operation Mode :-

Synchronous : Communication is needed for establishing communication path synchronously either for data instruction broadcast or for establishing connection.

Asynchronous : Communication is needed for multiplexing where the connection request are issued dynamically.

Combined : A system may also be designed to facilitate synchronously or asynchronously.

### (ii) Control Strategies :-

Centralised Control

Distributed Control — Used by most ICN

### Switching Methodologies :-

Circuit switching : Physical path is established between source and destination and used for bulk transfer.

Packet switching : Data is put in a packet and routed through the ICN without establishing a physical path and used for short data transfer.

Integrated : Includes the capability of circuit and packet switching

### Network Topology :-

\* Network can be depicted in a graph where :-

Nodes → switching points  
Edges → communication links

→ Static  
→ Dynamic

The Space = {operation} × {switching} × {N/w Topologies}

for ICN = {Mode} × {Methodologies} × {Topologies}