

# Computer Org. & Arch.

## Basics

Architecture :

- CPU

- I/O

- Memory

CPU design

→ Addressing modes

→ Inst<sup>n</sup>

→ Data format



Organization : How the components are actually connected / implemented.

von Neumann Arch : - Stored Prog Arch.

Disadv {  
 - len through input  
 - single mem for inst<sup>n</sup>/data



Harvard Arch :

- stored prog Arch.
- diff mem for inst and data.

Disadv {  
 - At a time inst<sup>n</sup> and data  
 doesn't req. simultaneously  
 costly



Components of computer

- ▷ CPU
- ▷ I/O
- ▷ Mem



Mem Cycle time :



$$MCT = \text{mem. Access Time} + \text{Extra latency}$$

→ This is the time elapsed b/w the CPU got the request and tell it ready to accept the new signal.

CPU registers :

i) General purpose registers → used to store the general contents

ii) Special purpose registers :

- |      |                  |
|------|------------------|
| ▷ AC | ▷ PSW            |
| ▷ PC | ▷ AR / MAR       |
| ▷ IR | ▷ DR / MDR / MBR |
| ▷ SP |                  |

\* Accumulator (AC) / ACC :- Used to store result of ALU and sometimes one of the operand for ALU also.



\* Types of CPU arch., Based on ALU inputs :-

① AC based Arch.



$$\text{Inst}^n = \text{operand} + \text{operation}$$

② Reg.-based Arch.



$$\text{Inst}^n = \text{Operation} + 2 \text{ operands}$$

③ Reg.-Mem based Arch.



$$\text{Inst}^n = \text{operation} + 1 \text{ operand} + (\text{other operand is fixed})$$

④ Complex system Arch.



$$\text{Inst}^n = \text{operation} + \text{min 2 operand.}$$

⑤ Stack Based Arch.



$$\text{Inst}^n = \text{only operation} \\ (\text{operands taken from bit \& content of stack})$$

\* Program Counter (PC) :- Stores address of next inst<sup>n</sup> to be executed.

\* Instruction Register (IR) :- Stores the current instruction to be executed.

\* Stack Pointer (SP) :- Stores the address of TOP (top of the stack)

\* Flag registers or Status Reg. :- (PSW - Program Status Words)



Some CPU → AC + flag reg  
Some CPU → flag reg

\* Address Reg or MAR :- Used to send address to memory

\* Data Reg or MDR or MBR :- Used to send data to memory and receive data from memory



## \* Memory Read :

- 1) CPU sends add. to mem via add bus
- 2) CPU will enable read mem control signal.
- 3) Memory goes to specified address and reads the content and send it to CPU via data bus.

## \* Memory Write :

- 1) CPU sends add. using add. bus to memory.
- 2) CPU sends ~~data~~ using data bus to memory.
- 3) CPU sends write control signal as enable to memory
- 4) Memory goes to specified address & enters the received data/content on that cell.

Type of Arch (Based on size of F/P in ALU)



Memory type

Byte Addressable (By default)

$$W = 1B$$



1 Nibble = 4 bits

Word Addressable

$$W = 2B$$

$$= 4B$$



\* Micro-operations: The operations executed on values stored in registers.

Symbolic notation to describe the micro-ops: Register Transfer Lang. (RTL)

\* Min time needed to perform 1 micro-op  $\rightarrow$  1 CPU cycle time.

1. Register Transfer :  $R_2 \leftarrow R_1$  or  $R_1 \leftarrow R_2$

2. Comma :  $R_2 \leftarrow R_1, PC \leftarrow PC+1$

3. Memory Transfer      Memory Read

4. Logical op:

- AND
- OR
- X-OR
- X-NOR

CPU  $\leftarrow$  Memory  
 $DR \leftarrow M[\text{address}]$   
 $M[1000]$   
 $M[\text{AR}]$

Memory write

$M[\text{address}] \leftarrow DR$   
 or  
 $M[1000] \leftarrow DR$   
 or  
 $M[\text{AR}] \leftarrow DR$

5. Shift op:

\* Logical Shift      Left      1011  $\rightarrow$  0110  
Right      1011  $\rightarrow$  0101

\* Circular Shift      Left      1011  $\rightarrow$  0111  
Right      1011  $\rightarrow$  1101

\* Arithmetic Shift after operation sign should remain same

Left - Same as logical shift but CPU performs it when the sign doesn't change.  
Right - 11010  $\rightarrow$  11101

\* Register Spill : If no of registers are not enough to keep intermediate operands, then one or some values can be moved to memory for temporary vars.



Instruction : Group of bits which instructs computer to perform same operation.

[opcode | operand info]

\* Instruction set Arch : Defines the collection of distinct insts., ISA = 2<sup>10<sup>possible bits</sup></sup>

Type of Inst based on Operand Info

1) 4-address Inst

[opcode | Add1 | Add2 | Add3 | Add4]  
| { } { } { } { } |  
| operands      Add of next    |  
|                inst            |

- used in comp. w/o pc

- disadvantage

- inst size larger

- inst fetch takes more time

- Relation of prog is costly

2) 3-address Inst : max<sup>n</sup> 3 add. can be specified in an inst

[opcode | add1 | add2 | add3]  
| { } { } { } |  
| operands      |

- used in modern comp.

3) 2-address Inst : max<sup>n</sup> 2 add. can be specified in an inst

[opcode | add1 | add2]

- one of the operand is used as both src and dest  
- more than inst for a prog. compared to 3-add inst

4) 1-address Inst

[opcode | add]

- AC is used as 2nd operand implicitly.

5) 0-address Inst

[opcode]

No any add. is specified within an inst

- both operand taken from stack

- implemented on stack-based Arch.

Instruction

fixed length inst

opcode variable length

variable length inst

fixed opcode length

fixed length instruction

- start with that inst<sup>n</sup> which has min<sup>n</sup> opcode bits



CPU check if opcode 10 has been used for 2-add inst<sup>n</sup> or not

$$\text{total} = 2^2 = 4$$

$$\text{used op code} = 3$$

1

used

not used

Take it as 2 add inst<sup>n</sup>

\* Only those inst<sup>n</sup> which are having starting 2 bit 11.

\* max =  $(1) \times 2^2 = 4$  instr possible

$$\text{used op code} \times 2^{n-m}$$

$n = \text{opcode size in L-add}$

$m = \text{opcode size in 2-add.}$

- min will always 1. Because if 0 then that inst<sup>n</sup> not supported.

variable length instr

Consider 3 types of inst<sup>n</sup> in system.

1. Register operand inst<sup>n</sup>: One opcode and 2 registers.

2. Memory operand inst<sup>n</sup>: One opcode, 1 register and 1 memory address

3. Immediate Operand inst<sup>n</sup>: One opcode, 1 register and 1 immediate operand.

Number of registers =  $6 \rightarrow 6$

Number of bits in immediate operand = 10 bits

Memory size = 512 Mbytes (by 2 addressable)  $\rightarrow 2^9 \times 2^{20} = 2^{29}$

Total inst<sup>n</sup>: ① Reg Operand type: 10

② Memory Operand type: 12

③ Immediate Operand type: 4

$$\left\{ 26 \right\} = 2^5$$

opcode = 5 bits

Maximum and Minimum inst<sup>n</sup> length are? Ans: 40, 17



Effective Address - Address of operand in computation-type inst<sup>n</sup>  
(ADD, SUB, ...)

(or)

The target address in a branch-type inst<sup>n</sup>.

Branch type inst<sup>n</sup>

Unconditional

ex: goto

always jumping

always change PC value

by target address.

Conditional

ex: JNZ, JZ, BNZ

Jump only when cond<sup>n</sup> is true

Condition is true  
(Branch is taken)

PC updated by  
target address

Cond<sup>n</sup> false.

(Branch is not  
taken)

No change in PC  
value

\* Instruction Cycle - 6 phases to execute one inst<sup>n</sup>.



→ PC increments during IF and it incremented by the size of inst<sup>n</sup> which is fetched.

Inst<sup>n</sup> Cycle

Fetch cycle  
(IF)

Execution Cycle  
(ID, EAC, OF, EX, WB)

- Execution Cycle - 5 phases
- Execution phase - 1 phase
- Inst<sup>n</sup> Execution - all 6 phases

## Addressing Modes

It specifies how to use the address field of  $\text{inst}^*$  to get the operand.



\* we use addressing modes to calculate eff. add.

- 1) Implied mode  $\Rightarrow$  Opcode definition itself specifies the operand.  
 - No any explicit address specified for operand.



- 2) Immediate Mode  $\Rightarrow$  Add field of  $\text{inst}^*$  specifies the operand value.  
 - Used to initialize reg with const values.



- 3) Direct mode  $\Rightarrow$  / Absolute Mode  $\Rightarrow$  Add fields of  $\text{inst}^*$  specifies effective address.  
 - Easy to implement for mem operands.



- 4) Indirect mode  $\Rightarrow$  The add field contains the add of eff. add.



- 2 mem access to get the operand
- Used to implement pointers

- 5) Register Mode  $\Rightarrow$  Add field of the  $\text{inst}^*$  specifies a reg which contain the operand.



- 6) Register Indirect Mode  $\Rightarrow$



- we do shorten the inst length.

## → Auto inc/dec mode & (Auto indirect)

- Variant of reg indirect mode, but content of reg (off · add) is automatically incremented/decremented to access a table of content (array) sequentially.

## → Index mode / Index Register Mode &

### 2 implementation

↓  
CPU has special purpose index reg.



$$\boxed{\text{index reg}} \quad EA = m[\text{add}] + \text{content of index reg}$$

↓  
CPU doesn't have specific index reg.



$$EA = m[\text{add}] + \text{content of the index of reg field.}$$

- If prog. data is relocated, the new base add has to be updated in inst

## ⑥ Base, Index Reg mode & - Keep both index, base in reg only

$$- EA = \text{base reg value} + \text{index reg value.}$$

## ⑦ Scaled Mode &



- Advance version of index mode

$$- EA = \text{base} + \text{index} * w$$

## ⑧ PC relative Mode & Particularly used for branch inst



Relative location will be maintained or

$$EA = \text{PC value} + \text{add field value}$$

- intrasegment jumping

new value if forward jumping

offset

↓

new value if backward jumping

## ⑨ Base - Reg mode &



Barn add of target segment

⊕

EA

- used to jumping but base index is used  
to access ~~array~~ elements

- If relation happens ~~no~~ need to change the inst

13) Base, Index Register Displacement mode

- It is used to access array of structures



To reach specific member of the structure

$$EA = \underbrace{BA + index}_{\text{To reach specific array elements}} + \text{displacement}$$

NOTE: ⑪ & ⑫ are used for jumping / branching  
others are used for data.

14) Absolute Addressing mode

- The address of the operand is inside the inst.

NOTE:

The effective address of the following inst<sup>n</sup> in MUL 5(R1, R2)

$$\hookrightarrow 5 + [R1] + [R2]$$

The addressing mode used is base with offset and index.

## CPU organization

### CPU - Basics

- \* CPU cycle time  $\leftarrow$  Amount of time req. by CPU to perform a small operation (micro-operation).
- \* CPI  $\leftarrow$  # CPU cycles req. to execute an instruction.  $= \frac{\text{# CPU cycles to execute one Inst.}}{\text{one Inst.}}$
- \* CPU clock rate  $\leftarrow \frac{1}{\text{CPU cycle time}}$
- \* Execution time  $\leftarrow$ 
  - 1 Inst. execution time =  $\text{CPI}_{\text{avg}} * \text{cycle time}$
  - n Inst. execution time =  $n * \text{CPI}_{\text{avg}} * \text{cycle time}$
  - n Inst. execution time =  $\frac{n * \text{CPI}_{\text{avg}}}{\text{clock rate}}$
$$\text{CPI}_{\text{avg}} = \frac{\sum_{i=1}^K \text{CPI}_i * n_i}{\sum_{i=1}^K n_i}$$
- \* MIPS  $\leftarrow$  (Million Inst. per sec)
  - if  $n$  inst. executed in  $t$  time,  $\text{MIPS} = \frac{n}{t * 10^6}$

$$\text{MIPS} = \frac{\text{clock rate}}{\text{CPI}_{\text{avg}} * 10^6}$$

- ① CPI may not be same for all inst. hence consider CPI<sub>avg</sub>.
- ② Disadv: MIPS doesn't consider the complexity or capability of the inst.  
 ↳ Because of this reason FLOPS is used.  
 ↳ floating point operations per sec.

### ALU $\leftarrow$



- ① 32 bit CPU, ALU, that takes 2, 32 bit inputs, operation can be done in single cycle
- ② 64 bit architecture is faster than 32 bit architecture.



\* there are in/out signal for each register in CPU.

⇒ Instruction Fetch → Bring an inst<sup>n</sup> from mem to IR in CPU.

1.  $AR \leftarrow PC$        $AR_{in}, PC_{out}$
2.  $DR \leftarrow M[AR]$        $AR_{out}, mem_{rd}, DR_{in}$
3.  $\underbrace{IR \leftarrow DR}_{DR_{out}, IR_{in}}, \underbrace{PC \leftarrow PC + 1}_{PC_{inc}, Incrementor_{inc}}$

\* Control Unit →

It generates control signals and sends those signals to respective unit.  
Then the unit performs its respective operation.

Control variable → The name of specific control signals

Control word → Collection of all control signals generated by Control unit

Control unit Org.



Control word

| eg. | IR     | AR     | D1     | D1     | MUX1       | MUX2       | PC         | ALU    | mem |
|-----|--------|--------|--------|--------|------------|------------|------------|--------|-----|
|     | In out | In out | In out | In out | Select 1in | Select 2in | In out Inc | opcode | R W |

1.  $AR \leftarrow PC$

↳ This is actual word that is responsible for micro op  $AR \leftarrow PC$

## Hardwired Control Unit

The control logic is implemented using the hardwares.

Adv: Faster mode of operation

Disadv: Control logic updation is diff.

CPU can execute 'n' types of instruction ( $I_1, I_2, \dots, I_n$ )

$I_1 \Rightarrow \mu\text{-operation } (1, 2, 3, \dots, 10)$   $\star$  for each  $\mu\text{-operation}$  a control word is generated (unique).

$I_2 \Rightarrow \dots$  (but diff seq. of  $\mu\text{-op.}$ ) ...



$$S_i = (\text{in terms of } I_1, I_2, I_3, \dots, I_n \text{ Regd})$$

$$\text{e.g. } S_1 = I_1 I_1 + I_2 I_3 + \dots$$

$$S_2 = I_1 (I_1 + I_2) + I_3.$$

## Microprogrammed Control Unit

$\star$  Control memory stores every possible control word.

Based on requirements, the corresponding control word is fetched and the signals are sent to respective unit.

$\star$  Control words will stored inside ctrl. mem. which is attached with ctrl. unit and it is inside CPU.

Adv: Easy updation

Disadv: Slower (+ more acc.)

## Control word Sequencing



$\star$  all  $\mu\text{-operation}$  for a instruction is stored in seq.  $\Rightarrow$  No need of addr. calculation  $\rightarrow$  [Faster]

$\star$  at each addr. in ctrl mem.

This is for the last control word of an instruction. That needs external input to know what will be the next add.



## Micro-instr<sup>n</sup> (Basic Format)



Q) A system supports 256 distinct "inst". Each "inst" requires a sequence of 16 micro-operations. Each micro-op<sup>n</sup> in control memory has format:



control mem size = ? (bit)



"list" inst  $\times$  micro op<sup>n</sup>

$$= 256 \times 16 \quad \text{Add size} = 12 \text{ bits}$$

$$\begin{aligned} \text{Total micro-} \\ \text{op}^n &= 2^8 \times 2^4 = 2^{12} \end{aligned}$$

$$\text{Control mem size} = 2^8 \times 152 \text{ bits}$$

$$= 608 \text{ Kbits}$$

## \* Types of Microprogrammed Control Unit

{Horizontal}

In control word, 1 bit stored for each control signals

{Vertical}

all signals are divided into multiple groups and each group info is stored in encoded form.

Each group such signals among such signals among those only one can be enabled at a time.

Speed:

Hardwire  $>$  horizontal  $>$  vertical

4 signals, 16 signals



Store in encoded form in just 2 or 4... bits

(default case)

Every time atleast 1 signal is active from each group  
e.g. 4 signals in a group



It might be possible that no any signal is active from a group (atmost 1) at a time

e.g. 4 signals in a group



no one is active

## Horizontal

- 1) One bit for each signal
- 2) No decoder required.
- 3) Control word in large
- 4) faster
- 5) Max no of signals enabled at a time  
= total no of signals.  
(Degree of parallelism is high)

## Vertical

- 1) signal stored in encoded form in groups.
- 2) decoder req.
- 3) Small - CW.
- 4) slower
- 5) Max no of signals enabled at a time  
= No of groups.

NOTE: If any signal cannot be kept in any group then such signals are stored in horizontal manner.

## RISC

(Reduced Instruction set computer)

- 1) Less Number of Inst<sup>n</sup> supported.
- 2) Fixed length inst<sup>n</sup>
- 3) Simple inst<sup>n</sup>
- 4) Simple and less number of addressing modes.
- 5) Easy to implement using hardware control unit.
- 6) One cycle per inst<sup>n</sup>
- 7) Register-to-register arithmetic operation only
- 8) More number of registers.

## CISC

(Complex Instruction set computer)

- 1) More number of Inst<sup>n</sup>.
- 2) Variable length inst<sup>n</sup>
- 3) Complex inst<sup>n</sup>.
- 4) Complex and more number of addressing modes.
- 5) Difficult to implement using hardware control unit.
- 6) More than one cycle per inst<sup>n</sup>
- 7) Registers to Memory and Memory to Reg.
- 8) Less number of registers.

→ In macro programmed ctrl unit the repetitive ctrl signals are not repeated, unique ctrl. signals are stored in another mem and in control mem we store the pointer that will take less time.

## I/O Organization

\* Peripherals : devices connected to processor internally.

- Mainly 3 types:

- Input
- Output
- Storage (persistent)

\* Need for interface:

- conversion of signal reqd.
- synchronization reqd.
- conversion of format reqd.
- peripheral shouldn't disturb operation of other.



\* I/O processor Interfacing + DMA controller  
+  
I/O inst<sup>n</sup> execution.

### Memory Mapped I/O

- 1> Memory warpage.
- 2> All memory access inst<sup>n</sup> used for I/O access also.
- 3> No separate add. space for I/O.
- 4> More inst<sup>n</sup> for I/O access.
- 5> More addressing modes for I/O access.
- 6> More I/O devices connected.



### I/O Mapped I/O.

- 1> No memory warpage.
- 2> I/O access and memory access inst<sup>n</sup> are different.
- 3> I/O have their own separate address space
- 4> Less inst<sup>n</sup> for I/O access.
- 5> <sup>less</sup> addressing modes for I/O Access.
- 6> Less I/O devices connected.



## Data Transfer

### Synchronous vs Asynchronous Transfer

- **Synchronous:** All components/parts working w/ same clock pulse.
  - \* i.e. it will work with the speed of the slowest.
- **Asynchronous:** Each device will have its own independent clock signals.
  - is synchronization has to be provided externally.
  - \* In synchronous : Bottleneck = SLOWEST device
- \* **Serial vs Parallel :**
  - Serial : Single common link b/w 2 devices & 1 bit at a time.
  - Parallel : B/w 2 devices, there are multiple lines.  
If there are K lines b/w them, then we can send K-bit at a time.

### \* Serial Asynchronous Data Transfer

\* Since method is asynchronous,  
we need to provide external  
synchronization.



### ⇒ Modes of Data Transfer

1. Programmed I/O.
2. Interrupt driven I/O.
3. Direct Memory Access.

Programmed I/O :-

- More importance given to I/O
- Wastage of CPU time for asking each and every device after certain time interval.
- If any device has its status set then CPU performs data transmission for it.

$$\text{Total time} = \text{Time to req to check the status of I/O} + \text{Data Tx time}$$

(Flag req sending time from I/O) + (Flag checking time in CPU) Depends on I/O speed.

negligible

2) Interrupt driven I/O: \* Starting Addr. of ISR is called Vector Address.

- ISR():
1. CPU receives interrupt.
  2. CPU completes execution of current inst.
  3. Return addr. (+ state) stored in CPU stack.
  4. Interrupt. Serviced.
  5. Using POP stack - Original Program restored.



Send interrupt along with addr. of ISR.

\* CPU only receives intr. but no ISR addr.

\* CPU executes default ISR, which helps to find the correct ISR.

### \* Types of Interrupts

(i) Markable : (Low Priority) { CPU can accept or reject } <sup>on situation</sup>

(ii) Non-Markable : (High Priority) { CPU always accepts }

(Based on source)

(iii) External : (H/w) devices

(iv) Internal : (S/w) during instruction execution.

### \* Interrupt overhead \*

- 1. Push the value into stack
  - 2. Resolve priority
  - 3. Branching to ISR.
- Everything includes before serving the interrupt.

### \* Priority Based interrupt handling



- Starvation may occur (Practically very rare)

(\*) give highest priority to the device which is electrically closer to CPU.

\* Here we can't change the priority of the devices.

$$\text{Total time} = \text{Interrupt overhead} + \text{data transfer}$$

Interrupt Cycle: If interrupt of a device is not accepted, then after how much time again the interrupt is generated.

- \* In the question, the device transfer speed represents the internal speed of the device not the tx speed before mem I/O or CPU.
  - Fraction of processor time consumed by I/O device =  $\frac{I/O \text{ time}}{\text{interrupt processing time}}$

### 3) Direct Memory Access (DMA):

- Tech enables data transfer b/w I/O and mem w/o CPU intervention.
- Need a H/W - DMAC.



④ the CPU can grant HLDA at any point of time (at any phase of instruction execution)

⇒ During DMA transfer, CPU can only perform those which do not require system Bus.

⑤ DMA is also special kind of processor.

$$\boxed{\text{Time} = \text{DMA initiation} + \text{Data Transfer.}}$$

\* Mode of Data Transfer: CPU can perform only those operations which don't require system bus, during DMA tx, which means

i) Entire amount of data at a time (BURST Mode)  $\rightarrow$  2 implementation possible

- ① Full data
- ② 16K at a time

ii) One block at a time (BLOCK Transfer mode)

iii) Cyclic Stealing.

- How much data is transferred in one time through DMA, before CPU takes back the control of the bus.

\* Cycle Stealing: Slow I/O device takes sometime to prepare and to transfer. During that time CPU keeps the control of the buses. When the data is ready CPU gives the control of the buses to DMA Controller for 1 word cycle. In that cycle, prepared word is transferred to memory, and then CPU takes back the control of the buses.

Assume,  
I/O takes time to prepare data =  $t_x$ .

Data transfer time to memory =  $t_y$



$$\% \text{ of time CPU is blocked} = \frac{t_y}{t_x + t_y} \times 100\% \quad [\text{Burst mode}]$$



$$\% \text{ of time CPU is blocked} = \frac{t_y}{t_x} \times 100\% \quad [\text{Cycle stealing mode}]$$

∴  $t_y$  time will overlap w/  $t_x$  time.  
i.e. when  $t_x$  is larger than  $t_y$ , then:  $\{ t_x > t_y \}$

\* Masterless DMA: CPU gives the control of the buses to DMA, only when it does not need the buses (it operates internally)  
(here 0% CPU blocked due to DMA)

## Memory Organization

### \* Memory & Memory hierarchy :



- \* Maximize memory access speed
- \* Minimize per bit storage cost
- \*  $S_i \rightarrow \text{size of memory } M_i$
- \*  $C_i \rightarrow \text{per bit cost of } M_i$

\* Average per bit Storage cost =  $\frac{S_1C_1 + S_2C_2 + S_3C_3}{S_1 + S_2 + S_3}$

\* Memory Access rate =  $\frac{t}{\text{memory access(cycle) time}}$

\* Memory Representation : = no of cells  $\times$  1 cell size(capacity)  
= No. of mem addr.(location)  $\times$  no of bits in 1 location

Address size =  $\log_2 n$

- If 1 cell capacity = 1B  $\Rightarrow$  Byte Addressable mem (By default)  
(Address assigned per Byte)



\* Main Memory  $\rightarrow$  Mem to store current running progs. and their data.

\* Memory Decoding  $\rightarrow$  address size  $\times$  Total no of addresses is the decoder size

### Random access memory (RAM)

- Non volatile

- 2 job it is doing while the comp. gets on

$\rightarrow$  H/w check (Power-on self test)

$\rightarrow$  Booting: Bootstrap loader is in ROM.

- B prog. is in 2nd mem,

- After Booting over, CPU doesn't use ROM till the comp turned off.

### Read Only Memory (ROM)

### Random access Memory (RAM)

$\hookrightarrow$  Used to store running prog. and the data along with its running prog.

\* Initially CPU executes instruction from ROM (POST)

After POST  $\rightarrow$  [Booting Prog.]  $\sim$  Load OS into RAM. (Bootstrap)

RAM: Bootstrap

ROM: Bootstrap Loader

### Types of ROMs

- I) EPROM (Erasable)
- II) EEPROM (Programmable)
- III) EPROM (Erasable - Programmable)
- IV) EEPROM (Electrically Erasable Program)
- V) EAPROM (Electrically Alterable ... )

### Types of RAM

- i) Static
- ii) Dynamic.

### Static

- 1) Implemented using flip-flops.
- 2) No refresh required.
- 3) Faster Read/Write.
- 4) Used for cache.
- 5) Low idle power consumption.
- 6) High operational power consumption.

### Dynamic

- 1) Implemented using capacitors.
- 2) Periodic refresh is required.
- 3) Slow Read/Write.
- 4) used for main memory.
- 5) High Idle power consumption.
- 6) Low operational power consumption.

### \* DRAM & SRAM Refresh Logic & cell Arrangements :-

- In 1 refresh 1 row of cells can be refreshed.
  - Time to refresh complete chip = no of rows of cells  $\times$  1 refresh time
  - Time to refresh n no of chip = 1 chip refresh time, All chips can be refreshed parallelly.
  - Whenever refresh is going on, no R/W can happen in the chip.
- One entire chip is refreshed then only R/W can be done.



- Main memory is 16 bit wider  $\Rightarrow$  cell capacity = 16 bits or at each address data req by CPU in 16 bit (Arrangements includes here)

|  | A     | B     | $A \times B$<br>result | $A \times B$<br>mul. table | $A + B$<br>Add" table  | $A + B$<br>result |
|--|-------|-------|------------------------|----------------------------|------------------------|-------------------|
|  | n bit | n bit | 2n bit                 | $2n \times 2^n$ bit        | $(n+1) \times 2^n$ bit | $(n+1)$ bit       |

- \* No of pins required in RAM, ROM (chip select, Read, write, data, add line in RAM), (CS, Data, add in ROM)

### \* Using multiple chips in a mem system :-

Case 1: Total no of address increases (Vertical + Decoder)

Case 2: Total no of data increases (Horizontal + <sup>word size</sup> Decoder)

= At same time addr. line provided to both RAM

Case 3: If required Data & Addr more? (Hybrid Arrangements)

$$128 \times 8 \text{ bit} \Rightarrow 7 \text{ bit Add}$$

$$128 \times 8 \text{ bit} \Rightarrow 7 \text{ bit Add}$$

$$\frac{256 \times 8 \text{ bit}}{\Rightarrow 8 \text{ bit Add line}} \rightarrow \text{Among this } 8 \text{ bit 1st bit is used as chip select.}$$

- \* In multiple chip memory to decode output select one entire horizontal arrangement.
- \* CPU can initiate the memory request only when memory is ready.

Associative Memory : (Content Addressable mem.)



→ costly, fast, used in Cache & TLB.

- If data matched in more than one place then the content will send sequentially.

- parallel comparisons req.

\* CPU always generates mem address (even to access cache too)

\* Locality of Reference :

If CPU has requested one addr. for mem. access then that particular add. or nearby add. will be accessed in further.

Spatial Locality (space)

Temporal (Time)

\* Cache Memory : Based on Locality of reference concept current demanded locality are kept into the smaller & faster mem known as cache memory.

\* When miss in a cache occurs a block containing demanded content is brought to the cache. (Birthday, taking help of a small boy eg.)



\* This is done parallelly.



$$d. \text{ miss ratio} = \frac{\# \text{ hits}}{\# \text{ References}}$$

$$d. \text{ miss ratio} = (1 - h).$$

\* Average Memory Access Time

$$T_{avg} = H * \text{Time when hit in cache} + (1 - H) * \text{time when miss in cache} \quad (1)$$

Simultaneous Access (By Default)

\* Req for mm and cache are generated simultaneously



$$* T_{avg} = H * T_c + (1 - H) T_{mm}$$

(2)

Hierarchical Access ["level", "hierarchy"]

- Only cache is accessed first



$$* T_{avg} = H * T_c + (1 - H)(T_c + T_{mm}) \quad (3)$$

(3)

$$* T_{avg} = T_{cm} \text{ if } H = 100\%.$$

\* Every time there is a read miss, a block is brought from mm to cm.

\* Every time there is a write miss, a block does not come from mm to cm (no write allocate)

When do we use which formula



NOTE! Cache lookup time for hit/miss is ignored.

- \* T<sub>avg</sub> when locality of ref. is w.r.t [of a block accessed in com of cache miss]

\* Simultaneous

$$T_{avg} = h(t_c) + (1-h) \left( t_{block} \underset{\substack{\text{access time} \\ \text{from main}}}{\cancel{+ t_{main}}} \right)$$

$$* t_{block} = \text{Block size} * t_{main}$$

- \* T<sub>avg</sub> when transfer time included?

\* Simultaneous

$$T_{avg} = h(t_{cm}) + (1-h) \left( t_{block} + \underset{\substack{\text{to write in cache}}}{t_{cm}} \right)$$

\* Hierarchical

$$T_{avg} = h(t_c) + (1-h) \left( t_c + t_{block} + t_c \right)$$

- \* In one  $t_{cm}$  whole block in cache is accessed but in  $t_{main}$  only one word in  $mm$  can be accessed not whole block.

- \* If in ques transfer time is given directly, then that includes  $T_{cm}$  (writetime) +  $T_{main} \times \text{Block size}$  (Block access time) (Reading time).

- Cache writers or Write Propagate [Original content always must be there in  $mm$ ]

CPU updates the duplicate copy in cache mem but the  $mm$  copy is not updated this is called cache write issue.

(1) Write Through :-

The updation of  $mm$  content is done immediately/simultaneously along with cache.

Adv:  $mm$  always contains the correct updated value.  
( $mm$  contains consistent data)

Disadv: Wastage of time (Time consuming process)

\* For write operation  $mm$  is accessed irrespective of hit or miss

(2) Write Back :-

immediately we don't write. Dirty/muf. bit maintained to denote change.  
Do write operation during block replacement.

Adv: Time saving (faster)

(In case of cache hits,  $mm$  is not accessed)

Disadv: inconsistency in data.

Memory Coherence Problem: 2 devices [CPU/H/D] accessing same content from 2 diff mem.

\* Write Allocate/No Write Allocate :



- Write Allocate : Bring block from mm to cm.
- For better performance write back on this.
- No Write Allocate : the block is modified in the mm and not loaded into the cache
- For better performance write through on this.

\* Parag in Write through Cache :-

$$\text{> } T_{\text{avg(near)}} = H * t_{\text{cm}} + (1-H) * t_{\text{mm}} \quad [\text{Simultaneous}]$$

or

$$H * t_{\text{cm}} + (1-H)(t_{\text{cm}} + t_{\text{mm}}) \quad [\text{Hierarchical}]$$

$$2) T_{avg}(\text{writ}) = \max(t_{cm}, t_{mm}) = T_{mm}$$

$$\Rightarrow \text{Tang(overall)} = \text{fraction of read} \times \text{Tang(read)} + \text{frac" of write} \times \text{Tang(write)}$$

↳ Effective hit ratio = read hit ratio  $\times$  % of read.  
 ↳ Because instead of write we a

- In write through the whole block is not yet transferred, the amount of data CPU wants to update is sent to memory.

\* Tang in with Back Cache :- [Simplically we talk about locality of reference]

$$\triangleright T_{avg}(\text{read/write}) = H * t_{cm} + (1-H) \left( t_{block} + t_{write\ back} \right) [\text{Simultaneous}]$$

$$2) T_{avg} (\text{read/write}) = H * t_{cm} + (1-H) (t_{blkR} + t_{cm} + t_{blkW})$$

$$\text{Work bank} = d * \text{block}.$$

↓  
% of dirty blocks

[Hierarchical]

\* Bus usage in Work ~~not~~ /miss  $\rightarrow$  Bus usage is come into picture when man is accessed not cm.

- Only 1 data sent to mm for write in write-through cache.
  - In write-through cache the block is replaced from cache directly
  - In write-back cache, the dirty blocks are only written back to mm.

Write through with no write allocate.



Write through with write allocate.



Write back with no-write allocate.



Write back with write allocate.



## Cache Mapping

- CPU always generates mm address (even to access cache too)
- Transformation of mm addr into cache memory adder is known as Cache mapping.
- Mapping is applied on blocks.

(i) Direct Mapping

(ii) Set-associative mapping.

(iii) Fully associative mapping.



▷ Direct Mapping is

\* CM block number = (mm block number) % number of blocks in cache.

\* No of blocks in cache =  $\frac{\text{Cache size (CS)}}{\text{Block size (BS)}}$

\* MM Block No.

|     |              |
|-----|--------------|
| Tag | CM block No. |
|-----|--------------|

MM address

|     |              |             |
|-----|--------------|-------------|
| Tag | CM block no. | Byte offset |
|-----|--------------|-------------|

$\log_2(\text{cache size})$



\* If Blocksize =  $2^K B$   
no of bits in byte offset = K bits.

Byte offset

cacheline



\* Index in direct Mapping = CM block number

\* Tag in direct Mapping = mm address -  $\log_2(\text{cache size})$

\* Tag identifies among all mm blocks which maps to one index, which one is present in cache.

\* Byte offset is not used to check hit/miss in any of the mappings.

\* Meta-data size or Tag directory size = no of blocks in cache \* (Tag-bit + extra bit)

(all mappings)

\* For a given cache size, block size and mm size = Tag is same (for byte addressable and word addressable memory both)



### 3) Fully-Associative Mapping:



- \* Index in fully associative mapping = 0 bits
- \* Tag in fully associative mapping = mm block number

\* For a given cache size, mm size & block size:

▷ Size of tag & (tag-directory size)  
 Full ass.  $>$  Set ass.  $>$  Direct

▷ No. of bits in index  
 Direct  $>$  Set ass.  $>$  Fully ass.

| Index    |              |
|----------|--------------|
| Mapping  | Index        |
| Direct   | CM block No. |
| Set-Asso | CM set No.   |
| Fully    | 0 bit        |

\* Hardware Implementation:

\* Direct Mapping hardware:

- ▷ Number of MUX for tag selection = Tag bits
- ▷ size of MUX for tag selection = Number of blocks  $\oplus 1$
- ▷ Number of Comparators = 1
- ▷ size of comparator = Tag bits

### \* K-way set associative mapping :-

- 1) Number of MUX for tag selection =  $K \times \text{tag bit}$
- 2) size of MUX for tag selection = Number of sets  $\times 1$
- 3) Number of comparators =  $K$
- 4) size of comparator = Tag-bit.
- 5) OR-gate =  $1$  ( $K$ -input OR gate)

### \* Fully Associative mapping :-

- 1) Number of comparators = Number of blocks in cache
- 2) size of comparator = Tag-bit
- 3) OR-gate =  $1$  (number of blocks - input OR gate)

### \* Hit latency time :-

- 1) Direct Mapping = MUX delay + Comparator delay
- 2) Set associative Mapping = MUX delay + comparator delay + OR-gate delay
- 3) Fully associative Mapping = Comparator delay + OR-gate delay

## Types of Cache Misses

- 1) Cold or Compulsory miss: First time access of a block will always cause a miss.
- To reduce Cold misses: Increase block size.
- \* Total cold miss = number of blocks in mm
- 2) Capacity Miss: If cache is full and hence miss occurs.  
↳ and it is not the first time access of the block.
- To reduce Capacity misses: Increase the capacity
- 3) Conflict Miss: If cache set is full and hence miss occurs due to tag mismatch.
- To reduce conflict miss: Increase associativity
- \* No Conflict miss in fully associative cache.

Ex: No. of blocks in cache = 4  
2-way set associative cache

LRU replacement policy

CPU requests for main memory blocks: 0, 4, 8, 0, 4, 1, 3, 1, 5, 1, 3



|    |   |   |   |   |
|----|---|---|---|---|
| 00 | 0 | 4 | 8 | 4 |
| 01 | 1 | 2 | 8 | 3 |

$$\begin{aligned}
 \# \text{ cold miss} &= 6 \\
 \# \text{ conflict miss} &= 1 \\
 \# \text{ capacity miss} &= 1 \\
 \hline
 & 8
 \end{aligned}$$

## Goals of using Cache:

1. Minimize Access Time → Smaller Cache?
2. Maximize Hit Rate → Larger Cache → Multi-level Cache
3. Minimize Miss penalty → Non-blocking cache  
multi-bank memory

$$\left\{
 \begin{array}{l}
 \text{Hit ratio} = \frac{4}{12} = 0.33 \\
 \text{Miss ratio} = \frac{8}{12} = 0.67
 \end{array}
 \right.$$

## Cache Block Replacement Policy

- \* In direct mapping  $\rightarrow$  no any replacement policy required.
- \* In set-associative mapping  $\rightarrow$  replacement policy req.
- In fully-associative mapping  $\rightarrow$  "

### Block Replacement Algo's

\* replacement valid when :  $N \geq 2$

(1) FIFO :

(2) Optimal : Replace block which is not going to referred soon in future

(3) LRU : Replace block which has not been referred since long.  
(Least Recently used)

Cache Miss Penalty : Time required to bring a missed block from main memory to cache

Assume, Cycles required to send address to mem  $\rightarrow$  1 cycle

includes  $\rightarrow$  Cycles required to access 1 mm cell  $\rightarrow$  10 cycles

Cycles required to transfer 1 cell data to cache  $\rightarrow$  1 cycle

| <u>Cache Block size</u> | <u>Main mem cell size</u> | <u>Miss penalty</u>                                                           |
|-------------------------|---------------------------|-------------------------------------------------------------------------------|
| 4B                      | 1B                        | $1 + 10 * 4 + 1 * 4 \Rightarrow 45 \text{ cycles} \Rightarrow 225 \text{ ns}$ |
| 4B                      | 2B                        | $1 + 10 * 2 + 2 * 1 \Rightarrow 23 \text{ cycles} \Rightarrow 115 \text{ ns}$ |
| 4B                      | 4B                        | $1 + 10 + 1 \Rightarrow 12 \text{ cycles} \Rightarrow 60 \text{ ns}$          |

Assume, CPU operate on 200 MHz, then miss penalty time?

$$\text{Cycle time} = \frac{1}{200 \text{ MHz}} = 5 \text{ ns}$$

\* consider miss penalty cycle  $\rightarrow$  45 cycles  $\Rightarrow 225 \text{ ns}$

In 225 ns, transfer = 4B

$$1 \text{ ns} \rightarrow \frac{4B}{225}$$

$$1 \text{ sec} \rightarrow \frac{4B}{225 \times 10^9 \text{ ns}} = \frac{4 \times 10^3 \times 10^6 \text{ B}}{225} = 17.77 \text{ MB/sec}$$

\* Miss penalty increases when the block size increases.

## Multilevel Cache :

Simultaneous  $\rightarrow H_1 \times t_1 + (1 - H_1) [H_2 \times t_2 + (1 - H_2) t_{mm}]$

Hierarchical  $\rightarrow H_1 \times t_1 + (1 - H_1) (H_2 \times (t_1 + t_2) + (1 - H_2) (t_1 + t_2 + t_{mm}))$

## \* Dual Cache Organization :-



T<sub>avg</sub> = ?

- T<sub>avg</sub> for inst (Assume d-cache is not present at all)
- T<sub>avg</sub> for data (Assume l-cache is not present at all)
- T<sub>avg</sub> = fraction of inst \* (T<sub>avg</sub> inst) + fraction of data \* (T<sub>avg</sub> data)

## \* Cache Inclusion Policy



- All content of L1 is present in L2 also.
- It is not necessary to have same content in both the block present at level 1 and level 2 at any time
- We will use write back in most of the time, not need to update the block content every time the value changes in L1 cache

## Read Operation Only

1. Hit in L1 : CPU gets direct content from L1
  2. Miss in L1 & Hit in L2 : Bring a block from L2 to L1, if a block evicted from L1, then no any role of L2.
  3. Miss in L1 & Miss in L2 : Bring the block into L2 and then L1
- \* If any block evicted from L2, then send a block-invalidation to L1  
(make the valid bit in L1 of that block as 0)

## \* Exclusion:



Whatever present in L1,  
not necessarily present in L2.

## Read:

- 1) Hit in L1, CPU takes the content
  - 2) Miss in L1, Hit in L2  $\Rightarrow$  Move a block from L2 to L1 and if the block is evicted from L1 then place it into L2.
  - 3) Miss in L1, Miss in L2  $\Rightarrow$  Bring a block from mm to L1 and if the block is evicted from L1 then place it into L2.
- \* L2 is known as victim cache [All the replace cache of L1 is placed here]

# Pipeline

- Parallel Processing : Simultaneous data processing
- Type:
  - Vector processing
  - Array processing
  - Pipeline Processing (Pipelining)

\* Inst<sup>n</sup> fetch related to control unit, and  
Inst<sup>n</sup> execution related to ALU.

- Pipeline Processing :

- Useful when same processing is applied over multiple i/p.
- Tech. to decompose a seq. opes into sub opes.
- Task: One Operation performed in all segment.
- Sub operations are performed in segments
- Each segment can perform its suboperation over different inputs

- Pipeline Cycle time :

- That min time in which all segments can perform their respective sub-operation.
- Time required to perform some operations in pipeline.

$$\Rightarrow \text{No of cycles required} * \text{cycle time}$$

- Performance in pipeline is given by speedup ratio.

$$\text{Speedup} = \frac{\text{non-pipeline time}}{\text{Pipeline time}}$$

$$= \frac{n * t_n}{(K+n-1) * t_p}$$

[n value given in ques]

$t_p$  = pipeline cycle time      K = No of stages in pipeline = # segment  
 $t_n$  = non-pipeline cycle time      n = No of inst

- \* When no of tasks increased ie  $n \gg K$  [ignore K-1]

$$\text{Speedup}_{\text{ideal}} = \frac{t_n}{t_p}$$

{ max speedup, 1 cycle for each input }  
 If n is not given in ques

- \* Special Case : (Not always true)

→ Assume one task takes equal time in non-pipeline & pipeline system.

$$\Rightarrow t_n = K * t_p$$

$$\Rightarrow S_{\text{ideal}} = K$$

only in special case.

$$S \leq K$$

Always.

- Flynn's Classification of Computers
  - 1> SISD
  - 2> SIMD - pipeline processor
  - 3> MISD - Only theoretical
  - 4> MIMD - Multiple pipelines (Superscalar Computers)

$$\begin{cases} S = \text{Single} ; M = \text{Multiple} \\ I = \text{Inst stream} ; D = \text{Data stream} \end{cases}$$

## Synchronous Pipeline



\* Pipeline cycle time = max(all segment delay) + Intermediate buffers or reg delay  

$$(t_p) = \max(6, 8, 7, 5) + 1\text{ns} = 8 + 1 = 9\text{ns}$$

\* One task time in non-pipeline = (Sum of all reg delays) =  $6+8+7+5 = 26\text{ns}$



\* Try to divide operation into segments in such a way, that each segment should take equal delay: to have speed up near maximum.

Observations :- In general case, no of cycles =  $(K+n-1)$

$$\text{no of cycles per task} = \frac{(K+n-1)}{n} = \text{CPI}_{\text{pipeline}}$$

Ideal case,

$$\text{no of cycles} = n$$

$$\text{no of cycles per task} = 1$$

Ignoring starting  $(K-1)$  cycles to fill the pipe

$$\text{CPI}_{\text{ideal}} = 1$$

In every single cycle, one op can be obtained from pipeline

i. Throughput

$$\text{increase} = \left[ \frac{\text{New throughput} - \text{Old throughput}}{\text{Old throughput}} \right] \times 100\%$$

$$\text{MIPS} = \frac{\text{clockrate}}{\text{CPI}_{\text{ideal}} \times 10^6}$$

## Instruction Pipeline &

- pipeline processing applied on the instruction cycle.

Assume 5 seg. inst<sup>n</sup> pipeline

- IF - Instruction Fetch
- ID - Instruction Decode & Add. Ctl
- OF - Operand Fetch
- EX - Execution
- WB - Write Back

$\rightarrow$  (fetch cycle) + (execution cycle)

- \* Branch Detection: ID (decod.)
- \* Branch Evaluation: EX (4<sup>th</sup>)

$$\begin{cases} (4^{\text{th}}) \rightarrow 3 \text{ stall cycles} \\ (n^{\text{th}}) \rightarrow (n-1) \text{ stall cycles} \end{cases}$$

|                   | 1              | 2  | 3  | 4         | 5                       | 6  | 7  | 8  | 9 | 10 | 11 | 12 | 13 |
|-------------------|----------------|----|----|-----------|-------------------------|----|----|----|---|----|----|----|----|
| I <sub>1</sub>    | IF             | ID | OF | EX        | WB                      |    |    |    |   |    |    |    |    |
| I <sub>2</sub>    |                | IF | ID | OF        | EX                      | WB |    |    |   |    |    |    |    |
| I <sub>3</sub>    |                |    | IF | ID        | OF                      | EX | WB |    |   |    |    |    |    |
| Branch Conditions | I <sub>4</sub> |    | IF | <u>IF</u> | <u>ID<sup>BRA</sup></u> | OF | EX | WB |   |    |    |    |    |
|                   | I <sub>5</sub> |    |    |           | <u>IF</u>               | -  | -  |    |   |    |    |    |    |
|                   | I <sub>6</sub> |    |    |           |                         |    |    |    |   |    |    |    |    |
|                   | I <sub>7</sub> |    |    |           |                         |    |    |    |   |    |    |    |    |
|                   | I <sub>8</sub> |    |    |           |                         |    |    |    |   |    |    |    |    |
|                   | I <sub>9</sub> |    |    |           |                         |    |    |    |   |    |    |    |    |

Target

\* Key points :-

- \* Stall cycles = Extra cycles needed in pipeline execution due to some hazard.
- \* Stall cycles = Extra cycles needed in pipeline execution due to some hazard.
- \* If branch condition is evaluated after i<sup>th</sup> stage of branch inst<sup>n</sup> then the number of stalls because of branch inst<sup>n</sup> = (i-1).
- \* In any case, whether the condition (Branch) is true or false, the branch inst<sup>n</sup> will incur the stall cycles.

$$\begin{aligned} * \text{Smooth pipeline} &= (K+n-1) = 5+6-1 = 10 \text{ cyc} \\ * \text{Extra stall} &= 3 \text{ cyc.} \\ \hline \text{Total} &= 13 \text{ cyc.} \end{aligned}$$

Q) Consider a program which contains 500 inst<sup>n</sup> I<sub>1</sub>, I<sub>2</sub>, I<sub>3</sub>, ..., I<sub>500</sub>. Further consider a 5-stage pipeline with stages as: Inst<sup>n</sup> fetch, Decode, Operand fetch, Execution and Write back. The program contains 3 branch instructions, information of those given in a table below. The number of cycles required to execute this program in the given pipeline is \_\_\_\_\_?

| <u>Branch Out<sup>n</sup></u> | <u>Target Inst<sup>n</sup></u> | <u>Branch taken or not</u> |
|-------------------------------|--------------------------------|----------------------------|
| I <sub>9</sub>                | I <sub>29</sub>                | Taken                      |
| I <sub>234</sub>              | I <sub>322</sub>               | Not Taken                  |
| I <sub>443</sub>              | I <sub>491</sub>               | Taken                      |

In above ques consider pipeline stage delays are 12ns, 15ns, 13ns, 17ns and 14ns then,

- (i) Total time req. to execute the program?
- (ii) Calculate avg. CPI?
- (iii) Calculate MIPS count?

Soln:

$$\begin{aligned}
 \text{No of inst}^n &= (I_1 - I_9) + (I_{29} - I_{443}) + (I_{491} - I_{500}) \\
 &= 9 + \underbrace{415}_{(443-29+1)} + 10 \\
 &= 434
 \end{aligned}$$

w/o hazards no of cycles =  $(n+k-1) = (5+434-1) = 438$  cycles

$$\text{No of stalls} = 3 * 3 = 9$$

Taken or not taken  
both should be counted.

$$\begin{aligned}
 \text{Total no of cycles} &= 438 + 9 \\
 &= 447 \text{ cycles}
 \end{aligned}$$

$$(i) 447 * t_p = 447 * \max(12, 15, 13, 17, 14) = 447 * 17 = 7599 \text{ ns}$$

$$(ii) 434 \text{ inst}^n \rightarrow 434 \text{ cycles}$$

$$1 \text{ inst}^n \rightarrow \frac{447}{434} = 1.029 \text{ (Average CPI)}$$

$$(iii) \text{MIPS} = \frac{\text{clock rate}}{\text{CPI}_{\text{avg}} \times 10^6} = 57.16 \text{ MIPS}$$

• Pipeline Hazards  $\Leftrightarrow$  Situations that prevent the next inst<sup>n</sup> from being executions during its designated clock cycle.

1. Structural Hazard / Resource conflicts

2. Data Hazard / Data Dependency.

3. Control Hazard / Branch difficulty.

\* Structural Hazard / Resource conflicts:

$\rightarrow$  In different segments try to use same resource at same time.



Sol<sup>n</sup>: Use 2 cache   
 (i) L-cache   
 (ii) D-cache

\* Data Dependency or Data Hazards: Result of one inst<sup>n</sup> is used as input in next.

$$i: R_1 \leftarrow R_2 + R_3$$

$$i+1: R_5 \leftarrow R_1 * R_4$$

$$i: \text{IF } \text{ID } \text{OF } \text{EX } \text{WB}$$

$$i+1: \text{IF } \text{ID } - - \text{OF } \text{EX } \text{WB}$$

Stalls/Bubbles

A General pipeline <sup>W/W</sup> cannot understand the data dep.  
 or detect  $\hookrightarrow$  [Wrong result will be generated]

Sol<sup>n</sup> of Data dependency



④ Compiler can also insert some

other independent operation in b/w those inst<sup>n</sup>.

$$i: R_1 \leftarrow R_2 + R_3$$

NOP } Added by  
NOP } compiler

$$i+1: R_5 \leftarrow R_1 * R_4$$

## \* Operand Forwarding or Bypassing :

How pipeline is intelligent enough to detect dependency

i:  $R_1 \leftarrow R_2 + R_3$ : IF [ ] ID [ ] OF [ ] EX [ ] WB [ ]

ii:  $R_5 \leftarrow R_1 * R_4$ : IF [ ] ID [ ] OF [ ] EX [ ] WB [ ]

① No stall cycle [But costly sol<sup>n</sup>] Loop



NOTE:

→ decision to forward operand or not

\* For ALU to ALU data dependency, no stall cycle required if operand forwarding is used.

I<sub>1</sub>:  $R_1 \leftarrow M[\text{add}]$

I<sub>2</sub>:  $R_4 \leftarrow R_1 * R_3$  (or)

I<sub>1</sub>:  $R_1 \leftarrow R_2 * R_3$

I<sub>2</sub>:  $M[\text{add}] \leftarrow R_1$

← stalls are required

\* Control Hazards or Branch Difficulty : Hazards bcz of branch inst<sup>n</sup>

A general pipeline cannot have any solution for branch hazards



⇒ Delayed Branch :

Hardware cannot detect branch, so compiler has to provide sol<sup>n</sup> [By adding NOPs.]

i: Branch

NOP  
NOP  
NOP

IF ID OF EX WB

→ 4th stage (3 stalls)

i+1: Inst<sup>n</sup>

IF ID OF EX WB

⇒ Branch Prediction :

Pipeline can detect branch & predict the stall result.

If prediction not correct, then roll back [CPU state needed to be saved]

→ costly h/w

① Static Prediction

② Dynamic Prediction

Loop Buffer: Special Buffer which stores the small loop in it and that is executed a no of times (as per req.)

⇒ Prefetch:



- 2 pipelines (separate)
- one pipeline is discarded when condition is known.
- very costly (more h/w)
- nested branch makes system even more complex.

\* Hazard Classification (Data):

Assume that, there are 2 instructions i & j. and i executes before j

(i) RAW (True dependency)

j tries to read a source before i with it → inconsistent

$$\begin{array}{l} i: R_1 \leftarrow R_2 + R_3 \\ i+1: R_5 \leftarrow R_1 * R_4 \end{array}$$

Sol<sup>n</sup> delayed branch, operand forwarding

(ii) WAW (Write dependency) / false dep.

j tries to write a destination before i with it (final result differ)

$$\begin{array}{l} i: R_1 \leftarrow R_1 + R_3 \\ i+1: R_1 \leftarrow R_4 * R_5 \end{array}$$

\* (Race condition)

Sol<sup>n</sup> Register Renaming (H/w sol)  
(static single assignment)

(iii) WAR (Anti dep/ false dep)

j writes a destination before i reads

$$\begin{array}{l} i: R_1 \leftarrow R_2 + R_3 \\ j: R_2 \leftarrow R_4 * R_5 \end{array}$$

i gets incorrect value  
(i gets incorrect. before it reads)

Sol<sup>n</sup> register Renaming.

NOTE:

- In case if we are not using operand forwarding, we are going to have the stall cycles due to RAW dep/ true dep not due to WAW or WAR dep/ false dep.  
→ If u r not using op. forwarding try to draw diagram.

Consider a pipelined processor with the following four stages:

IF: Instruction Fetch ; ID: Instruction decode & Operand fetch ; EX: Execution and WB: Write Back.

The IF, ID, and WB stages take one clock cycle each to complete the operation. The number of clock cycles for EX stage depend on the inst<sup>n</sup>. The ADD and SUB inst need 1 clock cycle and the MUL inst needs 3 clock cycle in the EX stage. Operand forwarding is used in the pipelined processor. What is the number of clock cycles taken to complete the following sequence of inst<sup>n</sup>?

|     |            |                              |
|-----|------------|------------------------------|
| ADD | R2, R1, R0 | $R2 \leftarrow R1 + R0$      |
| MUL | R4, R3, R2 | $R4 \leftarrow R3 \times R2$ |
| SUB | R6, R5, R4 | $R6 \leftarrow R5 - R4$      |

Try to use method 2  
(w/o diagram)  
else draw diagram

In general, each stage takes one cycle for each inst<sup>n</sup>.

Given: IF = 1

|    | ADD | SUB              | MUL |
|----|-----|------------------|-----|
| EX | 1   | 1                | 3   |
| WB | 1   | <u>Extra - 0</u> | 2   |

$$\text{In general, w/o hazard} = K+n-1 = 4+3-1 = 6 \\ \text{stall} = \frac{2}{2} = 1 \text{ cycle}$$

NOTE :-

All inst<sup>n</sup> are ALU inst<sup>n</sup>, hence no stall cycles due to data dependency, because operand flow is used.

GATE 2009

|                | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub> | S <sub>4</sub> |
|----------------|----------------|----------------|----------------|----------------|
| I <sub>1</sub> | 2              | 1              | 1              | 1              |
| I <sub>2</sub> | 1              | 3              | 2              | 2              |
| I <sub>3</sub> | 2              | 1              | 1              | 3              |
| I <sub>4</sub> | 1              | 2              | 2              | 2              |

for ( $i=1$  to 2) { I<sub>1</sub>; I<sub>2</sub>; I<sub>3</sub>; I<sub>4</sub>; }



Ans  
in LCS

max + ()  
using given table

like LCS  
problem

23

Effects of Hazards: (CPI calculation in pipeline  $\leftarrow$  w/o hazard)

$\Rightarrow$  w/o any hazard: In general,  $CPI = \frac{K+n+1}{n}$

In ideal,  $CPI_{ideal} = 1$  ( $i.e. n \gg K$ )  
so ignore  $(K+1)$

$\Rightarrow$  w/ hazard: assuming 'n' no. of cycles are needed  $\rightarrow$  extra because of hazard.

In general,  $CPI = \frac{K+n+1+n}{n}$

In ideal,  $CPI = \frac{n+n}{n} = 1 + \frac{n}{n} \rightarrow IDEAL$

Ex.

Assume,  $n=100$  inst<sup>n</sup>,  $K=4$

15 inst<sup>n</sup> causing 3 stall cycles each.

i. Total stall cycles ( $\pi$ ) =  $15 * 3 = 45$  cycles.

$$CPI = \frac{K+n+1+\pi}{n} = \frac{4+100-1+45}{100} = \frac{148}{100} = 1.48$$

$$\begin{aligned} CPI_{ideal} &= \frac{n+\pi}{n} = \frac{100+45}{100} = 1.45 = 1 + \frac{45}{100} \\ &= 1 + \frac{15*3}{100} \\ &= 1 + 0.15 * 3 \end{aligned}$$

$\star$   $CPI_{ideal} = 1 + \frac{\text{Stall freq.} * \text{Stall cycles}}{\text{Inst}^n}$

$$\begin{aligned} CPI &= 0.85 * 1 + 0.15 (1+3) \\ &= 1.45 \end{aligned}$$

$$(1-f) * 1 + f (1+c)$$

$$(1-f) + f + fc$$

$$1+fc$$

\* Efficiency and throughput of pipeline

In ideal cond<sup>n</sup>, max speed up =  $S_{max}$

Achieved speed up =  $S$

i.  $\text{efficiency} = \frac{S}{S_{max}} = \eta$

Throughput  $\rightarrow$  work done/unit time

No. of operation (ideal) = 1 in 1 cycle

Time (1 operation) = 1 cycletime ( $t_p$ )

$\therefore t_p \rightarrow 1$

$$1 \text{ unit} \rightarrow \frac{1}{t_p} = \text{Thr.}$$

$$\left\{ \begin{array}{l} S = \frac{t_n}{t_p} \quad (\text{w/o hazard}) \\ S = \frac{t_n}{CPI * t_p} \quad (\text{w/ hazard}) \end{array} \right.$$

$$\left\{ \begin{array}{l} S = \frac{t_n}{t_p} \quad (\text{w/o hazard}) \\ S = \frac{t_n}{CPI * t_p} \quad (\text{w/ hazard}) \end{array} \right.$$

- Q) An instruction pipeline has five stages where each stage takes 2 ns <sup>an inst</sup> <sub>in all five stages.</sub> Branch inst<sup>n</sup> are not overlapped, ie. the instruction after branch is not fetched till the branch instruction is completed. Under ideal conditions calculate the avg. inst<sup>n</sup> execution time assuming that 10% of all inst<sup>n</sup> executed are branch inst<sup>n</sup>. Ignore the fact that some branch inst<sup>n</sup> may be conditional.

$$K = 15$$

$$t_p = 3 \text{ ns}$$

$$\text{CPI}_{\text{ideal}} = 1 + P_C = 1 + 0.1 \times 4 \\ = 1.4$$

$\left\{ \begin{array}{l} \text{Stall b/c of} \\ \text{branch inst}^n = 4 \\ z(l-1) \end{array} \right.$

$$\text{extra } S = \frac{15}{1.4 \times 3} = 3.57$$

$$\text{avg. inst}^n \text{ time} = 1.4 \times 8 \text{ ns} = 11.2 \text{ ns} \text{ (Ans)}$$

- b) If a branch inst<sup>n</sup> is a conditional inst<sup>n</sup>, the branch need not be taken when the condition is false. If the branch is not taken, the following instructions can be overlapped. When 70% of all branch inst<sup>n</sup> are conditional branch inst<sup>n</sup>, and 40% of the conditional branch are such that the branch is taken, calculate the average inst<sup>n</sup> execution time?



$$\text{CPI}_{\text{avg}} = 0.9 \times 1 + 0.1 \times 0.3 \times (1+4) \\ + 0.1 \times 0.7 \times 0.4 \times (1+4) \\ + 0.1 \times 0.7 \times 0.6 \times 1$$

$$= 1.232$$

avg inst<sup>n</sup> exec time

$$= 1.232 \times 3 \text{ ns}$$

$$= 3.696 \text{ ns} \text{ (Ans)}$$

$$\text{extra } \left\{ S = \frac{t_n}{\text{CPI} \times t_p}^2 \cdot \frac{15}{1.232 \times 3} \right. \\ \left. = 4.058 \right.$$



\* Addressable unit of disk  
→ Sectors.

$$* \text{No of Sectors in Disk} = \\ 2 * \text{no of platters} * \frac{\text{No of tracks}}{\text{per surface}} * \frac{\text{No of sectors}}{\text{per track}}$$

Top view of the disk:



\* Sector :-  
Sector is the smallest unit of disk which can be read/written at once.

Sector Capacity

\* Disk Capacity:

- 1) Surface capacity: sum of capacity of all sectors.
- 2) Disk capacity: Surface capacity \* no. of sectors.

↓ (Default disk)

Sectors having constant capacity  
(or)

variable storage density

→ \* Constant Angular velocity.

} Sectors having variable capacity  
or

Constant storage density

\* constant linear velocity

Not  
in  
syllabus.

$$* \text{Disk Capacity} = 2 * \frac{\text{No of platters}}{\text{no of tracks per surface}} * \frac{\text{No of sectors per track}}{\text{One sector capacity}}$$

→ access time for 1 sector of disk

$$* 1 \text{ disk access time} = \frac{\text{Seek time}}{\text{time}} + \frac{\text{Avg rotational delay or latency}}{\text{or}} + \frac{1 \text{ sector transfer time}}{\text{time}} + \frac{\text{Extra delay.}}{\text{time}}$$

\* Seek time = Time required to position the arm over the desired track.

\* Rotational latency = Time required to rotate desired sector under R/W head.

\* Transfer time = Time required to read or write 1 sector.

Rotational Delay

If current and target sector given.

$$\Rightarrow \frac{1}{2} \text{ disk rotation time} * \frac{\text{No of Sectors}}{\text{No of sectors per track}} \text{ to rotate.}$$

If current & target sector is not given.

$\Rightarrow$  avg. rotational latency

$$\Rightarrow \frac{1}{2} \text{ disk Rotation time}$$

Note: In one disk rotation,  
one track can be transferred.

Hence,

$$\frac{1 \text{ Sector transfer time}}{\text{ }} = \frac{\frac{1}{2} \text{ disk rotation time}}{\text{No of sectors per track.}}$$

## Multiple Sector Access Time

All sectors stored on same track,  
consecutively or Sequentially.

All sectors are stored Randomly.

Consider  $n$  sectors to be transferred:

$$= \text{Seek time} + \text{Rotational latency} + n * \frac{\text{Sector transfer time}}{n}$$

$$= n * [\text{Seek time} + \text{Rotational latency} + \frac{\text{Sector transfer time}}{n}]$$

\* Cylinder: Collection of tracks of same radius from all the surfaces, form a cylinder.

→ Data stored and accessed cylinder-wise: To save seek time.

\* No of cylinders in disk = No of tracks per surface.

\* Disk Addressing:

{ 3 platters,  
2 surfaces per platter  
4 tracks per surface.  
10 sectors per track.

↳ 60 sectors in a cylinder

Address →  $\langle C, h, S \rangle$   
 ↓      ↓      ↗  
 Cylinder no.   Surface no.   Sector no.

$n_c$  = no. of surfaces per cylinder

$n_t$  = no. of sectors per track

$$\text{Sector} = C * n_c + h * n_t + S.$$

$$\left\{ \begin{array}{l} C = \text{Sector seq. no.} / n_c \\ h = (\text{Sector seq. no. \% } n_c) / n_t \\ S = (\text{Sector seq. no. \% } n_c) \% n_t \end{array} \right.$$

➤ A hard disk has 63 sectors per track, 10 platters each with 2 recording surfaces and 1000 cylinders. The address of a sector is given as a triple  $\langle c, h, s \rangle$  where  $c$  is cylinder number,  $h$  is surface number and  $s$  is sector number. Thus, the 0<sup>th</sup> sector is address as  $\langle 0, 0, 0 \rangle$ , the 1<sup>st</sup> sector as  $\langle 0, 0, 1 \rangle$  and so on.

② The address  $\langle 400, 16, 29 \rangle$  corresponds to sector number?

$$n_t = \text{no of sectors per track} = 63,$$

$$n_c = \text{no of sectors per cylinder} = 63 * 20 = 1260,$$

$$\text{no of sectors covered in 400 cylinders (0-399)} = 400 * 1260 = 504000$$

$$\text{no of sectors covered in 16 surfaces (0-15)} = 16 * 63 = 1008$$

$$\begin{array}{r}
 & & 29 \\
 & + & \\
 504000 & + & 1008 \\
 \hline
 505037
 \end{array}$$

same data and same data continues.

③ The address of the 1039<sup>th</sup> sector is  $\rightarrow$

$$c = \left\lfloor \frac{1039}{1260} \right\rfloor = 0$$

$$h = \left\lfloor \left( \frac{1039}{1260} \right) / 63 \right\rfloor = 16$$

$$s = \left\lfloor \left( \frac{1039}{1260} \right) \% 63 \right\rfloor = 31$$