

# Computer Organization

11A

## CA: Conceptual Design & Operational Structure

- CPU Design
- Inst<sup>n</sup>s
- Add. Mod Ps
- Data format

## CO: Physical Device Interconnection

- IO organization
- Mem organization
- Pipelining



## → Acc Based Arch & Reg Based

- |                                        |                          |                           |
|----------------------------------------|--------------------------|---------------------------|
| • One of the ifps is acc               | • Both ifps are from Reg | • IIP may form reg or MIP |
| → Stack based arch has IIP from stack. |                          |                           |
| → O/p must be stored in accumulator    |                          |                           |

## Complex

$$\text{length of PC} = \text{length of MAR} = \text{PA}$$

$$\text{length of MBR} = \text{length of Acc} = \text{word size}$$

If virtual memory is used

$$\text{length of PC} \geq \text{length of VA}$$

$$\text{length of MAR} \geq \text{length of PA}$$

$$\text{Size of address bus of CPU} = \text{size of virtual address.}$$

32-bit arch: word size: 32 bits

32-bit arch: IIP size to ALU: 32 bits

64-bit arch: word size: 8 64 bits

IIP size to ALU: 64 bits

## Micro Operations: Smallest indivisible operation.

Can be performed in 1 Cycle

3 types:

• Arithmetic Logic Shift

Logical: Circular: Arithmetic:

(0 added) (shifted bit added)

Circular: (shifted bit added)

Arithmetic: (signed bit must be same after shift)

• So in Arithmetic right shift added bit is sign bit.

• In Arith. left shift, it is permitted only if sign remains same. Added bit is 0.

Note:

$$+7 \gg 3 = 3 \quad -7 \gg 1 = 4$$

in 2's comp form

## Instructions:

### Types:

#### 4-add inst<sup>n</sup>

- 4th add points to next inst<sup>n</sup>.
- inst<sup>n</sup> need not be contiguous.

### Disadv:

- Relocation is difficult.
- Large inst<sup>n</sup>s.

#### 3-add inst<sup>n</sup>

- add to next inst<sup>n</sup>. Field is removed and PC is added.
- Disadv of 4-add inst<sup>n</sup> are removed.

#### 2-add inst<sup>n</sup>

- One of the ifps acts as destination.
- Smaller inst<sup>n</sup>.

### Disadv:

- large no of inst<sup>n</sup>s will be needed for given program.
- Comp to 3 add inst<sup>n</sup>.

#### 1-add inst<sup>n</sup>

- 1 op → add field
- 2nd op - implied (accumulator)
- or implied with inst<sup>n</sup>

### Disadv:

- More no of inst<sup>n</sup>s for given program.

#### 0-add inst<sup>n</sup>

- No add is mentioned.
- Generally used in stack based architectures.

## Multiple inst<sup>n</sup> support:

If m-add & n-add inst<sup>n</sup>s are supported ( $m \geq n$ ) then find no of unused m-add inst<sup>n</sup>s and multiply to get req. n-add inst<sup>n</sup>s.

→ Whenever it is said a system supports k-add inst<sup>n</sup>s then min no of such inst<sup>n</sup> must be 1 (not 0).

→ Fixed length inst<sup>n</sup>  $\Rightarrow$  variable length opcode

variable length inst<sup>n</sup>  $\Rightarrow$  fixed length opcode

Memory latency: It is time b/w initiating a req for byte or word in memory until it is retrieved by processor.

Memory cycle time: It time b/w start of one mem access and time at which next access can be started.

Memory Access time: amount of time to transfer

a byte/word, from CPU to RAM or RAM to CPU.

Inst<sup>n</sup> Cycle: All the phases in inst<sup>n</sup> together form inst<sup>n</sup> cycle.

It is divided into 2 cycles:

Fetch cycle: Inst<sup>n</sup> fetch

Execution cycle: remaining phases

Addressing Modes

→ Effective add is add of operand.

| Mode                                     | Description                                                                                                                                                                                      | points (adv & disadv)                                                                                                                                            |
|------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Implied                                  | Location of operand is implied within the opcode.                                                                                                                                                | → <del>operand is limited</del>                                                                                                                                  |
| Immediate                                | Content of add field itself is the operand                                                                                                                                                       | → used to initialize registers<br>→ with mode operand range will be small                                                                                        |
| Direct<br>or<br>Absolute [ ]             | add field is MM add of operand                                                                                                                                                                   | → suitable for accessing static variables.                                                                                                                       |
| Indirect @                               | add field contains MM add which contains add of location where operand is present.                                                                                                               | → used for implementing pointers, parameter passing<br>→ Req. 2 MM access ∴ slow<br>∴ not suitable for pipelining                                                |
| Register Direct<br>(or)<br>Register mode | add field has register ref which has the operand                                                                                                                                                 | → It is fast<br>→ Relocation is not possible.                                                                                                                    |
| Register Indirect                        | add field specifies reg which contains effective add                                                                                                                                             | → Here indirection is obtained with shorter instn and also instn is faster.<br>→ It is bit slower (almost same) compared to direct mode but instn size is small. |
| Auto Inc/Auto Dec<br>(R)+<br>(R)-        | It is variant of reg indirect<br>Content of reg (effective add) is automatically inc/dec<br>Inc is post inc<br>Dec is pre dec                                                                    | → Used for accessing table of content (array)<br>→ The val to be inc is implicitly determined by size of operand accessed                                        |
| Indexed mode<br>(or)<br>Indexed reg mode | Effect add = add field + (index * w)<br>Content<br><br>Implementing this requires 2 instns<br>i) ind reg $\leftarrow w \times i \rightarrow$ array index<br>↳ element size.<br>ii) fetch operand | → Used to access arrays<br>→ Relocation is not possible<br>→ 2 instns req                                                                                        |
| Scaled mode                              | It is modified version of indexed mode<br>EA = add field + ind. reg * scaling factor<br><br>Here we just change the way EA is computed                                                           | → Only one instn req<br>∴ faster than indexed mode<br>→ Used to access arrays<br>→ Relocation is not possible.                                                   |

| Mode                                       | Description                                                                                                                                                                                | points.                                                                                                                                                                                                                      |
|--------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Index, base reg mode (or)<br>Based Indexed | Here we have 2 seg (int & base)                                                                                                                                                            | <ul style="list-style-type: none"> <li>→ It permits relocation</li> <li>→ Supports position independent code (PIC)</li> </ul>                                                                                                |
| Index, base + displacement mode            | used to access member within records<br>$EA = (\text{index} * w) + \text{base} + d$<br>$d \rightarrow \text{displacement}$                                                                 | <ul style="list-style-type: none"> <li>→ permits relocation</li> <li>→ Supports PIC</li> </ul>                                                                                                                               |
| Based reg mode                             | $EA = \text{base reg} + \underbrace{\text{add field}}_{\text{offset}}$                                                                                                                     | <ul style="list-style-type: none"> <li>→ used for inter segment jumping</li> <li>→ permits relocation.</li> <li>→ supports PIC</li> </ul>                                                                                    |
| PC Relative                                | <p>add field contains offset of branch target instn from PC (i.e., next add instn) (can also be -ve)</p> <p>In other words add field has (no of instn to be skipped) * (size of instn)</p> | <ul style="list-style-type: none"> <li>→ Reduces instn size (<math>\because</math> only offset is given)</li> <li>→ Supports only intra segment branching.</li> <li>→ Supports PIC.</li> <li>→ permits relocation</li> </ul> |

### Classification of addressing modes:

- Non-computable: Finding EA doesn't req any computations (1st six modes)
- Computable: Finding EA req computations (remaining modes)

→ while solving problems, remember to used value in PC as next instn add.

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <p>CPU cycle: Amount of time in which one M-op can be performed</p> <p>CPI: no of cycles req per instn</p> <p>Execution time of an instn = CPI <math>\times</math> cycle time.</p> <p><u>Control Unit Organization</u></p> <p>CU generates control signal to all components</p> <p><u>Control Variable</u>: It is name given to a control signal</p> <p><u>Control Word</u>: Collection of all control signals</p> <pre>     CU           +-- Hardwired     +-- Microprogrammed                   +-- Horizontal         +-- Vertical   </pre> | <p><u>Hardwired CU</u>:</p> <ul style="list-style-type: none"> <li>Control logic is implemented with gates, ffs etc.</li> <li>A table is created (cycles <math>\times</math> instn) and logic is designed accordingly. For each control signal a boolean exp in terms of (T &amp; I) is developed.</li> <li>Timing generator is reset after every <math>\phi</math> instn.</li> </ul> <p><u>Microprogrammed CU</u>:</p> <ul style="list-style-type: none"> <li>Adv: Fast</li> <li>Disadv: not flexible to modify</li> </ul> <p><u>Microprogrammed</u>:</p> <ul style="list-style-type: none"> <li>Here control memory is used for storing control words.</li> <li>CAR: stores address of loc in control memory</li> <li>CDR: stores a word of control memory.</li> <li>Adv: flexible to modify</li> <li>Disadv: slower than hardwired.</li> </ul> |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

Each location of control memory is of form

|                 |            |          |
|-----------------|------------|----------|
| Control signals | Mux select | next add |
|-----------------|------------|----------|

Based on how control signals are stored we have 2 types of M-programmed CU:

(i) Horizontal M-prog. CU:

Every control signal has one bit

(ii) Vertical M-prog CU:

$$\begin{aligned} \text{no of loc in} \\ \text{control mem} = \\ (\text{no of instns}) \times \\ (\text{no of M-instns} \\ \text{per instn}) \end{aligned}$$

Control signals are divided into groups such

that only one signal will be active in any

group. So we ~~fn~~ decode this and store in control word.

And decoding is required for execution.

n-control signals - exactly  $1 - \lceil \log_2 n \rceil$  bits

n-control signal - atmost  $1 - \lceil \log_2(n+1) \rceil$  bits

Note:

Horizontal

More control mem

faster

no need of decoder

Nano-Programmed CU:

vertical

less control mem

slower

decoder is needed

→ Instns may have similar M-instns.

Here a memory is dedicated for storing all unique control signals.

→ In control signal field of control mem

we store address which points to a location in nano memory where instns are stored.

|                     | RISC                                                     | CISC                    |
|---------------------|----------------------------------------------------------|-------------------------|
| instns              | less, fixed length                                       | more, variable length   |
| Add mdy             | less (only Direct Address)                               | more                    |
| CU                  | Hardwired                                                | M-prog                  |
| CPI                 | 1                                                        | >1                      |
| operation           | Reg to Reg                                               | Reg to Mem / Mem to Reg |
| reg availability    | more<br>provides reg window<br>∴ efficient param passing | less<br>no reg windows  |
| Pipelining          | more compatible                                          | less compatible         |
| Compiler design     | easy                                                     | difficult               |
| Programmer overhead | more                                                     | less                    |

## IO Organization

→ Need for IO interface

- (i) Signal conversion
- (ii) Data format conversion
- (iii) Synchronization
- (iv) Management & Control

→ ~~Req~~ The Every IO device has an address.

IO bus also has add, data, control buses

→ 3 types of bus organizations:

(i) Separate bus for IO & Memory:

→ Costly organization

→ Generally used when separate IOP is available.

(ii) IO Mapped IO (sep common add/data bus, ab)

- (a) Port Mapped IO separates ~~system~~ control bus
- (b) Isolated IO

→ Since add, data buses are same, control lines specify whether memory is being accessed or IO devices.

(iii) Mem mapped IO (all buses are same)

→ Interface registers are mapped onto memory.

→ So IO devices are accessed just using load & store.

| IO Mapped IO                                    | Mem mapped IO                                                                     |
|-------------------------------------------------|-----------------------------------------------------------------------------------|
| (i) No separate add space for IO                | (i) No separate add space for IO                                                  |
| (ii) All mem access insns less IO access insns. | (ii) All mem access insns are used for IO<br>∴ more insns & more addressing modes |
| (iii) less IO devices                           | (iii) can support more IO devices                                                 |

## Data Transfer

### Synchronous

→ Same clock for both

the devices

→ Data transfer rate depends on slower device

→ So clock rate depends on slower device

### Asynchronous

→ diff clock.

→ Types:

- ↓ strobe signal
- ↓ handshake
- ↓ Async. serial

### Data transfer



→ Single data line

→ Multiple data lines

## Strobe Signal:

### Source initiated:



### Destination initiated:



→ Strobe signal disabled  $\Rightarrow$  data on the bus is invalid.

## Handshaking (Refer Notes)

### Asynchronous serial:

→ Start bit & stop bit are used

sent at the beginning of a byte transfer.

sent at the end of a byte transfer.

∴ start bit = 0 & stop bit = 1

when 0 is seen, dest understand data on the line is valid.

Then after the byte is transferred, 1 will remain on the line.

→ Due to start & stop bit, effective data rate is reduced.

## Modes of Transfer:

### i) Programmed IO:

→ A program with loop stmt is used.

→ This loop continuously checks status register and transfers byte/word as soon as it is ready.

∴ time for which CPU is blocked = preparation time of data by IO device.

### ii) Interrupt Initiated IO:

→ Here IO devices interrupt CPU as soon as its byte/word is ready for transfer.

∴ time for which CPU is blocked = Interrupt + service + Transfer overhead routine time

↓  
execution of current inst  
pushing data onto stack  
default service routine etc.

## Direct Memory Access:

### Steps:

i) DMA Req (IO device to DMAC)

ii) HOLD Req (DMAC to CPU)

iii) CPU writes starting add, Data count, add of mem location of disk.

iv) Hold Ack (CPU to DMAC)

v) DMA Ack (DMAC to IO Device)

vi) Data transfer

vii) DMAC interrupts CPU saying that transfer is finished.

### Registers in DMAC:

i) starting add: Add of mem loc that has to be accessed.

incremented after every access.

ii) Data count: no of bytes/words to be accessed. Decrement after every access. Transfer finishes once Data count hits 0.

iii) add in IO device: Address of mem of IO from which data transfer has to be done.

## Modes of DMA transfer:

### i) Burst mode:

DMAC transfer one block or a burst of data and returns the bus.

Steps : i) Bus req  
ii) Data transfer  
iii) Bus release } Total time

∴ time CPU blocked =  $\frac{\text{block transfer time}}{\text{block TT + block prep time}} \times 100$ .

### ii) Cycle Stealing mode:

Here bus is taken and released for transfer of each byte/word. So ~~bus~~ is no of bytes to transfer.

Steps : i) Bus req  
ii) transfer byte/word  
iii) Bus release } Total time =  $N \times t_p$  3 steps  
∴ ~~bus~~ has

∴ CPU blocked =  $\frac{t_p}{t_p + t_{bp}} \times 100$   $t_p \rightarrow$  transfer time of 1 word  
 $t_{bp} \rightarrow$  prep time of 1 word

Burst mode has high throughput than cycle-stealing.

### (iii) Interleaving DMA:

- Here DMA gets system bus whenever the CPU is not using it.
- ∴ CPU will not be blocked in this case.

Note:

- DMA is used for large data transfers
- interrupt is used for small data transfer.
- throughput of Burst mode > throughput of cyclestealing mode > throughput of interleaved DMA
- ~~Normal~~ Normal interrupts are serviced after the execution of current instruction. But DMA interrupt is serviced as soon as interrupt is generated.

## Interrupts

- steps by CPU when interrupt occurs
  - (i) Finish current instr. execution
  - (ii) Save the current status
  - (iii) Branch to ISR
  - (iv) Resume previous process.
- ISR is part of device driver code.
- vectorized Interrupt      Non-vectorized Interrupt
  - vector add is given
  - vector add not given  
So CPU runs default service routine.
- Maskable & Non-Maskable interrupt.
- External Intr    Internal Intr    Slow Intr
  - generated by some device
  - execution of divide by 0, Pg-fault etc.
  - System calls.

### Handling Simultaneous interrupts:

- when CPU gets interrupts simultaneously, they are handled based on priority.



#### Polling:

- A slow runs checking each device if it has generated the interrupt. The order of checking decides priority.

If no of devices are more then polling takes lot of time after which service may not be useful.

### Daisy Chaining:



- when INTREQ is sent, CPU understands interrupt occurs
- Now INTACK is sent (1st to highest priority device)
- INTACK is passed from high priority device to low priority device.
- Finally after ack (0 or 1) is passed through all the devices ~~INTACK~~, INTREQ will be 0 and service is started.
- In daisy chaining the device which is electrically closer gets high priority.

Note:

- high priority is given to interrupts which if not serviced would lead to serious problems
- Also higher priority is given to faster devices like disk.

## Memory Organization

| SRAM                                                             | DRAM            |
|------------------------------------------------------------------|-----------------|
| RAM                                                              | Capacitors      |
| ROM                                                              | refresh         |
| calculating no of chip pins                                      | fast & cost     |
| RAM → 3 + Add + data                                             | slow, cheap     |
| ROM → 1 + Add + data                                             | less idle       |
| In RAM, read has high priority.                                  | power con.      |
| If volt pin is present, then we need ground pin too.             | high idle       |
| Multiple Chip Support (Refer notes if any doubt)                 | more operations |
| DRAM Chip:                                                       | power con.      |
| Refresh time of one chip = (no of rows) * (refresh time per row) | low operations  |

### DRAM Chip:

$$\text{Refresh time of one chip} = (\text{no of rows}) * (\text{refresh time per row})$$

- All chips can be refreshed in parallel.



## Associative Memory

- Search by content; parallel search (∴ fast) & costly
- Matching vector is maintained.
- Applications: TLB & Cache (Fully associative)

Locality of Ref ← Temporal  
Spatial

### Cache Memory:

Hit latency: time to find whether hit or miss  
Miss latency: time to get something from MM.

→ If there is miss, the req word is transferred to CPU directly from MM (also to cache)

$$\rightarrow \text{EHT} = h \cdot (\text{hit time}) + (1-h) \cdot (\text{miss time})$$

Sim:  $T_{avg} = h \cdot c + (1-h) \cdot m$

tie:  $T_{avg} = h \cdot c + (1-h) \cdot (c+m)$

### Cache Write:

| Write through                  | Write Back                                                      |
|--------------------------------|-----------------------------------------------------------------|
| → $T_{avg} = \text{Max}(C, M)$ | → $T_{avg} = h \cdot C + (1-h)(T_{new} + T_{dirty} + T_{back})$ |
| → No cache coherence           | Moving block from MM to cache<br>dirty block write back time    |
| → good for less freq writes.   | → Cache coherence present                                       |

Write allocate: Missed block is brought to cache.  
→ used with write back.

No write allocate: used with write through

### Mapping:

#### Direct Mapping:

$$\rightarrow \text{Cache block} = (\text{MM block}) \% (\text{no of cache blocks})$$



$$\rightarrow \text{tag bits} = \log_2 \left( \frac{\text{MM size}}{\text{cache size}} \right)$$

#### Hardware:

→ size of Mux:  $2^x$ ; no of Muxes = no of blocks  
→ no of blocks

→ One n-bit comp  
n → no of tag bits

$$\text{bit latency} = \text{Mux delay} + \text{Comp delay.}$$

#### Set Associative Mapping:

$$\rightarrow \text{cache block} = (\text{MM block}) \% (\text{no of sets})$$

$$\rightarrow \text{tag bits} = \log_2 \left( \frac{\text{MM size}}{\text{cache size}} \right) + \log_2 k$$

#### Hardware (Max, Comp, OR gate)

→ size of Mux:  $2^x \times 1$ ; no of Muxes = no of sets

→ no of comp = k; n-bit comp

n → no of tag bits

→ k-ip OR gate

$$\text{bit latency} = \frac{\text{Mux delay}}{\text{delay}} + \frac{\text{Comp delay}}{\text{delay}} + \frac{\text{OR gate delay}}{\text{delay}}$$

#### Associative Mapping:



→ tag bits =  $\log_2 (\text{MM block})$

#### Hardware: (compt+OR)

→ no of comp = no of blocks

→ size of comp = no of tag bits

→ size of OR gate = no of comp = no of cache blocks

$$\text{bit latency} = \frac{\text{COMP delay}}{\text{delay}} + \frac{\text{OR gate delay}}{\text{delay}}$$

#### Block Replacement (FIFO, LRU, Optimal)

→ Direct Mapping doesn't need any replacement technique.

#### To Cache Misses:

- Cold / Compulsory Miss } For any Miss
- Capacity Miss } check in this
- Conflict Miss } order

→ Conflict Miss is never possible in fully associative.

#### Note:

$$\text{Tag dir. size} = (\text{no of tag bits}) * (\text{no of cache blocks})$$

→ V/I bit, Modified bit (Dirty bit)

#### Multilevel Cache:

→ Smaller cache → less access time } Multilevel

→ Large cache → More hit rate } Cache

→ we may even use inst<sup>n</sup> cache, data cache which helps in pipelining.

$$\text{Sim: } h_1 E_1 + (1-h_1) h_2 (t_2) + (1-h_1)(1-h_2) (t_3)$$

$$\text{tie: } t_1 h_1 (t_1) + (1-h_1) h_2 (t_1 + t_2) + (1-h_1)(1-h_2) (t_1 + t_3)$$

#### Cache Inclusion Policies

| Hit location   | Inclusion policy                                                        | Exclusion policy                                                  |
|----------------|-------------------------------------------------------------------------|-------------------------------------------------------------------|
| L <sub>1</sub> | Access L <sub>1</sub>                                                   | Access L <sub>1</sub>                                             |
| L <sub>2</sub> | Move L <sub>2</sub> to L <sub>1</sub><br>L <sub>1</sub> to MM           | Swap(L <sub>1</sub> , L <sub>2</sub> )                            |
| MM             | MM to L <sub>2</sub> ; MM to L <sub>1</sub> ; Invalid to L <sub>1</sub> | MM to L <sub>1</sub><br>L <sub>1</sub> to L <sub>2</sub> (victim) |

→ In Exclusion policy L<sub>2</sub> is called victim cache.

value inclusion: value of block in L<sub>1</sub> is same as that of the blocks in L<sub>2</sub>.

Note:

→ Accessing cache doesn't req. System bus (or databases). CPU uses internal bus for cache access.

## Magnetic Disk

- platter; surface; track; sector
- consecutive sectors within a track form a cluster.
- Sector is smallest addressable unit.
- 2 ways of organizing:
  - (i) Constant Sector Capacity (Const. Angular velocity)
  - All sectors have same amount of data.
  - Density varies (less in outer tracks)

$$\text{Disk Capacity} = 2 * (\text{no of platters}) * (\text{tracks/surface}) * (\text{sectors/track}) * (\text{sector capacity})$$

(ii) Variable Sector Capacity (Const. linear velocity)

→ Storage density is constant

$$\text{Disk capacity} = 2 * (\text{no of platters}) * (\text{surface capacity})$$

Disk Access time: time to access 1 sector

$$\text{Disk access time} = \text{seek time} + \text{Rotational latency} + \text{transfer time}$$

(for 1 sector)

→ In one rotation, an entire track can be transferred.

Cylinder: Logical collection of tracks of same radius.

- Using cylinders reduces seek time.
- Data is stored cylinder wise.
- No of cylinders = No of tracks.
- Obtain sector num from  $\langle C, h, S \rangle$



Note:

Number of sectors is done from outermost cylinder to innermost cylinder.

Disk Blo: Amount of data transferred divided by time blo first request for service and completion of last transfer.

## Parallel Processing:

SISD: Eg: von neuman computer

SIMD: Eg: pipelining

MISD: not practically used

MIMD: Eg: super scalar computer.

→ contains multiple pipelines.

## Pipelining

→ Sequential process is divided into stages or segments.

→ k-stage pipeline:

$$\text{time for } n \text{ tasks} = (k+n-1)t_p$$

$t_p \rightarrow$  pipeline cycle time.

$$* \text{ Speedup} = \frac{\text{non-pipeline time}}{\text{Pipeline time}} = \frac{n*t_p}{(k+n-1)t_p}$$

$$* \text{ Ideal speedup} = k$$

$$\rightarrow \text{Pipeline cycle time} = \max(\text{segment delay}) + \text{Reg. Delay}$$

$$\rightarrow \text{Execution time in non-pipeline} = \text{sum of segment delays}$$

## Inst<sup>n</sup> Pipeline:

→ This is pipeline implemented on inst<sup>n</sup> cycle.

→ For a branch inst<sup>n</sup>, if target is evaluated in i<sup>th</sup> stage, then no of stalls = i - 1

$$\text{CPI}_{\text{ideal}} = 1 + \left( \frac{\text{stall}}{\text{freq}} \right) * \left( \frac{\text{no of stall}}{\text{cycle}} \right)$$

Pipeline Latency: It is amount of time after which machine takes next input.

for seq. it is sum of segment delays

for pipeline it is ~~max~~ max(segment delay) + seq. delay

Throughput: no of inputs processed per unit time.

$$\text{throughput} = \frac{n}{(k+n-1)t_p}; \text{ ideal throughput} = \frac{1}{t_p}$$

$$\text{Pipeline efficiency} = \frac{s^l}{s} \times 100$$

$s^l \rightarrow$  speedup with Hazards     $s \rightarrow$  speedup without hazards

## Pipeline Hazards

### (i) Structural Hazard / Resource Conflict

→ Occurs when 2 ips try to use same stage or 2 stages try to use same resource

Soln: replication of resources.  
• using i-cache, d-cache.

### (ii) Data Hazard:

#### Read after Write (Data Dependency)

$$R_1 \leftarrow R_2 * R_3 \quad \text{IF ID OF EX WB} \\ R_4 \leftarrow R_1 * R_5 \quad \text{IF ID - - OF EX WB}$$

Soln: Instn rescheduling / Delayed load (provided by compiler)

- Insert independent instns or NOP operations.
- no of insertions = no of stalls

#### H/W Interlock:

Insert stalls by locking operand fetch

#### Operand Forwarding / Bypassing:

$$\text{IF ID OF EX WB} \quad \text{Decision unit} \\ \text{IF ID OF EX WB} \quad \text{is used}$$

→ operand forwarding is used only with ALU to ALU dependency.

$$\text{Ex: } R_1 \in M[\text{add}] \quad R_4 \leftarrow R_1 * R_5 \\ R_4 \leftarrow R_1 * R_5 \quad M[\text{add}] \leftarrow R_4$$

In above 2 cases, operand forwarding cannot be used due to ALU to ALU dependency.

#### Write After Read (WAR) (Anti-Dependency)

$$R_1 \leftarrow R_2 + R_3$$

$$R_2 \leftarrow R_4 * R_5$$

#### Write After Write (WAW) (Write Dependency)

$$R_1 \leftarrow R_2 * R_3$$

$$R_1 \leftarrow R_4 + R_5$$

→ Soln for WAR & WAW is register renaming

#### Register Renaming:

Using another reg. instead of common register that is causing problem.

So we can write above 2 codes as

$$R_1 \leftarrow R_2 + R_3 \quad | \quad R_1 \leftarrow R_2 * R_3 \\ R_6 \leftarrow R_4 * R_5 \quad | \quad R_6 \leftarrow R_4 + R_5$$

→ It is H/w soln. So H/w has to detect dependencies and rename.

## (iii) Control Hazard / Branch Difficulty

Sols:

Delayed Branch (if provided by compiler).

→ some independent instns are put in delayed slot. If no independent instn then Nop.

#### Prefetch Target Instn:

→ Here when branch instn is detected, 2 pipeline will be initiated. After finding correct target other pipeline will be discarded.

(Gr)

Here we fetch the target instn in addition to the instn following. As soon as the target is found we have already prefetched.

#### Loop Buffer:

→ contains most recently fetched instns.

→ If loop is small enough to fit within the buffer, then instn can be taken from buffer without going to MM.

#### Branch Prediction:

##### Static

- Branch always taken
- Branch never taken

##### Dynamic Prediction

- Bit implementation
- Branch Target Buffer

#### Bit Implementation:

##### single bit:



B - Branch

NB - No Branch

00 - strong No branch

11 - strong branch

01 - weak no branch

10 - weak branch

→ Every branch instn has its own bit.

#### Branch Target Buffer / History Table:

→ Prediction is made from entries in the buffer.

If it is wrong then we update the buffer.

| Branch instn | Target instn |
|--------------|--------------|
| :            | :            |

single bit is maintained to see branch status.

→ Tags are maintained to match with an address.

→ It is a small cache that caches recent information.

Miss → normal cache penalty

Hit + wrong prediction → penalty

Hit + correct prediction → No penalty.

## Non-linear Pipeline:

→ Here a task may go back to previous stages.

→ Non-linear pipeline is represented using reservation table.

|    | 1 | 2 | 3 | 4 | 5 |
|----|---|---|---|---|---|
| S1 | X |   | X |   |   |
| S2 |   | X |   |   | X |
| S3 |   |   |   | X |   |

Collision: When 2 or more inputs need same stage at same time.

Permissible latency: Latency which doesn't cause cycle.

Forbidden latency: Latency that causes collision.

$$S_1: 3-1=2$$

$$S_2: 5-2=3$$

∴ 2, 3 are forbidden latencies

1, 4, 5 are permissible latencies

Collision Vector: It is an n-bit vector such that

$c_i = 0$ , permissible latency

$c_i = 1$ , forbidden latency

Col

Here we have  $\begin{matrix} 5 & 4 & 3 & 2 & 1 \\ 0 & 0 & 1 & 1 & 0 \end{matrix}$

State Diagram:

→ Take a state S

→ For every 0 at position i, right shift by i positions.

Now perform bitwise OR with original collision vector.



Simple Cycle: It is a cycle such that any state appears exactly once.

(1,4) (1,5) (4,5) (4,1) (5,1)

Greedy Cycle: This is cycle obtained by choosing minimum latency at every state

(1,4) is the only greedy cycle.

$$\text{avg. latency of } (1,4) = \frac{1+4}{2} = 2.5$$

Minimum Average Latency: It is minimum avg. among all greedy cycles.

## Memory Interleaving

→ Memory is divided into modules and module internal operations can go in parallel.

But addressing is not overlapped.

→ k-way interleaving means k modules.

2 types

### Lower Order Interleaving

• LSB bits are used for modules selection.

• Consecutive words are stored in wrap-around way.

• Suffers from spatial locality.

### Higher Order Interleaving

• MCB bits for module selection.

• Consecutive words stored in same module.

• Spatial locality is present.

In lower order interleaving accessing n words would required  $n(t_d) + m$

$t_d \rightarrow$  decoding time for module selection

$m \rightarrow$  memory access time.

This formula works only if  $k \cdot t_d \geq m$ . Where 'k' is interleaving.

## Floating Point Representation

→ Floating point can represent large range compared to fixed point.

| S      | Exponent | Mantissa |
|--------|----------|----------|
| biased |          |          |

biased

→ If exponent is k bits  $\Rightarrow$  bias =  $2^{k-1}$

Biased exponent,  $E = e + 2^{k-1}$

Biased makes comparison easy

original exponent

### Explicit Normalization of Mantissa

### Implicit normalization of Mantissa (Default)

value =  $(-1)^S \times 0.M \times 2^{E-\text{bias}}$

value =  $(-1)^S k \cdot M \times 2^{E-\text{bias}}$

Mantissa can never be all zeroes

Mantissa can be zeroes.

Note:

More bits in Mantissa  $\Rightarrow$  more precision (accuracy)

more bits in exponent  $\Rightarrow$  range is more.

### Disadvantages:

→ No representation for 0, ∞.

→ cannot store a number which cannot be normalized

### IEEE-754 floating point representation:

#### Single Precision (32)

| S | 8 | 23 |
|---|---|----|
| S | E | M  |

#### Double Precision (64)

| S | 11 | 52 |
|---|----|----|
| S | E  | M  |

$S \Rightarrow E = \text{all } 0\text{'s}/\text{all } 1\text{'s} \Rightarrow$  special number.

→ Mantissa is implicit normalized.

→ bias =  $127/1023$

| S   | E      | M    | number                |
|-----|--------|------|-----------------------|
| 0/0 | 00...0 | 0..0 | ±0                    |
| 0/1 | 11...1 | 1..1 | ±∞                    |
| 0/1 | 0...0  | ±0   | Denormalized/Fraction |
| 0/1 | 1...1  | ±0   | NAN (Not a Number)    |
| 1/1 | rest   | rest | implicit normalized   |

value =  $(-1)^S \times 1 \cdot M \times 2^{E-\text{bias}}$  bias  $\rightarrow 127/1023$ .

Denormalized: used to represent smaller fractions.

write num as  $0.0...0 \times 2^{-126}$  or  $0.0...0 \times 2^{-1022}$

Now we store this as  $E=00...0$  &  $M \neq 0$ .

$\therefore$  value =  $(-1)^S \times 0.M \times 2^E$   $E \rightarrow -126/-1022$

NAN: In single precision max value of E is 127.

$\therefore E = e+127 = 254 \Rightarrow 255$  is never possible so not a number.

But in double precision 2047 is never possible.

Note: Numbers close to 0 are densely distributed and numbers away from 0 are sparsely distributed.