



Lecture - 1 & 2 || Combinational Logic - 1

ORAND:XOR:

$$\begin{array}{c} A \\ \text{---} \\ B \end{array} \rightarrow \text{XOR gate} \quad Y = A \oplus B \\ = AB + \bar{A}\bar{B}$$

X-NOR:

$$\begin{array}{c} A \\ \text{---} \\ B \end{array} \rightarrow \text{X-NOR gate} \quad Y = \overline{A \oplus B} \\ = \bar{A}\bar{B} + A\bar{B}$$

ROM : (Read Only Memory)

| Input<br>(Mem Add) |                | Output<br>(Contents of Mem) |                |                |
|--------------------|----------------|-----------------------------|----------------|----------------|
| A <sub>1</sub>     | A <sub>0</sub> | Y <sub>2</sub>              | Y <sub>1</sub> | Y <sub>0</sub> |
| 0                  | 0              | 0                           | 1              | 1              |
| 0                  | 1              | 0                           | 0              | 1              |
| 1                  | 0              | 0                           | 0              | 0              |
| 1                  | 1              | 1                           | 0              | 0              |



(?)

input → 2bit address of mem location

output → 2bit contact of that mem. ~~location~~.

# Programmable Logic Array (PLA)

- A circuit.
- SOP
- Hardware



|        |                  |
|--------|------------------|
| 1 byte | = 8 bit          |
| 1 KB   | = $2^{10}$ bytes |
| 1 MB   | = $2^{20}$ bytes |
| 1 GB   | = $2^{30}$ bytes |
| 1 TB   | = $2^{40}$ bytes |

## Ripple Carry Adder:



## Fast Adder:



## ALU:



$$A - B = A + \bar{B} + 1$$

$$A + B = A + B$$

Adder  
Full  
Sum  
Inv

|    |           |
|----|-----------|
| 00 | : Add/Sub |
| 01 | : AND     |
| 10 | : OR      |

## Encoder

Input =  $2^n$   
Output =  $n$   
 $\frac{2^2}{2} \cdot 4$

- Combinational logical circuit
- Output specifies which set of inputs has occurred.

| $n$ | $B$ | $C$ | $D$ | $X$ | $Y$ |
|-----|-----|-----|-----|-----|-----|
| 1   | 0   | 0   | 0   | 0   | 0   |
| 0   | 1   | 0   | 0   | 0   | 1   |
| 0   | 0   | 1   | 0   | 1   | 0   |
| 0   | 0   | 0   | 1   | 1   | 1   |

for input to  
out put we,

$$Y = B + D$$

$$X = C + D$$



## Decoder

Input =  $n$   
Output =  $2^n$

- takes a code
- turns on one of several output wires.

| $X$ | $Y$ | $A$ | $B$ | $C$ | $D$ |
|-----|-----|-----|-----|-----|-----|
| 0   | 0   | 1   | 0   | 0   | 0   |
| 0   | 1   | 0   | 1   | 0   | 0   |
| 1   | 0   | 0   | 0   | 1   | 0   |
| 1   | 1   | 0   | 0   | 0   | 1   |

for input  
to out put  
we,

$$A = \bar{X} \bar{Y}$$

$$B = \bar{X} Y$$

$$C = X \bar{Y}$$

$$D = X Y$$



Multiplexor

- used to select one of possible input var.
- takes n input & choose among them.

One bit multiplexor:



Two bit multiplexor:



# Sequential 1

20/05/19

## RS flip-flop :

Nand Latch:



## NAND Gate:

| I <sub>1</sub> | I <sub>2</sub> | Output |
|----------------|----------------|--------|
| 0              | 0              | 1      |
| 0              | 1              | 1      |
| 1              | 0              | 1      |
| 1              | 1              | 0      |



| S | R | Q | Q-bar         |
|---|---|---|---------------|
| 0 | 0 | 1 | 1 (forbidden) |
| 0 | 1 | 1 | 0 (set)       |
| 1 | 0 | 0 | 1 (reset)     |
| 1 | 1 | 0 | 1 (uncertain) |

काम नहीं करने का '0' इनपुट है।  
योद्धा देता है।

मूलतः output : रखना चाहिए।  
Input जो नहीं करना चाहिए।  
किसी रूप से।

## SR flip flop:



| S | R | Clk | Q          |
|---|---|-----|------------|
| 0 | 0 | ↑   | uncertain  |
| 0 | 1 | ↑   | 0          |
| 1 | 0 | ↑   | 1          |
| 1 | 1 | ↑   | forbidden. |

↑  
उचित बिंदु तरीके

↑  
उचित बिंदु तरीके

## D-flip flop:

Data flip flop, यांचे तांत्रिक उद्देश संग्रह करण्याचा वापर.



| J | K | CLK | Q         |
|---|---|-----|-----------|
| 0 | 0 | ↑   | unchanged |
| 0 | 1 | ↑   | 0         |
| 1 | 0 | ↑   | 1         |
| 1 | 1 | ↑   | Toggle    |

| D | CLK | Q |
|---|-----|---|
| 0 | ↑   | 0 |
| 1 | ↑   | 1 |

→ या input फॉरम, जे अंतीम output →  
Save 2 bits,

D=0 म्हण॑ने J=0, K=1

D=1 म्हण॑ने, J=1, K=0

| T | CLK | Q         |
|---|-----|-----------|
| 0 | ↑   | unchanged |
| 1 | ↑   | Toggle    |

## Registers

- A set of flip flop.
- Holds the values of some var.

## 8 bit Reg



q7 q6 q5 q4 q3 q2 q1 q0

## Shift Reg: 4 bit right shift



1000 0000 1000  
Q0 EM

4 bit right shift

DRAW 8 bit left shift ??



0000 0000 0000  
Q0 EM



## Laptop & Desktop:

- one machine
- one user
- at a time.

## Server:

- one machine
- multiple user
- at a time.

## Embedded Computer:

- built for specific reason
- particular (उत्तम कार्य वाला)  
का उपयोग.

## Multi Core:

- Many CPU



- One core ~~single~~ chip

## Single Core:

- One CPU



- One ~~core~~ chip

## Many Core:

- Many CPU



- Many ~~physical~~ chip

Example: Cloud server,

## Embedded Core:

- Design a chip to do a particular work.

## Cloud:

- server ~~fast~~ जटिल
- use जटिल ग्रेड मशीन.
- need internet.

## Computers are pervasive:

- जटिल com carry नहीं easy,
- land phone वाले use जटिल.

## Understanding Performance:

### Algorithm:

- Determine num of operations executed.

### Programming language, compiler,

### Architecture:

- Determine num of machine instructions executed per operation

### Processor & mem system:

- Determine how fast instructions are executed.

### I/O system:

- Determine how fast I/O operations are executed.

# Below your Programs

(acronym)

## Application Software

- Written in high level lan.

## System Software

Compiler: translate HLL code to machine code.

OS: managing mem & storage scheduling tasks.

## Hardware

Processor, mem, I/O controllers



## Levels of programming codes

- i) High level lan
- ii) Assembly lan: close to machine lan.
- iii) Hardware representation:
  - Binary digits.
  - Encoded instructions & data.

PMD: Personal mobile device: small wireless device to connect to internet

Cloud Computing: refers to large collection of servers that provide service over the internet

SaaS: Software as a Service: delivers s/w & data as a service over the internet.

Web Search, Social Networking

## Components of Computer

- Same components for all kind of computer.
- Desktop, server, embedded.
- Input / output includes
  - User Interface devices  
display, keyboard, mouse
- Storage device.  
Hard disk, CD/DVD, flash.
- Network adapters  
for communication with other computers



## Inside the Processor (CPU)

Data Path: performs operation on data.

Control: seq of data path, memory.

Cache Mem: Small fast SRAM for immediate access to data.

## Abstractions

- Abstraction helps us deal with complexity.
- Instruction Set architecture (ISA)

- Application binary interface.
- Implementation.

## What is Computer Architecture?

- The science and art of
- designing: ~~not~~ hardware for the design ~~the~~ ~~to~~,
- Selecting component select ~~the~~ ~~to~~,
- Connect ~~the~~ ~~not~~ hardware ~~to~~ ~~in~~.

} Designing the hardware / software interface to create a computing system

- that meets functional goals
- performance
- cost

- Hardware programming  
Lang is very important.

- less compilation time/  
Same problem, execution time  
in ~~not~~ ~~not~~ compiler.
- ~~less~~ - less instruction ~~not~~  
Compiler.

- In C programming ~~is~~ ~~not~~  
compiler ~~not~~ ~~not~~ hardware programming language.  
So ~~not~~ ~~not~~ assembly compiler than ~~not~~.

## Interface



## ISA

- An **Abstract interface**
- between **hardware** & **Software** (lowest level) of a machine
- that encompasses all the information necessary.
- to write a **machine language program**
- that will run correctly, including instructions reg, mem access, I/O, so on.

## Data Path

- A data path is a set of **functional units**
- that carry out **data processing**.
- Data paths along with Control unit, make up the CPU of a computer system.
- A larger datapath can only be ~~configured~~ created by **joining** more than one together

Clock Cycle = Instruction Count  $\times$  Clock per Instruction

Instruction Count  $\geq$  10 with min.

useful instruction  $\geq$  10 with Clock min.

CSE-340 U95 Name: 31/05/19

## Lecture 1 (Act 1)

### Computer Mem Hierarchy



RAM

Virtual Mem

Hard Disk



## Multi Processor



= it is a tightly coupled computer system.

- Multi Core

- A computer system



- two or more CPU

- each <sup>one</sup> shares the common main mem & I/O device

- It helps in simultaneous processing of programs.

- It improves computing speed, performance and cost effectiveness.

## Role of computer Architec:

① Look Backward: Analyze and evaluate the past.

② Look forward: Be a dreamer. Be creative.

③ Look up: Understand important problems and their nature.

④ Look down: Understand the capabilities of the underlying technology. Want design + need 2nd gen, technology support etc etc.

## Response Time:

\$7 \$5 \$3 ~~A~~

Response time = CPU  
start time - end time  
= time taken after  
end time

- Total time taken to complete a task/instruction
- start time to ~~end~~ time.
- begin time to ~~end~~ time
- If you have 2 processor one taken 3 s another one taken 3 s. Then 3 s is good one
- It is performance indicator.
- ~~but~~ 2nd one.

## Throughput:

- How many bits it can process.
- Per unit time is throughput of a transmisor.
- How many bits it can receive per unit time is the throughput of a receiver.
- Throughput can be the throughput of a channel/transmisor/receiver.
- Throughput of a processor is how many task/instruction you can process per unit of time.
- ~~for~~ ~~not~~ ~~not~~,

① Replace a faster version?

- Response time ~~not~~ ~~100~~ 98 thousand  
bytes, ms,

② Adding additional processor to a system?

- Throughput ~~will~~ ~~not~~ ~~be~~ response time same

means, ~~the~~ ~~same~~ ~~time~~ ~~for~~ ~~all~~ ~~processors~~.

Time ~~is~~ ~~not~~ ~~20~~ t<sub>1</sub>, ~~but~~ ~~20~~ t<sub>1</sub> ~~per~~ ~~ms~~ ~~of~~,

so 2 ~~in~~ ~~ms~~ ~~not~~ ~~20~~ t<sub>1</sub> ~~ms~~, That  
means, <sup>per</sup> unit of time ~~is~~ ~~not~~ ~~20~~ t<sub>1</sub>, ~~but~~ ~~20~~ t<sub>1</sub>.

response time separately count ~~20~~ t<sub>1</sub>, ~~20~~ t<sub>1</sub>

ms ~~is~~ ~~not~~ ~~20~~ t<sub>1</sub> ~~ms~~ ~~but~~ ~~same~~ ~~time~~ ~~20~~ t<sub>1</sub>.

$$\text{Performance} = \frac{1}{\text{Execution Time}}$$

It's execution  
time ~~to~~ ~~the~~  
~~unit~~ ~~performance~~

② X is faster than Y. Now calculate & how  
much slow/fast are the execution time &  
Performance between each other?

- X is faster means,

$$\text{performance}_X > \text{performance}_Y$$

$$\Rightarrow \frac{1}{\text{execution time}_X} > \frac{1}{\text{execution time}_Y}$$

$$\Rightarrow \text{execution time}_Y > \text{execution time}_X$$

② X is n times faster than Y?

$$\text{performance}_X = n \times \text{performance}_Y$$

$$\Rightarrow \frac{\text{performance}_X}{\text{performance}_Y} = n.$$

$$\Rightarrow \frac{\text{execution time}_Y}{\text{execution time}_X} = n.$$

③ A take 10 sec & B take 15 sec which how

much faster A is then B?

$$\frac{\text{performance}_A}{\text{performance}_B} = \frac{\text{execution time}_B}{\text{execution time}_A} = n$$

$$\Rightarrow \frac{15}{10} = n$$

$$\Rightarrow n = 1.5$$

$$\left. \begin{array}{l} \text{ratio} = \frac{\frac{1}{10}}{\frac{1}{15}} \\ = f_0 \times \frac{15}{1} = 1.5. \end{array} \right\}$$

Execution Time - The time taken into the CPU.

- The time taken by ALU.

- The main time for data bus transfer, memory access, I/O, interrupt handling, etc.

Clock Cycle - Unit of task.

- Job is measured by job cycle.

- Time taken for one period of pulse.

CPU time / CPU Execution time for a program

=  $\frac{\text{Num of cycle} \times \text{Per cycle time}}{\text{for a program CPU takes}}$

CPU Execution time for a program =  $\frac{\text{CPU clock cycle for a program}}{\text{Clock rate}}$

2 sec  
कोरा  
clock cycle  
नियंत्रण

$$T = \frac{1}{f}$$

① Program runs 10 sec on computer which has 2 GHz clock. Program B run this program in 6 sec. B needs 12 time clock cycle then A. What is the clock rate of B?

$$\text{CPU time}_A = \frac{\text{num of clock cycle}}{\text{clock rate}}$$

$$\Rightarrow \text{num of clock cycle} = \text{CPU time} \times \text{clock rate}$$

$$= 10 \times 2 \times 10^9$$

Again

$$\text{CPU time}_B = \frac{\text{num of clock cycle for a program}}{\text{clock rate}}$$

$$\text{Clock rate} = \frac{\text{num of clock cycle}}{\text{CPU time}_B}$$

$$= \frac{1.2 \times 10 \times 10^9}{6}$$

$$= 2 \times 10^9$$

Ans

How many clock cycle  
CPU takes to execute a program.

$$\text{CPU}^{\text{num of}}_{\text{clock Cycle}} = \text{Instructions for a program} \times$$



① Computer A has a clock cycle time of 250 ps

& CPI of 2.0 for some program, Computer B has a clock cycle time of 500 ps and a CPI of 1.2 for the same program. faster which one?

$$\text{CPU Clock Cycle time} = \text{num of Instruction} \times \text{CPI}$$

Let

num of Instructions for both = I.

$$\therefore \text{CPU Clock cycle time}_A = I \times 250 \times 2.0$$

$$\therefore \text{CPU Clock cycle time}_B = I \times 500 \times 1.2$$

$$\text{CPU execution time}_B = \frac{\text{num of clock cycle} \times \text{per clock cycle time}}{\text{clock cycle time}}$$

*B type*

$$= I \times 2.0 \times 250$$

$$\approx 500 \text{ ips.}$$

$$\text{CPU execution time}_B = \frac{\text{num of clock cycle} \times \text{per clock cycle time}}{\text{clock cycle time}}$$

$$= I \times 2.2 \times 500$$

$$= 600 \times I \text{ ps.}$$

$$\frac{\text{CPU Performance}_A}{\text{CPU Performance}_B} = \frac{\text{Execution time}_B}{\text{Execution time}_A}$$

$$= \frac{600 \times I}{500 \times I}$$

$$\approx 22 \text{ 1.2}$$




Lecture no 01 (Ach)

time of instruction = 1. per  $\rightarrow$  195

CPU time = CPU clock cycle  $\times$  clock cycle time  
 for a program

CPU clock cycle  
~~number of steps = total num of instruction~~  
 for a program for program  $\times$  CPI (19)  
~~for a program~~  
~~time of instruction~~

- ① Sequence 1 has  $(2+1+2) \div 3 = 5$  instruction count.
- Sequence 2 has  $(4+1+2) \div 2 = 6$  instruction count.

CPU clock cycle of A =  $2 \times 1 = 2$

" " " of B =  $1 \times 2 \div 2$   
~~195.666 ms~~

" " " of C =  $(2+2 \times 1) \div 6 = 6$ .

∴ CPU clock cycle

~~length~~

CPU clock cycle

of encoder

Sequence, 1 =  $2+2+6 = 10$



length (bytes)



length (bytes)

$= 10$

$+ (1 \times 2)$

$+ (1 \times 3)$

$= 9$

So, seq 2 is faster.

$$CPI \text{ of seq. 1} = \frac{\text{CPU clock cycle for instruction}}{\text{Instruction count}}$$

Time taken by CPU  $\times$   $\frac{\text{clock cycle for instruction}}{\text{Instruction count}}$

~~Time taken by CPU  $\times$   $\frac{10}{5}$~~  =  $2 \times 10 = 20$

$$CPI_{seq. 2} = \frac{\text{CPU clock cycle for instruction}}{\text{Instruction count}}$$

Time taken by CPU  $\times$   $\frac{9}{6}$  =  $2 \times 1.5 = 3$

Time taken by CPU  $\times$   $\frac{5}{5} = 1$

## The Von Neumann Model



# The Harvard Architecture



Difference Between the Von Neumann & Harvard Model

- | Von Neumann                                                                                             | MASI                                                                      | Harvard Model |
|---------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|---------------|
| i) Mem add is not separated.                                                                            | i) Mem has separated.                                                     |               |
| ii) Computer slower. coz bus we use.                                                                    | ii) Step faster coz instruction fetch and data access do not in same way. |               |
| iii) get data memory & write memory in one bus.                                                         | iii) fast mem & slow mem.                                                 |               |
| iv) Both mem are shared by both.                                                                        | iv) Read mem & write mem.                                                 |               |
| v) Either read or write.                                                                                | v) Read & write.                                                          |               |
| vi) Instructions and data use same bus system.                                                          |                                                                           |               |
| vii) Mem is same.                                                                                       |                                                                           |               |
| viii) In some system instruction can be stored in read only mem while data mem requires read-write mem. |                                                                           |               |

# Difference Between RISC & CISC:

CISC (Complex Instruction Set Computing)

① Size is variable

0018163008  
10110  
121314



② Large num of instruction

- Many addressing Model

③ Use more RAM and less

Reg. between 2048 MB

④ Need better hardware

powerful processing

MOV AX, 5

MOV AX, BX, CX, DX

MUL CX

MOV AL, BH, CL, DH

Harvard Model

Addressing is done by base and offset

RISC (Reduced Instruction set computing)

① size is fixed

0010  
1011  
110010



② few instruction

③ Use more reg and less RAM

④ Less costly hardware would

also work well

⑤ 16301180 → 1010 id length

DATA,

⑥ phone num for some length.

⑦ add id no. of base multi

⑧ Von Neumann Model.

## Chapter-02 (ACH)

14/06/16



### The MIPS Instruction Set:

32 bit MIPS instruction

→ all instruction length is 32 bit.

→ all reg also 32 bit.

$$\textcircled{1} \quad f = (g + h) - (i + j)$$

add \$t0, g, h  
 add \$t1, i, j  
 sub f, \$t0, \$t1

add \$t0, \$1, \$2  
 add \$t1, \$3, \$4  
 sub \$t0, \$t1, \$t1

lw R1, 16(R2)  
 R1 = mem[R2]

lw means:

load data to reg from mem

sw means:

store data to mem from reg

②  $g = h + A[8]$

$\$1 \quad \$2$

Data Mem.  
so comp mem (276 reg mem with address,

offset base add of A



lw \$t0, 32(\$3)

$$\Rightarrow t0 = \text{mem}[s_3 + 32]$$

base offset

add \$t1, \$t2, \$t0

③  $A[12] = h + A[8]$

base reg \$3      ↓      offset  
\$2                  \$3

$\Rightarrow \text{lw } \$t0, 32(\$3)$

add    \$t0, \$2, \$t0

sw      \$t0, 48(\$3)

We need four bytes for each word, so an offset of 10 words is an offset of 40 bytes.

Mem[\$3 + 40] = \$t0.

\$zero neg means = 0  
\$t value at,  
0 to (for first) 1025  
add to check 1020

add \$t2, \$s1, \$t2

\$s = saved reg.

\$t = temp.

\$gp = global pointer.

\$fp = frame pointer

\$sp = stack pointer  
\$a = argument reg  
\$v = value reg

RAM



④  $a = b + 5$

addi \$0, \$t, 5



lw \$t2, 300(\$s3)

base \$s3 - \$s2, offset



if ( $r_3 = r_1$ )

↳ level 1

at the level 2

→ base, rs, rt, level 2 (tan)

base rs, rt, level 2

~~base + rt, rt~~

Level 1:

Tan:

1110

$A[8]$  } offset  
base  
 $32(\frac{83}{2})$  } base  
} offset } main  
location add  
300.0.0.  
from where  
you want to  
read and  
write

$i(i^{(new)})$   
else  
 $a[i] = h + g - K(g)$

R  
E

4 bit sign

equivalent  
decimal  
num  
to binary

if (i < j) {  
 f = g[i];  
 add \$s1, \$s2, \$s3.  
 sub \$s4  
 bear rs, r1, if  
 if: add \$s0, \$s1, \$s2  
 \* sub \$s0, \$s1, \$s2.  
 Jump Exit.

bne \$s0, \$s1, el18  
 el18: sub \$s0, \$s1, wed  
 a = b + c - d  
 bto  
 add \$s0, \$s1, \$s3

mail \$s0, \$s1, \$s2  
 P  
 ixb (\$0)

while (sare r1) == k) ; i + 1  
 ↓      ↓  
 \$s6    \$s7  
 (r = i+1)    (1010)  
 \$s11    \$t1, \$s3, 2

f, add \$st1, \$t1, \$s6  
 lw \$t0, 0(\$t1)  
 lw \$t0, 0(\$t1)  
 ixb  
 lw \$t0, \$t1 (\$s6)

offset wrt base to  
 PTR. off base to  
 LWR.

bear \$t0, \$s5; ~~if~~ not wrt  
 Not Exit:  
 addi \$s0, \$s0, 1  
 J Exit.

i x (4)10  
 i x (100)2

MIPS Register File:

- two read ports
- one write port

(32 32 bit)

Reg are faster than Mem. but reg files with more locations are slower.

Sign Extension

ADDI \$7, \$3, 8

→ constant 16 bit  
But we need 32 bit.

- it is a hardware
- if you put 16 bit then it will be 32 bit always
- Sign num ১০ টানা 2's complement ১৮
- Unsigned num ১০ টানা ০ টানে সঁজ,

$$(2)_{10} = (0000 \ 0000 \ 0000 \ 0010)_2$$

$$\begin{array}{r} 1111 \ 1111 \ 1111 \ 1101 \\ + 1 \\ \hline \end{array}$$

$$= (-2)_{10}$$

## Reg num:

\$t0 - \$t7 are reg. 8 - 15.

\$t8 - \$t9 are reg 24 - 25

\$s0 - \$s7 are reg 16 - 23.

S11 \$t0, \$t0, 2  
Sore(3=2)

## R-format Instructions:



| Op    | rs | rt | rd | Shamt | funct |
|-------|----|----|----|-------|-------|
| 6 bit | 5  | 5  | 5  | 5     | 6     |

add \$t0, \$s1, \$s2

|         |      |      |      |   |     |
|---------|------|------|------|---|-----|
| special | \$t1 | \$s2 | \$t0 | 0 | add |
|---------|------|------|------|---|-----|

|   |    |    |   |   |    |
|---|----|----|---|---|----|
| 0 | 17 | 18 | 8 | 0 | 32 |
|---|----|----|---|---|----|

|        |       |       |      |       |        |
|--------|-------|-------|------|-------|--------|
| 000000 | 10001 | 10010 | 0100 | 00000 | 100000 |
|--------|-------|-------|------|-------|--------|

(00000010001100100100000100000) = (02324020)  
 ↓      ↓      ↓      ↓      ↓  
 21 22 23 24 25  
 26 27 28 29 30 31

# I-format Instruction

| OP    | rs | rt | constant or address |
|-------|----|----|---------------------|
| 6 bit | 5  | 5  | 16                  |

lw \$1, 100(\$2)

| OP     | rs    | rt    | immediate           |
|--------|-------|-------|---------------------|
| 010011 | 00010 | 00001 | 0000 0000 0110 0100 |

n(8)

$$\$1 = \text{Mem}[100 + \$52]$$

offset + base = Mem Location

Base

offset



sw \$5, 3000(\$2)      Mem[\$2 + 3000] = \$5

| OP     | rs    | rt    | immediate value |
|--------|-------|-------|-----------------|
| 101011 | 00010 | 00101 | 000010111011100 |

addi \$0, \$1, 5

| OP     | rs    | rt    | immediate     |
|--------|-------|-------|---------------|
| 110011 | 00001 | 00000 | 0000000000101 |

lw \$51, 100(\$2)

$$\$1 = \text{Mem}[100 + \$52]$$

load data to reg from mem.

sw \$55, 3000(\$2)

$$\text{Mem}[3000 + \$52] = \$55$$

read

write

store data to mem from reg.

## J-type



→ jump ~~bit~~ ~~bit~~ ~~bit~~ ~~bit~~ ~~bit~~  
bit ~~bit~~ ~~bit~~ ~~bit~~ ~~bit~~ ~~bit~~

⑤ and \$t0, \$t1, \$t2

$$\begin{aligned} \text{lui } & s31, 20 \\ s1 &= 20 * 2^{16} \end{aligned}$$

$$t0 = t1 \& t2$$

⑥ or \$t0, \$t1, \$t2 → or gate ~~or logic~~ ~~table~~ ~~or~~ ~~and~~.

$$t0 = t1 | t2$$

|   |   |   |
|---|---|---|
| 1 | 0 | 1 |
| 1 | 1 | 1 |
| 0 | 0 | 0 |
| 0 | 1 | 1 |

⑦ NOR \$t0, \$t1, \$zero

$$t0 = ! (t1 | zero)$$

⑧ and immediate : andi \$s1, \$s2, ~~20~~ 20

$$s1 = s2 \& 20$$

⑨ Or immediate : ori \$s1, \$s2, 20

$$s1 = s2 | 20.$$

⑩ SLL \$t1 \$s0, \$s1, ~~20~~ 10

$$s0 = s1 << 10$$

(11) if ( $i == j$ )       $f = g + h$   
 $\begin{matrix} s_1 \\ s_2 \\ s_3 \\ s_4 \\ s_5 \end{matrix}$        $\begin{matrix} f \\ g \\ h \end{matrix}$   
 else       $f = g - h$ .  
 $\begin{matrix} s_3 \\ s_4 \\ s_5 \end{matrix}$

bear  $s_1, s_2$ , level 1  
 level 1 °

add  $s_3, s_4, s_5$ .

J Exit;

Sub  $s_3, s_4, s_5$ .

Exit.

(12) while ( $\text{Save}[i] = \pm k$ )       $i++ = 1$ .  
 $\begin{matrix} s_1 \\ s_2 \\ s_3 \\ s_4 \end{matrix}$

SLL  $s_{to}, s_{t0}, 2 \rightarrow 2s_{t0}$  left shift  
 কর্তৃত পদ রাখ মি.

Iw  $s_{t1}, s_{t0}[s_{t2}] \rightarrow t1 = \text{Mem}[to+s2]$

bear  $\{s_{t1}, s_{t2}, L1\}$

L1 ° add  $s_{t3} = s_{t4} + 1$

addi  $s_{t4}, s_{t4} + 1$

exit;

A(8)

$s_2[s_3]$

$s_{t0}[s_{t3}]$

Lecture - D3Overflow Flag:

unsigned num : carry and overflow same

$$\begin{array}{r}
 & 1 & 1 & 1 \\
 & | & | & | \\
 & 1 & 1 & 1 \\
 \hline
 & 1 & 1 & 1 & 0
 \end{array}$$

(Carry)      যোগ করে পাওয়া গুরুত্ব সহ সিঙ্গেন্ড বিট।

$CF = 1$   
 $OF = 1$

sign num : if  $C_{in} \neq C_{out}$

$$\begin{array}{r}
 C_{in} = C_{out} \quad \text{if } \boxed{C_{in} \neq C_{out}}
 \end{array}$$

$$\begin{array}{r}
 \textcircled{1} \quad \textcircled{1} \\
 1 \quad 1 \quad 1 \quad 1 \\
 1 \quad 1 \quad 1 \quad 1 \\
 \hline
 \textcircled{1} \quad 1 \quad 1 \quad 1 \quad 0
 \end{array}$$

Carry

$CF = 1$   
 $OF = 0$

$$\begin{array}{r}
 \boxed{C_{in} \neq C_{out}} \quad \text{if } \boxed{\textcircled{0} \quad \textcircled{1}}
 \end{array}$$

$$\begin{array}{r}
 0 \quad 1 \quad 1 \quad 1 \\
 0 \quad 1 \quad 1 \quad 1 \\
 \hline
 0 \quad 1 \quad 1 \quad 1 \quad 0
 \end{array}$$

$CF = 0$   
 $OF = 1$

## Multiplication

$$\begin{array}{r}
 & \xleftarrow{\quad} \\
 \begin{array}{r} 1000 \\ 1001 \end{array} & \xrightarrow{\quad} \text{multiplicand} \\
 \hline
 & \xrightarrow{\quad} \text{Multiplier} \\
 & \xrightarrow{\quad} 1000 \\
 & 0000 \times \\
 & 0000 \times \times \\
 & 1000 \times \times \times \\
 \hline
 & 1001000 \xrightarrow{\quad} \text{Product}
 \end{array}$$



## Floating Point

$$(11.1111)_2$$

$$\Rightarrow 11.1111 \times 2^0$$

$$\Rightarrow 11\underline{11} \cdot 11 \times 2^{-2}$$

$$\Rightarrow 1 \cdot \underline{1111} \times 2^{+1}$$

मान इन फ्रॉन्ट  
1011 2100 50 फ्रॉन्ट

फ्रॉन्ट

1011 2100 50 फ्रॉन्ट

## IEEE-floating Point format

| S | Exponent | fraction |
|---|----------|----------|
|---|----------|----------|

single: 8 bit

double: 11 bit

single: 23 bit  
double: 52 bit

(Exponent - Bias)

$$n = (-1)^S \times (1 + \text{fraction}) \times 2^{\text{Exponent} - \text{Bias}}$$

$$\text{① } -0.75$$

$$S = 1 \quad [(-1) \text{ अंक}, \text{ अंक } (-0.75) \text{ से } 0 \text{ होता है, } \text{ power } \geq (1) \text{ फ्रॉन्ट}]$$

$$\text{fraction} = \cancel{0.75}_{(0.75)_{10}} = (0.11)_2$$

$$\Rightarrow 1 \cdot 1 \times 2^{-1} \rightarrow \text{Exponent}$$

$$= 1000 \rightarrow \text{fraction}$$

परिवर्तन करना चाहिए  
संख्या का अंक बदलना चाहिए

$$\text{Exponent} = -1 + \text{Bias}$$

$$= -1 + 127$$

$$= 126$$

$$= 101111(1111110)_2$$

final format = 

|   |          |                      |
|---|----------|----------------------|
| 1 | 01111110 | 00000000000000000000 |
|---|----------|----------------------|

 23 bits

② What num is represented by single precision float

Precision float

1 10000001 001000... 04.

$$\begin{array}{r} +10 \\ 0000 \\ 0000 \\ 1 \cdot 10 \times 2^0 \\ 0 \cdot 11 \times 2^0 \end{array}$$

S = 1

$$\text{fraction} = (01000000)_2$$

$$f_{ex} = (10000001)_2 = 129$$

$$n = (-1)^s \times f_{ex} \times 2^{(129-127)}$$

$$= (-1) \times 0.125 \times 2^2$$

$$= -5.0.$$

$$\begin{aligned} & 0.125 \\ & = 0.1 \times 2^3 \\ & = 0.1 \times 8 \\ & = 0.1 \times 10 \\ & = 1.0 \times 2^0 \end{aligned}$$

$$\begin{aligned} & 0.125 \\ & = 0 \times 2^7 + 10 \times 2^2 \\ & = 0.125 \end{aligned}$$

(1) (8) (23)  
(11) (32)

Procedure Call Instructions

4: addi R1, R0, 12

8: jal my\_proc

12: add R1, R2, R2

my\_proc

go: addi R2, R1, 8

84: jr

Reg File

R0 : 0

R1 : 0'1240

R2 : 0'20



main () {

Jump to ~~my\_func~~ 12th line.  
 Store the return add. if the  
 current add is PC. then if add  
 $(PC + 4)$  it is the add ~~back~~ here  
 the value will return.  $(PC + 4)$  is a  
 address & it will save in ~~for a neg~~

jal my function() → ~~for a neg~~ (if address save ~~for a neg~~ ~~value~~ ~~for a neg~~)

my function() {

jr r1 → ~~for a neg~~ (if address save ~~for a neg~~ ~~value~~ ~~for a neg~~)

# Specifying Address Branch Destinations

bne \$s1, \$2, L1



j label



Lecture -04

- i) Path on PC+G or PC+G+G it is decided by that MUX.  
on the ~~base~~ base.
- ii) whose output going to the reg. is it  
Reg file ALU result or Data Mem?
- iii) ~~ALU~~ ALU (to what input IN)? Reg file is  
2nd reslt or S.E output?

## Combinational Elements

AND gate, ALU.

State Elements: Reg file, Mem.



- The fetch is done here.
- PC holds the address of current instruction.
  - Instruction Mem store the instruction & supply instruction given an add.
  - ALU used for increment.



- Decode part done here.
- in reg there can be written/read.

## Control Signals

|                 |                                            |                                                      |
|-----------------|--------------------------------------------|------------------------------------------------------|
| i) Reg Dst      | select rd as destination                   | select rd as destination                             |
| ii) Reg Write   |                                            | Reg file $\rightarrow$ write data<br>input select RD |
| iii) ALU Src    | reg file $\rightarrow$ 2nd output selected | select output from S.E.                              |
| iv) PC Src      | select PC+4                                | select PC+4 on Bus                                   |
| v) Mem Read     |                                            | Wr $\rightarrow$ Mem D.M (to read QD)                |
| vi) Mem Write   |                                            | Sw $\rightarrow$ Mem D.M (to write QD)               |
| vii) Mem To Reg | Add to this ALU Mem Reg D.M                | Wr $\rightarrow$ Mem D.M (to write RD)               |

## R-type

Add \$s10, \$t2, \$s13

or, sub, and,  
sll, srl, -



## I-type

lw \$s2, \$s3(\$s0)

lw \$s2 + 500 (\$s3)

[op | rs | rt | Imm ]  
imm 21:26 15:21 20:26 25:0



SW \$S2, 500(\$3)



Addi \$S2, \$S1, 100

rs      N      Imm



## J-Type

J 1030.



JAL L1



Jr 831



Bear  $s_{10}, s_{12}$ , Else



## Performance Issue



એક બિટ નું રજિસ્ટર હોય તો એક કલી વિના હોય  
અને એક વિના હોય

1 component  
1 instruction  
1 cycle  
at a time

single cycle means ~~one~~ one  
time એક વિના હોય  
2 clock cycle જે વિના હોય એટાં

એ તો જાણ્યુ કે એક વિના હોય

Iw  $\Rightarrow$  PC  $\rightarrow$  IM  $\rightarrow$  R.F  $\rightarrow$  ALU  $\rightarrow$  DM

Sw  $\Rightarrow$  PC  $\rightarrow$  IM  $\rightarrow$  R.F  $\rightarrow$  ALU  $\rightarrow$  DM

we do not use the single cycle implementation

because if the performance is too poor

If we implement the floating point unit or  
an instruction set with more complex instruction

this single cycle design would not work at all.

## Pipelining

- Pipelining is faster
- everything is working in parallel.



Five steps of MIPS instruction.

## Pipeline Performance



# Pipelining Vs Non Pipelining

8 instructions.

$\times T(n-1)$

| Stages | F | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> |                |                |                |
|--------|---|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|
| ↑ D    |   | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> |                |                |                |
| E      |   |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> |                |                |
| DM     |   |                |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> |                |
| NR     |   |                |                |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> |
|        | 1 | 2              | 3              | 4              | 5              | 6              | 7              | 8              | 9              | 10             | 11             | 12             |
|        |   |                |                |                |                |                |                |                |                |                |                | Time           |

for non pipelining we required  $8 \times 5$  stage clock cycle  
 $= 400$  s time.

Here we need only 12 s.

1 clock cycle  $\Rightarrow$  1 stage complete  $\therefore$  non pipelining  
 each instruction  $\Rightarrow$  5 stages  $\therefore$  total 40 stages  
 each instruction  $\Rightarrow$  5 stages  $\therefore$  total 40 stages  
 each instruction  $\Rightarrow$  5 stages  $\therefore$  total 40 stages  
 $\therefore$  12 s time.

$$\therefore \text{Speed up} = \frac{\text{Time of non pipeline}}{\text{Time of pipeline}}$$

$$\text{Time between instructions} = \frac{\text{Time between instructions}}{\text{Num of pipeline}}$$

$$= \frac{40}{5} = 8 \text{ CC}$$

$$\therefore \frac{40}{12} = 3.33$$

$$\therefore \text{Efficiency/Utilization} = \frac{\text{We use how much}}{\text{Total number of blocks}} = \frac{40}{60}$$

Pipeline Hazard: unwanted event occur in pipeline.

- ① Structural Hazards:
  - same clock cycle
  - same resource (Reg)
  - two or more instruction trying to access

- I1 : ADD R1, R2, R3 //  $R_1 \leftarrow R_2 + R_3$
- I2 : ADD R4, R5, R6 //  $R_4 \leftarrow R_5 + R_6$
- I3 : SUB R7, R1, R8 //  $R_7 \leftarrow R_1 - R_8$

|    | 1              | 2              | 3                              | 4                          | 5                          | 6 |
|----|----------------|----------------|--------------------------------|----------------------------|----------------------------|---|
| F  | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub>                 |                            |                            |   |
| D  | ADD<br>R2, R3  | ADD<br>R5, R6  | SUB<br>R1, R8                  |                            |                            |   |
| E  |                | $R_2 + R_3$    | $R_5 + R_6$                    | $R_1 - R_8$                |                            |   |
| WB |                |                | $R_1 \leftarrow \text{result}$ | $R_4 \leftarrow R_5 + R_6$ | $R_7 \leftarrow R_1 - R_8$ |   |

$$R_7 \leftarrow R_1 - R_8$$

$$R_4 \leftarrow R_5 + R_6$$

$$R_7 \leftarrow R_1 - R_8$$

|    | 1              | 2              | 3              | 4                              | 5                              | 6                          |
|----|----------------|----------------|----------------|--------------------------------|--------------------------------|----------------------------|
| F  | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> |                                |                                |                            |
| D  |                | SUB<br>R1, R8  | ADD<br>R5, R6  | ADD<br>R2, R3                  |                                |                            |
| E  |                |                | $R_1 - R_8$    | $R_5 + R_6$                    | $R_2 + R_3$                    |                            |
| WB |                |                |                | $R_7 \leftarrow \text{result}$ | $R_4 \leftarrow \text{result}$ | $R_1 \leftarrow R_1 + P_1$ |

Code schedule possible

|    | 1              | 2                                      | 3                                      | 4                                                                             | 5                              | 6 |
|----|----------------|----------------------------------------|----------------------------------------|-------------------------------------------------------------------------------|--------------------------------|---|
| F  | I <sub>1</sub> | I <sub>2</sub>                         | I <sub>3</sub>                         |                                                                               |                                |   |
| D  |                | ADD<br>R <sub>2</sub> , R <sub>3</sub> | ADD<br>R <sub>5</sub> , R <sub>6</sub> | SUB<br>R <sub>1</sub> , R <sub>8</sub>                                        |                                |   |
| E  |                |                                        | R <sub>2</sub> +R <sub>3</sub>         | R <sub>5</sub> +R <sub>6</sub>                                                | R <sub>1</sub> -R <sub>8</sub> |   |
| WB |                |                                        |                                        | R <sub>1</sub> ← result<br>R <sub>4</sub> ← result<br>R <sub>2</sub> ← result |                                |   |

Not the dependency coz,  
 R<sub>1</sub> is value ready at PC.  
 So, now SUB has to wait.  
 point is Reg 1 uses R<sub>1</sub>  
 at that time, it can't be used  
 also there's WB at the same time  
 same CC as. That is structural  
 hazard

- Delay with Bubble
- until I<sub>1</sub> is totally finished.

I<sub>4</sub> SW \$5, \$500(\$10)

I<sub>3</sub> LW \$5, 500(\$80)

I<sub>2</sub> Add \$5, \$7, \$9

I<sub>1</sub> Sub \$5, \$3, \$7

|    | 1                                      | 2                                      | 3                                | 4                                  | 5                            | 6              | 7 | 8 |
|----|----------------------------------------|----------------------------------------|----------------------------------|------------------------------------|------------------------------|----------------|---|---|
| F  | I <sub>1</sub>                         | I <sub>2</sub>                         | I <sub>3</sub>                   | I <sub>4</sub>                     | I <sub>5</sub>               | I <sub>6</sub> |   |   |
| D  | Sub<br>R <sub>2</sub> , R <sub>3</sub> | Add<br>R <sub>2</sub> , R <sub>3</sub> | LW<br>R <sub>5</sub> , 500(\$80) | SW<br>R <sub>5</sub> , \$500(\$10) |                              |                |   |   |
| E  |                                        |                                        |                                  |                                    | Addition<br>R <sub>5</sub> ← |                |   |   |
| DM |                                        |                                        |                                  |                                    |                              |                |   |   |
| WB |                                        |                                        |                                  |                                    | \$5 ←                        | \$5 ←          | X |   |

Decide by Reg → address,  
 W.B by Reg 1.

Fetch by I-M (using RAM)  
 & Address save by DM (using  
 RAM), same CC to RAM  
 use single bus instruction by (bus),

- Data Hazards
- Dependency will be there.
  - Assume Input Error also (W.A.N.T.)
  - overlapped memory access for all.

3 types of Data Hazard's

i) Read after write Hazard (R.W)

ii) Write after Read Hazard (W.R)

iii) Write after write Hazard (W.W)

ii) Read after write Hazard's

I<sub>1</sub> ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub> //  $R_1 = R_2 + R_3$

I<sub>2</sub> SUB R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub> //  $R_4 = R_1 - R_5$

|    | 1                                      | 2                                                 | 3                               | 4                               | 5                |
|----|----------------------------------------|---------------------------------------------------|---------------------------------|---------------------------------|------------------|
| F  | I <sub>1</sub>                         | I <sub>2</sub>                                    |                                 |                                 |                  |
| D  | ADD<br>R <sub>1</sub> , R <sub>3</sub> | <sup>REG</sup><br>R <sub>1</sub> , R <sub>2</sub> | R <sub>2</sub> + R <sub>3</sub> | R <sub>1</sub> - R <sub>5</sub> |                  |
| E  |                                        |                                                   | R <sub>1</sub> ←                | R <sub>1</sub> ←                | R <sub>4</sub> ← |
| WB |                                        |                                                   |                                 |                                 |                  |



Solution:

i) Delay the Pipe

[delay 2 CC]

|    | 1              | 2                                      | 3                                       | 4                                        | 5                                        | 6                                      | 7 |
|----|----------------|----------------------------------------|-----------------------------------------|------------------------------------------|------------------------------------------|----------------------------------------|---|
| F  | I <sub>1</sub> | I <sub>2</sub>                         |                                         |                                          |                                          |                                        |   |
| D  |                | ADD<br>R <sub>2</sub> , R <sub>3</sub> | <del>R<sub>1</sub>, R<sub>2</sub></del> | <del>R<sub>1</sub> + R<sub>3</sub></del> | <del>R<sub>1</sub> - R<sub>5</sub></del> | SUB<br>R <sub>1</sub> , R <sub>5</sub> |   |
| E  |                |                                        |                                         | R <sub>1</sub> + R <sub>3</sub>          | R <sub>1</sub> - R <sub>5</sub>          |                                        |   |
| WB |                |                                        |                                         | R <sub>1</sub> ←                         | R <sub>4</sub> ←                         |                                        |   |

## (ii) Operand forwarding Techniques

|    | 1              | 2                                      | 3 | 4                               | 5                | 6 | 7 |
|----|----------------|----------------------------------------|---|---------------------------------|------------------|---|---|
| F  | I <sub>1</sub> | I <sub>2</sub>                         |   |                                 |                  |   |   |
| D  |                | ADD<br>R <sub>2</sub> + R <sub>3</sub> |   | R <sub>2</sub> + R <sub>3</sub> |                  |   |   |
| E  |                |                                        |   | R <sub>1</sub> ←                | R <sub>2</sub> ← |   |   |
| WB |                |                                        |   |                                 |                  |   |   |



1 cc delay  $\Rightarrow$  direct execution  
 $\Rightarrow$   $R_1 \leftarrow (R_2 + R_3)$   
 $\Rightarrow$   $R_2 \leftarrow$

## (2) Write after Read Hazard (RW Hazard / Anti-Dependency Hazard)

i : ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>; // R<sub>1</sub> ← R<sub>2</sub> + R<sub>3</sub>

j : OR R<sub>2</sub>, R<sub>3</sub>, R<sub>4</sub>; // R<sub>2</sub> ← R<sub>3</sub> or R<sub>4</sub>

Here ADD goes after OR execute  $\Rightarrow$  faster, but dependency hazard will occur.

|    | 1                                     | 2              | 3                                | 4 | 5                                                    | 6 |
|----|---------------------------------------|----------------|----------------------------------|---|------------------------------------------------------|---|
| F  | I <sub>2</sub>                        | I <sub>1</sub> |                                  |   |                                                      |   |
| D  | OR<br>R <sub>3</sub> , R <sub>4</sub> |                | R <sub>3</sub> OR R <sub>4</sub> |   | R <sub>3</sub> OR R <sub>4</sub><br>+ R <sub>3</sub> |   |
| E  |                                       |                | R <sub>2</sub> ←                 |   | R <sub>1</sub> ←                                     |   |
| WB |                                       |                |                                  |   |                                                      |   |

③ Write after write hazard (WW Thread/aut<sup>4+</sup> Dependency)

$$i \quad R_2 \leftarrow R_4 + R_7$$

$$j \quad R_2 \leftarrow R_1 + R_3$$

if as usual i will happen first and j will be last. At the end we get updated value of  $R_2$  from j. But if j happens first then there will be a problem. So, we set delay so that j will execute just after the i finish.

### Code Scheduling

1  $lw \quad 8+t1, 0(8+0)$   
 6  $lw \quad 8+t2, 4(8+0)$   
 8  $\cancel{add} \quad 8+t3, 8+t2(8+0)$   
 9  $sw \quad 8+t3, 12(8+0)$   
 10  $lw \quad 8+t4, 8(8+0)$   
 11  $\cancel{add} \quad 8+t5, 8+t4, 8+t5$   
 13  $sw \quad 8+t5, 16(8+0)$

↗ independent 3[1] 3[2] 3[3]

Stall or wait? cycle 8[0]

1  $lw \quad 8+t1, 0(8+0)$   
 6  $lw \quad 8+t2, 4(8+0)$   
 7  $lw \quad 8+t4, 8(8+0)$   
 8  $add \quad 8+t3, 8+t1, 8+t2$   
 9  $sw \quad 8+t3, 12(8+0)$   
 10  $add \quad 8+t5, 8+t1, 8+t4$   
 11  $sw \quad 8+t5, 16(8+0)$

↗ independent 3[1] 3[2] 3[3]

for dependence:

- ① stall a 2<sup>nd</sup> stall
- ② stall & forwarding a 2<sup>nd</sup> stall

## Control Hazard

Add 83, 82, 85

Sub 86, 87, 88

bea 85, 86, L1

IMUL 85, 87, 88

Div 85, 89, 83

L1:

Add 812, 810, 821.

[flush]

X {

IMUL

Div

85, 87, 88

85, 89, 83

Program finds branch from one location to another. If jump goes to new mem location → takes two cycles. Thus the control instruction will be removed. Thus it is called time lost. As a result of this is called

Control Hazard. Branch delay is called. Ex state 9 after ID instruction F, D, (in 2nd stage) due to WB.

S<sub>1</sub>

S<sub>2</sub> - Bea

S<sub>3</sub>

S<sub>4</sub>

S<sub>5</sub>

S<sub>6</sub>

|    | 1              | 2              | 3                                      | 4                                    | 5              | 6 | 7 | 8 |
|----|----------------|----------------|----------------------------------------|--------------------------------------|----------------|---|---|---|
| F  | S <sub>1</sub> | S <sub>2</sub> | S <sub>3</sub>                         | S <sub>4</sub>                       | S <sub>6</sub> |   |   |   |
| D  |                | S <sub>1</sub> | Bea<br>S <sub>2</sub> , S <sub>3</sub> | S <sub>3</sub>                       | S <sub>6</sub> |   |   |   |
| E  |                |                | S <sub>2</sub>                         | S <sub>2</sub> -S <sub>3</sub><br>=0 | S <sub>6</sub> |   |   |   |
| WB |                |                |                                        | S <sub>1</sub>                       | S <sub>6</sub> |   |   |   |

too losing 3 cc then here

solution: flushing.

control signal send signal (0)

just set (0) keep all same.

If will happen in 3 no CC of execution & after those instruction at beg.

I<sub>1</sub> BEQ R1, R2, S0

I<sub>2</sub> AND R2, R3, R4

I<sub>3</sub> OR R1, R4, R5

I<sub>4</sub> ADD R2, R3, R4



3 No  
CC  
সমীল হওয়া পদ্ধতি যা কেবল মডেল এর ফল।

② I Can do prediction.

add R4, R5, R6

beq \$1, \$2, L0

add R2, R3, R6

L0: OR R7, R8, R9



Add one

bubble when you jump to the next address when jump is right after beq.

# Pipeline Data Path

lw \$8



sw \$8 4th cycle

w



# Multi Cycle Pipeline

lw \$10, 20(\$1)



sub \$11, \$12, \$13



add \$12, \$13, \$14



Stack Pointer & Frame Pointer: SP points the top of the stack and the frame pointer points to the current active frame. Some time you do not need frame pointer. You can use SP instead. Instead of that if the FP becomes free and can be used in other works.



Stack pointer can be changed but frame pointer will not. This is like your mother (fp) and your girl friend (sp). During the push the sp is changed but fp holds the base address. So fp is used to access the local vars.

Block: - The minimum unit of information that can be either present or not present in a cache.

Hit Rate: The fraction of memory access found in a level of mem hierarchy.

Miss Rate: The fraction of mem access not found in a level of mem hierarchy.

Hit Time: The time required  
- to access a level of mem hierarchy  
- the time needed to determine whether the access is a hit or miss.

Miss Penalty: - The time required to fetch a block  
- The time to access the block.



Chapter-05Mem Hierarchy:

How to do cache Mem Read and Cache Write?



Temporal Data locality: If data ~~cache mem~~ is in DRAM, use DRAM or cache mem to read. (Cache)

Spatial locality: Where to, but which instruction? we put outside of cache mem but not in DRAM (RAM)

Read Advantage

| Word Add | Binary addr  | Hit/Miss | Cache |
|----------|--------------|----------|-------|
| 22       | 10110        | Miss     | 110   |
| 26       | 11010        | Miss     | 010   |
| 16       | 10000        | Miss     | 000   |
| 3        | 00011        | Miss     | 011   |
| 16       | 10000        | Hit      | 000   |
| 18       | <u>10010</u> | Miss     | 010   |

can not go

bcz var 10210

WT,

so it is called

Hit,

Now it will

throw prev value

to the dirty bit.

then install the

latest data.

| Index | V  | Tag       | Data       |
|-------|----|-----------|------------|
| 000   | NY | 10        | Mem[10000] |
| 001   | N  |           |            |
| 010   | NY | <u>11</u> | Mem[11010] |
| 011   | NY | 00        | Mem[00011] |
| 100   | N  |           |            |
| 101   | NY | 10        | Mem[10110] |
| 110   | NY | 10        | Mem[10110] |
| 111   | N  |           |            |

22 = ~~00010100~~

offset

26 = ~~000110100~~

index

Tag.

00000 00000

(2) → 8  
2 → 64  
2 → 64  
2 → 64

## Virtual Mem.

- Software controls mem.

- Physically lays on H.D.



Virtual mem can't access ~~the~~ data.

It has <sup>logical</sup> add. which is converted to physical add.

Physical mem: Cache, RAM, H.D.



2 bit is used for byte adjustment in Physical Add.

## Size of Mem.

$$2^{10} \text{ bytes} = 1 \text{ KB}$$

$8 \text{ bit} = 1 \text{ byte}$

$$2^{10} \text{ bytes} = 1 \text{ KB}$$

$$2^{20} \text{ bytes} = 1 \text{ MB}$$

$$2^{30} \text{ bytes} = 1 \text{ GB}$$

$$2^{40} \text{ bytes} = 1 \text{ TB}$$

$$2^{50} \text{ bytes} = 1 \text{ PB}$$

$$2^{60} \text{ bytes} = 1 \text{ EB}$$

Floppy disk.

CD, DVD.

HDD.

RAM:

SRAM:

- fast & expensive

- Access in one clock cycle.

DRAM: - slower but less expensive

- Access in 10 C.C.



$$1 \text{ page} = 2^{12} \text{ bytes}$$

## Virtual Mem

$$2^{20} \text{ byte page} \times (2^{20} \cdot 2^{12})^{2^2} \text{ bytes}$$

## Physical Mem

$$2^{18} \text{ page} = (2^8 \cdot 2^{12}) = 2^{30} \text{ bytes}$$

$$2^{28} \text{ page} = (2^{28} \cdot 2^{12}) = 2^{40} \text{ bytes}$$

$$2^{32} \text{ byte} \rightarrow 1 \text{ page}$$

$$2^{12} \text{ byte} = 1 \text{ page}$$

$$= \frac{1}{2^{12}} \times 2^{32}$$

$$2^{32} \text{ byte} = \frac{1}{2^{12}} \times 2^{32}$$

$$= 2^{20} \text{ page}$$

VPN

Virtual  
page  
Num



Page fault: When a program tries to access a add that is belongs to a page on HDD this page ~~is called~~ must be brought into main mem.

Bits in a cache:

1 word = 4 byte



① 16KiB data & 4 word blocks, assuming 32 bit reg!!

$$16\text{KiB} = 2^{10} \times 16$$

$$= 2^{10} \times 2^4$$

- 4 word  $= 2^{\log_2 4}$   $\therefore m = 2$ .

$$1024 \text{ block} = 2^{10} \Rightarrow h = 10$$

$$2^{10} \times [2^2 \times 32 + (32 - 10 - 2) + 1]$$

## Speed Up Challenge:

$$\text{Speed up} = \frac{\text{Ex. time before}}{(\text{Ex. time before} - \text{Ex. time affected}) + \frac{\text{Ex. time after}}{\text{Amount of improvement}}}$$

$$\text{Speed up} = \frac{1}{(1 - \text{fraction time affected}) + \frac{\text{fraction time affected}}{\text{Amount of improvement}}}$$

$$\Rightarrow 70 = \frac{1}{\frac{1}{80}(1 - \text{fraction time affected}) + \frac{\text{fraction time affected}}{80}}$$

$$\Rightarrow 70 = \frac{80}{80 - 80 \text{ fraction} + \text{fraction}}$$

$$\Rightarrow 70 = \frac{80}{80 - 79 \text{ fraction}}$$

$$\Rightarrow 5600 - 5530 \text{ fraction} = 80$$

$$\Rightarrow -5530 \text{ fraction} = -5520$$

$$\Rightarrow \text{fraction time affected} = 0.999 \\ (1 - 0.99) = 0.1\%$$

Ans: 0.1%



## Software Arch/GPU Mem Structure



## Mess Passing



## A Basic CUDA Program Outline

- Host
- Device
- 
- Diagram showing the connection between CPU and GPU memory. A CPU box is connected to a 'CPU Mem' box, which is then connected to a 'GPU Mem' box on the device side. Arrows indicate the flow of data between them.
- ① Allocate the mem for array on host.
  - ② Allocate the mem for array on device
  - ③ CPU Mem  $\xrightarrow{\text{Mem}}$  CPU  $\xrightarrow{\text{Mem}}$  array  $\xrightarrow{\text{Mem}}$  fill array
  - ④ CPU Mem  $\xrightarrow{\text{Mem}}$   $\xrightarrow{\text{Mem}}$  GPU Mem  $\xrightarrow{\text{Mem}}$  GPU Mem
  - ⑤ GPU Mem  $\xrightarrow{\text{Mem}}$  GPU  $\xrightarrow{\text{Mem}}$  GPU
  - ⑥ Then copy GPU Mem to return  $\xrightarrow{\text{Mem}}$  or  $\xrightarrow{\text{Mem}}$  before the
  - ⑦ CPU Mem  $\xrightarrow{\text{Mem}}$  CPU  $\xrightarrow{\text{Mem}}$  info for connection.
  - ⑧ free the host
  - ⑨ Free the Mem

Add 80, 81, 82



### Double Data Hazard

add 81, 81, 82  
add 81, 81, 83  
add 81, 81, 84

प्र० रोटे एक रोटे डॉ रोटे डॉ रोटे

② Add 810, 811, 812.

lw 82, 40(810).



sub 86, 88, 89



Add 89, 87 100.

Q10

OR. 823, 89 815.

lw 82, 40(810).



3 flaps



~~E~~ X

X

X

X

X

X



X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

8 cell U



X

X

X

X

X

X

X

X

X

X

X

X

X

X

D

D

D

D

D

D

X

X

X

X

X

X

D

D

D

D

D

D

X

X

X

X

X

D

D

D

D

D

contd. 8 cell U

→ save CC & zero value  
depends → value zero

\* remember:

1 dependencies  
2 in stall  
still fast

3 no wait

4 code length.

5

6 stall go

## Fwd & stuff



with rest rechedule  
cfg 0352 num of same  
8128,