

For complete execution to make a program w.r.t.

Logical Operands

Memory communication (Read & Write)

Change flow control [:

Instruction should define



32 bits needed to specify memory location.  
 $\lceil \log_2 (\text{Total No. of operations}) \rceil \rightarrow$  to uniquely specify the operator

Constraints

↳ 1. Limited Memory → minimize no. of bits per instruction  
Program stored in magnetic strips.  
when running program ( $>$  RAM), we bring parts of RAM & overwrite  
it again & again to run the program.

② Slow Memory. → complex instructions (Read/Write operation slower than previous operation)  
Machine language used at that time

③ Ease of programming (No high-level lang.)

Implicit operands → 0-address format → only specify operand  
[No. of bits per instruction]



take top 1/2 element  
from compute, write result on top



$$\text{E.g. } A = B + C \times D$$

- ① PUSH C 39 |  $\boxed{D}$
- ② PUSH D 39 |  $\boxed{C}$
- ③ MULTPLY |  $\boxed{CD}$
- ④ PUSH B 39 |  $\boxed{B+CD}$
- ⑤ ADD - |  $\boxed{B+CD}$
- POP A 35

(R)

$\frac{39}{14}$   
 $\frac{15}{14}$   
67

14

105

If ~~stack~~ stack is not sufficient, then pop to some  
empty temp. location T1.

compute the rest  
and then bring the T1 data back to temp. memory stack  
and compute.

Eg if only 2 instack |  $\boxed{\quad}$

the  $A = B + C * D + E / F$   
1, 2, 3, 4, 5. POP T1, PUSH E, PUSH F, DIV, PUSH ADD

\* Write in postfix notation

Scan  $\rightarrow$  ① encounter variable  $\rightarrow$  push operand  
② Encount. operator  $\rightarrow$  perform operat.

$$A = B + C \times D$$

$$= B C D \times +$$

PUSH B  
" C  
" D  
MULTIP

ADD

POP A

using 1 address format everywhere.  $\rightarrow$   
one operand from temp,  
" " memory



LOAD/STORE location

(Instruction to transfer  
from memory)

$A[1]$

| OPERATION   | OPERAND              |
|-------------|----------------------|
| Accumulator | base<br>architecture |

$$A = B + C$$

$$\text{LOAD } B$$

$$\text{ADD }$$

~~A~~ →  $D \leftarrow + D \times E$   
~~LOAD B~~  
~~MUL C~~  
STORE temp ← from a <sup>main</sup> memory location

LOAD D  
MUL E  
ADD T  
STORE A

## FACeBOOK ACCUMULATOR BOTTLENECK

instead of 40 who

## ADDRESSING MODE

## Registers or memory

TOP [Add] - open

$$7 \overset{+1}{\cancel{1}} 32 = 40$$

$$A = B + C$$

$$D = A \cup C$$

- 158

$\backslash P = A \& R$  [ storing in register R0 ]

Able to forget limited memory well as slow memory cost of registers  $\uparrow$

|       |    |   |    |
|-------|----|---|----|
| LOAD  | B  | → | 40 |
| ADD   | A  | → | 40 |
| STORE | RO | → | 10 |
| LOAD  | E  | → | 40 |
| ADD   | RO | → | 10 |
| STORE | D. |   |    |

Program  
specifies  
and  
decides  
which  
variable  
used  
multiple  
times &  
stores it  
in the  
registers.

Using only registers!

only registers.  
↳ Need main memory to load/unload  
registers.



## ~~ADD~~ Instructions (Arithmetic Logic).

| Operations | OP1 | OP2 |
|------------|-----|-----|
| 7          | 2   | 2   |

main memory  
can be accessed  
by load & store

I go for 3 address format - bay cost only 2

| Operation | OP1 | OP2 | Destination |
|-----------|-----|-----|-------------|
|-----------|-----|-----|-------------|

### LOAD & STORE FORMAT

| Operations | OP1 | OP2 |
|------------|-----|-----|
| 9          | 2   | 32  |

Register      Memory

ex.  $A^{R_0} = B^{R_1} + C^{R_2}$   
 $D = A + E^{R^2}$   
 $P = A + R$

Registers only for  
SCALAR NO VECTORS like array.

|       |            |
|-------|------------|
| LOAD  | R1, B      |
| LOAD  | R2, C      |
| ADD   | R1, R2, R0 |
| LOAD  | R3, E      |
| ADD   | R0, R3, R4 |
| STORE |            |

only 4 registers!



CLEAR SPACE

LIVENESS RANGE: → the time when the var. is being used. Use registers for them until till that time

32 register can store 80 bits of live variables.  
Once it overflows, then add 1 to the main memory & bcoz it will be used for variable f.

Not before execution, everything in memory  
while " , something in memory, something in effect

$$A = B \times C + D$$

for the process on speed to be dominate, we need to bring entire thy in a lot not in chunks.

$$\begin{array}{l}
 \text{MUL } R_0, R_1, R_2, R_4 \\
 \text{ADD } R_2, R_4, R_3 \\
 \hline
 \text{MAC } R_1, R_0, R_2, R_3
 \end{array}$$

$\overbrace{\hspace{10em}}^{2020}$

$1000 + 10 + 10 = 1020$

$$C = \sum_{i=0}^{\infty} A_i B_i$$

$\vdash \text{CONV } A, B, C$

$$= 1000 + (10+10) * 1000$$

$$= 1000 + \underbrace{20\,000}_t +$$

$= 1000 + \frac{20,000}{\text{heavy computation}} \rightarrow$  fast processor  
 $\downarrow$   
 significant decrease in computational time.

= 21,000

Array addition dealt by using pointers (dynamic addressing)  
See dynamic addressing mode

Specify addressing mode by spec.   
 → MEMORY INDIRECT → Direct memory Addressing  
 → AUTO INCREMENT → use directly content of register (content is operable)  
 → base address given

```

graph TD
    A[Immediate (Value)] --> B[Addressing Mode]
    B --> C[Register Indirect]
    B --> D[Base + Offset]
    B --> E[Base + Index]
    C --- D
    C --- E

```

The diagram illustrates the Addressing Modes for memory access. It starts with four categories on the left: 'IMMEDIATE (VALUE)', 'AUTO INCREMENT', 'AUTO DECREMENT', and 'REGISTER INDIRECT'. The 'IMMEDIATE (VALUE)' category has an arrow pointing to 'Addressing Mode.', which then branches into three modes: 'Register INDIRECT', 'BASE + OFFSET', and 'BASE + INDEX'. The 'Register INDIRECT' mode is shown with a bracket above 'REGISTER INDIRECT' and 'Addressing Mode.'.

|           |     |           |      |      |
|-----------|-----|-----------|------|------|
| operation | A/M | operand 1 | op 2 | Dest |
|-----------|-----|-----------|------|------|

Base + offset:  $A[i] = \text{Base} + \text{offset}$  to reach



size x index

Memory Indirect:  $\rightarrow$  Double pointers.



Register stores the address of memory which contains the address where operand n' stored.

Auto Increment/Decrement (kind of stack).

PUSH R1, R2

automatically  
done in  
auto increm mode

M( $R_2$ ) =  $R_1$   
 $R_2 = R_2 + 1$

IMMEDIATE (VALUE): value given as instruction (constant)

Aim for all operands & destination  
using simple instructions:



Very simple  
so I know before  
how many operat



Read 2 out of m registers & write to a third one.

Architecture called

RISC  $\rightarrow$  Reduced  
ISCS  
(Simple Instructions)

Reduces memory communication faster as it ...

x86 used by Intel use CISC.

CISC FIXED ENCODING ← can process in parallel.  
 Operator A/M Opr1 A/M Opr2 A/M PESTMA  
 7 3 32 3 22  
 or  
 Register → 5  
 Addressing 4  
 mode Use Max = 32  
 Using max 6 wasted memory.

VARIABLE LENGTH ENCODING ← Don't know len of instruction.  
 so Not in parallel?  
 difficult to parallelise  
 CISC → variable in encoding

RISC → fixed

High level lang  
Work done by computer CISC  
 Work done by hardware (thus complex hardware,  
 simple compile)

MIC control signal

load, store access many  
 Rest - ALU op. access from registers



↓  
 J (Jump)

KA  
 DLX / MI  
 32

LO  
 AT

3  
 R  
 format  
 (All  
 opnands  
 register)

Add/Sub  
 SRL 1  
 Shift Right  
 Divide by 2

Add

I  
 format

# CONTROL FLOW CHANGE

unconditional

Conditional  
Data Dependant.

FLAG Register = Condition code / Status register.



updated after  
every arithmetic  
logic operation

Signed  
of e  
carry generated  
by operand in  
out not

if control flow  
is (jump)?

DLX / MIPS - Crisc - V  
32-bit ISA

REGISTER

R0 - R31 | uses small no. of  
addressing mode.

LOAD / STORE ARCHITECTURE

ALU

Arithmetic Logic Instruction



all  
registers  
64 operations  
encoded.

deposits another

SRL | SLL

shift right | shift left operator

divide  
by 2  
multiply by 2

SLT } less than  
SLE } ? - equality  
SGT } greater than

SLT R1 R2 R3, R1 < R2 R3  
R3 = 0.....0  
else R3 = 0.....1

Addressing Mode →

Immediate

when the freq is high of  
1 operand



ADD ADD I → treated differently  
this one for immediate operations.

64 bits have 2<sup>64</sup>  
7 = 64 op.

will use other 63 combinations  
 $\therefore 64 + 63 = 127 \text{ op}$

Base + Displacement - I-type format

LH SH  
LB SB  
Half word  
byte

② Base + Index ← R-type format

(provide 2 address & 2 reg's)



R1

Register  
you want  
you to look  
R2

$R2 = M(R1 + IMM16)$  LOAD

$M(R1 + IMM16) \rightarrow R2$  STORE

BEQ REG LOC ← I-type.

How  
far you  
need to  
go



If content of R1 = 0 then then  
it should IMM16 in instruction  
away from here.

[4 \* IMM16 from the  
current location]

Mostly programmed contd  
to remember  
the current  
location

Instruction  
pointer

$$IP = IP + IMM16 \times 4$$

bcoz every  
instruction is  
4 byte

BVEQ REG, LOC [ transfer → condition like if "call"

for unconditional change of location,



J LOC  
jmp

$$IP = IP + IMM26 \times 4$$

Example:  
to

JAL LOC R31 JUMP

↑  
loads the curr loc + R31  
if then move to desired  
locati.

RET LOC  
original  
RET BTR

JR REG → Jump and Return  
(to the address  
in Register)

R<sup>s</sup>, usually stored  
but > R not  
2nd for reg part

R [OP] OP1 | OP2 | DEST [C] FUN

Set value of dest based  
on comparator

I [OP] OP1 | DEST | IMM16

J [OP | IMM26]

BEGZ RX, LOC  
/BNES in terms of no. of instructions [every inst. = 4 byte]

return list

next R31

leave the  
return address

free base  
address

JAL. BAFF

location of B in R31 → Here in R31 (B+4) ← After function call, next  
instruction at start

C

↑ R31 overwritten by C+4 so, we lose B+4. so when  
JR R31 if will again execute C.

Save in free register ADD RS, RA, 0  
No of free register make a stack in memory.  
use a stack pointer.

← JR R31

LHI Rx, Imm16



Read only Register R0 which stores only 0  
Rest of register = Read + Write

ADD R5 R0, 0000H location of stack

Adds as a stack pointer

to store

SWI R31 (R5)0 { offset 0  
↓ Now increment because pointer must always point to top

ADD R5, R5 4

to return

SUBTRACT R5, R5 4

SWI R31 (R5)0

X calc.  
Sum of 10 elements  
Base = 01000H

6A[100]

R1 → Counter = 100

R2 → ~~Elements~~ SUM

R3 → 0001000

|     |    |    |       |
|-----|----|----|-------|
| ADD | R1 | R0 | 100   |
| ADD | R2 | R0 | R0    |
| ADD | R3 | R0 | 1000H |

LOOP : LOAD

R4 (R3)0

ADD R2 R2 R4

(R3)0 R3 4

SUB R1 R1 1

DNZD LOOP = (-5) in terms of instructions

STOR E R2 (R3) 0

HLT

For function calls with parameters, store parameters in registers & push them into stack.  
 Push right to left so that on popping, same order is rule.  
 Once pushed, stack pointer - appropriate value at access dictated.



Memory Read 1 Word  
 > 2 ms  
 RF read/write = 1 ms  
 AR = 2 ns

|     | a      | b | c | d | e | f | g | h | i | j | k | l |
|-----|--------|---|---|---|---|---|---|---|---|---|---|---|
| R   | 000000 | 1 | 1 | 1 | X | 1 | 1 | 0 | 0 | 1 | 1 | 0 |
| T   | 000100 |   |   |   |   |   |   |   |   |   |   |   |
| S   | 000010 |   |   |   |   |   |   |   |   |   |   |   |
| BQS | 001000 |   |   |   |   |   |   |   |   |   |   |   |
| J   | 010000 |   |   |   |   |   |   |   |   |   |   |   |

2 memory - 1 for program  
 1 for data

$$\text{Time} = \frac{1}{f_{\text{CPU}}} \times C_{\text{P.I}} \times T$$

Always fine per cycle

depends on the circuit.

Who decides the time,

$$f_{\text{CPU}} = \frac{1}{8 \times 10^9} \text{ sec} = 125 \text{ nsec}$$

All operations

In sequence  
 -> 1 write or reg

the longest path

$s \rightarrow s'$   
 $s \rightarrow p \rightarrow p' \rightarrow p'' \rightarrow s'$  make a transition system.

### SUB-TASKS

1. Fetching of instruction. (from the memory & update the program counter)
2. Update Program counter
3. OPERAND READ
4. Computation of Instruction or Computation of address of memory in case of load or store
5. Reading from & writing to memory
6. Update The state  
(write to register RF).

### A. QL inst

1. Inst fetch.  $\rightarrow$  can be done in parallel.
2. update PC
3. Operand read
4. Instruction computation
5. Write Back to RF



BEQ  $R_{S1}, R_{S2}, 2\text{mm}$

| OPERATION | OPI/DEST | IMM16 |
|-----------|----------|-------|
|-----------|----------|-------|

If  $R_{S1} = R_{S2}$  then

$$PC = PC + 4 + IMM16 \times 4$$

ELSE

$$PC = PC + 4$$



$T1 \rightarrow alu-A$   
 $T2 \rightarrow alu-B$   
 if (zero == 1) then S15  
 $T3 \rightarrow PC$

1. fetch instruction.
2. update  $PC \leftarrow PC + 4$
3. Read operands ( $R_{S1}$  &  $R_{S2}$ )
4. Compare (Subtracting)
5. If zero then  
 $PC = PC + 4 + IMM16 \times 4$ .



S14



J



1. fetch instruction

2. update  $PC \leftarrow PC + 4$

3. Compute  $PC + 4 + Imm26 \times 4$

4. Update PC.



clock

AR  
Mem  
RF

{

B



J

clock



Operation was  $S0 = S4 = S9 = S13 = S18$ .  $\leftarrow$  different in the sense  
of next statement evalt

$S1 = S5 = S10 = S14 = S17$



I will just say show overload.

→ Advantage

Share resources. (No need of multiple alus & memory)

clock-period

ALU - 2ns

Mem - 2ns

RF - 1ns

Max Delay = Max delay at  
Operation at  
2ns

only of the stds

R → 4 cycles

LW → 5 cycles

SW → 4 cycles

BGQ → 3 cycles

J → 3 cycles

| R   | S2  |
|-----|-----|
| LW  | S6  |
| SW  | S6  |
| BGQ | S15 |
| J   | S17 |

$$\text{Clock Period} = \frac{IC \times CPI \times T}{\uparrow}$$

Avg. (Arithmetic Mean) over all possible  
instructions

only if instructions are uniformly  
distributed.

Now performance  
of instruction,

Weighted mean

dependant on distribution



## LOAD INSTRUCTION



Required 5 bits  
32 bits required input  
16 bits only  
therefore sign extension  
needed for 32 bits D1.  
padding to be added with  
imm/16 need to be padded to be added with  
sign extension at end.

STORE INSTRUCTION



Control Flow.

BF & RI, R2, 100.

|           |      |      |       |   |
|-----------|------|------|-------|---|
| s1        | 2125 | 2120 | 1615  | 0 |
| operation | OPR1 | OPR2 | IMM16 |   |

RI      R2      100 .

$$\text{if } RI = R2 \quad PC = PC + 4 + 100 \times 4 \\ PC = PC + 400$$

$$\text{else} \quad PC = PC + 4$$

PC changes only in multiple of 4.  
IMM



with

$$J \rightarrow PC = PC + 4 + IMM_{26} \times 4$$



1st et jump & BEQ implement  $\Rightarrow$



$$\text{Mem read / write} = 2 \text{ n/s}$$

$$\text{RF write / read} = 1 \text{ n/s}$$

$$AV = 2 \text{ n/s}$$

$$CPI = 1 \text{ dynamic}$$

$$\frac{\text{Time}}{\text{Prog}} = \text{JC} \times CPI \times T$$

$$= 2 + 1 + 2 + 2 + 1 \leftarrow \text{Max.}$$

$$= 8 \text{ n/s}$$

JC  $\Rightarrow$  Inst. count

CPI  $\Rightarrow$  Cycle per Inst  
T  $\Rightarrow$  Clock Cyl

longest-path when we take load instruction.  
 for load  $\rightarrow$   $2 \text{ ns} + 1 \text{ ns} + 2 \text{ ns} + 2 + 1 \text{ ns} = 8 \text{ ns}$ .  
 MIPS 125

$2 \text{ ns}$   $1 \text{ ns}$   $2 \text{ ns}$   $2$   $1 \text{ ns}$   
 ↑ memory read R. file ALU data memory write into RF.

frequency  $\frac{1}{8} \text{ ns} = 12.5 \text{ MHz}$

Since  $2 \oplus \text{ CPI}=1$  is forced, time depends on IC.

$$\frac{\text{time}}{\text{Prog}} = \boxed{\text{IC}} \times \text{C}$$

↑ dynamic

dynamic/static instruction

To reduce IC,  
 $s \rightarrow s'$  replaced by

$$s \rightarrow p \rightarrow p' \rightarrow p'' \rightarrow s'$$

Subtasks.

- 1 Fetching the instruction
  - 2 Update PC
  - 3 Operand read
  - 4 Instruction computation or computation of address of memory (in case of load/store).
  - 5 Read/write memory
  - 6 Update the state (write to RF)
- �

to L1 and

1. Inst. fetch  
 2. Update PC  
 3. Operand

4. Instruction  
 5. Write back to RF

Computation (execute)



task ① & ②  
can be done  
in cycle

CPI changes but extra regis needed

Instead of using the ALU, ALU (right wala). can be used using MVX as that ALU is free in the clock cycle.

state 3<sub>(PC)</sub> - 3<sub>(ALU)</sub>  
clock - cycle of ALU  
of PC is not same as  
PC wala state  
जबकि अंतर से  
operation  
not.

AL  $\rightarrow$  6 ns  
ST  $\rightarrow$  7 ns  
Branch  $\rightarrow$  5 ns  
J  $\rightarrow$  4 ns.

Resource L ALU  
RF  
, PC

AW  
Stores.

S.

| PC $\rightarrow$ memory address, | alu A ADD |
|----------------------------------|-----------|
| Memory D $\rightarrow$ IR        | S0        |
| +4 $\rightarrow$ alu B           | S1        |
| alu C $\rightarrow$ PC           | S2        |





F) -) 52

same A&L which was



$$S_0 \rightarrow S_1 \rightarrow S_2 \rightarrow S_3$$



## Load instructions

|           |        |      |        |    |    |   |
|-----------|--------|------|--------|----|----|---|
| LW        | R1     | (R2) | 100    |    |    |   |
| operation | operv1 | Rest | Imm 16 |    |    |   |
| 31        | 28     | 25   | 21/20  | 16 | 15 | 0 |

- Subtasks :
- ① Fetch Instruction
  - ② update PC }
  - ③ Read operand →
  - ④ Compute address of memory .
  - ⑤ Read memory
  - ⑥ Write the result in register



## STORE INSTRUCTIONS

SW RI (R2)100

|           |      |       |       |
|-----------|------|-------|-------|
| OPERATION | DPR1 | DEST. | mmm16 |
|-----------|------|-------|-------|

Subtask

- ① Fetch
- ② Update
- ③ Read Op
- ④ Compute Add
- ⑤ Write to memory



S9 - S10 - S11 - S12

BEQ \$ R<sub>S1</sub>, R<sub>S2</sub>, Imm

| OPERATION | OP1            | DEST  | Imm16 |
|-----------|----------------|-------|-------|
|           | 26-25<br>21-20 | 16-15 |       |

If ( $R_{S1} == R_{S2}$ ) then

$$PC = PC + 4 + Imm16 \times 4.$$

else

$$PC = PC + 4$$

Subtask

- ① Fetch Info
- ② update PC ( $PC + 4$ )  $\downarrow$  housekeeping
- ③ read operands ( $R_{S1}$  &  $R_{S2}$ )
- ④ Compute (Subtask)
- ⑤ If zero,  $PC = PC + Imm16 \times 4$

S13



S14



S15



$CPI = 3$  but only  
ALU is needed to  
shift  $\rightarrow$  first  
& instruction to S14  
use same ALU



$r_{to}$   
y