

Date 01/05/2022

ISA: Instruction set architecture

[Initially assembly language]

Blueprint

→ essential properties about the product (which does not exist) needed for the success of the product, or it will fail.

Financial



Architecture: bunch of properties decided in early phase of product development that tells whether the product has a chance at success.

| COMPUTER<br>level of Abstractions | Problem                   | ISA: language of processor, helps to convert HLL to machine language. |
|-----------------------------------|---------------------------|-----------------------------------------------------------------------|
| Software                          | Algorithms                | (unique for a machine)                                                |
|                                   | High level lang. Programs |                                                                       |
|                                   | OS/VM/Compiler            |                                                                       |
| Hardware                          | ISA/Assembly lang.        |                                                                       |
|                                   | microarchitecture         | processor                                                             |
|                                   |                           | memory                                                                |
|                                   | Gates/Circuits            | buses (topmost layer of H/W)                                          |
|                                   | Electrons(voltage)        | everything in comp: based on electricity                              |

Shoulders of Giants: (Brief History)

Isaac Newton: If I have seen further it is because by standing on the shoulders of giants.

Gordon Moore: co-founder of Intel

Kathleen Booth: came up with assembly lang.

Federich Faggin: Intel 4004: first commercial microprocessor architecture of 4-bit arithmetic. Chip used in calculator.

Quantitative approach to design & evaluation of computer architecture: Henry & Peterson

RISC  $\rightarrow$  ISA

$\downarrow$   
made it open source

## ① Number System

$$(10011101)_2 \rightarrow (x)_{10}$$

$$2^7 + 2^4 + 2^3 + 2^1 + 1$$

$$\begin{array}{r} 32 \\ 16 \\ 8 \\ 4 \\ 2 \\ \hline 101 \end{array} \quad \begin{array}{r} 128 \\ 64 \\ 32 \\ 16 \\ 8 \\ 4 \\ 2 \\ \hline 11101 \end{array}$$

$$\begin{array}{r} 101 \\ \swarrow \uparrow \\ (101)_4 \end{array}$$

100 104 108 112 116 120 124

|   |   |   |   |    |   |     |
|---|---|---|---|----|---|-----|
| 2 | 3 | 5 | 1 | 11 | 0 | ... |
|---|---|---|---|----|---|-----|

Array: can directly jump  
to any location &  
do an operation  
(freedom)

## ② Stack

Stack: no freedom

so why use stack?

Organisation of data in stack simplifies the problem in  
(algorithm's complexity might change)

variables : Top



Push(2)

Pop(2) : decrease top by 1

Dal Polish Notations:

- ③ Postfix :  $(2+4) \rightarrow (24+)$   
 very easy for a machine that uses mem as storage to solve problem.

Prefix :  $(+ 24)$  operator present b4 operand

④ Organisation & Architecture

⑤ Syntax & Semantics:

⑥ Language : set of words { words }

If not present in this set, then not a part of language.

$$\begin{aligned}
 &= \{ \text{alphabets} \} \quad (26) \text{ english} \\
 &= \{ 0, 1 \} \quad \text{binary} \quad (\text{base} = 2) \\
 &= \{ 0, 1, 2 \} \quad (\text{base} = 3)
 \end{aligned}$$

Syntax error: missed something essential in program  
 (structure)  
 (e.g.;)

$4 = a + 2;$  no wrong ( $\because$  LHS should have a variable)

Semantics: whether it is meaningful or not  
 (whether it has meaning or not)

Date \_\_\_\_\_ / \_\_\_\_\_ / \_\_\_\_\_

**MUX :**

Decoder

Flip flop

Registers



$$A = 0 \Rightarrow X = 0$$

(no need to know B)



$$X = 1 \text{ if } A = 1$$



$$X = 1 \text{ if } A = 0$$



$$A = 1 \Rightarrow X = 0$$

$$\overline{A+B} = \bar{A}\bar{B}$$

Input



Output line

multiple  
channels

Remote  
channels

Circuits such as these use MUX: multiplexer.

I/p  
channel



I/p

If o/p = 01  $\Rightarrow$  I1 displayed

Control input  
(can generate a

n. of 2 bits  $\Rightarrow$  max. 4 channels)

(If n bits, then we can generate)

$2^n$  diff. inputs

$\therefore 4 \text{ i/p}$

Saathi

Date \_\_\_\_\_ / \_\_\_\_\_ / \_\_\_\_\_

MUX



In general :  $\begin{array}{|c|c|} \hline n & 2^n \times 1 \\ \hline \text{MUX} & \\ \hline \end{array}$

## DECODER



Required:

| A | B | D <sub>0</sub> | D <sub>1</sub> | D <sub>2</sub> | D <sub>3</sub> |
|---|---|----------------|----------------|----------------|----------------|
| 0 | 0 | 1              | 0              | 0              | 0              |
| 0 | 1 | 0              | 1              | 0              | 0              |
| 1 | 0 | 0              | 0              | 1              | 0              |
| 1 | 1 | 0              | 0              | 0              | 1              |

Given,  $D_0 = 1$  if  $A=0$  &  $B=0$

Register

circuit : 1 bit

Saathi



When  $S = 1$  or  $O$  is always  $i$

$S = 0$

~~$D = \text{zero}$~~   $O = \text{zero}$

So o/p will remain  
 $i$  as long as  $S = 1$   
& power is on.



bus : take data from one place  
& transport it to other.

when  $E = 0 \rightarrow \text{zero}$

$E = 1 \rightarrow \text{output value finds its way to the bus}$

prevents value  
to go to the bus.

For 1 byte  $\Rightarrow$  8 registers.

Bus :



Every <sup>MP</sup> has a manual that explains its ISA -



ISA: High level lang.  $\rightarrow$  i/p for compiler

Assembly

Machine lang.

0 1  $\rightarrow$  O/p of compiler

Alan Turing: British scientist in WWII.

Enigma: German machine: transmitted encrypted code. So decoding required.

Stored Program Computer:

- Model was proposed by Von Neumann
- Computer's memory would have program
- Instructions would be executed sequentially.
- All modern computers are based on this model  
(slight variant)

Program  $\rightarrow$  Memory  $\rightarrow$  Processor  $\rightarrow$  Instructions

reads stored reads

Data flow architecture  
parallel structures

# 5 components of a computer

Saathi

Date \_\_\_\_\_



Any of the 4 parts can't do anything, unless control unit sends a signal to that. Task is performed in a synchronised way.

## Memory Component

Size & address:



Every byte has an address. Such a memory is called byte-addressable.

If memory has 16 bytes

How many bits required to uniquely identify each byte?

Total addresses = 16 (if byte-addressable memory)

$$= 0 \dots 15$$

$$15 = 1111$$

= min. 4 bits required to generate 16 addresses

Memory of  $2^n$  memory locations & each loc. has m bits

- It has address spaces of  $2^n$  uniquely identifiable locations
- Addressability of m-bits

MAR - Memory address register  
MDR - " data "

Date \_\_\_\_\_ / \_\_\_\_\_ / \_\_\_\_\_

Saathie

1-byte = 8 bits

4 bit → nibble

If each address has 4 bits, addressability = 4

" " " " , 8 bits 8 bits = 8 bits

= 1 byte

= byte-addressable memory

Read-Write:

1 read → 2 steps →  $\xrightarrow{\text{MAR}} \xrightarrow{\text{MDR}}$  {at atomic level}



array[i]  $\Rightarrow$   $\underline{\text{base address}} + (\underline{i} * \underline{n})$

size of data type

RISC V architecture

LC 3 architecture (includes 16 instructions)

$$x_4 = x_5 + x_6$$

add  $x_4, x_5, x_6$   
operation code (opcode)  
operands

$$x = a + b$$

+ : operation operators  
 $x, a, b$  : operands

addition using 3 registers

(first register =  $x_0$ )

$$\text{sub } x_4, x_5, x_6 \Rightarrow x_4 = x_5 - x_6$$

addi  $x_4, x_5, 20$  // add immediate with constant

$$\Rightarrow x_4 = x_5 + 20$$

Date \_\_\_\_\_

$$\text{subi } x_4, x_5, 10 \Rightarrow x_4 = x_5 - 10$$

$i \Rightarrow$  immediate

C-code:  $a = (b+c)-(d+e)$

$a = (b+c) - (d+e)$  add a;

↓

sub a, add b, c, add d, e

RISC-V Assembly:

add x<sub>1</sub>, x<sub>2</sub>, x<sub>3</sub><sup>d e</sup>

add x<sub>4</sub>, x<sub>2</sub>, x<sub>3</sub><sup>b c</sup> ← reuse them as registers are free.

sub x<sub>5</sub>, x<sub>4</sub>, x<sub>1</sub>

st x<sub>5</sub>, ... // instruction to store whatever is in x<sub>5</sub>

(present in memory)

→ (to be transferred to register)

Data Transfer:

ld x<sub>5</sub>, 40(x<sub>6</sub>)

load

Base address in reg. x<sub>6</sub>

& 40 is offset

A[40]; ← array

Wherever x<sub>6</sub> is pointing, move 40 units ahead & get the data present there into x<sub>5</sub>.



1 store

$a = b + c$

$\Rightarrow$  4 instructions in RISC-V  
= 1 instruction in C

ldd ldd

ladd

sd x<sub>5</sub>, 40(x<sub>6</sub>)

= store data

whatever is in x<sub>5</sub>, put it in

the location to the 40 units

ahead of x<sub>6</sub>.

Memory[x<sub>6</sub>+40] = x<sub>5</sub>

Store double word

beg  $x_5, x_6, 100$ if ( $x_5 == x_6$ ) go to PC + 100branch equal to  $\text{label}$ ↓  
Program counter

has address of instruction which should

bne  $x_5, x_6, 100$ if ( $x_5 \neq x_6$ ) go to

PC + 100

branch not equal to.

be executed next.

bge  $x_5, x_6, 100$ if ( $x_5 \geq x_6$ ) go to PC + 100

branch greater than equal to.

add  $x_3, x_4, x_5$ // instruction represented in binary  
format (executable form)add  $x_3, x_4, x_5$ 

↑ has 32 bits, max. 3 registers,

1 opcode

⇒ 8 bits assigned to each.

If 16 bits, then opcode + 4 bits.

Each

 $x = AB + C$  use an opcode & it will be computed in 1 step. [separate steps for add & product not required]

Ans

0001 | 0001 | 0010 | 0011

: Instruction set has 16 bits.

4bit 4bit 4bit 4bit

[add  $x_1, x_2, x_3$ ] ⇒ 4 entries

0001 → add

representation  
in machine  
language

using 4 bit for opcode &amp; each register

⇒ 16 different opcode possible

add 0-001

addi 0-010

sub 0-011

mult 1

div 1

bne 1

...

we  
can have  
16  
registers:x<sub>1</sub> ← 0-000  
x<sub>2</sub> ← 0-001

15

Date \_\_\_\_\_



Case 1 → 16 registers

Case 2 → 8 registers

$\therefore$  1 reg. can't be represented by 4 bits  
req. then 3 reg  $\Rightarrow$  12 bits  
 $5 + 12 = 17 > 16$  bits

X instruction

for a large expression with multiple variables, so case 1 is faster as it has registers (so load & store not required)  
Costly operation.

Depending on complexity of code, either of the 2 cases could be faster.

Case 2 introduces 2 opcodes at the cost of 8 registers.

No. of Registers

32-bit register

 $\Rightarrow$  it can hold 32-bit data.

Case 1: used all the bits. If a new kind of data generated,  
no space left to add that.

So there is no scope for extension here

Case 2: There is scope for extension here.

## C2 Lecture 2

Memory of size : 16 MB

Addressability is 4-bits

data unit = 1B  
(each 1 byte has 1 address)

# bits required for an address for this memory .

int list[4];



$$x = \text{list}[3] = *(\text{list} + 3)$$

$$= *(\text{list} + 100 + 3)$$

$$= *(\text{list} + 4 \cdot 3)$$

$$= *(112)$$

$$1 \text{ kilo} = 1024 = 2^{10}$$

now called kilo'

$$2^{20} = \text{Mega}$$

$$2^{30} = \text{Giga}$$

$$16 \text{ MB} = 2^4 \times 2^{20} \text{ B} = 2^{24} \text{ B}$$

$$\text{unique addresses required} = 2^{24}$$

$$\therefore \text{no. of bits required} = 2^4$$

(to represent all address)

If data unit = 4 bits  $\Rightarrow$  1/2 byte

$$\text{size} \Rightarrow 1 \text{ byte} = 2 \text{ addresses}$$

$$\text{unique address required} = 2 \times 2^{24}$$

$$2^{24} \text{ B} = 2^{24} \times 2 \times 1 \text{ B}$$

$$\boxed{2} \\ \text{unit} \\ = 2^{25}$$

$$\text{no. of bits required} = 25$$

$$\text{addressability} = 2 \text{ bytes}$$

$$\Rightarrow 1 \text{ address} = 2 \text{ bytes}$$

$$\text{Unique address} = 2^{24} \text{ B}$$

$$= 2^{23} (2B)$$

$$\frac{1}{1 \text{ unit}}$$

$$\therefore 23 \text{ bits reqd}$$

ISA: interface b/w programmer & microarchitect

(+/-) Process  
(+/-) Cache  
(+/-) memory  
(+/-) Register  
control



ISA: manual that includes:

- ① Instruction set
- ② General Purpose Register
- ③ Memory

PC → to hold an address

IR ← Instruction register

Von Neumann Model :

→ Instructions stored in memory &

Dataflow architecture (1) execute sequentially

[no scope of parallelism]

Byte addressability = one

$I_1 = 100$  PC

$I_2 = 102$

executed

one after another

$I_3 = 104$

one after another

$I_4 = 106$

one after another

Memory → MAR, MDR



Date / /

Saathi

add  $x_0, x_1, x_2$   
 ↓  
 opcode      ↓ source  
 destination

has a value  
 which is in  
 memory

$x = a + 10; \quad t \leftarrow c$

addi  $x, x_1, 10 \quad t \leftarrow$  in assembly lang.

load data  $a$  to register  $b4$ , executing.

st.  $x_1 \quad x_2 \quad x_3 \quad a, x_5$

25 | 15 | 50 | 200



ld  $x_2, (x_5)$

addi  $x, x_1, 10$   
 sd  $x_1, (x_0)$



Neural networks  
 (non-linear) Deep Learning

GPU  
 games  
 + graphics  
 ↓  
 more pixels  $\Rightarrow$  sharper image

r, g, b  
 (6 → 255)

→ PC points to next instruction.

Hardware  $\rightarrow$  has a vision based on the customer targeted.

Software  $\downarrow$  Hardware  $\downarrow$  processor pipeline of 16 bits  $\rightarrow$

ISA

16 bits  $\rightarrow$

16 bits  $\rightarrow$  16 instructions  
 $\Rightarrow$  4 bits for opcode

8 GPRs.  $\Rightarrow$  3 bits required for each

2's comp :  $x$   
 Negative      Positive  
 $(x+1) \quad 0 \quad x$

Date

HLL

If ISA is close to control signals, hardware has to simplify it for processor to understand.

Semantic Gap



control signals

(translation)

smaller semantic gap  $\Rightarrow$  easy for processor to process  
large s.g  $\Rightarrow$  compiler has to break it to simpler parts

### STORE RESULT: How many ways?

- Initialise a variable name
- Pointers to integer
- Pointer to a pointer to an integer

↳ writing a word to memory

→ conventions and rule

spani respects & using down

- Array

- Option to specify memory address

ld  $x_1, (100)$

ld  $x_1, (x_2) \Rightarrow x_1 \leftarrow \text{mem}[\text{reg}[x_2]]$

ld  $x_1, 10(x_2)$  [x<sub>2</sub>+10]

ISA closer to HLL has more addressing modes

first thought : general +  
+ (base + offset)

MEMORY

Why memory? To store instructions, data.

Primary func. of comp → to compute/process data. So imp. to store data.

ENIAC: first electronic computer (vacuum tubes)

↳ based on decimal " (rather than binary)

using 10 vacuum tubes representing each decimal digit.

disadvantage: ① laborious & tedious work.

② Programmer would have to repeat the same sequence of operations to get the output again.



If the device is used for same type of computation, then we don't need memory to store the instruction. Hardwire the logic into the hardware itself. ( $\because$  Code not required.)

If the device is used for any type of computation, then code is required with data to give output.



Every time we have to give the code as i/p to repeat the operation. So we need to store the code in memory.

Then came von Neumann Architecture - (around 1940s)

# Von Neumann Architecture

Saathil

Date \_\_\_\_\_



→ Classical comp.  
today are based  
on it.

Coll

- $I_1$  - Load address to the program counter.
- $I_2$  - Program start memory address.
- $I_3$  - Stores the address of the next instruction to be performed.
- $I_4$  - for  $i = 1$  to 100
- $I_5$  -  $a[i] + 3;$

Memory has not evolved at the same pace as processor.

Memory latency:

→  $10^9$  clock cycles per second.

Consider a processor operating at 1 GHz connected to a DRAM with a latency of 100 ns (no cache). Assume the processor is capable of executing 4 instructions in each cycle of 1 ns.

Peak processor rating is? 4 GLOPS (gigalops)

$4 \times 10^9$  instructions per second.

flop. floating point  
operations per second.

$$\begin{aligned}
 1 \text{ sec} &\leftrightarrow 10^9 \text{ cycles} \\
 1 \text{ cycle} &\rightarrow 10^{-9} \text{ sec.} \\
 &= 1 \text{ ns.} \\
 &\rightarrow \text{nosecond.}
 \end{aligned}$$

Date \_\_\_\_\_  
Program



100ms to fetch data from DRAM

$$100\text{ms} = 100 \text{ cycles}$$

Processor cycles are being wasted  
as it has to wait

In 100, it could've completed  
400 cycles.

Key Characteristics of Comp. memory system.

Internal : directly accessible by processor

External : not " " " "

Main memory → Volatile

Non-accessible → ROM

Memory hierarchy → how much? how fast? how expensive?

Registers: fastest

Registers Cache is better than main memory

(faster than main memory)

made up of gates (very large ~~cost~~ area)

Memory hierarchy

cost ↑

Size ↓

Speed ↑

SSD: \$12/GB

Main memory → \$100 GB

> SSD.

∴ SSD lies below main memory.

Principle of locality

Date \_\_\_\_\_ / \_\_\_\_\_ / \_\_\_\_\_

Principle of Locality: program access a relatively small portion of the address space at any instant of time.

~~After~~ Temporal Locality: keep most recently accessed data items closer to the processor.

~~After~~ Spatial Locality: Move blocks consisting of contiguous words closer to the processor.

Access from cache is faster (approx. 5 times) than main memory.



3 level cache organisation.

