

## Von-Neuman model

stored program concept

The five classic components of computer system are:-

Control Unit, Datapath, Memory, I/O, CPU  
Registers, ALU & buses



### Architecture

Data structures

Data types

instruction set

instruction format

addressing modes

### Organization

MIPS → RISC ← { ARM  
 CISC } 8086

Microprocessor without Interlocking Pipeline System

IB

Computer Organization and Design: The hardware / software interface, Patterson & Hennessy

Lab

Vhdl HDL: A Guide to Digital Design  
and Synthesis by S. Pabnitzkar

## MIPS Register Set

→ 32 bit Processor

→ 32 bit address/date

→ GPRs

- 32 regs

- R<sub>0</sub>-R<sub>31</sub>

- Each reg is 32-bit wide

→ FPR (Floating Point Registers)

- 32 32-bit regs

- f<sub>0</sub>-f<sub>31</sub>

- Paired DP

→ System Registers

- PC, hi, lo, etc.

## GP Register File

32 registers (0 to 31)

Register names begin with a \$

- \$0, \$1, \$2, ..., \$31

- \$00-\$03, \$80-\$87

Not all registers for GPs

R<sub>0</sub>: ready return zero but cannot write

R<sub>31</sub>: used in function calls

## Software convention for Register

- 0 zero constant 0
- 1 at reserved for assembly
- 2 v0 expression evaluation
- 3 v1 function results
- 4 a0 arguments
- 5 a1
- 6 a2
- 7 a3 temporary: caller saves
- 8 . to .
- 9 s0 .
- 10 s1 .
- 11 s2 .
- 12 s3 .
- 13 s4 .
- 14 s5 .
- 15 t7 .
- 16 s0 callee saves  
(callee must save)
- 17 s1
- 18 s2
- 19 s3
- 20 s4
- 21 s5
- 22 s6
- 23 s7 temporary (cont'd)
- 24 t8 .
- 25 t9 .
- 26 k0 reserved for OS kernel
- 27 k1 reserved for OS kernel
- 28 gp .
- 29 sp stack pointer
- 30 fp frame pointer
- 31 ra return address (H/W)

## fixed instruction format (MIPS)

instruction format types

R-Type

I-type

J-type

Page 1: Simplicity favours

Suppose we want to add 4 variables

$$\textcircled{1} \quad a + b + c + d + e$$

add a, b, c

add a, a, d

add a, a, e

$$\textcircled{2} \quad d = a - e$$

sub d, a, e

$$\textcircled{3} \quad a + b + c$$

add a, b, c

$$\textcircled{4} \quad f = (g + h) - (i + j)$$

add to, g, h

add t1, i, j

sub f, to, t1

r-type



For all r-type opcode is 000000

I-type





Translate the MIPS assembly instruction

① add \$t0,\$\$1,\$\$2

To binary instruction format



func → for sub is 34  
what MIPS instruction is represented by the following bit sequence.



② sub \$t2,\$t0,\$t2

③ sll → shift left logical. rs will be zero as there are only two operands.  
④ srl → shift right logical. rt & rd.



⑤ and and \$s1,\$s2,\$s3

⑥ or or \$s1,\$s2,\$s3

⑦ nor nor \$s1,\$s2,\$s3

⑧ xor \$s1,\$s2,\$s3

⑨slt comparison

How will you get the content from the registers of memory in MIPS?

add \$r1, \$r2, \$r0

or \$r1, \$r2, \$r0

and \$r1, \$r2, \$r0



① addi \$s1, \$s2, #100

↓  
rt source

rs

↓  
rt

lw \$s1, byteoffset(\$s2)

↓  
rt

rs

sw byteoffset(\$s2), \$s1

↓  
rt

rs

lw \$s1, 100(\$s2)

↓  
rt

rs

|     |    |    |     |
|-----|----|----|-----|
| imm | rs | rt |     |
| 35  | 18 | 17 | 100 |

sw \$s1, 100(\$s2)

|     |    |    |     |
|-----|----|----|-----|
| imm | rs | rt |     |
| 43  | 18 | 17 | 100 |

lw \$3, \$00(\$1)  
add \$53, \$52, \$300(\$6)

add \$54, \$52, \$53

sw \$00(\$6), \$54

beq \$51, \$52, L1

bne \$51, \$52, L2

|    |    |    |                      |
|----|----|----|----------------------|
| OP | rs | rt | address displacement |
|----|----|----|----------------------|

beq \$51, \$52, 100

|    |    |    |                   |
|----|----|----|-------------------|
| 4  | r  | rt | add. displacement |
| 17 | 18 | 25 |                   |

bne \$51, \$52, 100

|   |    |    |    |
|---|----|----|----|
| 5 | 17 | 18 | 25 |
|---|----|----|----|

f, g, h, i, j, k are constants  
so what is compiled MIPS

if(i==j) f=g+h else f=g-h

beq \$53, \$54, IF

sub add \$50, \$51, \$52

beq \$80, \$zero, END

IF:

add \$50, \$51, \$52

END:

|      |       |
|------|-------|
| code | label |
|------|-------|

Compile the following while using MIPS assembly.

while (arr[i] == k)

i = i + 1;

Assume that i, k corresponding to \$31, \$55 and  
of array save is in \$6.

loop: sll \$s1, \$s3 #2  
add \$t1, \$t1, \$s6.  
lw \$s2, \$s0(\$t1)

bne \$s2, \$s5, exit

addi \$s3, \$s3, #1

jr loop

exit:

J → Jump unconditional Jump

Jr → Jump register

Jal → Jump and Link

J imm # Jump absolute

opcode: 2

6 bit opcode, 26-bit address

new PC = (PC & 0x00000000) | (imm < 2<sup>22</sup>)

J 10000

|   |      |
|---|------|
| 2 | 2500 |
|---|------|

② Jal imm # Jump & Link

opcode = 3

$r31 \leftarrow PC$

|   |      |
|---|------|
| 3 | 2500 |
|---|------|

↓  
MIPS

③ Jr \$rs

|    |    |    |    |    |      |
|----|----|----|----|----|------|
| OP | rs | rt | rd | sh | finc |
|----|----|----|----|----|------|

|   |    |   |    |   |   |
|---|----|---|----|---|---|
| 0 | rs | 0 | 10 | 0 | 8 |
|---|----|---|----|---|---|

④ Jalr \$rs, \$t # Jump reg and link

$r_t \leftarrow PC$

$PC \leftarrow rs$

mult \$s2, \$s3 # Hi, Lo ← \$s2 \* \$s3

multu \$s2, \$s3 # Hi, Low ← \$s2 \* \$s3

mfhi

\$s1

mflo

\$s2

div \$r2, \$r3       $l_0 \leftarrow b52/f13$        $l_1 \leftarrow b52/f13$       *bezier remain*

div \$r2, \$r3

Compare & Branch Instructions :-

Compare to zero and branch :-

blez rs, offset       $R[n] < 0$  then b

bgtz rs, offset

bltz rs, offset

bgez rs, offset

bltzal rs, offset

bgezal rs, offset

how do you branch if  $r2 = 0$

beq \$r2, \$zero, offset

sub \$t0, \$r2, \$r3

blt \$t0, offset

lui - load upper immediate

lui-type instruction.

lui \$t1,255

$t1 \leftarrow \boxed{255 \mid 0}$

What is the MIPS assembly code to load this 32-bit constant into reg \$s0?

lui \$s0, 61

addi \$s0, 2304

1+ 418  
+ 16+32

Addressing modes:-

1. Immediate addressing:- where the operand is constant within the instruction itself. The operand is 16 bit of itself.

$\boxed{\text{op} \mid \text{rs} \mid \text{rt} \mid \text{immediat}}$

addi

ori

xori

andi

2. Register Addressing Mode:-

$\boxed{\text{01} \mid \text{r} \mid \text{s} \mid \text{rt} \mid \text{shamt} \mid \text{function}}$

register

Base or displacement addressing mode

lw  
sw

$\boxed{\text{op} \mid \text{rs} \mid \text{rt} \mid \text{offs}}$



## PC Relative addressing mode

beq  
bne



## Pseudodirect addressing

where the jump address is 26 bit of the ~~isla~~  
which is left shifted by 2 times and then  
concatenated with 4 bits of PC.



loop: sll \$t1, \$s3, 2

0 | 0 | 19 | 9 | 2 | 0

add \$t1, \$t1, \$s6

1 | 0 | 9 | 22 | 9 | 0 | 2

lw \$t0, 0(\$t1)

35 | 9 | 8 | 0

bne \$t0, \$s5, \$t1

5 | 8 | 21 | 2

addi \$s3, \$s3, 1

8 | 19 | 19 | 1

jr \$t1, \$s6

12 | 200 | 0

Exit

## Performance :-

Do the following changes to a CS increases throughput?  
① Decrease response time? ② both?

Replacing processor in a computer with the faster version

Adding additional processor to a system that uses multiple processors for separate tasks?  
For example: searching www

Response time or Execution time is the time between the start and completion of a task.

To maximize performance we want to minimize response time or execution time for same task

$$\text{Performance}_x = \frac{1}{\text{Executiontime}_x}$$

$$\text{Performance}_x > \text{Performance}_y \\ \text{Executiontime}_x < \text{Execution}_y$$

If  $x$  is  $n$  times faster than  $y$

If computer A runs a program in  $10s$  and computer B runs it in  $15s$ , how much faster is A to B?  
A is 1.5 times faster than B.

How do you calculate the CPU time if only no. of clock cycles used for the program is given?

$$n \times \text{clock cycle time}$$

$$\begin{aligned} \text{CPU time for a program} &= (\text{CPU clock cycles for program}) * \frac{\text{clock rate}}{\text{Ghz}} \\ &= \frac{\text{CPU clock cycles}}{\text{clock rate}} \end{aligned}$$

$$\text{Clock rate} = \frac{1}{\text{clock cycle time}}$$

Our fax program runs in 10 s on machine A which has 2 GHz clock. We are trying to help a computer designer build a computer B which will run in 6 sec. The designer has determined a substantial increase in clock rate is possible but this increase will effect the rest of CPU design causing computer B to require 1.2 clock cycles as A. What clock rate should we designer to target.

$$(4 \text{ GHz})$$

$$\frac{10}{6} = \frac{x}{2}$$

$$x = \frac{20}{6} \times 1.2 = 4 \text{ GHz}$$

$$\frac{10}{6} = \frac{x}{2}$$

$$6x = 20$$

$$x = \frac{20}{6}$$

$$= 4$$

Machine A has 250MHz frequency and machine B has 500 MHz frequency.

Company code segments :-

Suppose we have two implementations of the same ISA. Computer A has 250 ps. CPI of 2 for some program. Computer B has 500 ps and CPI of 1.2

$$\frac{A}{B} = \frac{250 \times 2}{500 \times 1.2} = \frac{1}{1.2}$$

~~805~~ 1.2 500 B-10

B is faster than A by 1.2 times

CPI for each instruction class

| CPI | A | B | C |
|-----|---|---|---|
| CPI | 1 | 2 | 3 |

Instruction counts for each instruction class

| <u>Code seq.</u> | A | B | C |                     |
|------------------|---|---|---|---------------------|
| 1                | 2 | 1 | 2 | $\frac{10}{5} = 2$  |
| 2                | 4 | 1 | 1 | $\frac{9}{6} = 1.5$ |

2 | 4  
A compiler designer is trying to design between two codes.  
The hardware designer has supplied following facts.  
For a particular high level language the compiler  
must require

two code sequences

that require

Two code sequences @ which code sequence most instructions.

(b) which will be faster.

(b) what is CPI for each sequence

$$12_2 \quad 22_{1.5}$$

$$\begin{aligned}
 \text{Time - seconds} &= \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Clock cycles} \times \text{seconds}}{\text{Instructions} \times \text{clock cycles}} \\
 &= \text{no. of instructions} \times \text{CPI} \times \text{clock cycles}
 \end{aligned}$$

Components of performance :-

There are four basic concepts of performance

- i) CPU Execution time for a program (seconds)
- ii) Instruction count
- iii) clock cycles per instruction (CPI)
- iv) clock cycle time ( $\frac{1}{\text{frequency}}$ )

Instruction mix indicates how much percentage of instructions belong to each kind of classes.

It is a major of dynamic frequency of instructions across one or many programs.

An 800 MHz AMD processor, CPI = 1.2 for a MP3 compression

A 1 GHz pentium has CPI = 1.5. Both have identical

ISA. which processor is better.

$$\frac{P_A}{P_B} = \frac{\frac{1.5}{10^9}}{\frac{1.2}{800 \times 10^6}}$$

$$\begin{aligned}
 &= \frac{1.5}{10^9} \times \frac{800 \times 10^6}{1.2} = \frac{1.40}{10^6} = 1
 \end{aligned}$$

What is clock cycle time for a machine with clock rate

200 MHz

$$T = \frac{1}{f} = \frac{1}{200 \times 10^6} = 5 \text{ ns}$$

1. Amdahl's Law:-

The performance enhancement of an improvement is limited by how much the improved feature is used.

For example:-

$$\text{Execution time after improvement} = \frac{\text{Execution time effected by improvement}}{\text{Amount of improvement}}$$

+ Execution time unaffected.

Suppose you could speed up float point FP by 2 times but the program has 20% FP operations

$$E_{\text{after float}} = \frac{0.2}{2} + 0.8$$

$$= 0.9$$

- 1. Commoncase Fat  $\rightarrow n/w$
- 1. Amdahl's law  $\rightarrow s/w$
- 2. MFTS

3. MFLOPS

4. Benchmarks

Instruct types & CPI

What is a range CPI for following machine

| OP     | freq | cycles |
|--------|------|--------|
| DLV    | 50%  | 1      |
| Load   | 20%  | 5      |
| Store  | 10%  | 3      |
| Branch | 20%  | 2      |

$$0.5 \times 1 + 0.2 \times 5 + 0.1 \times 3 + 0.2 \times 2 \\ = 0.5 + 1 + 0.3 + 0.4 \\ = \underline{0.0} \quad 2.2$$

MIPS - million instructions per second

$$\text{MIPS} = \frac{\text{Instruction count}}{10^6 \times \text{Execution time}}$$

$$= \frac{\text{Instruction out}}{10^6 \times \text{Instruction count} \times \text{CPI} \times \text{Clock time} \times 10^6}$$

$$= \frac{\text{clock rate}}{\text{CPI} \times 10^6}$$

MFLOPs → million floating point operations per second

$$\text{MFLOPs} = \frac{\text{No of Fl. operations}}{\text{execution time} \times 10^6}$$

→ A benchmark is a standard of measurement or evaluation.

→ A computer benchmark typically a computer program that performs a strictly defined set of operations which is called as workload and

return some form of result for example.  
one is a benchmark that is used  
BIPS, BOBS, GIIPS, GOIPS

MIPS failure

It doesn't take into account the instruction set  
MIPS rate depends on the programs and its  
instruction mix

MIPS are not constant on a single machine  
MIPS can vary inversely with performance.

Instruction Execution

IF + IP + execute + memory pbe + write back  
+ FD

Elements of datapath:-



Instruction



Instruction memory

ALU operation



Fetching and Incrementing the PC :-



Registers



Whenever writeback is present we need to enable RegWrite.

Data Memory Unit :-



Data memory unit



By Dest → decide whether it or rd is the destination.  
AUOP1 AUOP2

Branch

|            |            |
|------------|------------|
| Branch     | → add      |
| Member     | → subtract |
| Member Reg | → P-Tyr.   |
| ALUOP      |            |
| Monowire   |            |

## Values of various signals

| Instruction | RegDst | AUFC | Memory | Reg write | Mem Read | Mem write | Mem Branch |
|-------------|--------|------|--------|-----------|----------|-----------|------------|
| R-Type      | 1      | 0    | 0      | 1         | 0        | 0         | 0          |
| loadword    | 0      | 1    | 1      | 1         | 1        | 0         | 1          |
| storeword   | 0      | 1    | 1      | 0         | 0        | 1         | 0          |
| beq         | x      | 0    | x      | 0         | 0        | 0         | 1          |

## 4-bit AUU control

0000 - AND

0001 - OR

0010 - add

0110 - subtract

0111 - set on less than

1100 - NOR

We wish to add instruction GT to MIPS single cycle data path.

AUU has additional control signal GT which is asserted  $r_s \rightarrow r_t$ .  
Also find type of instruction. Also write the necessary control signals. (R-type)



Find out the cycletime for single datapath design  
when memory units 200ps, ALU adder, 100ps reg file, 50ps multiplexer, control, places, sign extension and other wires have no delays.  
 $200 + 100 + 50 = 350$  ps.



|      |     |    |     |              |              |         |
|------|-----|----|-----|--------------|--------------|---------|
| addi | 200 | 50 | 100 | <del>x</del> | 50           | 400 ps  |
| lw   | 400 | 50 | 100 | 200          | 50           | 600 ps  |
| sw   | 200 | 50 | 100 | 200          | <del>x</del> | 150 ps  |
| B    | 200 | 50 | 100 | <del>x</del> | <del>x</del> | 3.6 Dps |
| J    | 200 | 50 | 100 |              |              | 200 ps  |

clockcycle time = 600 ps

which of the following implementations will be faster & by how much.

① SCL

② An implementation where every instruction execution in one cycle using variable length clock

To compare the performance we use following instruction times

|     |            |
|-----|------------|
| add | 600 ps     |
| lw  | load       |
| sw  | store      |
| add | 447.5 ps   |
| lw  | HSL ALU    |
| sw  | ISL branch |
| lw  | 5.1. Jump  |

SCL takes = 600 ps  
and one takes = 447.5 ps

~~SCJ~~  
~~one~~

$$\frac{\text{one}}{\text{SCL}} = \frac{600}{447.5} \approx 1.34 \text{ times}$$

|          | ori | lw | sw           | beq          | Jump         |              |
|----------|-----|----|--------------|--------------|--------------|--------------|
| R-type   | 0   | 0  | <del>x</del> | <del>x</del> | <del>x</del> |              |
| R-type   | 1   | 0  | 1            | 1            | 0            | <del>x</del> |
| ALDSR    | 0   | 1  | 1            | <del>x</del> | <del>x</del> | <del>x</del> |
| Memory   | 0   | 0  | 1            | 0            | 0            | 0            |
| RgWrite  | 1   | 1  | 0            | 1            | 0            | 0            |
| NemWrite | 0   | 0  | 0            | 1            | 0            | 1            |
| Jump     | 0   | 0  | 0            | 0            | 0            | 1            |

Assume the following memory access time is 2 ns, register read/write is 1 ns, ALU function takes 2 ns.

How much time it will take for the following instruction in single cycle data path?

$$\text{Add/Sub} \quad 2+1+2+1 = 6 \text{ ns}$$

$$\text{ori} \quad 2+1+2+1 = 6 \text{ ns}$$

$$\text{load} \quad 2+1+2+1+2 = 8 \text{ ns}$$

$$\text{store} \quad 2+1+2+2 = 7 \text{ ns}$$

$$\text{Branch} \quad 2+1+2 = 5 \text{ ns}$$

What will be the time for critical path?

8 ns

Arithmetic part - 48%

load - 22%

store - 11%

Branch - 19%

If program has million instruction?

Execution time - ?

$$\text{no. of instructions} \times \text{CPI} \times \text{data cycle}$$

$$= 10^6 \times 8 \times 10^{-9}$$

$$= 8 \times 10^{-3} \text{ ms}$$

$$= 8 \text{ ms}$$

## Disadvantages of SCI:

- (i) Long cycle time - all instructions take as much as first or slowest instruction
- (ii) It only implements simple instructions as it does not support floating point operations.
- (iii) It does not have fancy addressing like 8086. memory (data/instruction), separate Address
- (iv) It requires extra hardware. memory (data/instruction), separate Address

Is MIPS a perfect major of performance?  
Consider the following measurements for a program.

| Measurement | Computer A | Computer B |
|-------------|------------|------------|
| IC          | 10 billion | 8 billion  |
| Clock rate  | 4 GHz      | 4 GHz      |
| CPI         | 1.0        | 1.1        |

- (a) which computer has higher MIPS rating
- (b) which computer is faster.

$$\text{MIPS} = \frac{\text{IC}}{\text{Execution} \times 10^6}$$

$$= \frac{10 \times 10^9 \times f}{n \times CPI \times 10^6}$$

$$= \frac{4 \times 10^9}{10^6}$$

$$= 4000$$

$$\frac{4000}{1.1} = \frac{40000}{11} = 3636$$

$$\frac{10 \times 10^9 + 1}{4 \times 10^9} = 2.5 \text{ s}$$

$$\frac{8 \times 10^9 + 1.1}{4 \times 10^9} = 2.2 \text{ s}$$

Computer B is faster.

|     | A | B | C |
|-----|---|---|---|
| CPI | 1 | 2 | 3 |

|            | IC count in billion |   |   |
|------------|---------------------|---|---|
|            | A                   | B | C |
| Compiler 1 | 5                   | 1 | 1 |
| Compiler 2 | 10                  | 1 | 1 |

- (a) which code sequence will execute fast according to MIPS  
 (b) which is faster according to execution time

$$\text{Compiler 1} \rightarrow \frac{5}{7} \times 1 + \frac{1}{7} \times 2 + \frac{1}{7} \times 3 = \frac{10}{7}$$

$$\text{Compiler 2} \rightarrow \frac{10}{12} \times 1 + \frac{1}{12} \times 2 + \frac{1}{12} \times 3 = \frac{15}{12}$$

$$\frac{4 \times 10^9}{10^6 \times \frac{10}{7}} = \frac{4}{10} \times 2800 = 1120$$

$$\frac{4 \times 10^9}{15 \times 10^6} = \frac{4}{15} \times 10^3 = 266.67$$

$$(b) \text{ Compiler 1} \rightarrow \frac{10}{7} = 2.5 \text{ s}$$

$$\text{Compiler 2} \rightarrow \frac{15}{4} = 3.75 \text{ s}$$



~~WORKING~~

### ① Instruction Fetch

$$\begin{aligned} IR &\leftarrow \text{Mem}[PC] \\ PC &\leftarrow PC + 4 \end{aligned} \quad \} \text{ clockcycle 1}$$

### ② Instruction Decode & Register Fetch:

$$A \leftarrow \text{Reg}[IR[25:21]]$$

$$B \leftarrow \text{Reg}[IR[20:16]]$$

$$ALUout \leftarrow PC + (\text{sign extend } IR[15:0] \ll 2)$$

### ③ Execution, Memory address computation, or branch completion

$$\downarrow \quad \text{if } (A == B) \quad PC \leftarrow ALUout$$

$$ALUout \leftarrow A \text{ op } B$$

$$ALUout \leftarrow A + \text{sig extend } (IR[15:0])$$

### ④ Memory access or D-type instruction completion step

$$\text{Reg}[IR[15:11]] \leftarrow ALUout$$

$$\text{mem}[ALUout] \leftarrow B$$

$$MDR \leftarrow \text{Mem}[ALUout]$$

$$③ \text{ Reg[IR[20:16]]} \leftarrow \text{MDR}$$

Fig 5.30 is a summary of the steps taken to execute only instruction class.

For Jump instruction in step 3

$$\text{PC} \leftarrow \{P[3]:28\}, \text{IR}[25:0], 00\}$$

Suppose an instruction mix is 25% load, 10% stores, 11% branches, 2% jumps and 52% ALU. Calculate avg CPI for a multicycle datapath.

$$(0.25 \times 5 + 0.1 \times 4 + 0.11 \times 3 + 0.02 \times 3 + 0.52 \times 4) \times 4 = 2.08$$

$$= 1.25 + 0.4 + 0.33 + 0.06 + 0.52 \times 4$$

$$\begin{array}{r} \text{Avg CPI} \\ \hline 1.25 \\ \text{for multicycle} \\ \hline 0.4 \\ 0.33 \\ 0.06 \\ \hline 2.04 \end{array}$$

$$\begin{array}{r} 0.52 \\ \times 4 \\ \hline 2.08 \end{array}$$

Suppose you have thousand instructions to be executed compared the performance of single cycle implementation vs multicycle vs pipelining.

Pipelining



1. Throughput
2. Latency

for single cycle implementation clockcycle time is  
8 ns.

$$1000 \times 8 \times 10^{-9} = \underline{\underline{8 \mu s}} \\ = \underline{\underline{8000 \text{ ns}}}$$

$$1000 \times 4.12 = 4120 \text{ cycles} \\ = 4120 \times 2 \\ = 8240 \text{ ns}$$

$$1000 \times 2 = 2000 + 8 \\ = 2008 \text{ ns}$$

Pipeline latency for n instructions  
 n → instruction in k stage pipeline and suppose  
 we have t as cycle time. Time taken  
 $= kt + (n-1)t$

$$\text{speedup} = \frac{\text{time taken by unpipelined}}{\text{time taken by pipelined}} = \frac{nk}{kt + (n-1)t}$$

speedup calculated as hence number of stages  $\frac{n}{k}$   
 $n = k$

① Maximum speedup from Pipelining

② Pipeline Advantages

③ Pipeline Pathways

Memory

Write Back

Structural

Hazard

④ 3 Principles of Pipelining → Adder

⑤ What max is needed in datapath



### Pipeline principles

1. All instructions that share a pipeline must have the same stages in the same order.
2. All intermediate values must be latched each cycle.
3. There is no functional block reuse.

data dependency → clock hazard  
 add \$3,\$10,\$11 → IF ID EX MEM WB  
 lw \$8,50(\$53) → IF ID EX MEM WB  
 sub \$11,\$8,\$7 → IF ID EX MEM WB



### transparent register file:-

In transparent register file in the first half of the clockcycle the register is written & in the next half register is read.

We can reduce one stall by using transparent register file.

To remove data dependency or data hazards we have two solutions one is software and other is hardware.

# Data Hazard (Data Dependency)



Assume no transparent ref file

```

sub $2,$1,$3
and $12,$2,$5
or $13,$6,$2
add $14,$2,$2
sw $15,10($2)
    
```

1. Identify the data dependency of ~~written~~ code
2. how many cycles it will take without considering transparent refile
- 3.

Where are no OPS required

```

sub $2,$1,$3
and $4,$2,$5
or $8,$2,$6
add $9,$4,$2
slt $1,$6,$7
    
```

sub IF ID EX MEM WB  
 and IF ID EX MEM WB  
 2 } IF ID EX MEM WB  
 add IF ID EX MEM WB  
 and IF ID EX MEM WB  
 or IF ID EX MEM WB  
 IF ID EX MEM WB

Forwarding (Register bypassing or short circuiting): -

add \$2,\$3,\$4  
or \$5,\$3,\$2



$EX \rightarrow EX$  ALU output is input in next

Forward A  $\Rightarrow$  H/w needs to be changed.

Forward B

- ① sub \$2, \$1, \$3
- ② and \$12, \$2, \$5
- ③ or \$13, \$6, \$2
- ④ add \$14, \$2, \$2
- ⑤ sw \$15, 100(\$2)

Forwarding is in place as well as forwarding to



end state of pipeline before write back  
register value  $\rightarrow$  MEM  $\rightarrow$  EX  
stall if the result is used  
in next instruction

1. lw \$2, \$2(\$1)
2. and \$4, \$2, \$5
3. or \$8, \$2, \$6
4. add \$9, \$4, \$2
5. sif \$1, \$6, \$7

$EX \rightarrow EX$   
 $MEM \rightarrow MEM$   
 $NB \rightarrow ID$



After lw, we may have to put one stall.

lw \$15, 0(\$2)  $\uparrow$  swap  
 lw \$16, 4(\$2)  
 add \$14, \$5, \$16  
 sw \$16, 4(\$2)

Data dependency / hazards:-

Data dependency  $\rightarrow$  RAW - Read After Write  $\rightarrow$  True Data  
 Data dependency  $\rightarrow$  WAR - write after read  
 Data dependency  $\rightarrow$  WAW - write after write  
 Data dependency  $\rightarrow$  Output after write



## Branch Prediction

### Static Prediction

Forward  $\rightarrow$  NOT taken  
Backward - taken

`for(i=0; i<10; i++)`

{     }

|            | $i=0$ | $i=1$ | $i=2$ | $i=3$ | $i=4$ | $i=5$ | $i=6$ | $i=7$ | $i=8$ | $i=9$ | $i=10$ |
|------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|--------|
| prediction | NT    | T     | T     | T     | T     | T     | T     | T     | T     | T     | NT     |
| Actual     | T     | T     | T     | T     | T     | T     | T     | T     | T     | T     | NT     |

$$\text{Prediction Accuracy} = \frac{\text{No. of times predicted value is right}}{\text{Total no. of times predicted}}$$

$$= \frac{9+100}{11} = \frac{109}{11} = 81.81\%$$

### The bit branch predictor

#### Two bit branch predictor

#### Branch Buffers

#### Correlating & hybrid Predictors:

Global Prediction bits

Local Prediction bit

Actual prediction

Number of bits

Used in modern

processors.

## Arithmetic Operation

Perform multiplication when multiplicand is 1000  
and multiplier is 1001

$$\begin{array}{r}
 (1000) \leftarrow \text{multiplicand} \\
 \times 1001 \rightarrow \text{multiplier} \\
 \hline
 1000 \\
 0000 \times \\
 0000 \times \\
 1000 \times \times \\
 \hline
 1001000
 \end{array}$$

$$\begin{array}{r}
 0010 \\
 \times 0011 \\
 \hline
 0010 \\
 0010 \times \\
 0000 \times \times \\
 0000 \times \times \times \\
 \hline
 00001100
 \end{array}$$

multiplication algorithm sequential version :-

$m+n$  bits are required to represent all possible products.

| Iteration | Step                                                                         | Multiplexer | Multiplicand | Product |
|-----------|------------------------------------------------------------------------------|-------------|--------------|---------|
| 0         | Initial value                                                                | 0011        | 0000 0010    | 000000  |
| 1         | a. Product = Prod + multiplicand<br>b. SHL Multiplier<br>c. SHR Multiplicand | 0000        | 0000 0100    | 000000  |
| 2.        | a. Product = Prod<br>b. SHL Multiplier<br>c. SHR Multiplicand                | 0000        | 0000 1000    | 000000  |

|    |                  |           |          |
|----|------------------|-----------|----------|
|    | a. No operation  | 0001 0000 | 00000010 |
| 3. | b. SHL Multiples | 0000      |          |
|    | c. SHR Multiples |           |          |

  

|    |                   |           |           |
|----|-------------------|-----------|-----------|
| 4. | @ No operation    | 0000      | 0000 0110 |
|    | (b) SHL Multiples | 0010 0000 |           |
|    | (c) SHR Multiples |           |           |





The product register should really be of 65 bit to hold the carryout of the ~~product~~ adder.

| Iteration | Step                                                    | Multiplicand | Product   |
|-----------|---------------------------------------------------------|--------------|-----------|
| 0         | initialization                                          | 0010         | 0000 0011 |
| 1         | 1. Product = Product + multiplicand<br>2. S/H R Product | 0010         | 0010 0011 |
| 2         | 1. Product = Product + multiplicand<br>2. S/H R Product |              | 0011 0001 |
| 3         | 1. NOP<br>2. S/H R Product                              |              | 0000 1101 |

4. 1. NOP  
2. SHL Product

0000 0110

Signed multiplication:-

multiplicand = +4      multiplier = -5

| Iteration | Step                                                  | Multiplicand | Product  |
|-----------|-------------------------------------------------------|--------------|----------|
| 0         | Initialization                                        | 0100         | 0000 101 |
| 1         | 1. Product = Product + multiplicand<br>2. SHR Product |              | 100 101  |
| 2.        | 1. NOP<br>2. SHR Product                              |              | 010 010  |
| 3.        | 1. Product += multiplicand<br>2. SHR Product          | 100          | 001 001  |

$$010100 \rightarrow +2^0$$

$$\begin{array}{r} 1010 \\ +1 \\ \hline 101100 \end{array}$$

Suppose the following is actual value of whether branch taken or not calculation branch prediction for one bit NT

T NT T T T T T NT T T T NT T T

(NT) T NT T T T T T T T T NT T T

$$\frac{5}{10} \times 100 = 50\%$$

T NT T T T T T NT T T T T T T T

NT NT NT T T T T T T T T T T T T



Iteration steps

Quotient

Divisor

Reminder

0 Initialization 0000

xxxx xxxx  
0011 0000  
0000 1101

1 Remainder =  
Remainder - divisor

0011 0000 1101 1101  
0000 1101  
000

2a. Rem = Rem + divisor

slf quotient

0000

3. shr divisor

0001 1000

0000 1101  
110111  
110111  
110111

2. Remainder =  
Remainder - divisor

1111 0101

2a. Rem =

Remainder - divisor

0000 1101

slf quotient

0000 1100

3. shr divisor

0000 1100

3. Remainder = Rem -  
divisor

0000 1001

2a. slf quotient

0001 0000

3. shr divisor

0000 0110

4. Remainder = Rem -  
divisor

1111 0111

2a. remainder = remainder +  
divisor

0010

0000 0001

slf quotient

0000 0011

3. shr divisor

0000 0011

5.

Remainder = Rem -  
divisor

1111 1110

2a. Rem = Rem + divisor

0100

0000 0001

3. shr divisor

0000 0001

| Division           |                                        |                           |
|--------------------|----------------------------------------|---------------------------|
| Divide 7/2 (Basic) | Divisor<br>xxxx<br>0010                | Quotient<br>xxxx<br>0000  |
| Iteration steps    | xxxx<br>0010                           | Remainder<br>xxxx<br>0000 |
| 0 Initialization   | 0000                                   | 0000                      |
| 1.                 | 1. Remainder = Remainder<br>divisor    | 1100 011                  |
|                    | 2a. Remainder =<br>Remainder/divisor   | 0000 011                  |
|                    | sll quotient                           | 0001 0000                 |
|                    | 3 shr divisor                          | 0000 0000                 |
| 2.                 | 1. Remainder = Remainder<br>divisor    | 1111 011                  |
|                    | 2a. Remainder = Remainder<br>divisor   | 0000 011                  |
|                    | sll quotient                           | 0000 1000                 |
|                    | 3. shr divisor                         | 0000 0100                 |
| 3.                 | 1. Remainder = Remainder - divisor     | 1111 011                  |
|                    | 2a. Remainder = Remainder +<br>divisor | 0000 011                  |
|                    | sll quotient                           | 0000 0100                 |
|                    | 3. shr divisor                         | 0000 0010                 |
| 4.                 | 1. Remainder = Remainder<br>divisor    | 0000 0010                 |
|                    | 2. sll quotient                        | 0001                      |
|                    | 3. shr divisor                         | 0000 0010                 |
| 5.                 | 1. Remainder = Remainder<br>divisor    | (0000 0010)               |
|                    | 2. sll quotient                        | 0011                      |
|                    | 3. shr divisor                         | 0000 0001                 |

# Advanced Division Algorithm



7/2

## Iteration steps

0 Initialization

Divisor  
~~xxxx~~

0010

Reminder  
~~xxxx~~

0000

0111

1

sll rem

1. Rem = Rem - divisor

1110 1110

3b Get back remainder

0000 01110

sll rem

0001 1100

2. Rem = Rem - divisor

0010

11  
0011 1100

3b Get back remainder

00001 1100

sll rem

0011 1000

3. Rem = Rem - divisor

0010

0001 1000

3a. sll rem

0011 0001

4. Rem = Rem - divisor

0001 0001

sll rem

0000  
0010 0011

Final step

sll rem

0010

0001 0011

1  
1

Rem

Quot

The advanced division algorithm also works for signed numbers where in while doing the computation we ignore signs.

Perform 14 / 5 using advanced division algorithm

| Iteration     | Step                                           | Divisor                                                               | Remainder                                                             |
|---------------|------------------------------------------------|-----------------------------------------------------------------------|-----------------------------------------------------------------------|
| 0             | Initialization                                 | $\begin{array}{r} \times \times \times \\ \times \end{array}$<br>0101 | $\begin{array}{r} \times \times \times \\ \times \end{array}$<br>0000 |
| 1.            | Rem = Rem - divisor<br>get back rem<br>sll rem | 0101                                                                  | $\begin{array}{r} \times \times \times \\ \times \end{array}$<br>0000 |
| 2.            | Rem = Rem - divisor<br>get back rem<br>sll rem | 0101                                                                  | $\begin{array}{r} \times \times \times \\ \times \end{array}$<br>0001 |
| 3.            | Rem = Rem - divisor<br>get back rem<br>sll rem | 0101                                                                  | $\begin{array}{r} \times \times \times \\ \times \end{array}$<br>0011 |
| 4.            | Rem = Rem - divisor<br>get back rem<br>sll rem | 0101                                                                  | $\begin{array}{r} \times \times \times \\ \times \end{array}$<br>0100 |
| 5. Final step | sll rem                                        | 0101                                                                  | $\begin{array}{r} \times \times \times \\ \times \end{array}$<br>0100 |

Floating point :-

I EEE - 754 representation for floating points

1. Single precision floating point  $\rightarrow 32$
2. Double precision floating point  $\rightarrow 64$
3. Extended double precision  $\rightarrow 10 \text{ bytes}$
4. Quadruple Precision Floating  $\rightarrow 16 \text{ bytes}$



$$1101.001 \times 2^3$$

↳ ① normalized no

$$1.101001 \times 2^6$$

$$\begin{array}{r} \text{xxxx xxxx} \\ 0000 00010000101 \\ \Rightarrow 42D20000 \end{array} \quad \begin{array}{r} \text{xxxx xxxx} \\ 1010 0100000000000 \\ \text{can be from } 31 \text{ to } 2^{24} \end{array}$$

Biased

$$AE + 127$$

as 0255 is reserved.

$$\begin{array}{r} 1000 \\ 1111 \\ = 127 \end{array}$$

$$= 127 + 6$$

$$= 133$$

$$2^{127-1} \\ 2^{127}$$

$$16(133) \\ 8-5$$

$$10000101$$

$$\begin{array}{r} 0100 \\ 0100 \\ \hline 0100 \end{array}$$

AE range

$$[-126 \text{ to } 127]$$

$$\begin{array}{l} 10-A \\ 11-B \\ C-12 \\ D-13 \end{array}$$

64 bit      11 bit      52 bit  
 1 bit      xxxx xxxx | xxxx ... | xxxx = 9  
 + bias = 1023

$$BF = AE + 1023$$

$$(100.25)_{10}$$

represent in single double precision

$$\begin{array}{r}
 2(100) \\
 2(50-0) \\
 2(25-0) \\
 2(12-1) \\
 2(6-0) \\
 2(3-0) \\
 1 - 1
 \end{array}$$

$$\begin{array}{r}
 1100100.01 \\
 1.1001000 \times 2^6
 \end{array}$$

$$\begin{array}{r}
 0.25 \times 2 = 0.5 \\
 0.5 \times 2 = 1.0
 \end{array}$$

$$\begin{array}{r}
 0 \underline{1000} \underline{0101} \underline{100} \underline{1000} \underline{0000} \underline{0000} \underline{0000} \\
 4288000 \times 10^{-1}
 \end{array}$$

$$\begin{array}{r}
 0 \underline{0000} \underline{0101} \underline{1001} \underline{0001} \underline{0000} \underline{0000} \underline{0000} \\
 2059100000000000
 \end{array}$$

$$(-0.75)_{10} = 0$$

$$0.75 \times 2 = 1.5$$
$$0.5 \times 2 = 1.0$$

$$0.11$$

126

-7-14

0111 1110 100 0000 0000 0000 0111 1110  
0000 0000

BF400000

1000

$$-1.01 \times 10^2$$

$$z = 5$$

49D100000

$$\textcircled{2} \quad -0.11010001 \times 2^{10101}$$

(C9D10000)

$$\textcircled{3} \quad 0 \cdot 1101\ 001 \times 2^{-10011} \\ 3.5 \times 10^{-10000}$$

$-0.11010001 \times 2^{-10011}$   
 -0.11010001 after shifting left by 11  
 BSD10000

| Single precision |          | Double precision |          | Object Representation |                           |
|------------------|----------|------------------|----------|-----------------------|---------------------------|
| Biased Exponent  | Fraction | Biased Exponent  | Fraction |                       |                           |
| 0                | 0        | 0                | 0        | 0                     | 0.0000000000000000 (zero) |
| 0                | Nonzero  | 0                | nonzero  |                       | $\pm$ denormalized number |
| 1-254            | Anything | $1-2047$         | Anything |                       | Flatly point              |
| 255              | 0        | $2047$           | 0        |                       | $\pm$ infinity            |
| 255              | Non zero | $2047$           | nonzero  |                       | Nan (Not a number)        |

Floating point addition :-



Perform the addition  $(0.5)_{10}$  with  $(-0.4375)_{10}$  with single precision

single precision

$$0.5 = 0.1 \times 2^{-1}$$

$$-0.4375 = -0.11 \times 2^{-2}$$

$$0.11 \times 2^{-2} = 0.001 \times 2^{-1}$$

$$1.000 \times 2^{-1}$$

$$0.111 \times 2^{-1}$$

$$\underline{0.001 \times 2^{-1}}$$

$$1.0 \times 2^{-4}$$

$$-4 + 127 = 123$$

$$\frac{1}{16} = 0.0625$$

Multiplication & division

i) Add or subtract exponents

ii) Multiply or divide the mantissa

iv) Normalize the result

$$4 \times 10^9$$

$$8 \times 10^9$$

rounding methods: -  
There are four kind of roundings supported by IEEE-754  
format:

Add 3 bits at the end

g → guard bit

r → round bit

s → sticky bit (can be many)

but it is 1 if any of them

is 1.

(i) Round towards zero (also called truncation)

0.7783

↓  
0.778 (3 places)

↓  
0.77 (2 places)

(Always round towards zero)

1.01101      1.001      -1.1101  
↓                ↓            ↓  
0.11111      1.00      -1.11

↓  
0.11111

↓  
-1.001111

↓  
-1.001111

↓  
-1.001111

If u have a number with g, r, s bits

just leave them off

just leave them off

### (ii) Round towards $+\infty$ (positive infinity):-

i) It is applied to only positive numbers and no change if applied to negative numbers.

ii) If any of g, r, s bit is 1 add 1 to LSB

1. 1100 0000 0000 0000 0000 1000  
add 1 to LSB

add 1 to  
LSB

### (iii) Round towards $-\infty$ (negative infinity)

i) Round down to infinity

ii) If any of g, r, s bit is 1 add 1 to LSB

### (iv) Round towards nearest even:- (default)

- ① 0 x x → does not matter, r, s no change  
② 1 1 x → Add 1 to LSB  
③ 1 0 1 → Add 1 to LSB  
④ 1 0 0 → Add 1 to LSB  
if LSB = 1

1. 1100 0000 0000 0000 0000 111  
Add 1 to LSB

1. 1100 0000 0000 0000 0000 011  
No change

1. 1100 0000 0000 0000 0000 001 111

$$[1.0 \times 2^{-1}] \times [-1.110 \times 2^{-2}]$$

| S | E        | F                        |
|---|----------|--------------------------|
| 1 | 1000000  | 1100 0000 00000000011111 |
| 1 | 10000010 | 1110 0000 0000000001001  |

Perform addition and all four roundings

① round towards zero  
C1940008

② round towards  $+\infty$  (no change)

③ round towards  $-\infty$  (~~C1940009~~)

④ round to nearest even

C1940008

# Memory

## Memory Organization characteristics

- Location
- Capacity
- Units of transfer - Rate of transfer  $\rightarrow$  property of communication
- Access method
- Performance
- Organisation

### Access methods

Sequential

Random

Direct

Associative



### Performance

Access time ( $t_a$ )

Memory cycle time ( $t_c$ )

Cycle time =  $t_a + \text{recovery}$

Transfer rate ( $b$ )  $b = \frac{\omega}{t_c}$

### Direct Mapping

(Block Address)  $\rightarrow$  No. of blocks

set associative cache:  
No. of sets

Set - at least two blocks

block - two way set associative mapping

4 blocks → direct  
↓  
fully associative  
no. of misses

0, 8, 0, 6, 8  
0000 1000 0000 0110000

| Tag1 | data | Tag2 | data | Tag3 | data | Tag4 | data |
|------|------|------|------|------|------|------|------|
| 0    | 8    |      |      | 6    |      |      |      |

3 misses  
2 hit

|   | Tag1 | data | Tag2 | data |
|---|------|------|------|------|
| 0 | 011  |      | 100  |      |
| 1 |      |      |      |      |

3 misses  
replace b

110

direct map

direct mapping :- 0 → miss

second 1st data 8 → miss

0 → miss

b → miss

8 → miss

5 misses

4C's model

Compulsory miss:- The misses that happen when the block is accessed for the first time.

Capacity miss:- The misses that happen when the capacity of cache is reached.

Suppose the cache has 1024 words and block of 1 word. How many bits are used to identify the cache. Suppose there are  $2^{32}$  words in memory. How many bits are there for tag?

$$\text{No. of cache blocks} = \frac{1024}{1} = 1024$$

$$\text{No. of bits for cache} = 10$$

$$\text{No. of bits for tag} = 32 - 10 = 22$$

Block offset :- It indicates the no. of bits those are required to address each byte within a block.

$$\text{Cache bits} = 10$$

$$\text{tag bits} = 32 - 10 = 20$$

(32-2)  
2 for block offset

How many total bits are required for a direct mapped cache with 16KB of data and 4 word blocks. Assume a 32 bit address.

No. of bytes per block = 16 bytes

$$\text{Size of cache} = 16 \times 2^{10}$$

$$\text{No. of cache lines} = 10$$

$$\text{Tag bits} = 32 - 10 - 4 = 18 \text{ bits}$$



$$\text{b for one entry} = 18 + 1 + 16 \times 8 \\ = 147 \text{ bits}$$

$$\text{total bits} = 2^{10} \times 147$$

$$= 18.4 \text{ KB}$$

### Performance

$$T_{\text{access}} = t_{\text{hit}} + P_{\text{miss}} \times \text{Penalty}_{\text{miss}}$$

$$\text{Avg memory access} = \text{Hit time} + \text{Miss Rate} \times \text{Miss penalty } L_1$$

$$\text{Miss penalty } L_1 = \text{Hit time } L_2 + \text{Miss Rate } L_2 \times \text{Miss penalty } L_2$$

Suppose

$$L_1 \text{ hit time} = 1 \text{ cycle}$$

$$L_1 \text{ miss rate} = 5\%$$

$$L_2 \text{ hit time} = 5 \text{ cycles}$$

$$L_2 \text{ miss rate} = 15\%$$

$$\text{main memory penalty} = 100 \text{ cycles}$$

$$\text{penalty } L_2 = 5 + 0.15 \times 100 \\ = 20 \text{ cycles}$$

$$A$$

$$L_1 \text{ hit time} = 1$$

$$L_1 \text{ miss rate} = 5\%$$

$$L_1 \text{ miss penalty} = 100 \text{ cycles}$$

$$\text{Access} = 1 + 0.05 \times 100 \\ = 6 \text{ cycles}$$

probable

$$\text{Avg. } 1 + 0.08 \times 20$$

= 1 + 1

= 2 cycles

4. C's multi

1. Complexity

2. Conflict

3. Capacity

4. Cohesion

5. Shared resources and global reach

Advantages but introduce new challenges

Risks

1. Integration risk

2. Resource allocation risk

3. Organizational culture risk

4. Stakeholder management risk

5. Legal and regulatory risk