

programmer

user

## Assembly

### Instructions

Source program

$a = b + c;$  ← one statement

multiple instructions

Compiler

object program

load r<sub>1</sub>, b  
load r<sub>2</sub>, c  
add r<sub>1</sub>, r<sub>2</sub> # r<sub>1</sub> = r<sub>2</sub> + r<sub>2</sub>  
store r<sub>1</sub>, a

Assembler

generates instructions  
in terms of 0's and 1's



Linker

Executable → Loader → Memory



32 registers  
in Mips

- \$0
- \$1

- \$31



SML = symbolic machine language

add r<sub>1</sub>, r<sub>2</sub>

Assembler

machine instruction

011001011



\$0 → first register = 00000  
\$31 → last register = 11111

Assume that

result stored

at r<sub>2</sub> ( $r_1 = r_1 + r_2$ )



In MIPS all instructions have same bits.

However, for ex. in intel, all instructions have different bits.

### Machine Instructions

MIPS: all instructions have the same size. = 32 bits = 4 bytes

#### Types:

- R: all operands are registers
- I: register and the memory operand
- Jump Instructions: Memory address that we want to jump.

## Hypothetical two address machine.

|            | opcode | op1 | op2 |
|------------|--------|-----|-----|
| no of bits | 4      | 2   | 2   |
|            | /      | /   | /   |
| 0000       | 00     | 00  |     |
| :          | 01     | 01  |     |
| 1111       | 11     | 11  |     |

at most  
16 operations

$$r_1 = r_2 + r_3$$

add  $r_2, r_3 \# r_2 = r_2 + r_3$   
 move  $r_1, r_2 \# r_1 \leftarrow r_2$

} we lost  $r_2$ !

move  $r_1, r_2$   
 add  $r_1, r_3$



opcode op1 op2  
 1101 01 10  
 0001 01 11

Hexa

D6

17

machine operation code

| <u>Mnemonic</u> | <u>opcode</u> |
|-----------------|---------------|
| add             | 0001          |
| move            | 1101          |

Disassembler

move  $r_1, r_2$   
 add  $r_1, r_3$



Memory size  
 16 bits

$2^4 \times 2^{12}$

$16 \times 1024 \rightarrow 1 \text{ Kilo}$

64

→ 64 KB

Consider a hypothetical 3-address machine

| opcode | op1 | op2 | op3 |
|--------|-----|-----|-----|
| 8 bits | 16  | 16  | 16  |

8 bits 16 16 16 ⇒ 7 bytes

They refer  
to memory  
locations

add a, b, c

| add | a    | b   | c   |
|-----|------|-----|-----|
| 7F  | 0020 | --- | --- |

Don't Forget, it's  
in hexadecimal

→ Assembler Passes

• 1<sup>st</sup> pass: Generates the symbol table.

• 2<sup>nd</sup> pass : Generates the object program

| <u>symbol</u> | <u>Mem location</u> |
|---------------|---------------------|
| a             | 0020                |
| b             |                     |
| c             |                     |

## Complete assembly process

• 0 address machine ← we have two types of instructions

• 1 address machine



0 address

opcode

8 bits

• 2 address machine



1 address



8 bits      16 bits

## Machine operation Table (MOT)

| (Hexa)<br>opcode | Mnemonic | operand | Explanation         |
|------------------|----------|---------|---------------------|
| 01               | load     | A       | push A              |
| 02               | store    | A       | pop → top goes to A |
| 03               | add      | -       | SL ← SL + TL, pop   |
| 04               | sub      | -       | SL ← SL - TL, pop   |
| 05               | mult     | -       | SL ← SL * TL, pop   |
| 06               | div      | -       | SL ← SL / TL, pop   |
| 07               | stop     | -       |                     |

SL → second Level

TL → Top level

It's like a stack.

$$q = b + c * d + e * f$$

## (Hexa)

| location | machine inst | Symbolic machine instruction |
|----------|--------------|------------------------------|
| 0 (0)    | 01 00 17     | load b                       |
| 3 (1)    | 01 00 1B     | load c                       |
| 6 (6)    | 01 00 1F     | load d                       |
| 9 (8)    | 05           | multiply                     |
| 10 (A)   | 01 00 23     | load e                       |
| 13 (D)   | 01 00 27     | load f                       |
| 16 (10)  | 05           | multiply                     |
| 17 (11)  | 03           | add                          |
| 18 (12)  | 03           | add                          |
| 19 (13)  | 02 00 2B     | store a                      |
| 22 (16)  | 00           | stop                         |

without  
initializing

NOTE: block word → blk w → int a;  
(for next page) word 5 → int b = 5;

↳ with initializing

M/C : machine

| <u>Location</u>       | <u>machine inst</u> | <u>Symbolic m/c instructions</u> |
|-----------------------|---------------------|----------------------------------|
| 00 00 00 05 ← 23 (17) | 00 00 00 05         | b: .word 5                       |
| 00 00 00 0A ← 27 (1B) | 00 00 00 0A         | c: .word 10                      |
| 00 00 00 0F ← 31 (1F) | 00 00 00 0F         | d: .word 15                      |
| 00 00 00 10 ← 35 (23) | 00 00 00 10         | e: .word 20                      |
| 00 00 00 19 ← 35 (23) | 00 00 00 19         | f: .word 25                      |
| 00 00 00 00 ← 43 (2B) | 00 00 00 00         | g: .blkw 1                       |
|                       |                     | end                              |
|                       |                     | physical end for the program     |

### Symbol Table

| <u>Symbol / Variables</u> | <u>Value / Address (Hex)</u> |
|---------------------------|------------------------------|
| b                         | 17                           |
| c                         | 16                           |
| d                         | 1F                           |
| e                         | 23                           |
| f                         | 27                           |
| a                         | 2B                           |

Load process into main memory



• **Absolute Program:** Must be loaded into the same location.

• **Relocatable Program:** Can be loaded anywhere

Example: add \$t<sub>1</sub>, \$t<sub>2</sub>, \$t<sub>3</sub>

add \$t<sub>9</sub>, \$t<sub>10</sub>, \$t<sub>11</sub>

rd      rs      rt  
destination    second    third

R-type →

| opcode | rs   | rt   | rd  | shamt | funct  |
|--------|------|------|-----|-------|--------|
| 6      | 5    | 5    | 5   | 5     | 6      |
| 0      | \$10 | \$11 | \$9 | 0     | 20 hex |

00 0000 01010 01011 00000 01001

Example: addi \$t<sub>1</sub>, \$t<sub>2</sub>, 5 # \$t<sub>1</sub> = \$t<sub>2</sub> + 5

addi \$3, \$10, 5

rt rs

R[rt] = R[rs] + Size Extended Immediate Data

I

| operand | rs | rt | Imm. |
|---------|----|----|------|
|---------|----|----|------|

addi 10 9 5 16

00 1000 01010 01001 00000 00000 00000 0101    ⇒ 21 49 00 05<sub>16</sub>

2    1    4    9    0    0    0    5

Example: ori \$t<sub>1</sub>, \$t<sub>2</sub>, -8

\$t<sub>1</sub> = \$t<sub>2</sub> | -8

-8 ⇒ 01000 → Reverse and add 1  
10111 + 1 ⇒ 11000  
-8 in two's complement

| opcode | rs | rt | imm |
|--------|----|----|-----|
|--------|----|----|-----|

d<sub>16</sub> 10 9 hex → FF F8

00 1101 01010 01001    ⇒ 35 49 FF F8 ?

3    5    4    9

Example: Write a program which will do  $a = b + c;$

```
.text  
.globl __start
```

-- start :

```
lw $t1, b  
lw $t2, a  
add $t2, $t1, $t2  
sw $t2, c  
li $v0, 0  
syscall
```

```
.data  
a: .word 10  
b: .word 20  
c: .word 0  
.end
```

Example:

This part needs to be at last

```
.data  
array: .word 12, 20, 30, 40
```

```
.text  
.globl __start
```

-- start :

```
li $t2, 0 # $t2 = sum;  
la $t1, array  
lw $t3, 0($t1)  
add $t2, $t2, $t3 # $t2 = $t2 + $t3  
  
lw $t3, 4($t1)  
add $t2, $t2, $t3  
  
lw $t3, 8($t1)  
add $t2, $t2, $t3  
  
lw $t3, 12($t1)  
add $t2, $t2, $t3  
  
# stop
```

li \$v0, 0 ) This 2 lines stops the program.  
syscall

```
.data  
.end
```

memory

|             |             |
|-------------|-------------|
| 00 00 00 0C | 10 01 00 02 |
| 00 00 00 14 | 10 01 00 04 |
| 00 00 00 1E | 1001 00 06  |
| 00 00 00 23 | 1001 00 0C  |
| :           | :           |
| \$t1        | 10 01 00 02 |

## Exchange Array Elements

1 2 3 4

4 2 3 1

4 3 2 1

next:

```
la $t1, array  
lw $t2, arraySize  
subi $t2, $t2, 1  
mul $t2, $t2, 4  
add $t2, $t2, 1
```

bge \$t1 \$t2 done

```
lw $t3, 0($t1)  
lw $t4, 0($t2)
```

```
sw $t4, 0($t1)  
sw $t3, 0($t2)
```

```
addi $t1, $t1, 4  
subi $t2, $t2, 4
```

b next

## Functions:

l \$s0, 5

l \$t1, 5

\* \$s0 is guaranteed to be equal to 5  
\$t0 can have a different value.



E\_x: re max(x,y) + max(v,w)

.data

x: .word 10  
y: .word 20  
v: .word 30  
w: .word 40  
result: .word 0

main:

```
lw $t0, x  
lw $t1, y  
lw $t2, v  
lw $t3, w
```

jal sum # Sum returns results in \$t0

sw \$t0, result

# Print Sum

# Stop

Sum:

```
addi $sp, $sp, -16  
sw $ra, 12($sp)  
sw $s0, 8($sp)  
sw $s1, 4($sp)  
sw $s2, 0($sp)
```

} Here is save  
ra, s0, s1 & s2

jal max

```
move $s0, $v0 # $s0 = max(x, y)  
move $a0, $a2  
move $a1, $a3
```

jal max

```
move $s1, $v0  
add $s2, $s0, $s1  
move $v0, $s2 # $v0 holds the result
```

# Restore initial values of \$ra, \$s0, \$s1, \$s2

```
lw $s2, 0($sp)  
lw $s1, 4($sp)  
lw $s0, 8($sp)  
lw $ra, 12($sp)  
addi $sp, $sp, 16  
jr $ra
```

max:

```
move $t0, $a0  
move $t1, $a1
```

bgt \$t0, \$t1, next

```
move $v0, $t1  
b done
```

next:

```
move $v0, $t0
```

done:

```
jr $ra
```

L0: J L3 0040 0000  
add ... 0040 0004  
add ... 0040 0008

6 26  
2hex |

=> 00 0010 | 0000 0100 ... 11  
↓ ↓ ↓  
0 0 0

We cannot take the last 2 bits too ✓

L1: J L0 0040 000C  
add ... 0040 0010

The location we want to jump  $\rightarrow$  0040 000C

0040 000C  
↓

jump does not store  
the first 4 bits  
of PC (Program Counter)

NOTE: Instructions are always word boundary. In other words, their address is divisible by 4  $\rightarrow$  last two bits are 00

00 0010 0000 0100 0000 0000 0000 0000 |

$\rightarrow$  Machine Code: 08 10 00 00 03

Example: Write a sub program to count no. of occurrences of an ASCII characters.



\$t0 return result \$v0

How to set characters from the Array

lbu \$t0, 0(\$t0)

\$t0 | 32

Example: Factorial:  $n! = n \times (n-1)!$

(Slide 106)

$$(n-1)! \times (n-2)! \rightarrow \dots \rightarrow 2 \times 1$$

Factorial:

```
addi $sp, $sp, -8
sw $a0, 4($sp)
sw $t0, 0($sp)
```

# base case: (next page)

set less than # check if it is in the base case

```
add $t0, $0, 2
slt $t0, $a0, $t0
beq $t0, $0, else
```

↓

```
addi    $v0, $0, 1  
addi    $sp, $sp, -8  
jr    $ra
```

else:

```
addi    $a0, $a0, -1 # n = n-1  
jal    factorial # recursive call  
  
lw    $a0, 4($sp)  
mv   $v0, $a0, $v0  
lw    $ra, 0($sp)  
addi    $sp, $sp, 8  
jr    $ra
```

if

• nop → sll \$0, \$0, \$0

• move \$t1, \$t2 → add \$t1, \$0, \$t1

• blt \$t1, \$t2, L1

• bgt \$t1, \$t2, L2



slt \$at, \$t1, \$t2  
bne \$at, \$0, L1



slt \$at, \$t2, \$t1  
bne \$at, \$0, L2

## Floating Point Numbers

Representation → single precision 4 bytes  
→ double precision 8 bytes

Accuracy problems

They have their own register:

\$f0  
:  
\$f31

} 32 registers

Floating Point Instructions → have a fp processor.



## Representation of FP Numbers:

Single Precision: First represent the number in binary.

→ We will use IEEE 754 standard for FP representation



Double Precision:



Examples:  $9.0_{10} \Rightarrow 1001.0_2 \Rightarrow$  write in scientific notation

$$1.001_2 \times 2^3$$

$$\text{Exponent} = 127 + 3 = 130 \quad \wedge \quad 7F_{16}$$

$$\overbrace{\phantom{00}}^{3\text{ H}}$$

$$82_{16}$$

Example:



convert directly

NOTE:  $80.5_{10} \Rightarrow X_{16} = 50.8_{16}$

$$\begin{array}{r}
 .5 \\
 \times 16 \\
 \hline
 8
 \end{array}$$

When we see 0, we need to stop

NOTE:  $80.55_{10} \Rightarrow X_{16} = 50.8CCCC$

$$\begin{array}{r}
 .55 \\
 \times 16 \\
 \hline
 8.0
 \end{array}$$

$$\begin{array}{r}
 \times 16 \\
 \hline
 12.8
 \end{array}$$

13/02/2017, Monday

E.X:  $178 \Rightarrow X_{16}$

$$\begin{array}{r} 178 = 11 \quad \text{Remainder } 2 \\ \hline 16 \end{array} \quad \left. \begin{array}{r} 11 = 0 \quad \text{Remainder } 11 \\ \hline 16 \end{array} \right\} \quad 178 = 82_{16}$$

$$11 \times 16^1 + 11 \times 16^0 = 178$$

Fractional Port:  $0.140625 \Rightarrow X_{16}$

$$\begin{array}{r} 0.140625 \\ \times \quad 16 \\ \hline 2) 25 \\ \times \quad 16 \\ \hline 4) 0 \end{array} \quad \text{So, } 0.140625 = 0.24_{16}$$

$$2 \times 16^{-1} + 4 \times 16^{-2}$$

$178.140625 = 82.24_{16} \Rightarrow 1011 \ 0010 . \ 0010 \ 0100$

$$\underbrace{1011 \ 0010 \ 0010}_{2} \ 01 \times 2^7$$

Normalized Form.

NOTE: Floating point no. representation is IEEE 754 Standard

| SP               | S           | Exp | Fraction             |
|------------------|-------------|-----|----------------------|
| Single Precision | 31 30 23 22 | 0   | $\frac{7F}{86_{16}}$ |

excess  $127 = 7F_{16}$

$S=0 \Rightarrow +$

$S=1 \Rightarrow -$

|   |     |                                |
|---|-----|--------------------------------|
| S | exp | 0 1000 0110 0110010001001.0000 |
|   |     | 4 3 3 2 2 4                    |

$\Rightarrow 433224_{16}$

## Double Precision

| S           | Exp | Fraction                                                           |
|-------------|-----|--------------------------------------------------------------------|
| 63 62 52 51 | 0   | $\leftarrow 1 \rightarrow 1 \rightarrow \leftarrow 52 \rightarrow$ |

$$\begin{array}{r} 3FF \\ + 7 \\ \hline 406_{16} \end{array}$$

$$\text{Bias} = 1023 = 3FF_{16}$$

|   |          |  |  |  |      |  |      |  |      |  |          |  |        |  |  |  |  |  |  |  |
|---|----------|--|--|--|------|--|------|--|------|--|----------|--|--------|--|--|--|--|--|--|--|
| S | Exp      |  |  |  |      |  |      |  |      |  | Fraction |  |        |  |  |  |  |  |  |  |
| 0 | 100 0000 |  |  |  | 0110 |  | 0111 |  | 0010 |  | 00101    |  | .... 0 |  |  |  |  |  |  |  |
|   | 4        |  |  |  | 0    |  | 6    |  | 6    |  | 4        |  | 5      |  |  |  |  |  |  |  |

$$40\ 66\ 45\ 00\ 00\ 00\ 00\ 00_{16}$$

## Accuracy Problem in Floating Numbers:

$$\text{Add } 9.999 \times 10^1 \quad 1.610 \times 10^{-1}$$

- 1- Align the decimal point of numbers with the larger exponent.

$$9.999 \times 10^1 \quad 1.610 \times 10^{-1} \Rightarrow 0.1610 \times 10^0 = 0.0161 \times 10^1$$

- 2- Add the new form of the numbers

$$\begin{array}{r} 9.999 \times 10^1 \\ + 0.016 \\ \hline 10.015 \times 10^1 \end{array}$$

- 3- Normalize the result

$$10.015 \times 10^1 = 1.0015 \times 10^2$$

- 4- Round result if required

$$1.002 \times 10^2$$

4 :

## Binary Addition Using scientific notation.

$$(1.000_2 \times 2^{-1}) \quad (-1.110_2 \times 2^{-2})$$

1- Align the binary point of the numbers with the larger exponent

$$1.110 \times 2^{-2} \Rightarrow 0.111 \times 2^{-1}$$

2- Add the numbers

$$\begin{array}{r} 1.000 \times 2^{-1} \\ - 0.111 \times 2^{-1} \\ \hline 0.001 \times 2^{-1} \end{array}$$

16/03/2017 Persembe

3- Normalize the result

$$0.001 \times 2^{-1} \Rightarrow 1.000 \times 2^{-4}$$

## Binary Floating Point Multiplication.

$$1.000 \times 2^{-1} * -1.110 \times 2^{-2}$$

1- Add Exponents

$$(-1) + (-2) = (-3)$$

2- Multiply Numbers

$$\begin{array}{r} 1.000 \\ \times 1.110 \\ \hline 0000 \\ 1000 \\ 1000 \\ \hline 1.11000 \end{array}$$

3- Normalize  $\Rightarrow 1.110$

4- No need to change exponent

$$1.110 \times 2^{-3}$$

5- Sign bits are different we have - result

$$-1.110 \times 2^{-3}$$

## CHAPTER 7

- Micro architecture

parts of how they  
are connected.

### Data Path Components

Instruction memory

Register file: contains 32 register \$0 - \$31  
Data memory

State elements.

Control lines

Reg. Destination  
Mem Write  
Reg Write



add rd, rs, rt  
lw rt, imm(rs)

immediate value  $\rightarrow$  16 bits.

There is another component that extending  
bits to 32.

Multiplexers

→ We have 4 multiplexers.

i) add rd, rs, rt  
lw rt, imm(rs)

RF

which register to write  
for this, we need multiplexers.



iii) add rd, rs, rt  
lw rt, imm(rs)



iv) PC = PC + 4

PC  $\leftarrow$  BTA coming from beq inst.

## Data Path Implementation.

1- Single cycle



2- Multicycle : for different instructions we have different number of clock cycles.

3- Pipe timing → stages during inst. execution.

IF = Instruction Fetch

ID = Instruction Decode ; get values of registers.  
understand the purpose of instruction.

E.x: Execution.

add rd, rt, rt  
lw rt, imm(rs)

M: Access memory in the lw, sw

WB: Write Back

In Pipelining we have instruction level parallelism.

Instruction No



Data Path for single cycle

lw rt, disp(rs)  
add rd, rs, rt

| opcode | rs | rt | rd  | shamt | funct |
|--------|----|----|-----|-------|-------|
| 31     | 25 | 10 | 15. |       | 0     |

| op code | rs | rt | disp |
|---------|----|----|------|
| 25      | 20 | 15 | -    |



Reg Dat



20/03/2017, Monday

RTL: Register Transfer Language

DM: Data Memory

IM: Instruction Memory

- add rd, rs, rt  
InstMem[PC] // fetch Instruction  
RF[rd] ← RF[rs] + RF[rt]  
PC ← PC + 4

- ori rt, imm  
IM[PC] // fetch Instruction  
RF[rt] ← RF[rs] or zeroExt(imm)  
PC ← PC + 4

- lw rt, imm(rs)  
IM[PC]  
RF[rt] ← DM[RF[rs] + signExt(imm)]  
PC ← PC + 4

- sw rt, imm(rs)  
IM[PC]  
DM[RF(rt) + signExt(imm)] ← RF(rt)  
PC ← PC + 4

- Assume that we have swap
- ```
swap rs, rt // hypothetical instruction
RF[rs] ← RF[rt]
RF[rt] ← RF[rs]
```

- beq rs, rt, label
 

$RF \equiv R \rightarrow$  in some books we write  
 $R$  to show RF(Register File)

$IM[PC]$

if ( $R[rs]$  -  $R[rt]$ ) eq 0  
 $PC \leftarrow PC + 4 + \text{signExt(imm)} * 4$

else  
 $PC \leftarrow PC + 4$

- j label
 

|        |     |
|--------|-----|
| opcode | imm |
|--------|-----|

$IM[PC]$

$PC \leftarrow PC[31:23] \parallel imm[26] \parallel 00$

### E.x: Hypothetical Instruction

swt \$t1, 0(\$t2) // swt rt, imm(rs)

$IM[PC]$

$DM[RF[rs]] + \text{signExt}(imm[16]) \leftarrow RF[rt]$

$RF[rs] \leftarrow RF[rs] + 4$

$PC \leftarrow PC + 4$

→ We moved the slides

→ After the slides

CPI: Clock cycle / inst. → Single cycle  $\Rightarrow CPI = 1$

for execution time  
we use this formula

Program execution time = (No. of instructions)  $\times$  (clock cycle / inst)  $\times$  clock cycle time

CPI

, Clock rate  $2\text{GHz} = 2 \times 10^9$  cycle / sec

. Cycle period =  $\frac{1}{2 \times 10^9 \text{ cycles/sec}} = \frac{1}{2} \times 10^{-9} \text{ sec/cycle}$

= 0.5 nanosec / cycle

24/03/2017, Thursday

add  
sub  
or  
and  
slt  
sw } lw  
beq

### Single Cycle



27/03/2017, Monday

### Interrupt / Exceptions



### actions to be performed:

- 1- Address of the instruction causing the exception / interrupt saved to EPC (exception program counter).
- 2- Cause of the interrupt / exception  $\Rightarrow$  Cause register
- 3- Exception Level  
Saved to the status register

### RTL

IM[PC]

{ There is a problem.  
Cause  $\leftarrow 40$  // undefined inst.  
EPC  $\leftarrow PC$  // instructions that follows the problem instruction  
PC  $\leftarrow 0x80\ 00\ 01\ 80$

### Multiple Data Path

→ In single cycle, clock period is long enough to execute `LW`. Simple instruction require shorter amount of time.

`LW`: IF, ID, EX, MEM, WB  
`SW`: IF, ID, EX, MEM, -  
`ADD`: IF, ID, EX, -, WB

time.  
 $T_{LW} > T_{SW} > T_{ADD}$

- For different no. of instructions we have different no. of clock cycles.
- Clock cycle should long enough to perform the longest Task. MEM.
- Manufacturers use benchmark problems.

How to optimize the performance of a microprocessor for a certain problem domain

Program Profilers

Type a → arithmetic

Type b → logic

Type c → decision, branching.

|              | Type A | Type B | Type C |
|--------------|--------|--------|--------|
| Source Prog  | 50     | 30     | 20     |
| Execute Time | 15     | 30     | 50     |

Example for exec. Time calculation in multicycle environments:

No. of instructions executed:  $10^9$

clock rate: 2 GHz

| Type | CPI | → clock cycles/instruction |
|------|-----|----------------------------|
| A    | 3   |                            |
| B    | 1   |                            |
| C    | 5   |                            |

Distribution of executed instructions among instruction types

50% of type A  
30% of type B  
20% of type C

Final execution time

$$\text{Ex Time} = (\text{Total number of clock cycles for the entire program}) \times (\text{Clock period})$$

$$T_c = 10^9 \times (0.5 \times 3 + 0.3 \times 1 + 0.2 \times 5) = 10^9 \times 3.7$$

$$\text{Exec Time} = 10^9 \times 3.7 \times \frac{1}{2 \times 10^9 \text{ cycle/sec}} = 1.85 \text{ sec}$$



Pipelining

← Aims to execute multiple ins. at the same time.

Parallel execution.

Instruction level  
Program level.

Instruction come

I ns 3

2ns 2

Ins 1

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB



## Hazards

### 1- Data Hazard:

add \$t1, \$t2, \$t3  
and \$t3, \$t1, \$t0

### 2- Control Hazard:

add \$t1, \$t2, \$t3  
beq \$t5, \$t6, Label  
-  
-  
=

Label 1:

3- Structural Hazard: → The same unit may be needed by more than one instruction.

ALU

$$PC \leftarrow PC + 4$$

→ Consider the following timing requirements for different components

$$IM \rightarrow t_{max} = 250 \text{ ns}$$

One ins takes  $\leq 950 \text{ ns}$

$$IF \\ t_{read} = 150 \text{ ns}$$

$$EX \\ t_{ALU} = 200 \text{ ns}$$

$$DM \rightarrow t_{mem} = 250 \text{ ns}$$

$$WB \\ t_{rewrite} = 100 \text{ ns}$$

$$950 \text{ ns}$$



Consider executing 3 instructions

$$t_{seq} = 3 \times 950 = 2850 \text{ ns}$$

$$t_{pipeline} = 1750 \text{ ns}$$

Time needed to execute  $10^9$  inst.

$$t_{seq} = (950 \times 10^9) \times 10^9 = 950 \text{ sec}$$

Speedup ratio =

$$\frac{\text{Time needed during sequential execution}}{\text{Time needed during pipeline execution}}$$

$$t_{pipeline} = (250 \times 10^9) \times 10^9 = 250 \text{ sec}$$

$$= \frac{950}{250} = 4 \text{ times faster.}$$

03 / 04 / 2017, Monday

### Types of Dependencies

RAW: Read after write

add \$1, \$2, \$3  
add \$5, \$1, \$1

WAW: Write after read

add \$1, \$2, \$3  
add \$2, \$0, \$0

Make sure that 1<sup>st</sup> instruction reads \$2 before its written by 2<sup>nd</sup> instruction

Can be a problem during concurrent execution of instructions.

WAW: Write after write

add \$1, \$2, \$3  
add \$1, \$4, \$5

Can be a problem during concurrent execution  
Make sure that instructions are executed in the order given/listed above

### Data Dependencies:

- A. add \$1, \$2, \$3
- B. add X, \$1, X
- C. add X, X, \$1



EX/MEM.rd  $\Rightarrow$  ID/EX.rs

if (EX/MEM.rd = ID/EX.rs)

EX/MEM.rd  $\Rightarrow$  ID/EX.rs

06 / 04 / 2017, Thursday

### Data Hazard Example:

Solution with nops

Assume no for wording  
Find a sus solution  
by inserting nops.

lw r4, 8(r12)  
add r4, r4, r4  
sw r4, 8(r10)  
stages: f/d/e/m/w

| clock c | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---------|---|---|---|---|---|---|---|
| lw      | f | d | x | m | w |   |   |
| add     | f | d | x | m | w |   |   |

|     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
|-----|---|---|---|---|---|---|---|---|---|----|----|
| lw  | f | d | x | m | w |   |   |   |   |    |    |
| nop | f | d | x | m | w |   |   |   |   |    |    |
| nop | f | d | x | m | w |   |   |   |   |    |    |
| add |   |   | f | d | x | m | w |   |   |    |    |
| nop |   |   | f | d | x | m | w |   |   |    |    |
| nop |   |   | f | d | x | m | w |   |   |    |    |
| sw  |   |   | f | d | x | m | w |   |   |    |    |

Assume forwarding.

lw r4, 8(r12)

add r4, r4, r4

sw r4, 4(r10)

|     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|-----|---|---|---|---|---|---|---|---|
| lw  | f | d | x | m | w |   |   |   |
| nop | f | d | x | m | w |   |   |   |
| add | f | d | x | m | w |   |   |   |
| sw  |   | f | d | x | m | w |   |   |

10/04/2017, monday.

## CHAPTER 8

Locality:

- Space/Spatial Locality
- Temporal Locality

E-X:



| M. Memory | 1   | 2   | 3   | 4   | 1   | 3   |
|-----------|-----|-----|-----|-----|-----|-----|
| 123       | 123 | 123 | 123 | 123 | 123 | 123 |
| 456       | 456 | 456 | 456 | 456 | 456 | 456 |
| Cache     | 1   | 12  | 123 | 234 | 341 |     |

Miss Miss Miss Miss Miss Hit

→ Assume that cache size = 3 data items

→ Assume, cache brings the items one by one

Hit Rate = 1/6

Miss Rate = 5/6

E.v:

|             | 1  | 2  | 3  | 4  | 5  | 6  |
|-------------|----|----|----|----|----|----|
| Cache       | 12 | 12 | 12 | 12 | 12 | 12 |
| Main Memory | 34 | 34 | 34 | 34 | 34 | 34 |
|             | 42 | 42 | 42 | 42 | 42 | 42 |
|             | 56 | 56 | 56 | 56 | 56 | 56 |
| M H         | M  | H  | H  | H  | H  | H  |

Cache size =  $2 \times 2$  items  $\Rightarrow$  Taking 2 blocks that consists two items.

→ If it is available in the cache, we do not bring from main memory. That's why we missed in the third step.

### Direct Mapped Cache

$$\text{Memory size} = 16 \text{ words} = 64 \text{ bytes}$$

$$\text{Cache mem size} = 8 \text{ words}$$

$$\text{Block size} = 1 \text{ word}$$

| Dec | Hex | Bin     |
|-----|-----|---------|
| 0   | 0   | 00 0000 |
| 4   | 4   | 00 0100 |
| 8   | 8   | ..      |
| 12  | C   | ..      |
| 16  | 10  | ..      |
| 20  | 14  | ..      |
| 24  | 18  | ..      |
| 28  | 1C  | ..      |
| 32  | 20  | 10 0000 |
| 36  | 24  | ..      |
| 40  | 28  | 10 1000 |
| 44  | 2C  | ..      |
| 48  | 30  | 1       |
| 52  | 34  | 1       |
| 56  | 38  | 1       |
| 60  | 2C  | 11 1100 |

DRAM

|    |                |     |    |
|----|----------------|-----|----|
| 0  | W <sub>0</sub> | 000 | 00 |
| 4  | W <sub>1</sub> | 001 | 00 |
| 8  | W <sub>2</sub> | 010 | 00 |
| 12 |                | 011 | 00 |
| 16 |                | 100 | 00 |
| 20 |                | 101 | 00 |
| 24 |                | 110 | 00 |
| 28 |                | 111 | 00 |
| 32 |                | 000 | 00 |
| 36 |                | 001 | 00 |
| 40 |                | 010 | 00 |
| 44 |                | 011 | 00 |
| 48 |                | 100 | 00 |
| 52 |                | 101 | 00 |
| 56 |                | 110 | 00 |
| 60 |                | 111 | 00 |

SRAM

|    |    |    |    |     |    |   |
|----|----|----|----|-----|----|---|
| 0  | 1  | 2  | 3  | 000 | 00 | 0 |
| 4  | 5  | 6  | 7  | 001 | 00 |   |
| 8  | 9  | 10 | 11 | 010 | 00 |   |
| 12 | 13 | 14 | 15 | 011 | 00 |   |
| 16 | 17 | 18 | 19 | 100 | 00 |   |
| 20 | 21 | 22 | 23 | 101 | 00 | 0 |
| 24 | 25 | 26 | 27 | 110 | 00 |   |
| 28 | 29 | 30 | 31 | 111 | 00 |   |

Cache address

Tag Set No.

X XXX XX

| Memory Access | Tag | Set No. | Byte No            | Miss / Hit |
|---------------|-----|---------|--------------------|------------|
| 0 = 0 000 00  | 0   | 000     | Byte 0 of the word | Miss       |
| 20 = 0 101 00 | 0   | 101     | Byte 0 of the word | Miss       |
| 21 = 0 101 01 | 0   | 101     | Byte 1             | Hit        |
| 32 = 1 000 00 | 1   | 000     | Byte 0             | Miss       |
| 33 = 1 000 01 | 1   | 000     | Byte 1             | Hit        |
| 60 = 1 111 00 | 1   | 111     | Byte 0             | Miss       |
| 61 = 1 111 01 | 1   | 111     | Byte 1             | Hit        |
| 62 = 1 111 10 | 1   | 111     | Byte 2             | Hit        |

The tag of set no. 0 of cache memory is 0.

→ Consider two words per block

X XX XXX  
Tag Set No.

|    |    |    |     |
|----|----|----|-----|
| 0  | 7  | 00 | 000 |
| 8  | 15 | 01 | 000 |
| 16 | 23 | 10 | 000 |
| 24 | 31 | 11 | 000 |

|   |    |     |                 |                 |    |
|---|----|-----|-----------------|-----------------|----|
| 0 | 00 | 000 | w <sub>0</sub>  | w <sub>1</sub>  | 0  |
| 0 | 01 | 000 | w <sub>2</sub>  | w <sub>3</sub>  | 8  |
| 0 | 10 | 000 |                 |                 | 16 |
| 0 | 11 | 000 |                 |                 | 24 |
| 1 | 00 | 000 |                 |                 | 32 |
| 1 | 01 | 000 |                 |                 | 40 |
| 1 | 10 | 000 |                 |                 | 48 |
| 1 | 11 | 000 | w <sub>14</sub> | w <sub>15</sub> | 56 |
|   |    |     |                 |                 | 64 |

Direct Mapped Cache:

Physical address

Assume that each one is one word.



Block = 2 word



x xx  
Word No. Byte No.

N-way associative memory

N=2 (N = level of associativity)



N=4



→ Assume 8 sets



Example: No. of sets  $4K = 2^2 \times 2^{10} = 2^{12}$   
Block size = 4 words.

Physical Memory size = 4 GB  
 $2^2 \times 2^{23}$ 

## Direct Mapped

Total No. of Tag bits

Physical Address

Word No.

Total number of Tag bits =



$$\begin{aligned} &\text{No. of sets} \times \text{Tag size} = \\ &2^2 \times 2^4 = 2^{16} \text{ bits} \\ &\text{No. of sets} \quad \downarrow \\ &16 \text{ tag bits} \end{aligned}$$

## Cache Mem.



Cache Mem size =  $4K \times 4$  words

→  $n=2$



$$\text{No. of Sets} = 2^k = 2 \times 2^{10} = 2^6$$



Block Size = 4 words

$$\text{Total Number of Tag bits} = 2^6 \times 2 \times 17 = \\ \underline{\underline{n=2}}$$

Full Association:



No. of words =  $4K \times 4$  words

$$= 2^6 \times 2^2 = 2^{10} \text{ no. of words.}$$

We have 4K blocks



Total number of Tag bits

$$= (\text{No. of blocks}) \times (28)$$

$$= \frac{2^2 \times 2^{10} \times 28}{4K} \text{ bits}$$

17/04/2017 , Monday

First check cache

it's empty

cold start , miss

cache is full

item we want to access is not in cache  
conflict.

Example:

Cache mem size = 128 B

Phys. mem size = 4 GB = 2<sup>32</sup> bit

| N              | Block size (words) |
|----------------|--------------------|
| C <sub>1</sub> | 1                  |
| C <sub>2</sub> | 2                  |
| C <sub>3</sub> | 4                  |



$$C1: \text{number of sets} = \frac{\text{Total cache mem size}}{\text{Set size}} = \frac{128}{4 \times 4} = 2^3 \rightarrow 2^3$$



$$C2: \text{number of sets} = \frac{\text{Cache size}}{\text{Set size}} = \frac{128}{2 \times 2 \times 4} = 2^3$$

Depend on the tag, we will choose one of them.



$$C3: \text{number of sets} = \frac{\text{Cache size}}{N \times \text{Block size} \times \text{Word size in bytes}} = \frac{128}{4 \times 1 \times 4} = 2^3$$



C1 How to track byte position

| Line No. | V | T | Data 3 | Data 2 | Data 1 | Data 0 |
|----------|---|---|--------|--------|--------|--------|
| 000      |   |   |        |        |        |        |
| 001      |   |   |        |        |        |        |
| ~        |   |   |        |        |        |        |
| 111      |   |   |        |        |        |        |

Consider accessing physical location 27

$$27_{10} = 1B_{16} = 000 \underbrace{0001}_{\text{Set No.}} \underbrace{1011}_{\text{Block offset}}$$

Access byte number 4

| Line No. | V | T    | Data 1      | Data 0 |
|----------|---|------|-------------|--------|
| 000      |   |      |             |        |
| 001      |   |      |             |        |
| ~        |   |      |             |        |
| 011      | 1 | 0011 | (223 - 220) |        |
| 111      |   |      |             |        |

| V | Tag | Data 1 | Data 2 |
|---|-----|--------|--------|
|   |     |        |        |
|   |     |        |        |
|   |     |        |        |
|   |     |        |        |

$$220_{10} = DC_{16} = \underbrace{00}_{\text{Tag}} \underbrace{1101}_{\text{Set No.}} \underbrace{1100}_{\text{Block offset}}$$



Access Mem loc 220

$$220_{10} = DC_{16} = 0 \dots 000 \ 1101 \ 1100$$

20/04/2017, Thursday

Quiz 5 of Sec 6&6:

8 bit processor  
byte oriented  
word = byte  
64 MB memory  
64 KB cache  
Block size = 64 words  
direct-mapped cache

Give address structure



64 words, word = 1 byte

$$\text{No. of sets} = \frac{\text{Cache Size}}{\text{Set size}}$$

$$= \frac{64 \times 2^{10}}{64} = 2^{10}$$

In each set  
we have one  
block

Memory address structure



How many equality detectors

AND gates & OR gates



for 16 bit processor  
 7  
 each word  
 2 byte

64 MB memory  
 64 KB cache  
 Block Size = 16 words  
 8-way



$$\begin{aligned} \text{No. of bits} &= \frac{\text{Cache size}}{\text{Set size}} \\ &= \frac{2^6 \times 2^10}{2^3 \times 2^4 \times 2^1} \\ &= \text{No of blocks/ set} \quad \text{each word is 2 bytes.} \\ &\quad \text{No of words/block} \end{aligned}$$

Memory address format



### Midterm Topics

- Single cycle
- . RTL
- Implement

Instruction  $\rightarrow$  F O X M W

- Pipelining

$\rightarrow$  Hazards  $\rightarrow$  stall, forwarding, flush, nop

27/04/2017, Thursday

### Virtual Memory

Each program has its own VM

Programs are separated they have their own memory  $\rightarrow$  better protection



Cache

VM

Block

Page

Program is executed in terms of virtual addresses

Block size

Page size

Program uses virtual memory addresses

Block offset

Page offset

During execution, we have map to the physical memory addresses

Miss

Tag

Virtual page no.

Virtual address

Physical address

/ we use

memory page  
table

Mapping is done during execution time

Dynamic address translation



Example:  $VM = 8KB = 2^3 \times 2^{10} = 2^{13}$

Page Size = 1KB =  $2^{10} = 1024$

PM (Physical Memory) = 4KB =  $2^2 \times 2^{10} = 2^{12}$

No of pages in VM =  $\frac{2^{13}}{2^{10}} = 2^3 = 8$

No of pages (block) =  $\frac{2^{12}}{2^{10}} = 2^2 = 4$

VM

| Page 0 | 000  | 00 | 0000   | 0000      |
|--------|------|----|--------|-----------|
| page 1 | 1023 | →  | 000 11 | 1111 1111 |
| page 2 | 1026 | →  | 001 00 | 0000 0000 |
| page 3 | 2048 | →  | 010 11 | 1111 1111 |
| page 4 | 3072 | →  | 011 -  | - - -     |
| page 5 | 4096 | →  | 100 -  | - - -     |
| page 6 | 5120 | →  | 101 -  | - - -     |
| page 7 | 6144 | →  | 110 -  | - - -     |
|        | 7168 | →  | 111 -  | - - -     |

PM

| Block 0 | 00   | 00 | 0000  | 0000      |
|---------|------|----|-------|-----------|
| Block 1 | 1024 | →  | 00 11 | 1111      |
| Block 2 | 2048 | →  | 01 00 | 0000 0000 |
| Block 3 | 3072 | →  | 01 11 | 1111      |



## Virtual Memory to Physical Memory Mapping



- Memory Page table is kept in main memory.

- To make dynamic address translation some parts of a memory page table is kept in fast (cache-like) memory.

- TLB: Translation Lookaside Buffer.

- VM: Virtual Mem.
- PM: Physical mem.

• Virtual Memory Address refers to a virtual address whose page is not in main memory.



- VPN: Virtual Page Number
- PPN: Physical Page Number

## Memory Page Table

No. of rows = No. of pages in VM

| Page No. | Physical Page No | V | D |
|----------|------------------|---|---|
| 000      | 10               | 1 | 1 |
| 001      | 00               | 0 | - |
| 010      | 00               | 0 | - |
| 011      | 00               | 0 | - |
| 100      | 11               | 1 | 0 |
| 101      | 00               | 0 | - |
| 110      | 00               | 0 | - |
| 111      | 00               | 0 | - |

- V = Valid bit  $\Rightarrow$  1 means that virtual page is in main memory
- D = Dirty bit  $\Rightarrow$  1 means that physical has been changed while it was in main memory

Table size for the memory page table

$$8 \times (2 + 1 + 1) = 80 \text{ bits}$$



### DAT (Dynamic Address Translation)



8 / 05 / 2017

E.x: (from the textbook) C to MIPS conversion in PIC32 . Quiz #6 (Spring 2015)  
 $\hookrightarrow$  8.18

```
#include <P32xxxx.h>
int main(void)
{
    int switches;
    TRISD = 0xFF00;
    while(1)
    {
        switches = (PORTD >> 8) & 0xF;
        PORTD = switches;
    }
}
```

$$\begin{aligned} t_1 &= \text{TRISD} = 0xBF\ 88\ 60\ CO \\ t_2 &= \text{PORTD} = 0xBF\ 88\ 60\ 00 \end{aligned}$$

Solution:

|                                                                                                                                                                              |                      |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
| <pre>main: # \$t1 = TRISD , \$t2 = PORTD lui \$t1, 0xBF88 ori \$t1, \$t1, 0x60CO ori \$t2, \$t2, 0x6000 addi \$t2, \$t2, 0xFF00 sw \$t2, 0(\$t1) # TRISD = 0xFF00</pre>      | <pre>{ } } } }</pre> |
| <pre>while: bne \$t2,\$t2, end → We do not need to write it lw \$t4, 0(\$t2) srl \$t4, \$t4, 8 andi \$t4, \$t4, 0xF # \$t4 = switches sw \$t4, 0(\$t2) jump while end:</pre> | <pre>}</pre>         |

### Example: # Q1 (Spring 2014 Final)

```
void routine2 (void)
{
    int temp;
    TRISB = 0x000F;
    while ((PORTB & 0xF) != 0x0)
    {
        temp = PORTB;
        PORTB = temp << 4;
    }
}
```

TRISB = 0xFF CC 00 00 ← \$t0  
 PORTB = 0xFF CC 00 04 ← 4(\$t0)

```
routine 2:
lui $t0, 0xFFCC
ori $t0, $t0, 00 00
addi $t1, $t0, 0x000E
sw $t1, 0($t0)

while:
lw $t1, 4($t0)      # Read PortB
andi $t2, $t1, 0xF
beq $t2 $0, exit
# Read from PortB } we don't need this part.
lw $t1, 4($t1)      } we did it above.
sll $t2, $t1 4
sw $t2, 4($t0)
jmp while
exit:
jr $ra
```

White loop

} Before while

### Example: # Q1 (Spring 2014 Retake)

```
int problem_1b( int a ) ← comes in $a0
{
    TRISO = 0xFF00;
    PORTD = a;
    while ((PORTD & 0xFF00) == 0)
    {
        return PORTD; → returned in $v0
    }
}
```

PORTD = 0xFF FF 00 00 ← \$t0  
 TRISO = 0xFF FF 00 04 ← 4(\$t0)  
 \$t0 = PORTD

problem\_1b: my solution

```
lui $t0, 0xFFFF
ori $t0, 0x00 00
addi 4($t0), $t0, 0xFF00
addi $t0, $t0, $a0
while:
andi $t1, 0($t0), 0xFF00
bne $t1,$t0, end
li $v0, 0($t0)
end:
jr $ra
```

### Answer:

problem\_1b:

```
lui $t0, 0xFFFF
addi $t1, $t0, 0xFF 00
sw $t1, 4($t0)
sw $a0, 0($t0)
```

### Spin:

```
lw $t2, 0($t0)
andi $t3, $t2, $t1
beq $t3, 0, spin
lw $v0, 0($t0)
jr $ra
```

11/05/2017, Thursday

Question #1 (Spring 2015 Final)

Example: AD1CON1bits.SAMP = 0;

```
lw $t0, 0xBF80;  
ori $t0, $t0, 0x0000  
lw $t1, 0($t0) //Loading the content of the register.  
andi $t1, $t1, 0xFFFFD  
sw $t1, 0($t0)
```

Question #1 (Spring 2015 Final):

```
int readadc(void)  
{  
    AD1CON1bits.SAMP = 0;  
    while (!AD1CON1bits.DONE);  
    AD1CON1bits.SAMP = 1;  
    AD1CON1bits.DONE = 0;  
    return (AD1BUFO);  
}
```

```
#While (! AD1CON1bits.DONE)  
while: lw $t1, 0($t0)  
       andi $t2, $t1, 0x0001  
       beq $t2, $0, while  
  
#AD1CON1bits.SAMP = 1:  
       ori $t1, $t1, 0x0002  
#AD1CON1bits.DONE = 0:  
       andi $t1, $t1, 0xFFFFE  
       sw $t1, 0($t0)  
  
#return (AD1BUFO);  
lw $t0, 0x9070($t0)  
jr $ra
```