

# Computer Architecture Unit 2

---



22/02/2022

BYTES

# (INTRODUCTION)

bit



byte (8 bits)



kilobyte :  $10^3$  bytes

Kilobyte 1024 bytes

Megabyte :  $10^6$  bytes

Megabyte 1024 kilobytes

Gigabyte :  $10^9$  bytes

Gigabyte 1024 Megabytes

Terabyte :  $10^{12}$  bytes

Terabyte 1024 Gigabytes



STORING INTEGERS IN RAM



- 2-complement is used
- 4-bytes  $\rightarrow 2^{32}$  numbers  $\rightarrow [-2^{32}, +2^{32}-1]$

STORING STRINGS IN RAM



## STORING ARRAYS IN RAM

$$a = [7, 2, -3]$$



- we have to store also the lenght of the array

## STORING PROGRAMS IN RAM

- In RISC-V every instruction is stored in one word



- Fetch and execute

25/02/2022

## REGISTERS IN CPU



## INSTRUCTIONS IN RISC-V ARCHITECTURE

| word |            |
|------|------------|
| 1    | Instr. 1   |
| 2    | 0x00400000 |
| 3    | In Str. 2  |
| 4    | .          |
| 5    | Instr. 3   |
| 6    | .          |
| 7    | Instr. 4   |
|      | .          |

① Add  $5, 8, 8 \Rightarrow R[5] \leftarrow R[8] + R[8]$



② Sub  $7, 3, 2 \Rightarrow R[7] \leftarrow R[3] - R[2]$



Ex. 1

$$R[5] \leftarrow R[28] + R[29] + R[30]$$

Add 5, 28, 29

Add 5, 5, 30

Ex. 2

$$R[5] \leftarrow R[7]$$

Add 5, 7, 0

③ Addi 7, 7, 1  $\Rightarrow R[7] \leftarrow R[7] + 1$



Ex.

$$R[5] \leftarrow 3$$

Addi 5, 0, 3

→ loads a word from the memory

④ lw 5, offset(7)  $\Rightarrow R[5] \leftarrow \text{MEM}[R[7] + \text{offset}]$



⑤ lui 6, 0x1000



⑥  $sw\ 5, \text{ offset}(8) \Rightarrow \text{MEM}[R[8] + \text{offset}] \leftarrow R[5]$



### EXERCISE

$$R[5] \leftarrow x + y$$

|     |            |
|-----|------------|
| X   | 0x10010000 |
| Y   | 0x10010004 |
| x+y | 0x10010008 |

lui 1, 0x10010

lw 5, 0 (1)

lw 6, 4 (1)

Add 5, 5, 6

sw 5, 8 (1)

addi 17, 0, 10  
ecall } to close the program

⑦  $beq\ 5, 6, \text{ offset} \Rightarrow \text{if } R[5] = R[6] : \text{PC} \leftarrow \text{PC} + \text{offset}$

EX.

|              |            |   |
|--------------|------------|---|
| add 5, 0, 0  | 0x00400000 | ↑ |
| add 6, 0, 0  | 0x00400004 | ↑ |
| beq 5, 6, 0  | 0x00400008 | ↑ |
| addi 5, 5, 1 | 0x0040000C | } |
| addi 6, 6, 1 | 0x00400010 |   |

⑧  $bne\ 5, 6, \text{ offset} \Rightarrow \text{if } R[5] \neq R[6] : \text{PC} \leftarrow \text{PC} + \text{offset}$

addi 5, 0, 0      bne 6, 0, -8 ~~loop~~

addi 6, 0, 10

loop: add 5, 5, 6

addi 6, 6, -1

# 09/03/2022

## REGISTERS IN THE CPU

0 :  $x_0$  zero registers

5 → 7 :  $x_5 \rightarrow x_7$   $t_0 \rightarrow t_2$  they usually store temporary values

10-11 :  $x_{10} - x_{11}$   $a_0 - a_1$

17 :  $x_{17}$ ,  $a_7$

## SYSTEM CALLS

ecall

$a_7 = 10 \Rightarrow \text{exit}()$

$a_7 = 1 \Rightarrow \text{print\_int}(a_0)$

$a_7 = 4 \Rightarrow \text{print\_string}(a_0)$   
containing an address in memory where the string starts

## LOGICAL INSTRUCTIONS

and  $x_7, x_5, x_6 \Rightarrow x_7 \leftarrow x_5 \text{ AND } x_6$

andi  $x_7, x_5, 0x01 \Rightarrow x_7 \leftarrow x_5 \text{ AND } 0x01$

or ..

ori ..

xor ..

xori

# 11/03/2022

ble / blt  $t_0, t_1, \text{label} \Rightarrow \text{PC goes to label if } t_0 < \leq t_1$

bge / bgt  $t_0, t_1, \text{label} \Rightarrow \text{PC goes to label if } t_0 > \geq t_1$

## FUNCTIONS

jal  $r_2, \text{offset} \Rightarrow r_2 \leftarrow \text{PC} + 4$   
 $\text{PC} \leftarrow \text{PC} + \text{offset}$

$\text{jalr } x_1, r_2, \text{ offset} \Rightarrow x_1 \leftarrow \text{PC} + 4$   
 $\text{PC} \leftarrow r_2 + \text{offset}$

Ex. 1



Ex. 2



### IMPLEMENTATION OF A STACK

Ex. 1       $\leftarrow$  stack pointer

addi SP, SP, -4

sw r2, 0(SP)

lw r2, 0(SP)

addi SP, SP, +4

Ex. 2

addi SP, SP, -8

sw r2, 0(SP)

sw r0, 4(SP)

lw r2, 0(SP)

lw r0, 4(SP)

addi SP, SP, 8

## FACTORIAL

f: bne a0, zero, rec

li 20, 1

jztr zero, r2, 0

rec: addi sp, sp, -8

sw r2, 0(sp)

sw a0, 4(sp)

addi a0, a0, -1

jz1 r2, f

20

lw t0, 4(sp)

24

mul a0, a0, t0

28

lw r2, 0(sp)

32

addi sp, sp, 8

jztr zero, r2, 0

t0



16/03/2022

|      |                           |
|------|---------------------------|
| add  | $t_0, t_1, t_2$           |
| sub  | $t_0, t_1, t_2$           |
| addi | $t_0, t_1, \text{imm}$    |
| lw   | $t_0, \text{offset}(t_1)$ |
| sw   | $t_0, \text{offset}(t_1)$ |

## SINGLE CYCLE ARCHITECTURE



23/03/2021

add  $t_0, t_1, t_2$   
 sub  $t_0, t_1, t_2$   
 addi  $t_0, t_1, \text{imm}$   
 lw  $t_0, \text{offset}(t_1)$   
 sw  $t_0, \text{offset}(t_1)$   
 beq  $t_1, t_2, \text{addr}$

beq  $x_7, x_8, \text{offset}$

## SINGLE CYCLE ARCHITECTURE



Ex 1

$t_0 : 0x10010000$

$t_1 : 0xF F F F F F F F$

$t_2 : 0x0 0 0 0 0 0 0 0 5$

$t_3 : 0xF F F F F F F F$

Ex. 2

lw  $t_0, x = lui t_0, 0x10010$

addi  $t_0, t_0, 24$

data

· ascii "Alonsodro"

· align 2

· word 5, 7, 8

x: · word 6

lw  $t_0, x = lui t_0, 0x10010$

lw  $t_0, 24(t_0)$

13/04/2022

## PARALLELISM

### PIPELINING



## DATA HAZARDS

EX. 1

PROBLEM

add  $x_2, x_3, x_5$

the second instruction needs the first one to complete first!

1st solution (less efficient)

Put some "bubbles" between the two instruction in order to let the first instruction complete.

2nd solution (Forwarding)

Enable the architecture to detect when a situation of that type happens and put the "future" value of  $x_2$  in the place of the "old one"



EX. 2

PROBLEM

beq  $x_1, x_2, \text{label}$

The architecture will continue to execute the instructions after the branch even if it should not

## 1st solution (loop unrolling)

reduce the number of branches

Ex.

for i in range (1, 100) → (compiler)  
do-stuff()  
for i in range (1, 100) :  
do-stuff  
do-stuff  
:  
do-stuff

## 2nd solution

Execute both the next instruction and the instruction at label address and discard the wrong one.

## 3rd solution

Predict before whether the branch will happen or not.

20 / 04 / 2022

## FORWARDING

Ex. 1

add  $x_1, x_1, x_2$   
add  $x_2, x_2, x_1$

Ex. 2

add  $x_1, x_1, x_2$   
addi  $x_7, x_7, 1$   
sub  $x_2, x_1, x_4$

Ex. 3

add  $x_1, x_1, x_2$   
add  $x_1, x_1, x_3$   
add  $x_1, x_1, x_4$



## Forwarding Unit logic

if (EX/MEM . Reg write) and ( EX.MEM . rd ≠ 0 ) and ( ID/EX.rs1 = EX.MEM . rd ) :

Forward A = 10

if (EX/MEM . Reg write) and ( EX.MEM . rd ≠ 0 ) and ( ID/EX.rs2 = EX.MEM . rd ) :

Forward B = 10

if (MEM/WB . Reg write) and ( MEM/WB . rd ≠ 0 ) and ( ID/EX.rs1 = MEM/WB . rd ) and

not( (EX/MEM . Reg write) and ( EX.MEM . rd ≠ 0 ) and ( ID/EX.rs1 = EX.MEM . rd ) ) :

Forward A = 01

if (MEM/WB . Reg write) and ( MEM/WB . rd ≠ 0 ) and ( ID/EX.rs2 = MEM/WB . rd ) and

not( (EX/MEM . Reg write) and ( EX.MEM . rd ≠ 0 ) and ( ID/EX.rs2 = EX.MEM . rd ) ) .

Forward B = 01

22/04/2022

## STALLING

Ex.

lw x6, 0(x7) bubble

**IF**

add x8, x8, x6

**ID**

**EX**

**MEM**

**WB**



## HAZARD UNIT LOGIC

if  $ID/EX.\text{Memread} \text{ and } (ID/EX.\text{Md} = IF/ID.\text{RS1} \text{ or } ID/EX.\text{Md} = IF/ID.\text{RS2})$ :  
stall the pipeline

## BRANCH PREDICTION

EX.

for i=1 to 1000 000



## Branch prediction logic



# MULTIPLE ISSUE ARCHITECTURE

pocket

## static architecture



## EXERCISE

.data

x: (array 1)  
y: (array 2)  
z: (array 3)  
h: een

.text

le t<sub>0</sub>, x  
le t<sub>1</sub>, y  
le t<sub>2</sub>, z  
lw a<sub>1</sub>, h

loop : lw s<sub>0</sub>, 0(t<sub>0</sub>)  
lw s<sub>1</sub>, 0(t<sub>1</sub>)

add s<sub>1</sub>, s<sub>0</sub>, s<sub>1</sub>  
sw s<sub>1</sub>, 0(t<sub>2</sub>)  
add t<sub>0</sub>, t<sub>0</sub>, 4  
add t<sub>1</sub>, t<sub>1</sub>, 4  
add t<sub>2</sub>, t<sub>2</sub>, 4  
add a<sub>1</sub>, a<sub>1</sub>, -1  
bne a<sub>1</sub>, zero, loop

## Multiple issue version (loop only)

addi t<sub>0</sub>, t<sub>0</sub>, 4 | lw s<sub>0</sub>, 0(t<sub>0</sub>)  
addi t<sub>1</sub>, t<sub>1</sub>, 4 | lw s<sub>1</sub>, 0(t<sub>1</sub>)  
addi a<sub>1</sub>, a<sub>1</sub>, -1 | /  
add s<sub>1</sub>, s<sub>0</sub>, s<sub>1</sub> | /  
addi t<sub>2</sub>, t<sub>2</sub>, 4 | sw s<sub>1</sub>, 0(t<sub>2</sub>)  
bne a<sub>1</sub>, zero, loop | /

## Multiple issue version loop unrolling (loop only)

addi a<sub>1</sub>, a<sub>1</sub>, -2 | lw s<sub>0</sub>, 0(t<sub>0</sub>)  
addi t<sub>2</sub>, t<sub>2</sub>, 8 | lw s<sub>1</sub>, 0(t<sub>1</sub>)  
addi t<sub>0</sub>, t<sub>0</sub>, 8 | lw s<sub>2</sub>, 4(t<sub>0</sub>)  
add t<sub>3</sub>, s<sub>1</sub>, s<sub>2</sub> | lw s<sub>3</sub>, 4(t<sub>1</sub>)  
add t<sub>1</sub>, t<sub>1</sub>, 8 | sw t<sub>3</sub>, -8(t<sub>2</sub>)  
add t<sub>4</sub>, s<sub>2</sub>, s<sub>3</sub>  
bne a<sub>1</sub>, zero, loop | sw t<sub>4</sub>, -4(t<sub>2</sub>)

## Performance for array of $10^6$ items

① Performance of single clock cycle 200 MHz

$$\frac{\text{clocks per loop}}{200 \cdot 10^6} = \frac{g \cdot 10^6}{200 \cdot 10^9 \frac{1}{s}} = 0,045 \text{ s} = 45 \text{ ms}$$

② Pipeline 1 GHz  
no branch prediction

$$\frac{11 \cdot 10^6}{1 \cdot 10^9 \frac{1}{s}} = 11 \cdot 10^{-3} \text{ s} = 11 \text{ ms}$$

③ Pipeline 1 GHz  
branch prediction  
code reordering

$$\frac{g \cdot 10^6}{1 \cdot 10^9 \frac{1}{s}} = 5 \text{ ms}$$

④ Multiple issue no branch prediction 1 GHz

$$6 \cdot 10^6 \text{ ms} = 6 \text{ ms}$$

⑤ Multiple issue no branch prediction loop unrolling 1 GHz

$$6 \cdot \frac{10^6}{2} \text{ ms} = 3,5 \text{ ms}$$

27/04/2022

## MEMORY

### LOCALITY

- spatial locality
- temporal locality

EX.

$m := a[0]$

for  $i := 1$  to 1000 000 :



### CACHE MEMORY

#### Read



- (i) The value asked by the CPU is already in the cache  $\rightarrow$  hit  $\rightarrow$  faster access
- (ii) Else  $\rightarrow$  miss  $\rightarrow$  the value is saved in the cache (together with his block) for future access

#### Write



- (i) The value is saved in the cache
- (ii) If the cache needs to be overwritten, the value is then saved in the memory

1 block is 64 bytes

There are  $2^k$  sets in the cache



$x = \text{address}$

$$\text{block} = [x / 64] = x[6:]$$

$$\text{set} = \text{block \% msets} = \text{block}[:4]$$

$$\text{tag} = [\text{block} / \text{msets}]$$



## CACHE



## EXERCISE 4

- cache with 4 cache
- block size of 32 bytes

access chronology : 36, 48, 0, 100, 164, 172, 96, 92  
1    2    3    4    5    6    7    8  
M    H    M    M    M    H    H

- 1) Miss  $\rightarrow$  block 1 goes in set 1 of the cache
- 2) Hit (block 1 already in the cache)
- 3) Miss  $\rightarrow$  block 0 goes into set 0
- 4) Miss  $\rightarrow$  block 3 goes into set 3
- 5) Miss  $\rightarrow$  block 5 goes into set 1
- 6) Hit (block 5 already in the cache)
- 7) Hit (block 3 already in the cache)
- 8) Miss  $\rightarrow$  block 2 goes into set 2

29/04/2022

## CACHE



2-WAY ASSOCIATIVE CACHE



Ex.

.text

{

add : lw t<sub>2</sub>, 0(t<sub>0</sub>)

add t<sub>3</sub>, t<sub>3</sub>, t<sub>2</sub>

addi t<sub>0</sub>, t<sub>0</sub>, 4

addi t<sub>1</sub>, t<sub>1</sub>, -1

bne t<sub>1</sub>, zero, l1c

- block size = 64 bytes

- two-way associative cache with 16 sets

- LRU (least recently used) : the block which was used the least recently is discarded

data 16

Analysis (two-way cache)

ACCESES MISS

iteration 0 6 1

iteration 1 6 0

:

:

iteration 16 6 1

iteration 17 6 0

:

iteration 31 6 0

iteration 32 6 1

| cycle  | data 0 |
|--------|--------|
| data 1 |        |
| data 2 |        |
| ...    |        |
| ...    |        |
| .      |        |
| .      |        |
| ...    |        |

Miss rate in the first 16 iterations =  $\frac{1}{16 \cdot 6} \approx 1\%$   
(it is the same for the others)

Analysis (one-way cache)

ACCESES MISS

iteration 0 6 2

iteration 1 6 2

:

iteration 15 6 2

iteration 16 6 1

|              |   |   |
|--------------|---|---|
| iteration 17 | 6 | 0 |
| :            | : | : |
| iteration 31 | 6 | 0 |
| iteration 32 | : |   |

$$\text{miss rate} = \frac{32 + 31(1)}{16 \cdot 6 + 31(16 \cdot 6)} = 2,4\%$$

04/05/2022

data

|           |       |            |       |
|-----------|-------|------------|-------|
| x: . word | _____ | 0x10010000 | set 0 |
| y: . word | _____ |            | set 0 |
| z: . word | _____ |            | set 0 |
| m: . word | 4096  |            |       |

text

|   |    |                  |   |            |            |                                   |
|---|----|------------------|---|------------|------------|-----------------------------------|
| M | lw | t <sub>0</sub> , | x | 0x00400000 | 0x00010000 | block : 64                        |
| H | lw | t <sub>1</sub> , | y |            |            | set : 32                          |
| H | lw | t <sub>2</sub> , | z |            |            | 0                                 |
| H | lw | t <sub>3</sub> , | m |            |            | one-way association<br>write-back |

cico : M lw t<sub>4</sub>, o(t<sub>0</sub>) M  
M lw t<sub>5</sub>, o(t<sub>1</sub>) M  
M add t<sub>6</sub>, t<sub>4</sub>, t<sub>5</sub>  
H sw t<sub>6</sub>, o(t<sub>2</sub>) M  
M addi t<sub>0</sub>, t<sub>0</sub>, 4  
H / t<sub>1</sub> t<sub>1</sub> /  
H / t<sub>2</sub> t<sub>2</sub> /  
H / t<sub>3</sub> t<sub>3</sub> -1  
M bne t<sub>3</sub>, zero, cico } → set 1

|      |   |    |
|------|---|----|
| i 0  | 8 | 12 |
| i 1  | 6 | 12 |
| :    |   |    |
| i 15 | 6 | 12 |
| i 16 | 4 | 12 |
| :    |   |    |
| i 31 | 4 | 12 |
| i 32 | 3 | 12 |
| i 33 | 3 | 12 |
| :    |   |    |

i2) H M M M M N M M L H H H H

i16) H M H M H H H M H H H H

i32) H M H M N H M H H H H

$$\text{miss rate} = \frac{8 + 15 \cdot 6 + 4 \cdot 16 + 3 \cdot 16 \cdot 30}{12 \cdot 16 + 12 \cdot 16 + 12 \cdot 16 \cdot 30} = 2,6\%$$



11/05/2022

## VIRTUAL MEMORY



13/05/2022

## EXCEPTIONS / INTERRUPTS

### EXAMPLES

- signal to the process
- division by zero
- overflow
- call
- undefined instruction
- memory violation
- hardware fault
- end of time slice
- I/O operations



### TIME SLICE



# vector interrupt



## I/O

DATA FROM I/O DEVICES TO CPU

Two ways:

- 1) I/O instructions (not very used)
- 2) Memory mapped I/O

Ex.

$\begin{matrix} 0x00000000 \\ 0x00010000 \end{matrix}$ } addresses reserved to the devices

INTERACTION BETWEEN I/O DEVICES AND

Polling : CPU asks periodically to the device if it has completed

interrupt driven I/O : I/O device sends a signal to the CPU after it has completed

## PARALLELISM

SISD (single instruction single data)

SIMD ( - / multiple / )

MISD (multiple / single / )

MIMD (multiple / multiple / ) → multicore

## MIMD

Multiple CPUs  
Multiple cores



Multithreading



## SIMD

Theoretical PRAM



20/05/2022

**Exercise 1.** How many clock cycles are needed to complete the following code on the Risc-V pipelined architecture? (Figure in the next page.) Remember that:

- Execution completes when the last instruction exits the pipeline.
- If you read and write the same register of the register file in the same clock cycle, the value that is read is the value that is being written.

Show a figure of the pipeline with bubbles and forwardings.

```
add $1, $0, $0
sw $0, 4($1)
lw $1, 8($0)
add $1, $1, $0
sw $1, 12($2)
sub $3, $1, $0
add $1, $3, $3
lw $7, 0($3)
lw $5, 4($7)
```



60

**Exercise 2.** Consider a two-way associative cache with four sets and block of 8 words. The cache is initially empty and replacement of blocks is done by using LRU. Show which of the following memory accesses are hit and which are misses: 128, 16, 4, 184, 68, 204, 104, 28, 252, 240, 244, 40, 136, 164, 56, 256, 152, 180, 156, 64, 48.

block = 32 bytes

| address | block | set | H/M |
|---------|-------|-----|-----|
| 128     | 4     | 0   | M   |

|   |   |   |
|---|---|---|
| 0 | 4 | 8 |
| 1 | 5 | 1 |
| 2 | 2 | 6 |
| 3 | 3 | 7 |

|    |   |   |   |
|----|---|---|---|
| 16 | 0 | 0 | M |
|----|---|---|---|

0 6

|    |   |   |   |
|----|---|---|---|
| 48 | 0 | 0 | H |
|----|---|---|---|

1 32

|    |   |   |   |
|----|---|---|---|
| 68 | 2 | 2 | M |
|----|---|---|---|

2 64

|     |   |   |   |
|-----|---|---|---|
| 204 | 6 | 2 | M |
|-----|---|---|---|

3 96

|     |   |   |   |
|-----|---|---|---|
| 104 | 3 | 3 | M |
|-----|---|---|---|

4 128

|    |   |   |   |
|----|---|---|---|
| 28 | 0 | 0 | H |
|----|---|---|---|

5 160

|     |   |   |   |
|-----|---|---|---|
| 252 | 7 | 3 | M |
|-----|---|---|---|

6 192

|     |   |   |   |
|-----|---|---|---|
| 240 | 7 | 3 | H |
|-----|---|---|---|

|     |   |   |   |
|-----|---|---|---|
| 244 | 7 | 3 | H |
|-----|---|---|---|

|    |   |   |   |
|----|---|---|---|
| 40 | 1 | 1 | M |
|----|---|---|---|

|     |   |   |   |
|-----|---|---|---|
| 136 | 4 | 0 | H |
|-----|---|---|---|

|     |   |   |   |
|-----|---|---|---|
| 164 | 5 | 1 | H |
|-----|---|---|---|

|    |   |   |   |
|----|---|---|---|
| 56 | 1 | 1 | H |
|----|---|---|---|

|     |   |   |   |
|-----|---|---|---|
| 256 | 8 | 0 | M |
|-----|---|---|---|

|     |   |   |   |
|-----|---|---|---|
| 152 | 4 | 0 | H |
|-----|---|---|---|

|     |   |   |   |
|-----|---|---|---|
| 180 | 5 | 1 | H |
|-----|---|---|---|

|     |   |   |   |
|-----|---|---|---|
| 156 | 4 | 0 | H |
|-----|---|---|---|

|    |   |   |   |
|----|---|---|---|
| 64 | 2 | 2 | H |
|----|---|---|---|

|    |   |   |   |
|----|---|---|---|
| 48 | 1 | 1 | H |
|----|---|---|---|

**Exercise 3.** Consider the following code:

```
.data
v:    .word 0,1,7,4,2,...      # array of 1024 integers  data 0
w:    .word 0,0,0,0,0,...      # array of 1024 integers  data 64
n:    .word 1024

.text
    lui s0, 0x10012          code 0
    lw s2, 0(s0)
    lui s0, 0x10010
    lui s1, 0x10011
loop:   lw t0, 0(s0)
        sw t0, 0(s1)
        addi s2, s2, -1
        bne s2, zero, loop
        addi s0, s0, 4
        addi s1, s1, 4
        ecall
        li a7, 10
        ecall
```

This program copy one array into another.

- Question A:** what is the approximate total miss rate if you have an instruction cache and a data cache, both one-way associative with 8 blocks of 64 bytes?
- Question B:** What is the speed-up (how faster is it) if we use the Risc-V multiple issue architecture with loop unrolling of 4 loops instead of the standard pipelined architecture? Show the code.

(1)

$n_0 : H M H M H H H H$

; Always the same

$n_{1023} /$

$$\text{miss rate} : \frac{4}{8} = 25\%$$



(2)

standard pipeline (branch prediction)

clock cycles (for the loop) : 6 apart first and last iteration

total clock cycles  $\approx 6 \cdot 1024$

## multiple issue architecture:

loop:

|    |                  |          |      |                                      |
|----|------------------|----------|------|--------------------------------------|
| lw | t <sub>0</sub> , | 0(\$0)   | addi | s <sub>0</sub> , s <sub>0</sub> , 16 |
| lw | t <sub>1</sub> , | -12(\$0) |      | /                                    |
| lw | t <sub>2</sub> , | -8(\$0)  |      | /                                    |
| lw | t <sub>3</sub> , | -4(\$0)  |      | /                                    |
| sw | t <sub>0</sub> , | 0(\$1)   | addi | s <sub>1</sub> , s <sub>1</sub> , 16 |
| sw | t <sub>1</sub> , | -12(\$1) |      | /                                    |
| sw | t <sub>2</sub> , | -8(\$1)  | addi | s <sub>2</sub> , s <sub>2</sub> , -4 |
| sw | t <sub>3</sub> , | -4(\$1)  | bne  | s <sub>2</sub> , zero, loop          |

$$\text{total clock cycles} \approx 8 \cdot \frac{1024}{4} = 2 \cdot 1024$$

$$\frac{6 \cdot 1024}{2 \cdot 1024} = 3 \Rightarrow \text{M1 is 3x faster}$$

.data

|           |                         |            |          |                      |
|-----------|-------------------------|------------|----------|----------------------|
| x: . word | —                       | 0x10010000 | data 0   | block = $[x / 64]$   |
| y: . word | —                       | 0x10014000 | data 256 |                      |
| z: . word | —                       | 0x10018000 | data 512 | set = block % m sets |
| m: . word | $4096 \cdot 4 = 2^{14}$ |            |          |                      |

.text

l0: t<sub>0</sub>, x 0x00400000  
 l0: t<sub>1</sub>, y  
 l0: t<sub>2</sub>, z  
 lw t<sub>3</sub>, m

cicle:  
 lw t<sub>4</sub>, o(t<sub>0</sub>)  
 lw t<sub>5</sub>, o(t<sub>1</sub>)  
 add t<sub>6</sub>, t<sub>4</sub>, t<sub>5</sub>  
 sw t<sub>6</sub>, o(t<sub>2</sub>)  
 addi t<sub>0</sub>, t<sub>0</sub>, 4  
 / t<sub>1</sub> t<sub>1</sub> /  
 / t<sub>2</sub> t<sub>2</sub> /  
 / t<sub>3</sub> t<sub>3</sub> -1  
 bne t<sub>3</sub>, zero, cicle

(16 words)  
 block size : 64 bytes  
 number of sets : 32  
 one-way associative  
 write-back



memory access per cycle: 12

MISSES

i0: M M M M M H M M M H H H M 8

i9: H M M M M H M M H H H H 6

:

i15

x8 i16 H M H M H H M H H H H 4

:

,

i32 H M H M H H M H H H H 3

:

i511

i512

$$\frac{8 + 6 \cdot 15 + 4 \cdot 16 + 3 \cdot 480}{512 \cdot 12} = 26\%$$

# SAMPLE EXAM 2

**Exercise 1.** How many clock cycles are needed to complete the following code on the Risc-V pipelined architecture? (Figure in the next page.) Remember that:

- Execution completes when the last instruction exits the pipeline.
- If you read and write the same register of the register file in the same clock cycle, the value that is read is the value that is being written.

Show a figure of the pipeline with bubbles and forwardings.

```
sub s1, s0, s0
lw s0, 4(s1)
lw s1, 8(s0)
add s1, s3, s3
sw s1, 12(s2)
sub s3, s1, s0
lw s7, 0(s3)
add s1, s3, s3
lw s5, 4(s7)
```



**Exercise 2.** Consider a two-way associative cache with two sets and block of 8 words. The cache is initially empty and replacement of blocks is done by using LRU. Show which of the following memory accesses are hit and which are misses: 16, 136, 184, 252, 160, 80, 252, 80, 48, 136, 60, 204, 100, 88, 196, 60, 32, 176, 80, 224.

block size: 32 bytes

| BLOCK | SET | H/M |
|-------|-----|-----|
| 16    | 0   | M   |
| 136   | 4   | M   |
| 184   | 5   | M   |
| 252   | 7   | M   |
| 160   | 5   | H   |
| 80    | 2   | M   |
| 252   | 7   | H   |
| 80    | 2   | H   |
| 48    | 1   | M   |
| 136   | 4   | H   |
| 60    | 1   | H   |
| 204   | 6   | M   |
| 100   | 3   | M   |
| 88    | 2   | M   |
| 196   | 6   | H   |
| 60    | 1   | H   |
| 32    | 1   | H   |
| 176   | 5   | M   |
| 80    | 2   | H   |
| 224   | 7   | M   |

|   |   |   |
|---|---|---|
| 0 | 6 | 2 |
| 1 | 1 | 5 |

|     |   |
|-----|---|
| 0   | 0 |
| 32  | 1 |
| 64  | 2 |
| 96  | 3 |
| 128 | 4 |
| 160 | 5 |
| 192 | 6 |
| 224 | 7 |
| 256 | 8 |

**Exercise 3.** Consider the following code:

```
.data
n000: .word 8, n001      # list of 1024 integers      data  0
n001: .word -3, n002
n002: .word 5, n003
[...]

lista: .word n000

.text
    lui s0, 0x10012
    lw s1, 0(s0)
    li s2, 0
loop:   beq s1, zero, fine
        lw s1, 4(s1)
        addi s2, s2, 1
        bne s1, zero, loop
fine:   ecall
        li a7, 10
        ecall
```

This program counts the number of nodes in the list.

- Question A:** what is the approximate total miss rate if you have an instruction cache and a data cache, both one-way associative with 8 blocks of 64 bytes?
- Question B:** What is the speed-up (how faster is it) if we use the Risc-V multiple issue architecture with loop unrolling of 4 loops instead of the standard pipelined architecture? Show the code.

①  
i 0 : H H M H H  
i 1 : H H H H H  
:  
i 8 : H H H H H



$$\text{miss rate} = \frac{1}{40} = 2,5\%$$

②

```
beq s1, zero, fine | lw s1, 4(s1)
addi s2, s2, 1 |
beq s1, zero, fine | lw s1, 4(s1)
addi s2, s2, 1 |
beq s1, zero, fine | lw s1, 4(s1)
```

addi \$2, \$2, 1 | -

bne \$4, zero, fine | lw \$4, 4(\$4)

addi \$2, \$2, 1 | -

bne \$4, zero, loop | -

$$\frac{4 \cdot 1024}{9 \cdot 256} = \frac{16 \cdot 256}{9 \cdot 256} = 1, \overline{7}$$

M1 is 1,7x faster