

①

- loop: LD R4, 0(R2) // Load in R4 the content of R2, with an offset of 0  
 , LD R5, 0(R3)  
 , ADD R6, R4, R5  
 , SD R6, 0(R3) // Store contents of R6 to the memory address in R3 with offset 0  
 , ADD R6, R6, R4  
 , ADDI R3, R3, #8  
 , SUB R20, R4, R3  
 , BNZ R20, Loop

Dependencies

R A W

W A R

W A W

a) R4 depends on R2

|        | R5  | R3     |             |
|--------|-----|--------|-------------|
|        | R6  | R5, R4 | 2           |
|        | R3  | R6, R3 | 3           |
|        | R6  | R6, R4 | 4           |
|        | R20 | R4, R3 | 6           |
| writes |     | reads  | instruction |
| R3     |     | R3     | 5           |
| -      |     | R20    | 7           |

|                                       |                                           |
|---------------------------------------|-------------------------------------------|
| R4: I <sub>1</sub> → I <sub>3</sub>   | RAW                                       |
| R5: I <sub>2</sub> → I <sub>3</sub>   | RAW                                       |
| R6: I <sub>3</sub> → I <sub>4</sub>   | WAW                                       |
| X R3: I <sub>1</sub> → I <sub>3</sub> | WAW X (it only writes in the main memory) |
| R6: I <sub>4</sub> → I <sub>5</sub>   | RAW, WAW                                  |
| R3: I <sub>5</sub> → I <sub>6</sub>   | RAW                                       |
| R20: I <sub>7</sub> → I <sub>8</sub>  | RAW                                       |
|                                       | Source → Target                           |

b) No forwarding

Branch → flush

All memory access need two cycles

} Timing sequence  
sequence machine

IF ← 2 cycles  
 ID  
 EX  
 M ← {2c. LD SD  
 1c. rest  
 WB

|   | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |  |
|---|----|----|----|----|----|----|----|----|---|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|--|
| 0 | IF | IF | ID | EX | M  | M  | WB |    |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |  |
| 1 |    | IF | IF | ID | EX | M  | M  | WB |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |  |
| 2 |    |    | IF | IF | ID | -  | -  | EX | M | WB |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |  |
| 3 |    |    |    | IF | IF | -  | ID | -  | - | EX | M  | M  | WB |    |    |    |    |    |    |    |    |    |    |    |    |  |
| 4 |    |    |    |    | IF | IF | -  | ID | - | -  | -  | -  | -  | -  | EX | M  | WB |    |    |    |    |    |    |    |    |  |
| 5 |    |    |    |    |    |    |    |    |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |  |
| 6 |    |    |    |    |    |    |    |    |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |  |
| 7 |    |    |    |    |    |    |    |    |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |    |  |

(2)

- i0 lw \$r3, 0(\$r0)  
 i1 lw \$r2, 0(\$r3)  
 i2 addi \$r1, \$r1, 1  
 i3 sub \$r4, \$r3, \$r2  
 i4 sw \$r1, 0(\$r3)  
 i5 bnez \$r4, i0

5 stages pipelined architecture

|    |
|----|
| F  |
| D  |
| Ex |
| M  |
| WB |

load & store need 3 cycles (M) WB  
branch need 2 cycles (Ex)

↳ evaluation of the condition will be known after decoding.

| instruction | Read       | Written |
|-------------|------------|---------|
| 0           | \$r0       | \$r3    |
| 1           | \$r3       | \$r1    |
| 2           | \$r2       | \$r1    |
| 3           | \$r3, \$r2 | \$r4    |
| 4           | \$r2, \$r3 | -       |
| 5           | \$r4       | -       |

a)

| Source | Target | Register | Type |
|--------|--------|----------|------|
| 10     | 1      | \$r3     | RAW  |
| 1      | 2      | \$r1     | RAW  |
| 0      | 3      | \$r3     | RAW  |
| 2      | 4      | \$r1     | RAW  |
| 0      | 4      | \$r3     | RAW  |
| 3      | 5      | \$r4     | RAW  |

b)

|   | 1 | 2 | 3 | 4 | 5 | 6 | 7  | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
|---|---|---|---|---|---|---|----|---|---|----|----|----|----|----|----|----|----|----|----|----|----|
| 0 | F | D | E | M | M | M | WB |   |   |    |    |    |    |    |    |    |    |    |    |    |    |
| 1 | F | D | - | - | - | - | E  | M | M | M  | WB |    |    |    |    |    |    |    |    |    |    |
| 2 | F | - | - | - | D | - | -  | - | - | E  | M  | WB |    |    |    |    |    |    |    |    |    |
| 3 |   |   |   |   | F | - | -  | - | O | E  | M  | WB |    |    |    |    |    |    |    |    |    |
| 4 |   |   |   |   |   |   |    |   | F | D  | -  | E  | M  | M  | M  | WB |    |    |    |    |    |
| 5 |   |   |   |   |   |   |    |   | F | -  | O  | E  | E  | -  | M  | WB |    |    |    |    |    |

c)

1 iteration takes 16 cycles as the fetch instruction will be on the 17<sup>th</sup> cycle (and not on the 18<sup>th</sup> as the statement says that the result of the branch is known after decoding).

d)

(3)

- 0 LD R4 D(R1)
- 1 DIV R2 R2 R4
- 2 ADD A1 R1 #1
- 3 SUB A5 R5 #1
- 4 SD R4 D(R5)
- 5 SUB R6 R1 R5
- 6 BNEZ R6 I0

| Source | Target | Register | Type |
|--------|--------|----------|------|
| 0      | 1      | R4       | RAW  |
| 2      | 5      | R1       | RAW  |
| 3      | 4      | R5       | RAW  |
| 5      | 6      | R5       | RAW  |

a) Not forwarding, Pipelined 5 steps

\* Additional M for LD

\* Pipeline freezing for branching

\* Branch evaluation is done during decode

| Instruction | Read   | Write |
|-------------|--------|-------|
| 0           | R1     | R4    |
| 1           | R2, R4 | R2    |
| 2           | R1     | R1    |
| 3           | R5     | R5    |
| 4           | R4, R5 | -     |
| 5           | R1, R5 | R6    |
| 6           | R6     | -     |

|   | 1 | 2 | 3 | 4 | 5  | 6 | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
|---|---|---|---|---|----|---|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0 | F | D | E | M | WB |   |    |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 1 | F | D | - | - | E  | M | WB |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 2 | F | - | - | D | E  | M | WB |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 3 |   |   |   | F | D  | E | M  | WB |    |    |    |    |    |    |    |    |    |    |    |    |
| 4 |   |   |   | F | D  | - | E  | M  | WB |    |    |    |    |    |    |    |    |    |    |    |
| 5 |   |   |   | F | -  | D | E  | M  | WB |    |    |    |    |    |    |    |    |    |    |    |
| 6 |   |   |   |   | F  | D | -  | -  | E  | M  | WB |    |    |    |    |    |    |    |    |    |
|   |   |   |   |   | F  | P | U  | -  | D  | E  | M  | WB |    |    |    |    |    |    |    |    |

(4)

|                   | 0 | 1 | 2 | 3 | 4  | 5 | 6  | 7 | 8 | 9 | 10 | 11 | 12 |
|-------------------|---|---|---|---|----|---|----|---|---|---|----|----|----|
| Loop: LD R4 D(R2) | F | D | E | M | WB |   |    |   |   |   |    |    |    |
| LD R5 D(R3)       | F | D | E | M | WB |   |    |   |   |   |    |    |    |
| ADD R6, R4, R5    |   | F | D | - | E  | M | WB |   |   |   |    |    |    |
| SD R6, D(R3)      |   | F | X | D | E  | M | WB | X | B |   |    |    |    |
| B.NZ R6, Loop     |   | F | - | D | X  | M | WB | X | B |   |    |    |    |

a) Memory access instruction 3 cycles  
Branch instruction 1 cycle

In addition to the fetch/issue cycle

$$C_{\text{Loop}} = C_{I_1} + C_{I_2} + C_{I_3} + C_{I_4} + C_{I_5} = (1+3) + (1+3) + (1) + (7) + (1+1) = 15$$

| Target | Source | Register | Type |
|--------|--------|----------|------|
| 3      | 1      | R4       | RAW  |
| 3      | 2      | R5       | RAW  |
| 4      | 3      | R6       | RAW  |
| 5      | 3      | R6       | RAW  |

c) Forwarding technique

Branch handling technique is freezing

6

1 loop: lw \$f0, 0(\$r1)  
2 lw \$f2, 0(\$r2)  
3 mul.f \$f4, \$f0, \$f2  
4 add.d \$f6, \$f6, \$f4  
5 add.i \$r2, \$r1, 4  
6 add.i \$r2, \$r2, 4  
7 sub \$r3, \$r3, 1  
8 bneq \$r3, loop

| Instruction | Read       | Write |
|-------------|------------|-------|
| 1           | \$r1       | \$f0  |
| 2           | \$r2       | \$f2  |
| 3           | \$f0, \$s2 | \$f4  |
| 4           | \$f6, \$f4 | \$f6  |
| 5           | \$r1       | \$r1  |
| 6           | \$r2       | \$r2  |
| 7           | \$r3       | \$r3  |
| 8           | \$r3       | -     |

| Source | Target | Register | Type |
|--------|--------|----------|------|
| 1      | 3      | \$f0     | RAW  |
| 2      | 3      | \$f2     | RAW  |
| 3      | 4      | \$f4     | RAW  |
| 7      | 8      | \$f3     | RAW  |

- b) Read, write on the same register can be done without conflicts. (write back & decode in the same cycle)

  - Branches are flushed the pipeline
  - Memory requires a cycle
  - Scalability is known in execution

c)  $(b \cdot N)$

c) If  $N$  branches are predicted as taken  
d) There is forwarding if branches are predicted as taken (if  $\text{mark} = 1$ )

L<sub>s</sub> (from memory before write back)

|   | 1 | 2 | 3 | 4 | 5  | 6  | 7 | 8  | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
|---|---|---|---|---|----|----|---|----|---|----|----|----|----|----|----|----|----|
| 1 | F | D | E | M | WB |    |   |    |   |    |    |    |    |    |    |    |    |
| 2 |   | F | O | E | M  | wB |   |    |   |    |    |    |    |    |    |    |    |
| 3 |   |   | F | D | -  | E  | M | WB |   |    |    |    |    |    |    |    |    |
| 4 |   |   |   | F | -  | D  | - | E  | M | WB |    |    |    |    |    |    |    |
| 5 |   |   |   |   | F  | T  | D | E  | M | WB |    |    |    |    |    |    |    |
| 6 |   |   |   |   |    | F  | O | E  | M | WB |    |    |    |    |    |    |    |
| 7 |   |   |   |   |    |    | F | D  | E | M  | wB |    |    |    |    |    |    |
| 8 |   |   |   |   |    |    |   | F  | D | -  | E  | M  | WB |    |    |    |    |

(5)

|                       | 0 | 1 | 2 | 3 | 4 | 5  | 6  | 7  | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
|-----------------------|---|---|---|---|---|----|----|----|---|---|----|----|----|----|----|----|----|----|
| ADDI R3, R3, 1        | F | D | E | E | M | WB |    |    |   |   |    |    |    |    |    |    |    |    |
| Loop: LD \$4, 0(\$R1) | F | D | - | E | M | M  | WB |    |   |   |    |    |    |    |    |    |    |    |
| MULT R5, R4, R4       | F | - | D | - | E | E  | M  | WB |   |   |    |    |    |    |    |    |    |    |
| ADD R5, R3, R5        | F | - | - | D | E | E  | M  | WB |   |   |    |    |    |    |    |    |    |    |
| SD R5, 0(\$R1)        | F | D | E | - | M | M  | WB |    |   |   |    |    |    |    |    |    |    |    |
| NADDI R1, R1, 1       | F | D | - | E | E | M  | WB |    |   |   |    |    |    |    |    |    |    |    |
| DADDI R1, R1, 1       | F | - | D | - | E | M  | WB |    |   |   |    |    |    |    |    |    |    |    |
| BNE R1, R2, Loop      | F | - | - | D | E | M  | WB |    |   |   |    |    |    |    |    |    |    |    |

- \* Forwarding ~~\* ADD & MULT take an extra cycle in memory~~ \* ADD & Mult take an extra execution cycle
- \* No branch predictions.

a) In none, as we are using a pipelined 5 step processor.

R5 ADD → MULT WAW  
R1 DADDI → DADDI WAW

| b) | target | source | Register | Type |
|----|--------|--------|----------|------|
| 3  | 2      | R4     | RAW      |      |
| 4  | 3      | R5     | RAW      |      |
| 5  | 4      | R5     | RAW      |      |
| 7  | 6      | R1     | RAW      |      |

|   | 1 | 2 | 3 | 4 | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
|---|---|---|---|---|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 1 | F | O | E | M | WB |    |    |    |    |    |    |    |    |    |    |    |    |    |
| 2 | F | D | E | M | WB | WB |    |    |    |    |    |    |    |    |    |    |    |    |
| 3 | F | O | - | E | E  | M  | WB |    |    |    |    |    |    |    |    |    |    |    |
| 4 | F | - | D | - | E  | E  | M  | WB | X  |    |    |    |    |    |    |    |    |    |
| 5 | F | - | D | - | E  | E  | M  | WB | WB | X  | X  |    |    |    |    |    |    |    |
| 6 | F | F | D | E | M  | E  | WB | X  | X  | X  | X  |    |    |    |    |    |    |    |
| 7 | F | F | D | E | -  | M  | WB | X  | X  | X  | X  |    |    |    |    |    |    |    |
| 8 | F | F | D | E | -  | M  | WB | X  | X  | X  | X  |    |    |    |    |    |    |    |

(7)

|   | 0                     | 1                | 2 | 3 | 4 | 5  | 6  | 7  | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
|---|-----------------------|------------------|---|---|---|----|----|----|---|---|----|----|----|----|----|----|----|----|----|----|----|----|
| 1 | loop                  | lw \$f0, 0(\$r1) | F | D | E | M  | WB |    |   |   |    |    |    |    |    |    |    |    |    |    |    |    |
| 2 | lw \$f2, 0(\$r2)      | F                | D | E | M | WB |    |    |   |   |    |    |    |    |    |    |    |    |    |    |    |    |
| 3 | sub \$f4, \$f0, \$f2  | F                | D | - | E | M  | WB |    |   |   |    |    |    |    |    |    |    |    |    |    |    |    |
| 4 | mult \$f4, \$f4, \$f4 | F                | - | D | - | E  | M  | WB |   |   |    |    |    |    |    |    |    |    |    |    |    |    |
| 5 | addi \$f6, \$f6, \$f4 | F                | - | D | - | E  | M  | WB |   |   |    |    |    |    |    |    |    |    |    |    |    |    |
| 6 | addi \$r1, \$r1, 4    | F                | O | E | M | WB |    |    |   |   |    |    |    |    |    |    |    |    |    |    |    |    |
| 7 | addi \$r2, \$r2, 4    | F                | O | E | M | WB |    |    |   |   |    |    |    |    |    |    |    |    |    |    |    |    |
| 8 | sub \$r3, \$r3, 1     | F                | F | D | E | M  | WB |    |   |   |    |    |    |    |    |    |    |    |    |    |    |    |
| 9 | beq \$r3, loop        | F                | - | - | D | E  | M  | WB |   |   |    |    |    |    |    |    |    |    |    |    |    |    |

| a)         | Reads | writes | Instruction |
|------------|-------|--------|-------------|
| \$r1       | \$f0  | 1      |             |
| \$r2       | \$f2  | 2      |             |
| \$f0, \$f2 | \$f4  | 3      |             |
| \$f4       | \$f4  | 4      |             |
| \$f4, \$f6 | \$f6  | 5      |             |
| \$r1       | \$r1  | 6      |             |
| \$r2       | \$r2  | 7      |             |
| \$r3       | \$r3  | 8      |             |
| \$r3       | -     | 9      |             |

| Source | Target | Register | Type |
|--------|--------|----------|------|
| 1      | 3      | \$f0     | RAW  |
| 2      | 3      | \$f2     | RAW  |
| 3      | 4      | \$f4     | RAW  |
| 3      | 4      | \$f4     | WAW  |
| 3      | 5      | \$f4     | RAW  |
| 7      | 6      | \$r1     | WAR  |
| 2      | 7      | \$r2     | WAR  |
| 8      | 9      | \$r3     | RAW  |

- b) \* No forwarding  
 \* Read and write in the same cycle (WB is affected)  
 \* Branches flush the pipeline  
 \* Additional cycle for memory refresh  
 \* Branch evaluation is done after execution

\* 5 step processor w/ pipeline

c)

N-20

19

d)

|   |   |   |   |   |    |   |         |    |    |    |    |    |    |    |    |    |    |    |
|---|---|---|---|---|----|---|---------|----|----|----|----|----|----|----|----|----|----|----|
|   | 1 | 2 | 3 | 4 | 5  | 6 | 7       | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 1 | F | D | E | M | WB |   |         |    |    |    |    |    |    |    |    |    |    |    |
| 2 | F | D | E | M | WB |   |         |    |    |    |    |    |    |    |    |    |    |    |
| 3 |   | F | D | - | E  | M | WB      |    |    |    |    |    |    |    |    |    |    |    |
| 4 |   | F | - | D | E  | A | WB      |    |    |    |    |    |    |    |    |    |    |    |
| 5 |   |   | F | O | E  | M | WB      |    |    |    |    |    |    |    |    |    |    |    |
| 6 |   |   | F | O | E  | M | WB      |    |    |    |    |    |    |    |    |    |    |    |
| 7 |   |   | F | O | E  | M | WB      |    |    |    |    |    |    |    |    |    |    |    |
| 8 |   |   | F | O | -  | E | M       | WB |    |    |    |    |    |    |    |    |    |    |
| 9 |   |   |   | - | A  | F | D       | E  | WB | WB |    |    |    |    |    |    |    |    |
|   |   |   |   |   | F  | - | fullish |    |    |    |    |    |    |    |    |    |    |    |
|   |   |   |   |   |    |   |         |    |    |    |    |    |    |    |    |    |    |    |
|   |   |   |   |   |    |   |         |    |    |    |    |    |    |    |    |    |    |    |

\* Full forwarding  
\* Branches are taken.

b) N-13

Taken  
Not taken

(5)

1) DADDUI R3 R1 #40

Loop: L.D. F0, O(R1)

L.D. F2, O(R2)

ADD.D F4, F0, F2

S.D F4 O(R1)

DADDUI R1, R1, #8

DADDUI R2, R2, #8

BLE R1, R3, Loop

| Producers    | Consumers         | Latency |
|--------------|-------------------|---------|
| FP ALU       | FP ALU            | 5       |
| FP ALU       | Load/store double | 4       |
| FP ALU       | Branch inst.      | 4       |
| Load double  | FP ALU            | 2       |
| Load double  | Load double       | 1       |
| Store double | FP ALU            | 2       |

\* One instruction per cycle (except stalls)

\* Pipeline with a single data path

1) F0: I2 → I4, RAW

F2: I3 → I4, RAW

F4: I4 → I5, RAW

R1: I5 → I6, WAR

R1: I6 → I6, WAR

R1: I5 → I6, WAR

R2: I3 → I7, WAR

R3: I1 → I8, RAW

R1: I6 → I8, RAW

2) # number of iterations is 5.

1 + 5 \* 13 = 66 cycles

↳ cycles + stalls

1 + 5 \* 11 = 56 cycles, 17%.

3)

DADDUI R3, R1, #40

Loop: LD F0, O(R1)

LD F2, O(R2)

&lt;stalls&gt; DADDUI R1, R1, #8 ?

&lt;stalls&gt; DADDUI R2, R2, #8

ADD.D F4, F0, F2

&lt;stall&gt;

&lt;stall&gt;

&lt;stall&gt;

&lt;stall&gt;

SD F4, O(R1) SD F4, -8(R1)

DADDUI R1, R1, #8

DADDUI R2, R2, #8

BLE R1, R3, Loop

4) Only unrolling (no scheduling)

DADDUI R3, R1, #32

Loop: LD F0, @R1)

LD F2, @R2)

<stall>

<stall>

ADD.D F4, F0, F2

<stall>

<stall>

<stall>

SD F4, @R1)

LD F6, @R1) {

LD F8, @R2) }

<stall>

<stall>

ADD.D F10, F6, F8

<stall>

<stall>

<stall>

SD F10, @R1) }

ADDUI R1, R1, #16 }

ADDUI R2, R2, #16 }

BLE R1, R3, Loop.

LD F0, @R1)

LD F2, @R2)

<stall>

<stall>

ADD.D F4, F0, F2

<stall>

<stall>

<stall>

SD F4, @R1)

1 + 2 \* 23 + 10 = 57 iterations.

DADDUI R3, R1, #32

Loop: LD F0, @R1)

LD F2, @R2)

{ LD F6, @R1)  
LD F8, @R2)

ADD.D F4, F0, F2

<stall>

ADD.D F10, F6, F8

{ DADDUI R1, R1, #16  
DADDUI R2, R2, #16  
SD F4 - 16(R1)

{ <stall>

SD F10 - 8(R1)

BLE R1, R3 Loop.

LD F0, @R1)

LD F2, @R2)

<stall>

<stall>

ADD.D F4, F0, F2

<stall>

<stall>

<stall>

SD F4, @R1 )

$$1 + 2 \cdot 23 + 10 = 57$$

① \* Each instruction 1 cycle \* Processor is pipelined \*  $R_1 = 0, R_2 = 24, R_3 = 16$

Loop 1: LD F2,  $\phi(R1)$  1+3  
ADDD F2, F0, F2 1+2

Loop 2: LD F4,  $\phi(R3)$  1+3  
MULTO F4, F0, F4 1+4  
DIVD F10, F4, F0 1+10  
ADDD F12, F10, F4 1+4  
ADDI R1, R1, #8 1+0  
SUB R18, R2, R1 1+0  
BNZ R18, loop 2 1+1  
SD F2,  $\phi(R3)$  1+1  
ADDI R3, R3, #8 1+0  
SUB R20, R2, R3 1+0  
BNZ R20, loop 1 1+1

| Instruction | Additional cycle cost |
|-------------|-----------------------|
| LD          | 3                     |
| SD          | 1                     |
| ADD         | 2                     |
| MULTO       | 4                     |
| DIVD        | 10                    |
| ADDD        | 0                     |
| SUB         | 0                     |
| BNZ         | 1                     |

a) 1 iteration of loop 1  
3 iterations of loop 2  
Loop 2 cycles:  $4 + 5 + 11 + 5 + 1 + 1 + 2 = 29$   
Loop 1 - loop 2 cycles:  $4 + 3 + 2 + 1 + 7 + 2 = 13 \Rightarrow$   
 $13 + 29 \cdot 3 = 13 + 87 = 100$

b)

Loop 1: LD F2,  $\phi(R1)$  1+3  
ADDD F2, F0, F2 1+2

Loop 2: LD F4,  $\phi(R3)$  1+3  
MULTO F4, F0, F4 1+4  
DIVD F10, F4, F0 1+10  
ADDD F12, F10, F4 1+2

LD F4,  $\phi(R3)$  1+3  
MULTO F4, F0, F4 1+4  
DIVD F10, F4, F0 1+10  
ADDD F12, F10, F4 1+2

LD F4,  $\phi(R3)$  1+3  
MULTO F4, F0, F4 1+4  
DIVD F10, F4, F0 1+10  
ADDD F12, F10, F4 1+2

ADDI R1, R1, #24 1+0  
SUB R18, R2, R1 1+0  
BNZ R18, loop 2 1+1  
SD F2,  $\phi(R3)$  1+1  
ADDI R3, R3, #8 1+0  
BNZ R20, loop 1 1+1

b) Iterations with loop 2 unrolled

85 iterations

$$c) \frac{100}{85} = 1.17\% \text{ improvement}$$

| ② Source    | Target       | Latency | Loop: LD F0 $\phi(R1)$ |
|-------------|--------------|---------|------------------------|
| FPAU        | FPAU         | 6       | LD F2 $\phi(R2)$       |
| FPAU        | STORE double | 3       | ADDD F4, F2, F0        |
| Load double | FPAU         | 2       | SD F4 $\phi(R3)$       |
| Load double | STORE double | 0       | stall x3               |

X stall x2  
stall x3  
DADDI R1, R1, -8  
DADDI R2, R2, -8  
DADDI R3, R3, -8  
BNE R1, R4, loop

c) 13 · n iterations

Always of 40000 bytes

$$\times 39 \text{ iterations} \cdot 13 \text{ cycles} = 519987 \text{ iterations}$$

520000

$$10 \cdot 40000 = 400000 \text{ cycles}$$

|    |                                                                                                                                                                    |                                                                                                                                                                   |
|----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| b) | Loop: LD F0 $\phi(R1)$<br>LD F2 $\phi(R2)$ (stall x2)<br>ADDD F4 F2 F0<br>DADDI R1 R1 #8<br>DADDI R2 R2 #8<br>DADDI R3 R3 #8<br>SD F4 $\phi(R3)$<br>BNE R1 R4 loop | c) Loop: LD F0 $\phi(R1)$<br>LD F2 $\phi(R2)$ (stall x2)<br>ADDD F4 F2 F0<br>LD F4 # -8 R1<br>LD F2 # -8 R2<br>DADDI R1 R1 -8<br>DADDI R2 R2 -8<br>DADDI R3 R3 -8 |
|    | 12 · 10k = 240000 cycles                                                                                                                                           |                                                                                                                                                                   |

cycles = 250 000  
iterations = 25  
10000 iterations = 250 000 cycles

Loop: LD F<sub>0</sub> φ(R1)  
 LD F<sub>2</sub> φ(R2)  
 <stall>  
 <stall>  
 ADDD F<sub>4</sub>, F<sub>2</sub>, F<sub>0</sub>  
 LD F<sub>0</sub> -8(R1)  
 LD F<sub>2</sub> φ(R2)  
 <stall>  
 SD F<sub>4</sub>, 0(R3)  
 ADDD F<sub>4</sub>, F<sub>2</sub>, F<sub>0</sub>  
 LD F<sub>0</sub> -6(R1)  
 LD F<sub>2</sub> -16(R2)  
 <stall>  
 SD F<sub>4</sub> -8(R3)  
 ADDD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>  
 LD F<sub>0</sub> -24(R1)  
 LD F<sub>2</sub> -24(R2)  
 <stall>  
 SD F<sub>4</sub> -16(R3)  
 ADDD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>  
 LD F<sub>0</sub> -24(R1)  
 LD F<sub>2</sub> -24(R2)  
 <stall>  
 SD F<sub>4</sub> -16(R3)  
 ADDD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>  
 SD F<sub>4</sub> -16(R3)  
 ADDD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>  
 SD F<sub>4</sub> -16(R3)  
 ADDD F<sub>4</sub>, F<sub>0</sub>, F<sub>2</sub>  
 SD F<sub>4</sub> 8(R3)  
 BNE R1, R4, loop

LD F<sub>0</sub> φ(R1)  
 LD F<sub>2</sub> φ(R2)  
 LD F<sub>6</sub> -8(R1)  
 LD F<sub>8</sub> -8(R2)  
 LD F<sub>12</sub> -16(R2)  
 LD F<sub>14</sub> -16(R2)  
 LD F<sub>18</sub> -24(R1)  
 LD F<sub>20</sub> -24(R2)  
 ADDD F<sub>0</sub>, F<sub>2</sub>, F<sub>2</sub>  
 ADDD F<sub>0</sub>, F<sub>6</sub>, F<sub>8</sub>  
 ADDD F<sub>12</sub>, F<sub>14</sub>, F<sub>12</sub>  
 ADDD F<sub>22</sub>, F<sub>20</sub>, F<sub>18</sub>  
 SD F<sub>4</sub> 0(R3)  
 SD F<sub>10</sub> -8(R3)  
 SD F<sub>6</sub> -16(R3)  
 SD F<sub>22</sub> -24(R3)  
 DADDIU R1 R1 #32  
 DADDIU R2 R2 #32  
 DADDIU R3 R3 #32  
 BNE R1, R4, loop

200 000 cycles

(3)

| Instruction | Latency (addition) |
|-------------|--------------------|
| ldc1        | 2                  |
| sdc1        | 2                  |
| add.j       | 4                  |
| mult.j      | 6                  |
| addi        | 0                  |
| subi        | 0                  |
| bnez        | 1                  |

b)

loop ldc1 \$f0 φ(\$t0)  
 ldc1 \$f2 φ(\$t1)  
 add.j \$f4, \$f0, \$f2 stall x2  
 mult.j \$f4, \$f4, \$f6 stall x4  
 sdc1 \$f4, φ(\$t2) stall x6  
 addi \$t0, \$t0, 8  
 addi \$t1, \$t1, 8  
 subi \$t3, \$t3, 1  
 bnez \$t3 loop 8  
 addi \$t2 \$t2 8

a)

|    | Read      | Written |
|----|-----------|---------|
| 1  | \$t0      | \$f0    |
| 2  | \$t1      | \$f2    |
| 3  | \$f0 \$f2 | \$f4    |
| 4  | \$f4 \$f6 | \$f4    |
| 5  | \$f4 \$t2 | -       |
| 6  | \$t0      | \$t0    |
| 7  | \$t1      | \$t1    |
| 8  | \$t3      | \$t3    |
| 9  | \$t3      | -       |
| 10 | \$t2      | \$t2    |

| Source | target | Instruction | type |
|--------|--------|-------------|------|
| 1      | 3      | \$f0        | RAW  |
| 2      | 3      | \$f0        | RAW  |
| 4      | 5      | \$f4        | RAW  |
| 8      | 9      | \$t3        | RAW  |
| 3      | 4      | \$f4        | RAW  |

b) 27 + 1 || 27 cycles per iteration

c) Loop: \$f0 φ(\$t0) ldc1  
 ldc1 \$f2 φ(\$t1)  
 addi \$t0 \$t0 P  
 addi \$t1 \$t1 8  
 add.j mult.j \$f4 \$f0 \$f2  
 subi \$t3 \$t3 1

<stall> x 3  
 mult.j \$f4 \$f4 \$f4 \$f6  
 sdc1 \$f4 φ(\$t2)  
 bnez \$t3 loop  
 addi \$t2 \$t2 8

18 cycles per iteration

4

d) loop:  $\text{ldc1 } \$f_0 (\$t_0)$   $25n + 1$

$\text{ldc1 } \$f_2 (\$t_1)$

$\text{ldc1 } \$f_6 8(\$t_0)$

$\text{ldc1 } \$f_{10} 8(\$t_1)$

$\text{ldc1 } \$f_{18} 16(\$t_0)$

$\text{ldc1 } \$f_{18} 16(\$t_1)$

$\text{ldc1 } \$f_{20} 24(\$t_0)$

$\text{ldc1 } \$f_{22} 24(\$t_1)$

$\text{add } \$f_4 \$f_0 \$f_2$

$\text{add } \$f_{10} \$f_4 \$f_{26}$

$\text{add } \$f_{18} \$f_{10} \$f_{12}$

$\text{add } \$f_{22} \$f_{20} \$f_{18}$

$\text{sub } \$t_3 \$t_3 4$

$\text{mult } \$f_4 \$f_4 \$f_{26}$

$\text{mult } \$f_{10} \$f_{10} \$f_6$

$\text{mult } \$f_{18} \$f_{18} \$f_6$

$\text{mult } \$f_{22} \$f_{22} \$f_6$

$\text{addi } \$t_0 \$t_0 32$

$\text{addi } \$t_1 \$t_1 32$

<stall>

$\text{sdc1 } \$f_4 -32(\$t_2)$

$\text{sdc1 } \$f_{10} -24(\$t_2)$

$\text{sdc1 } \$f_{18} -16(\$t_2)$

$\text{sdc1 } \$f_{20} -8(\$t_2)$

$\text{bne } \$t_3 \text{ loop}$

$\text{addi } \$t_2 \$t_2 8$