

# Essentials of Computer Systems - Exercises #4

## 1 Performance Equations

$$\text{Execution time} = T = IC \times t_{\text{CLK}} \times CPI = \frac{\# \text{insn}}{\text{program}} \cdot \frac{\# \text{seconds}}{\text{cycle}} \cdot \frac{\# \text{cycles}}{\text{insn}}$$

$$\text{Performance} = \frac{1}{T}$$

$$\text{Speedup} = \frac{T_{\text{old}}}{T_{\text{new}}}$$

Amdahl's law: the total speedup depends on

- fraction of X being used in original system  $\Rightarrow f$
- enhancement/speedup of X  $\Rightarrow S$

$$S_{\text{total}} = \frac{1}{\frac{f}{S} + (1-f)}$$

### Exercise 1.1 Average CPI

When running a given benchmark we found the frequency and CPIs of instructions given in table 1. Afterwards, two separate modifications were made and benchmarking repeated:

- hardware modifications resulting in changed CPIs of the instructions (see table 2)
- modifications to the compiler resulting in changed frequencies if the instructions (see table 3)

Compare all three options and identify the best one!  $\rightarrow$  average CPI of the different versions

| Table 1 |       |         |
|---------|-------|---------|
| insn    | $f_i$ | $CPI_i$ |
| ALU     | 50%   | 1       |
| load    | 20%   | 5       |
| store   | 10%   | 3       |
| other   | 20%   | 2       |

| Table 2 |       |         |
|---------|-------|---------|
| insn    | $f_i$ | $CPI_i$ |
| ALU     | 50%   | 1       |
| load    | 20%   | 4       |
| store   | 10%   | 4       |
| other   | 20%   | 2       |

| Table 3 |       |         |
|---------|-------|---------|
| insn    | $f_i$ | $CPI_i$ |
| ALU     | 55%   | 1       |
| load    | 10%   | 5       |
| store   | 10%   | 3       |
| other   | 25%   | 2       |

$$\frac{1 \cdot 50 + 5 \cdot 20 + 3 \cdot 10 + 2 \cdot 20}{100} = 2.2 \quad \frac{1 \cdot 50 + 4 \cdot 20 + 4 \cdot 10 + 2 \cdot 20}{100} = 2.10 \quad \frac{1 \cdot 55 + 5 \cdot 10 + 3 \cdot 10 + 2 \cdot 25}{100} = 1.85$$

$$\cancel{\text{Speedup} = \frac{IC \cdot t_{\text{clock}} \cdot CPI_{\text{old}}}{IC \cdot t_{\text{clock}} \cdot CPI_{\text{new}}}} = \frac{CPI_{\text{old}}}{CPI_{\text{new}}} \quad \begin{array}{l} \text{same benchmark} \rightarrow \text{same IC} \\ \text{same hardware} \rightarrow \text{same } t_{\text{clock}} \end{array} \quad .$$

$$S = \frac{2.2}{2.1} = 1.05$$

$$S = \frac{2.2}{1.85} = 1.12 \quad \rightarrow \text{higher speed-up}$$

### Exercise 1.2 Amdahl's law

The old CPU takes 5 seconds to execute a given program. A new CPU with four times faster adder<sup>1</sup> is available. The adder is used in 25% of the instructions in the given program. Compute the new execution time for the given program when using the new CPU.

$$S = \frac{1}{\frac{0.25}{4} + (1 - 0.25)} = 1.23 \rightarrow \text{speed-up of 23\%}$$

$$\begin{array}{l} \text{old time} \rightarrow \frac{5 \text{ s}}{1.23} = 4.0625 \text{ s} \rightarrow \text{new time} \\ \text{speed-up} \rightarrow 1.23 \end{array}$$

<sup>1</sup> adder is a part of the ALU

## 2 Execution on the Pipelined CPU

Exercise 2.1 For the following code snippet show:

- analysis of data dependencies. Identify the data hazards.
- show the timing diagram of execution on pipelined CPU without bypassing logic (see lecture slides)
- show the timing diagram of execution on pipelined CPU with bypassing logic from EX/MEM and MEM/WB interstage registers
- show the timing diagram of execution on pipelined CPU with bypassing logic from EX/MEM and MEM/WB interstage registers and with RegFile bypassing. RegFile bypassing allows the “overlap” of ID and WB stage by adding MUXes on the inputs to the interstage registers ID/EX.A and ID/EX.B, allowing to select either the RegFile outputs rs and rt or the data value from WB stage - please see schematic below.



| recap:            |                                                |
|-------------------|------------------------------------------------|
| insn syntax       | operation                                      |
| and Rd, Rs, Rt    | $Rd \leftarrow Rs \text{ AND } Rt$             |
| or Rd, Rs, Rt     | $Rd \leftarrow Rs \text{ OR } Rt$              |
| lw Rt, offset(Rs) | $Rt \leftarrow \text{MEM}[Rs + \text{offset}]$ |
|                   | <b>W      R</b>                                |

code snippet:

$lw R3, 0(R1)$   
 $and R4, R3, R2$   
 $or R5, R3, R2$   
 $and R3, R4, R5$

### a) Analysis

code snippet:

$lw R3, 0(R1)$   
 $and R4, R3, R2$   
 $or R5, R3, R2$   
 $and R3, R4, R5$

$insn 2 \text{ reads } R3 \rightarrow insn 3 \text{ reads } R3$   
 $insn 2 \text{ reads } R2 \rightarrow insn 3 \text{ reads } R2$ 
} RAR      not a data dependency

$insn 2 \text{ reads } R3 \rightarrow insn 4 \text{ writes } R3$   
 $insn 3 \text{ reads } R3 \rightarrow insn 4 \text{ writes } R3$ 
} WAR      not a problem, as long as the insn are completed in the original order

$insn 1 \text{ writes } R3 \rightarrow insn 4 \text{ writes } R3$ 
} WAW

$insn 1 \text{ writes } R3 \rightarrow insn 2 \text{ reads } R3$   
 $insn 1 \text{ writes } R3 \rightarrow insn 3 \text{ reads } R3$   
 $insn 2 \text{ writes } R4 \rightarrow insn 4 \text{ reads } R4$   
 $insn 3 \text{ writes } R5 \rightarrow insn 4 \text{ reads } R5$ 
} RAW      Hazards!

### b) Stalling the CPU waits (stall) until the needed value is available in RegFile (after WB)

|                  | 1  | 2  | 3  | 4   | 5  | 6  | 7   | 8   | 9  | 10 | 11  | 12 | 13 | 14 |
|------------------|----|----|----|-----|----|----|-----|-----|----|----|-----|----|----|----|
| $lw R3, 0(R1)$   | IF | ID | EX | MEM | WB |    |     |     |    |    |     |    |    |    |
| $and R4, R3, R2$ | IF | o  | o  | o   | ID | EX | MEM | WB  |    |    |     |    |    |    |
| $or R5, R3, R2$  |    |    |    |     | IF | ID | EX  | MEM | WB |    |     |    |    |    |
| $and R3, R4, R5$ |    |    |    |     | IF | o  | o   | o   | ID | EX | MEM | WB |    |    |

r3 value is retrievable in RegFile from cycle 6

r4 value is retrievable in RegFile from cycle 10

r5 value is retrievable in RegFile from cycle 11

### c) Interstage registers bypassing

request the values from the interstage registers as soon as they are available (bypass):

- load (lw): value available in MEM/WB (after MEM) → only temporarily available !
- arithmetic/logic operations: value available in EX/MEM, then in MEM/WB (after EX and MEM)

|                                                                           | 1 | 2 | 3 | 4               | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
|---------------------------------------------------------------------------|---|---|---|-----------------|---|---|---|---|---|----|----|----|----|----|
| $lw R3, 0(R1)$<br>$and R4, R3, R2$<br>$or R5, R3, R2$<br>$and R3, R4, R5$ |   |   |   | IF ID EX MEM WB |   |   |   |   |   |    |    |    |    |    |

IF ID EX MEM WB

IF 0 ID EX MEM WB

IF 0 ID EX MEM WB

IF 0 ID EX MEM WB

r3 value is directly available in cycle 5, and retrievable in RegFile from cycle 6

r4 value is directly available in cycles 6 and 7 and retrievable in RegFile from cycle 8

r5 value is directly available in cycles 8 and 9 and retrievable in RegFile from cycle 10

DECODE accepts addresses, EXECUTE accepts data. Only DECODE (ID) can access the RegFile, only EXECUTE can take data directly from the interstage registers

### d) RegFile bypassing

data that are being saved in the RegFile are available to DECODE in the same cycle.

|                                                                           | 1 | 2 | 3 | 4               | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
|---------------------------------------------------------------------------|---|---|---|-----------------|---|---|---|---|---|----|----|----|----|----|
| $lw R3, 0(R1)$<br>$and R4, R3, R2$<br>$or R5, R3, R2$<br>$and R3, R4, R5$ |   |   |   | IF ID EX MEM WB |   |   |   |   |   |    |    |    |    |    |

IF ID EX MEM WB

IF 0 ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

r3 is directly available in cycle 5 and retrievable in RegFile from cycle 6  
 both r4 and r5 use bypassing