

CMSC-611 : HW 1

Q1) Given:

| Class | CPI <sub>P1</sub> | CPI <sub>P2</sub> |
|-------|-------------------|-------------------|
| A     | 1                 | 2                 |
| B     | 2                 | 2                 |
| C     | 3                 | 2                 |
| D     | 3                 | 2                 |

classes are divided as follows:

class A = 10%, class B = 20%

class C = 50%, class D = 20%

To find:

a) Global CPI for each implementation

b) Total clock cycle required in both cases

(Ans:

a)

$$CPI_{Global} = CPI_A N_A + CPI_B N_B + CPI_C N_C + CPI_D N_D$$

$$= 1 \times 0.1 + 2 \times 0.2 + 3 \times 0.5 + 3 \times 0.2$$

$$= 2.6$$

Global CPI for processor P1 is 2.6

for processor P2:

$$\begin{aligned} \text{CPI}_{\text{global}} &= \text{CPI}_A \cdot N_A + \text{CPI}_B \cdot N_B + \text{CPI}_C \cdot N_C + \text{CPI}_D \cdot N_D \\ &= 2 \times 0.1 + 2 \times 0.2 + 2 \times 0.5 + 2 \times 0.2 \\ &= 2 \end{aligned}$$

∴ CPI global for processor P2 is 2

b)

As we know the global CPI for both implementations

$$\text{clock cycles} = \text{CPI} \times (\text{No. of Inst})$$

Number of instruction are  $1 \times 10^6$

∴ clock cycle required for P1 =  $\text{CPI}_{P1} \times (\text{No. of Inst})$

$$\text{CPI}_{P1} = 2.6$$

$$\begin{aligned} (\text{clock cycle})_{P1} &= 2.6 \times 10^6 \\ &= 26 \times 10^5 \end{aligned}$$

$$\text{CPI}_{P2} = 2.0$$

$$(\text{clock cycle})_{P2} = 2.0 \times 10^6$$

Q2) Given:

| Instruction | CPI              |
|-------------|------------------|
| FP          | $50 \times 106$  |
| INT         | $110 \times 106$ |
| L/S         | $80 \times 106$  |
| Branch      | $16 \times 106$  |

To find:

- How much must be improve CPI of FP if we want program to run two times faster
- How much must we improve CPI of L/S if we want program to run two times faster
- How much is execution time improved if CPI of INT & FP are reduced by 40% & CPI of L/S & Branch is reduced by 30 %.

Ans a) Suppose we decide to improve CPI of FP by x  
System will run two times faster

first calculate clock time for old system

$$\begin{aligned} \text{clock}_{\text{old}} &= 50 \times 106 \times 1 + 110 \times 106 \times 1 + 80 \times 106 \times 4 \\ &\quad + 16 \times 106 \times 2 \\ &= (50 + 110 + 320 + 32) \times 106 \\ &= 512 \times 106 \end{aligned}$$

System will run 2x faster, so clock time  
will decrease by 2

$$\cancel{\text{clock new}} = \frac{\text{clock old}}{2}$$

$$\therefore \frac{50 \times 106 \times 1}{x} + 110 \times 106 \times 1 + 80 \times 106 \times 4 + 16 \times 106 \times 2 = 256 \times 106$$

$$50 \times 106 \times \frac{1}{x} = (256 - 462) \times 106$$

$$50 \times 106 \times \frac{1}{x} = -206 \times 106$$

$$x = -0.243$$

Here, as  $x$  is -ve  $\therefore$  we infer that it  
is not possible to increase the  
performance of system by twice just  
by increasing CPI of FP instruction.

b) Given :

Suppose we improve CPI of L/S instruction by factor of  $x$ , then CPI of overall system will be increased twice.

$$50 \times 106 \times 1 + 110 \times 106 \times 1 + \frac{80 \times 106 \times 4}{x} + 16 \times 106 \times 2$$

$$= 512 \times 106 \times \frac{1}{2}$$

$$\therefore 160 + \frac{80 \times 4}{x} + 32 = 256$$

$$x = \frac{320}{64} = 5$$

∴ If we improve the CPI of L/S instruction 5 times then overall system performance will be twice

c) Given :

CPI of INT & FP instruction are reduced by 40 %

CPI of L/S & Branch instruction are reduced by 30 %

$$CPI_N = 50 \times 106 \times 1 (1 - 0.4) + 110 \times 106 \times 1 (1 - 0.4) \\ + 80 \times 106 \times 4 (1 - 0.3) + 16 \times 106 \times 2 (1 - 0.3)$$

$$= (30 + 66 + 224 + 22.4) \times 106$$

$$CPI_N = 36,294.4 \dots \dots \dots (1)$$

$$CPI_o = 54,272 \dots \dots \dots (2)$$

from eq (1) execution time for new system =  $\frac{36,294.4}{2 \times 10^9}$  sec  
 $= 1.81 \times 10^{-5}$  sec

Execution time for old system =  $\frac{54,272}{2 \times 10^9}$  sec  
 $= 2.7136 \times 10^{-5}$  sec

$\therefore$  Execution time of overall system will be improved by  $0.8989 \times 10^{-5}$  sec  
or

33 %

Q3) Given:

1MB L2 cache with 64 bytes of block size  
First 16 byte block can be received in  
120 cycles.

Additional three bytes can be received  
in 16 cycles each

To find:

- How many cycles would it take to service miss with critical word first
- Miss without critical word first

Ans)

i) In critical word first, data block is sent to processor as soon as it arrives and its execution of processor continue

First block takes 120 clocks,

Next three blocks take 16 clock each  
Clock cycle of next three block is overlapped

∴ In miss with critical word first it will take 120 cycles to service L2 cache

ii) Without critical word first

In this mode, the CPU does not request the missed word from memory as soon as it arrives.  
It waits for its execution to complete.

∴ It will take total of  $120 + 3 \times 16 = 168$  cycles

Q4)

a) No of inst & no of clock cycles required for each instruction are totally independent  
→ TRUE : clock cycle per instruction (CPI) are totally independent of instruction set used  
CPI only depends on computer hardware organization

b) Performance of computer architecture is usually improved by decreasing clock rate  
→ FALSE : Decreasing clock rate means there are fewer ticks per second or more clocks per tick

Performance equation is dependant on inverse of cycle time

∴ Performance will degrade after decreasing clock rate

In performance eq<sup>n</sup> CPU time  $\propto \frac{1}{\text{clock rate}}$

Q5)

For given code to calculate average of an array adjacent memory is required.

As average is calculated by considering every values of array who's memory addresses are next to each other

Memory address of each element of array are next to each other, so spacial locality must be considered while caching data.

When designing cache, with constant capacity & associativity, I would choose a cache with larger block size.

As more data next to each other are kept in cache, there are less chances of having cache misses.

Program uses consecutive addresses and never reuses them. Using property of spacial locality larger block size should be used.