

**NOTES**

**COMPUTER  
SYSTEM  
ARCHITECTURE**

# Computer System Architecture

↳ deals with functional behaviour of computer system.

↳ design implementation for the various part of computer.



## Architecture

Instruction Set  
Architecture  
(ISA)

Hardware System  
Architecture  
(HSA).

e.g. MOV R1 , 02H.

MOV R2 , 03H

ADD R1,R2

STORE X,R1.

R1 , R2 → + Addr

## Computer Architecture

Von Neumann Architecture  
(A.K.A. Princeton Architecture).

Non - von Neumann Architecture



Harvard  
Architecture

Modified  
Harvard  
Architecture



(Containing, instruction  
and data both)

## Computer Architecture

→ SIMD → [ Single instruction stream, Single data stream ]



→ SIMD

→ SIMD (Not used yet)

→ MIMD

## Memory

Primary memory → Basically → DRAM. ( Little storage than Cache, volatile )

Cache → SRAM. ( costly, fastest, volatile )

Secondary Memory → Hard disk.



## Memory Hierarchy

( Paging, demand paging )



→ We can't use large amount of fast memory

↳ Expensive → cost, power, space.

↳ Even fast memory chips make slow big memory system.

→ Low level are closer to CPU.

## Memory Interfacing



$$H_1 T_1 + ((1-H_1) H_2)(T_1+T_2) + ((1-H_1)(1-H_2))(T_1+T_2+T_3)$$



$$H_1 T_1 + ((1-H_1) H_2) T_2 + ((1-H_1)(1-H_2)) T_3$$

## Cache memory

→ CPU looks first in Cache (L1, L2 and L3) then main memory.



→ General Cache organisation is  $(S, \Sigma, B)$

$$S = 2^s = \text{no. of sets in cache}$$

$$\Sigma = 2^e = \text{no. of lines in set} \quad \Sigma = 2^s$$

$$B = 2^b = \text{no. of bytes in line}$$



$$\rightarrow \text{Cache size } C = S \times \Sigma \times B$$



Cache Performance → • miss rate = fraction of memory references not found in cache  $= 1 - \text{hit rate}$

Typical miss rate =  $3-10\%$  for L1.

< 1% for L2, depends on size of and locality

hit time :- time to deliver a line in the cache to the processor  
 includes time to determine whether the line is in the ~~empty~~ cache.

Typical time  $\rightarrow$  L1  $\rightarrow$  1-2 cycle.  
 L2  $\rightarrow$  3-20 cycle.

Miss penalty  $\rightarrow$  Additional time required for data access because of Cache miss.  
 typically  $\rightarrow$  50-200 cycles for main memory

Calculating Avg. Access time (AMAT)

L1  $\rightarrow$  1 cycle and miss rate 10%.

L2  $\rightarrow$  10 cycle and miss rate 2%.

DRAM  $\rightarrow$  80 cycle and miss rate 0% (main memory)

$$\begin{aligned} \text{AMAT} &= 1 + 0.10 * 10 + 0.1 * 0.02 * 80 \rightarrow \begin{array}{l} \text{Always L1 cache +} \\ \text{miss L1 * time L2} \\ + \text{miss L1 * miss L2} \\ * \text{time DRAM} \end{array} \\ &= 1 + 1 + 0.16 = 2.16 \text{ cycles} \end{aligned}$$

Alternative  $\rightarrow$

$$0.9 * 1 + 0.1 * 0.98 (10+1) + 0.1 * 0.2 * (1+10+80)$$

$$H_1 T_1 + H_1 (1-H_1) H_2 (T_1+T_2) + (1-H_1)(1-H_2) (T_1+T_2+T_3) \\ = 2.16 \text{ cycles}$$

~~So~~ would you believe 59% hit is twice as good as 97%?

Consider L1 cache hits  $\rightarrow$  1 cycle

L1 miss penalty  $\rightarrow$  100 cycle

97% L1 hits :-  $1 + 0.03 \times 100 \text{ cycle} = 4 \text{ cycles}$

$1 + 0.03 \times 100 \text{ cycle} = 2 \text{ cycles}$

CPU Time  $\rightarrow$  Program execution cycle  
 Memory stall cycle.  $\hookrightarrow$  Mainly from cache misses  
 $\hookrightarrow$  Include cache hit time.

$$\text{Memory stall cycle} = \frac{\text{Memory access}}{\text{Program}} \times \text{Miss rate} \times \text{Miss penalty}$$

$$= \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Miss}}{\text{Instruction}} \times \text{Miss penalty}$$

Eg - Instruction cache miss rate = 2%.

- data % % % = 4%.

- Miss penalty = 100 cycles

- Basic CPI (with ideal cache performance) = 2  
↓  
Cycles per instruction

- Load and store account are 36% instructions.

Miss cycles per instruction.

- Instruction cache  $\Rightarrow 0.02 \times 100 = 2$

- Data cache:  $0.36 \times 0.04 = 1.44$ .

Actual CPI =  $2 + 2 + 1.44 = 5.44$ .

→ Ideal CPU is  $5.44/2 = 2.72$  times faster

→ we spend  $2.44 / 5.44 = 44\%$  of our execution time on memory stall.

Let's Reduce ideal CPI

→ what if we improve the datapath so that ideal CPI was reduced → assume base CPI = 1.5

Actual CPI =  $1.5 + 2 + 1.44 = 4.94$ .

→ Ideal CPU is  $4.94 / 1.5 = 3.29$  times faster.

→ we spend  $3.44 / 4.94 = 70\%$  of our execution time on memory stall.

Proportion

→ So, Decrease base CPI → greater proportion of time spent on the memory stall.

→ Increase clock rate → memory stalls accounted for more CPU cycles.

Multilevel Cache Consideration

L1 → focus on minimal hit time

L2 → focus on less miss rate

Result → L1 cache → usually smaller than single cache

→ L1 cache block size smaller than L2 block size.

## Intel Core i7 cache Hierarchy



### Some other points on Cache design.

Ideally → Each word is stored separately

→ Any word can be stored anywhere

→ Requires too many comparators.

Reality → Group of words into blocks.

Classic Cache → After  $S = 2$ ,  $B = 32B$ ,  $C = 256$  bytes,  
 $A = 4$ .



$$\text{no. of lines (blocks)} = \frac{C}{B}$$

MIPS → million instruction per second

## Cache mapping

↓  
Direct      Fully      k-way set  
mapping    association associative

Direct Mapping. →  $k \leq n$

Cache

|                |                |
|----------------|----------------|
| L <sub>0</sub> | B <sub>0</sub> |
| L <sub>1</sub> | B <sub>1</sub> |
| L <sub>2</sub> | B <sub>2</sub> |
| L <sub>3</sub> | B <sub>3</sub> |

16 words.

Main memory

|                  |                  |                  |                  |                 |
|------------------|------------------|------------------|------------------|-----------------|
| w <sub>0</sub>   | w <sub>1</sub>   | w <sub>2</sub>   | w <sub>3</sub>   | B <sub>0</sub>  |
| w <sub>4</sub>   | w <sub>5</sub>   | w <sub>6</sub>   | w <sub>7</sub>   | B <sub>1</sub>  |
|                  |                  |                  |                  | B <sub>2</sub>  |
|                  |                  |                  |                  | B <sub>3</sub>  |
|                  |                  |                  |                  |                 |
|                  |                  |                  |                  |                 |
|                  |                  |                  |                  |                 |
|                  |                  |                  |                  |                 |
| w <sub>124</sub> | w <sub>125</sub> | w <sub>126</sub> | w <sub>127</sub> | B <sub>31</sub> |

line size = Block size

$$n = \frac{16}{4} = 4$$

(4 → Block size  
(4 words))

$$k \leq n \rightarrow 0 \bmod 4 = 0 \rightarrow B_0$$

128 words

$$\frac{128}{4} = 32$$

$$1 \bmod 4 = 1 \rightarrow B_1$$

!

;

$$0 - 127 \rightsquigarrow 2^7 \rightarrow 7 \text{ bits}$$

\* Physical Add.



Block num

Block

Offset



5

2

7

2<sup>5</sup>

→ disadvantageous block assignment  
placed, if others are  
vacant, fit cont.  
assigned to them

|     |          |              |
|-----|----------|--------------|
| 3   | 2        | 2            |
| Tag | Line no. | Block offset |

it will  
tell which  
state of K Mod 4.

is present → e.g. → B<sub>0</sub>, B<sub>4</sub>, B<sub>8</sub>, B<sub>12</sub> ... B<sub>24</sub>  
which one ✓

## Fully associative Mapping

as INR

any block can assign any line of cache.



here we are not finding line; so it is not present

## Set ~~Associative~~ Mapping

main memory.



1 set contains k-way lines  $\rightarrow$  k-way set associative cache

e.g. By k addressable MM size = 128 bytes (B)  $\rightarrow$   $2^7 \text{ bytes} = 2^7 \text{ bit}$   
 Cache Size = 32 B  
 $\text{Block Size} = 4 \text{ B} \rightarrow 2^2 = 2^2 \text{ bits}$ , no. of memory blocks =  $\frac{2^7}{2^2} = 2^5$



No. of lines in cache.

$$\rightarrow \frac{2^5}{2^2} = 2^3 = 8$$

$$\text{No. of sets} = \frac{2^5}{2} = 2^2 = 4$$



Categorization based  
on the instruction

~~CISC~~

CISC

RISC

- Common instruction set computer.
- Large no. of instructions
- Variable length instruction format
- Program counter will face some difficulties.
- Cost is High.
- Large no. of addressing modes
- More power full
- Several cycle instruction
- Manipulation directly in memory
- Microprogrammed control unit.
- Eg ~ Motorola 6800, Intel 8080
- Reduced instruction set computer.
- less no. of instructions
- Fixed length instruction format
- Cost is low.
- few no. of addressing modes
- Hardwired control unit.
- Eg → MIPS, ARM, AVR, SPARC

MULT 2:2, 3:3

MOV RA, 3:3

PWV RB, 2:2

PWVL RA, RB

Stow RA,

}

Load

}

 }

Store

## Pipeline Pipelining

This is the process of arrangement of hardware elements of CPU such that its overall performance is increased.

- Simultaneous execution of more than one instruction take place in pipelined process.



Pipeline → 5 stage → IF  
 FD  
 EX  
 MEM  
 WB

Generally RISC architecture use this pipeline.

Total Instruction → 8 → I<sub>1</sub> - I<sub>8</sub>

Stage diagram



$$n \times K \xrightarrow{n-1} T = K + (n-1)$$

~~Time~~  
No. of Instructions

int

From Pipelining we.

want CPI ≈ 1

$$5 + (8-1) = 5+7=12$$

e.g. n = 1000, (Instructions),

$$T = S + (1000-1) = 1000$$

$$\text{CPI} = \frac{1000}{1000} \approx 1$$

$$\text{Speedup} = \frac{NP(\text{Time})}{P(\text{Time})} \Rightarrow$$

For 5 instructions →  $\frac{40}{12} \approx 3$   
 ↓  
 Time.

$$\text{efficiency} = \frac{\text{use box}}{\text{Total Box}} = \frac{8 \times 5}{12 \times 5} = \frac{2}{3} \text{ efficiency}$$

Q. A 4-stage pipeline has stage delay as 150, 120, 160 and 140 ns. Registers are used b/w stages and have delay of 5 ns each. Assuming constant clock rate, the total time taken to process 1000 items on this pipeline will be \_\_\_\_\_?



First we have to find the clock freq. =  $\frac{1}{\text{clock period}}$

$$= \frac{1}{165 \text{ ns}}$$

$$\begin{aligned} \text{Time} &= 1 \times 4 \times 165 + 999 \times 1 \times 165 \\ &= 165 - 495 \text{ ns} \approx \underline{165.5 \text{ ns}} \end{aligned}$$

Q. Considered a non-pipeline processor with clock rate of 2.5 GHz and avg cycles of instructions of 4. The same processor is upgraded to ~~to~~ pipeline with 5 stages, but due to internal pipeline delay, the clock speed is reduced to 2 GHz. Assume that there is no stall in pipeline. The speedup achieved in pipelined processor is \_\_\_\_\_?

$$\text{Speedup} = \frac{T_{NP}}{T_p}$$

$$T_{NP} = 4 \times \frac{1}{2.5} \times 10^{-9}$$

$$T_p = 1 \times \frac{1}{2} \times 10^{-9}$$

CPI = 1

$$\text{Speedup} = \frac{4 \times \frac{1}{2.5}}{\frac{1}{2}} = 3.2$$

## Hazards in pipelining

It is very difficult to make  $CPI=1$ , and those difficulties are called Hazards in pipelining.



when we flush

the other processing  
instructions due to  
branching coming  
at first instruction.



But it will present after some time

so it is a Data Hazard with Read after write (RAW) problem



## Structural Hazard

↳ when the multiple instructions need same resource  
Solutions → Resource duplication or stall.

## Control Hazard

Solution → stall after decoding a jump instruction  
flushing → reduce the performance.

## Data hazard

RAW no solution. → stall.

$$I_1: R_3 \leftarrow R_1 + R_2$$

$$I_2: R_5 \leftarrow R_3 + R_4$$

|                | Detection                     |                |                |
|----------------|-------------------------------|----------------|----------------|
|                | Domain                        | Range          |                |
| I <sub>1</sub> | R <sub>1</sub> R <sub>2</sub> | R <sub>3</sub> | R <sub>3</sub> |
| I <sub>2</sub> | R <sub>3</sub> R <sub>4</sub> | R <sub>5</sub> | R <sub>5</sub> |

$I_1(\text{Range}) \cap I_2(\text{Domain}) \neq \emptyset$   
 $\hookrightarrow \text{RAW Data hazard.}$

### ② WAR F:

$$I_1: R_1 \leftarrow R_2 * R_3$$

$$I_2: R_2 \leftarrow R_4 + R_5$$

If this modify first, and then Instruction  $I_1$  will read  $R_2$ , it will occur inconsistency in the data.

This is very special case, which can be occurs in Auto-increment.

Auto-increment.

Detection  $\rightarrow \text{Domain}(I_1) \cap \text{Range}(I_2) \neq \emptyset$

### ③ WAN

$$I_1: R_3 \leftarrow R_1 * R_2$$

$$I_2: R_3 \leftarrow R_4 + R_5$$

It can happen when instructions are executing parallelly.

$\text{Range}(I_1) \cap \text{Range}(I_2) \neq \emptyset$

## ④ CPU performance evaluation

CPU clock rate depend on the specific CPU organization and hardware implementation.

Instruction operation  $\rightarrow$  elementary hardware operation performed in CPU clock cycle.

$$\text{CPI} \rightarrow (\text{cycles per instruction}) \approx \frac{1}{\text{IPC}}$$

instruction  
per cycle.

- Avg (or effective) CPI of a program  $\rightarrow$  Avg CPI of all instruction in the program of given CPU design.

### CPI instruction



Performance  $\rightarrow$  total execution time should shorter.

$$\text{Performance} = \frac{1}{\text{Execution time}}$$

$$\text{Speed up} = \frac{\text{Performance}_A}{\text{Performance}_B} = n = \frac{\text{Execution time}_B}{\text{Execution time}_A}$$

↓  
how to compare  
performance

→  $n \rightarrow A$  is  $n$  times faster than B.

### CPU Execution time

$$\text{clock cycle time} = C = \frac{1}{\text{Clock rate}}$$

CPI  $\rightarrow$  Clock/cycle/instruction.

$$\text{CPU Time} = \frac{\text{Seconds}}{\text{Program}} = \frac{\text{Instruction}}{\text{Program}} \times \frac{\text{Cycles}}{\text{Instruction}} \times \frac{\text{Seconds}}{\text{Cycle}}$$

$$T = I \times CPI \times C$$

Instruction count

Avg on  
effective CPI  
for program.

CPU clock  
cycles

CPI  $\rightarrow$  Total program execution cycle/Instruction Count executed.

$$\left\{ \text{Total cycle} = CPI \times I \right\}$$

Q.  $I = 1,000,000$

CPI = 2.5

$$\text{Clock rate.} = 200 \text{ MHz} \rightarrow C = \frac{1}{200} \times 10^6$$

$$T = I \times CPI \times C = 1,000,000 \times 2.5 \times \frac{1}{200} \times 10^6$$

$$= \frac{5}{4} = 0.125 \approx 125 \text{ ms}$$

CPU performance

Affecting factors  $\rightarrow$  ISA  
(Instruction Set Archite.) Compiler, program, CPU design,  
technology (VLSI).

Ex Q. e.g. Program A  $\rightarrow I_A = 1,000,000$

$$\text{CPI} = 2.5 \text{ cycles/instruction}$$

$$\text{CPU clock rate} = 200 \text{ MHz}$$

Program B →  $T_B = 195000 \text{ ms}$   
 $CPI_B = 3.0$   
 $\text{Clock Rate}_B = 300 \text{ MHz}$

Speedup =  $\frac{\text{Performance}_B}{\text{Performance}_A} = \frac{A \text{ execution time}}{B \text{ execution time}}$   
 $= \frac{0.125}{0.095} = 1.32 \text{ or } 32\% \text{ faster.}$

Instruction Types → n types or classes of instructions generated on CPU (e.g. ALU, Branch).

$C_i$  = Count of instruction of type i executed.

$CPI_i$  = Cycles per instruction of type i

$\text{CPU clock cycles} = \sum_{i=1}^n CPI_i \times C_i$

[and Executed Instruction Count  $I = [C_i]$ ]

eg → Instruction class →

|  |   | CPI | Instruction count |
|--|---|-----|-------------------|
|  | A | 1   | 2                 |
|  | B | 2   | 1                 |
|  | C | 3   | 2                 |

clock cycles →  $2 \times 2 + 2 \times 1 + 3 \times 2 = 10$

Avg CPI → Clock cycles / I =  $10 / 5 = 2$ .

Instruction frequency & CPI.  
 $C_i$  → Count of instruction of type i

$CPI_i$  →

$F_i$  → Frequency or fraction of instruction of type i  
 $= \frac{C_i}{\text{total executed}} = \frac{C_i}{I} = \frac{C_i}{\sum C_i}$

$CPI = \sum (CPI_i \times F_i)$

avg effective CPI.

In action of ~~total~~ total time execution for 9<sup>th</sup> instruction =  $\frac{CPI_i \times T_i}{CPI}$

### eg Instruction Type freq. & CPI

| eg | Op     | Freq Fi | CPI <sub>i</sub> | $\rightarrow CPI_i \times F_i$ | Y. Time         |
|----|--------|---------|------------------|--------------------------------|-----------------|
|    | ALU    | 50%     | 1                | 0.5                            | 23% = 0.5 / 2.2 |
|    | Load   | 20%     | 5                | 1                              | 45% = 1 / 2.2   |
|    | Store  | 10%     | 3                | 0.3                            | 14% = 0.3 / 2.2 |
|    | Branch | 20%     | 2                | 0.4                            | 18% = 0.4 / 2.2 |
|    |        |         |                  | <u>2.2</u>                     |                 |
|    |        |         | <u>CPI =</u>     |                                |                 |

### Choosing Programs to evaluate Performance

→ Benchmarks / Level of program → to ~~evaluate~~ evaluate performance

① Actual target workload → Full applications that run on the target machine.

② Real Full program-based Benchmark → Programs that are typically targeted application

③ Small "kernel" benchmark

④ microbenchmark → integer, floating point, local memory

### MIPS (Million Instructions per second). Rating

→ measure of how many millions of instructions are executed per second.

$$\begin{aligned} \text{MIPS Rating} &= \frac{\text{Instruction count}}{(\text{Execution time} \times 10^6)} \\ &= \frac{\text{Instruction count}}{(\text{CPU clock} \times \text{Cycle Time} \times 10^6)} \\ &= \frac{\text{Instruction count} \times \text{clock rate}}{\text{Instr. time} \times \text{CPI} \times 10^6} \\ &= \frac{\text{clock rate}}{\text{CPI} \times 10^6} \end{aligned}$$

Problem with MIPS rating → As shown above MIPS rating does not account the count of instructions executed.

→ A higher MIPS rating in many cases may not mean higher performance or better execution time.

∴ Thus it cannot be used to compare computers with different instruction set.

Under what conditions can the MIPS rating be used to compare performance of diff. CPUs?

- Conditions:
- (1) The same program is used
  - (2) Same ISA is used
  - (3) The same compiler is used.

e.g. → For a machine (CPU) with instruction set.

| Instruction class | CPI |
|-------------------|-----|
| A                 | 1   |
| B                 | 2   |
| C                 | 3   |

Compiler produces the following executed instruction count

|            | A  | B | C |
|------------|----|---|---|
| Compiler 1 | 5  | 1 | 1 |
| Compiler 2 | 10 | 1 | 1 |

Clock rate = 100 MHz.

$$\text{MIPS} = \frac{\text{Clock rate}}{\text{CPI} \times 10^6} = \frac{100 \times 10^6}{\text{CPI} \times 10^6} = \frac{100}{\text{CPI}}$$

For Compiler 1

$$\hookrightarrow \text{CPI}_1 = (5 \times 1 + 1 \times 2 + 1 \times 3) / (5 + 1 + 1) = 10/7 = 1.43.$$

$$\text{MIPS rating} = \frac{100}{1.43} = 70.0 \text{ MIPS.}$$

$$\text{CPU time}_1 = ((5+1+1) \times 1.43) / (100 \times 10^6) = 0.1 \text{ us}$$

Compiler 2:

For Compiler 2:

$$\hookrightarrow \text{CPI}_2 = (10 \times 1 + 1 \times 2 + 1 \times 3) / (10 + 1 + 1) = 15/12 = 1.25$$

$$\text{MIPS}_2 = \frac{100}{1.25} = 80.0 \text{ MIPS.}$$

$$\text{CPU time}_2 = (10 + 1 + 1) \times 1.25 / (100 \times 10^6) = 0.15 \text{ us.}$$

here we can see that, Compiler 2 is better while in reality code produced by compiler 1 is faster.

MFLOPS  $\rightarrow$  (Million Floating point operation Per Second)

$\hookrightarrow$  floating point operations  $\rightarrow$  addition, subtraction, multiplication, division operation applied to floating point representation

$\rightarrow$  measure the millions of floating point operation per second.

$$\boxed{\text{MFLOPS} = \frac{\text{No. of floating point operation}}{\text{Execution time} \times 10^6}}$$

$\rightarrow$  It is better to compare measure b/w different machines.  
(Applied even if ISAs are ~~not~~ different) than MISP ratio.

$\rightarrow$  It depends on the type of floating point operation present in the program.

Peak MFLOPS  $\rightarrow$  Simplest floating point instructions (with the lowest CPI).

### Quantitative Principles of Computer Design

Amdahl's Law  $\rightarrow$  The performance gain from improving some portion of a computer is calculated by.

$$\left\{ \begin{array}{l} \text{Speedup} = \frac{\text{Performance of entire task using the enhancement}}{\text{Performance of entire task without using the enhancement.}} \\ \qquad \qquad \qquad \text{ie, using some enhancement} \end{array} \right\}$$

$$\text{or } = \frac{\text{Execution time without enhancement}}{\text{Execution time for entire task using enhancement.}}$$

$\rightarrow$  Performance improvement or speed up due to enhancement.

$$\left( \begin{array}{l} \text{Speedup}(\Sigma) = \frac{\text{Performance with } \Sigma}{\text{Performance without } \Sigma.} \end{array} \right)$$

Amdahl's law

Enhancement accelerates a fraction  $F$  of the execution time by fraction  $'S'$ . And remainder of time unaffected.

Execution time with  $\Sigma$  =  $((1-F) + F/s) \times \text{Execution time without } \Sigma$ .

$$\text{Speedup } (\Sigma) = \frac{\text{Execution time without } \Sigma}{((1-F) + F/s) \times \text{Execution time without } S}$$



| <u>Op</u> | <u>Freq.</u> | <u>Cycle</u> | <u>CPI(i)</u> | <u>Y. Time</u> |
|-----------|--------------|--------------|---------------|----------------|
| ALU       | 50%          | 1            | 0.5           | 23%            |
| Load      | 20%          | 5            | 1.0           | 45%            |
| Store     | 10%          | 3            | 0.3           | 14%            |
| Branch    | 20%          | 2            | 1.4           | 18%            |

$$|CPI| = 2.2$$

no previous que.

→ Here Load instruction enhanced. → 5 to 2. cycles..

$$So] F = 0.45 \text{ or } 45\%$$

$$1-F = 1 - 0.45 = 0.55 \text{ or } \frac{45}{(100-45)} \Rightarrow 55\%$$

$$S \text{ freq. } S = S/2 = 2.5$$

$$\text{Using Amdahl's law} \rightarrow \text{Speedup} = \frac{1}{(1-F) + F/S} = \frac{1}{0.55 + 0.45/2.5}$$

$$CPI_{\text{new}} = \frac{2.2}{1.37} = 1.6$$

$$= 1.37$$

### Alternative Method

$$\text{Old CPI} = 2.2$$

$$\text{New CPI} = 0.5 \times 1 + 0.2 \times 2 + 0.1 \times 3 + 0.2 \times 2 = 1.6$$

$$\text{Speedup} = \frac{\text{Original execution time}}{\text{New exec. time}} = \frac{\pi \times CPI_{\text{old}} \times S}{\pi \times CPI_{\text{new}} \times 4}$$

$$= \frac{2.2}{1.6} = \underline{\underline{1.37}}$$

### Extending Amdahl's law for multiple Enhancement

$$\text{Speedup} = \frac{1}{((1 - \sum F_i) + \sum \frac{F_i}{S_i})}$$

→ Note:- All  $F_i$  refer to original execution time before the enhancement are applied.

$$\begin{array}{ll} S_1 = 10 & F_1 = 20\% \\ S_2 = 15 & F_2 = 15\% \\ S_3 = 30 & F_3 = 10\% \end{array}$$

$$\text{Speedup} = \frac{1}{(1 - \sum F_i) + \sum \frac{F_i}{S_i}} = \frac{1}{(1 - 0.2 - 0.15 - 0.1) + \left(\frac{0.2}{10} + \frac{0.15}{15} + \frac{0.1}{30}\right)} = \left(\frac{1}{0.55 + 0.0333}\right) = \underline{\underline{1.71}} = \frac{t_{original}}{t_{enhanced}}$$

### "Reverse" Multiple Enhancement Amdahl's Law

Amdahl's law assumes that the fraction given refer to original execution time.

→ If for enhancement  $S_i$  the fraction  $F_i$  it affects is given as fraction of resulting execution time after the enhancement were applied

$$\begin{aligned} \text{Speedup} &= \frac{(1 - \sum F_i) + \sum F_i S_i \times \text{Resulting execution time}}{\text{Resulting execution time}} \\ &= \frac{(1 - \sum F_i) + \sum F_i S_i}{1} \end{aligned}$$

e.g. → Previous data → is used.  
then

$$\begin{aligned} \text{Speedup} &= (1 - 0.2 - 0.15 - 0.1) + 0.2 \times 10 + 0.15 \times 15 + 0.1 \times 30 \\ &= 0.55 + 2 + 2.25 + 3 \\ &= 7.8 \quad \checkmark \end{aligned}$$