

## System Buses (Connection lines b/w two devices with comp. system)

(\*) Address Bus

(\*) Data Bus

(\*) Control Bus



Address Bus: unidirectional,  
means only CPU can  
generate

Data bus :: Bi-directional  
(duplex)

Control Bus: bidirectional :: every  
individual line is unidirectional  
Interrupt signal by I/O → CPU

Mem can generate two

type of signal for CPU

→ wait and → Ready

(CPU is operating at very

fast speed but memory  
not operating at that  
speed)



Memory Cycle Time

Time from accepting a

request till memory gets  
ready again to perform  
next operation

CPU Register

\* within CPU are small storage  
units are register

\* while executing prog. inside  
CPU registers are used

→ General Purpose Registers (GPRs)

\*\* ↳ Storing general content  
    ↳ o/p from ALU  
    ↳ ip to ALU  
    ↳ store temp. storage

↗ Special content

## → Special Purpose Registers

1) Accumulator (AC)

2) ~~Program Counter~~

2) Program Counter

3) Instruction Register (IR)

4) Flag Register / program  
status word (PSW)

5) Address Register (AR)/  
Mem. Add. Register (MAR)

6) Data Register (DR) / MDR /  
M. Buffer Regis

⊗ not full list, there are  
are others as well.

Accumulator: (not specific to an  
architecture, generic  
point of view)

→ Used to store result of ALU  
and sometimes to store one of  
the input for ALU



First o/p from ALU to Accumulator  
(AC)

- ✳ AC - base architecture → when one input is taken from AC and another i/p taken from somewhere else

## Types of Architecture

- ✳ Based on input to ALU

### 1) AC - Based



→ instruction  
→ opcode operand  
ex.

$\text{ADD } R_3 \rightarrow AC \leftarrow AC + R_3$

$\text{MUL } R_1 \rightarrow AC \leftarrow AC * R_1$

no need to mention AC  
in this architecture as one  
input is taken from AC.

## 2) Register based architecture



All inputs taken from register

## 3) Register - Mem . Architecture



## 4) Complex System Architecture



Both i/p can  
be taken be  
from either  
Reg Or Mem.

## 5) Stack Based



### Note

① 32-bit architecture



② If result of two i/p's is more than 32 bit then extra bit is stored in carry flag register

③ Similarly for 64 bits architecture  
→ word size = 64 bits

## Program Counter

Store address of next instruction



Q) A CPU has 24-bit instruction. A program starts at address 300 (in decimal). Which of the following is a legal program counter value?

- A) 400                      B) 500
- C) 600                      D) 700

A) 24 bits  $\rightarrow$  3 Bytes

|     |         |
|-----|---------|
|     |         |
| 300 | Instr 1 |
| 303 | Instr 2 |
| 306 | Instr 3 |
| 309 | Instr 4 |

So answer is  
(C) 600  
as its multiple  
of 3

## Instruction Register

Used to store current instruction which CPU is executing.

## Stack pointer

In the memory for execution purpose there should be a stack

### LIFO



→ Used to store address of the top element of stack

## Flag Register (Status Register)





→ 8055 will have diff. Status Register

→ 8086 will have diff.  
and so on.

S = Sign

Z = Zero

\* if ALU result is '0' then

$Z = 1$

if  $ALU \neq 0$  then  $Z = 0$

$$S \begin{cases} 0 & - +ve \\ 1 & - -ve \end{cases}$$

why to set status register?

→ It stores the status of ALU results

→ to check conditions

↳ ex.  $(a > b) \rightarrow$  CPU will

$a - b$  in ALU and

status of Result is +ve

then  $a > b$  condition

is true



MAR : Send address to memory

MDR : send data to memory  
 (memory write) and  
 receive data from  
 memory (mem.read)



Mem.read → CPU sends add to  
 mem using add bus

→ CPU enables read  
 control signal for  
 memory

→ Mem. performs read  
 opr. & sends the data  
 to CPU using data bus

## Mem write

- 1) CPU send add to mem. using add bus
- 2) CPU sends data to mem. using data bus
- 3) CPU sends enabled write signal to mem.
- 4) Mem. performs write opr. on provided add.

## Software Architecture

## CPU Performance



For 1<sup>st</sup> clock

$$F = 1 \text{ GHz} = 10^9 \text{ Hz}$$

$$T = \frac{1}{f} = \frac{1}{10^9} \text{ sec} = 10^{-9} \text{ sec.}$$

For 2<sup>nd</sup> clock

$$f = 500 \text{ MHz} = 500 \times 10^6 \text{ Hz}$$

$$T = \frac{1}{f} = \frac{1}{500 \times 10^6} \text{ sec} \\ = 2 \text{ msec.}$$

So clock 1 is faster

Clock rate  $f$  depends on

→ clock is becoming faster

→ CPU organization is getting better

ex) ADD R1, R2,

→ Fetch PC ~~out~~ MAR.read . . . (1 clock, cycles)

✗

→ . . .

→ . . .

✗ May be on a diff. CPU with diff. organization less no. of micro instructions are required

✗ Speed with which a computer executes programs is affected by the design of its hardware and its machine language instruction.

✗ Since most programs are written in HLL so performance of compiler also affects the performance (that translates programs into machine language).

✗ Compiler + machine instruction set hardware in sync

✗ Processor time: sum of processor time needed to execute the program.

## Processor Clock

$$R = \frac{1}{P}$$

- (\*) Processor circuits are controlled by a timing signal called a 'clock'
- (\*) Clock defines regular time intervals called clock cycles.

(\*) Machine instruction

ADD R1 R2  
 ↳ PC<sub>out</sub> MAR<sub>in</sub> READ WAIT  
 each of → step is executed  
 in one clock cycle.

length P of one clock cycle

~~$R = \frac{1}{P}$~~

Cycles per second → Hz

Million → Mega (M)

Billion → Giga (G)

500 million/sec → 500 MHz

1250 → 1/sec → 1.25 GHz

→  $\frac{1}{500 \text{ MHz}} = 2 \text{ ns}$

→  $\frac{1}{1.25 \text{ GHz}} = 0.8 \text{ ns}$

## Basic Performance Equation

$$T = \frac{N \times S}{R}$$

T : processor time required  
to execute a program  
that has been prepared  
in some HLL

N : Machine language instr.

S : avg. no. of basic steps  
needed to execute  
one machine instruction  
is S, where each basic  
step is completed in  
one clock cycle.

R : clock Rate R / Sec.

$$T = \frac{N \times S}{R}$$

→ whatever process is running

is user mode

→ process running in kernel mode

### Hardwire design

Processor shld have faster

(chp 6 hamacher)

Add A, B

→ executed using faster adder circuit

→ Since memory access is there

so it shld be faster  
(mem access time)

(Fetch), Decode, OF, (Execute, write)

faster

Fetch, Decode, ~~execute~~ Write

→ can be saved by using cache



To execute one step is performed

in one clock cycle → 1 step

1 cycle / instruction

1 CPI

## Instruction Set

Add A R<sub>1</sub> = S = 5

Add A B = S = 6



If 2 GHz (0.5 ms)

for 4 GHz (0.25 ms)

→ Same hardware / mem magnt etc.

↳ F → 2 clock

E → " "

S = 12

⋮

W → 2 "

④ Reduced P but S becomes double  
so T remains same



How to reduce T → better

hardware so P increases

so that each step is also

completed in one cycle only

$$- T = NXS \times P \quad \text{or} \quad P = \frac{T}{N \times S}$$

$$P = \frac{1}{R}$$

$$T = \frac{N \times S}{R}$$

$$\cancel{C = A + B}$$

ISA

P<sub>1</sub> (CISC)  
ADD A B C

(Complex instr.)

P<sub>2</sub>

MOV A R0

MOV B RI

ADD R0 RI

mem(A) + mem(B), mem(C) MOV RI C

RISC  
(Reduced Instruction Set Computer)

(not complex instructions allowed)

$A + B \rightarrow C$

There are 1/2/3 address instruction

ADD A [Accumulator]

ADD A B

ADD A BC

P<sub>3</sub> [one mem. operand is used]

MOV A, R0

ADD B R0

MOV R0, C

P<sub>4</sub> (Single operand)

LOAD A

Add B [accumulator contains A+B]

STORE C

[accumulator  $\rightarrow$  C]  
[load/store used not MOVE]

④ All 4 have diff ISA

P1 - P4, i3, i5, i7  
AMD

④ QOS → CISC

④ Now mostly RISC

→ Which one is better

Hardware, memory, clock rate, cache same for CISC, RISC  
→ P is same

We are concerned for "N"

$$P_1 \rightarrow N = 1$$

$$P_2 \rightarrow N = 4$$

$$P_3 \rightarrow N = 3$$

$$P_4 \rightarrow N = 3$$

① Fetch:

DEC

OFA

OFB

EX

WC

6 steps

③

$$\begin{pmatrix} F \\ D \\ OF \end{pmatrix} S=3$$

NXS

$$= 3 \times 3$$

$$NXS = 6 \times 6$$

$$= 6$$

$$= 9$$

which one is better

$$\Rightarrow \frac{6}{P_1} \text{ vs } \frac{9}{P_2}$$

~~Suppose for P1~~

~~100 (assembly language)~~

$$\text{avg } S = CPI \\ = 6$$

$$100 \times 6$$

for P2

$$N = 300$$

$$S = CPI = 3$$

$$N \times S = 900$$

so CISC is better than RISC

But nowady RISC is preferred with the help of pipelining.

Pipelining

Ex

$P_2 \rightarrow$  Mov      ] Simultaneous  
                  Add  
                  Mov      execution

Sequence of steps

F D - OF - E - W

→ first instr. in first clock cycle

→ next decode



$$\hookrightarrow S = 5$$

For 100 instruction program  
after 5<sup>th</sup> clock cycle

effective value of 'S' = 1  
ex. F D E

F D E

F D E

F D E

non pipeline  $\Rightarrow 10 \times 3$   
pipeline  $\Rightarrow 12$

F D E  
F D E  
F D E

$$\text{Pipeline of CISC} = 100/S = 10\% \approx 100$$

$$\text{RISC} = 300/S = 300 \approx 300$$

→ but we have to see some other aspects as well in pipelining

→ talk about hazards  
 data hazard  
 (Stalls)

→ logical correctness of program is maintained

For CISC

Add A, B, C

Add E, R1

F → D → OF → OF → E → W

• F D - - - OF → E → W

3 Stalls

For RISC (P3)

|   |   |    |    |  |
|---|---|----|----|--|
| F | F | D  | OF |  |
| F | D | OF | E  |  |
| F | D | -  | W  |  |

1 stall

② Easy to handle hazard (stall) in RISC (simpler instruction)

Logical correctness  $\rightarrow$  done by compiler

$\rightarrow$  This is how ISA affects performance

## Superscalar Op



$\rightarrow$  multithreading (Coordinated)

S can be reduced to 1 by using pipelining on thread ~~is~~ and cores

we can reduce S < 1 also using multicore processor

it can be done in RISC  
than in CISC

$$T = \frac{N \times S}{R} = N \times S \times P$$

## Compiler

- Reduce N we need to have stable machine instruction set
- Compiler shld use it in good way.
- Optimizing compiler
  - ↳ reduces  $N \times S$
- No. of cycles of is dependent not only on the choice of instructions but also on the way they are ordered in the program. (Chapter 8)
- Compiler re-arranges program instructions to achieve better performance without affecting the result of computation

- Compiler shld be closely linked to the processor architecture.
- Usually compiler and processor are designed at the same time with both designers in proper communication with each other.
- Ultimate goal → reduce total no. of clock cycles.

## Performance Measurement

- Computer designers use performance estimates to evaluate the effectiveness of new features
- Not only "T" but other features also affect the performance
- For proper comparison standardized programs must be used.

Benchmark: → performance measure is the time it takes a computer to execute a given benchmark

- ⊗ Initially some artificial programs were used for standardizing benchmarks.
  - ↳ but they do not stand in sight standard when compared with real world applications.
- Non-profit organization called System Performance Evaluation Corporation (SPEC) publishes representative application programs for different application domains together with test results for many commercially available ~~softwar~~ computers.

- For general purpose computers a suite of benchmarks programs was selected in 1989
- ↳ modified in 1995 and 2000.

Programs:

- game playing
- compiler and database applications
- numerically intensive programs in astrophysics and quantum chemistry
- Program is run on testing PC and running time on a real PC is measured. (no simulations)
- Program also run of reference PC

SPEC95 → SUN SPARCstation  
10/40

SPEC 2000 → Ultra SPARC 10  
work station,

## SPEC Rating

$$= \frac{\text{Running time on ref. PC}}{\text{Running time on PC under test}}$$

SPEC rating of 50 means  
PC under test is 50 times  
as fast as Ultra SPARC 10  
for a particular test benchmark.

Test run for all programs in  
SPEC Suite

→ Geometric mean of the  
results is computed

SPEC Ratings for i prog

$$= \left( \prod_{i=1}^n \text{SPEC}_i \right)^{1/n}$$

SPEC measure is combination  
of all ~~measures~~ factors affecting  
performance such as compiler, OS, processor,  
memory of system.