

Each hardware implementation of a thread,  
is a logical CPU.

Eight great Ideas in Computer Architecture:-

- Design for Moore's Law.
- Use abstraction to simplify design.
- Make the common case fast.
- Performance via pipelining.
- Performance via parallelism.
- Hierarchy of memories.
- Dependability via redundancy.

Response Time: The time required to give response by CPU.

Throughput: Total work done per unit time.

Response time and throughput are affected by -  
changing processor of higher version.

$$\text{Performance} = \frac{1}{\text{Execution Time.}}$$

$$ET = N \times CPI \times T.$$

$$ET = IC \times CPI \times T$$

$$ET = \frac{N \times CPI}{F.}$$

### CPI Example:-

Q. Comp. A: Cycle Time = 250 ps, CPI = 2.0

Comp. B: Cycle Time = 500 ps, CPI = 1.2  
Same ISA; which computer is faster and by how much?

$$\rightarrow CPU\ Time_A = N \times CPI_A \times T_A$$

$$= N \times 2 \times 250 \times 10^{-12} = 500 \times 10^{-12} \times N.$$

$$CPU\ Time_B = N \times CPI_B \times T_B$$

$$= N \times 1.2 \times 500 \times 10^{-12} = 600 \times 10^{-12} \times N.$$

$$\therefore \frac{ET_B}{ET_A} = \frac{600 \times 10^{-12} \times N}{500 \times 10^{-12} \times N} = \frac{6}{5} = 1.2.$$

$\therefore$  Computer A is 1.2 times faster than Computer B.

□ If different instruction classes take different no. of cycles,

$$\therefore \underline{\underline{Clock\ Cycles}} = \sum_{i=1}^n (CPI_i \times InstructionCount_i)$$

□ Weighted average CPI:

$$CPI = \frac{Clock\ Cycles}{Total\ Instruction\ Count} = \frac{\sum_{i=1}^n (CPI_i \times \frac{InstructionCount_i}{Total\ Instruction\ Count})}{}$$

Relative frequency.

Determine the effective CPI, and execution time for the program.

$$\begin{aligned} \text{Total Clock Cycles} &= 45 \times 10^4 \times 1 + 32 \times 10^4 \times 2 + 15 \times 10^4 \times 2 + 3 \times 10^4 \times 2 \\ &= 10^4 (45 + 64 + 30 + 16) \\ &= 155 \times 10^4 \text{ Cycles.} \end{aligned}$$

Q. Alternative compiled code sequence using instruction classes A, B, C.

| Class         | A | B | C |
|---------------|---|---|---|
| CPI for Class | 1 | 2 | 3 |
| IC in seq. 1. | 2 | 1 | 2 |
| IC in seq. 2. | 4 | 1 | 1 |

Seq. 1.

$$\begin{aligned} \text{Clock Cycles} &= 1 \times 2 + 2 \times 1 + 3 \times 2 \\ &= \underline{\underline{10}}. \end{aligned}$$

Seq. 2.

$$\begin{aligned} \text{Clock Cycles} &= 1 \times 4 + 2 \times 1 + 3 \times 1 \\ &= \underline{\underline{9}}. \end{aligned}$$

$$\text{Avg. CPI} = \frac{10}{2+1+2} = \underline{\underline{2}}.$$

$$\text{Avg. CPI} = \frac{9}{4+1+1} = \underline{\underline{1.5}}.$$

Q. A 400 MHz processor was used to execute a benchmark program with the following instruction mix and clock cycle count.

| Instruction Type   | Instruction Count. | Clock cycle. |
|--------------------|--------------------|--------------|
| Integer Arithmetic | 450000             | 1            |
| Data Transfer.     | 320000             | 2            |
| Floating Point     | 150000             | 2            |
| Control Transfer   | 80000              | 2            |



$$\therefore \text{Average CPI} = \frac{155 \times 10^4}{(45+32+55)} \times 10^4 = \frac{155}{100} = \underline{\underline{1.55}}$$

$$ET = TC \times CPI \times T = \frac{TC \times CPI}{F}$$

$$= \frac{10^6 \times 1.55}{4 \times 10^8} = \frac{1.55}{4 \times 10^2} = \frac{15.5}{4} \times 10^{-3}$$

$$T. = 3.875 \text{ ms.}$$

| Instruction type     | CPI. | Instruction mix. |
|----------------------|------|------------------|
| Arithmetic and logic | 1    | 60 %.            |
| Load/Store in Cache  | 2    | 18 %.            |
| Branch               | 4    | 12 %.            |
| Cache Miss           | 8    | 10 %.            |
| Memory reference.    |      |                  |

$$\text{Average CPI} = 1 \times 0.6 + 2 \times 0.18 + 4 \times 0.12 + 8 \times 0.1$$

$$= 2.24$$

Q.  $f = 400 \text{ MHz. } CPI = 1.55. \text{ find MIPS rate.}$

$$\rightarrow \text{MIPS rate} = \frac{f}{CPI \times 10^6} = \frac{4 \times 10^9}{1.55 \times 10^6} = \frac{400}{1.55} = \underline{\underline{258.065}}$$

$$\text{Q. } f = 400 \text{ MHz. } CPI = 2.24. \text{ find MIPS rate.}$$

$$\rightarrow \text{MIPS rate} = \frac{f}{CPI \times 10^6} = \frac{4 \times 10^9}{2.24 \times 10^6} = \frac{200}{2.24} = \underline{\underline{178.57}}$$

$$\therefore \text{MIPS rate} = \frac{f}{CPI_{Av} \times 10^6}$$

CO & A Pg. 56  
William Stallings.

$$\therefore \text{MIPS rate} = \frac{f}{CPI_{Av} \times 10^6}$$

$$\therefore \text{MIPS} = \frac{10^9 \times 4}{10^6 \times (20+20+50+100)} = \frac{10^3 \times 4}{190} = \underline{\underline{21.052}}$$

Q. The execution time in seconds of 4 programs on 3 computers are given below. Assume  $TC = 10^9$ .

| Programs   | Comp. 1. | Comp. 2 | Comp 3 |
|------------|----------|---------|--------|
| Program A  | 1        | 10      | 20     |
| Program B  | 1000     | 100     | 20     |
| Program C  | 500      | 100     | 50     |
| Program D. | 100      | 800     | 100.   |

Consider each Program contributes to 25 %. Find MIPS of all 3 Computer.

$$\text{Comp. 1: } ET_{Av} = 0.25 \times 1 + 0.25 \times 1000 + 0.25 \times 500 + 0.25 \times 100$$

$$\rightarrow 400.25.$$

$$\therefore \text{MIPS} = \left( \frac{400.25}{ET_{\text{Million}}} \right) \times 10^6 = \frac{400.25}{10^3}$$

$$\therefore \text{MIPS} = \frac{10^3}{400.25} = \underline{\underline{2.498}}.$$

$$\text{Comp. 2: } ET_{10^9} = (10+100+1000+500) \times \frac{1}{4} = \frac{1910}{4} = 477.5$$

$$\therefore \text{MIPS} = \frac{1}{ET_{10^9}} = \frac{10^9}{10^6 \times 477.5} = \underline{\underline{2.09}}.$$

Comp. 3:

$$\therefore \text{MIPS} = \frac{10^9 \times 4}{10^6 \times (20+20+50+100)} = \frac{10^3 \times 4}{190} = \underline{\underline{21.052}}$$



▪ Benchmarks:-

▪ Amdahl's Law:-

$$\frac{\text{Execution Time}_{\text{new}}}{\text{Execution Time}_{\text{old}}} = \left[ \frac{1 - \text{Fraction Enhanced}}{\text{Enhanced}} + \frac{\text{Fraction Enhanced}}{\text{Speedup Enhanced}} \right]$$

$$\text{Speedup Overall} = \frac{\text{ET}_{\text{old}}}{\text{ET}_{\text{new}}} = \frac{1}{(1-f) + \left(\frac{f}{S}\right)}$$

$$\therefore \text{Overall Speedup} = \frac{1}{1-f + \frac{f}{S}}$$

Q.  $f=0.5$ ,  $S=1.6$ . find Overall speedup.

$$\rightarrow 0.5 = \frac{1}{1-f+f/S} = \frac{1}{0.5+0.5/1.6} = \underline{\underline{1.2307}}$$

Q.  $f=0.2$ ,  $S=10$ . find O.S.

$$\rightarrow 0.5 = \frac{1}{1-f+f/S} = \frac{1}{0.8+0.2/10} = \underline{\underline{1.2195}}$$

Patterson.  
Pg. 54.

Q. freq. of FP operations = 25 %.

Average CPI of FP operations = 4.0

Avg. CPI of other instructions = 1.33

freq. of FSQRT = 2 %.

(PI of FSQRT = 20).

Compare betn two design Alternatives:

(i) decrease CPI of FSQRT to 2.

(ii) decrease average CPI of all FP to 2.5.

For Use processor performance eqn:

Problem 6:

Q. When parallelizing an application, the ideal speedup is speeding up by the number of processors. This is limited by two things: percentage of appln that can be parallelized and the cost of communication. Amdahl's law takes into account the former but not latter.

1] No. of processors =  $N = 8$ .  $\therefore S = 8N$ .

80% process is parallelizable.  $f = 0.8$ .

$$\therefore 0.5 = \frac{1}{1-f + \frac{f}{S}} = \frac{1}{1-0.8 + \frac{0.8}{N}} = \frac{N}{0.2N+0.8}$$

Q. First process: 80 %.

Second process: 20 %.

1. First process's 40% process is speed up by 2 times.

Find overall speedup:

$$\rightarrow \frac{1}{(1-0.8 \times 0.4) + \frac{0.32}{2}} = \underline{\underline{1.19}}$$

2. Second Process's 99 % Process is speedup by 2 times.

Find overall speedup.

$$\rightarrow \frac{1}{(1-0.99 \times 0.2) + \frac{0.198}{2}} = \frac{1}{1-0.198+0.099} = \underline{\underline{1.1099}}$$

Q.  $S = 10$ . Enhanced mode is used 50% time.

1. Find total Speedup.

$$\rightarrow 0.5 = \frac{\text{ET}_{\text{old}}}{\text{ET}_{\text{new}}} = \frac{50\% + 50 \times 10}{100\%} = \frac{550\%}{100} = \underline{\underline{5.5}}$$

2. What % of original execution time converted to fast mode?

$$\rightarrow \text{let, } x = \text{req. answer.} \quad 1-x = \frac{x}{10} \quad \therefore x = 0.909$$

$x \cdot 1 = 90.91\%$

$\therefore 1-x = \text{remaining time} = 50\% \text{ time.}$

$\therefore \frac{x}{10} = \text{enhanced time} = 50\% \text{ time.}$



• Problem 7:

Unpipelined processor. 1 ns cycle. (clock cycle).  
 ALU Branches Memory operation.  
 4 4 5 cycles.  
 E 40% 20% 40% relative freqns.

0.2 ns overhead is added to clock by processor due to skew.  
 Ignoring latency impact, how much speed up in the instruction execution cycle will be gained in pipelining.

$$\rightarrow S = \frac{ET_{up}}{ET_p} = \frac{1 \text{ ns} \times \left[ \frac{(40+20)}{100} \times 4 + \frac{40}{100} \times 5 \right]}{(1+0.2) \text{ ns}} = \frac{4.4}{1.2} \approx 3.7 \text{ times.}$$

• Problem:

We have single thread machine. 7 ns.

5 break up.

IF : 1 ns. ID : 1.5 ns.

□ Amdahl's Law for Fixed load:-

Degree of Parallelism :- (DOP)

Kai-Hwang.  
 Pg. 134 (108)

Amdahl's Law:

$$S_n = \frac{n}{1 + (n-1)\alpha}$$

Here, system is used either in a pure sequential mode on one processor with a probability "α", or in a fully parallel mode using n processors with a probability 1-α.

Now,

$$w_1 + w_n = \alpha + (1-\alpha) = 1.$$

Workload.

|                |                |                |                |                |                |
|----------------|----------------|----------------|----------------|----------------|----------------|
| W <sub>1</sub> | W <sub>2</sub> | W <sub>3</sub> | W <sub>4</sub> | W <sub>5</sub> | W <sub>6</sub> |
| W <sub>n</sub> |

(a) Fixed Workload.



Execution time.



(b) Decreasing execution time.



As  $n \rightarrow \infty$ ,  $S_n \rightarrow \frac{1}{\alpha}$ .  
 $\therefore S_n$  is upper bounded by  $\frac{1}{\alpha}$ .  
 $\therefore \alpha$  is called as sequential bottleneck of program.

### Fixed Time Speedup:-

$$S_n' = \frac{W_1' + W_n'}{W_1 + W_n} = \frac{\alpha + n(1-\alpha)}{\alpha + (1-\alpha)} = n - \alpha(n-1).$$

Workload.

Execution time.



(a) Variable workload.

$S_n$

$$S_{1024} = 1024 - 1023\alpha.$$

Fixed Memory Speedup:-

$$S_n'' = \frac{W_1'' + W_n''}{W_1 + W_n''} = \frac{W_1 + G(n)W_n}{W_1 + G(n)W_n/n}.$$

Amdahl's Law and Gustafsson's Law are special cases of fixed-memory model.

### Performance Summary

The big picture:-

William Stallings  
Pg. 65

#### Central Tendencies of Data :-

- Mean (Average)
- Median. (Middle of sorted array)
- Mode. (Datapoint with highest frequency).

④ von Neuman Architecture.

- Memory
- MAR
- MBR

0.1 1.1 2.1 ...  $\alpha$

## Basic Instruction Cycle.

Fetch cycle      Execution cycle.



- Fetch Cycle :-

- ④ Control signals → Hardware

- Micro-program

## □ Instruction Cycle State Diagram :-



④ Intel and AMD, fixed length Instruction or Variable length Instruction?  
 RISC :- Reduced Instruction Set Computer.  
 CISC :- Complex Instruction set Computer.

## □ Machine Instruction Characteristics:-

- Types of Instructions, based on no. of addresses:-

1. 3 address instruction.

OP  $\overline{d}$ ,  $\overline{s}$ ,  $\overline{\text{Operands.}}$

2. 2 address instruction.

OP  $\overline{d}$ ,  $\overline{s}$

3. 1 Address instruction.

OP  $\overline{s}$

4. 0 Address instruction.

OP.

$$Z = K + B * C.$$

|                |            |                                 |
|----------------|------------|---------------------------------|
| eg. 3 Address. | 2 Address. | 1. Address.                     |
| MPY T, B, C.   | MOV R, B   | 0 Address.                      |
| ADD Z, T, K.   | LOAD C     | First convert it into Post fix. |
| MOV Z, R.      | MPY B,     | $K B C * +$ .                   |
| - I/O          | PUSH K     | PUSH B                          |
| - Timer        | ADD K.     | STOR Z.                         |
| - Program,etc. | MUL        | PUSH C                          |
|                | ADD        |                                 |
|                | POP Z.     |                                 |

- ④ Don't change the Values of the registers, pop Z.

eg.

$$Y = \frac{A-B}{C+(D \times E)}$$

3-address → 4 instructions.

2-address → 6 instructions.

1-address → 8 instructions.

0-address → 10 instructions.

Postfix:

$$AB-CDE^{++} \\ 10 \text{ instructions.} \quad \begin{array}{l} AB-CDE^{++} \\ (\text{no. of operators} + \text{no. of operands}) \\ + j \text{ (Pop the answer)} \end{array}$$

$$\text{eg. } Z = (((P - (Q+S)*T)/(U/(V-W)))$$

3-address:

ADD Z, Q, S.

MPY Z, Z, T.

SUB Z, P, Z.

SUB R, V, W.

DIV R<sub>2</sub>, U, R<sub>1</sub>

DIV Z, Z, R<sub>2</sub>

6-instructions.

2-address:

MOV R<sub>1</sub>, S

ADD R<sub>1</sub>, Q.

MUL R<sub>1</sub>, T.

MOV Z, P

SUB Z, R<sub>1</sub>

MOV R<sub>1</sub>, V

SUB R<sub>1</sub>, W.

MOV R<sub>3</sub>, U.

DIV R<sub>3</sub>, R<sub>1</sub>

DIV Z, R<sub>3</sub>

10-instructions.

1-address:

LOAD Q

ADD S

MPY T,

STOR Z.

LOAD P

SUB Z

STOR Z,

LOAD V

SUB W.

STOR R<sub>2</sub>

LOAD U

DIV R<sub>2</sub>

STOR R<sub>2</sub>

LOAD Z

DIV R<sub>2</sub>

75-instructions

## Expanding Opcodes :-

When fixed length instructions are used, then smaller length opcodes (without operands) may waste the memory. (bits).

The idea of expanding opcodes is to make some opcodes short, but have a means to provide longer ones when needed.

When opcode is short, a lot of bits are left to hold operands.

If there is no operands (such as Halt), all bits can be used for opcode.

eg. 16 Registers of 16 bits.

→ 14 of type 3 address instruction.

0000 — — — 3 → 14.

1101 — — — 3 → 14.

→ 30 of type 2 address instruction.

111 00000 — — 3 → 30.

111 11101 — — 3 → 30.

→ 28 of type 31 address instruction.

11111110 00000 — 3 → 28.

11111111 1011 — 3 → 28.



→ 64 of type 0 address instruction.

1111 1111 11 00 0000 }  
1111 1111 11 11 1111 }

64.

By solving this all, we can add to get total no. of address:-

$$\begin{array}{l} \text{Type 3: } 14 \times 2^{12} \\ \text{Type 2: } 30 \times 2^8 \\ \text{Type 1: } 28 \times 2^4 \\ \text{Type 0: } 64 \end{array} \left. \begin{array}{l} \text{Sum = } 65,536 \\ \text{= } 2^{16} \end{array} \right\} \text{(size of Register)}$$

• Machine with 16-bit instructions and 16 registers.

Instruction format can have several structures:-

- Opcode + Memory address (such as MARIE):

4 KB bytes addressable space

Opcode.      Address.

- Opcode + Register Address.

We need 4 bits to select one of 16 registers.

Opcode.      Address1      Address2      Address.

example 2:-

15 instructions with 3 address.

14 instructions with 2 address.

31 instructions with 1 address.

16 instructions with 0 address.

Can we encode this instruction set in 16 bits?

$$\text{Sum} = 15 \times 2^{12} + 14 \times 2^8 + 31 \times 2^4 + 16$$

$$\text{Sum} = 65,536. = 2^{16}$$

Yes, we can encode it in 16 bits.

0000 — — — } 15 3-address codes.  
1110 — — — }

1111 0000 — — } 14 2-address codes.  
1111 1101 — — }

1111 1110 0000 — } 31 1-address codes.  
1111 1111 1110 — }

1111 1111 1111 0000 } 16 0-address codes.  
1111 1111 1111 1111 }

Escape Code.

example 3:- Is it possible to design an expanding opcode to allow the following to encode with a 12-bit instruction? Assume a register operand requires 3 bits.

→ 4 instructions with 3 registers

→ 255 instructions with 1 register.

→ 16 instructions with 0 registers.

$$\rightarrow 4 \times 2^9 + 255 \times 2^3 + 16 \times 1 = 4,104. \quad 2^{12} = 4096.$$



- Q. Assume a CPU having 16 registers and instruction length is 16 bit. It should be having:
- 14 type 3 address instructions.
  - 31 type 2 address instructions.
  - 12 type 1 address instructions.

Calculate max. no. type 0 address instructions it should have? Show appropriate encoding.

$$14 \times 2^8 + 31 \times 2^4 + 12 \times 2^4 + x = 2^{16}$$

$$\therefore x = 64.$$

0000 — — — } 14  
1101 — — — } 3 Address instruction.

1110 0000 — — } 31 Address instruction.

1111 1110 — — }

1111 1111 0000 — } 12 1 Address instruction.

1111 1111 1011 — }

1111 1111 1100 0000 } 64 0 Address instruction.

1111 1111 1111 1111 }

Escape Code.

$$\text{Address size} = \log_2(16) = 4$$

- Q. Is it possible to design an expanding opcode to allow the following to be encoded with a 12-bit instruction? Assume a register operand requires 3 bits.
- 7 instructions: 2 15-bit addr and 1 3-bit reg.
  - 500 instructions: - 1 15-bit addr and 1 3-bit reg.
  - 50 instructions: with no addresses or regs.

$$7 \times 2^{15} \times 2^{15} \times 2^3 = 7 \times 2^{33}$$

$$500 \times 2^{15} \times 2^3 = 500 \times 2^{18}$$

50 patterns.

$$\text{Sum} = 6.026 \times 10^9 \text{ bit patterns.}$$

With 36 bit patterns,

$$2^{36} = 6.87 \times 10^9.$$

So we need 36 bit patterns to create opcodes.

- Q. A processor has 16-bit instruction length, 8-registers, and memory address of 8-bit. It is possible to have following types of instructions?

1. 3 no. of type that have 2 register fields & memory addr.

→ Yes! Possible. 8 registers → 3 bits.

$$3 \times 2 = 6 \text{ bits.}$$

∴ remaining 2 bits.  $2^2 = 4$  possible.  
8 bits for memory addr. ∴ 3 are also possible.

2. 6 no. of instructions of type 1 reg. field and 1 memory addr.

→ 11'000' — —

11'101' — —



3. 6 no. of instructions of type that have all 3 reg. fields.

→ 11 11 000 - - -

11 11 101 - - -

4. 3 no. of instructions with 1 memory address.

→ 11 11 11 00 - - -

11 11 11 10 - - -

5. 7 no. of instructions with type that have 2 reg. fields.

→ 11 11 11 11 10 - - -  
8 + Not possible +  $3 \times 2$

14 → Can't represent 7 instructions using  
Only 3 are possible.

6. 7 no. of instructions of type that have only 1 reg. field.

→ 11 11 11 11 11 00 - - -

7 can be represented.

7. 8 no. of instructions of no. addr. no. reg. type.

→ 11 11 11 11 11 11 000

11 11 11 11 11 11 111

8 can be represented.

Control :-

Hardware based

Control

Micro-program based  
Control (Firmware).

• Pipeline Implementation :-

|    |    |    |   |     |
|----|----|----|---|-----|
| IF | ID | IE | M | WB. |
|----|----|----|---|-----|

□ There are 4 particular subcycles :-

1. Instruction cycle.
2. Indirect cycle.
3. Execution cycle.
4. Interrupt cycle.

Control Unit:

Data path ← → Control Path.

• Control Signals trigger events in the processor.

• Control Unit generates control signals.

• Micro-operations :-

It will follow series of instructions.  
Micro → simple and will do a small task.



### Instruction Fetch (IF cycle) :-

It occurs at the beginning of each Instruction cycle, and causes execution of fetch micro-operations.

4 Registers are involved in Fetch:-

- MAR (Memory Address Register)
- MBR (Memory Buffer Register)
- PC (Program Counter)
- IR (Instruction Register).

### Grouping of Micro-operations:-

Can't refer to same memory/register in same cycle.

### Indirect Instruction:-

Fetch source operands.  
This cycle is used to fetch the values of operands, when using indirect addressing.  
If direct addressing is used, this is skipped.

### Interrupt Cycle:-

The PC is stored in MBR. Interrupt address is loaded in MAR. PC is changed.  
ISR is executed.

The original PC is restored. These tasks are also divided into micro-operations.



- Cores in CPU are called Latency Core. (Latency oriented) They deal with latency.
- Cores in GPU are called Throughput oriented.



CPU



### □ Control Unit Functional Requirements:-

Three step process to lead to a characterization of the Control Unit:-

- Define basic elements of processor.
- Describe micro-operations processor performs.
- Determine the functions that the control unit must perform to cause the micro-operations to be performed.

The Control unit performs two basic operations:-

- Sequencing Micro-operations sequencing.

- Execution Micro-operations execution.

- ④ Control Unit generates Control Signals which govern all these things.



### ◊ Control Signals Governs :-

- Processor (ALU)
- Memory / I/O / External things
- Datopath.

### □ Fetch Cycle (From previous) :-

To do this, different control signals are used.  
Then read from Memory to MBR.  
 $PC \rightarrow MAR$ .

- ⊗ Control Signals governs the execution of sequence of micro-operations. ⊗



### □ Building a Data Path :-

- Elements that process data and addresses in CPU Registers, ALU, Multiplexer, etc.

- We will be using RISC-V

$PC = PC + 4$  → Because each instruction is of 4 byte, same length RISC Archi. ( $L_1$  and  $L_2$  cache is split). ( $L_1$  and  $L_2$  is unified).

## □ R-Format Instructions:-

## Load-Store Instructions:-

## Branch Instructions:-

- Compare Operands.

ALU subtracts and check zero output.

PERF. (Linux utility).

what it allows to do/perform.

- cache misses are more in column major (in intel).  
are less in row major

For Ryzen, it is opposite.

## □ Full Data Path Diagram:-

## • Five-step sequence of actions to fetch and execute an instruction:-

1. Fetch an instruction and increment the program counter.
2. Decode the instruction and read registers from the register file.
3. Perform ALU operation.
4. Read and write memory data if the instruction involves a memory operand.
5. Write the result into the destination register, if needed.

## □ Basic Processing Unit:-

Stages:-

1

Instruction Fetch.

2

Source Registers

3

ALU

4

Memory Access.

5.

Destination Register.

IF

ID

IE

M.

WB.



Five Active temporary resistors, which are always Active.

e.g. RA, RB, RM, RY, RZ.

Each Physical Core Supports 2 Hardware threads by Implementation.

L1 Cache is Split into I and D, which is private to each physical core.

L2 Cache is private to each physical core, Unified (not split).

L3 Cache is Shared Cache. Unified.

What is block size, and who decides it?



If Cache Miss Occurs at L1, then it will check in TLB, and then will check in L2 Cache, then in L3 and then the request will go off-chip, to RAM.

Branch-if-[R5]=[R6] Loop.

1. Memory address  $\leftarrow [PC]$ , Read Memory, IR  $\leftarrow$  Memory data.  
 $PC \leftarrow [PC] + 4$ .

2. Decode Instruction. RA  $\leftarrow [R5]$ , RB  $\leftarrow [R6]$ .

3. Compare [RA] to [RB]. If [RA] = [RB], then  
 $PC \leftarrow [PC] + \text{Branch offset}$ .

4. No Action.

5. No action.

Call-Register RG.

1. Memory address  $\leftarrow [PC]$ , Read Memory, IR  $\leftarrow$  Memory data,  $PC \leftarrow [PC] + 4$ .

2. Decode instruction, RA  $\leftarrow [RG]$ .

3. PC-Temp  $\leftarrow [PC]$ ,  $PC \leftarrow [RA]$

4. RY  $\leftarrow [PC\text{-Temp}]$ .

5. Register Link  $\leftarrow [RY]$ .



# MFC - Memory Function Control.



- Processor-Memory Interface
- Control Signals for Instruction address generator.



## Processor-Memory Interface.



## Instruction Address Generator.



## Two Methods to Generate Control Signals:-

### Hardwired Control

- Circuit is printed/fixed.
- Can't be altered.

### Micro-program Control

- More Generalized.
  - Can be altered.
- Setting of Control Signals depends on:-
- Contents of the Step Counter.
  - Contents of the instruction register.
  - The result of a computation or a comparison operation.
  - External input signals, such as interrupt requests.



### CTSC-style Processor Organization



As per instruction in IR, there is a need of getting sequencing of appropriate signals. It can be done by two approaches:

- (i) Hardware Control.
- (ii) Micro-program Control.

## Two Methods to Generate Control Signals:-

### Hardwired Control

- Circuit is printed/fixed.
- Can't be altered.

### Micro-program Control

- More Generalized.
- Can be altered.



As per instruction in IR, there is a need of getting sequencing of appropriate signals. It can be done by two approaches:

- (i) Hardware Control.
- (ii) Micro-program Control.



Let us consider, 3 instruction steps:

INSTR1: Add  $R_1, R_0$ .

Steps to Execute: [I]  $R_1$  out,  $X$  in.

[II]  $R_0$  out, Add.

[III]  $Z$  out,  $R_1$  in

[IV] End.

INSTR2: XOR  $R_0, R_5$

[I]  $R_0$  out,  $X$  in.

[II]  $R_5$  out, XOR

[III]  $Z$  out,  $R_0$  in

[IV] End.

INSTR3: Add  $[R_0], R_1$ .

[I]  $R_0$  out, MAR in, MAR out, MBR in.

[II] MBR out,  $X$  in.

[III]  $R_0$  out, ADD

[IV]  $Z$  out, MBR in, MBR out, MAR out, RD.

[V] End.

$$R_0 \text{ out} = ((\text{INSTR1 AND } T_2)) \text{ OR}$$

When INSTR1 and  $T_2$  means  $R_0$  out should be made 1 if there is INSTR1 and second clock cycle.

$$\text{Add} = (\text{INSTR1 and } T_2) \text{ OR}(\text{INSTR3 and } T_3) \text{ OR} (\dots)$$

:

$$\text{end} = (\text{INSTR1 and } T_4) \text{ OR}(\text{INSTR2 and } T_4) \text{ OR} \dots$$

### Timing Marker (TM) :-

The AND combination is for activation of particular signal in context of signal instruction.

Here, for every instruction, there is one particular gate and giving to each AND gate one combination timing marker and instruction.

And all the AND gates are ORed. It gives  $R_0$  out.



We require two types of signal for such an activity:

(i) Time Marker.

(ii) Instruction Marker.



Control Unit with Decoded Inputs:



Scanned with OKEN Scanner

- Different instruction with different steps to execute means naturally different no. of clock cycles.

So if almost 8 bits instruction, we can choose 3 bit counter.

If 8 steps → 3 bit counter.

16 steps → 4 bit counter.

|    |                  |                                            |
|----|------------------|--------------------------------------------|
| 00 | T <sub>1</sub>   | Like wise, we can have similar pattern for |
| 01 | T <sub>2</sub>   | 3 bits, 4 bits, and so on.                 |
| 10 | T <sub>3</sub>   |                                            |
| 11 | T <sub>4</sub> . |                                            |

End signal connected to n bit counter that indicates if instr. is completed and set n bit counter to 0000.

After the prev. step, now IR is loaded with new instr. and counter starts counting the clock again from start.

The approach discussed in hardware based control unit design lasts till we throw the design or scrap the design.

No flexibility for changes as it is permanently implemented in hardware architecture.

### Diagram: Typical Microinstruction Formats.

### Horizontal micro-instruction:



### 3) Vertical Micro-instruction:



- Control Memory.
- Control Unit Micro-architecture.

# □ Arithmetic for Computers:-

- Integer Addition.
- Integer Subtraction.

## ② ALU (Arithmetic and Logical Unit) :-

### - Integer Representation.

- Sign-magnitude Representation. (two 0's).
- Two's Complement Representation. (range:  $-2^{n-1}$  to  $2^{n-1}$ ).  
To expand the bits, repeat the signed bit to the left.  
To subtract, A-B, Add two's complement of B to A.

### Alternative Representations:-

| Decimal | Sign-magnitude | Two's Complement | Biased |
|---------|----------------|------------------|--------|
| +8      | 01000          | 01000            | 11111  |
| +7      | 01001          | 01001            | 11110  |
| +6      | 01010          | 01010            | 11011  |
| +5      | 01011          | 01011            | 11000  |
| +4      | 01100          | 01100            | 10111  |
| +3      | 01101          | 01101            | 10110  |
| +2      | 01110          | 01110            | 10101  |
| +1      | 01111          | 01111            | 10100  |
| 0       | 00000          | 00000            | 00000  |
| -0      | 10000          | 10000            | -      |
| -1      | 10001          | 11111            | 01100  |
| -2      | 10010          | 11110            | 01011  |
| -3      | 10011          | 11101            | 01000  |
| -4      | 10100          | 11011            | 00111  |
| -5      | 10101          | 11010            | 00100  |
| -6      | 10110          | 10111            | 00010  |
| -7      | 10111          | 10110            | 00001  |
| -8      | -              | 10001            | 00000  |

eg.  $\begin{array}{|c|c|c|c|c|c|c|c|} \hline & -128 & 64 & 32 & 16 & 8 & 4 & 2 & 1 \\ \hline & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 \\ \hline \end{array}$

$$-128 + 8 = -120.$$

New trick.

### □ Range Extension:

- For sign-magnitude,  
→ shift the sign-bit leftmost, add 0's.
- For two's complement  
→ Copy the sign-bit to all the left digits.
- Negation:  
Negate all bits (1's complement). [Don't change the first bit from right, change all the leftmost bits].

### • Special Case :- (i) (-128 for 8 bits).

$$-(128) \approx -128.$$

(ii)  $00000000 \rightarrow 1's \text{ complement } 11111111 \approx 100000000$   
 $\therefore -0 \equiv 0.$

### □ Overflow rule:-

If two nos have same sign, and their result has opposite sign, then overflow occurs.





$SW = \text{Switch (addition or subtraction)}$

$OF = \text{Overflow bit}$

$$S_i = x_i \oplus y_i \oplus C_i$$

$$C_{i+1} = x_i y_i + x_i C_i + y_i C_i$$



Binary addition / subtraction logic.

Carry look ahead :-

$$S_i = (x_i \oplus y_i) \oplus C_i$$

$$C_{i+1} = x_i y_i + (x_i + y_i) C_i \approx x_i y_i + (x_i \oplus y_i) C_i$$

$$G_i = x_i y_i \quad P = x_i \oplus y_i$$

Multiplication :- [Add and shift]. { Unsigned int }

$$\begin{array}{r}
 & 1 & 0 & 1 & 1 \\
 \times & 1 & 1 & 0 & 1 \\
 \hline
 & 1 & 0 & 1 & 1 \\
 & 0 & 0 & 0 & 0 \\
 & 1 & 0 & 1 & 1 \\
 \hline
 & 1 & 0 & 0 & 1 & 1 & 1
 \end{array}$$



$M = \text{Multiplicand.}$   
 $Q = \text{Multiplier.}$



$(-7) \times (f_3)$ .

$$\begin{array}{r} 7 = 0111 \\ 3 = 0011 \\ \hline & X 1001 \end{array}$$

$\therefore$  Add and shift will not work for signed integer.

$$\begin{array}{r} 000110110110 \\ \hline 000110110110 \end{array}$$

□ Booth's Algorithm :-

Division.