

23/06/2017

## Computer Organisation:

# RISC Pipeline: 5 stage pipeline → only index & register addressing mode

- To analyse the implementation issues of a pipeline, let us consider RISC pipeline as a reference model.
- RISC processor supports 5 stage pipeline, used to execute the instructions.
- Different types of instruction possible in a RISC CPU is as follows:

① Data Transfer Instruction: In the RISC CPU, load and store instructions are used as a data transfer instructions to carry the data between the memory and registers respectively.

→ Memory references implemented using index addressing mode.

Syntax :

$$\boxed{\text{Load } r_0 \mid 3(r_1)} \Rightarrow r_0 \leftarrow m[3 + [r_1]]$$

$$\boxed{\text{Store } 3(r_1) \mid r_0} \Rightarrow m[3 + [r_1]] \leftarrow r_0$$

② Data Manipulation Instruction: In the RISC CPU, ALU operations are performed on a register data.

Syntax :

$$\boxed{\text{Add } r_0 \mid r_1 \mid r_2} \Rightarrow r_0 \leftarrow r_1 + r_2$$

③ Transfer of control instruction

Syntax: ④ Unconditional Jump

$$\boxed{\text{Jmp } 2000} \Rightarrow \text{PC: Sequential address replaced with target address}$$

⑤ Conditional Jumps

Syntax:



→ To execute the above set of instructions 5 stage instruction pipeline is used in the RISC CPU.  
→ Different stages present in the pipeline are as follows:

\* Stage 1: Instruction Fetch (IF):

In this stage, CPU reads the instruction from the memory based on the PC simultaneously, PC will be incremented to next instruction address in a sequence.

\* Stage 2: Instruction Decode (ID):

In this stage, two operations are performed:

- ① Decode the instruction
  - ② Access the register file to read the data (operand fetch)
- This stage also contains the comparator circuit to evaluate the branch condition.

\* Stage 3: Execute (EX):

In this stage, ALU operations are performed.

\* Stage 4: Memory Access (MA):

In this stage, memory will be accessed to read or write the data.

\* Stage 5: Write Back (WB):

In this stage, register write operation is performed to store the result.

→ In the RISC CPU, above stages are connected in a pipelined sequence, becomes 5-stage instruction pipeline.





### Structural Dependency:

→ Structural dependency will occur in the pipeline due to a resource conflict.

→ Resource may be a register or memory or function unit (ALU).

Diagram 1

|    | cc1 | cc2 | cc3 | cc4   |
|----|-----|-----|-----|-------|
| I1 | mem | ID  | ALU | (mem) |
| I2 |     | mem | ID  | ALU   |
| I3 |     |     | mem | ID    |
| I4 |     |     |     | (mem) |

Conflict

→ In the above diagram, I1 and I4, both of the instructions are trying to access the same resource (memory) in the same cycle (cc4). This situation is called as resource conflict.

→ Conflict is an unsuccessful operation.

→ To make it successful, we need to keep the I4 instruction into a waiting until the resource becomes available.

→ This waiting creates stalls in the pipeline, i.e., described below.

|    | cc1 | cc2 | cc3 | cc4 | cc5 | cc6 | (cc7) |
|----|-----|-----|-----|-----|-----|-----|-------|
| I1 | mem | ID  | ALU | mem | wB  |     |       |
| I2 |     | mem | ID  | ALU | mem | wB  |       |
| I3 |     |     | mem | ID  | ALU | mem | wB    |
| I4 |     |     |     | /// | /// | /// | mem   |

Diagram 2

7 cycles - 4 IF = 3 stalls

→ To minimize the structural stalls hardware technique is used, i.e., renaming.

→ In diagram 1, I1 instruction accessing the memory in a 4<sup>th</sup> stage of pipeline to read or write the data. Simultaneously, I4 instruction accessing the mem in the 1<sup>st</sup> stage of pipeline to read the instruction in the same cycle of cc4.

→ when the instruction and data, both are present in the same mem, then, the above situation creates conflict.

→ Re-naming mechanism states that divide the mem chip into two - independent modules used to store instruction and data separately called as code memory (CM) and data memory (DM) respectively.

→ Refer the code mem. in the first stage and refer the DM in the 4th stage of a pipeline. So, accessing of these two memories in a same cycle does not creates conflict, i.e., described below:

|    | cc1 | cc2 | cc3 | cc4 | cc5 | cc6 | (cc7) |
|----|-----|-----|-----|-----|-----|-----|-------|
| t1 | CM  | ID  | ALU | DM  | WB  |     |       |
| t2 | CM  | ID  | ALU | DM  | WB  |     |       |
| t3 | CM  | ID  | ALU | DM  | WB  |     |       |
| t4 |     |     | CD  | ALU | DM  |     |       |
| t5 |     |     | CD  | ALU | DM  |     |       |
| t6 |     |     | CD  | ALU | DM  |     |       |
|    |     |     | CD  | ALU | DM  |     |       |

Registers - 7 :  $r_1 = 0$  stalls

# Data dependency:

→ Consider the program segment, where instruction 'j' follows instruction 'i' in a program order.

Reg. I instruc.  
J instruc.



→ Data dependency condition will occur when the instruction 'j' tries to read the data before instruction 'i' writes it.

→ If  $i : ADD r_0, r_1, r_2 \rightarrow r_0 \leftarrow r_1 + r_2$

$j : MUL r_3, r_0, r_4 \rightarrow r_3 \leftarrow r_0 * r_4$

→ When the above instructions are executed in a non-pipeline processor, then data dependency condition does not occur, because,  $I_2$  is executing after  $I_1$ , so,  $I_2$  tries to read the register  $r_0$  data after  $I_1$  writes it. (No data loss)

→ When the above instructions are executed in a pipelined processor, then, data dependency condition will be occurred. Because  $I_2$  is executing alongwith  $I_1$ , so,  $I_2$  tries to read the register  $r_0$  data before  $I_1$  writes it. So,  $I_2$  is overwriting reads the old value from the  $r_0$  (data loss).

|       | cc1 | cc2 | cc3 | cc4 |
|-------|-----|-----|-----|-----|
| $I_1$ | IF  | ID  | EX  |     |
| $I_2$ |     | IF  | ID  |     |
| $I_3$ |     |     | IF  |     |

→ To correct the above problem, we need to stop the access of a dependent data until the previous instruction, (dependent) is completed.

→ For this purpose, some logic will be maintained in the ID stage.

Structure of the logic:

ID Stage, TOMASULO Algorithm:

| S/N. | Function Unit | Destination | Independent source 1 | Inde. src 2 | Dependent src 1 | Dep. src 2 | Status                                  |
|------|---------------|-------------|----------------------|-------------|-----------------|------------|-----------------------------------------|
| I1   | ADD           | r0          | r1                   | r2          | -               | -          | 0 r1-value 6X<br>1 r1-wl stage allocate |
| I2   | MUL           | r3          | -                    | r4          | r0 (21)         | -          | 0 r4-value<br>1 r4-wl stage allocate    |

→ By using the above logic, we can lock the dependent ins. (I2) in the ID stage until the data becomes available.

→ This process creates stalls in the pipeline, i.e., described below.

|    | cc1 | cc2 | cc3                 | cc4 | cc5                      | cc6 | cc7 |
|----|-----|-----|---------------------|-----|--------------------------|-----|-----|
| I1 | IF  | ID  | EX                  | MA  | WB                       |     |     |
| I2 | IF  | IF  | EX (I1)<br>r4-value | MA  | WB                       |     |     |
| I3 |     |     | IF (I1)(I2)         | EX  | MA                       |     |     |
| I4 |     |     |                     | IF  | WB (I3)<br>r0-value feed | IF  |     |

7 cycles - 4 IFs = 3 stalls (Data dependency stalls)

→ To minimize the data dependency stalls, hardware techniques used, i.e., Operand forwarding also known as bypassing/bypassing/clocking.

→ This technique states that we use the buffer between the stages to hold the intermediate outputs. So, the dependent instruction will be accessing the new value from the interface register (buffer) before update the register file.

→ Consider the following program segment.

|                                   |
|-----------------------------------|
| — True Data dependency (Adjacent) |
| I1: ADD r0, r1, r2                |
| I2: SUB r3, r0, r4                |
| I3: MUL r5, r3, r2                |
| I4: DIV r6, r5, r1                |

|                                    |
|------------------------------------|
| --- Data Dependency (Non-adjacent) |
| I1: ADD r0, r1, r2                 |
| I2: SUB r3, r0, r4                 |
| I3: MUL r5, r0, r2                 |
| I4: DIV r6, r5, r1                 |

→ Execution sequence of a program with operand forwarding is as follows:

|    | cc1 | cc2 | cc3                 | cc4 | cc5                      | cc6 |  |
|----|-----|-----|---------------------|-----|--------------------------|-----|--|
| I1 | IF  | ID  | EX                  | MA  | WB                       |     |  |
| I2 | IF  | IF  | EX (I1)<br>r4-value | MA  | WB                       |     |  |
| I3 |     |     | IF (I1)(I2)         | EX  | MA                       |     |  |
| I4 |     |     |                     | IF  | WB (I3)<br>r0-value feed | IF  |  |
| I5 |     |     |                     |     | IF                       |     |  |
| I6 |     |     |                     |     |                          | IF  |  |

Q. Consider the following program code, executed on a RISC pipeline. All the instructions are spending one cycle on all the stages, but load instruction spending 3 cycles on 4th stage of pipeline. How many cycles are required to complete the program code.

I1: Load r0, 3(r1)

I2: Add r2, r0, r2

I3: MUL r4, r3, r2

I4: DIV r5, r4, r2

Solution: RISC Pipeline → (225) [IF, ID, EX, MA, WB]  
→ Operand forwarding is default

|    | cc1 | cc2 | cc3                 | cc4 | cc5 | cc6 | cc7                 | cc8                 | cc9 | cc10 | cc11 |
|----|-----|-----|---------------------|-----|-----|-----|---------------------|---------------------|-----|------|------|
| I1 | IF  | ID  | EX (I1)<br>r4-value | MA  | MA  | MA  | WB                  |                     |     |      |      |
| I2 | IF  | ID  | "                   | "   | "   | "   | EX (I2)<br>r0-value | MA                  | WB  |      |      |
| I3 |     |     | IF                  | "   | "   | "   | ID (I3)<br>r1-value | EX (I3)<br>r2-value | MA  | WB   |      |
| I4 |     |     |                     |     |     |     | IF                  | CD                  | EX  | MA   | WB   |

Total Cycles = 11

- Q. Consider 4 stage pipeline (IF, ID, EX, WB) used to execute the following program where all the instructions are spending 1 cycle on all the stages but DIV spends 4 cycles on EX stage and MUL spends 3 cycles on EX stage. Operand forwarding is used in the pipeline.

- (a) How many cycles are required to complete the program  
 (b) How many cycles are saved using the operand forwarding over w/o operand forwarding.

Code:

|                      |    |    |    |    |
|----------------------|----|----|----|----|
| I1: DIV X0, Y1, Y2   | IF | ID | EX | WB |
| I2: MUL Y3, Y4, Y5   | IF | ID | EX | WB |
| I3: ADD Y6, Y7, Y8   | IF | ID | EX | WB |
| I4: SUB Y9, Y10, Y12 | IF | ID | EX | WB |



### (b) w/o operand forwarding

|    | cc1 | cc2 | cc3 | cc4 | cc5 | cc6 | cc7 | cc8 | cc9 | cc10 | cc11 | cc12 | cc13 | cc14 | cc15 | cc16 |
|----|-----|-----|-----|-----|-----|-----|-----|-----|-----|------|------|------|------|------|------|------|
| I1 | IF  | ID  | FX  | EX  | EX  | EX  | WB  |     |     |      |      |      |      |      |      |      |
| I2 | IF  | ID  | "   | "   | "   | EX  | EX  | EX  | WB  |      |      |      |      |      |      |      |
| I3 | IF  | ID  | "   | "   | "   | ID  | "   | "   | WB  |      |      |      |      |      |      |      |
| I4 | IF  | "   | "   | "   | "   | ID  | EX  | EX  | WB  |      |      |      |      |      |      |      |

Cycles saved = 4

24/06/2017 :

### # Control Dependency:

→ Control dependency will be occurred in the pipeline due to a jump instructions.

→ Consider a program segment

- 1000 : I1 ↓ Falling/fall-through path
- 1001 : I2
- 1002 : I3 (Jump 2000)
- 1003 : I4
- ⋮
- 2000 : B11 ↓ Taken path
- 2001 : B12

Expected o/p sequence

I1 - I2 - I3 - B11 - B12

| PC:1000 | cc1 | cc2 | cc3 | cc4 | cc5 | cc6 | cc7 |
|---------|-----|-----|-----|-----|-----|-----|-----|
| I1      | IF  | ID  | EX  | MA  | WB  |     |     |
| I2      | IF  | ID  | "   | EX  | MA  | WB  |     |
| I3      | IF  | "   | "   | ID  | EX  | MA  |     |
| I4      | IF  | "   | "   | ID  | EX  | MA  |     |
| B11     |     |     |     | TF  | ID  | EX  |     |
| B12     |     |     |     |     | TF  |     |     |

| I1 - I2 - I3 - B11 - B12 |  
 ↓ unwanted

Actual o/p sequence.

- In the above execution sequence, unwanted instructions are overlapping with the jump instruction; so, the program functionality is missing.
- This kind of problem in the pipeline is called as control dependency.
- To correct the above problem, there is a need of inserting the delay slot (NOP) instruction after the jump instruction.
- NOP insertion in the program is called as flush/freeze.
- flush operation creates the stalls in the pipeline, i.e., described below:

#### Modified code

```

1000: I1
1001: I2
1002: I3 (JMP 2000)
1003: NOP
1004: I4
;
2000: BI1
2001: BI2
;

```

| PC:1000        | CC1 | CC2 | CC3 | CC4 | CC5 | CC6 | CC7 |
|----------------|-----|-----|-----|-----|-----|-----|-----|
| I1<br>PC:1001  | ID  | (X) | MA  | WB  |     |     |     |
| I2<br>PC:1002  | IF  | ID  | EX  | MA  | WB  |     |     |
| I3<br>PC:1003  | IF  | ID  | EX  | MA  |     |     |     |
| NOP            |     | IF  | ID  | EX  |     |     |     |
| BI1<br>PC:2001 |     | IF  | ID  | ED  |     |     |     |
| BI2<br>PC:2002 |     |     |     | IF  | ED  |     |     |

O/P → I1 - I2 - I3 - NOP - BI1 - BI2

- No. of stalls created in the pipeline due to a jump instruction is called as branch penalty.
- It depends on the availability of a target address in the pipeline, i.e.,

Expected O/P

I1 - I2 - I3 - BI1 & BI2

At what stage target address is available in the pipeline — 1

NOTE: → RISC pipeline branch penalty is always 1 because target address is available in the second stage of the pipeline.

Q. Consider 8-stage pipeline which overlaps all the instructions except the branch instruction. Means when the instruction is a branch instr. then processor stops the fetching of the following instructions until the target address is available. In this pipeline target address is available in the 7th stage. How many stalls are present in the pipeline. When program contains 30% branch instructions, then, how many stall cycles are created during the execution of a program?



→ To handle the control dependency stalls hardware technique is used i.e., branch prediction buffer. Also known as branch target buffer or loop buffer.

NOTE: → It is a high speed buffer maintained in the first stage of the pipeline, used to hold the predicted target addresses.

→ When the prediction is true, then stalls are not present, otherwise stalls are created.

→ It is not an optimised solution. So, not in the practical systems.

NOTE: → When the question contains pipeline with branch prediction, then, assume that target address is present in prediction, then, assume that target address becomes zero. Otherwise, the first stage, so, branch penalty becomes zero. Otherwise, (w/o branch prediction), then, assume that target address is not present in the first stage. So, branch penalty is calculated based on the target address availability.

NOTE: → To handle the control dependency stalls, one of a software technique is used, i.e., delayed branch.

# Delayed Branch:  
→ It is a compiler technique, so, compiler will rearrange the program if possible or substitute the NOP instruction after the Jump instruction, if not possible to rearrange, to handle the control dependency problem.

Ways:

I000 : I1  
I001 : I2  
I002 : I3 (JMP I000)  
I003 : I4  
I004 : I5  
I005 : I6  
I006 : I7  
I007 : I8

Expected O/P:

I1 - I2 - I3 - BI1 - BI2

Actual:

I1 - I2 - I3 - I4 - BI1 - BI2

COMPILER  
"Re-arrangement":

Code:

Machine generated : I1

address

I3 (JMP BI1)

I2

I4

⋮

BI1

BI2

| PC:I | cc1         | cc2         | cc3                            | cc4          | cc5          |
|------|-------------|-------------|--------------------------------|--------------|--------------|
| I1   | IF<br>PC:I3 | ID          | EX                             | MA           | WB           |
| I3   |             | IF<br>PC:I2 | ID<br>Untord<br>T0C<br>PC: BI1 | EX           | MA           |
| I2   |             |             | IF<br>PC:I4                    | ID           | EX           |
| BI1  |             |             |                                | IF<br>PC:BI2 | ID           |
| BI2  |             |             |                                |              | PC:BI3<br>IF |

O/P:

I1 - I3 - I2 - BI1 - BI2

"Nop-substitution":

Machine : I1  
Generated Address

I2  
I3 (JMP BI1)

NOP

I4

BI1

BI2

| PC:I | cc1         | cc2         | cc3                           | cc4          | cc5          | cc6          |
|------|-------------|-------------|-------------------------------|--------------|--------------|--------------|
| I1   | IF<br>PC:I2 | ID          | EX                            | MA           | WB           |              |
| I2   |             | IF<br>PC:I3 | ID                            | EX           | MA           | WB           |
| I3   |             |             | IF<br>PC:NOP<br>T0C<br>PC:BI1 | EX           | MA           |              |
| I4   |             |             |                               | IF<br>PC:BI2 | ID           | EX           |
| NOP  |             |             |                               |              | IF<br>PC:BI3 | ID           |
| BI1  |             |             |                               |              |              | IF<br>PC:BI2 |
| BI2  |             |             |                               |              |              | IF<br>PC:BI3 |

O/P:

I1 - I2 - I3 - NOP - BI1 - BI2  
Stall

### Instruction Scheduling:

- CPU always executes the program in a sequence called in order execution.
- In the in order execution sequence, if any instruction is dependent, then the remaining instructions are also sharing the stall cycles.
- F.g.  $I_1: ADD(r_0, r_1, r_2)$       In order exec<sup>n</sup> seq.  
 $I_2: SUB(r_3, r_0, r_4)$   
 $I_3: MUL(r_4, r_5, r_6)$   
 $I_4: DIV(r_3, r_7, r_8)$   
 $I_1 - I_2 - I_3 - I_4$   
 stalls

→  $I_2$  is data dependent on  $I_1$ , so,  $I_2$  will be waiting until the  $I_1$  execution is completed. This waiting creates stalls. These stalls are also shared by the  $I_3$  and  $I_4$  instructions, even they are independent.

→ Instruction scheduling concept is used to handle the above situation. It states that:

"Inserting the independent instructions first, to utilise the stalls."

→ It causes out of order (re-order) execution, i.e.,  $I_1 - I_3 - I_4 - I_2$

→ Out of order execution creates 2 more data dependencies in the pipe. line:

① ANTI DATA DEPENDENCY

② OUTPUT DATA DEPENDENCY

| Inst <sup>n</sup> I                               | Inst <sup>n</sup> J    |
|---------------------------------------------------|------------------------|
| Read before write                                 | Write [TRUE DATA LOSS] |
| Write before read                                 | Read [ANTI DATA LOSS]  |
| Write before write                                | Write [OIP DATA LOSS]  |
| before<br>↓<br>after<br>↓<br>delay<br>↓<br>Hazard |                        |

→ Anti dependency will be occurred in the pipeline, when the instruction J tries to write the data before instru<sup>n</sup>.

I writes it.

E.g.:  $I_3$  is executing before  $I_2$ , so,  $I_3$  updates the register  $r_4$  before  $I_2$  reads it. Therefore,  $I_2$  incorrectly gets the new value from the  $r_4$  (Data loss).

→ OIP data dependency, will be occurred when the instruction J tries to write the data before instruction I writes it. E.g.  $I_4$  is executing before  $I_2$ , so,  $I_4$  updates the register  $r_3$  before  $I_2$  writes it. Therefore, destination register  $r_3$  is incorrectly updated with a old value (DATA LOSS).

→ To handle the above dependency problems, h/w technique is used, i.e., Register Re-naming.

→ This technique states that:

\* use the re-order buffer to store the o/p of a set of out of order instructions. Later, update the register file with a re-order buffer contents after the execution of a dependent instruction.

Register Re-Naming

re-order buffer

i.e., Exec<sup>n</sup> seq.

$I_1 \rightarrow r_0$

$I_3 \rightarrow$  re-order buffer

$I_4 \rightarrow$

$I_2 \rightarrow r_3$  (status=1)

↓  
Re-order Buffer → Reg. file  
( $r_3 & r_4$ )

Hazards: Hazard is a delay.

→ Delay is created in the pipeline due to a dependency problem. So, hazard is divided into 3 types:

- ① Structural Hazard
- ② Data Hazard
- ③ Control Hazard

→ Data Hazard is further divided into 3 types, based on the order of the read and write operations, named as:

- ① RAW Hazard created because of RBR
- ② WAR Hazard " " WBR
- ③ WAW Hazard " " WBW

→ RAW Hazard is created when the instruction 'J' tries to read the data before instruction I writes it. (TRUE data dependency)

→ WAR Hazard is created when the instruction 'J' tries to write the data before instruction I reads it (ANTI data dependency)

→ WAW Hazard is created when the instruction 'J' tries to write the data before instruction I writes it. (O/P data dependency)

Instruction J                      Instruction I

|          |                |
|----------|----------------|
| In-reg.  | Out-Reg. [RAW] |
| Out-reg. | In-reg [WAR]   |
| Out-reg  | Out-reg [WAW]  |

Note:  
RAR(Read After Read) is not a hazard because RBR is not a data dependency.

Performance Evaluation with stalls:

$$S = \frac{\text{Average instruction execution time non-pipeline}}{\text{Average instruction execution time pipeline}}$$

$$\begin{aligned} S &= \frac{n}{n} \\ &= \frac{f_n}{f_p} \\ &= \frac{S}{K} \end{aligned}$$

$$S = \frac{CPI_{\text{non-pipe}} \times \text{cycle time}_{\text{non-pipe}}}{CPI_{\text{pipe}} \times \text{cycle time}_{\text{pipe}}}$$

→ Ideal CPI of the pipeline is always 1 (one). But, due to a dependency problem stall cycles are created in the pipeline.

$$\therefore S = \frac{CPI_{\text{non-pipe}} \times \text{cycle time}_{\text{non-pipe}}}{(1 + \text{No. of stalls/instr.}) \times \text{cycle time}_{\text{pipe}}}$$

→ When the pipeline stages are perfectly balanced, then, one task execution time in the non-pipeline is also equal to no. of stages in the pipeline, i.e.,

$$\left[ \begin{array}{l} \therefore t_n = K \text{ cycles} \\ = K \cdot t_p \end{array} \right] \therefore S = \frac{\text{No. of stages in pipeline} \times \text{cycle time}_{\text{pipe}}}{(1 + \text{no. of stage/instr.}) \times \text{cycle time}_{\text{pipe}}}$$

$$S = \frac{\text{No. of stages in pipeline} [\text{Pipeline depth}]}{(1 + \text{No. of stalls/instr.]})}$$

When the system is operating with 100% efficiency then no stalls present, so,

$$S = \text{Pipeline depth}$$

Q. Consider 5 stage pipeline which allows overlapping all the instructions except the branch instructions. Processor stops the fetching of a following instruction after the branch instruction until the instruction is completed. In the program, 40% of instructions are branch instructions. Among them, 60% of instructions are conditional branch. In which, 30% instructions do not satisfy the condition. When the condition is false, then the following instructions are overlapped. Pipeline is operated with a 4 ns clock.

(a) what is the average instruction execution time.

(b) what is the performance gain.

Target address availability:

① Stage number  $\Rightarrow$  Branch Penalty = [Stage no. - 1]

② Stage Name  $\Rightarrow$  BP = [Corresponding Stage no. - 1]

③ Until the first is completed  $\Rightarrow$  AP = [last stage no. - 1]

④ All the ins. are processed with all stage

$$\Rightarrow \text{BP} = \text{last stage number} - 1$$



No. of stalls / instructions (~~0.6 × 0.7~~)

$$+ (0.6 \times 0.4 \times 4)$$

$$+ (0.4 \times 0.6 \times 0.3 \times 0)$$

$$+ (0.4 \times 0.6 \times 0.7 \times 4)$$

$$= 1.312$$

(a) Average Ins. E.T. =  $[1 + \text{No. of stalls/ins.}] \times \text{cycle time}$

$$= (1 + 1.312) \times 4 \text{ ns}$$

$$= 9.248 \text{ ns}$$

(b) Speed up ( $S$ ) =  $\frac{\text{Pipeline depth}}{(1 + \text{No. of stalls/ins.})}$

$$S = \frac{5}{(1 + 1.31)}$$

$$S = 2.16$$

Q. Consider 6 stage pipeline which is operating with a 3 ns clock. Pipeline allows all the instructions except branch ins. Target address of the branch ins. is available in the 4<sup>th</sup> stage. Program contains 40% branch instructions. Among them, 70% of instructions are optimised. What is the avg. ins. exec. time?

$$\text{No. of stalls} = 3$$

stage  
with stalls

$$\text{No. of stalls/ins.} = 0.4 \times 0.3 + 3$$

$$\text{Avg. time} = 1 + 2.36$$

$$\text{Avg. time} = (1 + 0.36) \times 2 \text{ ns}$$

$$= \underline{\underline{2.72 \text{ ns}}}$$



Q. Consider a hypothetical pipeline which allows all the instructions except memory instructions. Penalty associated with a m/m instruction is 7 cycles. Total length of program contains 30% of memory instructions. Pipeline is operating with 3 ns clock. What is the average instruction execution time?

$$\text{No. of stalls/ins.} = 0.3 \times 7 = \cancel{2.1}$$

$$\text{Avg. exec. time} = (1 + 2.1) \times 3$$

$$= \underline{\underline{9.3 \text{ ns}}}$$



Q. A hypothetical pipeline uses the delayed branch technique to handle the control dependency. Compiler substitutes the 4 NOP instructions after the branch instructions to manage the control dependency. Program contains 50% branch ins. Among them, compiler optimises 80% instructions. Pipeline contains 8 stages and operating with 500 MHz clock. What is the average instruc. exec. time?



$$\text{No. of stalls/ins.} = 0.5 \times 0.8 \times 4$$

$$= \cancel{1.6}$$

$$\text{Avg. exec. time} = \frac{1}{5 \times 10^8} \text{ s}$$

$$\text{Avg. exec. time} = (1 + 1.6) \times 2 \text{ ns}$$

$$= \underline{\underline{5.2 \text{ ns}}}$$

Q. Consider the following program code executed on a hypothetical pipeline:

$$\begin{aligned}
 I_1: & r_0 \leftarrow r_1 + r_2 \\
 I_2: & r_1 \leftarrow r_0 + r_3 \\
 I_3: & r_0 \leftarrow r_1 + r_4 \\
 I_4: & r_2 \leftarrow r_3 + r_4 \\
 & r_5: r_1 \leftarrow r_2 - r_4 \\
 & I_6: M[x] \leftarrow r_1
 \end{aligned}$$

*concept: Only time data dependency is considered.*

*fact: Adjacent data dependency is also considered.*

How many RAW, WAR and WAW hazards are possible in the above code.

#### RAW

$$\begin{aligned}
 & I_1 - I_2 [r_0] \\
 & I_2 - I_3 [r_1] \\
 & I_3 - I_4 [r_2] \\
 & I_4 - I_5 [r_2] \\
 & I_5 - I_6 [r_1]
 \end{aligned}$$

#### WAR:

$$\begin{aligned}
 & I_2 - I_5 [r_1] \\
 & I_3 - I_2 [r_0] \\
 & I_4 - I_3 [r_2] \\
 & I_5 - I_4 [r_1] \\
 & I_5 - I_3 [r_1]
 \end{aligned}$$

#### WAW:

$$\begin{aligned}
 & I_2 - I_2 [r_0] \\
 & I_3 - I_1 [r_0] \\
 & I_4 - I_1 [r_2] \\
 & I_5 - I_1 [r_1]
 \end{aligned}$$

Ans. (4, 5, 2)

#### # NON-LINEAR PIPELINE :

→ This pipeline contains forward and backward connections. So, reservation table is used to process the entries.

→ Let us consider, a sample reservation table used to execute the IPs in a hypothetical non-linear pipeline.

|    | 1 | 2 | 3 | 4 | 5 |
|----|---|---|---|---|---|
| S1 | X |   |   |   | X |
| S2 |   | X |   | X |   |
| S3 |   |   | X |   |   |

→ Multiple checkmarks in the row indicates repeated use of same stage in different cycles.

→ Continuous checkmarks in the row indicates extended use of same stage upto some number of cycles.

→ Multiple checkmarks in the column indicates parallel use of different stages in a particular cycle.

#### NOTE:

→ In the linear pipeline, latency is always 1.

→ In the non-linear pipeline, latency will be different (varies), calculated based on the reservation table.

# Latency Analysis: → Latency means time difference between the two successive initiations in the pipeline.

→ It is a positive integer.

→ In the non-linear pipeline, some of the latencies cause conflicts called as forbidden latency (Non-permissible) and some of the latencies do not cause the conflict called as non-forbidden latency (permissible).

→ To detect the forbidden latency, we need to take any the difference between any two checkmarks in the same row.

$$\text{E.g., row 1} \Rightarrow [5-1] = 4$$

$$\text{row 2} \Rightarrow [4-2] = 2$$

→ Based on the forbidden latency, collision vector will be generated.

### Collision vector:

i.e.,  $[C_1 : C_2 \dots C_n : C_1]$

$C_i = 0 \rightarrow$  permissible  
 $C_i = 1 \rightarrow$  non-permissible

Collision vector  
only gives the initial latency.

Collision Vector  $\rightarrow [C_5 \ C_4 \ C_3 \ C_2 \ C_1]$

$\begin{bmatrix} 0 & 1 & 0 & 1 & 0 \end{bmatrix}$

→ Latency sequence is identified based on the collision vector to specify the correct flow of initiation, i.e.,

### ② Latency sequence w.r.t. $C_1$ :

|    | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
|----|---|---|---|---|---|---|---|---|---|----|----|----|----|----|
| S1 | 1 | 2 |   | 1 | 2 | 3 | 4 |   | 3 | 4  | 5  |    |    |    |
| S2 | 1 | 2 | 1 | 2 |   | 3 | 4 | 3 | 4 |    |    |    |    |    |
| S3 |   | 1 | 2 |   |   | 3 | 4 |   |   |    |    |    |    |    |

### Latency cycle:

$$(2-1) = 1$$

$$(1, 5), (1, 5), \dots$$

$$(7-2) = 5$$

$$\text{Average latency} = \frac{1+5}{2} = 3$$

$$(8-7) = 1$$

$$(13-8) = 5$$

### ③ Latency sequence w.r.t. $C_3$ :

|    | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
|----|---|---|---|---|---|---|---|---|---|----|----|----|----|----|
| S1 | 1 | 2 | 1 | 3 | 2 |   | 4 | 3 | 5 | 4  |    |    |    |    |
| S2 | 1 | 1 | 2 | 2 | 3 | 3 | 4 |   |   |    |    |    |    |    |
| S3 | 1 |   | 2 |   |   | 3 |   | 4 |   |    |    |    |    |    |

### Latency cycle:

$$(4-1) = 3$$

$$3, 3, 3, \dots$$

$$(7-4) = 3$$

$$\text{Average latency} = 3$$

$$(10-7) = 3$$

### ④ Latency sequence w.r.t. $C_5$ :

|    | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
|----|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|----|
| S1 | 1 |   |   | 1 | 2 | 3 |   |   | 2 | 3  | 4  | 5  |    | 4  | 5  |    |    |
| S2 | 1 | 1 | 1 |   | 2 | 3 | 2 | 3 |   |    |    | 4  | 5  | 4  | 5  |    |    |
| S3 |   | 1 |   |   | 2 |   | 3 |   | 2 | 3  |    |    | 4  | 5  |    |    |    |

### Latency cycle:

$$(6-1) = 5$$

$$(7-6) = 1$$

$$(12-7) = 5$$

$$(13-12) = 1$$

$$(5-1), (5, 1), \dots$$

$$\text{Avg latency} = \frac{5+1}{2} = 3$$

Q. What is MAL? Minimum Avg. latency.

$$\text{Soln} = 3.$$

## MEMORY ORGANISATION

### # Types of Memory Organisation:

→ Based on the style of accessing the system supported memories, memory organisation is divided into two types:

① Simultaneous Access memory organisation

② Hierarchical Access

### # Simultaneous Access:

→ In this organisation, CPU directly communicating with all the levels of memories to access the data in a sequence, i.e., CPU request is initially forwarded Level 1 memory.

→ When there is a miss in level 1 (L1) m/m, then, CPU directly accesses the data from the L2 memory, without copying into L1.

→ When there is a miss in L2 memory, then, CPU directly accesses the data from L3 memory, w/o copying it into L2 and L1 and so on.



Here,  $H_1, H_2, \dots, H_n \rightarrow$  Hit ratios of the respective m/m's

$$\text{Hit ratio} = \frac{\text{No. of hits}}{\text{Total no. of accesses}}$$

$T_1, T_2, \dots, T_n \rightarrow$  Access time of a respective m/m.

→ Amount of time required to access one word data from the memory is called as average access time ( $T_{avg}$ ).

$$T_{avg} = H_1 T_1 + (1-H_1) H_2 T_2 + (1-H_1)(1-H_2) H_3 T_3 + \dots + (1-H_1)(1-H_2)\dots(1-H_{n-1}) H_n T_n$$

→ 1 word  $\overbrace{\hspace{1cm}}$   $T_{avg}$   
No. of words  $\overbrace{\hspace{1cm}}$  1 sec

$$n_{mem} = \frac{1}{T_{avg}} \text{ words/sec} \quad (\text{throughput})$$

### # Hierarchical Access:

→ In this organisation, CPU always accessing the data only from the L1 memory. If data is not in L1 (miss), then, the respective data will be transferred from higher levels to L1 memory.

→ Later, CPU will be accessing the data only from the L1 memory.



$$T_{avg} = H_1 T_1 + (1-H_1) H_2 (T_2 + T_1) + (1-H_1)(1-H_2) H_3 (T_3 + T_2 + T_1) + \dots + (1-H_1)(1-H_2)\dots(1-H_{n-1}) H_n (T_n + T_{n-1} + \dots + T_1)$$

- Prog:
- I<sub>1</sub>: Load (Data in L2 m/m)
  - I<sub>2</sub>:
  - I<sub>3</sub>: Load [I<sub>1</sub>] → reuse the same data
  - I<sub>4</sub>: Load [I<sub>1</sub>]
  - I<sub>5</sub>: Load [I<sub>1</sub>]
  - I<sub>6</sub>:
  - I<sub>7</sub>: Load [I<sub>1</sub>]
  - I<sub>8</sub>: Load [I<sub>1</sub>]
  - I<sub>9</sub>: Load [I<sub>1</sub>]
  - I<sub>10</sub>:

$$P \leq \frac{1}{AT}$$

### Simultaneous

$$I_1 : M_{L_1}, T_2$$

$$I_3 : M_{L_1}, T_2$$

$$I_5 : M_{L_1}, T_2$$

$$I_7 : M_{L_1}, T_2$$

$$I_9 : M_{L_1}, T_2$$

5 times

L2 m/m is accessed

### Hierarchical

$$I_1 : M_{L_1}, (T_2 + T_1)$$

$$I_3 : T_1$$

temporal

$$I_5 : T_1$$

locality

$$I_7 : T_1$$

$$I_9 : T_1$$

5xT<sub>1</sub> + T<sub>2</sub>

- NOTE:
- In the hierarchical memory organisation, CPU will be accessing the higher levels of memory data from the L1 memory.
  - When the application is developed with a reusability concept, then, in this organisation, higher levels memory data is available in the first level itself. This principle is named as Locality of reference.

- It is of two types:
  - ① Temporal locality      ② Spatial locality
- Temporal means same word of data will be ~~refered~~ referred by the CPU again and again in the near future.
- Spatial means adjacent words in the block are referred by the CPU in a sequence.
- In the computer system design, hierarchical memory org. is used to implement the memory system (simultaneous access memory org. is not in practice).

- NOTE:
- When the question contains hierarchy word or hierarchy meaning or cache memory word, then use hierarchical m/m organisation.
  - Otherwise, use simultaneous m/m organisation.

- Q. In a two-level m/m design, Level 1 m/m is 8 times faster than L2 memory and its access time is 30ns less than the average access time. Let L1 access time is 20ns. What is the hit ratio?

$H_2 = 1$  since, data is definitely present in L2.

$$T_{avg} = 50\text{ ns}$$

$$\begin{aligned} T_{avg} &= H_1 T_1 + (1-H_1) H_2 T_2 \\ 50 &= H_1 \cdot 20 + (1-H_1) \cdot 160 \\ 50 &= H_1 \cdot 20 + (1-H_1) \cdot 160 \\ 50 &= 20H_1 + 160 - 160H_1 \\ -110 &= -140H_1 \Rightarrow H_1 = \frac{110}{140} = 0.785 \end{aligned}$$

Q. 3-level m/m org. has the following specifications.

| Level | Access time per word | Block size in words | Hit ratio | Access time per block |
|-------|----------------------|---------------------|-----------|-----------------------|
| 1     | 20ns                 | —                   | 0.7       | $T_1 = 20\text{ ns}$  |
| 2     | 100ns                | 2                   | 0.9       | $T_2 = 200\text{ ns}$ |
| 3     | 200ns                | 4                   | 1         | $T_3 = 800\text{ ns}$ |

If the referred block is not in L1, then transfer it from L3 to L2 or L2 to L1. If not in L2, then transfer it from L3 to L2 or L2 to L1. How long will it take to access the data block?

$$\begin{aligned} T_{avg} &= H_1 T_1 + (1-H_1) H_2 (T_2 + T_1) + (1-H_1)(1-H_2) H_3 (T_3 + T_2 + T_1) \\ &= 0.7 \cdot 20 + (0.3) \cdot 0.9 \cdot (200) + (0.3) \cdot (0.1) \cdot 1 \cdot (800) \\ &= 14 + \cancel{18} + 59.4 + 30.6 \\ &= 104 \text{ ns} \end{aligned}$$

Q. In a 2-level hierarchical m/m design, L1 m/m access time is 40ns and it is referenced by the CPU 80% of time. L2 m/m access time is 200ns. What is the avg. access time of m/m.

$$\begin{aligned} T_{avg} &= H_1 T_1 + (1-H_1) H_2 (T_2 + T_1) \\ &= 0.8 \cdot 40 + (0.2) 1 \cdot (200) \\ &= 32 + 48 \\ &= 80 \text{ ns} \end{aligned}$$

## # Memory Hierarchy Design:

- In the computer design memory system is organised using the hierarchical accessing principle. So, system supported memories are organised into 4 levels.
- According to a hierarchy design, system supported m/m standards is as follows:

| Level : 1                             | 2                 | 3                | 4                  |
|---------------------------------------|-------------------|------------------|--------------------|
| Name : Register                       | (CM) Cache memory | (MM) Main memory | (SM) Secondary m/m |
| Typical Size                          | < 1kB             | < 16MB           | < 16GB             |
| Implementation: Customised Multiparts | SRAM (flip-flop)  | DRAM (Capacitor) | Magnetic           |
| Access Time : (0.25-0.5)              | (0.5-25)          | (80-250)         | 50,00,000          |
| Band width : (20,000 - ) (MByte/sec)  | (5000-10000)      | (1000-5000)      | (20-150)           |
| Managed by Compiler                   | H/w               | O.S.             | O.S.               |
| Backed by CM                          | MM                | SM               | Compact Disc (CD)  |

## Memory Hierarchy Design:



### Bottom-Top Approach:

- ① Capacity is ↓
- ② Access time is ↓
- ③ Performance is ↑
- ④ Cost / bit is ↑
- ⑤ Expensive

25/06/2017  
→ Accessing sequence of system supported memories according to a hierarchy design is as follows.



→ CPU generated memory request is initially referred in the cache memory. When the data is present in cache, then, operation becomes hit. So, hit data will be transferred to CPU in the form of words. Otherwise, operation becomes miss.

→ So, the reference is forwarded to main memory.

→ When the operation is hit in main memory, then the data will be transferred to cache in a form of blocks. Later, cache to CPU in the form of words. Otherwise,

reference will be forwarded to secondary memory.

→ Secondary memory is a final memory in the computer, so, operation is always hit in secondary memory, therefore, hit data is transferred to main memory in the form of pages, main to cache in the form of blocks and cache to CPU in the form of words.

## # Cache Memory: (Imp.) \*\*\*

→ Cache memory is used as an intermediate memory between the CPU and main memory, to hold the image of a main memory data, therefore, CPU will be accessing the main memory data from the cache memory within a less amount of time, hence, speed gap is balanced between the CPU and main memory.



### Design Elements:

- ① Memory Organisation
- ② Mapping Techniques
- ③ Replacement Algorithms
- ④ Updating Techniques
- ⑤ Multi-level Cache Design

## # Memory Organisation:

→ In the memory hierarchy design, data will be transferred between the main memory and cache memory in a blockwise manner. So, both of the memories are organised into blocks based on the block size. So,

$$\text{No. of blocks in m.m.} = \frac{\text{MM size}}{\text{Block size}}$$

$$\text{No. of blocks (lines) in CM} = \frac{\text{CM size}}{\text{Block size}}$$

E.g. Consider 8B cache with 2B block



E.g. Consider 1G BMM with 2B Block.



### Physical Addr. Format:



Q. A hypothetical computer supports 64kB cache memory and 26bit physical address space. Memory is organised into a 64 byte blocks. How many no. of blocks are present in the cache m/m and m/m respectively. Also show addr. formats.

$$\text{No. of blocks in cache} = \frac{64}{2^8} = 2^{10} \text{ blocks}$$

$$\text{No. of blocks in MM} = \frac{2^{26}}{2^6} = 2^{20} \text{ blocks}$$

### Cache m/m address format:



### MM address format:



Q. A hypothetical computer supports 14 bit cache index and 24 bit physical address space. 4 bit word offset is used to access the data from the block. How many no. of blocks are present in the cache memory and main memory respectively.

Also show the addr. formats.

$$\text{No. of blocks in Cache} = \frac{2^{14}}{2^4} = 2^{10} \text{ blocks}$$

$$\text{No. of blocks in MM} = \frac{2^4}{2^4} = 2^{20} \text{ blocks}$$



#### # Mapping Techniques:

→ The process of transferring the data from main mem. to cache memory is called as mapping. During the mapping, data block is transferred to cache alongwith a physical address. So, ~~so, there~~ in the cache memory every line contains 3-fields of information to hold the main memory data, i.e.,



$$\text{Total cache} = \text{Tag mem. size} + \text{Data mem. size}$$

$$\text{Tag mem. size} = \text{No. of lines in CM} * \text{No. of tag bits in the line}$$

$$\text{Data mem. size} = \text{No. of lines in CM} * \text{Block size}$$

→ Cache mem. is designed with 3 kinds of mapping function:

• Direct Mapping

• Associative Mapping

• Set-associative Mapping

#### # Direct Mapping:

→ In this technique, mod function is used to map the data.



#### # Address Binding:



No. of blocks possible in the line, not at the same time  
i.e.  $2^{\text{tag}} = 2^1 = 2$





→ In the mapping process, some of the tag bits are connected to an address logic of cache memory and rest of the bits are stored in a cache controller as a tag. therefore,

$$\begin{aligned} \text{Tag mem. size} &= \text{No. of lines in CM} \times \text{No. of tag bits in the line} \\ &= 4 \times 1 \text{ bit} \\ &= 4 \text{ bits} \end{aligned}$$

→ Explicit replacement techniques are not required in the direct cache design to replace the data because every main memory block is having fixed location in the cache memory.

Q. Consider 32 kB direct mapped cache organised into 32 bytes blocks. Cache memory data is a subset of the 2<sup>32</sup> bytes memory space. In the cache controller each tag is comprising of valid bit and 1 modified bit.

- (a) No. of lines in cache.
- (b) No. of blocks in main memory.
- (c) Address binding
- (d) No. of tag bits in the line
- (e) Tag mem. size in the cache controller.

$$(a) \text{No. of lines} = \frac{2^{15}}{2^5} = 2^{10}$$

$$(b) \text{Blocks in MM} = \frac{2^{32}}{2^5} = 2^{27}$$

(c) Address binding:

| CM  | Tag | Lo | Wo |
|-----|-----|----|----|
| 111 | 10  | 15 |    |

(d) 17 bits

| mm | tag | wo |
|----|-----|----|
| 27 | 5   |    |

$$\text{(e)} \cancel{\text{tag}} \times 2^{10} \times (7 + 1 \text{ valid} + 1 \text{ modified}) \\ = 2^{10} \times 19 \text{ bits} \\ = 19 \text{ bits} \rightarrow \text{Tag directory size}$$

Machine ins. MOV r0, [100]

↓  
CPU generates the mem. request

1001 RD

↓ Direct CM

↓ Address format

Tag Lo Wo

1 2 1



### Limitations:

- In the direct cache design, each line is capable to hold only one tag (block) at a time, so, number of conflict misses will be increased.
- During the program execution, when the CPU frequently refers the multiple blocks, which are mapped into a same cache line, then, the cache blocks are continuously swapping. So, WT ratio will be falling down. This phenomena is called as thrashing, described below.



MM Block ref : B0, B4, B0, B4, B0, B4, ...

B0 - Miss (compulsory miss) :  $0 \bmod 4 = 0$

B4 - Miss (comp. miss) :  $4 \bmod 4 = 0$

B0 - Miss (conflict miss) :  $0 \bmod 4 = 0$

B4 - Miss (conflict miss) :  $4 \bmod 4 = 0$

→ To handle the above problem, alternative cache is required, i.e., associative cache.

→ Associative cache is designed w/o address called as content-addressable memory. In this mapping technique, sequence function is used to map the data, so that any main memory block can be mapped into any cache memory line in a sequence.

→ Address binding is :



### MM Block

B0  
[000]  
B7  
[111]  
B4  
[100]

### Associative MAP

Sequence → Any line  
Sequence → Any line  
Sequence → Any line

### CM Line

Any line  
Any line  
Any line

→ During the mapping process, complete tag is stored in the cache controller because cache memory does not contain the address. Therefore, tag memory size is equal to:

$$\text{Tag m/m size} = \text{No. of lines in CM} \times \text{No. of tag bits in the line}$$

$$\Rightarrow 4 \times 3 \text{ bits} = 12 \text{ bits}$$

→ Explicit replacement techniques are used in the associative cache design to replace the data when the cache is full.

Q. Consider a fully associative cache of 64kB, organised into a 32 bytes block. CPU generates 24 bit address to access the data. In the cache controller each tag is comprising of 1 valid bit, 1 modified bit and 1 replacement bit.

① No. of lines in cache.

② No. of blocks in MM.

③ No. of tag bits in the

④ Address binding

⑤ Tag mem. size

| ① MM. Tag | W0 |
|-----------|----|
| 19        | 5  |

| CM: | Tag | W0 |
|-----|-----|----|
|     | 19  | 5  |

⑥ Tag bits =  $19$

⑦ ~~19~~  $22 \times 2^11$

$$= 4416 \text{ bits}$$

Machine Inst. MOV  $r_0, [1001]$

CPU generates the mem. req. request.



Associative CM

Add. Format



3

1



### # Limitations:

- ① In the associative cache m/m, all the existing tags are compared with a CPU generated tag in parallel, to decide operation as hit or miss. Implementation of a parallel comparison logic becomes very complex and expensive, so, alternative is required, i.e., set associative cache.

### # Set-Associative Cache Design:

→ Set-associative cache is used to compromise the disadvantages in the direct and associative cache designs. In this design, lines are grouped into sets to accommodate more than one block in the set.

N  $\Rightarrow$  No. of lines in CM

P  $\Rightarrow$  No. of lines in set

$$\rightarrow \text{No. of sets (S)} = \frac{N}{P}$$

→ Consider 2-way set-associative cache of 8-byte width with a 2-byte block.

Before design

After organization  
into a block

8B  $\Rightarrow$  CM

$$\text{No. of lines (N)} : \frac{8}{2} = 4$$

$$4 \times 2B = 8B$$

$$\text{No. of sets} = \frac{4}{2} = 2$$

$$2 \times 2 \times 2B = 8B$$

Affine organization into sets.



→ In this cache design, mod function is used to map the data.

### 2-way set-associative:

$$0 \bmod 2 = 0$$

$$1 \bmod 2 = 1$$

$$2 \bmod 2 = 0$$

$$3 \bmod 2 = 1$$

$$4 \bmod 2 = 0$$

$$5 \bmod 2 = 1$$

$$6 \bmod 2 = 0$$

$$7 \bmod 2 = 1$$

$$8 \bmod 2 = 0$$

$$9 \bmod 2 = 1$$

$$10 \bmod 2 = 0$$

$$11 \bmod 2 = 1$$

$$12 \bmod 2 = 0$$

$$13 \bmod 2 = 1$$

$$14 \bmod 2 = 0$$

$$15 \bmod 2 = 1$$

$$16 \bmod 2 = 0$$

$$17 \bmod 2 = 1$$

$$18 \bmod 2 = 0$$

$$19 \bmod 2 = 1$$

$$20 \bmod 2 = 0$$

$$21 \bmod 2 = 1$$

$$22 \bmod 2 = 0$$

$$23 \bmod 2 = 1$$

$$24 \bmod 2 = 0$$

$$25 \bmod 2 = 1$$

$$26 \bmod 2 = 0$$

$$27 \bmod 2 = 1$$

$$28 \bmod 2 = 0$$

$$29 \bmod 2 = 1$$

$$30 \bmod 2 = 0$$

$$31 \bmod 2 = 1$$

$$32 \bmod 2 = 0$$

$$33 \bmod 2 = 1$$

$$34 \bmod 2 = 0$$

$$35 \bmod 2 = 1$$

$$36 \bmod 2 = 0$$

$$37 \bmod 2 = 1$$

$$38 \bmod 2 = 0$$

$$39 \bmod 2 = 1$$

$$40 \bmod 2 = 0$$

$$41 \bmod 2 = 1$$

$$42 \bmod 2 = 0$$

$$43 \bmod 2 = 1$$

$$44 \bmod 2 = 0$$

$$45 \bmod 2 = 1$$

$$46 \bmod 2 = 0$$

$$47 \bmod 2 = 1$$

$$48 \bmod 2 = 0$$

$$49 \bmod 2 = 1$$

$$50 \bmod 2 = 0$$

$$51 \bmod 2 = 1$$

$$52 \bmod 2 = 0$$

$$53 \bmod 2 = 1$$

$$54 \bmod 2 = 0$$

$$55 \bmod 2 = 1$$

$$56 \bmod 2 = 0$$

$$57 \bmod 2 = 1$$

$$58 \bmod 2 = 0$$

$$59 \bmod 2 = 1$$

$$60 \bmod 2 = 0$$

$$61 \bmod 2 = 1$$

$$62 \bmod 2 = 0$$

$$63 \bmod 2 = 1$$

$$64 \bmod 2 = 0$$

$$65 \bmod 2 = 1$$

$$66 \bmod 2 = 0$$

$$67 \bmod 2 = 1$$

$$68 \bmod 2 = 0$$

$$69 \bmod 2 = 1$$

$$70 \bmod 2 = 0$$

$$71 \bmod 2 = 1$$

$$72 \bmod 2 = 0$$

$$73 \bmod 2 = 1$$

$$74 \bmod 2 = 0$$

$$75 \bmod 2 = 1$$

$$76 \bmod 2 = 0$$

$$77 \bmod 2 = 1$$

$$78 \bmod 2 = 0$$

$$79 \bmod 2 = 1$$

$$80 \bmod 2 = 0$$

$$81 \bmod 2 = 1$$

$$82 \bmod 2 = 0$$

$$83 \bmod 2 = 1$$

$$84 \bmod 2 = 0$$

$$85 \bmod 2 = 1$$

$$86 \bmod 2 = 0$$

$$87 \bmod 2 = 1$$

$$88 \bmod 2 = 0$$

$$89 \bmod 2 = 1$$

$$90 \bmod 2 = 0$$

$$91 \bmod 2 = 1$$

$$92 \bmod 2 = 0$$

$$93 \bmod 2 = 1$$

$$94 \bmod 2 = 0$$

$$95 \bmod 2 = 1$$

$$96 \bmod 2 = 0$$

$$97 \bmod 2 = 1$$

$$98 \bmod 2 = 0$$

$$99 \bmod 2 = 1$$

$$100 \bmod 2 = 0$$

$$101 \bmod 2 = 1$$

$$102 \bmod 2 = 0$$

$$103 \bmod 2 = 1$$

$$104 \bmod 2 = 0$$

$$105 \bmod 2 = 1$$

$$106 \bmod 2 = 0$$

$$107 \bmod 2 = 1$$

$$108 \bmod 2 = 0$$

$$109 \bmod 2 = 1$$

$$110 \bmod 2 = 0$$

$$111 \bmod 2 = 1$$

$$112 \bmod 2 = 0$$

$$113 \bmod 2 = 1$$

$$114 \bmod 2 = 0$$

$$115 \bmod 2 = 1$$

$$116 \bmod 2 = 0$$

$$117 \bmod 2 = 1$$

$$118 \bmod 2 = 0$$

$$119 \bmod 2 = 1$$

$$120 \bmod 2 = 0$$

$$121 \bmod 2 = 1$$

$$122 \bmod 2 = 0$$

$$123 \bmod 2 = 1$$

$$124 \bmod 2 = 0$$

$$125 \bmod 2 = 1$$

$$126 \bmod 2 = 0$$

$$127 \bmod 2 = 1$$

$$128 \bmod 2 = 0$$

$$129 \bmod 2 = 1$$

$$130 \bmod 2 = 0$$

$$131 \bmod 2 = 1$$

$$132 \bmod 2 = 0$$

$$133 \bmod 2 = 1$$

$$134 \bmod 2 = 0$$

$$135 \bmod 2 = 1$$

$$136 \bmod 2 = 0$$

$$137 \bmod 2 = 1$$

$$138 \bmod 2 = 0$$

$$139 \bmod 2 = 1$$

$$140 \bmod 2 = 0$$

$$141 \bmod 2 = 1$$

$$142 \bmod 2 = 0$$

$$143 \bmod 2 = 1$$

$$144 \bmod 2 = 0$$

$$145 \bmod 2 = 1$$

$$146 \bmod 2 = 0$$

$$147 \bmod 2 = 1$$

$$148 \bmod 2 = 0$$

$$149 \bmod 2 = 1$$

$$150 \bmod 2 = 0$$

$$151 \bmod 2 = 1$$

$$152 \bmod 2 = 0$$

$$153 \bmod 2 = 1$$

$$154 \bmod 2 = 0$$

$$155 \bmod 2 = 1$$

$$156 \bmod 2 = 0$$

$$157 \bmod 2 = 1$$

$$158 \bmod 2 = 0$$

$$159 \bmod 2 = 1$$

$$160 \bmod 2 = 0$$

$$161 \bmod 2 = 1$$

$$162 \bmod 2 = 0$$

$$163 \bmod 2 = 1$$

$$164 \bmod 2 = 0$$

$$165 \bmod 2 = 1$$

$$166 \bmod 2 = 0$$

$$167 \bmod 2 = 1$$

$$168 \bmod 2 = 0$$

$$169 \bmod 2 = 1$$

$$170 \bmod 2 = 0$$

$$171 \bmod 2 = 1$$

$$172 \bmod 2 = 0$$

$$173 \bmod 2 = 1$$

$$174 \bmod 2 = 0$$

$$175 \bmod 2 = 1$$

$$176 \bmod 2 = 0$$

$$177 \bmod 2 = 1$$

$$178 \bmod 2 = 0$$

$$179 \bmod 2 = 1$$

$$180 \bmod 2 = 0$$

$$181 \bmod 2 = 1$$

$$182 \bmod 2 = 0$$

$$183 \bmod 2 = 1$$

$$184 \bmod 2 = 0$$

$$185 \bmod 2 = 1$$



→ During the mapping, some of the tag bits are connected to the address logic of the cache mem. and rest of the bits are stored in the cache controller as a tag. So,

$$\text{Tag mem. size} = \frac{\text{No. of sets}}{\text{in CM}} \times \frac{\text{No. of lines}}{\text{in the set}} \times \frac{\text{No. of tag bits}}{\text{in the line}}$$

↓                  ↓

S                  P

$$= S \times P \times \text{No. of tag bits}$$

$$= 2 \times 2 \times 2 \text{ bit}$$

$$= 8 \text{ bits}$$

→ This cache is designed with explicit replacement techniques used to replace the data, when the set is full.



### # Address format:



### NOTE:

→ When the no. of sets are approached to 1, then, the cache becomes fully associative.

$$\text{E.g. } 8B \text{ CM, 2B Block} \\ 4\text{-way set associative}$$

$$N = \frac{8}{2} = 4$$

$$P \leq N \\ S \leq 1$$



→ When the no. of sets are approached to N, then, the cache becomes direct.

$$\text{E.g. } 8B \text{ CM, 2B Block}$$

1-way set assoc.

$$N = \frac{8}{2} = 4$$

$$S = \frac{8}{1} = 8$$



### # Replacement Algorithms:

→ Replacement algorithms are used in the cache design to replace the data when the cache is full. Three types of replacement techniques implemented in the cache design:

① Random Replacement

② FIFO

③ LRU

→ In Random Replacement, replace the blocks in the cache randomly. So, performance is affected.

→ In FIFO, replace the block which is having the longest timestamp.

→ In LRU, replace the block which is present in the cache for the longest time without any reference.

④ Consider 4-block cache, initially empty with the following block references: 4, 5, 7, 12, 4, 5, 13, 4, 5, 7

Identify the hit ratio, using:

⑤ FIFO

⑥ 2-way set associative with LRU

⑦ LRU

| CM   | 4-M  | 5-M | 7-M | 12-M | 4-H | 5-H | 7-H | Hit ratio = $\frac{2}{10} = 0.2$ |
|------|------|-----|-----|------|-----|-----|-----|----------------------------------|
| FIFO | X 13 |     |     |      |     |     |     |                                  |
|      | 3 4  |     |     |      |     |     |     |                                  |
|      | X 5  |     |     |      |     |     |     |                                  |
|      | 4 7  |     |     |      |     |     |     |                                  |

LRU. Place the recent references in the top.

|   |   |   |    |   |   |    |   |   |
|---|---|---|----|---|---|----|---|---|
| 4 | 5 | 7 | 12 | 4 | 5 | 13 | 4 | 5 |
| 4 | 5 | 7 | 12 | 4 | 5 | 13 | 4 | 5 |
| 4 | 5 | 7 | 12 | 4 | 5 | 13 | 4 | 5 |
| 4 | 5 | 7 | 12 | 4 | 5 | 13 | 4 | 5 |
| 4 | 5 | 7 | 12 | 4 | 5 | 13 | 4 | 5 |

$$\text{Hit ratio} = \frac{4}{10} = 0.4$$

### ① Direct Map ( $k \bmod N = i$ )

|   |                 |          |
|---|-----------------|----------|
| 0 | $y \bmod 4 = 0$ | $4 - M$  |
| 1 | $y \bmod 5 = 1$ | $5 - M$  |
| 2 | $y \bmod 3 = 3$ | $2 - M$  |
| 3 | $y \bmod 4 = 0$ | $12 - M$ |

$4 - M$   
 $5 - M$   
 $2 - M$   
 $12 - M$   
 $4 - M$   
 $5 - H$   
 $12 - M$   
 $4 - H$   
 $5 - M$   
 $7 - H$

$$\text{Hit ratio} = \frac{3}{10} = 0.3$$

### ② 2-way set-associative with LRU :

$$\text{No. of sets} = \frac{4}{2} = 2$$

|   |                 |                   |
|---|-----------------|-------------------|
| 0 | $y \bmod 2 = 0$ | $(k \bmod 3 = i)$ |
| 1 | $y \bmod 2 = 1$ | LRU               |

$$\text{Hit ratio} = \frac{4}{10} = 0.4$$

$4 - M ; 4 \bmod 2 = 0$   
 $5 - M ; 5 \bmod 2 = 1$   
 $7 - M ; 7 \bmod 2 = 1$   
 $12 - M ; 12 \bmod 2 = 0$   
 $4 - H$   
 $5 - H$   
 $12 - M$   
 $4 - H$   
 $5 - H$   
 $7 - Miss$

### # Updating Techniques:

- In a memory hierarchy structure CPU, always performs the read and write operations only on a cache memory. In this regard, CPU checks the availability of a data block in the cache memory.
- When the data is present in cache, then the operation becomes read hit or write hit, otherwise operation becomes read miss or write miss. Due to a miss operation, the respective block will be transferred from main memory to cache mem. called as read allocate or write allocate.
- After the allocation, CPU performs read or write operation only on a cache memory.



$B_0 \Rightarrow k \bmod N = i$   
 $[000] \quad 0 \bmod 4 = 0 \Rightarrow \text{line '0'}$

Prog.  
 $I_1 : \text{MOV } R_0, [0001]; B_0 - 1^{\text{st}} \text{ byte} \rightarrow R_0$   
 $I_2 : \text{ADD } R_0, \#23; R_0 + 23 \rightarrow R_0$   
 $I_3 : \text{MOV } [0000], R_0; R_0 \rightarrow B_0 - 0^{\text{th}} \text{ byte}$

### Exception:

$I_1 : \text{Read Hit}; R_0 = 24H$   
 $I_2 : R_0 + 23 \rightarrow R_0 = 47H$   
 $I_3 : \text{Write Hit}; [M:0000] = 24H \rightarrow 47$



→ In the above code, during the execution of a I3 instruction, CPU performs the write operation only on a cache memory. The corresponding block in the main memory is not updated. So, the same address contains different values at different places. This kind of data inconsistency problem in the memory is called as cache coherence.

→ Coherence causes the data loss, i.e., described below:

I4:  $\text{MOV } r0, [0000]_{A4}$  : B4 - 0<sup>th</sup> Byte  $\rightarrow r0$

I5

I6

I7:  $\text{MOV } r0, [0000]_{B0}$  : B0 - 0<sup>th</sup> Byte  $\rightarrow r0$

Execution: Read - allocate

I4: Read miss :  $K \bmod N = i$

$$4 \bmod 4 = 0$$

↓

CM: B0 B4  
↳ [Updated data lost]

→ During the execution of I4 instruction, cache memory data block is replaced with new block. So, updated data is lost. Therefore, at the time of executing I8 instruction old data is transferred to r0 [data loss].

→ To handle the above problem, updating techniques are used in the cache design. These are of two types:

① Write-Through

② Write-Back

# Write-Through: In this technique, simultaneous write operations are performed in both cache memory and main/m/m. So, coherence is not present. Therefore, lower level m/m data, always becomes the subset of a higher level. This principle is called as Inclusion.

\* To hold the inclusion b/w two levels of the memories, we need to design the lower level memory as a write-through memory.





NOTE:  
 → When the question contains cache memory word, write-through and  $H_w=1$ , then use simultaneous memory organisation.  
 Otherwise, [ $H_w \neq 1$ ], use hierarchical organisation.

# Write-Back: In this technique, CPU performs the write operation only in the cache memory. So, coherence is present. This coherence does not cause the data loss when the CPU reads the status of a cache block before replacement.

→ In a write-back cache design, each line contains one extra bit space called as update bit. This bit is set when the cache block is updated otherwise reset.

→ CPU reads the status of a update bit before replacement. When it is 1, then, write-back the block into main m/m and replace the data in the cache memory, otherwise simply replaces the cache block.

Only applicable for Hierarchical Access



→ Read cycle time is  $T_{avgread}$ :

$$T_{avgread} = H_r T_c + (1-H_r) [ \% \text{ dirty} (T_m + T_m + T_c) + \% \text{ clean} (T_m + T_c) ]$$

↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓

Read    Read    Read    Write    Read    Read    back allocate data    Read    Read

Hit data    Miss    bits    write    read    data    allocate data    read    allocate data

→ Write cycle time is:

$$T_{avgwrite} = H_w T_c + (1-H_w) [ \% \text{ dirty} (T_m + T_m + T_c) + \% \text{ clean} (T_m + T_c) ]$$

↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓      ↓

Write    write

white    white    white    back allocate data    allocate data    allocate data    allocate data

→ Average access time of the memory, when considering both, read and write operation is:

$$T_{avgwb} = (f_r \times T_{avgread}) + (f_w \times T_{avgwrite})$$

# Multilevel Cache Design:

→ Multilevel cache organisation is used in the computer system to reduce the miss penalty.

→ Amount of the time required to access the data from the higher levels due to a miss operation is called as miss penalty.

→ In this kind of organisation, cache memory size is increased logically by replicating the multi levels of caches to hold the more blocks of information.



→ In the multi-level cache design, L2 cache size is greater than L1 cache, And L1 mem. associativity is greater than level 2 (L2) memory.

→ In this organisation, we can calculate two kinds of miss rates;

- ① Global Miss Rate (GMR)
- ② Local Miss Rate (LMR)

$$GMR = \frac{\text{No. of misses in cache}}{\text{Total no. of CPU generated references}}$$

$$LMR = \frac{\text{No. of misses in cache}}{\text{Total no. of accesses to the cache}}$$

$$GMR_{L1} = \frac{4}{10} = 0.4 \quad | \quad GMR_{L2} = \frac{2}{10} = 0.2$$

$$LMR_{L1} = \frac{4}{10} = 0.4 \quad | \quad LMR_{L2} = \frac{2}{4} = 0.5$$

GMR and LMR of L1 is always same.

→ Average access time of a memory system calculated in terms of hit time, miss rate and miss penalty is as follows:

$$T_{avg} = \text{Hit time}_{L1} + (\text{miss rate}_{L1} \times \text{miss penalty}_{L1})$$

$$\text{Miss penalty}_{L1} = \text{Hit time}_{L2} + (\frac{\text{Miss rate}_{L2} \times \text{Miss Penalty}_{L2}}{\text{Local}})$$

$$\text{Miss penalty}_{L2} = \text{MM access time}$$

→ Average memory stalls created per instruction during the execution of a program is calculated as:

$$\text{Avg. mem. stalls/ins.} = \left( \frac{\text{No. of misses/ins.} \times \text{hit time}_{L2}}{\text{Global}} \right) + \left( \frac{\text{No. of misses/ins.} \times \text{access time}}{\text{Global}} \right)$$

#### # Types of Cache Misses:

① Compulsory Miss (Cold start miss/first reference miss):  
→ This miss will occur when the very first reference block is a miss. Compulsory misses can be reduced by increasing the block size.

② Capacity Miss:  
→ This miss will occur when the capacity is full. These misses can be minimised by increasing the cache size.

③ Conflict Miss / Collision / Inference Miss:  
→ This miss will occur when too many blocks are mapped into a same line or same set. These misses can be reduced by doubling the associativity.

27/06/2017

#### # I/O Organisation:

→ CPU is a electronic component and I/O devices are the electro-magnetic components. So, there is a difference existing in terms of operating mode, word format and data transfer rate.

→ To synchronise the I/O speed with CPU, high-speed interface logic is used named as I/O Interface or IO module.

→ I/O interface is responsible to perform the I/O operations. So, I/O devices are connected to a system bus via I/O interface.



### # Accessing Sequence:

- CPU initializes the I/O interface along with a I/O command. Later, busy with other useful tasks.
- I/O Interface control logic interprets the I/O command and enables the operation.
- Based on the speed of a device, consume the time in completing the operation. Later, transfers the operation status into a interface buffer.
- When the buffer contains the status, then, I/O interface generates the interrupt signal to CPU and waiting for acknowledgement signal.
- After receiving the ACK signal, I/O interface transfers the data to CPU with high speed. Therefore, speed gap is synchronized between the CPU and I/O.
- Different interfaces used in the computer design are as follows:



### # I/O Modes:

- I/O modes are used to transfer the data between the I/O and other components of a computer (CPU, memory).
- 3-different I/O modes are used in the system design, named as:
  - ① Programmed I/O
  - ② Interrupt driven I/O
  - ③ Direct Memory Access (DMA)

### # Programmed I/O :

- In this mode, I/O devices are directly connected to system bus, w/o using a interface. So, CPU takes the responsibility of a I/O operation.
- In this mode, CPU will be blocked during the I/O operation.
- Therefore, CPU utilization is inefficient.
- In this mode, CPU time depends on the speed of the I/O device and the amount of data to be transferred.
- E.g. Consider a system, where 1KBPS I/O device is interfaced to CPU in a programmed I/O mode.



$$1KB - 1sec \\ 1B \rightarrow \frac{1}{10^3} \Rightarrow 10^{-3} s. \\ = 1ms.$$

### # Interrupt Driven I/O :

- In this mode, I/O operations are controlled based on the interrupt signals. So, all the I/O devices are interfaced to CPU via I/O interface chip.
- In this mode, CPU utilization is efficient because CPU does not undergoes idle state during the I/O operation.
- In this mode, CPU time depends on the latency of the interface, rather than the speed of the I/O device.



In the DMA operation, CPU performs the DMA module operations.

At the same time, DMA controller performs the DMA module operations.

As DMA controller has more bandwidth than CPU, it can transfer data faster.

DMA controller receives the DMA request from the DMA module.

When the DMA module receives this signal, DMA module enables the DMA channel of the system bus and mapping for DMA signal.

DMA, then enables DMA signal to CPU, to gain the DMA access right to the DMA module.

After receiving the DMA signal, DMA module enables the DMA memory via DMA buffer (count becomes zero). After DMA access right to the DMA.

After releasing this signal, DMA module transfers the data to memory.

The DMA operation, bus connection is established.

After releasing this signal, DMA module transfers the data to memory.

① Busy state  
② Blocked state (HOLD state)

In the DMA operation, CPU is present in 2 states:

NOTE:

① Busy state  
② Block state

In the DMA operation, CPU is busy state until transforming the data of depends on the DMA speed.

→ Let ' $n$ ' is the preparation time; &  $y$  is transfer time.

$\pi \rightarrow$  Preparation Time -

$y \rightarrow$  Transfer Time

$$\text{Percentage of time CPU is blocked} = \left( \frac{y}{x+y} \right) \times 100$$

$$\text{Percentage of time CPU is busy} = \left( \frac{n}{n+y} \right) \times 100$$

→ DMA module is programmed in 3 modes:

① Burst Mode    ② cycle stealing mode.    ③ Block mode

→ In burst mode of DMA, bulk amount of data will be transferred to main mem, based on the HLD and MDA signals.

→ In cycle stealing mode of DMA, small amount of data will be transferred to main memory by stealing the control of a system bus from the CPU hand, when it is in idle state. all to transferred to main m/m

→ In Block mode of DMA, data will be transferred to main m/m in the form of blocks based on the Hold and MLD signals.

Q. Consider a 32-bit hypothetical processor which is interfaced to a 1Mbps I/O device in a programmed I/O mode. Take a transmission is in the form of word wise. I/O interface interrupt overhead is 14s. How much performance will be gained by operating the device in a interrupt mode over the programmed I/O mode.

|                                                                                                        |                                                                                      |
|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| <u>Programmed I/O</u> : (Depends on<br>1 MB/s $\Rightarrow$ 1s<br>$1B \Rightarrow 10^{-6}s = 1\mu s$ ) | <u>Interrupt driven</u> : (Depends on I/O<br>1us<br>$\Rightarrow$ Interface latency) |
|                                                                                                        | gains 4                                                                              |

For word  $\leq 4 \text{ ms}$   
(32-bit)

Q. Consider 1MBs I/O device interfaced to a CPU in a cycle stealing mode of DMA. Whenever, 8 bytes data is available in the buffer then, if it is transferred to CPU, Mem cycle time is 2μs. How much storage of CPU time is consumed in the DMA operation.

(x) Execution time depends on  $I(0) = 1 \text{ MIPS} \times 8 = 8 \text{ ms}$

(y) transfer time =  $\frac{1}{100} \times 8 - 8.25$  = Min cycle time = 24 s

$$\% \text{ time CPU blocked} = \frac{2}{10} \times 100 = 20\%$$

Q. Consider 10Mbps device interfaced to a 32 bit CPU in a cycle stealing mode whenever 4-word data is available in the buffer, then it is transferred to main mem. M/cycle time is 0.5μs. How much %age of time CPU blocked?

$$n = 1.645$$

$$y = 0.5 \times 4 = 2 \text{ ms}^{-1}$$

$$\frac{200}{3.6} \times 100 \rightarrow \frac{55.56}{1} = 55.56\%$$

MIC cycle time = time 0.5 ms  
 (No. of cycles  
 reqd for one  
 mem. op.)  
 1 word  $\rightarrow$  1 access/cycle  
 4 words  $\rightarrow$  4 m/c cycles.

Q. Consider 10MBPS I/O device which is interfaced to a 1MHz clock freq. CPU in a cycle stealing mode of DMA. Whenever, 32 byte data block is available in the buffer then, it is transferred to MM. DMA initialisation and transfer takes 30 and 60 cycles respectively. How much %age of CPU time is blocked in the DMA operation.

$$10 \text{ MBPS} \quad 1 \text{ MHz} \quad \frac{1}{10^6}$$

$$10 \rightarrow 0.1 \text{ ms} \quad = 10^{-6} \text{ s} \quad 1 \text{ ms microcycle}$$

$$\pi \approx 3.2 \times 0.1 = 3.2 \text{ ms} \quad | \quad y = 30 \text{ cycles} + 60 \text{ cycles}$$

$$= 90 \text{ cycles} = 90 \text{ ms}$$

Q. A 32-bit hypothetical processor is interfaced with a DMA module to transfer the data. DMA contains 8-bit count register and 32 bit data register. How many bytes of data is transmitted to main memory in a DMA cycle.

$$4 \text{ words} = 4 \text{ bytes}$$

DMA cycles is controlled by the byte count register.

Count register is 8 bit long, so it maintains 256 counts.

In each count 1 word data is transferred to MM. So, total data transmission  $\approx 256 \text{ words}$   
 $\approx 256 \times 4 = \underline{\underline{1 \text{ KB}}}$

### # Hard Disk Design:

- Hard disk is a direct access electromagnetic storage component. It contains a bunch of magnetic coded platters used to store the data.
- Each platter contains 2 surfaces.
- Every surface has its read-write head used to access the data.
- Surface contains set of concentric circles called as tracks.
- Track is divided into a set of sectors. Sector holds the data.
- In the hard disk design, surfaces, tracks and sectors are the addressable units.
- In the hard disk, cylinder is formed by connecting the same track number in all the surfaces.



### # SECONDARY STORAGE COMPONENTS.

→ Classification of secondary storage components is as follows



- ① No. of cylinders in the disk = No. of tracks in the surface
- ② Track capacity = No. of sectors/track  $\times$  Bytes/sector
- ③ Cylinder capacity = No. of surfaces in disk  $\times$  track capacity
- ④ Disk Capacity = No. of cylinders in the disk  $\times$  cylinder capacity

#### Average:

$$\text{Disk size} = \text{No. of surfaces} \times \text{No. of tracks/surface} \times \text{No. of sectors/track} \times \text{Bytes/sector}$$

- Different adjustments are required in the disk operation:
- ① Seek Time: Amount of the time required to move the head point from the initial position to desirable desired track in the surface.
  - ② Rotational Latency: Amount of time required to adjust the head point at the beginning of a desired sector in the track.
  - ③ Best case: In this view, rotational latency is zero because head point is adjusted during the seek time itself.
  - ④ Worst case: In this view, one complete rotation is required to adjust the head point.
  - ⑤ Average case: In this view, half of a revolution is required to adjust the head point.
  - ⑥ Transfer Time: Amount of time required to access the data from the disk.
  - ⑦ Overhead: Additional delays in the disk operation.
  - ⑧ Total Time → Total time required to access the data from the hard disk is calculated as:

$$T_{avg} = \text{seek time} + \frac{\text{average rotational latency}}{\text{track}} + \text{overhead}$$

→ In the hard disk design, storage distance is restricted by the inner and outer diameters.



$$\begin{aligned} \text{Storage distance} &= \text{Outer diameter} - \text{Inner diameter} \\ &= 12\text{cm} - 4\text{cm} \\ &= 8\text{cm} \\ \text{No. of tracks} &= \frac{\text{Storage dist.}}{\text{Inter track dist.}} \\ &= \frac{8\text{cm}}{0.2\text{cm}} \\ &= 40 \text{ tracks} \end{aligned}$$

→ When the sector address is expressed in a 3-dimensional format, then sector no. in the hard disk is calculated using the following formulae:

$$\text{Sector number} = (i, j, k)$$

$i$  = cylinder no.

$j$  = surface no.

$k$  = sector no.

$$\boxed{\text{Sector no.} = k + s(j + i)t}$$

$s$  = No. of sectors/track

$t$  = # track/cylinder

(No. of surfaces in the disk)

Q. Consider a hard disk with a rotational rate of 9800 rpm. Average seek time of the disk is 8.4ms. Disk contains 16 platters. Surface contains 120 tracks. Track contains 256 sectors. Sector holds 512 bytes of data. In the application program record size is 4KB. Calculate the following.

$$\text{④ Disk Capacity} = \frac{2^9 \text{ bytes}}{2^9 \text{ bytes}} \times 2^{29} \text{ bytes} = 512 \text{ MB}$$

⑤ Time required to access the record.

⑥ Amount of time required to access 100 random sectors data.

⑦ Calculate the disk transfer rate.

⑧ Seek time = 8.4ms  
Rotational latency: Depends on rotational speed

$$9800 \text{ revs} \rightarrow 60 \text{ s} \\ 1 \text{ rev} \rightarrow \frac{60}{9800} = 0.006125 \approx 6.12 \text{ ms}$$

$$\text{Avg rotational latency} = \frac{1}{2} \times \text{revolution time} = \frac{1}{2} \times 6.12 \\ = 3.06 \text{ ms}$$

Transfer time: Depends on rotational speed

1 revolution time = 1 track data (256 sectors)

? → 1 record (8 sectors)

$$= \frac{6.12 \times 8}{256} = 0.191 \text{ ms}$$

$$\begin{aligned} T_{avg} &= \text{seek time} + \text{avg rot. latency} + \text{transfer time} + \text{overhead} \\ &= 8.4 + 3.06 + 0.191 + 0 = 11.651 \text{ ms} \end{aligned}$$



## Unsigned Multiplication:

→ When the multiplication takes place between 2 n-bit numbers, then, result will be expressed in a  $2 \times n$ -bit size, i.e.,  $(n \text{-bit} \times n \text{-bit})$

= 2 n-bit.

→ Optimized multiplication process is described in the following flow chart.



E.g.  $1(1 \times 13c)1011 \neq 1101$   
 $M = 1011$ ; count = 4

$\begin{array}{c} C \\ \hline 1 & A & \end{array}$ 
 $\begin{array}{c} 1 \\ \hline 1101 \end{array}$ 
 ; initial

---

$\xrightarrow{0} \xrightarrow{0} 1011 \xrightarrow{1101 \text{ last}} \text{C, A+C+A}$   
 , shift  
 count = 3

---

$\xrightarrow{0} \xrightarrow{0} 0101 \xrightarrow{1110 \text{ last}} \text{count} = 3$

---

$\xrightarrow{0} 0010 \xrightarrow{1111 \text{ last}} \text{shift}$   
 count = 2

---

$\xrightarrow{0} \xrightarrow{0} 101 \xrightarrow{111 \text{ last}} \text{C, A+C+A}$   
 $\xrightarrow{0} 0110 \xrightarrow{111 \text{ last}} \text{shift}$   
 count = 1

Product is in A & register, i.e., 1000 1111 [1443]

| Multiplicand          | Multiplier | OP^n        |
|-----------------------|------------|-------------|
| 0110<br>1011<br>10001 | 0<br>1     | Shift       |
| 43]                   |            | Add & Shift |

# Standard Multiplication [Booth's / Bit Pair Multiplication]:

$\rightarrow$  Signed data multiplication is controlled by the bit pairs.  
Bit pair is formed based on the multipliers. In this process, one assumption is required to get the initial bit pair, i.e.,  $a_0 = 0$ .

E.g. Multiplier : 9<sub>3</sub> 9<sub>2</sub> 9<sub>1</sub> 9<sub>0</sub>

| 3-bit Pairs    |                 |
|----------------|-----------------|
| Multiplexer    | Pair with       |
| q <sub>0</sub> | q <sub>-1</sub> |
| q <sub>1</sub> | q <sub>0</sub>  |
| q <sub>2</sub> | q <sub>1</sub>  |
| q <sub>3</sub> | q <sub>2</sub>  |

→ Multiplication is controlled based on the bit pair status, i.e.,

| Bit Pair | Operation | Reading |
|----------|-----------|---------|
| 00       | ASR       | 0       |
| 01       | t & ASR   | +1      |
| 10       | - & ASR   | -1      |
| 11       | ASR       | 0       |

ASR → Arithmetic Shift Right

Consider the following Booth's Multiplication.

Multicast: 0110 0001 1000

Multifaktor : 1010 0101 1010

⑥ How many arithmetic operations are required in the multiplication.  
⑦ What is the recorded multiplier.

| Multiplexer | Pairs with  |      |
|-------------|-------------|------|
| (20) 0      | (q-1) 0 → 0 | LSB. |
| 1           | 0 → -1      |      |
| 0           | 1 → +1      |      |
| 1           | 0 → -1      |      |
| 1           | 1 → 0       |      |
| 0           | 1 → +1      |      |
| 1           | 0 → -1      |      |
| 0           | 1 → +1      |      |
| 0           | 0 → 0       |      |
| 0           | 1 → -1      |      |
| 1           | 0 → +1      |      |
| 0           | 0 → -1      |      |
| 1           | 0 → +1      |      |
| 0           | 0 → -1      |      |

(a) → 9

(b) →

$$\begin{aligned} & -1+1-10+1-1+10 \\ & -1+1-10 \end{aligned}$$

Q. Consider the following 8-bit multiplication: What is the recoded multiplier.

$$(+112) \times (-121)$$

$$01110000 \times 10000111$$

$$\begin{array}{r} 0111001 \\ 10000110 \\ +1 \\ \hline 10000111 \end{array}$$

| Multiplexer | Pairs with   |                     |
|-------------|--------------|---------------------|
| 0           | 0 → -1 (LSB) | Recoded Multiplier. |
| 1           | 0 → 0        |                     |
| 1           | 0 → 0        |                     |
| 1           | 1 → +1       |                     |
| 0           | 0 → 0        |                     |
| 0           | 0 → 0        |                     |
| 0           | 0 → 0        |                     |
| 0           | 0 → -1 (MSB) |                     |

Q. Consider the following bit pair combination used in the Booth's multiplication.

Bit Pair: Fill the data.

A x  
C A  
B C  
D D  
E E  
F F  
G G  
H H  
I I  
J J  
K K  
L L  
M M

Dj

Q. Optimized signed multiplication is described in the following flowchart.



Ex:  $(-7) \times (-3) = +21$

M = 1001, count = 4  
A = 0000, Q = 1101, q-1 = 0, initial

Bit Pairs:

| q <sub>0</sub> | q <sub>-1,0</sub> | q <sub>-1</sub> |
|----------------|-------------------|-----------------|
| 1              | 0 → -1            |                 |
| 0              | 1 → +1            |                 |
| 1              | 0 → -1            |                 |
| 1              | 1 → 0             |                 |

Recoding:  
Multiplier = 0-1+1-1

|                                      |                                      |
|--------------------------------------|--------------------------------------|
| 00111 → 1001 → last ; A ← A - M      | 00000 → 1101 → last ; A ← A + M      |
| 00111 → 1110 → last ; A ← A - M      | 00000 → 1110 → last ; A ← A + M      |
| 00111 → 1110 → last ; A ← A - M      | 00000 → 1110 → last ; A ← A + M      |
| 00111 → 0111 → last ; A ← A - M      | 00000 → 0111 → last ; A ← A + M      |
| 00111 → 0111 → last ; A ← A - M      | 00000 → 0111 → last ; A ← A + M      |
| 00011 → 0101 → last ; ASR, count = 0 | 00011 → 0101 → last ; ASR, count = 0 |

(Ans) → Product = 00010101 = +21

## Floating Point Data Format:

→ Representation of a very large and very small fraction of the data consumes more memory space.

E.g. +998000000000 (very large) ( $\approx \infty$ )  
+0.000000000978 (very small) ( $\approx 0$ )

→ To represent the large and small fraction of a data within a less amount of memory, floating point rep. is used.

→ General form of a floating point format is:

$$\boxed{\pm M \times B^e}$$

$\pm$  → sign

M → Mantissa / Significant (fraction)

B → Base/radix

e → exponent

→ In the floating point representation, normalisation format is used to represent the large and small fraction of data in a common format, i.e.,

$$\boxed{\text{valid fraction } \times B^{\pm e} \text{ digit}}$$

→ When the base of the data is greater than 2, then the valid digit varies, so, extra space is required to store the valid digit.

| Base  | Valid Digit  |
|-------|--------------|
| $B=3$ | {1, 2}       |
| $B=4$ | {1, 2, 3}    |
| $B=5$ | {1, 2, 3, 4} |

→ To handle the above overhead, base of the data is always restricted to 2, so, valid digit becomes 1 as implicit. Therefore no need to store this digit in the memory.

→ General form of normalization is:  
Implicit:  $\boxed{\pm(1)b_{bb...} \times B^e}$   $b=0/1$   $B=2$  (default)

→ Mantissa alignment process is used to adjust the decimal point in order to get the data in a normal form. In this process, right alignment increments the exponent and left alignment decrements the exponent.

E.g. +110110101

(1.bbb...)

Mantissa align right  
8 times

$\downarrow$   
 $1.10110101 \times 2^{+8}$

E.g. 0.00000010101101

(1.bbb...)

Mantissa align left  
7 times

$\downarrow$   
 $+1.0101101 \times 2^{-7}$

→ When the data is in a normal form, then it is stored into a memory using the floating point format. Let us consider a 20-bit hypothetical format used to represent the data, i.e.,

| Sign (S) | Biased Exponent (BE) | Mantissa (M) |
|----------|----------------------|--------------|
| 1-bit    | 6-bit                | 13-bit       |

→ Floating point data is stored in the memory with 3 fields of information:

① Sign Field: This field is used to represent the sign of a data, i.e., S  $\rightarrow$  0(+ve)  $\rightarrow$  1(-ve)

② Biased Exponent Field (BE): This field is used to store the exponent. In the memory, exponent is always stored in a biased exponent format, i.e., BE = Actual Exponent + Bias (A.E.)

→ Bias is a maximum possible exponent, depends on the format, BE field, i.e.,

| BE field size | Range of exponent                                                | Bias |
|---------------|------------------------------------------------------------------|------|
| 6-bit         | $(-2^{(6-1)})$ to $(2^{6-1} - 1)$<br>-32 to +31                  | +31  |
| 9-bit         | -256 to 255                                                      | +255 |
| 11-bit        | $-2^{(11-1)}$ to $2^{11-1} - 1$<br>$(-2^{10})$ to $(2^{10} - 1)$ |      |

→ This concept is used to preserve the sign of an exponent, i.e., when  $(BE > Bias)$  then AE is +ve.  
when  $(BE < Bias)$  then AE is -ve.

NOTE:  
→ To retrieve the exponent from the memory, we need to subtract the bias from the biased exponent, i.e.,  $[AE = BE - Bias]$ .

E.g. Format  $\boxed{S|BE|M}$

Bias =  $+(2^{6-1} - 1)$   
= +31

~~AE~~  
Data =  $-1.10101 \times 2^{+8}$

AE = +8

BE =  $8 + 31 = 40$   
001000  
011111  
BE =  $100111$

Retrieve (Read)

$$AE = BE - Bias$$

$$BE = 100111$$

$$Bias = 01111$$

$$AE = 001000 = 8$$

E.g. Format:  $\boxed{S|BE|M}$   
1-bit 6-bit 13-bit  
↳ Excess-Bias

Data:  $-1.10101 \times 2^{+8}$

Store:  
 $BE = AE + Bias$   
 $= \left(\frac{d^6}{2}\right) = \frac{d^6}{2}$   
 $= 32$

AE : 001000  
Bias : 100000  
BE : 101000

Retrieval:  
 $AE = BE - Bias$   
 $BE = 101000$   
Bias : 100000  
AE = 001000  
[+8]

Alternate:  
 $AE = \text{complement of MSB of BE}$   
 $BE = \underline{\downarrow} 101000$   
Complement of MSB  
↓  
AE = 001000

(3) Mantissa field: → This field is used to store the fraction part of a data.

→ Data is always stored in the memory in a normalised mantissa format, i.e.,  $[1.bbb\dots] b: 0/1]$ .  
Implicit fraction

→ In the above structure only the fraction part is stored in a mantissa field of the memory because valid digit becomes implicit.

NOTE:  
→ To retrieve the data from the memory, we need to align the mantissa to right direction upto 1 time to get the implicit digit as a valid fraction. This adjustment is called as Denormalisation (Subnormal).

Fetch the data from the memory  
 $(\pm .bbb\dots \times 2^{\pm e})$   
↑  
mantissa align to right upto 1 time to get the implicit digit  
↓  
 $\pm 0.1bbb\dots \times 2^{(\pm e) + 1}$   
 $\pm 0.1bbb\dots \times 2^{\pm e'}$   $e' \rightarrow \text{adjusted exponent}$

NOTE:  
→ When the format is designed with an Excess Bias form, then bias becomes the center of a exponent range. In this format, actual exponent is retrieved from the memory by taking the complement of a MSB of Biased exponent rather than the subtraction operation.

Q. Consider the following hypothetical format used to represent the data.

Format : 

|       |       |        |
|-------|-------|--------|
| S     | BE    | M      |
| 1 bit | 7 bit | 12 bit |

Data :  $-7.25 \times 2^{11}$

What is its hexadecimal equivalent when the data is stored in the memory?

(a) w/o normalisation

$$\begin{aligned} \text{(1) Sign} &= \text{negative} = 1 \\ \text{(2) BE} &= AE + \text{BIAS} \times 2^{(g-1)} \\ &= +63 \end{aligned}$$

$$AE = 0001011$$

$$\text{Bias} = 0111111$$

$$BE = \underline{\underline{1001010}}$$

$$\begin{aligned} \text{(3) Mantissa} \\ (7.25)_0 &= (x)_2 \end{aligned}$$

$$111.01$$

$$x = 111.01$$

(b) with normalisation,

① Data :  $111.01 \times 2^{1001010}$   
           ↓  
           w/o normal form  
           ↓  
           Store in mem.  
           ↓  
           format reqd.  
           ↓

|   |         |              |
|---|---------|--------------|
| S | BE      | M            |
| 1 | 1001010 | 010000000000 |
| C | A       | 4 0 0 0      |

(c) Data with normal form:  
 $-111.01 \times 2^{1001010}$   
           ↓  
 $(1.1101 \times 2^4) \times 2^{-4}$   
           ↓  
           Mantissa align to right & times

$1.1101 \times 2^{1001010+2}$   
 $1.1101 \times 2^{1001010}$   
           ↓  

|   |         |                  |
|---|---------|------------------|
| S | BE      | M                |
| 1 | 1001010 | 1101000000000000 |
| C | A       | 2 0 0 0          |

Q. Consider the following hypothetical format used to represent the data.

Format : 

|       |       |        |
|-------|-------|--------|
| S     | BE    | M      |
| 1 bit | 7 bit | 12 bit |

  
           ↳ Excess Bias

Data : 0X 3CE 00

What is the actual data associated with the above hexadecimal equivalent in the memory. Report the decimal value.

$$\begin{array}{r} 0111100 \text{ BE} \\ 1111100 \text{ AE} \\ \hline 0011100 \text{ M} \end{array}$$

$$\begin{array}{r} 1111100 \text{ AE} \\ \hline \text{L}2's complement \\ \hline 0000011 \end{array}$$

$$\begin{array}{r} 0000011 \\ +1 \\ \hline 0000010 \end{array}$$

$$\begin{array}{l} \text{(3) Mantissa:} \\ M = 110 \\ M = 11 \end{array}$$

Data :  $+0.111 \times 2^{-4}$

$$\begin{array}{l} \rightarrow +0.111 \times 2^{-4+1} = +0.111 \times 2^{-3} \\ \rightarrow 0.000111 \rightarrow \text{To report} \end{array}$$

$$\begin{array}{l} = 0.000111 \times 2^{-3} \\ = 1/16 + 1/32 + 1/64 + 1/128 \\ = 0.0111875 \end{array}$$

To report the decimal value, exponent will be making '0' by aligning the mantissa to right upto 3 times.

## # RANGE OF FLO

# RANGE OF FLOATING POINT DATA ::

→ Floating point data range is always dependent on the floating point numerical format to

Point framework. Let us consider the data, i.e.,

|      |       |       |
|------|-------|-------|
| s    | 18bit | m     |
| 1bit | 6bit  | 12bit |

① `sign()` → 0 (true)  
1 (-ve)

② Opponent Range: Det.

② Evaporent Range: Depends on  $\delta t$  head

$$B_1^{\circ} \text{ads} = + (2^{n-1} - 1)$$

$$\textcircled{6} \text{ Min BE} = \text{All 0's in BE}$$

$$\Delta E = B_L - B_{\text{bias}} = 0 - (-3) = +3$$

⑥ maybe - all is in BC

$$= 111111111111111111111111$$

$$AE = AE - Bias \approx 63 - (-3) = 66$$

Exponent Range = -31 to +32

**②** Mankissa Range: Depends on Normal Mankissa, like : 11  
 (a) Min M = All 0s in m → Mankissa align to left  
 (b) Open (3 times to convert all 0s to 1s)

$$\begin{aligned} \text{Normal } \mu &= 1, \sigma = 1.0 \\ &= 0 \end{aligned}$$

(b)  $M = \text{All } i \in m$

∴ Normal M = 1.00

④ Range of a 20-bit floating point data:

## ① Single Precision (32 bit)

## ② Double Precision (64bit)



$$\frac{1}{m} \left| \Delta S \right|$$

② Exponent range : bias =  $2^{-1}$   
 $\Rightarrow$   $2^{-1} = \frac{1}{2^1}$

$$\begin{aligned} \text{Min. } BE &= 0 \\ AE &= 0 - 127 = -127 \\ MOr + Bc &= 255 \\ AE &= 255 - 127 = 128 \end{aligned}$$

$$S_2 = \{x_1^2 + x_2^2 + \dots + x_n^2\}$$

(3) manissa longi  
Min. = 1.0000

$$\text{④ Mean } \bar{y} = \frac{1}{n} \sum y_i$$

$$\text{Normal} = 1.111\ldots$$







- ① Indicates the min exponent data with all possible M's.
- ② Indicates the max exponent M's.
- ③ " " data which is nested w/o the min max exponents with all possible M's.

### # Spatial locality in the array accessing from the memory.

→ Array is a ordered set of the homogeneous data elements.  
→ Array elements are always stored in the memory in a sequential cells row by row (row-wise).

$$\text{E.g. } \begin{bmatrix} a[0][0] & a[0][1] & a[0][2] & a[0][3] \\ a[1][0] & \vdots & \vdots & \vdots \\ a[2][0] & \vdots & \vdots & \vdots \\ a[3][0] & \vdots & \vdots & \vdots \end{bmatrix}_{4 \times 4}$$

### Mem storage (row-wise)

|      |         |    |
|------|---------|----|
| 0000 | a[0][0] | B0 |
| 0001 | a[0][1] | B1 |
| 0010 | a[0][2] | B2 |
| 0011 | a[0][3] | B3 |
| 0100 |         |    |
| 0101 |         |    |
| 0110 |         |    |
| 0111 |         |    |
| 1000 |         |    |
| 1001 |         |    |
| 1010 |         |    |
| 1011 |         |    |
| 1100 |         |    |
| 1101 |         |    |
| 1110 |         |    |
| 1111 |         |    |
| F000 |         |    |
| F001 |         |    |
| F010 |         |    |
| F011 |         |    |
| F100 |         |    |
| F101 |         |    |
| F110 |         |    |
| F111 |         |    |

Assume Block size = 2B

→ Array elements are accessed from the memory in 2 ways:

#### ① Row major indexing:

a[0][0] → miss ; a[0][0], a[0][1] → CM  
a[0][1] → hit  
a[0][2] → miss  
a[0][3] → hit

#### ② Column major indexing:

a[0][0] → miss  
a[1][0] → miss  
a[2][0] → miss

### NOTE :

| Storage Seq. | Access Seq. | Locality of reference |
|--------------|-------------|-----------------------|
| Row          | Row         | Yes                   |
| Row          | Col.        | No                    |
| Col.         | Col.        | Yes                   |
| Col.         | Row         | No                    |

→ When the main mem. block references are sequential, then arrangement in the cache is also sequential, so, mapping function is not required to map the data.

E-9

|   |       |       |
|---|-------|-------|
| 0 | B6B8  | B9B2  |
| 1 | B7B9  | B5B3  |
| 2 | B2B40 | B6B4  |
| 3 | B2B11 | B7B15 |

## # Instruction Cycle - Pipelining

- fetch → M<sub>11m</sub> to register
  - decode
  - compute Effective addresses
    - Direct → nothing
    - Indirect → memory
  - Execute



## # fetch Phase:

- Getting instruction from main memory.

To:  $AR \leftarrow PC$   
 $T_1: IR \leftarrow M[AR], PC \leftarrow PC + 1$

## # Decode Phase:

$\rightarrow T_2 : D_7 \dots D_2 D_0 \leftarrow \text{Decode } IR(14, 13, 12); AR \leftarrow IR(15)$

# Compute Effective address.

- Not required for direct addressing.
  - T8: for indirect addressing.
  - $\text{w}_{\text{register}} \rightarrow \text{D}_8, \text{I}, \text{T}_8 : \text{AR} \leftarrow \text{M}[R]$

WERTSCHÜTZER  $\rightarrow$  D<sub>2</sub>: I. T<sub>3</sub>: AR  $\leftarrow$  m[AR]



$\bar{D}_7 \cdot \bar{I} \cdot T_3$ : Nothing





### \* Data forwarding

- means if we write to a memory location and subsequently make from the same memory location, the second instruction can fetch the value directly from the register used to do the write rather than waiting for the memory.
- So, this increases the performance.

↳ Register renaming

↳ Memory buffering

↳ Register renaming

↳ Memory buffering