

Computer ArchitectureInstruction PipeliningMIPS:

- Microprocessor without Interlocked Pipelined Stages (MIPS)
- 32 registers, 32-bit each (1 byte)
- Uniform length instructions
- RISC load-store architecture
- MIPS architecture uses meta goal-
  - performance maximize
  - cost & time reduce



Every instruction in MIPS is represented in 32 bits.  
MIPS are basically three types of instruction:

① R-type    ② I-type    ③ J-type

|     | 6 bits | 5 bits         | 5 bits | 5 bits              | 5 bits | 6 bits |
|-----|--------|----------------|--------|---------------------|--------|--------|
| R : | OP     | rs             | rt     | rd                  | shamt  | funct  |
| I : | OP     | rs             | rt     | address / immediate |        |        |
| J : | OP     | target address |        |                     |        |        |

op: basic operation of the instruction (opcode)

rs: first source operand register

rt: second source operand register

rd: destination operand register

shamt: shift amount

funct: selects the specific variant of the opcode

address: offset for load/store instruction

immediate: constants for immediate instructions

## Q What is Pipelining?

- ⇒ Pipelining is an implementation technique whereby multiple instructions are overlapped in execution.
- ⇒ It takes advantages of parallelism that exists among the actions needed to execute an instruction.
- ⇒ Today, pipelining is the key implementation technique used to make fast CPUs.

## Characteristics:

- ⇒ Pipeline does not reduce the latency of a single task, it improves throughput (transactions/unit time) of on-line workload.
- ⇒ Pipeline rate is limited by the slowest pipeline stage.
- ⇒ Potential speedup = No. of pipe stages.
- ⇒ Unbalanced length of pipe stages reduces speed up.
- ⇒ Time to fill and time to drain pipeline reduces speedup

## Q) Pipelining in Circuits?

- Pipelining partitions the system into multiple independent stages with added buffers between the stages.
- Pipelining can increase the throughput of a system.



- ① pipelining અલો અનેકાર્યાળો stage એવી પ્રક્રિયા series, સેદ્ધાળે પ્રક્રિયાટો series એ કોણ ના કોણે વાંચ રહેશે માટે।
- ② એવી technique use કરે પ્રક્રિયા process કે અનેકાર્યાળો subprocess એ કરતા માસું।
- ③ stage ખૂલો અનુભાવે સાજાને આવે મેન જાઓસ્ટ્રોન ઑપરેટર્સ output પણેનું ટેક્સું input ટ્રિમવે કરતું।
- ④ પ્રાઇવેઝનું subprocess એ આવધાને register buffer માટે માન્યાનું sub task એ input ટ્રિમવે માસું।
- ⑤ એવે stage ખૂલો પ્રક્રિયાએ pipe એવી નજી સ્ટ્રક્ચરું create કરાયાનું।

## 4) Instruction Execution Cycle :

Each instruction can take at most 5 clock cycles.

### ① Instruction Fetch cycle (IF) :

- Based on PC, fetch the instruction from memory
- Update / increment PC.
- प्रैर्फ़े शेस्ट वग़ाज़ इला memory मेंके instruction से microprocessor ए आता।
- current instruction ए 32bit (4 byte) पर्टें next instruction भालू, मालू current instruction ए मालू 4 byte मोज़ चाहें प्रैर्फ़े instruction count बढ़ाएं IF stage.

### ② Instruction Decode / Register fetch cycle (ID) :

- Decode the instruction + register read operation
- Fixed field decoding
- Equality check of registers.
- Computation of branch target address if any.
- opcode विलोपन कर्ते हुए इसे instruction ए की वग़ाज़ चाहें वला इडेटे।
- Register file मेंके operand इला ID stage ए और ALU ए प्रैर्फ़े इवे।  
add R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub> ए Register इला मेंके value इला ALU ए प्रैर्फ़े इवे।
- मादि द्गाता branch instruction भालू, उसे उसे branch instruction ए तार्ज़ target address प्रैर्फ़े stage ए count इस्तु।

### ③ Execution / Effective address cycle (EX) :

- Memory reference instruction : calculate the effective address
- Register-register ALU instruction (add, sub, mul)
- Register-immediate ALU instruction
- Example: LW R<sub>1</sub>, 8(R<sub>2</sub>), Effective address = [R<sub>2</sub>] + 8  
Add R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>
- এই stage এ normal ALU এক্সে কাজ (add, sub, mul) অন্তর্ভুক্ত নহ'।

### ④ Memory access cycle (MEM) :

- Load instruction : Read from memory using effective address.
- Store instruction : Write the data in the register to memory using effective address.
- Load / store instruction এর জন্যে MEM stage এ memory access কর্তৃত প্রয়োজন আছে।
- Load : memory থেকে microprocessor এ data আসে
- Store : register থেকে memory এ data যায়।

### ⑤ Write-back cycle (WB) :

- Register-register ALU instruction / Load instruction
- Write the result into the register file
- এই stage এ সে data শূলো আসলে register file এ write হবে।

Answer to the q. no - 4

Hence, the time delay of 4-segment pipeline.

$$t_1 = 35 \text{ ns}$$

$$t_2 = 30 \text{ ns}$$

$$t_3 = 40 \text{ ns}$$

$$t_4 = 45 \text{ ns}$$



But maximum delay time = 45 ns.

Interface register delay time = 5 ns

$$\begin{aligned}\therefore \text{Total maximum delay time} &= (45 + 5) \text{ ns} \\ &= 50 \text{ ns}.\end{aligned}$$

For 1st instruction, it will end after 4 stages.

But next 99 instruction will take 1 clock cycle to finish their execution.

So, time to complete 100 instruction is

$$= (1 \times 4 \times 50) + (99 \times 50) \text{ ns}$$

$$= 200 + 4950 \text{ ns}$$

$$= 5150 \text{ ns}.$$

## Cycles required to implement instructions:

- o Branch instruction — 4 cycles
- o Store instruction — 4 cycles
- o All others instruction — 5 cycles

## Visualization of Pipeline :

| instruction | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  |
|-------------|----|----|----|----|----|----|----|----|----|
| i           | IF | ID | EX | ME | WB |    |    |    |    |
| i+1         |    | IF | ID | EX | ME | WB |    |    |    |
| i+2         |    |    | IF | ID | EX | ME | WB |    |    |
| i+3         |    |    |    | IF | ID | EX | ME | WB |    |
| i+4         |    |    |    |    | IF | ID | EX | ME | WB |

## Q Instruction Encoding :

Answer to the q. no - 5

|     |   |                                   |                                         |                      |
|-----|---|-----------------------------------|-----------------------------------------|----------------------|
| 1   | 0 | XXXXX (5 bits)<br>32 combinations | 5 bit Reg name                          | 5 bit Reg name       |
| II  | 1 | 00/01                             | 5 bit Reg name                          | 8 bit memory address |
| III | 1 | 10/11                             | XXXXXX XXX (8 bits)<br>512 combinations | 5 bit Reg name       |

## Q) Pipeline Hazards :

# Hazards : circumstances that would cause incorrect execution if next instruction is fetched and executed.

There are 3 different types of hazards :

- ① Structural hazards : Different instruction, at different stages, in the pipeline want to use the same hardware resources.
- ② Data hazards : An instruction in the pipeline requires data to be computed by a previous instruction still in the pipeline.
- ③ Control hazards : Succeeding instruction, to put into pipeline, depends on the outcome of a previous branch instruction already in pipeline.

## Structural Hazard :

if one read port in memory.

|       | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 |
|-------|----|----|----|----|----|----|----|----|
| Load  | IF | ID | EX | ME | WB |    |    |    |
| ins 1 |    | IF | ID | EX | ME | WB |    |    |
| ins 2 |    |    | IF | ID | EX | ME | WB |    |
| ins 3 |    |    |    | IF | ID | EX | ME | WB |

This happens when we are using uniport memory. The problem happens in the 4th instruction. This 4th instruction also trying to access memory in the IF stage. But 1st Load instruction is also trying to access the memory.

## \* Resolving Structural Hazard :

Eliminate the use of same hardware for two different things at the same time.

### Solution 1: Wait

- must detect the hazard
- must have mechanism to stall

- But it has some problem also.
- Instruction - 2 এর পরে একটি clock cycle stall হওয়া হল. instruction - 4 (IF) stage ও সাথে clock cycle হ'ল. But এই time এ instruction - 1 (MEM) stage এ আছে, এখন, ins - 1 ALU থেকে কোনো সমস্যা নাই, কিন্তু যদি Load/Store হ'ল তবে, আজেকটে clock cycle stall হওয়া হবে,

|       | C1  | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 |
|-------|-----|----|----|----|----|----|----|----|----|-----|
| Load  | IF  | ID | EX | ME | WB |    |    |    |    |     |
| ins 1 | L/S |    | TF | ID | EX | ME | WB |    |    |     |
| ins 2 |     |    |    | TF | ID | EX | ME | WB |    |     |
| stall |     |    |    | O  | O  | O  | O  | O  |    |     |
| ins 3 |     |    |    |    | IF | ID | EX | ME | WB |     |

### Solution 2: Duplicate hardware

- Multiple such units will help both instruction to progress.
- If we increase the number of read port memory such that one read-port can be used always by IF unit and others read-port can be used always by the data access unit.
- We can separate memory, one for exclusively storing instruction and other for exclusively storing the data. We use data cache and instruction cache.



## Data Hazard :

|                                                        | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  |
|--------------------------------------------------------|----|----|----|----|----|----|----|----|----|
| add r <sub>1</sub> , r <sub>2</sub> , r <sub>3</sub>   | IF | ID | EX | ME | WB |    |    |    |    |
| sub r <sub>4</sub> , r <sub>5</sub> , r <sub>3</sub>   |    | IF | ID | EX | ME | WB |    |    |    |
| and r <sub>6</sub> , r <sub>1</sub> , r <sub>7</sub>   |    |    | IF | ID | EX | ME | WB |    |    |
| on r <sub>8</sub> , r <sub>1</sub> , r <sub>9</sub>    |    |    |    | IF | ID | EX | ME | WB |    |
| xor r <sub>10</sub> , r <sub>1</sub> , r <sub>11</sub> |    |    |    |    | IF | ID | EX | ME | WB |

এখানে add ins এর 5th cycle পর্যন্ত পর্যবেক্ষণ আজগা r<sub>1</sub> এর value টি পাব, But নিচের সময়সূলো ins অন্দের ID stage এ r<sub>1</sub> এর value access করতে চাই।  
 সাহে ফলে এখানে data hazard create থাই,  
 basically এইটি (RAW) hazard, Read after Write,  
 আজগা instruction Write করার আজগাই সদি পর্যবেক্ষণ  
 instruction Read করতে সাধ, গুরুতর RAW Hazard হয়।

এখানে add এর পরের 3টি instruction sub, and, on  
 এর ক্ষেত্রে data Hazard থাই। but xor এর ক্ষেত্রে  
 data hazard থাই না, যথেষ্ট xor এর ID stage  
 add এর WB পর্যবেক্ষণ আছে।

## # Types of data hazards:

### ① Read after Write (RAW):

Instruction  $i$  tries to read operand before Instruction  $j$  writes it.

$i:$  add  $r_1, r_2, r_3$

$j:$  sub  $r_4, r_1, r_3$

- caused by a data dependence
- This hazard results from an actual need for communication.
- যদি  $i$  ইনসিন্যুটেশন লিপ্ত করে আগেই  $j$  ইনসিন্যুটেশন লিপ্ত করে আগে, তাহলে Raw hazard হয়।
- মূল calculation এ কোন আসে, যদি  $i$  add এর ans টি sub এ input হিসেবে লাগিছে।

### ② Write after Read (WAR):

Instruction  $j$  writes operand before instruction  $i$  reads it.

$i:$  sub  $r_4, r_1, r_3$

$j:$  add  $r_1, r_2, r_3$

$k:$  mul  $r_6, r_1, r_7$

- called an antidependence by compiler writers.
- This results from the reuse of the name  $r_1$ .
- Can't happen in MIPS 5 stage pipeline because:
  - All instructions take 5 stages.

- Reads are always in stage 2, and
- writes are always in stage 5.

- **আসেন্ট ইন্স হেড কান্টার আগের মাদি পর্যন্ত**  
instruction write করতে পারে, এখন WAW hazard হয়।
- এটা হয় out of order execution এ, প্রচালিয়া  
ins-i রিক করে read করতে গিয়ে মাদি দুর্ঘত্ব  
ins-j already রিক এ write করতে পারে না,  
তাহলে ins-i রিক পারে না কারণ আ ঝুল হবে,

### ③ Write after Write (WAW):

Instruction জো writes operand before instruction  
রিক writes it.

i: sub r1, r2, r3

j: add r1, r2, r3

K: mul r6, r1, r7

- called an output dependence.
- This also results from the reuse of name r.
- Can't happen in MIPS 5 stage pipeline bcz:
  - All instructions take 5 stages.
  - Writes are always in stage 5.
- **আগেন্ট ইন্স রিক কান্টার আগের মাদি পর্যন্ত**  
ins write করতে পারে, এখন WAW hazard হয়।
- এটাও হয় out of order execution এ, ins-i,  
ins-i এর পরে r1 এ write কান্টার ক্ষমা, এখন মাদি  
ins-j আগে write করতে প্রথম পারে ins-i write করা  
তাহলে, এখন ins-K r1 read করবে, আ ঝুল হবে।

## # Resolving Data Hazard :

Solution 1 : Operand forwarding

- Forward data if possible.

|                                                          | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  |
|----------------------------------------------------------|----|----|----|----|----|----|----|----|----|
| add r <sub>01</sub> , r <sub>02</sub> , r <sub>03</sub>  | IF | ID | EX | ME | WB |    |    |    |    |
| sub r <sub>09</sub> , r <sub>01</sub> , r <sub>03</sub>  |    | IF | ID | EX | ME | WB |    |    |    |
| and r <sub>06</sub> , r <sub>01</sub> , r <sub>07</sub>  |    |    | IF | ID | EX | ME | WB |    |    |
| or r <sub>08</sub> , r <sub>01</sub> , r <sub>09</sub>   |    |    |    | IF | ID | EX | ME | WB |    |
| xor r <sub>00</sub> , r <sub>01</sub> , r <sub>011</sub> |    |    |    |    | IF | ID | EX | ME | WB |



- Data hazard even with operand forwarding.  
সদি একটি Load ~~stall~~ instruction এর পরে  
বিশেষ ALU instruction আসে, তখন operand  
forwarding এ কাজ হ্যালি, বরং Load ins  
এ value টি MEM stage এর পরে আওয়াজ আসে,  
তাই Load এর পরে ALU ভাবলে একটি ~~stall~~ cycle  
stall দিতেই হ্যালি।

## Solution - 2 : Software Re-scheduling

- instruction শুলো মন্দি Re-scheduling করা  
মাস্য - মাত্রে বেণোলো Load instruction এর পরে  
বেণোলো ALU instruction না থাকে, তাহলে  
stall ব্যবহার করা মাস্য।
- একটা cycle stall দিয়েও solve করা মাস্য।

\* Data hazard even after operand forwarding :

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9 |
|----------------|----|----|----|----|----|----|----|----|---|
| LW r1, 0(r2)   | IF | ID | EX | ME | WB |    |    |    |   |
| sub r3, r1, r6 | IF | ID | EX | ME | WB |    |    |    |   |
| and r6, r1, r7 |    |    | IF | ID | EX | ME | WB |    |   |
| or. r8, r1, r9 |    |    |    | IF | ID | EX | ME | WB |   |

\* Resolving Load - ALU hazard with stall :

|                | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  |
|----------------|----|----|----|----|----|----|----|----|----|
| LW r1, 0(r2)   | IF | ID | EX | ME | WB |    |    |    |    |
| sub r3, r1, r6 | IF | ID | *  | EX | ME | WB |    |    |    |
| and r6, r1, r7 |    |    | IF | *  | ID | EX | ME | WB |    |
| or. r8, r1, r9 |    |    |    | *  | IF | ID | EX | ME | WB |

### Answers to the q. no - 6 :

Hence, a,b,c,d,e and f in memory.

$$a = b + c$$

$$d = e - f$$

|     |            |    |
|-----|------------|----|
| LW  | Rb, b      | 5  |
| LW  | Rc, c      | 6  |
| ADD | Ra, Rb, Rc | 8  |
| SW  | a, Ra      | 9  |
| LW  | Re, e      | 10 |
| LW  | Rf, f      | 11 |
| SUB | Rd, Re, Rf | 13 |
| SW  | d, Rd      | 14 |

After Rescheduling :

|     |            |    |
|-----|------------|----|
| LW  | Rb, b      | 5  |
| LW  | Rc, c      | 6  |
| LW  | Re, e      | 7  |
| ADD | Ra, Rb, Rc | 8  |
| LW  | Rf, f      | 9  |
| SW  | a, Ra      | 10 |
| SUB | Rd, Re, Rf | 11 |
| SW  | d, Rd      | 12 |

Answer to the q. no - 7

For unpipelined architecture;

$$CCT_u = 1/1.5 = 0.67 \text{ ns}$$

$$CPI_u = 5$$

$$\begin{aligned} \therefore \text{Ex-}T_{u\text{per}} \text{ instruction} &= CCT_u \times CPI_u \\ &= 0.67 \times 5 \\ &= 3.33 \text{ ns/instruction} \end{aligned}$$

For pipelined architecture;

$$CCT_p = 1/1 = 1 \text{ ns.}$$

$$CPI_p = \text{Base CPI} + \text{stall CPI}.$$

$$= 1 + (\text{Memory stall} + \text{Branch stall} + \text{Load-AUW})$$

$$= 1 + (0.3 \times 0.05 \times 50) + (0.2 \times 0.3 \times 2) + (0.1 \times 1)$$

$$= 1 + 0.75 + 0.12 + 0.1$$

$$= 1.97$$

$$\begin{aligned} \therefore \text{Ex-}T_p \text{ per instruction} &= CCT_p \times CPI_p \\ &= 1 \times 1.97 \\ &= 1.97 \text{ ns/instruction.} \end{aligned}$$

$$\therefore \text{Speedup} = \frac{\text{Ex-}T_u}{\text{Ex-}T_p} = \frac{3.33}{1.97} = 1.69.$$

## Answer to the q. no - 5

Without operand forwarding:

ID of nth instruction can be only after WB of (n-1)th instruction

|     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|-----|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|
| L.D | F | D | X | M | W |   |   |   |   |    |    |    |    |    |    |    |
| ADD |   | F | * | * | * | D | X | M | W |    |    |    |    |    |    |    |
| L.D |   |   |   |   |   | F | * | * | * | D  | X  | M  | W  |    |    |    |
| ADD |   |   |   |   |   |   |   |   |   | F  | *  | *  | *  | D  | X  | M  |

Instruction reach WB at clock cycles 5, 9, 13, 17, 21 ...

Last ADD instruction reaches WB in

$$5 + (1999 \times 4) = 8001 \text{ cycles}$$

$$\text{CPI} = 8001 / 2000 = 4.0005$$

With operand forwarding:

|     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|-----|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|
| L.D | F | D | X | M | W |   |   |   |   |    |    |    |    |    |    |    |
| ADD |   | F | * | D | X | M | W |   |   |    |    |    |    |    |    |    |
| L.D |   |   |   | F | D | X | M | W |   |    |    |    |    |    |    |    |
| ADD |   |   |   |   | F | * | D | X | M | W  |    |    |    |    |    |    |

Every ADD after LD has a stall,

BUT LD after ADD don't have a stall.

Instruction reach WB at clock cycles 5, 7, 8, 10, 11, 13

Last ADD instruction reaches WB in

$$7 + (999 \times 3) = 3004 \text{ cycles}$$

$$\therefore \text{CPI} = 3004 / 2000 = 1.502$$

## Control Hazard

control hazards on branches ; three stage stall

|                                  | 1  | 2  | 3  | 4  | 5  | 6 | 7 | 8 | 9 |
|----------------------------------|----|----|----|----|----|---|---|---|---|
| 10: beq \$t1, \$t2, -86          | IF | ID | EX | ME | WB |   |   |   |   |
| 14: and \$t2, \$t3, \$t5         | IF | ID |    |    |    |   |   |   |   |
| 15: or \$t3, \$t6, \$t7          |    | IF | ID |    |    |   |   |   |   |
| 22: add \$t8, \$t9, \$t9         |    |    | IF | ID |    |   |   |   |   |
| 36: xor \$t9, \$t10, \$t1, \$t11 |    |    |    |    | IF |   |   |   |   |

## Branch Optimized MIPS Pipeline



- એથાલ, ins\_10 એ branch instruction નું હોય એવું જીએ આજાન equal કરું એટે 36 no ins' એ jump પણ હશે, એથન 2, 3 ઓ 10 એવા જીએ equal કરું વિના હોય અને આજાન કોનઈ પાછાબ ins\_10 એવું MEM stage બન્ધ સહેત અર્થાં 4th cycle બન્ધ થાયે, આદિલ 5th cycle એ 36 no instruction JF stage એ હશે, વિષ્ણુ એવું માર્કે આજાન 3 ટો instruction pipe line નું દુલ્યું હોય હોય. આદિલ એથાલ 3 ટો stall રજા.
  - એથન branch pipeline optimize કરું. instruction જો jump વળ્યા વિના એટો આમના ID stage એવું check વળ્યું ફેન્ડ માટ્ટું.

## # Four Branch Hazard Alternatives

▷ stall until branch direction is clear:

- The moment you come to know that the instruction that is fetched is a branch, then you may have to stall until the branch direction is clear.

▷ Predict Branch not taken:

- Execute successive instruction in sequence.
  - “squash” instructions in pipeline if branch actually taken

|                    | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------------------|---|---|---|---|---|---|---|---|---|
| Untaken branch ins | F | D | X | M | W |   |   |   |   |
| ins i+1            |   | F | D | X | M | W |   |   |   |
| ins i+2            |   |   | F | D | X | M | W |   |   |
| ins i+3            |   |   |   | F | D | X | M | W |   |
| ins i+4            |   |   |   |   | F | D | X | M | W |

|                  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|------------------|---|---|---|---|---|---|---|---|---|
| Taken branch ins | F | D | X | M | W |   |   |   |   |
| ins i+1          |   | F | * | * | * | * |   |   |   |
| Branch target    |   |   | F | D | X | M | W |   |   |
| Branch target +1 |   |   |   | F | D | X | M | W |   |
| Branch target +2 |   |   |   |   | F | D | X | M | W |

#### ▷ Predict Branch Taken :

- But branch target address in is not known by IF stage.
- Target is known at same time as branch outcome (ID stage)
- MIPS still incurs 1 cycle branch penalty.

## ▷ Delayed Branch :

- Define branch to take place 'after' one instruction following the branch instruction.
- 1 slot delay allows proper decision and branch target address in 5 stage pipeline.
- Branch instruction প্রাপ্ত পদ্ধতি prediction করে দ্বিতীয় cycle ভাসা হলে, তারের পরের cycle এ একই দ্বিতীয় instruction উপর করণ হয়, যাতে branch ins এর outcome প্রাপ্ত হলে dependency না থাকে,
- Branch ins এর পরের পরের জায়গাটাকে delay slot বলে।
- Where to get instruction to fill branch delay slot?

|                       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-----------------------|---|---|---|---|---|---|---|---|---|
| Untaken branch ins    | F | D | X | M | W |   |   |   |   |
| Branch delay ins(i+1) |   | F | D | X | M | W |   |   |   |
| ins (i+2)             |   |   | F | D | X | M | W |   |   |
| ins (i+3)             |   |   |   | F | D | X | M | W |   |
| ins (i+4)             |   |   |   |   | F | D | X | M | W |

|                       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|-----------------------|---|---|---|---|---|---|---|---|---|
| Taken branch ins      | F | D | X | M | W |   |   |   |   |
| Branch delay ins(i+1) |   | F | D | X | M | W |   |   |   |
| Branch target         |   |   | F | D | X | M | W |   |   |
| Branch target +1      |   |   |   | F | D | X | M | W |   |
| Branch target +2      |   |   |   |   | F | D | X | M | W |

## ~~\* Filling Branch Delay slot :~~

### (a) From before :



### (b) From target :



### (c) From fall through :



## Answers to the q. no - 9



| SI No | Last Outcome | BHT N/T   | Prediction | Outcome | Mis-Prediction Y/N? |
|-------|--------------|-----------|------------|---------|---------------------|
| 1     | N (initial)  | (00) / 11 | N          | T       | Yes                 |
| 2     | T            | 01 / (11) | T          | T       | No =                |
| 3     | T            | 01 / (11) | T          | N       | Yes                 |
| 4     | N            | (01) / 10 | N          | N       | No =                |
| 5     | N            | (00) / 10 | N          | T       | Yes                 |
| 6     | T            | 01 / (10) | T          | N       | Yes                 |
| 7     | N            | (01) / 00 | N          | T       | Yes                 |
| 8     | T            | 11 / (00) | N          | T       | Yes                 |
| 9     | T            | 11 / (01) | N          | T       | Yes                 |
| 10    | T            | 11 / (11) | T          | N       | Yes                 |
| 11    | N            | (11) / 10 | T          | T       | No =                |
| 12    | T            | 11 / (10) | T          | N       | Yes                 |
| 13    | N            | (11) / 00 | T          | T       | No =                |
| 14    | T            | 11 / (00) | N          | T       | Yes                 |
| 15    | T            | 11 / (01) | N          | N       | No =                |
| 16    | N            | (11) / 00 | T          | T       | No =                |

## Multi-cycle Operations

- Can EXE stage always complete the operation in 1 cycle?
- Some operations require more than 1 cycle to complete.
  - Floating Point / Integer Multiply
  - Floating Point / Integer Divide
  - Floating Point Add / Sub
- Dedicated hardware units are available on the processor for performing these operations.



DATA REGISTER (D)

BW 24 X3 66 77 660  
BW 32 X3 66 77 660

FPU DATA REGISTER (F)

FPU DATA REGISTER (F)

| Functional Unit              | Latency | Initiation interval |
|------------------------------|---------|---------------------|
| 1. Integer ALU               | 0       | 1                   |
| 2. Data memory (FP/int Load) | 1       | 1                   |
| 3. FP add/sub                | 3       | 1                   |
| 4. FP/int multiply           | 6       | 1                   |
| 5. FP/int divide             | 24      | 25                  |

\* Latency: The number of intervening cycles between an instruction that produces a result and an instruction that uses the result.

- Functional unit instruction এবং Ex ক্লক ইওয়ার্ক পর্যন্ত আরেকটি instruction ক্লক ইওয়ার্ক করার পর কাউন্টার ক্লক cycle wait করতে।

\* Initiation interval: The number of cycles that must elapse between issuing two operations of a given type.

- Functional unit instruction এবং Ex ক্লক ইওয়ার্ক করে cycle পর্যন্ত... same functional unit instruction এবং Ex ক্লক ইওয়ার্ক করবে,

① Integer ALU:

add : IF ID EX ME WB

ins : IF ID EX ME WB

Latency - 0 ; wait করতে দুর্বল।

Initiation interval - 1 ; 1 টি clock cycle পরেই ক্লক ইওয়ার্ক

### ③ Data memory (FP / int Load) :

Lw : IF ID EX ME WB

Ins : - IF ID - EX ME WB

Latency - 1 ; 1 টি clock cycle wait করতে হবে ,  
ক্ষমতা Load ins এর value টি ME stage  
এতে পরে পাওয়া যায় .

LW : IF ID EX ME WB

LW : IF ID EX ME WB

Initiation interval - 1 ; 1 টি clock cycle পরেই পরের  
Load execute হতে পারে ,

### ③ FP add / sub :

FP add : IF ID A<sub>1</sub> A<sub>2</sub> A<sub>3</sub> A<sub>4</sub> ME WB

Ins : IF ID - - - Ex ME WB

Latency - 3 ; 3 টি clock cycle wait করতে হবে ,  
ক্ষমতা FP add ins এর A<sub>4</sub> এর ঘটে value  
be ready হবে .

FP add : IF ID A<sub>1</sub> A<sub>2</sub> A<sub>3</sub> A<sub>4</sub> ME WB

FP add : IF ID A<sub>1</sub> A<sub>2</sub> A<sub>3</sub> A<sub>4</sub> ME WB

Initiation interval - 1 ; 1 টি clock cycle পরেই পরের  
FP add ins execute হতে পারে। FP Add  
and FP sub ins are divided into smaller box . they are  
internally pipelined . A<sub>1</sub> এর output A<sub>2</sub>  
use করে , A<sub>2</sub> এর output A<sub>3</sub> ... এভেন্য  
WB শিরী 1st FP add পর্যন্ত A<sub>2</sub> stage এ 2nd FP add  
ওপরে A<sub>1</sub> stage এ করতে পারে ,

#### ④ FP/Int Multiply :

FP Mul : IF ID M<sub>1</sub> M<sub>2</sub> M<sub>3</sub> M<sub>4</sub> M<sub>5</sub> M<sub>6</sub> M<sub>7</sub> ME WB  
 Ins : IF ID — — — — — EX ME WB

Latency - 6 ; 6 টি clock cycle wait করতে হবে; কারণ  
 Mul ins এর M<sub>7</sub> এর পর value টি ready হবে,

FP Mul : IF ID M<sub>1</sub> M<sub>2</sub> M<sub>3</sub> M<sub>4</sub> M<sub>5</sub> M<sub>6</sub> M<sub>7</sub> ME WB

FP Mul : IF ID M<sub>1</sub> M<sub>2</sub> M<sub>3</sub> M<sub>4</sub> M<sub>5</sub> M<sub>6</sub> M<sub>7</sub> ME WB

Initiation interval - 1 ; 1 টি clock cycle পরেই FP Mul instruction  
 execute হতে পারে, কারণ, FP Mul এর ALU  
 stage এর sub-stage শুধু আবাস internally  
 pipelined. তাই প্রথম 1st ins M<sub>1</sub> এ, 3rd ins  
 অর্থাৎ M<sub>1</sub> stage এ আবাস পারে।

#### ⑤ FP/Int Divide :

Div : IF ID D<sub>1</sub> D<sub>2</sub> D<sub>3</sub> — — D<sub>24</sub> D<sub>25</sub> ME WB

Ins : IF ID — — — — — EX ME WB.

Latency - 24 ; প্রথম ins এর Execution শুরু করতে 24 টি  
 clock cycle wait করতে হবে, কারণ Div ins  
 এর D<sub>25</sub> এর পর value টি ready হবে।

Div : IF ID D<sub>1</sub> D<sub>2</sub> D<sub>3</sub> — — D<sub>24</sub> D<sub>25</sub> { ME WB }

Div : IF ID : D<sub>1</sub> D<sub>2</sub> — — D<sub>25</sub>

Initiation interval - 25 ; 1st Div ins শুরু হওয়ার  
 25 টি clock cycle পরে 2nd Div ins  
 শুরু হতে পারে, কারণ, Division stage  
 টি internally pipelined না। পুরো 25  
 stage কেষ না। যাই এই div stage  
 দ্বিতীয় ins use করতে পারে না,

## ↳ Issues in longen latency Pipelines :

- Since divide unit is not-pipelined, structural hazard occurred.

Div 1 : IF ID D<sub>1</sub> D<sub>2</sub> ... D<sub>25</sub> ME WB

Div 2 : IF ID - - - - D<sub>1</sub> D<sub>2</sub> ... D<sub>25</sub>

→ To eliminate structural hazard more stall is required.

- Instruction have varying runtimes — more register writes in the same cycle.

সময় ক্ষেত্রে WB বিভিন্ন instruction এর WB same clock cycle এ আলো।

|                                                        | 1 | 2 | 3              | 4              | 5              | 6              | 7              | 8              | 9              | 10 | 11 |
|--------------------------------------------------------|---|---|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----|----|
| MULD F <sub>0</sub> , F <sub>4</sub> , F <sub>6</sub>  | F | D | M <sub>1</sub> | M <sub>2</sub> | M <sub>3</sub> | M <sub>4</sub> | M <sub>5</sub> | M <sub>6</sub> | M <sub>7</sub> | M  | W  |
| - - -                                                  |   | F | D              | X              | M              | W              |                |                |                |    |    |
| - - -                                                  |   |   | F              | D              | X              | M              | W              |                |                |    |    |
| ADD.D F <sub>2</sub> , F <sub>4</sub> , F <sub>6</sub> |   |   |                | F              | D              | A <sub>1</sub> | A <sub>2</sub> | A <sub>3</sub> | A <sub>4</sub> | M  | W  |
| - - -                                                  |   |   |                |                | F              | D              | X              | M              | W              |    |    |
| - - -                                                  |   |   |                |                |                | F              | D              | X              | M              | W  |    |
| L.D F <sub>2</sub> , O(R <sub>2</sub> )                |   |   |                |                |                |                | F              | D              | X              | M  | W  |

→ Single port write port (serialize completion) vs multiple write ports.

- Resolve write port conflicts in ID stage and stall issue by
  - stall either of the instruction (priority basis) at MEM/WB stage

► WAW hazards possible.

দ্বিতীয় ins মাদি আগের ins সত্ত্বেও আগের  
register এ write করে যাবে।

|                                                        | 1 | 2 | 3              | 4              | 5              | 6              | 7              | 8              | 9              | 10  | 11 |
|--------------------------------------------------------|---|---|----------------|----------------|----------------|----------------|----------------|----------------|----------------|-----|----|
| MULD F <sub>0</sub> , F <sub>4</sub> , F <sub>6</sub>  | F | D | M <sub>1</sub> | M <sub>2</sub> | M <sub>3</sub> | M <sub>4</sub> | M <sub>5</sub> | M <sub>6</sub> | M <sub>7</sub> | M   | W  |
| ...                                                    |   | F | D              | X              | M              | W              |                |                |                |     |    |
| ADD.D F <sub>2</sub> , F <sub>4</sub> , F <sub>6</sub> |   |   | F              | D              | X              | M              | W              |                |                |     |    |
| ...                                                    |   |   | F              | D              | A <sub>1</sub> | A <sub>2</sub> | A <sub>3</sub> | A <sub>4</sub> | M              | (W) |    |
| L.D F <sub>3</sub> , D(R <sub>2</sub> )                |   |   |                | F              | D              | X              | M              | W              |                |     |    |
|                                                        |   |   |                | F              | D              | X              | M              | (W)            |                |     |    |

⑤ → - WAW hazard at register F<sub>2</sub>.

- Delay issue (ID → EX) of L.D. until ADD.D enters MEM stage.

|       |  |  |  |   |   |                |                |                |                |   |   |
|-------|--|--|--|---|---|----------------|----------------|----------------|----------------|---|---|
| ADD.D |  |  |  | F | D | A <sub>1</sub> | A <sub>2</sub> | A <sub>3</sub> | A <sub>4</sub> | M | W |
| ...   |  |  |  | F | D | X              | M              | W              |                |   |   |
| L.D   |  |  |  | F | D | -              | -              | X              | M              | W |   |

- Keep the result of ADD.D and give it to needed instruction.

- Hence only L.D will write on F<sub>2</sub>.

- If any instruction in A<sub>1</sub>..A<sub>4</sub>, M<sub>1</sub>..M<sub>7</sub> has the same destination as the instruction in ID and the time at which they reach WB is same, delay issue by 1 cycle and repeat.

► ○ RAW hazards possible.

• Stall for RAW hazards will be more.

|                                                        | 1 | 2 | 3 | 4              | 5              | 6              | 7              | 8              | 9              | 10             | 11             | 12             | 13             | 14 | 15 | 16 | 17 |
|--------------------------------------------------------|---|---|---|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----|----|----|----|
| L.D F <sub>1</sub> , O(R <sub>2</sub> )                | F | D | X | M              | W              |                |                |                |                |                |                |                |                |    |    |    |    |
| MUL.D F <sub>0</sub> , F <sub>4</sub> , F <sub>6</sub> | F | D | * | M <sub>1</sub> | M <sub>2</sub> | M <sub>3</sub> | M <sub>4</sub> | M <sub>5</sub> | M <sub>6</sub> | M <sub>7</sub> | M              | W              |                |    |    |    |    |
| ADD.D F <sub>2</sub> , F <sub>0</sub> , F <sub>8</sub> | F | * | D | *              | *              | *              | *              | *              | *              | A <sub>1</sub> | A <sub>2</sub> | A <sub>3</sub> | A <sub>4</sub> | M  | W  |    |    |
| S.D F <sub>2</sub> , O(R <sub>2</sub> )                |   |   |   | F              | *              | *              | *              | *              | *              | D              | X              | *              | *              | *  | M  |    |    |

⇒ - If the source of an instruction in ID is F<sub>i</sub> then F<sub>i</sub> should be there as the name of destination of instruction in ID/A<sub>1</sub>, A<sub>1</sub>/A<sub>2</sub>, A<sub>2</sub>/A<sub>3</sub> and ID/M<sub>1</sub>, M<sub>1</sub>/M<sub>2</sub>, ... M<sub>6</sub>/M<sub>7</sub>.

- RAW hazard অট্টবাটে স্টল স্টল রয়ে থাক।

Answers to the q. no - 10

LOAD  $F_4, 8(R_2); \quad F_4 = X$   
 FMUL  $F_0, F_4, F_2 \quad F_0 = AX \quad [F_2 = A]$   
 FADD  $F_3, F_0, F_2; \quad F_3 = AX + A$   
 STOR  $F_3, 16(R_3);$

So,  $AX + A$  will be stored in  $16(R_3)$

|      | 1 | 2 | 3 | 4              | 5              | 6              | 7              | 8              | 9              | 10             | 11 | 13             | 13             | 14             | 15             | 16 | 17 | 18 |
|------|---|---|---|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----|----------------|----------------|----------------|----------------|----|----|----|
| LOAD | F | D | X | M              | W              |                |                |                |                |                |    |                |                |                |                |    |    |    |
| FMUL | F | D | * | M <sub>1</sub> | M <sub>2</sub> | M <sub>3</sub> | M <sub>4</sub> | M <sub>5</sub> | M <sub>6</sub> | M <sub>7</sub> | M  | W              |                |                |                |    |    |    |
| FADD |   | F | * | D              | *              | *              | *              | *              | *              | *              | *  | A <sub>1</sub> | A <sub>2</sub> | A <sub>3</sub> | A <sub>9</sub> | M  | W  |    |
| STOR |   |   |   | F              | *              | *              | *              | *              | *              | *              | *  | D              | *              | *              | *              | X  | M  |    |