

## Tutorial 9

### CSE 112 Computer Organisation

---

Q1. Consider a four-stage pipeline with the stages, Fetch (F), Decode (D), Execute (E), and Writeback (W). The description of the stages is given as follows:

1. Fetch - Fetches the instructions from instruction memory
2. Decode - Decodes the instructions and type of operations. It also reads the input operands. If the input operands are not present, the decode stage is stalled to read the operands after they are made available.
  - o The address of the unconditional branch is assigned to PC at the end of the decode stage.
3. Execute - Executes the instruction in the ALU
4. Writeback - Updates the register value at the end of the stage.

Considering the above stages, determine the following program's CPI (Cycles Per Instruction).

1. ADD x1, x2, x3
2. SUB x4, x1, x2
3. Jal x0, 5 // branch to instruction 5
4. SUB x5, x6, x7
5. ADD x4, x5, x6

Ans:

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|----------------|---|---|---|---|---|---|---|---|---|----|
| ADD x1, x2, x3 | F | D | E | W |   |   |   |   |   |    |
| SUB x4, x1, x2 |   | F | D | D | D | E | W |   |   |    |
| Jal x0, 5      |   |   | F | F | F | D | E | W |   |    |
| SUB x5, x6, x7 |   |   |   |   |   | F | - | - | - | -  |
| ADD x4, x5, x6 |   |   |   |   |   |   | F | D | E | W  |

$$\text{CPI} = (10 \text{ Cycles}) / (5 \text{ Instructions}) = 2$$

**Note to TAs: Explain the concept of control hazard and how instructions are cleared from the pipeline in case of a branch taken.**

Control hazards arise from the pipelining of branches and other instructions that change the PC. If the branch instruction is determined to be taken, the instructions that have entered the pipeline after the branch need to be cleared or flushed from the pipeline. This is necessary to ensure that incorrect instructions, which were fetched and partially executed based on the wrong assumption, do not impact the correct control flow of the program.

Q2. Consider two different machines. The first has a single-stage, non-pipelined machine with a cycle time of 15 ns. The second is a pipelined machine with 5 pipeline stages and a cycle time of 3ns.

- What is the speedup of the pipelined machine versus the single-cycle machine, assuming there are no stalls?
- What is the speedup of the pipelined machine versus the single cycle machine if the pipeline stalls 1 cycle for 25% of the instructions?
- Now consider a 4-stage pipeline machine with a cycle time of 3.1 ns. Again, assuming no stalls, is this implementation faster or slower than the original 5-stage pipeline? Explain your answer.

Ans:

- The speedup is  $15 \text{ ns} / 3 \text{ ns} = 5$
- New CPI =  $1 + .25 \times 1 = 1.25$

Since the number of instructions is the same, the speedup is

$$\frac{\text{CPI(old)} \times \text{Cycletime (old)}}{\text{CPI(new)} \times \text{Cycletime (new)}} = \frac{1 \times 15}{1.25 \times 3} = 4$$

- The 5-stage machine is faster. This is because it has a shorter cycle time, which results in a faster overall execution time (since there are no stalls, they both have the same CPI).

Q3. Consider a processor A with a clock frequency of 1 MHz. The processor executes one instruction per cycle.

- Determine the clock period of processor A and the time needed to implement 197 instructions.
- Processor A is now upgraded to a four-stage pipelined processor, say fetch [F], Decode [D], Execute [E], and Writeback [W], which have delays of 180 ns, 190 ns, 140 ns and 170 ns, respectively. The registers between the pipeline stages have a delay of 10 ns (**Note:** This time includes sequential overheads). Determine the new clock period of the upgraded processor.
- Determine the time needed to implement 197 instructions after the upgradation of the processor. Assume no hazards are present.
- Determine the speedup obtained by upgrading the processor [Hint: Take the ratio of time of 197 instructions obtained in parts a and c].

Ans:

- Clock period =  $1/1 \text{ MHz} = 10^{-6} \text{ s} = 1 \text{ us} = 1000 \text{ ns}$   
As the processor executes one instruction in one clock cycle, the time needed to implement 197 instructions = 197 us
- Cycle time = max(stage delay) + register delay  
 $= \max(180, 190, 140, 170) + 10 = 190 + 10 = 200 \text{ ns}$
- Pipeline time to process 197 data items  
 $= \text{Time taken for 1st data item} + \text{Time taken for remaining 196 data items}$   
 $= 1 \times 4 \text{ clock cycles} + 196 \times 1 \text{ clock cycle}$   
 $= 4 \times \text{cycle time} + 196 \times \text{cycle time}$   
 $= 200 \times \text{cycle time} = 200 \times 200 \text{ ns} = 40 \times 10^3 \text{ ns} = 40 \text{ us}$

Or

At a steady state, the processor provides output every 1 clock cycle. Hence, time needed =  $197 * 200 \text{ ns} = 39.4 \text{ us}$

- 

**Ans 1:**

As speed up is inversely proportional to time, speed up obtained =  $197 \text{ us} / 40 \text{ us} = 4.925$

OR

**Ans 2:**

Ratio of time of processor B and A is  $40/197 = 0.203$

Q4. The 5 stages of the processor have the following latencies:

|   | Fetch | Decode | Execute | Memory | Writeback |
|---|-------|--------|---------|--------|-----------|
| 1 | 300ps | 400ps  | 350ps   | 550ps  | 100ps     |
| 2 | 200ps | 150ps  | 100ps   | 190ps  | 140ps     |

Assume that when pipelining, each pipeline stage costs 20ps extra for the registers between pipeline stages. If you could split one of the pipeline stages into 2 equal halves, which one would you choose? What is the new cycle time? What is the new latency? What is the new throughput?

Ans: We would want to choose the longest stage to split in half. The new cycle time becomes the originally 2nd longest stage length. But the number of stages will become 6 instead of 5.

1. New cycle time =  $400 + 20 = 420 \text{ ps}$   
Latency =  $6 * 420 = 2520 \text{ ps}$   
Throughput =  $1/420 \text{ inst/ps} = 2.38 \times 10^9 \text{ inst/s}$
2. New cycle time =  $190 + 20 = 210 \text{ ps}$   
Latency =  $6 * 210 = 1260 \text{ ps}$   
Throughput =  $1/210 \text{ inst/ps} = 4.76 \times 10^9 \text{ inst/s}$

N.B. Pipelining reduces the cycle time to the length of the longest stage plus the register delay. Latency becomes  $CT * N$  where  $N$  is the number of stages as one instruction will need to go through each of the stages and each stage takes one cycle. The throughput is defined as  $1/CT \text{ inst/s}$ .