

## Control Path Implementation

### ALU Control

The MIPS ALU defines the 6 combinations of 4 control units

|   | ALU control lines | Function         |
|---|-------------------|------------------|
| 0 | 0 0 0 0           | AND              |
| 1 | 0 0 0 1           | OR               |
| 2 | 0 0 1 0           | add              |
| 3 | 0 1 1 0           | subtract         |
| 4 | 0 1 1 1           | set on less than |
| 5 | 1 1 0 0           | NOR              |

→ Depending on the instruction class, the ALU will need to perform one of those first five functions

load word & store word : ALU computes memory address by addition

R-Type instructions : ALU needs to perform AND, OR, subtract, add or set less than depending on the value of the 6-bit funct field in the instruction

Beq Instructions: ALU must perform a subtraction

\* Generating the 4-bit ALU control input

→ use a control unit that inputs the function field of the instruction and a 2 bit control field called ALUOp.

→ ALUop indicates the operation performed by the ALU

add for add and stores = 00

subtract for beq = 01

determined by operation = 10  
encoded in funct field

→ The ALUOp along w/ the funct field  $\Rightarrow$  4 bit signal which can directly control the operation of the ALU.

| Instruction<br>opcode | ALU OP | Instruction<br>operation | Funct    | desired<br>ALU<br>op<br>action | 4-bit<br>ALU<br>control input |
|-----------------------|--------|--------------------------|----------|--------------------------------|-------------------------------|
| LW                    | 0 0    | load word                | xx·xxxx  | add                            | 0010                          |
| SW                    | 0 0    | store word               | xx xx xx | add                            | 0010                          |
| Beq                   | 0 1    | branch equal             | xx xx xx | subtract                       | 00110                         |
| R-type                | 1 0    | add                      | 10 0 000 | add                            | 0010                          |
| R-type                | 1 0    | sub                      | 10 0 010 | subtract                       | 0110                          |
| R-type                | 1 0    | AND                      | 100100   | AND                            | <u>0000</u>                   |
| R-type                | 1 0    | OR                       | 100101   | OR                             | 0001                          |
| R-type                | 1 0    | set on less than         | 101010   | slt                            | 0111                          |

$$0 \rightarrow 10 \rightarrow 100 \rightarrow 101 \rightarrow \text{alt 0s 2 1s}$$

→ When the ALUOP is 00 or 01, the desired ALU action does not depend on the function code  $\rightarrow$  use don't cares

→ If ALUOP = 10, then the function code is used to determine the ALU control input

## \* Deriving ALU control signals using a truth table

| ALU OP | Func Field |    |    |    |    |    | Operation | from<br>prev<br>table |
|--------|------------|----|----|----|----|----|-----------|-----------------------|
|        | F5         | F4 | F3 | F2 | F1 | F0 |           |                       |
| ALUOP1 | ALUOP2     |    |    |    |    |    |           |                       |
| 0      | 0          | x  | x  | x  | x  | x  | 0010      | line1                 |
| 0      | 1          | x  | x  | x  | x  | x  | 0110      | line3                 |
| 1      | 0          | x  | x  | 0  | 0  | 0  | 0010      |                       |
| 1      | x          | x  | x  | 0  | 0  | 1  | 0110      |                       |
| 1      | 0          | x  | x  | 0  | 1  | 0  | 0000      |                       |
| 1      | 0          | x  | x  | 0  | 1  | 0  | 0000      |                       |
| 1      | 0          | x  | x  | 0  | 1  | 0  | 0000      |                       |
| 1      | x          | x  | x  | 1  | 0  | 1  | 0111      |                       |

{ 1>10, then  
order as  
1x  
10  
10  
1x

(From the previous table  
copy last 4-bits  
MSB 2 bits = don't care)

## \* Mapping of ALU control functions to gates



only 3 operation op  
since MSB is  
always 0

Operation

## Unit - 3

### Building Data Paths

Datapath → refers to elements that process data and addresses in the CPU.

There are 5 main steps to execute any instruction

Step 1: Fetch instruction

Step 2: Instruction decode and read registers

Step 3: ALU operation, branch address computation

Step 4: Lw / store in data memory

Step 5: Register write

### Components

① Instruction memory:



② Program Counter:



## Stage 1: Instruction Fetch

- Transfer contents of PC to MDR
- Choose a particular memory location
- Read instructions present in memory
- Load instructions into instruction memory
- Also increment the PC by 4 to execute the next instruction

## Stage 2: Decode

- Instruction present in IR will be decoded
- If data available for the operation is available in the register, it performs the operation

## Stage 3: Performing the Operation

This stage depends on the nature of instruction being executed.

### Type 1: R-Type Operations (add, sub, AND, OR, SLT)

e.g. add \$t1, \$a0, \$s2

- pick ~~instruction~~ source operands from instructions, specifically the corresponding reg. no
- Read the corresponding data from the 2 source registers
- Take read values to ALU
- Take result of ALU to the <write data> portion of the register file. Also extract the write register no from the instructions, to know which register to write into



### Type 2: Immediate Instructions

e.g. addi \$t0, \$s0, 4

- Initially, the same as the R-type instruction
- This time, take the constant value from the instruction memory
- Include a mux between the 2nd port to read data & the ALU.
- This allows to choose between a value from the register and a constant directly from the IR
- While taking the constant from the IM, and before passing to the ALU, include a sign-extender, to extend length from 16-32 bits



### Step4: Loading / Storing in Data memory

eg. `lw $1 $0 , 8($t0)`

↓  
Offset

→ Initially, the base address (here  $\$t0$ ) and the offset 8 have to be added.

→ The base address will be present in the register file and the offset comes from the immediate portion, that is (along with the mux)

→ The ALU is used to add the offset to the value in the base register to find the load address. It is now present at the ALU result section

→ Pass the address from the result section to the data memory to load the corresponding value at the other end.

→ Add a MUX after the data memory. This is so that it chooses what comes from the ALU result in case of arithmetic operations, & and chooses the value after the offset in case of load instructions

→ Write back the final result into the register file



## Step4: Pt & Store word

(5)

eq. sw \$a0, \$f(sp)

- Use ALU to add the offset to the value in the base register to find the store address.
- Pass the value of the second register as the data to be written



## Type3: Branch Instructions

eq. beq \$t0, \$s0, offset

- Use the ALU to check if the 2 source registers are equal or not equal (subtract & check if result is 0)
- Find target address =  $PC + 4 + \text{offset} \times 4$
- Offset available ~~available~~, since it is constant (i-type)
- Multiply offset by 4 (aka shift left by 2)
- Pass this, and the shift left result to an adder
- The result has to be sent back to PC ⇒ if beq occurs, then send  $(PC+4) + (\text{offset} \times 4)$ , otherwise send  $PC + 4$

→ This is decided using a MUX.

### Final Diagram



### Type 4 : J Format

e.g. j f1

→ There are 6 bits in the opcode

→ There are 26 bits in the target address

→ There is not enough space in the instruction to specify a full target address

## Solution

7

- Take the 26 bit target address field, left shift by 2
- concatenate with the result of the upper 4 bits of PC + 4.



## 6 Implementing Control Paths for different Instruction Types



### A. R-Type Instruction (add, sub, AND, OR, SLT)

eq. add \$t1, \$t2, \$t3

#### Step 1: Fetch

- Transfer contents of PC to MAR
- Choose a particular memory location
- Read instruction

#### Step 2: Decode → Instruction present in IR is decoded

- Instruction Increment PC by 4

#### Step 3: Execute

- Pick source operands' reg. no from instruction
- Read corresponding data from source registers

→ The main control unit computes the setting of control lines

(5)

→ The ALU operates on the data read from the register file, using the funct field.

#### Step 4: Write back

→ The write register no. from the instructions is extracted to know which reg. to write into.

→ The result from the ALU is written into the register file (bits 15:11), after selecting the destination register.

#### B. Load Word Instruction



#### Step 1 & 2 - Write from R-type

→ A register value is read from the register file

→ The ALU computes the sum of the value read from the reg. file & the sign extended, lower 16 bits of the instruction (offset)

→ The sum from the ALU is used as the address for the data memory:

→ The data from the memory unit is written into the register file, the register destination is given by bits Q0:16.

### c. Branch Instructions

\* Why is a single-cycle implementation not used today?

Ans. → In single-cycle design, the clock cycle must have the same length for every instruction

- The clock cycle is determined by the longest possible path in the processor (Load instruction, using 5 function units in series - the instruction memory, the register file, the ALU, the data memory and the register file)
- The clock cycle becomes too long, performance is very poor.

Solution - use a multicycle implementation

each step in the execution will take 1 clock cycle  
allows a functional unit to be used more than once per instruction

### Pipelines

- Pipelining is a process of decomposing a sequential process into sub operations, with each sub operation being executed in a separate dedicated segment that operates concurrently with all other segments
- involves the overlapping of multiple instructions
- lowers total execution time by increasing instruction throughput
- It does not reduce the execution time of an individual instruction.

9

→ Each step in the pipeline completes a part of an instruction..

Each step is called a pipeline stage or pipeline segment.

→ In an ideal case, one instruction can finish executing on every clock cycle.

$$\boxed{\text{Potential speedup} = \text{no. of pipeline stages}}$$

\* Pipeline Throughput = no. of instructions completed per second

\* Pipeline latency = how long does it take to execute a single instruction in the pipeline

\* Pipeline Speedup

If all the stages are balanced (each stage takes the same amount of time)

Time taken to execute an instruction  
pipeline

$$= \frac{\text{Time taken to execute an instruction sequentially}}{\text{no. of stages}}$$

→ If not balanced, speedup is less.

\* Pipeline Performance

Let there be k segments in a pipeline.

There are n tasks to be performed

The cycle time is tp

No. of cycles to get the first output from the pipeline = k

Time needed for the first task = ktp

The remaining tasks can be performed in  $(k-1)$  cycles.

$$\therefore \text{No. of cycles needed for } n \text{ tasks} = k-1+n \\ = k+n-1$$

Time needed for performing  $n$ -tasks using the pipelined processor

$$= \underline{\underline{(k+n-1) * tp}}$$

Time taken by a non-pipelined processor

$$= tn \Rightarrow \text{for } n \text{ tasks} = \underline{\underline{ntn}}$$

$\therefore$  The speedup of the pipelined processor over non-pipelined processor is

$$S = \frac{ntn}{(k+n-1) * tp}$$

when  $n$  is large  $k+n-1 \rightarrow n$

$$\Rightarrow S = \frac{ntn}{ntp} \approx \frac{tn}{tp}$$

If we assume that the cycle time of a pipelined & non-pipelined Processor is  $tp$

$$\Rightarrow tn = k + p \quad (\text{1 task's time} = k \text{ cycles} * tp_{\text{pipeline}})$$

$$\Rightarrow \frac{tn}{tp} = S = k$$

S = k

$\Rightarrow$  In ideal case, the maximum speedup that can be provided by the pipeline = depth of the pipeline

## \* Pipelining in NIPS

(11)

→ There are 5 stages, one step per stage

1. IF : Instruction fetch from memory

2. ID : Instruction decode

3. EX: Execute operation

4. MEM: access memory operand

5. WB: write back to register



## \* Pipeline Hazards

→ Situations that prevent the next instruction in the instruction stream from being executed during its designated clock cycle.

→ reduce performance from the ideal speedup gained during Pipelining

## \* Types of Hazards

(i) Structural Hazards

(ii) Data Hazards

(iii) Control Hazards

Structural Hazards → arise from resource conflicts

→ H/W cannot support all possible combinations of instructions

Data Hazards → occur when instruction depends on data from an instruction ahead of it in pipeline

Control Hazards → results from branches / other instructions that change the flow of the program

## \* Dealing w/ Hazards and the Consequences

→ pipeline must be stalled

→ stalling pipeline usually temporarily halts all instructions / lets some other instructions in the pipeline proceed , while the stalled inst waits for data, resources etc.

### Consequences

→ stalls impede progress of pipeline & result in a deviation from 1 instruction executing / clock cycle .

CPI pipelined =

= ideal CPI + pipeline stall cycles per instruction

= 1 + pipeline stall cycles per instruction

Speedup = CPI unpipelined

(b)

$1 + \text{pipeline stall cycles per instruction}$

## ① Structural Hazards

→ when not all possible combinations of instructions can be executed, structural hazards occur

→ solution (i) duplicate resources

e.g. an ALU to perform an arithmetic exception  
and an adder to increment the PC

(ii) introduce stalls / bubbles

Example



Solution: introduce a stall



\* LOAD instruction steals an instruction fetch cycle, causing 1 stall.

## ② Data Hazards

→ occurs when an instruction depends on data from an instruction ahead of it in the pipeline

### Example

ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>  
SUB R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub>  
AND R<sub>6</sub>, R<sub>1</sub>, R<sub>7</sub>  
OR R<sub>8</sub>, R<sub>1</sub>, R<sub>9</sub>  
XOR R<sub>10</sub>, R<sub>1</sub>, R<sub>11</sub>

All instructions after ADD use the result of ADD

ADD writes into the register in the WB stage, but SUB needs it in the IP stage

⇒ data hazard

### Solution: Forwarding

→ Forwarding occurs when a result is passed directly to a functional unit that requires it

→ Result goes from the output of one unit to input of another w/ forwarding, the example above becomes:

ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>  
SUB R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub>  
AND R<sub>6</sub>, R<sub>1</sub>, R<sub>7</sub>  
OR R<sub>8</sub>, R<sub>1</sub>, R<sub>9</sub>  
XOR R<sub>10</sub>, R<sub>1</sub>, R<sub>11</sub>



## Example (where forwarding doesn't work)

$lw \quad R_1, 0(R_2)$   
 $sub \quad R_4, \underline{R_1}, R_5$   
 $and \quad R_6, \underline{R_1}, R_7$   
 $or \quad R_8, R_1, R_9$



Here, can't get data to the sub instruction, because the result is needed at the beginning of clock cycle 4, but it is not produced until the end of clock cycle 4.

→ use a stall instead

## \* Types of Data Hazards

### ① RAW (Read after Write)

Instruction J tries to read data before instruction I writes it

$$\begin{aligned} I: R_2 &\leftarrow R_1 + R_3 \\ R_4 &\leftarrow R_2 + R_3 \end{aligned}$$

### ② Write after Read (WAR)

Instruction J tries to write data before instruction I reads it.

$$I: R_2 \leftarrow R_1 + R_3$$

$$J: R_3 \leftarrow R_4 + R_5$$

### ③ Write after Write (WAW)

Instruction J tries to write output before instruction I writes it

$$I: R_2 \leftarrow R_1 + R_3$$

$$J: R_2 \leftarrow R_4 + R_5$$

## Examples

① LW R<sub>1</sub>, 45(R<sub>2</sub>)

ADD R<sub>5</sub>, R<sub>1</sub>, R<sub>7</sub>

SUB R<sub>8</sub>, R<sub>6</sub>, R<sub>7</sub>

OR R<sub>9</sub>, R<sub>6</sub>, R<sub>7</sub>

⇒ requires 1 stall because of Leo  
just before add

② LW R<sub>1</sub>, 45(R<sub>2</sub>)

ADD R<sub>5</sub>, R<sub>6</sub>, R<sub>7</sub>

SUB R<sub>8</sub>, R<sub>1</sub>, R<sub>7</sub>

OR R<sub>9</sub>, R<sub>6</sub>, R<sub>7</sub>

⇒ can use forwarding, result of LOAD  
can be forwarded to ALU before sub  
instruction



③ LW R<sub>1</sub>, 45(R<sub>2</sub>)

ADD R<sub>5</sub>, R<sub>6</sub>, R<sub>7</sub>

SUB R<sub>8</sub>, R<sub>6</sub>, R<sub>7</sub>

OR R<sub>9</sub>, R<sub>1</sub>, R<sub>7</sub>



no action required

read from Req occurs in the second half

write of R<sub>1</sub> happens in the first half.

Note: Register writes happen in the first half of the  
clock cycle, read happens in the second half  
(can access memory twice in 1 cc)

(4)

ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>SUB R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub>AND R<sub>6</sub>, R<sub>1</sub>, R<sub>7</sub>OR R<sub>8</sub>, R<sub>1</sub>, R<sub>9</sub>

(5)

ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>SUB R<sub>4</sub>, R<sub>1</sub>, R<sub>5</sub>AND R<sub>6</sub>, R<sub>1</sub>, R<sub>7</sub>OR R<sub>8</sub>, R<sub>1</sub>, R<sub>9</sub>

The req write of R<sub>1</sub> happens first in CE's and the req read of R<sub>1</sub> for the OR operation happens in the second half of the 5th CE.

### (3) Control Hazards

→ Control hazards arise from the need to make a decision based on the results of one instruction, while others are executing  
(occur less frequently than data hazards)

#### Schemes to Resolve Control Hazards

##### A. Assume Branch not Taken

- Stalling until the branch is complete is too slow
- An improvement over branch stalling is to predict that the branch will not be taken, and thus continue execution down the sequential instruction stream.
- If the branch is taken, the instructions that are being fetched and decoded must be discarded.

- If the branches are untaken half the time, and if it costs little to discard the instructions, this optimization halves the cost of control hazards.
- To discard instructions, change the original control values to 0.

### B. Reducing the delay of Branches

- Move branch execution earlier in the pipeline  $\Rightarrow$  fewer instructions need to be flushed.

#### \* Dynamic Branch Prediction

- Look up the address of the instruction to see if a branch was taken the last time this instruction was executed, and, if so, begin fetching instructions from the same place as last time. This technique is called dynamic branch prediction.

#### Implementation

##### 1-bit Branch prediction

- use a branch prediction buffer or branch history table.
- A branch prediction buffer is a small memory indexed by the lower portion of the address of the branch instruction.
- The memory contains a bit that says whether the branch was recently taken or not.

Shortcoming: Even if a branch is almost always taken, we predict incorrectly twice, rather than once.

Q9. For . . . .

For ( $i=0$ ;  $i < 9$ ;  $i++$ )

$a[i] = a[i] * 2;$

Mispredictions: First & last loop iterations

(i) last iteration - prediction bit will indicate taken, as the branch has been taken 9 times in a row.

(ii) first iteration: bit was flipped on prior execution of the last execution, since the branch was not taken on that exiting iteration.

accuracy: branch taken 90% of the time

accuracy = 80%. (2 incorrect predictions + 8 correct ones)



## \* Exceptions

→ exceptions: unexpected events requiring change in flow of control

exception vs. interrupt



from an external I/O controller

arises

within CPU

(undefined opcode,  
overflow, syscall)

## \* Handling Exceptions

- In MIPs, exceptions are managed by a System Control Coprocessor
  - Save the program counter of the interrupted instruction (In MIPs - Exception Program Counter (EPC))
  - Save indication of the problem
    - (i) the cause register
    - (ii) assume 1 bit:
      - 0 → undefined opcode
      - 1 → overflow
  - Jump to exception handler

## Method 2: use vectored interrupts

- the address to which control is transferred is determined by the cause of the exception
- instructions deal w/ the interrupt / jump to real handler

## ~~Answers~~

### What does the handler do?

- determines action to be taken

If restartable :

- (i) take corrective action
- (ii) use EPC to return to program

Otherwise

- (i) terminate program
- (ii) report error using EPC and cause

## \* Exceptions in a Pipeline

(ST)

→ another form of a control hazard

e.g. consider overflow on add in EX stage

add \$1, \$2, \$1

Solution:

complete all previous instructions

flush add & subsequent instructions

set cause and EPC register values

transfer control to handler

## \* Exception Properties

### ① Restorable exceptions

→ flush inst. from pipeline

→ execute handler, return to inst., refetched & executed from scratch

### ② PC saved in EPC register

→ identifier causing instruction

→ handler must adjust as  $PC + 4$  is what gets saved

## \* Precise vs. Imprecise Interrupt

↓  
Always  
associated  
w/ correct  
instruction

not associated w/ exact  
instruction that was the  
cause of the interrupt  
exception

If there are multiple  
exceptions in a pipeline,  
due to overcap, deal w/  
exception from the earliest  
instruction, flush subsequent  
instructions

1

## Unit 3

~~~~~

### Additional Numericals

~~~~~ ~~~~~

① Consider the following code:

```

DIV    R2    R5    R8
SUB    R9    R2    R7
ADD    R5    R14   R6
MUL    R11   R9    R5
BEQ    R10   #10   R12
OR     R8    R15   R2

```

Identify all the RAW, WAR, WAW & Control Hazards.

Ans RAW

(i) DIV & SUB ,  
 (ii) SUB & MUL

(iii) DIV & OR

(iv) SUB & MUL

WAR

(i) DIV & ADD  
 (ii) DIV & OR

WAW → NONE

CONTROL → BEQ, OR

② An unpipelined system takes 50ns to process a task. The same task can be processed in a 6 segment pipeline with a clock cycle of 10ns. Determine the speedup ratio of the pipeline for 100 tasks. What is the maximum speedup that can be achieved?

$$t_n = 50 \text{ ns}$$

$$t_p = 10 \text{ ns}$$

$$k = 6$$

$$n = 100$$

$$\begin{aligned} \frac{s = n t_n}{(k+n-1) * t_p} &= \frac{100 * 50}{(6+100-1) * 10} \\ &= \frac{500}{105} = 4.76 \end{aligned}$$

$$\boxed{S = 4.76}$$

$$\text{Max speedup} = \frac{t_n}{t_p} = \frac{50}{10} = 5$$

③ Consider an unpipelined processor that has a 1ns clock cycle and it uses 4 cycles for ALU and branch operations and 5 cycles for memory operations. Assume the relative frequency of these operations are 40%, 20% & 40% respectively. Suppose that due to clock skew & setup, pipelining the processor adds 0.2ns of overhead to the clock, ignoring any latency impact, how much speedup in the instruction execution rate will we gain from the pipeline?

$$T = CPI \times C \times I$$

$$= 1 \text{ ns} * ((0.40 + 0.2) * 4 + 5 * 0.4)$$

$$= 1 \text{ ns} * (4.4) = 4.4 \text{ ns}$$

(3)

### Pipelined Instructions:

$$1 + 0 \cdot 0 = 1.0 \text{ ns}$$

$$\text{Speedup} = \frac{4.4}{1.0} = 3.76 \text{ times}$$

- (4) The time delay for the 4 segments in a pipeline are as follows:  $t_1 = 50 \text{ ns}$ ,  $t_2 = 30 \text{ ns}$ ,  $t_3 = 95 \text{ ns}$ ,  $t_4 = 45 \text{ ns}$ . The interface register delay time  $t_r = 5 \text{ ns}$

- (a) How long would it take to add 100 pairs of numbers in the pipeline?
- (b) How can we reduce the time to about one half of the time calculated in (a)?

Answer time taken = 100 ns

$$\text{pipelined time} = (k+n-1) * t_p$$

$$t_p = 100$$

$$k = 4$$

$$n = 100$$

$$= (4+99) * 100 = 10,300 \text{ ns}$$

(b) Divide max segment into 2

$$50 + 5 = 55 \text{ ns}$$

$$45 + 5 = 50 \text{ ns}$$

$$\Rightarrow \text{new } t_p = 55$$

$$(k+n-1) * t_p = (5+99) * 55 = 5720 \text{ ns}$$

⑤ Assume that the individual stages of the datapath have the following latencies

| IF  | ID  | EX  | MEM | WB  |         |
|-----|-----|-----|-----|-----|---------|
| 250 | 400 | 150 | 350 | 200 | (in ps) |

What is the clock cycle time in a pipelined & non-pipelined processor?

Ans

$$\text{non-pipelined} = 1350 \text{ ps}$$

$$\text{pipelined} = 400 \text{ ps}$$

⑥ Assume that instructions executed by the processor are broken down as follows.

| ALU | Reg | Mem | SW  |
|-----|-----|-----|-----|
| 40% | 20% | 25% | 15% |

What is the utilization of the write-register port of the Registers unit?

Ans

ALU & Reg use the write register port  
 $\Rightarrow 65\%$

⑦ Assume that the individual stages of the datapath have the following latencies

| IF    | ID     | EX    | MEM   | WB     |
|-------|--------|-------|-------|--------|
| 250ps | 400 ps | 150ps | 350ps | 200 ps |

Split one stage into 2, and identify the new clock cycle time?

Ans.  $SPLIT\ TB = 400\text{ps}$

New cycle time =  $350\text{ps}$

(5)

⑧ Assume the following fragment of MIPS code:

lw \$t0, 20(\$t1)

sub \$t2, \$t0, \$t3

The individual stages of the datapath have the following latencies:

IF ID EX MEM WB

250ps 350ps 150ps 300ps 200ps.

Identify the total latency (assuming there is no forwarding unit)

⑨ What is the longest chain of dependent operations in the following program?

LD r7, r8

SUB r10, r11, r12

MUL r13, r7, r11

ST r9 r13

ADD r13 r2, r1

LD r5, r6

SUB r3, r4, r5

Ans. LD r7 r8

MUL r13 r7 r11

ST r9 r13

ADD r13 r2 r1

## Role of control signals

- (i) Branch → used with PsSrc <sup>to make</sup> = Branch + 0 from ALU
- (ii) ALUSrc → selects second source operand
- (iii) ALUOp → specifies what operation should be performed
- (iv) MemRead, MemWrite → enables memory read, write <sup>for</sup>  
Load instructions
- (v) ReaWrite → enables a write to one of the registers
- (vi) ReqPst → determines how the destination register <sup>is specified</sup>
- (vii) MemtoReg → determines where the value to be <sup>written</sup>  
comes from.