



CS M151B / EE M116C  
Midterm Exam #1

All work and answers should be written directly on these pages, use the backs of pages if needed.

This is an open book, open notes quiz – but you cannot share books or notes.

We will follow the departmental guidelines on reporting incidents of academic dishonesty – do not make us enforce the rules. Keep your eyes on your own exam!

NAME: \_\_\_\_\_

ID: \_\_\_\_\_

Do not write anything in the area below on this page:

Problem 1: \_\_\_\_\_ (18)

Problem 2: \_\_\_\_\_ (20)

Problem 3: \_\_\_\_\_ (30)

Problem 4: \_\_\_\_\_ (32)

Total: \_\_\_\_\_ (out of 100)

1. **Hardware Design ISA-about Tradeoffs (18 points):** Consider the three properties of execution time. For the following design decisions, indicate how these three properties could impact a MIPS processor running an application *A* with the following instruction breakdown and latencies:

| Instruction                             | % of Instructions in <i>A</i> | Instruction Latency (cycles) |
|-----------------------------------------|-------------------------------|------------------------------|
| Load                                    | 20%                           | 5                            |
| Store                                   | 10%                           | 4                            |
| Simple R-type (i.e. adds, ands, shifts) | 45%                           | 4                            |
| Multiply                                | 10%                           | 5                            |
| BEQ/BNE                                 | 10%                           | 3                            |
| Jump                                    | 5%                            | 3                            |

Note that this architecture is not the single cycle datapath and is not pipelined. Your answer for each should be that it either **could increase, could decrease, or will stay the same**. You may only circle one answer per property. The first one is done for you as an example.

**Example: Design Decision:** Use a carry lookahead adder instead of a ripple carry adder.

- |                            |                |                |                    |
|----------------------------|----------------|----------------|--------------------|
| • <i>Instruction Count</i> | could increase | could decrease | will stay the same |
| • <i>Cycle Time</i>        | could increase | could decrease | will stay the same |
| • <i>CPI</i>               | could increase | could decrease | will stay the same |

- a. **Design Decision:** Increase the size of the immediate field for I-type instructions by reducing the number of registers in the ISA.

- |                            |                |                |                    |
|----------------------------|----------------|----------------|--------------------|
| • <i>Instruction Count</i> | could increase | could decrease | will stay the same |
| • <i>Cycle Time</i>        | could increase | could decrease | will stay the same |
| • <i>CPI</i>               | could increase | could decrease | will stay the same |

- b. **Design Decision:** A new physical design technology is used which can reduce wire latency.

- |                            |                |                |                    |
|----------------------------|----------------|----------------|--------------------|
| • <i>Instruction Count</i> | could increase | could decrease | will stay the same |
| • <i>Cycle Time</i>        | could increase | could decrease | will stay the same |
| • <i>CPI</i>               | could increase | could decrease | will stay the same |

- c. **Design Decision:** A new compiler is used to compile application *A*. The compiler uses shifts in place of multiply instruction – it takes multiple shift instructions to replace a single multiply.

- |                            |                |                |                    |
|----------------------------|----------------|----------------|--------------------|
| • <i>Instruction Count</i> | could increase | could decrease | will stay the same |
| • <i>Cycle Time</i>        | could increase | could decrease | will stay the same |
| • <i>CPI</i>               | could increase | could decrease | will stay the same |

a. **Design Decision:** Increase the size of the immediate field for I-type instructions by reducing the number of registers in the ISA.

### Instruction Count: Could increase, could decrease

#### Justification for “could increase”:

The decrease in the amount of registers could mean more register spilling. This could increase the amount of lw and sw instructions, increasing the overall instruction count.

#### Justification for “could decrease”:

The increase in the size of the immediate field of I-Type instructions means more flexibility in some instructions:

- addi can now add larger immediates
- beq/bne can now branch farther
- lw/sw can now calculate a farther offset.

It is possible that the original code often needed to explicitly create large immediates (case 1):

```
lui $t0, UPPER  
ori $t0, $t0, LOWER  
add $t1, $t1, $t0
```

...or needed to branch to distant locations using j (case 2):

```
bne $t0, $t1, LABEL  
j ADDR  
LABEL:
```

...or both (case 3):

```
lui $t0, UPPER  
ori $t0, $t0, LOWER  
bne $s0, $s1, LABEL  
jr $t0  
LABEL:
```

...or in a pathological case, used multiple branches to get to a distant destination (case 4):

```
beq $t0, $t1, LABEL1  
...  
LABEL1:  
beq $t0, $t1, LABEL2
```

Case 1 can be reduced to a single addi and the remaining cases could be reduced to a single beq if the new size of the immediate field is large enough to accommodate the desired immediate. As a result, the IC could be reduced. Note, according to the

instruction type distribution, there are no I-Type instructions other than lw, sw, and beq. Thus, for this particular set of instructions, this justification is most valid when concerning case 2 (reduce the number of jumps) or case 4 (reduce the number of branches).

### Cycle Time: Could decrease

#### **Justification for “could decrease”:**

Due to the reduced number of registers, the hardware complexity is reduced which may decrease the latency of accessing the register file. If the stage in which register access occurs was previously the stage with the greatest latency, the faster register access could result in a reduction of cycle time.

### CPI: Could increase

#### **Justification for “could increase”:**

Due to the decrease in registers, there could be more register spilling, resulting in a higher proportion of loads (CPI = 5) and stores (CPI = 4). The current CPI of the instructions is 4.15. If the result of the increased spilling is the addition of new lw/sw pairs, then the CPI will increase (adding a set of instructions that have an average CPI of 4.5).

#### **Justification for “could increase”:**

Due to the increased immediate field, there may be a reduction in the number of the following instructions:

- Simple R-Type, CPI = 4 (case 1)
- Non-branch/lw/sw I-Type (addi, ori, lui)
- beq/bne, CPI = 3 (case 4)
- j, CPI = 3 (case 2)

Because the current CPI is 4.15, reducing the number of any one of these would increase the CPI.

#### **Justification for “could decrease”:**

Consider a case where the original datapath had multiple instruction decode stages. If the reduced number of registers means that multiple decode stages could be reasonably coalesced into one stage due to the decreased complexity, the CPI could decrease due to a reduction of stages in the datapath.

---

b. ***Design Decision:*** A new physical design technology is used which can reduce wire latency.

### Instruction Count: Will stay the same

#### **Justification for “will stay the same”:**

This is purely a change to the hardware. Since the instruction count exists in the software domain, a change to the hardware cannot affect the instruction count unless the

instructions/software explicitly specify that particular hardware component (ex. instructions specify registers).

### Cycle Time: Could decrease

#### **Justification for “could decrease”:**

Because the wire latency has decreased, the latency through each stage of the datapath would likely decrease. As a result, the cycle time could decrease to account for the reduced latency of each stage.

### CPI: Could decrease

#### **Justification for “could decrease”:**

If the reduced wire latency means that multiple stages in the datapath could be coalesced into a single stage as part of the hardware modification, the CPI could be reduced.

---

c. **Design Decision:** A new compiler is used to compile application A. The compiler uses shifts in place of multiply instruction – it takes multiple shift instructions to replace a single multiply.

### Instruction Count: Could increase

#### **Justification for “could increase”:**

Since a single multiply is replaced by several shifts, the number of instructions will increase if this multiply replacement is done.

### Cycle Time: Could decrease, must stay the same

#### **Justification for “could decrease”:**

If we make the assumption that not only are we using a different compiler that eliminates multiply instructions, but we are also using different hardware that doesn't include the multiply hardware, the complexity of the hardware is reduced. If the multiply hardware was part of the stage with the maximum latency, then eliminating the multiply hardware could result in a reduced cycle time.

#### **Justification for “must stay the same”:**

If we make the assumption that ONLY the compiler has changed, then the hardware is unchanged and therefore the cycle time is also unchanged.

### CPI: Could decrease

#### **Justification for “could decrease”:**

Replacing each multiply with multiple shifts has the following effects:

- Reduces the number of multiply instructions which have a CPI of 5
- Increases the number of shift instructions which have a CPI of 4

Since the current CPI is 4.15, both of these changes will bring the overall CPI closer to 4, which will reduce the CPI.

**Justification for “could decrease”:**

As with the possible justification for why CT may decrease, the reduced complexity of not requiring multiply hardware may mean if the EX stages can be reduced or coalesced, the CPI decreases.

2. **Performance Anxiety (20 points):** Consider the same application  $A$  and baseline MIPS processor from the previous problem. Application  $A$  executes three billion instructions. Answer the questions below – show your work.

- a. What is the CPI for application  $A$ ?

$$\text{CPI: } \frac{4.15}{.2 \times 5 + .1 \times 4 + .45 \times 4 + .1 \times 5 + .1 \times 3 + .05 \times 3} = 4.15$$

- b. The hardware running the application from part (a) has a cycle time of 300 ps (that is picoseconds). What is the execution time for that hardware to run the application?

$$ET = 3 \times 10^9 \times 4.15 \times 300 \times 10^{-12} \quad ET: \underline{3.735 \text{ s}}$$

- c. Now suppose we add an instruction that does a multiply/accumulate operation. The multiply/accumulate operation replaces one multiply instruction and one add instruction. This optimization allows the compiler to replace 50% of multiplies. What is the new CPI?

$$\text{Multiply } 10 \rightarrow 5 \quad \text{CPI: } \underline{4.27}$$

$$\begin{array}{ll} R\text{-type} & 45 \rightarrow 40 \\ \text{MAC} & 5 \end{array} \quad \frac{20}{95} \times 5 + \frac{10}{95} \times 4 + \frac{40}{95} \times 4 + \frac{5}{95} \times 5 + \frac{10}{95} \times 3 + \frac{50}{95} \times 3 + \frac{5}{95} \times 2 \\ 1.32 + 2.11 + 0.47 + 0.37 = 4.27$$

- d. What is the execution time for the same hardware from part (b) to run the new application from part (c)?

$$ET = (.95)(3)(10^9) \times 4.27 \times 300 \times 10^{-12} \quad ET: \underline{3.65 \text{ s}}$$

3. **Putting the CLA into UCLA (30 points):** Assume for the rest of this problem that all logic gates have the following delays:

| Fan In    | Delay       |
|-----------|-------------|
| 1         | 5T          |
| 2         | 6T          |
| 3         | 7T          |
| 4         | 8T          |
| 5         | 9T          |
| 6 or more | 2T x fan-in |

So a 2-input AND gate would have delay 6T and a 4-input OR gate would have delay 8T. For simplicity, assume that mux's have delay 6T regardless of fan-in.

We will create a 32-bit adder out of some building blocks we've covered in class. We will use the 4-bit carry lookahead (4-bit CLA) that we covered in class as one basic building block of this design. And we will use it (as we did in class) to make a 16-bit hierarchical CLA (16-bit HCLA). But instead of connecting these in series to make a 32-bit adder, we will use carry select to speed up the 32-bit adder. The design will look as follows (be sure to note where we are using the 4-bit CLA and where we are using the 16-bit HCLA):



The grey boxes show the outputs of the 32-bit adder, including the Carry Out ( $C_{32}$ ) and Sum bits  $S_0$  through  $S_{32}$ . The individual CLA and HCLA blocks take input from a carry in and the A and B operands. For example, the left most 4-bit CLA is computing the sum of inputs  $A_{0:3}$  and  $B_{0:3}$  (i.e. producing sums  $S_0$ - $S_3$ ). The carry out of this 4-bit CLA (marked  $C_4$ ) selects the sums and carry out of the two 4-bit CLAs in Ripple Carry. The carry out of the these Ripple Carry 4-bit CLAs (marked  $C_{12}$ ) selects the sums and carry out of the 4-bit CLA and 16-bit HCLA in Ripple Carry. The 0 or 1 at the top the adders in carry select are the hardwired  $C_{in}$  for these blocks. Multiplexers are shown as ovals.

Your task is to find the maximal delay of this design – i.e. determine the delays of  $S_{0-31}$  and  $C_{32}$  – the maximal delay of these outputs will be the maximal delay of the entire design. To do this (and to help with possible partial credit) please use the diagrams on the following pages and fill in the tables in every page.

**Single 4-bit CLA:**



| Output | Delay (in terms of T) |            |
|--------|-----------------------|------------|
| G0     |                       | (1 points) |
| P0     |                       | (1 points) |
| G3     |                       | (1 points) |
| P3     |                       | (1 points) |
| C3     |                       | (2 points) |
| C4     |                       | (2 points) |
| S3     |                       | (1 points) |

*Ripple Carry 4-bit CLAs:*



| Output                        | Delay (in terms of T) |
|-------------------------------|-----------------------|
| $C_8$                         |                       |
| $XC_{12}$ (before any muxing) |                       |
| $S_{11}$ (before any muxing)  |                       |

(1 points)

(2 points)

(2 points)

**16-bit HCLA:**



### **16-bit HCLA**

| Output                       | Delay (in terms of T) |            |
|------------------------------|-----------------------|------------|
| $G_\omega$                   |                       | (2 points) |
| $P_\omega$                   |                       | (2 points) |
| $C_{16}$                     |                       | (2 points) |
| $C_{28}$                     |                       | (2 points) |
| $C_{32}$ (before any muxing) |                       | (2 points) |
| $S_{31}$ (before any muxing) |                       | (2 points) |

Entire Design:



| Output              | Delay (in terms of T) |
|---------------------|-----------------------|
| C12 (after the mux) | (1 points)            |
| C32 (after the mux) | (1 points)            |
| S31 (after the mux) | (1 points)            |

Find the maximum delay **in terms of T** of the 32-bit adder – take the maximum of all output bits – including the sum bits ( $S_0-S_{31}$ ) and the final carry out ( $C_{32}$ ).

Maximal Delay: \_\_\_\_\_ (1 points)

\*: AND

+: OR

^: XOR

Using  $S_n = (A_n \wedge B_n \wedge C_n)$  (3-input XOR)

**1.**  $G_0 = A_0 * B_0$

- Delay( $G_0$ ) = 6T

**2.**  $P_0 = A_0 \wedge B_0$

- Delay( $P_0$ ) = 6T

**3.**  $G_3 = A_3 * B_3$

- Delay( $G_3$ ) = 6T

**4.**  $P_3 = A_3 \wedge B_3$

- Delay( $P_3$ ) = 6T

**5.**  $C_3 = G_2 + G_1 * P_2 + G_0 * P_1 * P_2 + C_0 * P_0 * P_1 * P_2$

- Delay( $C_3$ ) = 6T ( $P_0/P_1/P_2$ ) + 8T (4-input AND) + 8T (4-input OR) = 22T

**6.**  $C_4 = G_3 + G_2 * P_3 + G_1 * P_2 * P_3 + G_0 * P_1 * P_2 * P_3 + C_0 * P_0 * P_1 * P_2 * P_3$

- Delay( $C_4$ ) = 6T ( $P_0/P_1/P_2/P_3$ ) + 9T (5-input AND) + 9T (5-input OR) = 24T

**7.**  $S_3 = (A_3 \wedge B_3 \wedge C_3)$

- Delay( $S_3$ ) = 22T ( $C_3$ ) + 7T (3-input XOR) = 29T

**8.**  $C_8 = G_7 + G_6 * P_7 + G_5 * P_6 * P_7 + G_4 * P_5 * P_6 * P_7 + C_4 * P_4 * P_5 * P_6 * P_7$

- Delay( $C_8$ ) = 6T ( $P_4/P_5/P_6/P_7$ ) + 9T (5-input AND) + 9T (5-input OR) = 24T

**9.**  $C_{12} = G_{11} + G_{10} * P_{11} + G_9 * P_{10} * P_{11} + G_8 * P_9 * P_{10} * P_{11} + C_8 * P_8 * P_9 * P_{10} * P_{11}$

- Delay( $C_{12}$ ) = 24T ( $C_8$ ) + 9T (5-input AND) + 9T (5-input OR) = 42T

**10.**  $S_{11} = A_{11} \wedge B_{11} \wedge C_{11}$

$C_{11} = G_{10} + G_9 * P_{10} + G_8 * P_9 * P_{10} + C_8 * P_8 * P_9 * P_{10}$

- Delay( $C_{11}$ ) = 24T ( $C_8$ ) + 8T (4-input AND) + 8T (4-input OR) = 40T

- Delay ( $S_{11}$ ) = 40T ( $C_{11}$ ) + 7T (3-input XOR) = 47T

**11.**  $G_\omega = G_{31} + G_{30} * P_{31} + G_{29} * P_{30} * P_{31} + G_{28} * P_{29} * P_{30} * P_{31}$

- Delay ( $G_\omega$ ) = 6T ( $G_{28}$ ) + 8T (4-input AND) + 8T (4-input OR) = 22T

**12.**  $P_\omega = P_{31} * P_{30} * P_{29} * P_{28}$

- Delay ( $P_\omega$ ) = 6T ( $P_{28}/P_{29}/P_{30}/P_{31}$ ) + 8T (4-input AND) = 14T

**13.**  $C_{16} = G_{15} + G_{14} * P_{15} + G_{13} * P_{14} * P_{15} + G_{12} * P_{13} * P_{14} * P_{15} + C_{12} * P_{12} * P_{13} * P_{14} * P_{15}$

- Delay ( $C_{16}$ ) = 6T ( $P_{15}/P_{14}/P_{13}/P_{12}$ ) + 9T (5-input AND) + 9T (5-input OR) = 24T

$$14. C_{28} = G\psi + G\chi^*P\psi + G\Phi^*P\chi^*P\psi + C_{16}^*P\Phi^*P\chi^*P\psi$$

-Delay (C<sub>28</sub>) = 24 T(C<sub>16</sub>) + 8T (4-input AND) + 8T(4-input OR) = 40T

$$15. C_{32} = G\omega + G\psi^*P\omega + G\chi^*P\psi^*P\omega + G\Phi^*P\chi^*P\psi^*P\omega + C_{16}^*P\Phi^*P\chi^*P\psi^*P\omega$$

-Delay (C<sub>32</sub>) = 24T (C<sub>16</sub>) + 9T (5-input AND) + 9T(5-input OR)= 42T

$$16. S_{31} = A_{31} \wedge B_{31} \wedge C_{31}$$

$$C_{31} = G_{30} + G_{29}^*P_{30} + G_{28}^*P_{29}^*P_{30} + C_{28}^*P_{28}^*P_{29}^*P_{30}$$

-Delay(C<sub>31</sub>) = 40T (C<sub>28</sub>) + 8T (4-input AND) + 8T (4-input OR) = 56T

-Delay(S<sub>31</sub>) = 56T(C<sub>31</sub>) + 7T(3-input XOR)= 63T

$$17. C_{12} (\text{after MUX}) = 42T(C_{12} \text{ before MUX}) + 6T (\text{MUX})= 48T$$

$$18. C_{32} (\text{after MUX}) = 48T (C_{12} \text{ after MUX}) + 6T (\text{MUX})= 54T$$

$$19. S_{31} (\text{after MUX}) = 63T (S_{31} \text{ before MUX}) + 6T (\text{MUX})= 69T$$

Maximal delay = 69T

4. **FMOV (32 points):** Consider the single-cycle processor implementation. Your task will be to augment this datapath with a new instruction: the *fmov* instruction. This instruction will be an I-type instruction, and will have the following effect:

```
if (M[R[rs]]==SE(I))  
    R[rt]=R[$t0]
```

Note that the *fmov* always uses register \$t0 as the source value that is put into R[rt] – it is an implicit operand.

**Implement your solution on the following two pages.** All other instructions must still work correctly after your modifications. You should not add any new ALUs, register file ports, or ports to memory.



Main Controller

| Input or Output | Signal Name | R-format | lw | sw | Beq | Fmov |
|-----------------|-------------|----------|----|----|-----|------|
| Inputs          | Op5         | 0        | 1  | 1  | 0   | U    |
|                 | Op4         | 0        | 0  | 0  | 0   | N    |
|                 | Op3         | 0        | 0  | 1  | 0   | T    |
|                 | Op2         | 0        | 0  | 0  | 1   | Q    |
|                 | Op1         | 0        | 1  | 1  | 0   | E    |
|                 | Op0         | 0        | 1  | 1  | 0   | Z    |
| Outputs         | RegDst      | 1        | 0  | X  | X   | 0    |
|                 | ALUSrc      | 0        | 1  | 1  | 0   | 1    |
|                 | MemtoReg    | 0        | 1  | X  | X   | X    |
|                 | RegWrite    | 1        | 1  | 0  | 0   | X    |
|                 | MemRead     | 0        | 1  | 0  | 0   | 1    |
|                 | MemWrite    | 0        | 0  | 1  | 0   | 0    |
|                 | Branch      | 0        | 0  | 0  | 1   | 0    |
|                 | ALUOp1      | 1        | 0  | 0  | 0   | 0    |
|                 | ALUOp0      | 0        | 0  | 0  | 1   | 1    |
|                 | Fm          | 0        | 0  | 0  | 0   | 1    |

ALU Controller

| Opcode | ALUOp | instruction  | function | ALU Action | ALUCtrl |
|--------|-------|--------------|----------|------------|---------|
| Lw     | 00    | load word    | XXXXXX   | add        | 010     |
| Sw     | 00    | store word   | XXXXXX   | add        | 010     |
| Beq    | 01    | branch equal | XXXXXX   | subtract   | 110     |
| R-type | 10    | add          | 100000   | add        | 010     |
| R-type | 10    | subtract     | 100010   | subtract   | 110     |
| R-type | 10    | AND          | 100100   | AND        | 000     |
| R-type | 10    | OR           | 100101   | OR         | 001     |
| R-type | 10    | SLT          | 101010   | SLT        | 111     |