

1) Fill in the blanks (3 points each)

a) The power wall has forced computer designers to rely on ..... ~~sof. + super~~ ..... to improve response time of programs.

b) A single cycle system is not practical because ..... ~~hazard~~ .....

c) ~~RAM, ROM, L1, L2, L3~~ is a state element that consists of a set of registers that can be accessed by supplying a register number and a proper control signal.

d) Consider a 3 Giga Hertz system based on architecture M1 with the following specifications. The peak performance for M1 in instructions per second is ~~floating point arithmetic~~ .....  $\rightarrow 3$

$$\begin{aligned} \text{clock rate} &= \frac{\text{cycle time}}{\text{instructions}} \\ &= \frac{i-cycles}{\text{sec}} \end{aligned}$$

| Instruction types available in M1 | CPI | Percentage of operations in a typical program |
|-----------------------------------|-----|-----------------------------------------------|
| Load                              | 5   | 15%                                           |
| Store                             | 4   | 15%                                           |
| integer arithmetic                | 3   | 50%                                           |
| floating point arithmetic         | 11  | 20%                                           |

$$\begin{aligned} \text{clk rate} &= 3 \text{ GHz} \\ \frac{3 \times 10^9}{5} &= .6 \times 10^9 \quad \frac{3 \times 10^9}{3} = 1 \times 10^9 \\ \frac{3 \times 10^9}{4} &= .75 \times 10^9 \quad \frac{3 \times 10^9}{11} = .27 \times 10^9 \end{aligned}$$

e) Increasing ..... ~~clock rate~~ ..... or lowering ..... ~~CPI~~ ..... can improve CPU execution time.

f) Amdahl's law states that the performance enhancement possible with a given improvement is limited by the amount that the improved feature is used.

g) The ..... ISA ..... is an abstract interface between the hardware and the lowest level of software of a machine.

h) List three important features of a RISC architecture such as MIPS:

..... pipeline .....  
..... large number of registers .....  
..... limited address .....

i) Assume a program that consists of only addition and multiplication operations each taking 50% of the overall execution time. By how much should we improve the speed of multiplication hardware to improve the overall execution time by 30%?

..... 1.3 M .....

j) response time is defined as the total time to complete a task including I/O access, memory access, disk access, and operating system overhead.

$$5 \text{ CPI exec time} = \frac{\text{clk cycle}}{\text{clk rate}}$$

$$5 \text{ exec time} + \text{nr} = \text{exec time}$$

$$4 \text{ exec time} + \text{nr} = 1.3 \text{ exec time}$$

2

$$\text{CPU exec time} = \frac{\text{CPI} \times \text{instruction count}}{\text{clk rate}} \quad \frac{\text{cycle}}{\text{sec}}$$

- 2) a) Computer A has an overall CPI of 2.0 and can be run at a clock rate of 500MHz. Computer B has a CPI of 3.0 and can be run at a clock rate of 850 Mhz. We have a particular program we wish to run. When compiled for computer A, this program has exactly 200,000 instructions. How many instructions would the program need to have when compiled for Computer B, in order for the two computers to have exactly the same execution time for this program? (10 points)

10<sup>b</sup>

$$A: \text{CPI} = 2, \text{clk rate} = 500 \times 10^6, \text{Ai} = 200,000$$

$$B: \text{CPI} = 3, \text{clk rate} = 850 \times 10^6, i = ?$$

$$A \text{ cpu exec time} = \frac{2 \times 200,000}{500 \times 10^6} = \frac{1}{1250}$$

$$B \text{ cpu exec time: } \frac{1}{1250} = \frac{3 + i}{850 \times 10^6}$$

$$i = \frac{850 \times 10^6}{1250} - \frac{1}{3} = 226,667 \text{ instructions}$$

- b) If processor A has a higher clock rate than processor B, and processor A also has a higher MIPS rating than processor B, explain whether processor A will always execute faster than processor B. Suppose that there are two implementations of the same instruction set architecture. Machine A has a clock cycle time of 30ns and an effective CPI of 2 for some program, and machine B has a clock cycle time of 20ns and an effective CPI of 2.0 for the same program. Which machine is faster for this program, and by how much? For the problem you can use MIPS rating = (Clock Rate)/(CPI \* 10<sup>9</sup>) (5 points)

*whether processor A is faster than B will depends on the clk rates and MIPS rating because they are divided by each other.* -2

$$\frac{1 \text{ C}}{e \times 10^9} = \text{MIPS}$$

$$\frac{1 \text{ C}}{\text{MIPS}} = e$$

$$A: \text{clk cycle time} = 30 \text{ ns}, \text{CPI} = 2; \text{clk rate} = 30 \times 10^9$$

$$B: \text{clk cycle time} = 20 \text{ ns}, \text{CPI} = 2; \text{clk rate} = 20 \times 10^9$$

$$A \text{ MIPS} = \frac{30 \times 10^9}{2} = 15 \times 10^9$$

$$B \text{ MIPS} = \frac{20 \times 10^9}{2} = 10 \times 10^9$$

$$\frac{15 \times 10^9}{10 \times 10^9} = \frac{15}{10} = 1.5$$

*processor B is faster by 1.5 time*

$$a.) \frac{(2.0)(200000)}{500 \times 10^6} = 0.8 \text{ ms} : A$$

$$\frac{(3.0) \cdot *}{850 \times 10^6} = 0.8 \text{ ms} : B$$

$$x = \frac{(0.8 \text{ ms})(850 \times 10^6)}{3.0} = 2266666.667 \text{ [instructions]}$$

$$b.) \text{MIPS} = \frac{\text{Clockrate}}{\text{CPI} \times 10^6} = \frac{30 \times 10^{-9}}{2 \times 10^6} = 1.5 \times 10^{-14} \quad (A)$$

$$\frac{A}{B} = \frac{1.5 \times 10^{-14}}{1.0 \times 10^{-14}} = \underline{1.5}$$

$$\text{MIPS} = \frac{20 \times 10^{-9}}{2 \times 10^6} = 1.0 \times 10^{-14} \quad (B)$$

faster by 1.5

- 3) Consider a hypothetical 16-bit assembly language with only three register-type instructions. The format for these instructions is given below. Design a single cycle datapath which can fetch and execute any of these instructions. Show all necessary components. Label each component (including its ports). Assume that the memory is byte-addressable. You must use minimum number of components possible. (20 points)

|             | bits<br>(15 - 12) | bits<br>(11 - 8) | bits<br>(7 - 4) | bits<br>(3 - 0) |                                                                                                        |
|-------------|-------------------|------------------|-----------------|-----------------|--------------------------------------------------------------------------------------------------------|
| Instruction | opcode            | operand 1        | operand 2       | Operand 3       | operation                                                                                              |
| MOV         | 0000              | Reg 1            | Reg 2           | not used        | Moves the content of register 2 into register 1; No change to the content of register 2. Reg 1 = Reg 2 |
| ADD         | 0011              | Reg 1            | Reg 2           | Reg 3           | Reg 1 = Reg 3 + Reg 2                                                                                  |
| SWAP        | 1100              | Reg 1            | Reg 2           | not used        | Swaps the contents of registers 1 and 2<br>New Reg 1 = Current Reg 2<br>New Reg 2 = Current Reg 1      |



- 4) Write the Verilog code for a memory module that holds 2K bytes. The memory is byte addressable. Each word is 32-bit wide. There are separate read and write data ports. It has only one address port. This memory has only one enable, called rnw. When rnw is low, a word write takes place on the next positive edge of the clock. When rnw is high, the content of memory word at the given address is placed on the data port. On a reset, each memory location is initialized to 0. (20 points)

```

module memModule (word, rnw, reset, dataPort);
    input [31:0] word;
    input rnw, reset;
    output [31:0] dataPort;
    reg [31:0] dataPort; // port ready
    always @ (posedge clk or posedge reset)
        if (!reset)
            dataPort <= 0; // init
        else if (rnw)
            dataPort <= word;

```

*Read / write -n*

- 5) a) List all assembly instructions necessary for the execution of the following C code-segment on a 32-bit system designed using the 32-bit MIPS architecture. Assume that all array elements are 32-bit unsigned numbers stored in main memory. You can assume  $i$  and  $a$  are in some registers. Your answer should use proper MIPS assembly format. (8 points)

```
for (i=0; i<100; i++)
    z[i] = x[i] - (a * y[i])
```

- b) Compute the number of clock cycles required to execute the code on a multi-cycle datapath in which each instruction takes 4 cycles on the average. (7 points)

a)

```

andi $r1, $0, 0
sli $r1, 0
loop: beq $r1, $0, end // i < 10?
lw $t2, 0($r1) // x[i]
lw $t3, 0($r1) // y[i]
mul $t4, $t2, $t3 // t4 = a * y[i]
sub $t5, $t2, $t4 // t5 = x[i] - (a * y[i])
addi $r1, $r1, 4
j loop
end:
```