



## School of Electrical Engineering and Computer Science

CptS260: Introduction to Computer Architecture - Fall 2017

Final Exam, Duration: 120 Minutes, December 14, 2017

NAME: *Solutions*

ID:

Notes:

- You may bring a double-side letter-size cheat-sheet and a calculator to the test. No other resources are allowed! In particular, NO textbooks, lecture notes, internet access, smartphone usage, etc. are allowed!
- MIPS reference data sheet is provided.
- The test includes 5 implicit bonus points.
- Make sure to write your name and WSU ID down on all pages.
- Show your work for each question. Even if you don't know exact answer to a question, show your work to get partial credit.
- You are strongly encouraged to complete the course evaluation before the deadline of 12/15/2017. Your feedback is very much appreciated!

| Problem       | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  |
|---------------|----|----|----|----|----|----|----|----|
| Total Points  | 10 | 10 | 10 | 15 | 10 | 20 | 15 | 15 |
| Points Earned |    |    |    |    |    |    |    |    |

Total Points Earned: .....

1. For each sentence below, indicate if it is a *True* or *False* statement.
  - a. Compared to an un-pipelined processor, a pipelined architecture usually has a shorter clock cycle.  
*True.*
  - b. Control hazards in a pipelined architecture are due to data dependency of the instructions.  
*False, control hazards are due to branch instruction.*
  - c. DRAM (Dynamic Random Access Memory) is as fast as SRAM (Static Random Access Memory)  
*False, DRAM is slower than SRAM.*
  - d. Non-negative numbers have the same unsigned and 2's-complement representation.  
*False True.*
  - e. In MIPS assembly, instruction 'beq \$1, \$2,1' will advance the program counter (PC) by one (i.e., PC = PC + 1) if [\$1]=[\$2]. (note: [] indicates content of the register)  
*False, PC<sub>new</sub> = PC<sub>old</sub> + 4.*
  - f. In cache memory, the 'Valid' bit keeps track of the cache blocks that are updated in the cache but have not been written to the main memory.  
*False, 'Dirty' bit keeps track of the cache blocks not written to the main memory, yet.*
  - g. In MIPS assembly, instruction 'jr' will update register '\$ra' by copying content of the program counter (PC) into '\$ra'.  
*False, PC = \$ra, PC is updated by \$ra.*
  - h. MIPS is a byte-addressable computer.  
*True.*
  - i. A fully associative cache memory has a higher hit time (i.e., it is slower) compare to a direct mapped cache memory of the same size.  
*True, FA is slower than direct mapped.*
  - j. For primary (L1) cache in multilevel cache systems, focus is on minimizing hit time.  
*True.*

2. Assume a processor has a clock cycle time of 500 ps (picoseconds), a D-Cache (Data Cache Memory) with 5% miss rate, and an I-Cache (Instruction Cache Memory) with 10% miss rate. Also assume that 30% of the instructions are load/store and access to the main memory takes 20 ns (nanoseconds). Calculate the effective CPI assuming an ideal/base CPI of 2.

$$CPI = CPI_{ideal} + \underset{\text{I-cache}}{\text{Stalls}} + \underset{\text{D-cache}}{\text{Stalls}}$$

$$\begin{aligned} CPI &= 2 + \frac{10}{100} \times \frac{20 \text{ ns}}{0.5 \text{ ps}} + \frac{30}{100} \times \frac{5}{100} \times \frac{20 \text{ ns}}{0.5 \text{ ps}} \\ &= 2 + 4 + 0.6 = \boxed{6.6} \end{aligned}$$

3. Assume the following 32-bit binary represents a floating point number using single precision IEEE 754 standard. Calculate the equivalent decimal number for this floating representation. Show your work



$\text{Sign} = 1 \rightarrow \text{negative number}$

$$\text{Exponent} = (10001001)_2 = 1 + 8 + 128 = 137$$

$$\text{Fraction} = 111100\cdots 0$$

$$\text{Decimal} = (-1)^{\text{Sign}} \times (1. \text{Fraction})_2 \times 2^{(\text{Exponent} - 127)}$$

$$\text{Decimal} = -1 \times (1.111100\cdots 0)_2 \times 2^{137-127}$$

$$= - (1.111)_2 \times 2^{10} = -(1111000000)_2$$

$$= -1984$$

4. Trace the following assembly program. Assume that the registers' initial values and memory contents are as shown in the following tables. The program starts from location 10000d in the instruction memory. Show the final state of the processor including registers' content/data, and memory contents.

| Address | Initial Value | Final Value |
|---------|---------------|-------------|
| 100d    | 1             | 1d          |
| 104d    | 1             | 1d          |
| 108d    | 1             | 2d          |
| 112d    | 1             | 3d          |
| 116d    | 1             | 5d          |
| 120d    | 1             | 8d          |
| 124d    | 1             | 13d         |

| Register | Initial Value | Final Value |
|----------|---------------|-------------|
| \$t0     | 0             | 120d        |
| \$t1     | 0             | 5d          |
| \$t2     | 0             | 5d          |
| \$s0     | 0             | 5d          |
| \$s1     | 0             | 13d         |

```

    li      $t2, 5d : loops for 5 times
    li      $t1, 0d : counter
    la      $t0, 100d : pointer to array
loop:
    lw      $s0, 0($t0)
    lw      $s1, 4($t0)
    add   $s1, $s0, $s1
    sw      $s1, 8($t0)
    addi  $t0, $t0, 4
    addi  $t1, $t1, 1
    bne   $t1, $t2, loop
exit:

```

*Function of code =*

$$x[i+2] = x[i] + x[i+1]$$

5. The following figure shows a sequential circuit with two D Flip Flops with a common input clock. Assume that the flip flops are initialized at '1'. That is,  $Q_0 = Q_1 = Q_2 = 1$  during Cycle 0. This means the output of this circuit is initially  $Q_2Q_1Q_0 = '1 1 1'$ .
- Compute the value of each output signal ( $Q_2$ ,  $Q_1$  and  $Q_0$ ) for 8 cycles using the table below.
  - Convert the 3-bit binary  $Q_2Q_1Q_0$  to its equivalent decimal in the table. How many unique output states (i.e.,  $Q_2Q_1Q_0$ ) does this circuit produce?



| Clock Cycle | D <sub>2</sub> | D <sub>1</sub> | D <sub>0</sub> | Q <sub>2</sub> | Q <sub>1</sub> | Q <sub>0</sub> | Decimal (Q= Q <sub>2</sub> Q <sub>1</sub> Q <sub>0</sub> ) |
|-------------|----------------|----------------|----------------|----------------|----------------|----------------|------------------------------------------------------------|
| 0           | 0              | 1              | 1              | 1              | 1              | 1              | 7                                                          |
| 1           | 0              | 0              | 1              | 0              | 1              | 1              | 3                                                          |
| 2           | 1              | 0              | 0              | 0              | 0              | 1              | 1                                                          |
| 3           | 0              | 1              | 0              | 1              | 0              | 0              | 4                                                          |
| 4           | 1              | 0              | 1              | 0              | 1              | 0              | 2                                                          |
| 5           | 1              | 1              | 0              | 1              | 0              | 1              | 5                                                          |
| 6           | 1              | 1              | 1              | 1              | 1              | 0              | 6                                                          |
| 7           | 0              | 1              | 1              | 1              | 1              | 1              | 7                                                          |
| 8           | 0              | 0              | 1              | 0              | 1              | 1              | 3                                                          |

b. The output sequence is  $\underbrace{7, 3, 1, 4, 2, 5, 6, 7, 3, \dots}_{7 \text{ unique outputs}}$

6. Consider the following 3 instructions are to be executed through the MIPS architecture shown below.  
 Assume that the branch instruction is taken.

|     |                  |
|-----|------------------|
| lw  | \$t1, 12(\$t0)   |
| and | \$s0, \$t2, \$t1 |
| beq | \$0, \$s0, 1048d |



- a. Fill out the table below to show which of the hardware blocks are used (i.e., are active) during execution of each instruction. Each hardware block is shown by a box in the next figure.

|     | PC | Instruction Memory | Register File | Sing Extend | ALU | Data Memory | ADD1 | ADD2 | Jump |
|-----|----|--------------------|---------------|-------------|-----|-------------|------|------|------|
| lw  | ✓  | ✓                  | ✓             | ✓           | ✓   | ✓           | ✓    | ✗    | ✗    |
| and | ✓  | ✓                  | ✓             | ✗           | ✓   | ✗           | ✓    | ✗    | ✗    |
| beq | ✓  | ✓                  | ✓             | ✓           | ✓   | ✗           | ✓    | ✓    | ✗    |

- b. Fill out the table below to show which of the control signals in the figure should be activated (set to 1) during execution of each instruction. You do not need to specify inactive or unknown (i.e. '0' and 'x' values) signals in this part.

|     | RegDst | Jump | Branch | MemRead | MemtoReg | MemWrite | RegWrite |
|-----|--------|------|--------|---------|----------|----------|----------|
| lw  | 0      | 0    | 0      | 1       | 1        | 0        | 1        |
| and | 1      | 0    | 0      | 0       | 0        | 0        | 1        |
| beq | ✗      | 0    | 1      | 0       | ✗        | 0        | 0        |

- c. Assume the program starts from address 1032d (i.e., address 1032 decimal) in the instruction memory and branch is taken. Compute the value of PC after each instruction is executed.

|     | Program Counter (PC) Content                |
|-----|---------------------------------------------|
| lw  | $PC = 1032 + 4 = 1036d$                     |
| and | $PC = 1036d + 4d = 1040d$                   |
| beq | $PC = (1040d + 4d) + 4 \times 1048d = 5236$ |

$$PC_{new} = (PC_{old} + 4) + \text{sign-extend(branch address)} \ll 2$$

7. Consider the following instruction sequence executing on the 5-stage MIPS pipeline of form IF, ID, EX, MEM, WB.

|     |                  |
|-----|------------------|
| li  | \$t1, 8          |
| lw  | \$t2, 4(\$t0)    |
| add | \$t3, \$t1, \$t2 |
| sw  | \$t3, 12(\$t0)   |

- a. Assume that the pipeline does not use any forwarding. Complete the following pipeline execution diagram for the above instruction sequence, showing the flow of instructions through the pipeline during each clock cycle. Indicate any necessary stalls on the pipeline diagram. How many cycles does it take to complete execution of the instruction sequence? What is the overall CPI?

| Instruction / Cycle  | 1  | 2  | 3  | 4   | 5   | 6  | 7  | 8   | 9  | 10 | 11  | 12 |
|----------------------|----|----|----|-----|-----|----|----|-----|----|----|-----|----|
| li \$t1, 8           | IF | ID | EX | MEM | WB  |    |    |     |    |    |     |    |
| lw \$t2, 4(\$t0)     |    | IF | ID | EX  | MEM | WB |    |     |    |    |     |    |
| add \$t3, \$t1, \$t2 |    |    | •  | •   | IF  | ID | EX | MEM | WB |    |     |    |
| sw \$t3, 12(\$t0)    |    |    |    |     |     | •  | •  | IF  | ID | EX | MEM | WB |

12 cycles,  $CPI = \frac{12}{4} = 3$  cycles per instruction

- b. Now, assume that all possible forwarding paths have been added to the pipeline. Redraw the pipeline execution diagram below. In the diagram show forwarding by arrows. How many cycles does it take to finish the instruction sequence? What is the overall CPI?

| Instruction / Cycle  | 1  | 2  | 3  | 4   | 5   | 6  | 7   | 8   | 9  | 10 | 11 | 12 |
|----------------------|----|----|----|-----|-----|----|-----|-----|----|----|----|----|
| li \$t1, 8           | IF | ID | EX | MEM | WB  |    |     |     |    |    |    |    |
| lw \$t2, 4(\$t0)     |    | IF | ID | EX  | MEM | WB |     |     |    |    |    |    |
| add \$t3, \$t1, \$t2 |    |    | •  | IF  | ID  | EX | MEM | WB  |    |    |    |    |
| sw \$t3, 12(\$t0)    |    |    |    |     | IF  | ID | EX  | MEM | WB |    |    |    |

9 cycles,  $CPI = \frac{9}{4} = 2.25$

8. Consider a 2-way associative cache memory with a total capacity of 32 words. The block size is 2 words and the processor has an 8-bits address bus. Assume that the processor is only word-addressable meaning that the offset field of the address needs to distinguish between the two words within a block. The following table is intended to show the cache content (in columns labeled as 'Way-1' and 'Way-2') as well as the line/set address (in column labeled as 'Index').

| Index | Way-1 |              |                                                          | Way-2 |      |                      |
|-------|-------|--------------|----------------------------------------------------------|-------|------|----------------------|
|       | Valid | Tag          | Data                                                     | Valid | Tag  | Data                 |
| 000   |       |              |                                                          |       |      |                      |
|       |       |              |                                                          |       |      |                      |
| 001   |       |              |                                                          |       |      |                      |
|       |       |              |                                                          |       |      |                      |
| 010   | 1     | 0000<br>0001 | <del>MEM [4]<br/>MEM [20]<br/>MEM [5]<br/>MEM [21]</del> | 1     | 0010 | MEM [36]<br>MEM [37] |
|       |       |              |                                                          |       |      |                      |
| 011   |       |              |                                                          |       |      |                      |
|       |       |              |                                                          |       |      |                      |
| 100   | 1     | 0000         | MEM [8]<br>MEM [9]                                       |       |      |                      |
|       |       |              |                                                          |       |      |                      |
| 101   |       |              |                                                          |       |      |                      |
|       |       |              |                                                          |       |      |                      |
| 110   | 1     | 0000         | MEM [12]<br>MEM [13]                                     | 1     | 0001 | MEM [28]<br>MEM [29] |
|       |       |              |                                                          |       |      |                      |
| 111   | 1     | 0000         | MEM [14]<br>MEM [15]                                     |       |      |                      |
|       |       |              |                                                          |       |      |                      |

- Assume that the cache contains data for two memory references with addresses 9 and 20. Update the previous table for these addresses. (*Hint:* when you reference a memory location, you will need to move a block that contains this memory address).
- Now assume that the CPU accesses the following memory references (i.e., numbers from left to right represent memory addresses)

8, 37, 14, 12, 9, 20, 13, 14, 29, 5

Update the table from part (a) with valid bit, tag, and data fields in appropriate rows/places to show the cache content. Also, fill out the following table and mention if a reference to each location/address is a cache hit or cache miss. Assume an LRU (Least Recently Used) replacement policy for this cache.

| Word Address | Binary Address | Tag  | Index | Hit/Miss |
|--------------|----------------|------|-------|----------|
| 8            | 00001000       | 0000 | 100   | Hit      |
| 37           | 00100101       | 0010 | 010   | Miss     |
| 14           | 00001110       | 0000 | 111   | Miss     |
| 12           | 00001100       | 0000 | 110   | Miss     |
| 9            | 00001001       | 0000 | 100   | Hit      |
| 20           | 00010100       | 0001 | 010   | Hit      |
| 13           | 00001101       | 0000 | 110   | Hit      |
| 14           | 00001110       | 0000 | 111   | Hit      |
| 29           | 00011101       | 0001 | 110   | Miss     |
| 5            | 00000101       | 0000 | 010   | Miss     |

remaining is tag.

a.  $9 = (\underline{\underline{0000}} \underline{\underline{1001}})_2$

↑                      ↑ offset

index (8 rows)      ↴

$20 = (\underline{\underline{0001}} \underline{\underline{0100}})_2$

tag                      offset  
index



