

Qurz +

Note: Cannot bypass a mem stage if low / swl

2020

### Question A Note the throughput part

For each of the questions below, create a valid N-stage pipeline of the given circuit. Each component in the circuit is annotated with its propagation delay. Give the latency and throughput of each design, assuming ideal registers ( $t_{PD} = 0$ ,  $t_{SETUP} = 0$ ). Remember that our convention is to place a pipeline register on each output.

- (1) Show the maximum-throughput 1-stage pipeline.



$$3+4+2+2 = 11$$

Latency (ns):   A01  

Throughput (ns<sup>-1</sup>):   A02     11  

- A01 = ?
- A02 = ?

- (2) Show the maximum-throughput 2-stage pipeline using a minimal number of registers.



$$(4+3) \cdot 2 = 14$$

Latency (ns):   A03  

Throughput (ns<sup>-1</sup>):   A04     1  

- A03 = ?
- A04 = ?

- (3) Show the maximum-throughput pipeline using a minimal number of registers.



(用 3 個 reg.)

$$4 \times 3 = 12$$

3stage

Latency (ns):   A05     12  

Throughput (ns<sup>-1</sup>):   A06     1  

- A05 = ?
- A06 = ?

(4) You manage to reimplement the slowest combinational component in the previous circuit (the one with a propagation delay of 4 ns) using two components with propagation delays of 2 ns, as shown below. Show the maximum-throughput pipeline using a minimal number of registers.



$$3 \times 4 = 12$$

Latency (ns): A07  
Throughput ( $\text{ns}^{-1}$ ): A08

- A07 = ?
- A08 = ?

## Question B

For each of the questions below, create a valid N-stage pipeline of the given circuit. Each component in the circuit is annotated with its propagation delay in nanoseconds. Give the latency and throughput of each design, assuming ideal registers ( $t_{PD} = 0$ ,  $t_{SETUP} = 0$ ). Remember that our convention is to place a pipeline register on each output.

(1) Show a maximum-throughput pipeline that uses the smallest possible number of pipeline stages.



3 stage

Latency (ns): B01  
Throughput ( $\text{ns}^{-1}$ ): B02

- B01 = ?
- B02 = ?

(2) You reimplement the 4 ns combinational component in the previous circuit using two faster components connected in series, shown in **grey** below. You can choose the propagation delays of these two components, as long as their delays add to 4 ns (e.g., they could be 3 ns + 1 ns, 2 ns + 2 ns, etc.) Choose the propagation delays of both components in a way that lets you pipeline the circuit for maximum throughput while minimizing the number of pipeline stages. Then, find the maximum-throughput pipeline. Your solution should use the minimum possible number of pipeline stages.



Latency (ns):   B03 9  
 Throughput (ns<sup>-1</sup>):   B04 1/3

- B03 = ?
- B04 = ?

### Question C

For each of the questions below, create a valid **N-stage pipeline** of the given circuit. Each component in the circuit is annotated with its **propagation delay** in nanoseconds. Give the latency and throughput of each design, assuming ideal registers ( $t_{PD} = 0$ ,  $t_{SETUP} = 0$ ). Remember that our convention is to place a pipeline register on each output.

(1) Show the maximum-throughput 2-stage pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit?



Latency (ns):   C01 20  
 Throughput (ns<sup>-1</sup>):   C02 1/10

- C01 = ?
- C02 = ?

(2) Show the maximum-throughput pipeline using a minimal number of registers. What are the latency and throughput of the resulting circuit?



## Question D

Consider the below 5-Stage Pipeline processor.



(1) Even a simple **in-order** pipelined processor would make use of **speculative execution**. For the 5-stage pipeline above, assume that there is no virtual memory, and that misaligned accesses are checked in the Execute stage. For the instruction sequence below, complete the execution diagram and specify the cycles in which the second add is being executed speculatively. For example, if one stage F is being executed speculatively, use the notation F (S).

推测执行

| Clock Cycle    | 0 | 1   | 2   | 3   | 4   | 5   | 6   | 7 | 8 | 9 |
|----------------|---|-----|-----|-----|-----|-----|-----|---|---|---|
| add x1, x2, x0 | F | D   | X   | M   | W   | -   | -   | - | - | - |
| lw x3, 0(x2)   | - | D01 | D02 | D03 | D04 | D05 | w   | - | - | - |
| add x3, x4, x5 | - | -   | D06 | D07 | D08 | D09 | D10 | - | - | - |

*F<sub>s</sub> D<sub>s</sub> X M W*

- D01 = ?
- D02 = ?
- D03 = ?
- D04 = ?
- D05 = ?
- D06 = ?
- D07 = ?
- D08 = ?
- D09 = ?
- D10 = ?

(2) Given the 5-stage pipeline above, how long is the load-use delay? Answer in terms of how many bubbles must be added between a load and a dependent register-register instruction that is fetched right after the load.

How many bubble(s) must be added?   D11  

## Question E

Consider the 5-Stage pipeline RISC-V processor.



(1) How many cycles does it take to run each iteration of the following loop on a standard 5-stage pipelined RISC-V processor P1?

loop: lw x10 0x100(x0)  
      beqz x10, loop  
      add x12, x10, x11  
      sub x13, x12, x1

lw F D E M W  
beqz F D D D E M W  
add F F F D X X  
sub X X X F  
lw F

Number of cycles per loop iteration:   E01

(2) Consider a modified processor, **P2**, which has extra hardware for the special case of checking if a register is equal to zero or not in the decode stage. What would be the number of cycles per loop iteration in this case?

Number of cycles per loop iteration on processor P2: E02 — 

- E02 = ?

(3) Consider a third processor, **P3**, whose instruction and data memories are pipelined and take 2 clock cycles to respond. Assume that P3 also has the extra hardware for checking if a register is equal to zero or not in the decode stage. What would be the number of cycles per loop iteration using P3?

Number of cycles per loop iteration on processor P3: E03 

- E03 = ?



## Question F

You have been given a 5-stage pipelined RISC-V processor. Unfortunately, the processor you have been given is defective: it has no bypass paths, annulment of instructions in branch delay slots, or pipeline stalls.

*"forwarding"*

```

nop
nop
nop
nop
Loop: lw <10> 0x0(x10)
AAA: sll x14, <10> x11
BBB: bneq x10, loop
CCC: add x13, x10, x13
      nop
      nop
      nop
      nop

```

*"lw F D E M W"*  
*"sll F D D D E M W"*  
*"bneq F F F D E M W"*  
*"add F F"*

You undertake to convert some existing code, designed to run on an unpipelined RISC-V, to run on your defective pipelined processor. The code snip on above is a sample of the program to be converted. It does not make much sense to you, but you are to add the minimum number of NOP instructions at the various tagged points in this code to make it give the same results on your defective pipelined RISC-V as it gives on a normal, unpipelined RISC-V.

Note that the code snip begins and ends with sequences of NOPs; thus, you do not need to worry about pipeline hazards involving interactions with instructions outside of the region shown.

(1) Specify the minimal number of NOP instructions (defined as `add x0, x0, x0`) to be added at each of the labeled points in the above program.

NOPs at Loop: \_\_ F01 0  
 NOPs at AAA: \_\_ F02 3 2  
 NOPs at BBB: \_\_ F03 0  
 NOPs at CCC: \_\_ F04 2 1 ?

- F01 = ?
- F02 = ?
- F03 = ?
- F04 = ?

(2) On a fully functional 5-stage pipeline (with working bypass, annul, and stall logic), the above code will run fine with no added NOPs. How many clock cycles of execution time are required by the fully functional 5-stage pipelined RISC-V for each iteration through the loop?

Clocks per loop iteration: \_\_ F05 —

- F05 = ?



## Question G

The following programs are being executed on the 5-stage pipelined RISC-V processor with full **bypassing**. For each code listing, the pipeline diagram shows the state of the pipeline for cycle `1000` of execution. Please fill in the diagram for cycle `1001`; use `UNKNOWN` if you cannot tell what opcode to write into a stage.

(1) Program below

```
...
sw x1, 0(x0)
lw x17 0xC(x1)
addi x2, x2, -4
slli x11, x17, 2
sw x11, 0(x2)
jal ra, factorial
...
```



| Cycle | 1000 | 1001            |
|-------|------|-----------------|
| IF    | sw   | G01 <u>sw</u>   |
| ID    | slli | G02 <u>slli</u> |
| IE    | addi | G03 <u>addi</u> |
| MEM   | lw   | G04 <u>addi</u> |
| WB    | sw   | G05 <u>lw</u>   |

③ M/M stage 1B Bypass 不能用  
 M/M forward  
 開題目

(2) Program below

```
***  
xor x11, x11, x12  
slli x12, x12, 3  
sub x13, x12, x11  
and x12, x13, x11  
add x13, x12, x13  
sw x13, 0x100(x0)  
***
```

| Cycle | 1000 | 1001 |
|-------|------|------|
| IF    | add  | G06  |
| ID    | and  | G07  |
| IE    | sub  | G08  |
| MEM   | slli | G09  |
| WB    | xor  | G10  |

sw  
add  
and  
sub  
slli

- G06 = ?
- G07 = ?
- G08 = ?
- G09 = ?
- G10 = ?

### Problem A

2021

Consider the G circuit, depicted on the following.



Since the G circuit is currently only combinational, we know exactly what to improve: Pipeline it.

1. We decide to start small with a **two-stage pipeline**. Using the diagram below, please show a two-stage pipeline with maximal throughput using a minimum number of registers. Calculate the overall throughput and latency of your circuit.



- Latency (ns): 24 A01
- Throughput ( $\frac{1}{ns}$ ): A02 1/16

- A01 = ?
- A02 = ?

2. Now, we decide to add an additional stage. Using the diagram below, please show a **three-stage pipeline** with maximal throughput using a minimum number of registers. Calculate the overall throughput and latency of your circuit.



- Latency (ns): \_\_ A03 \_\_
- Throughput ( $\frac{1}{ns}$ ): \_\_ A04 \_\_

- A03 = ?
- A04 = ?

3. Then, think of the maximal throughput pipeline. Using the diagram below, please show a pipeline that maximizes the throughput using a minimum number of registers. Calculate the latency and throughput of our circuit. You should use the smallest number of pipeline stages required to achieve maximum throughput.



- Latency (ns): \_\_ A05 \_\_
- Throughput ( $\frac{1}{ns}$ ): \_\_ A06 \_\_

- A05 = ?
- A06 = ?

2. Now, we decide to add an additional stage. Using the diagram below, please show a **three-stage pipeline** with maximal throughput using a minimum number of registers. Calculate the overall throughput and latency of your circuit.



- Latency (ns): A03 30
- Throughput ( $\frac{1}{ns}$ ): A04  $\frac{1}{10}$

- A03 = ?

- A04 = ?

3. Then, think of the **maximal throughput** pipeline. Using the diagram below, please show a pipeline that maximizes the throughput using a minimum number of registers. Calculate the latency and throughput of our circuit. You should use the smallest number of pipeline stages required to achieve maximum throughput.



- Latency (ns): A05 28
- Throughput ( $\frac{1}{ns}$ ): A06  $\frac{1}{7}$

- A05 = ?

- A06 = ?

## Problem B

Assume that we are trying to build a RISC-V processor. For the reference purpose, there are two 5-stage ( IF , DEC , EXE , MEM , WB ) bypassing processors. However, they are a little broken. The processors are as follows:

- Processor 1: A defective 5 stage bypassing processor which does not annul instructions following branches.
- Processor 2: A defective 5 stage bypassing processor that reads all values bypassed back as 0 but reads correctly from the register file.

Known properties:

- Both processors always predict that branches are not taken (they always fetch from PC + 4 ).
- Both processors determine the direction of the branch in the EXE stage.

We try this simple RISC-V looping code on both of these processors. Assume at the start of the code that  $x_2$  is set to some number greater than 0 and that all the other registers are set to 0 .

```
L1: addi x1, x2, -4  
bnez x1, L1  
slli x3, x1, 1  
lw x2, 0x100(x0)  
. = 0x100  
.word 0x4
```



- If we had a fully working 5 stage bypassing processor, which always fetches from PC + 4 and annuls instructions following taken branches, how many cycles would it take for one iteration of the loop? Assume that the bnez is taken. Provide the number of cycles per loop iteration.
  - You may use the pipeline diagrams below to help you answer the question, but you are not required to fill them out.

| cycle | i    | i+1 | i+2 | i+3 | i+4 | i+5 | i+6 | i+7 | i+8 |
|-------|------|-----|-----|-----|-----|-----|-----|-----|-----|
| IF    | addi |     |     |     |     |     |     |     |     |
| DEC   |      |     |     |     |     |     |     |     |     |
| EXE   |      |     |     |     |     |     |     |     |     |
| MEM   |      |     |     |     |     |     |     |     |     |
| WB    |      |     |     |     |     |     |     |     |     |

- number of cycles per loop iteration = \_\_ B01 \_\_

- B01 = ? 4

2. If we pass the number `0x10` into `x2`, by the time the second iteration's addi reaches EXE, what value will be passed to `x1` for each of the two defective processors? If the loop does not make it to a second iteration's addi, write N/A.
- You may use the pipeline diagrams below to help you answer the question, but you are not required to fill them out.

| cycle | i    | i+1 | i+2 | i+3 | i+4 | i+5 | i+6 | i+7 | i+8 |
|-------|------|-----|-----|-----|-----|-----|-----|-----|-----|
| IF    | addi |     |     |     |     |     |     |     |     |
| DEC   |      |     |     |     |     |     |     |     |     |
| EXE   |      |     |     |     |     |     |     |     |     |
| MEM   |      |     |     |     |     |     |     |     |     |
| WB    |      |     |     |     |     |     |     |     |     |

| cycle | i    | i+1 | i+2 | i+3 | i+4 | i+5 | i+6 | i+7 | i+8 |
|-------|------|-----|-----|-----|-----|-----|-----|-----|-----|
| IF    | addi |     |     |     |     |     |     |     |     |
| DEC   |      |     |     |     |     |     |     |     |     |
| EXE   |      |     |     |     |     |     |     |     |     |
| MEM   |      |     |     |     |     |     |     |     |     |
| WB    |      |     |     |     |     |     |     |     |     |

- Processor 1's `x1` : \_\_ B02 0
- Processor 2's `x1` : \_\_ B03 N/A

Processor 1: 不會停止 instruction even if branch

$$4 - 4 = 0$$

Processor 2: bypassed value 還是 0

add: F D E M W  
 bne: F D E M W  
 sht: F D E M W  
 lw: F D E M W

N/A  
 branch { N/A  
 N/A

C

Since having full bypassing can be very costly, we attempt to reorder the instructions in this loop so that we can minimize the total number of cycles per loop iteration while using a single bypass path.

so-called EXE → DEC path

```
loop:
    slli a4, a3, 5
    or a5, a0, a7
    xor a3, a4, a2
    add a1, a2, a3
    sub a6, a5, a2
    blt a6, a2, loop
```



We can change the order of the instructions in the program as long as it does not change the final result. In what order would you execute the 6 instructions in the loop so that you only need a single bypass path?

decode

1. Here, we select EXE → DEC bypass path. Please list the opcodes in order of execution: \_\_ C01 \_\_ (split in comma)

• C01 = ?

2. Now, suppose we are offered another processor. This processor is identical to the processor used above with full bypassing, except that it can make branch decisions in the decode stage rather than the execute stage. The disadvantage is that this increases the clock period from 400 ps (before) to 450 ps (with the branch calculated in decode).
  - o Which processor is faster (old: branch decision in EXE and  $t_{CLK} = 400$ ps, new: branch decision in DEC and  $t_{CLK} = 450$ ps)? \_\_ C02 new

• C02 = ?

FD E M W

$$8 \times 400 = 3200$$

$$7 \times 450 = 3150$$

## Problem D

Consider the following C code:

```
int price[6] = {7, 5, 8, 10, 15, 7};
int maximum = 10, c = 3, t = 5;
for (int i = 0; i < 6; i++) {
    if (price[i] > maximum)
        maximum = price[i];
}
int total_cost = maximum + c + t;
```

Then, the translated RISC-V assembly:

```
// x4 = 0x24 - length of price in bytes
// x5 = 0x3 - c
// x6 = 0x5 - t
// x7 = 0xA - maximum
// x1 = 0x400 - address of price[0]
start: add x2, x0, 0
       slli x2, x2, 2
loop:  add x8, x2, x1
       lw x3, 0(x8)
       bge x7, x3, skip
       mv x7, x3
       ori x7, x7, 0
skip: add x2, x2, 4
      blt x7, x4, loop
      add x8, x7, x5
      add x7, x7, x6
```

Assume the registers are initialized to the values specified in the assembly code comments.

In the following five-stage pipelined RISC-V processor ( IF , DEC , EXE , MEM , WB ):

- All branches are predicted not-taken. (Always fetch from PC + 4 ).
- Branch decisions are made in the EXE stage.
- The pipeline has full bypassing.
- The processor annuls instructions following taken branches.
- Assume that in the first iteration of the loop both branches are taken.

1. How many cycles did it take to execute the first loop iteration on this processor? Make sure not to include the first two instructions at label start in your cycle count.

◦ Cycles to execute first iteration of the loop on this processor: \_\_ D01 \_\_

◦ D01 = ?

11

2. If you could modify your fetch stage to always fetch the correct next instruction instead of predicting all branches not taken, how many cycles will it now take to execute the first iteration of the loop on this modified processor? \_\_ D02 \_\_

◦ D02 = ?

9

3. Now, let's change this processor without bypassing. That is, all data hazards are resolved by stalling. How many cycles did it take to execute the first loop on this processor? \_\_ D03 \_\_

◦ D03 = ?

11 + 0x3 + 1 = 24

24



loop

## Problem E

Given the standard RISC-V datapath, determine if the following is implementable or not without any additional functional units? Assume the instruction is not a pseudoinstruction encoding.



1. Consider the new instruction `isnull rd, rs1` which check if an input given through `rs1` is considered NULL or not by C standard. The result is returned through `rd` as a bit. Is it implementable? E01 (Answer with Yes or No)

E01 = ?

Yes

2. What changes would you need to make in order for the instruction to be able to execute correctly? Assume all modifications and additions are done on top of the existing single cycle datapath. Select all that apply. E02 (Answer in the following items and use comma to split)

ceh

???

- a . Modify Branch Comparator logic.
  - b . Modify the control signals to the `ALU`.
  - c . Modify the control logic for the Branch Comparator.
  - d . Modify `ALU` buses.
  - e . Modify the control logic for `WBSel`.
  - f . Add additional control signals for the writeback mux.
  - g . Modify control logic for `ALU / ALUSel` .
  - h . Modify the control logic for parsing `instr[31:0]` .
  - i . Add an additional comparator.
  - j . None of the above.
- E02 = ?

3. `isnull rd, rs1` is not in a standard RISC-V instruction format; as we are attempting to reduce the number of hardware changes in our datapath. We instead choose to implement our instruction as a pseudoinstruction in the following format. Which of the following statements is true? Assume earlier changes propagate. Select all that apply \_\_ E03 \_\_ (Anwser in the items and use comma to split)

- a . We need to provide a second argument `x0` when calling the instruction and modify the control signals.
- b . We need to provide a second argument `x0` as a comparator for all branch comparisons.
- c . It is impossible to represent as an R-Type instruction
- d . We need to wire `x0` as a comparator for all branch comparisons.
- e . We need to wire `x0` as `rs2` and modify the control signals.

o E03 = ?

e

?

4. We plan to add a new R-Type signed compare instruction called `comp`, into the RISC-V single-cycle datapath, to compare `R[rs1]` and `R[rs2]` and set `R[rd]` appropriately. The RTL for it is shown below.

```
// comp rd, rs1, rs2
if R[rs1] > R[rs2]: R[rd] = 1
elif R[rs1] == R[rs2]: R[rd] = 0
else: do nothing
```

We want to change the datapath to make this work. We start by adding two more inputs ( `0x00000000` and `0x00000001` ) to the rightmost `WBSel` MUX. What else is required to make this instruction work? Select all that apply \_\_ E04 \_\_ (Anwser in the items and use comma to split)

- a . Modify Branch Comp
- b . Modify Imm. Gen.
- c . Modify the ALU and `ALUSel` control signals
- d . Modify the control logic for `RegWEn`
- e . Modify the control logic for `MemWEn`

o E04 = ?

d

?

and 0 otherwise. Using the datapath below, fill in the following table with the rest of the control bits for this instruction. If the control bit can be set to \*, please draw an X in the table below.



| IArrN | PCSel | RegWEn | MemRW | WBSel | BrUn | ALUSel |
|-------|-------|--------|-------|-------|------|--------|
| 1     | F01   | F02    | F03   | F04   | F05  | ADD    |

- F01 = ?
- F02 = ?
- F03 = ?
- F04 = ?
- F05 = ?

## Problem A

Assume that we are building a [JIT compiler](#), which generates RISC-V instructions at runtime. Because a JIT must render and execute a native binary image at runtime, true machine-code JITs necessitate platforms that allow for data to be executed at runtime – code is data. Here is a simplified implementation based on C and RV32I: (filename: jit.c )

```

1 #include <stdint.h>
2 #include <stdio.h>
3
4 typedef int func_t(void);
5 int main(int argc, char *argv[])
6 {
7     uint32_t instructions[] = {
8         [0] = A01, /* addi a0, zero, $b */
9         [1] = 0x00008067, /* A02 */
10    };
11
12 /* Reinterpret the array address as a function */
13 func_t *jit = (func_t *) instructions;
14 printf("%d\n", jit());
15
16 return 0;
17 }
```

Compile via the command below:

```
$ riscv-none-elf-gcc -O0 -march=rv32i -mabi=ilp32 -o jit.elf jit.c
```

Run it with [rv32emu](#).

```
$ rv32emu jit.elf
```

We get the corresponding output.

```
(gdb) inferior exit code 0
```

Reference: [Porting a JIT compiler to RISC-V: Challenges and Opportunities](#)

- Given instruction `addi a0, zero, ?` as shown in Line 8, what is the [HEX value](#) of A01?  
 A01 = ?
- What is the disassembly of `0x00008067` (shown in Line 9) ? \_\_ A02 \_\_  
 A02 = ?
- Explain why the above program works. \_\_ A03 \_\_  
 A03 = ? *Reinterpret the array address as function*

## Problem B

The Python code for the recursive function `count8`, which counts the number of 8's (base 16) in the input, is shown below. `count8(0x988) = 2` and `count8(0x81) = 1`, for instance. A RISC-V assembly-based implementation of the function is shown below.

Python code

```
def count8(in):
    if in < 8:
        return 0
    end = in & 0xF
    current_8 = 0
    if end == 8:
        current_8 = 1
    return current_8 + count8(in >> 4)
```

The corresponding RISC-V assembly

```
count8:
    li a1, 8
    bgt a1, a0, base
    addi sp, sp, -12
    sw ra, 0(sp)
    sw s1, 4(sp)
    andi a2, a0, 0xF # a2 is end ; a0 is in
    li s1, 0
L1:
    bne __B01__ # if end==8
    addi s1, s1, 1
    ox9D8 continue:
        srli a0, a0, 4
        sw a0, 8(sp)
        call count8
    ox9EY L2:
        lw ra, 0(sp)
        lw s1, 4(sp)
        addi sp, sp, 12
        j done
    base:
        li a0, 0
    done:
        ret
```

1. What should be `B01` to make the assembly implementation match the Python code?

- `bne a2, a1, continue`

2. How many words will be written to the stack before the program makes each recursive call to the function `count8`? `__B02__` 3

The instruction call count8 makes the first call to the function count8 in the program outside of the function definition. The program is interrupted right before the execution of `lw ra`, `0(sp)` at label L2 during a recursive call to count8 . The layout of a memory area is shown in the diagram below. The values of all addresses and data are shown in hex. The sp register's current value, `0xEAC`, points to the location indicated in the diagram.

| -         | Address | Data                |
|-----------|---------|---------------------|
|           | 0xE94   | 0x9E4               |
|           | 0xE98   | 0x0                 |
|           | 0xE9C   | 0x0                 |
| <u>SP</u> | 0xEA0   | __ B03 <u>0x9E4</u> |
| <u>S1</u> | 0xEA4   | __ B04 <u>0x1</u>   |
| <u>a0</u> | 0xEA8   | __ B05 <u>0x8</u>   |
| sp →      | 0xEAC   | 0x9E4               |
| <u>S1</u> | 0xEB0   | 0x1                 |
| <u>a0</u> | 0xEB4   | 0x82                |
|           | 0xEB8   | 0xC8 <u>ra</u>      |
| <u>S1</u> | 0>EBC   | 0x73                |
| <u>a0</u> | 0xEC0   | 0x828               |
|           | 0xEC4   | 0xC14               |
|           | 0xEC8   | 0x82                |

$$\begin{array}{l} 0x82 \rightarrow 4 \\ = 0x8 \end{array}$$

3. Fill B03, B04, and B05 for the stack associated with `0xEA0`, `0xEA4`, and `0xEA8` (answer in HEX)

- B03 = ?
- B04 = ?
- B05 = ?

4. What is the initial input in at the initial call to count8? \_\_ B06 \_\_ (answer in HEX)  
 o B06 = ?

0x828 here is 1

5. What are the values in the following registers right when the execution of `count8` is interrupted? (answer in HEX)

- ra = B07
- a2 = B08
- s1 = B09
- B07 = ? 0x9E4
- B08 = ? 0x80x2?
- B09 = ? 0x1

6. What is the value in register `a0` right when the execution of `count8` is interrupted?  
 (answer in HEX) \_\_ B10 \_\_

- B10 = ? 0x~ 0x8?

7. What is the hex address of the `call f` instruction that made the initial call to `count8`? \_\_

B11 \_\_  
 o B11 = ? 0xC4

8. What is the hex address of the `continue` label? \_\_ B12 \_\_

- B12 = ? 0x9D8

### Problem C

Take a look at the logic diagram below, which computes two outputs  $\{x, y\}$  from four inputs  $\{a, b, c, d\}$ . Calculate the circuit's  $t_{PD}$  using the  $t_{PD}$  details for the gate components listed in the table below.



| Gate  | $t_{PD}$ |
|-------|----------|
| XNOR2 | 5.5ns    |
| AND2  | 3.5ns    |
| OR2   | 2.5ns    |
| INV   | 1.0ns    |

$$t_{PD} = \underline{\quad} C01 \underline{\quad} \text{ ns}$$

• C01 = ?

### Problem D

Assume that you are creating  $G$  circuit, as seen below. Everything has been running smoothly until you discover that the  $G$  circuit's throughput is too low one day. Your biggest client needs a throughput at least twice as high as what you can currently offer in order to include the  $G$  circuit into their new product. You know precisely what to do: pipeline it as the  $G$  circuit is now only combinational.



D.

1. You choose to start small with a two-stage pipeline for your initial iteration. Calculate your circuit's overall throughput and latency.



- o Latency: D01 ns
- o Throughput: D02  $\text{ns}^{-1}$
- o D01 = ?
- o D02 = ?

You decide to add an additional level after completing the first iteration because you are beginning to feel quite at ease with the G circuit. Let's implement the maximal throughput pipeline. Kindly explain a pipeline using the following design that maximizes throughput while utilizing the fewest possible registers. Calculate your circuit's throughput and latency.

NOTE: Use the fewest possible pipeline stages to maximize throughput in your pipelined circuit.



- o Latency: D03 ns
- o Throughput: D04  $\text{ns}^{-1}$

3. Assume that we want to integrate another pipelined circuit, Circuit H, shown below. However, this circuit has multiple different timing specifications! In the below table, you have timing specifications for two different implementations of H: Version A, and Version B. Using your maximal-throughput G circuit from part C, select the version of H that minimizes your latency of the combined G-H circuit. Additionally, calculate the latency and throughput of the combined G-H circuit.



| -            | Version A | Version B |
|--------------|-----------|-----------|
| clock period | 7 ns      | 8 ns      |
| stages       | 2         | 1         |

- o Which one can minimize your latency of the combined G-H circuit? (Answer in Version A or Version B) \_\_ D05 \_\_
- o Overall latency = \_\_ D06 \_\_ ns
- o Overall throughput = \_\_ D07 \_\_ ns<sup>-1</sup>
  - o D05 = ?
  - o D06 = ?
  - o D07 = ?

### Problem E

We are considering the following RISC-V processor design. For this processor P, the code and the starting state of the necessary registers in the register file are shown below. Keep in mind that while x6 is in hexadecimal, registers x1 through x5 are given their values in decimal.

```

start:
    lw x1, 0(x6)
    addi x2, x0, 5
    blt x1, x2, end
    addi x3, x2, 11
    sub x4, x3, x1
    xori x5, x6, 0x1
end:
    sub x4, x3, x2
    addi x3, x4, 7
    addi x3, x1, 3
    . = 0x400
    .word 0x1

```

| Register | Value |
|----------|-------|
| x1       | 1     |
| x2       | 5     |
| x3       | 32    |
| x4       | 10    |
| x5       | 20    |
| x6       | 0x400 |



Processor P : 5-stage pipelined RISC-V processor, which is fully bypassed and has branch annulment. Branch decisions are made in the EXE stage and branches are always predicted not taken. What are the values of registers x3 and x4 at the start of cycle 12 (in decimal)?

- x3: E01 32
- x4: E02 26
- |
- |   ◦ E01 = ?
- |   ◦ E02 = ?

## Problem F

The P<sub>1</sub> and the P<sub>2</sub> are two RISC-V based designs whose performance you are expected to evaluate. You have made the choice to compare the performance of the two CPUs using the next construction step.

```
L1:
    slli x11, x10, 2
    lw    0x80(x11)
    add x13, x13, x11
    addi x10, x10, 1
    blt x10, x14, L1
    sub x14, x14, x15
    xorri x15, x14, 0x1
```



The P<sub>1</sub> is a 4-stage pipelined processor with full bypassing. It attempts to improve performance by decreasing the number of cycles it takes to execute each instruction. The EXE and MEM stages are combined. Load requests are sent in the EXE stage and received in the WB stage one cycle after. The IF stage speculatively fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a branch misprediction. Branches are resolved in the EXE stage.

The P<sub>2</sub> is a 6-stage pipelined processor with full bypassing. It attempts to improve performance by raising the clock speed. The EXE stage is split into two stages to break up the critical path through the ALU. The IF stage always fetches the instruction at PC+4 and annuls incorrectly fetched instructions after a taken branch. ALU and Branch ALU results are available only in EXE2 .

1. How many cycles does it take the P<sub>1</sub> to execute one iteration of the loop? F01 8  
|   ◦ F01 = ?

2. How many cycles does it take the P<sub>2</sub> to execute one iteration of the loop? F02 13  
|   ◦ F02 = ?

3. Assume that we can build the P<sub>1</sub> with a clock period of 6ns. What is the clock period that the P<sub>2</sub> must achieve to attain the same performance on this assembly sequence? F03 ns  
|   ◦ F03 = ?

$$6 \times 8 = \square \times 13, \quad \square = \frac{48}{13}$$



## Problem G

Assume that while our data memory utilizes a little more realistic model that includes a DRAM, our instruction memory continues to respond in one cycle. A memory address, a read/write signal, and optional write data are all inputs to the memory. The CPU first uses the cache to fulfill read or write requests. The data can be quickly sent back to the pipeline if the access succeeds in the cache. It will take significantly longer for the CPU to reach the DRAM, and then retrieve the needed data from it.

Assume that each of the 2 processors has extra hardware that enables them to foresee when branches will be taken and to get the branch's destination instruction immediately following the branch instruction (without requiring any stalls).

Consider the following 2 RISC-V processor designs.

- P<sub>1</sub>: Basic 5-stage pipeline with bypass paths from EXE to DEC and from WB to DEC



- P<sub>2</sub>: Since the memory access can take a long time in the worst case, we use pipelining to improve the processor performance. The MEM stage is further divided into three pipeline stages.



1. Assume the 2 processors have ideal registers ( $t_{\text{SETUP}} = t_{\text{HOLD}} = t_{\text{PD}} = t_{\text{CD}} = 0\text{ns}$ ) between different pipeline stages. Please derive the minimum clock cycle given the following propagation delays for the logic blocks.

| Logic Block | Propagation Delay |
|-------------|-------------------|
| IF          | 2 ns              |
| DEC         | 2 ns              |
| EXE         | 3 ns              |
| MEM         | 10 ns             |
| WB          | 2 ns              |
| bypass      | 1.5 ns            |
| MEM1        | 4 ns              |
| MEM2        | 3 ns              |
| MEM3        | 3 ns              |

- Clock period for P<sub>1</sub>: G01 4 ns
- Clock period for P<sub>2</sub>: G02 4.5 ns
- G01 = ?
- G02 = ?

4.5  
(3+1.5)  
BYPASS

2. The code segment below copies elements from array A to array B, where Array A starts at address  $0x100$ , and Array B starts at address  $0x500$ . Assume that this loop has been running for a long time, and that the processor always predicts that the branch will be taken.

```

add x1, x0, x0    x1=0
add x2, x0, x0    x2=0
Next:
lw x3, 0x100(x1)
addi x1, x1, 4    x1+=4
sw x3, 0x500(x2)
addi x2, x2, 4
blt x1, x4, Next // x4=1000

```

|      |   |   |   |   |   |
|------|---|---|---|---|---|
| lw   | F | D | E | M | W |
| addi | F | D | E | M | W |
| sw   | F | D | E | M | W |
| addi | F | D | E | M | W |
| blt  | F | D | E | M | W |
| lw   | F | D | E | M | W |

How many cycles does this loop take to execute in  $P_1$ ? G03

- o G03 = ?

3. Assume the memory can only serve one read or write operation at a time. In  $P_2$ , since MEM1, MEM2 and MEM3 stages all use the memory, we now have structural hazard. Structural hazard happens if multiple instructions try to use the same hardware resource. To resolve the structural hazard, we can only allow one memory access instruction exist in the MEM1, MEM2, and MEM3 stages. Note that it is ok to have multiple other types of instructions, e.g., arithmetic instructions, as long as they do not need to use the memory. Specifically, we need some mechanism to stall memory access instructions.

The detailed stall mechanism is explained below. We will inject NOPs to the EXE stage when the IF and DEC stages are stalled. When the DEC has a memory access instruction, we stall for the following 2 cases:

- o If the EXE stage also has a memory access instruction, stall IF and DEC for 2 cycles
- o If the MEM1 stage also has a memory access instruction, stall IF and DEC for 1 cycle

These stalls are in addition to any stalls required to deal with data hazards. How many cycles does this loop take to execute in  $P_2$ ? G04

- o G04 = ?

$$6+2=8$$

|       |   |   |   |                |                |                |                |   |
|-------|---|---|---|----------------|----------------|----------------|----------------|---|
| lw    | F | D | E | M <sub>1</sub> | M <sub>2</sub> | M <sub>3</sub> | W              |   |
| addi: | F | D | E | M <sub>1</sub> | M <sub>2</sub> | M <sub>3</sub> | W              |   |
| sw    | F | D | D | M <sub>1</sub> | M <sub>2</sub> | M <sub>3</sub> | W              |   |
| addi: | F | F | F | D              | M <sub>1</sub> | M <sub>2</sub> | M <sub>3</sub> | W |
| blt   |   |   |   |                |                |                |                |   |
| lw    |   |   |   |                |                |                |                |   |



## Quiz4 of Computer Architecture (2023 Fall)

HackMD

Combinational logic circuits.

## Quiz4 of Compute...

- Problem A
- Problem B
- Problem C
- Problem D
- Problem E
- Problem F
- Problem G

全部展開  
回到頂部  
移至底部



| -                     | $t_{PD}$ | $t_{CD}$ | $t_{SETUP}$ | $t_{HOLD}$ |
|-----------------------|----------|----------|-------------|------------|
| Register (R1, R2, R3) | 5ns      | 1ns      | 7ns         | 2ns        |
| Circuit A             | 3ns      | ?        | -           | -          |
| Circuit B             | 4ns      | 2ns      | -           | -          |
| Circuit C             | 6ns      | 2.5ns    | -           | -          |

1. What is  $t_{PD}$  of this sequential circuit? \_\_ B01 \_\_ ns

B01 = ?

上午 9:15 11月14日 週二

hackmd.io

91%

## Quiz4 of Computer Architecture (2023 Fall)

HackMD

2. What is  $t_{CD}$  of this sequential circuit? \_\_ B02 \_\_ ns

B02 = ?

3. What is the minimum value of  $t_{CD}$  for Circuit A that will ensure proper operation of this circuit? \_\_ B03 \_\_ ns

B03 = ?

4. What is the minimum clock period that can be employed to clock all the registers within the circuit? \_\_ B04 \_\_ ns

B04 = ?

5. Available from the supplier are alternative circuits for A, B, and C, each with the following specifications.

| -     | $t_{PD}$ | $t_{CD}$ | $t_{SETUP}$ | $t_{HOLD}$ |
|-------|----------|----------|-------------|------------|
| A-New | 1ns      | 0.5ns    | -           | -          |
| B-New | 2.5ns    | 2ns      | -           | -          |
| C-New | 2ns      | 1ns      | -           | -          |

Replacement of these new circuits is cost-sensitive, allowing for the substitution of only one combinational circuit to achieve a minimized clock period. Please specify which combinational circuit you intend to replace and the resulting minimum clock period.

Combinational circuit replaced: \_\_ B05 \_\_

$t_{CLK}$  of circuit: \_\_ B06 \_\_ ns

## Quiz4 of Computer Architecture (2023 Fall)

each output.

1. Consider a 2-stage pipeline optimized for maximum throughput while utilizing the fewest possible registers. Determine the latency and throughput of the resulting circuit, taking careful note of the direction of each arrow.

## Quiz4 of Compute...

Problem A

Problem B

**Problem C**

Problem D

Problem E

Problem F

Problem G

全部展開

回到頂部

移至底部



- Latency: C01 ns
- Throughput: C02 ns<sup>-1</sup>
- C01 = ?
- C02 = ?

上午 9:32 11月14日 週二

Quiz4 of Computer Archi X Quiz4 of Computer Archi X Quiz4 of Computer Archi X sequential circuit tcd me X +

← → ⌂

86%

## Quiz4 of Computer Architecture (2023 Fall)



- Latency: C01 ns
- Throughput: C02 ns<sup>-1</sup>
- C01 = ?
- C02 = ?

2. Consider a maximum-throughput 4-stage pipeline using a minimal number of registers.  
What are the latency and throughput of the resulting circuit?

- Latency: C03 ns
- Throughput: C04 ns<sup>-1</sup>
- C03 = ?
- C04 = ?

3. Consider a maximum-throughput pipeline that uses a minimum number of pipeline stages.  
What are the latency and throughput of the resulting circuit?

- Latency: C05 ns
- Throughput: C06 ns<sup>-1</sup>
- C05 = ?
- C06 = ?

## Quiz4 of Computer Architecture (2023 Fall)

HackMD



Below, you can find the timing specifications for each of the components:

| Component                         | Propagation delay ( $t_{prop}$ ) |
|-----------------------------------|----------------------------------|
| Register                          | 1ns                              |
| Decoder                           | 2ns                              |
| RegFile read                      | 3ns                              |
| MUX                               | 1ns                              |
| ALU                               | 4ns                              |
| Adder                             | 3ns                              |
| Memory read (instruction or data) | 4ns                              |

Setup/hold times for clocked inputs (registers and writes to RegFile and data memory)

- Setup time ( $t_{setup}$ ): 2 ns
  - Hold time ( $t_{hold}$ ): 0 ns
1. What is the shortest clock cycle duration achievable for this processor? Minimum  $t_{CLK}$ :        ns

## Quiz4 of Computer Architecture (2023 Fall)

- Quiz4 of Computer...
- Problem A
- Problem B
- Problem C
- Problem D**
- Problem E
- Problem F
- Problem G
- 全部展開
- 回到頂部
- 移至底部

2. The current processor's performance is suboptimal. We intend to introduce an alternative data path (while keeping the control logic unchanged) in which the data memory's Adr input is sourced differently. What is the minimum clock cycle time of the new processor?

Minimum  $t_{CLK}$ :        ns

3. Now, the new processor executes certain instructions incorrectly as per the RISC-V ISA.

## Problem E

### Quiz4 of Compute...

- Problem A
- Problem B
- Problem C
- Problem D**
- Problem E
- Problem F
- Problem G

全部展開  
回到頂部  
移至底部

```
loop:
    lw a1, 0(a0)
    lw a0, 4(a0)
    add a2, a2, a1
    bne a0, loop
    mv a0, a2
    addi sp, sp, 8
    ret
```



- Assume that at cycle 100 the `lw a1, 0(a0)` instruction is fetched. Fill in the following:

- Number of cycles per loop iteration:   E01
- Number of cycles per loop iteration wasted due to stalls:   E02
- Number of cycles per loop iteration wasted due to annulments:   E03
- E01 = ?
- E02 = ?
- E03 = ?

We aim to enhance the performance of this processor. Upon careful examination, we have identified an opportunity to save cycles by introducing additional bypass paths. Contrary to its label, "full bypassing" does not fully exploit the potential for bypassing. It primarily bypasses