

USE YOUR IMAGINATION™

# Blue Book

EXAMINATION BOOK

BOX NO. \_\_\_\_\_

NAME RAMAN KUMAR JHA (RJ2712)  
SUBJECT CSA  
CLASS \_\_\_\_\_  
SECTION A  
INSTRUCTOR AZEEZ  
DATE 10/02/2025

11" x 8.5" 8 LEAVES 16 PAGES

ROARING SPRING®  PAPER PRODUCTS  
ROARING SPRING, PA 16673

RAMAN KUMAR JNA  
(RJ2712)

① [a]

Assuming the 4 byte word is stored in  $x_5$ .

andi  $x_6, x_5, 0x00FF0000$  # extract byte 3  
slli  $x_6, x_6, 8$  # shift extracted byte 3 to bits 15:8  
andi  $x_7, x_5, 0x0000FF00$  # extract byte 2  
slli  $x_7, x_7, 8$  # shift extracted byte 2 to bits 23:16  
or  $x_8, x_6, x_7$  # combine swapped byte 2 and 3  
andi  $x_5, x_5, 0xFF0000FF$  # zero out bits 23:8 in original register  
or  $x_5, x_5, x_8$  # load swapped byte 2 and 3 in original register.

① [b] i)  $f = g - A[B[C[16]]]$

lw  $x_7, 64(x_7)$  #  $C[16]$   
slli  $x_7, x_7, 2$   
add  $x_6, x_6, x_7$  #  $A[B[C[16]]]$   
lw  $x_6, 0(x_6)$  #  $B[C[16]]$   
slli  $x_6, x_6, 2$   
add  $x_5, x_5, x_6$  #  $A[B[C[16]]]$   
lw  $x_5, 0(x_5)$  #  $A[B[C[16]]]$   
sub  $x_8, x_9, x_5$  #  $f = g - A[B[C[16]]]$

RAMAN KUMAR JHA  
(RJ2712)

① [b] ii)  $f = g - A[C[8]] + B[B[64]]$

|      |                 |                                |
|------|-----------------|--------------------------------|
| lw   | $x_6, 256(x_6)$ | # $B[B[64]]$                   |
| lw   | $x_7, 37(x_7)$  | # $C[8]$                       |
| add  | $x_6, x_6, x_7$ | # $B[B[64]] + C[8]$            |
| slli | $x_6, x_6, 2$   |                                |
| add  | $x_5, x_5, x_6$ | # $4A[B[B[64]]] + C[8]$        |
| lw   | $x_5, 0(x_5)$   | # $A[B[B[64]] + C[8]]$         |
| sub  | $x_8, x_9, x_5$ | # $f = g - A[B[B[64]] + C[8]]$ |

① [b] iii)  $A[i] = 8B[4i-4l] + 8C[64i+64]$

|      |                       |                                   |
|------|-----------------------|-----------------------------------|
| slli | $x_{11}, x_{10}, 6$   | # $64i$                           |
| addi | $x_{11}, x_{11}, 64$  | # $64i+64$                        |
| slli | $x_{11}, x_{11}, 2$   |                                   |
| add  | $x_7, x_7, x_{11}$    | # & $C[64i+64]$                   |
| lw   | $x_7, 0(x_7)$         | # $C[64i+64]$                     |
| slli | $x_7, x_7, 3$         | # $8C[64i+64]$                    |
| slli | $x_{12}, x_{10}, 2$   | # $4i$                            |
| addi | $x_{12}, x_{12}, -41$ | # $4i - 41$                       |
| slli | $x_{12}, x_{12}, 2$   |                                   |
| add  | $x_6, x_6, x_{12}$    | # $4B[4i-41]$                     |
| lw   | $x_6, 0(x_6)$         | # $B[4i-41]$                      |
| slli | $x_6, x_6, 3$         | # $8B[4i-41]$                     |
| add  | $x_6, x_6, x_7$       | # $8C[64i+64] + 8B[4i-41]$        |
| slli | $x_{13}, x_{10}, 2$   |                                   |
| add  | $x_5, x_5, x_{13}$    | # $A[i]$                          |
| sw   | $x_6, 0(x_5)$         | # $A[i] = 8B[4i-41] + 8C[64i+64]$ |

Q[6]

$$0 \times 2F_{16} = 0010\ 1100 = 46_{10}$$

$$0 \times 2D_{16} = 0011\ 1101 = 61_{10}$$

$$46 \times 61 = 2806_{10} = 0$$

$$0 \times 2F_{16} = 0010110_2 = 46_{10}, \text{ and}$$

$$46 = 32 + 8 + 4 + 2 = 2^5 + 2^3 + 2^2 + 2^1$$

So,

We can shift  $0 \times 3D$  left 5 places

$$= 0\ 0111\ 1010\ 0000 = 1952_{10} = 0x7A0_{16}$$

then add  $0 \times 3D$  left shifted 3 places

$$= 001\ 1100\ 1000 = 488_{10} = 0x1E8_{16}$$

then add,  $0 \times 3D$  left shifted 2 places,

$$= 00\ 1111\ 0100 = 244_{10} = 0xF4_{16}$$

then add  $0 \times 3D$  left shifted 1 place

$$= 0\ 0111\ 1010 = 122_{10} = 0x7A_{16}$$

$$\text{Adding } 1952 + 488 + 244 + 122 = 2806$$

$$= 0xAf6_{16}$$

So, we can use 4 shifts and 3 adds instead of actual multiplication.

⑤

- To reverse all bits in a register, the most efficient instruction is the ~~grevi~~ available in bit manipulation extension. It performs the operation most efficiently.
- grevi t1, t0, 31

In this it will reverse the bits of t0 register and it will store it in t1 register as the result.

This single instruction provides the whole bit reversal.

Instruction detail —

grevi rd, rs1, imm reverses the bits of rs1 and writes the result to rd where imm is the register bits - 1

If the bit manipulation extension is not available, a multi instruction management system is needed, but grevi performs it in a single instruction.

③

The best known algorithm for swapping two registers without a temporary register uses the XOR swap trick. This works because XOR is reversible and does not need any other register other than the one being swapped.

$\text{xor } x_5, x_5, x_6 \quad \# x_5 = x_5 \wedge x_6$

$\text{xor } x_6, x_5, x_6 \quad \# x_6 = x_5 \wedge x_6$

$\Rightarrow (x_5 \wedge x_6) \wedge x_6 = x_5$

$\text{xor } x_5, x_6, x_6 \quad \# x_5 = x_5 \wedge x_6$

$\Rightarrow (x_5 \wedge x_6) \wedge x_6 = x_6$

\* After these three instructions, contents of  $x_5$ , and  $x_6$  are swapped.

\* No additional registers are used.

\* This is optimal for the RISC V processor, and follows the constraints provided in the question.

④

### Given program

addi  $x8, x0, 0$   $\# i=0$

addi  $x9, x0, 0$   $\# \text{sum}=0$

addi  $x10, x0, 10$   $\# x10=10$

Loop:

beq  $x8, x10, L2$   $\# \text{if } i=10, \text{ goto L2}$

add  $x9, x9, x8$   $\# \text{sum}=\text{sum}+i$

addi  $x8, x8, 1$   $\# i=i+1$

j loop

L2:

### Step 1: Instruction count per loop.

The loop mainly consists of 5 instructions

→ beq

→ add

→ addi ( $x10$ )

→ addi ( $x11$ )

→ j

### Step 2: Number of iterations

$x10$  starts from 0, increment by 10

each loop, the loop runs until  $x10 = 2x8$

Given  $x8=100$ , (from eg), initial  $x10=0$

The number of iteration is:

$$\text{The number of loop} = \frac{100}{10} = 10$$

But let's check the exit condition:

- when  $x10 = x8(100)$ , beg triggers and exits.
- At that iteration,  $x10$  has reached 100 and the loop stops

(→ So, for  $x10 = 0, 10, \dots, 90, 100$  - when  $x10$  is 100, beg is true and code jumps to L2 (exit))

→ This means that the beg is executed 11 times, and once for each  $(0, 10, 20, \dots, 90, 100)$

### Step 3: Total instructions executed

→ For the first 10 iterations ( $x10=0$  to  $x10=90$ ) all 5 instructions execute.

→ At  $x10=100$ , beg executes, and the branch is taken (jumps to L2) other branch instructions not executed.

Total instructions =  $50 + 1 = 51$

- 16 full loops (Each with 5 instructions)
- on the 11<sup>th</sup> time begins, it successfully branches and L2 takes over.

#### Step 4: Cycle calculation

→ Each iteration = 1 cycle ( $CPI = 1$ )  
So, 51 cycles are required to complete the program.

#### Step 5: CPI for multicycle processor given ideal condition

$$CPI = 1$$

CPI is defined as cycles per instructions which will remain 1 as per the information given.

② (a) For overall SPEC performance, geometric mean of SPEC ratios, which summarizes the performances across all benchmarks.

\* opteron geometric mean

SPEC ratio: 20.86

\* Itanium 2 geometric mean

SPEC ratio: 27.12

The higher the SPEC ratio, the better the performance.

Itanium 2 has a high geometric mean SPEC ratio (27.12), so

based on overall SPEC performance, I should choose Itanium 2 because it is faster on average across all benchmarks of the table.

(2) (b) Weighted average of execution time ratios =  $(0.6 \times \text{mupwise ratio}) + (0.2 \times \text{itanium2 ratio}) + (0.2 \times \text{apsi})$

From the table —

opteron / itanium ratios:

$$\rightarrow \text{mupwise} = 0.92$$

$$\rightarrow \text{ammp} = 1.03$$

$$\rightarrow \text{apsi} = 0.65$$

Weighted average

$$= (0.6 \times 0.92) + (0.2 \times 1.03) + (0.2 \times 0.65)$$

$$= 0.888$$

The weighted average of execution

time ratios ~~is 0.89~~

$$0.888$$

(2) (c) Speedup is the inverse of the execution time ratio,

Hence, Speedup =  $\frac{1}{0.888} = 1.126$

The opteron is 1.126 times faster than the itanium2 for this workload.