

## Tut - 1

Combinatorial is another name for combinational

① Combinational circuit : outputs depends only on inputs.  
(no memory elements.)

Seq. circuit : Memory elements. (output depends on history of inputs.)

② GNU assembler. (as) , g++ compiler.



④ Floppy disks.

⑤ If parallel tasks then → octa core ; {depends on clockspeed, power consumption etc}

⑥ MIPS ⇒ microprocessor without interlocked pipeline stages.

⑦ Networking devices & automotive devices.

Q.7.

s3

char \*y; // 32 bits. } -

~~FUNC:~~

};



~~FUNC:~~

addi \$s0, \$zero, 100

addi \$s2, \$zero, 0

slt \$t0, \$s2, \$s0

beq \$t0, \$zero, EXIT

sll \$t0, \$s2, 4

align & M(i) → p[i

(16 bytes for each M)

add \$s3 \$t0, \$s1

lui \$t0, 1 ( $2^{16}$ )

addi \$t0, \$t0, 3 ( $2^{16} + 3$ ) ( $65536 + 3$ )

sw 0(\$s3), \$t0 → x

sw 4(\$s3), \$t0 } a ( $2^{32})(2^{16} + 3)$

sw 8(\$s3), \$t0

Note big endian - higher order 4 bytes is at lower address

addi \$t0, \$s2, 3

sw 12(\$s3), \$t0

addi \$s2, \$s2, 1

j FUNC

EXIT :



WHILE:

$s_0 \rightarrow \text{save}$   
 $s_1 \rightarrow i$   
 $s_2 \rightarrow k$

①. ⑧ ~~addi \$s\_2, \$zero, k~~

~~addi~~

muli \$t\_0, \$s\_1, 4 replace with sll \$t0, \$s1, 2

② addl \$s\_3, \$t\_0, \$s\_0

lw \$t\_0, 0(\$s\_3)

③

④

beq \$t\_0, \$s\_2, NEXT

j EXIT

NEXT:

addi \$s\_1, \$s\_1, 1

j WHILE

EXIT:

⑤ ② = ④ + ①

⑥ ②  $\rightarrow 20+6$       ④  $\rightarrow 0+5$

bne \$t\_0, \$s\_2, EXIT

addi \$s\_1, \$s\_1, 1

j WHILE

EXIT:

This is the expected solution for (a)

For (b)

you need to implement like a do-while loop  
where condition is checked after body  
However, since it is actually a while loop,  
you need to jump past the body straight  
to loop check (this jump will be executed  
only once)

## TUTORIAL 2

v0 does not have to be saved as caller, as v0 is a return value

Q1

Main  
Callee: \$ra, \$t0, \$s1  
Caller: \$a0

F  
Callee: \$ra, \$s0, \$t1  
Caller: \$a0, \$t1

G  
Callee: \$ra, \$t3  
Caller: -

(Registers Addressed)

Q2

Opcode will increase from 5 to 6 bits  
 $R_s, R_t, R_d$

Q3

$$\frac{2^{16}-1}{4} \text{ (space available)} \xrightarrow{\text{! temporary}} 2^3 - 1 \Rightarrow 8191$$

(space rep for int)

Q4 (a) Yes

(p1.0, p2.0)

(b) Since \$t1, \$s0 are both callee preserved, & swapping them shouldn't cause an issue (so, YES)  
(q1.0, q2.0)

2 a) Yes (p1.0, p2.0) b) No, since we are using an external library, then ~~caller temporary~~ and \$t1, \$s0 are ~~caller temporary~~ to is caller saved and s0 is callee saved and callee saved respectively.

Tutorial 3

1. a) # Assume \$a0 is  $a[0]$ , \$a1 is  $b[0]$   
 # \$t0 has i, \$t1 has k \$t2 has N, \$s0 has c.  
 # \$a2 is the max length of arrays 'a' and 'b'

bge \$a2, \$a3, a3-lower

a2-lower:

sll \$a4, \$zero, \$a2, 2  
 + a4-set

a3-lower:

sll \$a4, \$zero, \$a3, 2

a4-set:

or \$t0, \$zero, \$t1 # Set i = k  
 loop-begin: sll \$t0, \$t1, 2 #  $i = i + k$

sll \$a4, \$t0, \$t2, 2 #

bne bge \$t0, \$t2, end-loop

sll \$t0, \$a4

bge \$t4, \$zero, end-loop

addi \$a2, \$a2, 4

addi \$a3, \$a3, 4 ; addi \$t0, \$t0, 4 ; j loop-begin

lw \$t5, 0(\$a2)

add \$t5, \$t5, \$s0

sw \$t5, 0(\$a2)

sll \$t3, \$t2, 2

add \$a2, \$a0, \$t0

add \$a3, \$a1, \$t0

loop-begin

b) Optimization already performed in the code above.

- c) I can increment \$a2 and \$a3 by 4 after loading and storing to prepare for the next loop iteration
- d) Both can be used, getting rid of the addi instructions in the end

2. a) or, sll, add, add, srl  $\rightarrow$  5 R formats

loop  $\rightarrow$  erg

sll

erg

2 memory ops

add  $\times 4$

# jump

$\Rightarrow$  3 branch, 2 memory and 5 R format' instrs

- b)  $\Rightarrow (5 \text{ R format}) + (3 \text{ branch} + 2 \text{ memory} + 5 \text{ R}) \times (N - k)$

Using lw and sw - we, we get  $2R$  with the second set of brackets

c)

→ Consider a context where the result of the multiplication is 32 bits wide and the product is 64 bits wide.

3. a) No: of clock cycles taken =  $10^{10}$  clock cycles.

Finally, we have  $9 \times 10^{10}$  clock cycles.

⇒  $5 \times 10^8$  mult operations were replaced.

b) Let every other line in the code correspond to a single clock cycle.

⇒  $8 \times 10^9$  instructions. Now we have  $9 \times 10^9$  instead of  $8.5 \times 10^9$  instructions.

$$\Rightarrow \frac{90}{85} = \frac{18}{17} \Rightarrow \boxed{\frac{1}{17}}$$

4. a) != is not defined between floats with infinite precision. Instead, consider the absolute value of  $(\text{est} \times \text{est})^{-1}$  and bound it below, say,  $10^{-5}$ .

b) Newton-Raphson iteration

5.





1)

add3 \$t0, \$t1, \$t2

Jul-05

add3 \$t3, \$t4, \$t5

This solution is correct  
-Bhaskar

∴ add3 \$t0, \$t1, \$t2

add3 \$t3, \$t4, \$t5

Structural Hazard  
since same ALU

- 2) Yes, due to add3, considering only the structural stalls caused by it, we get almost 1 structural stall due to add3, which matches with 2 instructions of (add, add) but the structural stall needn't occur always, add3 improves performance

This answer  
is correct  
-Bhaskar

If we make the simplifying assumption that add3 is followed by an instruction using ALU, then perf with/without add3 will be the same - same amount of work completed in same number of cycles

3.

lw \$t0 4(\$sp)  
sw \$t1 8(\$t0)

3.

lw \$t0 4(\$sp)  
sw \$t1 8(\$t0)This answer is correct  
-Bhaskar5. a), b) ~~lw~~  
~~add3~~ \$sp 0(\$sp)  
lw \$t0 0(\$sp)This ans is correct, except that  
usually we won't lw from memory to \$sp  
-Bhaskarlw \$sp 0(\$sp)  
lw \$t0 \$sp

no stall

c)

This answer is correct  
-Bhaskar

## Forwarding to Exstage

1) EX/MEM to EX

add  $r_1 r_2 r_3$   
add  $r_5 r_1 r_4$

2) MEM/WB to EX

add  $r_1 r_2 r_3$   
nop  
add  $r_5 r_1 r_4$

## Forwarding to Mem Stage

3) MEM/WB to MEM

lw  $s_t 2 o(s_t 1)$   
sw  $s_t 2 o(s_t 3)$  (with a stall)

4) Post WB to Mem

add  $R_1 R_2 R_3$   
nop  
sw  $o(R_6), R_1$

SAT-6

Tut-6

1)

- a) 3-stage branch completion
- b) 2-stage branch completion
- c) Assume branch not taken
- d) Branch prediction
- e) Branch target buffer
- f) Delayed branches

These two can be considered compatible. Branch predictor can return 3 possible values: 0, 1, don't-know  
 "Assume branch not taken" can be used when branch predictor does not know

similar argument

Incompatible pairs : {fa, fb}, {c, dy}, {c, ey}, {c, fb}, {dy, ey}, {c, fb}

These 3 can be considered compatible in the case of
 

- 3-stage branch
- 1 delay slot only

2)

- (a) and \$t1 \$t0 \$t2 anything not producing s1 is ok
- (b) and \$s1 \$t0 \$t1 anything producing s1 is ok
- (c) ~~L1 - add \$t0 \$s3 \$t0~~ anything not producing s2 or s3
- (d) addi \$t0 \$t2 2 this is not the cleanest answer - may be correct under some assumptions  
 some instruction producing s2 or s3 is the expected answer  
 e.g. sub \$s2, \$s0, \$s1
- (e) addi \$s2 \$s1 1 anything which does not read s2 is ok
- (f) addi \$s1 \$s2 1 anything which reads s2 is ok
- (g) ~~L1 - addi \$s2 \$s2 \$s2 2~~ answer should create a data hazard stall  
 - which means involving lw followed by a dependent instruction  
 L4 reads s2, make L7 also read s2 and make L1 load s2  
 e.g. L1: lw \$s2, 4(\$sp)  
 L7: sub \$s2, \$s0, \$s1

# CS 230 Twt 7 Solutions



Start with same code as in slides (SLIDES) -  
fwd from reg-reg to reg-reg

Add this condition at the end:

if ((IF/ID.Rs == EX/MEM.Rt) && (EX/MEM.Mem Rd == 1)) } Similar for  
Fwd A = 01 } Fwd B - replace R1 with RT

- ② if ((IF/ID.Rs == ID/EX.Rt) && (ID/EX.Mem Rd == 1))
- can be combined with previous if conditions
  - the STALL if case will be satisfied during the first time the dependent reg-reg instruction is in ID
  - the forwarding if case will be satisfied during the 2nd time.



- ④ if ((IF/ID.Rs == EX/EX.Rd) && (EX/EX.RegWr == 1) && (EX/EX.Mem Rd == 0))
- Fwd I = 1
- else Fwd I = 0
- ⑤ if ((IF/ID.Rs == ID/EX.Rd) && (ID/EX.RegWr == 1) && (ID/EX.Mem Rd == 0))
- STALL → IF/ID latch = 0

- ⑥ Invalid op code triggers in CC3  
Similar for Fwd 2, replace Rs with RT  
In CC4, first instruction of invalid op code exception handler is fetched, CC4 also detects misaligned mem exception which takes precedence
- lw IF ID EX MEM WB  
invalid op code

- Q1. diff b/w registers and L1 cache as cache
- CPU can only directly access registers (as it is part of CPU memory)  
register memory is much smaller thus faster access
  - Data loaded in registers is dictated by program  
unlike L1 cache. This is the main/important/conceptual difference

Q2. Cache is SRAM, smaller, closer to CPU  
hence making it faster than main memory (DRAM)

Q3. Main memory =  $2^8$  words.

block size = 2 words

cache memory =  $2^3$  blocks.

last three bits of  
main mem block #



| word # | cache # | tag | hit / miss |
|--------|---------|-----|------------|
|--------|---------|-----|------------|

|   |   |   |      |
|---|---|---|------|
| 2 | 1 | 0 | miss |
|---|---|---|------|



|   |   |   |     |
|---|---|---|-----|
| 3 | 1 | 0 | hit |
|---|---|---|-----|

|    |    |   |      |
|----|----|---|------|
| 11 | 15 | 0 | miss |
|----|----|---|------|

|    |   |   |      |
|----|---|---|------|
| 16 | 0 | 1 | miss |
|----|---|---|------|

|    |    |   |      |
|----|----|---|------|
| 21 | 10 | 1 | miss |
|----|----|---|------|

|    |   |   |      |
|----|---|---|------|
| 13 | 6 | 0 | miss |
|----|---|---|------|

|    |    |   |      |
|----|----|---|------|
| 64 | 32 | 0 | miss |
|----|----|---|------|

|    |    |   |      |
|----|----|---|------|
| 48 | 24 | 0 | miss |
|----|----|---|------|

|    |   |   |      |
|----|---|---|------|
| 19 | 9 | 1 | miss |
|----|---|---|------|

|    |   |   |     |
|----|---|---|-----|
| 11 | 5 | 0 | hit |
|----|---|---|-----|

|   |   |   |      |
|---|---|---|------|
| 3 | 1 | 0 | miss |
|---|---|---|------|

|    |    |   |      |
|----|----|---|------|
| 22 | 11 | 1 | miss |
|----|----|---|------|

|   |   |   |      |
|---|---|---|------|
| 4 | 2 | 0 | miss |
|---|---|---|------|

|    |    |   |      |
|----|----|---|------|
| 27 | 13 | 1 | miss |
|----|----|---|------|

|   |   |   |      |
|---|---|---|------|
| 6 | 3 | 0 | miss |
|---|---|---|------|

|    |   |   |      |
|----|---|---|------|
| 11 | 5 | 0 | miss |
|----|---|---|------|

Q 4 main memory size = 1024 words. =  $2^{10}$  words

$$\# \text{ blocks in main memory} = 2^9$$

$$\# \text{ blocks in cache} = 2^3$$

would be same as Q3

since the size is same  
just tag size would be larger by 2 bits.



Q 5. main memory =  $82 \text{ MB} = 2^{10} \cdot 2^{10} \cdot 2^5 = 2^{25} \text{ bytes}$

$$\text{Cache size} = 512 \text{ KB} = 2^{10} \cdot 2^9 = 2^{19} \text{ bytes}$$

$$\text{block size} = 16\text{-word} = 2^4 \cdot 2^2 = 2^6 \text{ bytes}$$

$$\# \text{ blocks in cache} = 2^{19} / 2^6 = 2^{13}$$

Assuming we need ~~valid, dirty~~ <sup>6 bits</sup> per block.

$$25 - (13 + 6) = 6$$

DM :



0 bits comparator

$$\text{metadata} = (\text{tag} + 2) \text{ bits each block}$$

$$= 2^{13} \cdot (2^3) \text{ bits.}$$

$$= 2^{16} \text{ bytes} = 8 \text{ KB.}$$

(b) 2 way set associative.

i) each set has 2 blocks

$$\# \text{ of sets} = \frac{2^{13}}{2} = 2^{12}$$

$$= \frac{2^5}{(12+6)}$$



ii) ~~1~~ bits each comparator

$$\# \text{ of comparators} = \# \text{ of blocks in set} = 2$$

$$\text{iii) } \# \text{ of bits compared/comparator} = \text{tag size} = 7$$

$$\text{iv) metadata} = 9 \times 2^{13} \text{ bits} = 9 \times 2^{10} \text{ bytes} = 9 \text{ KB}$$

(c) 9 way set associative

each set has 4 blocks.

$$\# \text{ of sets} = \frac{2^{13}}{2^2} = 2^9$$



$$\text{ii) } \# \text{ of comparators} = 4$$

$$\text{iii) } \# \text{ of bits compared/comparator} = 8$$

$$\text{iv) metadata size} = (8+2) \times 2^{13} \text{ bits} = 10 \times 2^{10} \text{ bytes} = 10 \text{ KB.}$$

(d) fully associative  $\# \text{ of sets} = 1$  offset



$$\# \text{ of blocks in cache} = 2^{13} = \# \text{ of comparators}$$

$$\text{ii) } \# \text{ of comparators} = 2^{13} = 8192$$

$$\text{iii) } \# \text{ of bits compared/comparator} = 19$$

$$\text{iv) metadata size} = (19+2) \times 2^{13} \text{ bits} = 21 \text{ KB}$$

## (d) 8-way set associative

each set has 8 blocks

$$\# \text{ of sets} = \frac{2^{13}}{2^3} = 2^{10}$$



$$\text{ii) } \# \text{ of comparators} = 8$$

$$\text{iii) } \# \text{ of bits comp / comparator} = 9$$

$$\text{iv) metadata size} = (9+2) \times 2^{13} \text{ bits}$$

$$= 11 \text{ KB}$$

(Q1). block size = 16 words.

$$\text{miss rate} = 0.25 \cdot 1 = \frac{1}{400} = r$$

$$\text{CPI w/o miss} = 2.5$$

DRAM latency : a: command : 2

b: access : 20

c: comm data : 2

assuming: no early start policy

$$X\text{-wide mem} \Rightarrow \text{bus size} = X$$

Granularity of access  
is  $X$

w/o interleaving

$$\text{CPI} = 2.5 + r * \frac{\text{blocksize}}{X} * (a+b+c)$$

with interleaving

$$\text{CPI} = 2.5 + r * (a+b+c * \frac{\text{blocksize}}{X})$$

$$\textcircled{a} \quad X=1 \quad \text{w/o interleaving} \quad \text{CPI} = 2.5 + \frac{1}{400} \cdot \frac{16}{1} \cdot (24) \\ = 2.5 + 0.96 = \underline{\underline{3.46}}$$

$$\text{with interl. CPI} = 2.5 + \frac{1}{400} (22 + 16 \cdot 2) = 2.5 + 0.135 \\ = \underline{\underline{2.635}}$$

$$\textcircled{b} \quad X=2 \quad \text{CPI} = 2.5 + \frac{1}{400} \cdot 8 \cdot (24) = 2.5 + 0.48$$

$$\text{w/o interleaving: CPI} = 2.5 + \frac{1}{400} \cdot 8 \cdot (24) \\ = \underline{\underline{2.98}}$$

$$\text{with interl: CPI} = 2.5 + \frac{1}{400} (22 + 8 \cdot 2) = 2.5 + 0.095 \\ = \underline{\underline{2.595}}$$

$$\textcircled{c} \quad X=4 \quad \text{w/o interl: CPI} = 2.5 + \frac{1}{400} \cdot 4 \cdot (24) = 2.5 + 0.24 =$$

$$\text{w/interl: CPI} = 2.5 + \frac{1}{400} \cdot (22 + 4 \cdot 2) = 2.5 + 0.075 \\ = \underline{\underline{2.575}}$$

$$\textcircled{d} \quad X=8 \quad \text{w/o interl: CPI} = 2.5 + \frac{1}{400} \cdot 2 \cdot (24) = 2.5 + 0.12 \\ = \underline{\underline{2.62}}$$

$$\text{w/interl: CPI} = 2.5 + \frac{1}{400} (22 + 2 \cdot 2) = 2.5 + 0.065 \\ = \underline{\underline{2.565}}$$

Q2. half of instr. have data access

| processor      | cache type | block size (word) | inst <sub>m</sub> rate (imr) | data miss rate (dmr) |
|----------------|------------|-------------------|------------------------------|----------------------|
| P <sub>1</sub> | BM         | 2                 | 4 T.                         | 6 T.                 |
| P <sub>2</sub> | BM         | 4                 | 2 T.                         | 4 T.                 |
| P <sub>3</sub> | 2-way set  | 4                 | 2 T.                         | 3 T.                 |

Cache miss penalty :  $6 + \frac{\text{block size}}{\text{words}}$

measured  $CP2(P_2) = 2.0$ .

(a) measured  $CP2 = \frac{i_{\text{ideal}}}{CP2} + \left( \text{imr} + \frac{\text{dmr}}{2} \right) * \text{miss penalty}$ .

for P<sub>1</sub>

$$2 = i_{\text{ideal}} + \frac{4}{100} \times 7 = i_{\text{ideal}} + 0.49$$

$i_{\text{ideal}} CPI = 1.51$

(b)

$P_2$ : miss penalty  $= 6 + 4 = 10$

$$\text{imr} + \frac{\text{dmr}}{2} = 4 + 10 = 14$$

$$CPI = 1.51 + \frac{14}{100} = 1.91$$

- for P<sub>3</sub>
- (i) not enough data to compute effect of associativity on hit time  $\rightarrow CPI$  and overhead via block eviction policy
  - assuming these to be negligible for this prob

$P_3$ : miss penalty  $= 6 + 4 = 10$

$$\text{imr} + \frac{\text{dmr}}{2} = 3.5 + 10 = 13.5$$

$$CPI = 1.51 + \frac{13.5}{100} = 1.86$$

, P<sub>3</sub> is the fastest.

Note: number of blocks in set must be  $3 \times 2^k$  so that  
number of sets in cache is a power of 2, and also number  
 $\frac{9GB}{2^{10} \cdot 2^k \cdot 2}$

Q3. Cache size = 3 K

to calculate offset # off block size must be power of 2

- (a) for D.M. and 2-way set # sets in cache is not power of 2 hence both not possible with simple bit extractions.
- (b) fully associative max block size = ~~1K~~ 4K words
- (c) blocksize = 1K words =  $4 \times 2^{10} = 2^{12}$  bytes

mem addr field

