

B.TECH 5<sup>TH</sup> SEMESTER EXAMINATION, DECEMBER 2021

Computer Architecture and Organization - II [CS 3103]

Name: Abhirup Mukherjee

Examination Roll No.: 510519109

G-Suite ID : 510519109\_abhirup@students.iiests.ac.in

(Q1) i) ① Write - Invalidate

X, 2, P, +, 7, 4

→ In write-invalidate scheme, a write operation for some data (say X) in a block (say B) by a processing unit  $P_i$  does the following:

a) Invalidate all other cached copies of B [but not MM version]

→ in case cache is using write-through, update to X will be immediately reflected in MM also

b) Subsequent request for B, by ~~any~~ any other processor other than  $P_i$ , will be treated as cache miss.

(ii) Write update

- An update of block B in  $P_1$ 's cache, updates all cached copies of B & MM
- Any subsequent request for B from any processor which held B <sup>before</sup> will be treated as hit.

(Q2) ii)



- initial  $x = 20$  (let us assume  $x$  is in loc 0 of MM)
- caches initially empty & direct mapping implementation

(2)

## I write update.

→ before going into write update & write invalidate, some things are required to be addressed.

- ② If  $X$  is in memory, loc  $\alpha$  ( $\alpha$  can be 0,1,2,3) in both  $C_1$  and  $C_2$  using direct mapping,  $X$  will be stored in  $\alpha \vee 2$  loc in cache.

→ ~~so we can~~

→ here we have ~~also~~ assumed that  $X$  will be in memory location 0, and hence  $X$  will be present in loc 0 in  $C_1$  &  $C_2$  also.

- b) Assuming write ~~back~~<sup>through</sup> policy in cache

## II write update, invalidate

| Activity                                                            | $X$ at  |         | MM   |
|---------------------------------------------------------------------|---------|---------|------|
|                                                                     | cache 1 | cache 2 |      |
| initial                                                             |         |         | * 20 |
| ① P1 read $X$ , read miss $C_1$                                     | 20      | -       | 20   |
| ② P2 read $X$ , read miss $C_2$                                     | 20      | 20      | 20   |
| ③ P2 update $X=5$ , $X$ at<br>P1 invalidated, MM updated also       | 5       | -       | 5    |
| ④ due to write through policy                                       | -       | 5       | 5    |
| ⑤ P2 read $X$ , read hit                                            | 20      | -       | 20   |
| P1 update $X=20$ , $X$ at<br>C2 invalidated, write through<br>to MM | 20      | 20      | 20   |
| ⑥ P2 read $X$ , read miss                                           | 20      | 20      | 20   |

(II)

write update

| Activity                                    | X at |    | MM |
|---------------------------------------------|------|----|----|
|                                             | C1   | C2 |    |
| ① initial                                   | -    | -  | 20 |
| ② P1 read X , read miss                     | 20   | -  | 20 |
| ③ P2 read X , read miss                     | 20   | 20 | -  |
| ④ P2 update X=5 , broadcasted to everywhere | 5    | 5  | 5  |
| ⑤ P2 read X , read hit                      | 5    | 5  | 5  |
| ⑥ P1 update X=20 , broadcast everywhere     | 20   | 20 | 20 |
| ⑦ P2 read X , read hit                      | 20   | 20 | 20 |

Q3) (I) EREW SM SIMD computer

- Exclusive Read Exclusive write memory
- No two Processing element's are allowed to simultaneously read / write to same memory location.

Eg

|      |        |
|------|--------|
|      |        |
| loc2 | 100111 |
| loc2 | 111111 |

- simultaneous read from loc1 & loc2 ✓
- simultaneous write to loc1 & loc2 ✓
- simultaneous multiple read from loc1 & loc2 X
- simultaneous multiple write to loc1 & loc2 X

(II) CREW SM SIMD computer.

- Concurrent read Exclusive write

|      |      |
|------|------|
|      |      |
| loc1 | 1000 |
| loc2 | 1000 |

- simultaneous read from loc1 & loc2 ✓
- simultaneous ~~write~~ to loc1 & loc2 ✓
- simultaneous multiple read from loc1 & loc2 ✓
- simultaneous multiple write from loc1 & loc2 X

### III) ER CW SM SIMD computer

→ Exclusive read concurrent write



- simultaneous read from loc1 & loc2 ✓
- simultaneous write to loc1 & loc2 ✓
- simultaneous multiple read from loc1 & loc2 X
- simultaneous multiple write to loc1 & loc2 ✓

### IV) CR CW SM SIMD computer.

→ concurrent read concurrent write.



- simultaneous read from loc1 & loc2 ✓
- simultaneous write to loc1 & loc2 ✓
- simultaneous multiple read from loc1 & loc2 ✓
- simultaneous multiple write to loc1 & loc2 ✓

Time to replace elements  $m$  in table of  $n$  elements by  $100$   
 in an SIMD computer with  $N$  PE's

→ basic idea → search entire array → if there ~~is~~  $m$  in  
 ↓  
 if there is  $m$  in array  
 ↓  
~~then~~ replace that  $m$  by  $100$

Eg : if table / array = { 1, 2, 3, 4,  $m$ , 7, 8,  $m$ ,  $n$  }

end result = { 1, 2, 3, 4, 100, 7, 8, 100, 100 }

→ Basic steps we will have to do in SIMD computer to achieve this

- (a) Broadcast element to be searched ( $m$ ) to all PE's
- (b) all PE compare elements with  $m$
- (c) if  $m$  is found ~~by~~ by PE<sub>i</sub>, it will write  $100$  to it's position, i.e storing of result in memory

I EREW updating table in EREW

a) Broadcasting



procedure → Let  $m$  be in memory  $L_1$

→ time step 1 →  $m$  read by  $P_0$  from  $L_1$  & then written to  $L_2$

→ time step 2 →  ~~$m$  read by  $P_1, P_2$~~  read  $m$  from  $L_2$  &  $L_3$   
and write to  $L_3$  &  $L_4$

→ time step 3 → now  $m$  is in four loc

→ another four PE (3-6) read from  
 $L_1 - L_4$  and writes in  $L_5 - L_8$



→ using this method we can broadcast  $m$  to all PE in  $\log(N)$  time

## II Comparison of $n$ with elements of table.

$\rightarrow$  if  $N > n \rightarrow$  will happen in  $O(1)$

$\rightarrow$  in case  $N < n$ , i.e. no. of elem.  $>$  no. of processing unit,  
each PE will compare  $\frac{n}{N}$  elements with  $n$ .

$\rightarrow$  ~~comparisons~~  $\frac{n}{N}$  comparison inside a single PE happens sequentially

&  $\Rightarrow$  1 PE executes parallelly

$$\text{so Time Complexity} = O\left(\frac{n}{N}\right)$$

## III Storing result in memory

$\rightarrow$  if any element found in, ~~any~~ PE will write 100 in its location

$\rightarrow$  in the worst case, all ~~any~~ elements can be in, so

$$\text{Time complexity} = O(n)$$

$\rightarrow$  here every PE can write simultaneous (different) location  $\rightarrow TC = O\left(\frac{n}{N}\right)$

$$\therefore \text{Total } TC = O(\lg N) + O\left(\frac{n}{N}\right) + O\left(\frac{n}{N}\right)$$

$$\text{if } n \gg N \rightarrow TC = O(1)$$

## (II) CREW update.

### a) Broadcasting.

$\rightarrow$  concurrent read  $\rightarrow$  broadcasting in  $O(1)$

### b) comparison.

$\rightarrow$  same as before  $O\left(\frac{n}{N}\right)$

### c) write update.

$\rightarrow$  same as before  $O\left(\frac{n}{N}\right)$

$$\therefore TC = O(2) + O\left(\frac{n}{N}\right) + O\left(\frac{1}{N}\right)$$

$$\cancel{n} \gg N \rightarrow TC = O(n)$$

## (III) EREW

### a) Broadcasting

$\rightarrow$  same as EREW  $\rightarrow O(\lg n)$

### b) comparison

$\rightarrow$  same as before  $\rightarrow O\left(\frac{n}{N}\right)$

### c) write update

~~$\rightarrow$  concurrent write possible, every PE can write simultaneously to d~~

$\rightarrow$  same as before  $\rightarrow O\left(\frac{n}{N}\right)$

$$\therefore TC = O\left(\frac{n}{N}\right)$$

$$\cancel{n} \gg N \rightarrow TC = O(n)$$

## (IV) CRCW

(a) broadcast

 $\rightarrow$  same as CREW  $\rightarrow O(1)$ 

(b) comparison

 $\rightarrow$  same as before  $O(\frac{n}{m})$ 

(c) write update.

 $\rightarrow$  same as before  $O(\frac{n}{m})$ 

$$\therefore T() = O\left(\frac{n}{m}\right) \quad \cancel{\text{if } n \gg N} \Rightarrow O(n)$$

(Q5)

## ① compulsory miss

- The very first access to some block B will always be a miss.
- This miss is called compulsory miss.
- This miss cannot be avoided.



(II)

## Conflict miss

- when multiple memory blocks map to same cache frame, conflict happens and ~~exist~~ one frame is discarded
- when access to discarded frame is made, it will result in a miss.
- that miss is called conflict miss.

### capacity miss.

→ when cache capacity becomes full, ~~some blocks~~ some blocks will be ~~discarded~~ discarded back to MM & for some other cache.

→ after discard, accessed to elements inside the discarded cache will be a miss.

→ This miss is called capacity miss.

→ If we increase cache size [keeping block size constant]

→ no. of blocks inside cache increases, reducing chance of capacity miss.

→ no. of compulsory miss will be constant ~~&~~

→ conflict miss ~~depends~~ depends on associativity

~~Ex:~~

Pseudo-Associative cache.

→ works very similar to 2-way set associative mapping, with the only difference being as follows.



2 way set  
associative.



pseudo - associative.

→ in pseudo-associative cache ~~and A and B~~ loc A & loc B form a group  
~~if both~~ ~~form~~ → group loc B form a group

→ where A & B have ~~4~~ we equal, but MSB of A & B are different

→ this works faster than 2-way as bit addition is slower than inverting ~~the~~ only MSB

~~in 2 way → check if data present in cache A  
then check if data present in cache B~~

→ hit time of pseudo-associative almost same as direct mapped cache.

→ Pseudo-Associative cache reduces ~~hit time~~ the time required for checking presence of frame, thereby reducing hit time.

### Loop Fusion

Q) Given code.

for int i = 1 to 1000 with increment 1 :  
 $c[i] = B[i] + A[i]$

for int i = 1 to 1000 with increment 1 :  
 $D[i] = B[i] * 2 * A[i];$

let us assume two elements of each array A/B/C/D forms a block

let us see no. of misses for A → A has 1000 elements so  
 500 blocks.

let cache accomodate less than 500 blocks of A at a time.  
 & assume cache replacement policy is first in first out

→ ~~hit time~~

→ There will be 500 misses for A in first loop & then  
 there will be 500 more misses for A in second loop

→ So total 1000 misses.

using loop fusion.

II) Using loop fusion.

for ~~int~~ int i=1 to 1000 with increment 1 :

$$C[i] = B[i] + A[i];$$

$$D[i] = B[i] * 2 * A[i];$$

→ here only 500 misses will take place as ~~A[::]~~ is  
as ~~A[::]~~ in second line of loop A[i] will always  
be hit.

∴ Loop Fusion reduces miss rate by improving temporal locality

Q6) given sequence.

~~z z z b b c d c z z z b b c z z z~~

① 0<sup>th</sup> order.

~~z | b | c | d~~

Q6) a) Context-based value predictor.

Context  $\rightarrow$  sequence of  $n$  most recent values produced for n  
 $\rightarrow$  classified according to order.

$\Rightarrow$   $n^{\text{th}}$  order model takes last  $n$  values & uses this  
 for prediction.

② given sequence

~~z. z z b b c d c z z z b b c z z z~~

② 0<sup>th</sup> order.

|              |              |              |              |
|--------------|--------------|--------------|--------------|
| <del>z</del> | <del>b</del> | <del>c</del> | <del>d</del> |
| 9            | 4            | 3            | 1            |

$\text{prediction} = ?$  z.

(17)

② 1<sup>st</sup> order.

|   | a | b | c | d |
|---|---|---|---|---|
| a | 6 | 2 | 0 | 0 |
| b | 0 | 2 | 2 | 0 |
| c | 2 | 0 | 0 | 1 |
| d | 0 | 0 | 1 | 0 |

prediction  $\Rightarrow$  a

③ 2<sup>nd</sup> order.

|    | a  | b | c | d |
|----|----|---|---|---|
| aa | 3  | 2 | 0 | 0 |
| ab | 0  | 2 | 0 | 0 |
| ac | 0  | 0 | 0 | 0 |
| ad | 0  | 0 | 0 | 0 |
| ba | 0  | 0 | 0 | 0 |
| bb | 0  | 0 | 2 | 0 |
| bc | 1  | 0 | 0 | 1 |
| bd | 0  | 0 | 0 | 0 |
| ca | 12 | 0 | 0 | 0 |
| cb | 0  | 0 | 0 | 0 |
| cc | 0  | 0 | 0 | 0 |
| cd | 0  | 0 | 1 | 0 |
| da | 0  | 0 | 0 | 0 |
| db | 0  | 0 | 0 | 0 |
| dc | 1  | 0 | 0 | 0 |
| dd | 0  | 0 | 0 | 0 |

prediction = a

