



# Dipartimento di Elettronica e Informazione

## Politecnico di Milano

20133 Milano (Italia)  
Piazza Leonardo da Vinci, 32  
Tel. (+39) 02-2399.3400  
Fax (+39) 02-2399.3411

---

### Advanced Computer Architecture

July 21, 2022

Prof. Donatella Sciuto and Marco Santambrogio

|            |
|------------|
| Name       |
| Last Name  |
| Professor: |

You have 90' to complete the exam

|                   |  |
|-------------------|--|
| Problem 1 (6pts)  |  |
| Problem 2 (6pts)  |  |
| Problem 3 (6pts)  |  |
| Question 1 (5pts) |  |
| Question 2 (5pts) |  |
| Question 3 (5pts) |  |
| Total (32pts)     |  |

To pass the exam (the following conditions are in AND):

- The overall score has to be equal or higher than 18
- At least 1 problem and 1 Question has to be correct

## Problem 1

Consider two CPUs: CPU1 e CPU2. CPU1 has clock cycle of 2 ns while CPU2 has an operating frequency of 700MHz. Given the following frequencies of occurrence of the instructions for the two CPUs

| Operation type | Frequency | CPU1 | CPU2 |
|----------------|-----------|------|------|
| A              | 0.3       | 2    | 2    |
| B              | 0.1       | 3    | 3    |
| C              | 0.2       | 4    | 3    |
| D              | 0.3       | 2    | 2    |
| E              | 0.1       | 4    | 3    |

- a) Compute the average CPI for CPU1 and CPU2
- b) Which is the slowest CPU?
  
- c) Assume that CPU1 is modified in such a way that:
  - All instructions require 8 clock cycles;
  - Memory stalls are not considered;
  - The miss penalty is 7 clock cycles

Assuming a miss rate of 23% and assuming 3 memory references on average for each instruction determine the impact on performance of the cache with respect to an ideal cache?

## Problem 2

Consider the following access pattern on a two-processor system with a direct-mapped, write-back cache with one cache block and a two cache block memory. Assume the MESI protocol is used, with write-back caches, write-allocate, and invalidation of other caches on write (instead of updating the value in the other caches).

| Time | After Operation    | P1 cache block state | P2 cache block state | Memory at block 0 up to date? | Memory at block 1 up to date? |
|------|--------------------|----------------------|----------------------|-------------------------------|-------------------------------|
| 0    | P1: read block 0   |                      |                      |                               |                               |
| 1    | P2: write block 1  |                      |                      |                               |                               |
| 2    | P1: write block 0  |                      |                      |                               |                               |
| 3    | P2: read block 0   |                      |                      |                               |                               |
| 4    | P1: write block 0  |                      |                      |                               |                               |
| 5    | P2 : write block 1 |                      |                      |                               |                               |
| 6    | P1: read block 0   |                      |                      |                               |                               |
| 7    | P2: write block 0  |                      |                      |                               |                               |
| 8    | P1: read block 0   |                      |                      |                               |                               |
| 9    | P2: read block 0   |                      |                      |                               |                               |

### Problem 3

Assume that the following code has been executed on a CPU with SCOREBOARD.

|    | Instruction      | ISSUE | READ OPERAND | EXE COMPLETE | WB |
|----|------------------|-------|--------------|--------------|----|
| I1 | LD F1, 16 (R1)   | 1     | 2            | 6            | 7  |
| I2 | ADDD F12, F1, F2 | 2     | 8            | 13           | 14 |
| I3 | MULTD F2, F4, F3 | 3     | 4            | 15           | 16 |
| I4 | DIVD F3, F12, F5 | 4     | 15           | 26           | 27 |
| I5 | SD F3, 4 (R1)    | 5     | 28           | 32           | 33 |
| I6 | SUBD F2, F12, F4 | 17    | 18           | 23           | 24 |

- A. List all the possible conflicts in the code.
- B. Is there a “configuration” that can respect the shown execution?  
How many units? Which kind? What latency?
- C. If the previous table was not correct, please, write the right one and specify  
the number, kind and latency for each unit.

## **Question 1**

The following 5 questions are True/False (you do not have to motivate/support your answers).

Each of them will count 1 point

### **Question 1.1**

A scalar architecture can let instructions from different threads to be executed at the same clock cycle.

Given the previous statement, confirm if it is TRUE or FALSE and **effectively support** your answer.

Circle the **right** answer:    True              False

### **Question 1.2**

The ARM big.LITTLE architecture, considering the fact that is a multicore architecture where the cores are sharing the same ISA, cannot be considered a heterogenous architecture.

Circle the **right** answer:    True              False

### **Question 1.3**

A physically distributed memory can be used to implement a shared-memory architecture

Circle the **right** answer:    True              False

### **Question 1.4**

A VLIW Architecture has to have multiple Program Counters to load the necessary Multiple Data

Circle the **right** answer:    True              False

### **Question 1.5**

Increasing the number of stages in a pipeline, it is always improving the performance

Circle the **right** answer:    True              False

**Question 2**

Explain the difference among Dependencies and Hazards.

**Question 3**

Explain the main differences between loop unrolling and software pipelining?

1

$$CPI_1 = \sum_{i=1}^E F_i \cdot CPU_{1,i} = 2,7$$

$$CPI_2 = \sum_{i=1}^E F_i \cdot CPU_{2,i} = 2,4$$

$$MIPS_1 = \frac{F_1}{CPI_1 \cdot 10^3} = 1,85$$

$\Rightarrow CPU_1$  SLOWEST

$$MIPS_2 = \frac{F_2}{CPI_2 \cdot 10^3} = 2,92$$

$$CACHE = CLOCK CYCLES + (REFERRENCES \cdot PENALTY \cdot RATE) = 1283$$

$$IDEAL\ CACHE\ (RATE=0)=8$$

$$IZ = \frac{CACHE}{IDEAL\ CACHE} = 62\%$$

2

| Time | After Operation    | P1 cache<br>block state | P2 cache<br>block state | Memory at<br>block 0 up<br>to date? | Memory at<br>block 1 up<br>to date? |
|------|--------------------|-------------------------|-------------------------|-------------------------------------|-------------------------------------|
| 0    | P1: read block 0   | E(0)                    | I                       | YES                                 | YES                                 |
| 1    | P2: write block 1  | E(0)                    | M(1)                    | YES                                 | NO                                  |
| 2    | P1: write block 0  | M(0)                    | M(1)                    | NO                                  | NO                                  |
| 3    | P2: read block 0   | S(0)                    | S(0)                    | YES                                 | YES                                 |
| 4    | P1: write block 0  | M(0)                    | I                       | NO                                  | YES                                 |
| 5    | P2 : write block 1 | M(0)                    | M(1)                    | NO                                  | NO                                  |
| 6    | P1: read block 0   | M(0)                    | M(1)                    | NO                                  | NO                                  |
| 7    | P2: write block 0  | I                       | M(0)                    | NO                                  | YES                                 |
| 8    | P1: read block 0   | S(0)                    | S(0)                    | YES                                 | YES                                 |
| 9    | P2: read block 0   | S(0)                    | S(0)                    | YES                                 | YES                                 |

3

|                  |
|------------------|
| LD F1, 16 (R1)   |
| ADDD F12, F1, F2 |
| MULTD F2, F4, F3 |
| DIVD F3, F12, F5 |
| SD F3, 4 (R1)    |
| SUBD F2, F12, F4 |

RAW F1 I1-I2 ✓

RAW F12 I2-I4 ✓

RAW F12 I2-I6 ✓

WAR F2 I2-I3 ✓

WAR F3 I3-I4 ✓

WAR F2 I3-I6 ✓

RAW F3 I4-I5 ✓

WAR F2 I2-I6 ✓

|    | Instruction      | ISSUE | READ OPERAND | EXE COMPLETE | WB |
|----|------------------|-------|--------------|--------------|----|
| I1 | LD F1, 16 (R1)   | 1     | 2            | 6            | 7  |
| I2 | ADDD F12, F1, F2 | 2     | 8            | 13           | 14 |
| I3 | MULTD F2, F4, F3 | 3     | 4            | 15           | 16 |
| I4 | DIVD F3, F12, F5 | 4     | 15           | 26           | 27 |
| I5 | SD F3, 4 (R1)    | 5     | 28           | 32           | 33 |
| I6 | SUBD F2, F12, F4 | 17    | 18           | 23           | 24 |

✓ ✓ ✓ ✓ ✓ ✓ ✓

MEM1, MEM2 RS1, RS2 4cc

ADD1 RS3 5cc

MUL1, MUL2 RS4, RS5 11cc

# THEORY 1

## Question 1

The following 5 questions are True/False (you do not have to motivate/support your answers).

Each of them will count 1 point

### Question 1.1

A scalar architecture can let instructions from different threads to be executed at the same clock cycle.

Given the previous statement, confirm if it is TRUE or FALSE and **effectively support** your answer.

Circle the **right** answer:    True

False

✓ **SINGLE ISSUE ARCHITECTURE**

### Question 1.2

The ARM big.LITTLE architecture, considering the fact that is a multicore architecture where the cores are sharing the same ISA, cannot be considered a heterogenous architecture.

Circle the **right** answer:    True

False

✓ **Multicore  $\Rightarrow$  Different cores  
that work together**

### Question 1.3

A physically distributed memory can be used to implement a shared-memory architecture

Circle the **right** answer:    True

False

✓

### Question 1.4

A VLIW Architecture has to have multiple Program Counters to load the necessary Multiple Data

Circle the **right** answer:    True

False

✓

### Question 1.5

Increasing the number of stages in a pipeline, it is always improving the performance

Circle the **right** answer:    True

False

✓

## THEORY 2

- DEPENDENCIES  $\Rightarrow$  LIMITATIONS GIVEN BY THE PROGRAM ITSELF TO ENSURE CORRECTNESS
- HAZARDS  $\Rightarrow$  CONFLICTS THAT OCCUR WHEN MULTIPLE INSTRUCTIONS THAT SHARE SOME OPERANDS ARE EXECUTED IN PARALLEL

## THEORY 3

### LOOP UNROLLING



REPLICATE LOOP CODE MULTIPLE TIMES, RESULTING IN SLOWER EXECUTION, MORE REGISTERS TO BE NEEDED

### SOFTWARE PIPELINING



PARALLELIZE LOOP ITERATIONS

- PROLOGO  $\rightarrow$  INITIALIZATION PIPELINE
- LOOP ITERATIONS  $\rightarrow$  EXECUTION OF MM/ZDATA
- EPILOGO  $\rightarrow$  SWIPE PIPELINE

OVERHEAD TIMES "LOST" ONLY ONCE AT THE BEGINNING AND AT THE END, BUT WHAT IF LOOPS DEPEND FROM EACH OTHER?