



# INSTITUTO SUPERIOR TÉCNICO

DEPARTAMENTO DE ENGENHARIA INFORMÁTICA

## ORGANIZAÇÃO DE COMPUTADORES

LEIC

### **First Lab Assignment: System Modeling and Profiling**

Version 1.1.0

2023/2024

# 1 Introduction

The goal of this assignment is twofold: (i) to determine the characteristics of a computer's caches, and (ii) to leverage the obtained knowledge about the caches in order to optimize the performance of a given program. For this task, the students will make use of a performance analysis tool to have direct access to hardware performance counters available on most modern microprocessors. The tool that will be used is the standard Application Programming Interface (API): PAPI [1].

In the rest of this section, we make a brief introduction to PAPI, and describe the targeted computer platform and the development environment. In Section 3, we describe the procedure for modeling the L1 and L2 caches of the targeted platform (Subsection 3.1), and provide a guide for analyzing the performance of a matrix-multiply code segment and optimizing it based on the characteristics of the L2 cache of the target architecture (Subsection 3.2).

## 1.1 Targeted Platform and Development Environment

**IMPORTANT: This assignment must be performed on the computers of your lab classes room.** These computers have similar hardware characteristics, and any of them can be used as a target platform. Note that, since this work is hardware-dependent, conducting it on a computer with different hardware characteristics could produce unexpected results, and hence invalidating your work. This means you should always use the same lab. If you are an Alameda student, you can access the specific lab computer you want (see <https://welcome.rnl.tecnico.ulisboa.pt/#labs-access>).

To properly setup the development environment, it is necessary to obtain the PAPI library and a set of auxiliary program files. This material can be found in the package `lab1_kit.zip`, which can be downloaded from the course website. After downloading and uncompressed this package on any of the lab classes' computers, PAPI must be built. To this end, change directories to the location of the PAPI source code: folder `papi-X.X.X/src`. Compile the code by issuing the commands: `./configure`, and `make`. This operation will produce a set of helper tools located in directory `src/utils/` and create the PAPI library `papilib.a`. The tool `papi_avail`, in particular, is useful to determine the PAPI events supported on the target platform. The library will be linked to the auxiliary programs presented in the following sections.

## 2 Exercise

To help determining the characteristics of the labs computer's caches, the following exercises will help you estimate cache parameters from small C applications.

The first step to get acquainted with the procedure is to determine only the size of the cache using a small C application on a (known) machine, such as the code you have analyzed on lab exercise VI.3. This C code, is a simplified version of the following programs in this assignment. Basically, it iterates over an array to determine the cache size.

To guarantee that you measure the time accurately, please use the source code available in the lab kit (file spark.c).

In order to perform the evaluation you should go to your lab in order to access the cache size by running the application there. You may want to repeat the evaluation of the elapsed time a few times to achieve statistical significance. You should table the relevant results for different cache sizes on the response sheet and make a conclusion regarding the cache size. You can calculate more measures before the output, examine the final part of the source code file.

1. What is the cache capacity of the computer you tested? Please justify.

To discover the other cache parameters, you're going to modify the C application, so that it generates different data access patterns. Please spend a few minutes analyzing the modifications to the source code.

```
for(size_t cache_size = CACHE_MIN; cache_size < CACHE_MAX; cache_size = 2*cache_size) {  
    for(size_t stride = 1; stride <= cache_size/2; stride = 2*stride){  
        limit = cache_size - stride + 1;  
        for(ssize_t i = 10 * stride; i > 0; i--) {  
            for(index = 0; index < limit; index += stride) {  
                array[index] = array[index] + 1;  
            }  
        }  
    }  
}
```

The meaning of each variable is the following:

**array[]** an arbitrary large array that will be repeatedly accessed to measure the cache miss pattern;

**cache\_size** value of the cache size under test; all cache sizes given by integer powers of 2, between CACHE\_MIN = 8kB and CACHE\_MAX = 64kB should be considered;

**stride** states how many entries are being skipped at each access; for example, if the stride is 4, entries 0, 4, 8, 12, ... in the array are being accessed, while entries 1, 2, 3, 5, 6, 7, 9, 10, 11, ... are skipped;

**limit** the largest address that will be accessed for the cache size and access pattern under test;

**repeat** denotes the number of times that each access pattern will be repeated in array.

The execution time for this code segment on this machine yield the chart depicted in Figure 1, by varying the adopted value for the *stride* parameter and for different array sizes, defined between ARRAY\_MIN = 4kB and ARRAY\_MAX = 4MB.

2. What is the cache capacity of the computer?
3. What is the size of each cache block?
4. What is the L1 cache miss penalty time?



**Figure 1:** Variation of the cache access time with the adopted *stride* value for different array sizes.

### 3 Procedure

#### 3.1 Modeling Computer Caches

In the first part of this assignment, the goal is to model the characteristics of the L1 data cache and L2 cache of the targeted computer platform. Next, we provide instructions for performing this analysis.

Use the forms at the end to answer the questions below.

##### 3.1.1 Modeling the L1 Data Cache

The methodology to experimentally model the L1 data cache consists in considering the total amount of data cache misses during the execution of the following code sequence of program `cml.c`, similar to the program in Section 2. This program can be found in the package `lab1_kit.zip`.

```

for(array_size=ARRAY_MIN; array_size < ARRAY_MAX; array_size=array_size*2)
    for(stride=1; stride <= array_size/2; stride=stride*2) {
        limit = array_size - stride + 1;
        for(repeat=0; repeat<=200*stride; repeat++)
            for(index=0; index<limit; index+=stride)
                x[index] = x[index] + 1;
    }
}

```

- a) Change to directory `cml/`, in the package `lab1_kit.zip`, and analyze de code of the program `cml.c`. Identify its source code with the program described above.  
What are the processor events that will be analyzed during its execution? Explain their meaning.
- b) Compile the program `cml.c` using the provided `Makefile` and execute `cml`. Plot the variation of the average number of misses (*Avg Misses*) with the `stride` size, for each considered dimension of the L1 data cache (8kB, 16kB, 32kB and 64kB).

NOTE: A fast sketch of these plots can be drawn in your computer by running the following commands:

```

./cml > cml.out
./cml_proc.sh

```

NOTE 2: You can draw these tables and plots on your computer, print, and attach to the report. You do not have to

fill them by hand on the printed report.

NOTE 3: You may need to mark the script as executable before being able to run it.

c) By analyzing the obtained results:

- Determine the **size** of the L1 data cache. Justify your answer.
- Determine the **block size** adopted in this cache. Justify your answer.
- Characterize the **associativity set size** adopted in this cache. Justify your answer.

### 3.1.2 Modeling the L2 Cache

In this part of the assignment, the goal is to experimentally model the characteristics of the L2 cache of the targeted computer platform. To analyze the computer's L2 cache, we will use the same methodology that was introduced in the previous section to model the L1 data cache.

- a) Modify the program `cm1.c` in order to analyze the characteristics of the L2 cache. (Hint: use the event `PAPI_L2_DCM`.) Describe and justify the changes introduced in this program.
- b) Compile the program `cm1.c`, execute `cm1`, and plot the variation of the average number of misses (*Avg Misses*) with the `stride` size, for each considered dimension of the L2 cache.
- c) By analyzing the obtained results:
  - Determine the **size** of the L2 cache. Justify your answer.
  - Determine the **block size** adopted in this cache. Justify your answer.
  - Characterize the **associativity set size** adopted in this cache. Justify your answer.

## 3.2 Profiling and Optimizing Data Cache Accesses

Often, programmers wishing to improve their programs' performance focus their attention on how the programs affect the computer's caches. In the following, it will be analyzed how simple code changes can help to improve that performance for a matrix multiplication application.

Consider a simple matrix multiplication application, operating on two square matrices of  $N \times N$  16-bit integer elements, with  $N = 1024$ . From a mathematical point of view, given two matrices **A** and **B**, with elements  $a_{ij}$  and  $b_{ij}$  such that  $0 \leq i, j < N$ , the product matrix **C** is defined as:

$$c_{ij} = \sum_{k=0}^{N-1} a_{ik} b_{kj} = a_{i1} b_{1j} + a_{i2} b_{2j} + \dots + a_{i(N-1)} b_{(N-1)j} \quad (1)$$



**Figure 2:** Straightforward matrix multiplication.

### 3.2.1 Straightforward implementation

A straight-forward C implementation of Eq. 1 can look like this:

```
for (i = 0; i < N; ++i) {
    for (j = 0; j < N; ++j) {
        for (k = 0; k < N; ++k) {
            res[i][j] += mul1[i][k] * mul2[k][j];
        }
    }
}
```

The two input matrices are `mul1` and `mul2`. The result matrix `res` is assumed to be initialized to all zeroes.

The provided program `mm1.c` includes this code sequence and all the necessary initialization steps, as well as the set of statements that are required in order to profile its execution using the PAPI toolbox.

- a) Change to directory `mm1/` and analyze de code of the program `mm1.c`. Identify its source code with the program described above.  
What is the total amount of memory that is required to accommodate each of these matrices?
- b) Compile the source file `mm1.c` using the provided `Makefile` and execute it. Fill the table with the obtained data.
- c) Evaluate the resulting L1 data cache *Hit-Rate*.

### 3.2.2 First Optimization: Matrix transpose before multiplication [2]

By analyzing the obtained results, it can be observed that such a straightforward implementation suffers from a severe penalty in what concerns the amount of L2 cache misses resulting from its access pattern. In fact, while `mul1` matrix is accessed sequentially, the inner loop advances the row number of `mul2` (see Fig. 2), meaning successive accesses to far away memory positions.



**Figure 3:** Transposed matrix multiplication.

One possible remedy to attenuate such problem is based on matrix transposition. In fact, since each matrix element is accessed multiple times, it might be worthwhile to rearrange (“transpose,” in mathematical terms) the second matrix `mul2` before using it (see Fig. 3):

$$c_{ij} = \sum_{k=0}^{N-1} a_{ik} b_{jk}^T = a_{i1} b_{j1}^T + a_{i2} b_{j2}^T + \dots + a_{i(N-1)} b_{j(N-1)}^T \quad (2)$$

After the preliminary transposition step, both matrices may be iterated sequentially. As far as the C code is concerned, it now looks like this:

```
int16_t tmp[N][N];

// transposition
for (i = 0; i < N; ++i) {
    for (j = 0; j < N; ++j)
        tmp[i][j] = mul2[j][i];
}

// multiplication
for (i = 0; i < N; ++i) {
    for (j = 0; j < N; ++j) {
        for (k = 0; k < N; ++k)
            res[i][j] += mul1[i][k] * tmp[j][k];
    }
}
```

Variable `tmp` is a temporary array to store the transposed matrix.

One direct consequence of this optimization is that it now requires additional accesses to the data memory. Hopefully, this extra cost can be easily recovered, since the 1024 non-sequential accesses per column are usually much more expensive.

- a) Change to directory `mm2/` and analyze the code of the program `mm2.c`. Identify its source code with the program described above. Compile this program using the provided `Makefile` and execute it.

Fill the table with the obtained data.

- b) Evaluate the resulting L1 data cache *Hit-Rate*.

- c) Change the code in the program `mm2.c` in order to include the matrix transposition in the execution time. Compile this program using the provided `Makefile` and execute it.

Fill the table with the obtained data.

Comment on the obtained results when including the matrix transposition in the execution time.

- d) Compare the obtained results with those that were obtained for the straightforward implementation, by calculating the difference of the resulting hit-rates ( $\Delta$ HitRate) and the obtained speedups.

### 3.2.3 Second Optimization: Blocked (tiled) matrix multiply [2]

Despite the good results that may be obtained with the matrix transposition method, in many applications this approach can not be applied, either because the matrix is too large or the available memory is too small. Hence, other alternatives, which do not require the extra copy procedure, should be studied.

The search for an alternative processing scheme should start with a close examination of the involved math and the operations performed by the original implementation. Trivial math knowledge shows that the order of the several additions to obtain each element of the result matrix is irrelevant, as long as

each addend appears exactly once. This understanding will lead to solutions which reorder the additions performed in the inner loop of the original code.

According to the original algorithm, the adopted order to access the elements of matrix `mull` is: (0,0), (1,0), ... , (N -1,0), (0,1), (1,1), ... . Although the elements (0,0) and (0,1) are in the same cache line, by the time the inner loop completes one round, this cache line has long been evicted. For this example, each round of the inner loop requires, for each of the three matrices, 1024 cache lines, which is much more than what is available in most processors' caches.

One possible solution is to simultaneously handle more than one iteration of the middle loop, while executing the inner loop. In this case, several values which are guaranteed to be in cache will be used, thus contributing to a reduction of the L2 cache miss-rate. Hence, to maximize the speedup provided by this technique, it is necessary to adapt the dimension of the sub-matrix under processing to the cache block size, by taking into account the size of each matrix element. As a hypothetical example, considering that a `short` operand occupies 2-Bytes, this means that a 64-Byte cache block will accommodate 32 matrix elements, thus defining the optimal size for the sub-matrix line to be 32 (see Fig. 4).



**Figure 4:** Blocked matrix multiplication.

As far as the C code is concerned, it now looks like this:

```
#define SUB_MATRIX_SIZE (CACHE_LINE_SIZE / sizeof (short))

for (i = 0; i < N; i += SUB_MATRIX_SIZE) {
    for (j = 0; j < N; j += SUB_MATRIX_SIZE) {
        for (k = 0; k < N; k += SUB_MATRIX_SIZE) {
            for (i2 = 0, rres = &res[i][j], rmull1 = &mull[i][k];
                 i2 < SUB_MATRIX_SIZE;
                 ++i2, rres += N, rmull1 += N) {
                for (k2 = 0, rmull2 = &mull2[k][j]; k2 < SUB_MATRIX_SIZE; ++k2, rmull2 += N) {
                    for (j2 = 0; j2 < SUB_MATRIX_SIZE; ++j2) {
                        rres[j2] += rmull1[k2] * rmull2[j2];
                    }
                }
            }
        }
    }
}
```

The most visible change is that the code has six nested loops now. The outer loops iterate with intervals of `SUB_MATRIX_SIZE` (the cache line size `CACHE_LINE_SIZE` divided by `sizeof(short)`). This divides the matrix multiplication in several smaller problems which can be handled with more cache locality. The inner loops iterate over the missing indexes of the outer loops. There are, once again, three loops. The `k2` and `j2` loops are in a different order. This is done because, in the actual computation, only one expression depends on `k2` but two depend on `j2`.

- a) Change to directory `mm3/` and analyze the code of the program `mm3.c`. Identify its source code with the program described above.

Change the program source code in order to comply the algorithm parameterization (sub-matrix line size) with the block size (`CLS`) that was determined in Section 3.1.

How many matrix elements can be accommodated in each cache line?

- b) Compile this program using the provided `Makefile` and execute it. Fill the table with the obtained data.

- c) Evaluate the resulting L1 data cache *Hit-Rate*.
- d) Compare the obtained results with those that were obtained for the straightforward implementation, by calculating the difference of the resulting hit-rates ( $\Delta$ HitRate) and the obtained speedup.
- e) Compare the obtained results with those that were obtained for the matrix transpose implementation by calculating the difference of the resulting hit-rates ( $\Delta$ HitRate) and the obtained speedup. If the obtained speedup is positive, but the difference of the resulting hit-rates is negative, how do you explain the performance improvement? (Hint: study the hit-rates of the L2 cache for both implementations; You may use the following PAPI events PAPI\_L2\_DCH (or PAPI\_L2\_DCM) and PAPI\_L2\_DCA. Run papi\_avail to check for available events and understand their meaning.)

## References

- [1] Performance Application Programming Interface (PAPI). Webpage. "<http://icl.cs.utk.edu/papi>", December 2008.
- [2] Ulrich Drepper. What every programmer should know about memory. Technical report, Red Hat, Inc., November 2007.
- [3] *PAPI User's Guide*.
- [4] *PAPI Programmer's Reference*.

# First Lab Assignment: System Modeling and Profiling

STUDENTS IDENTIFICATION:

| Number: | Name:           |
|---------|-----------------|
| 102082  | Simão Sanguinho |
| 102663  | Pedro Ribeiro   |
| 103252  | José Pereira    |

## 2 Exercise

Please justify all your answers with values from the experiments.

1. What is the cache capacity of the computer you used (please write the workstation name)?

| Array Size         | 4KiB     | 8KiB     | 16KiB    | 32KiB    | 64KiB    | 128KiB   |
|--------------------|----------|----------|----------|----------|----------|----------|
| t2-t1              | 0.000385 | 0.001762 | 0.003552 | 0.007054 | 0.015966 | 0.037456 |
| # accesses a[i]    | 409600   | 819200   | 1638400  | 3276800  | 6553600  | 13107200 |
| # mean access time | 2.160092 | 2.151120 | 2.167819 | 2.152675 | 2.436157 | 2.857637 |

Computer Used: Lab 6 p6

By analysing our results, it became evident that the mean access time remained relatively constant up to an array size of 32KiB. Beyond this point, a noticeable increase was observed (at 64KiB). That implies that the cache capacity was no longer sufficient to effectively store larger array sizes, which translates into a higher miss rate and thus longer access times.

We conclude that the cache size must be 32KiB.

Consider the data presented in Figure 1. Answer the following questions (2, 3, 4) about the machine used to generate that data.

2. What is the cache capacity?

According to Figure 1, the data can be divided into two main sets: the lower set where the time oscillates between 300-500ns (array sizes 4k to 64k), and the higher one between 500-1000ns (array sizes 128k to 4M). The sudden time jump from array size 64k to 128k can only be explained by the exhaustion of the cache, meaning that all arrays up to a size of 64k can be stored in cache (meaning that those arrays are less or equal than the cache size), whilst the others, bigger than 64k, can not.

Because of this, we can conclude that the cache capacity is 64k.

3. What is the size of each cache block?

The cache block size corresponds to the stride beyond which there is no variation in access time for arrays larger than the cache capacity. At such a point every time you want to access an element of the array, it corresponds to a different block on the cache, making it mandatory to get data outside of the L1 cache (high miss rate), and the same for subsequently larger strides.

This occurs at a stride of 16 Bytes, meaning that the size of each cache block is 16 Bytes.

4. What is the L1 cache miss penalty time?

The L1 cache miss penalty time can be determined by the difference between the access time where there is no miss rate and 100% miss rate. According to Figure 1, there is no miss rate for arrays smaller than the cache size and with a stride smaller than the cache block size (16 Bytes), in which the access time is around 360ns and there's 100% miss rate for arrays larger than cache size and a stride larger than the cache block size, around 1000ms.

So, the L1 cache miss penalty time can be calculated:  $1000\text{ms} - 360\text{ns} = \underline{\underline{640\text{ns}}}$

### 3 Procedure

#### 3.1.1 Modeling the L1 Data Cache

- a) What are the processor events that will be analyzed during its execution? Explain their meaning.

During the execution of cm1.c, the processor will analyse the PAPI\_L1\_DCM, which corresponds to the L1 Data Cache Misses. During its execution, the program will count the numbers of attempts of retrieving data from the L1 cache that was not there.

- b) Plot the variation of the average number of misses (*Avg Misses*) with the stride size, for each considered dimension of the L1 data cache (8kB, 16kB, 32kB and 64kB).

**Note that, you may fill these tables and graphics (as well as the following ones in this report) on your computer and submit the printed version.**

| Array Size | Stride | Avg Misses | Avg Cycl Time |
|------------|--------|------------|---------------|
| 8kBytes    | 1      | 0.000179   | 0.002293      |
|            | 2      | 0.000107   | 0.002238      |
|            | 4      | 0.000034   | 0.002225      |
|            | 8      | 0.000033   | 0.002153      |
|            | 16     | 0.000034   | 0.002264      |
|            | 32     | 0.000032   | 0.002183      |
|            | 64     | 0.000037   | 0.002162      |
|            | 128    | 0.000023   | 0.002053      |
|            | 256    | 0.000018   | 0.001998      |
|            | 512    | 0.000009   | 0.001987      |
|            | 1024   | 0.000009   | 0.001948      |
|            | 2048   | 0.000008   | 0.002015      |
|            | 4096   | 0.000008   | 0.002078      |
|            | 8192   | 0.000186   | 0.002245      |
| 16kBytes   | 2      | 0.000141   | 0.002234      |
|            | 4      | 0.000154   | 0.002218      |
|            | 8      | 0.000142   | 0.002167      |
|            | 16     | 0.000144   | 0.002237      |
|            | 32     | 0.000143   | 0.002215      |
|            | 64     | 0.000155   | 0.002187      |
|            | 128    | 0.000073   | 0.002129      |
|            | 256    | 0.000024   | 0.002010      |
|            | 512    | 0.000017   | 0.001950      |
|            | 1024   | 0.000012   | 0.001948      |
|            | 2048   | 0.000005   | 0.002032      |
|            | 4096   | 0.000004   | 0.002113      |
|            | 8192   | 0.000003   | 0.002028      |
|            | 16384  | 0.00002    | 0.002036      |
| 32kBytes   | 1      | 0.002048   | 0.002171      |
|            | 2      | 0.002454   | 0.002153      |
|            | 4      | 0.004651   | 0.002133      |
|            | 8      | 0.008668   | 0.002097      |
|            | 16     | 0.015187   | 0.001952      |
|            | 32     | 0.023419   | 0.001934      |
|            | 64     | 0.028299   | 0.001840      |
|            | 128    | 0.40069    | 0.002013      |
|            | 256    | 0.660303   | 0.002132      |
|            | 512    | 0.006143   | 0.002010      |
|            | 1024   | 0.000056   | 0.001982      |
|            | 2048   | 0.000017   | 0.002024      |
|            | 4096   | 0.000007   | 0.002139      |
|            | 8192   | 0.000002   | 0.002124      |
| 64kBytes   | 16384  | 0.00002    | 0.002036      |
|            | 1      | 0.015643   | 0.001940      |
|            | 2      | 0.039239   | 0.001857      |
|            | 4      | 0.062660   | 0.002194      |
|            | 8      | 0.125353   | 0.002179      |
|            | 16     | 0.250743   | 0.002156      |
|            | 32     | 0.500896   | 0.002067      |
|            | 64     | 0.999560   | 0.001902      |
|            | 128    | 1.030536   | 0.001962      |
|            | 256    | 1.037527   | 0.001894      |
|            | 512    | 1.052606   | 0.001923      |
|            | 1024   | 1.009067   | 0.001851      |
|            | 2048   | 1.951666   | 0.005532      |
|            | 4096   | 1.042661   | 0.005060      |
|            | 8192   | 0.000007   | 0.002181      |
|            | 16384  | 0.000001   | 0.002166      |
|            | 32768  | 0.000001   | 0.002083      |



c) By analyzing the obtained results:

- Determine the **size** of the L1 data cache. Justify your answer.

By analysing the plot drawn we can see that for arrays of size smaller or equal to 32KiB have almost 0% miss-rate (average misses around 0). The array size 64 KiB already has a larger number of average cache misses meaning that the cache is exhausted and the disk is being used.  
Hence, the cache size is 32KiB.

- Determine the **block size** adopted in this cache. Justify your answer.

In order to determine the cache block size we need to analyse the plot and find a stride where the average cache misses becomes relatively constant. At such a point, we have exceeded the size of cache blocks, as sequential accesses will be to different blocks.

According to the plot, this happens when the stride is 64 ( $2^6$ ) Bytes, which leads us to conclude that the cache block size of this cache is 64Bytes.

- Characterize the **associativity set size** adopted in this cache. Justify your answer.

If we consider the array of 64KiB (twice as the cache capacity - 32KiB) with a stride of  $2^{13}$  Bytes, we will only access 2 different blocks in our cache. By analysing our plot, we can see that, at such a point, we have a 0% miss rate. This can be explained by the fact that the cache is able to store those two different blocks simultaneously (in the same set), which means it has to be at least a 2-way set associativity. Since this behaviour is still shown at strides of  $2^{14}$  and  $2^{15}$  Bytes, we must conclude that the cache has at least 8-way set associativity.

At a stride of  $2^{16}$  (16 different blocks are accessed), the average misses increase again meaning that the L1 cache is no longer capable of storing simultaneously the blocks.

We must conclude that the L1 cache associativity set size is 8.

$$\text{Associativity Set Size} = \frac{\text{Cache Size}}{\text{Stride}} = \frac{2^{16}}{2^{13}} = 8 \text{ way associativity set size}$$

### 3.1.2 Modeling the L2 Cache

- a) Describe and justify the changes introduced in this program.

In order to count the L2 cache misses, we changed the PAPI event to PAPI\_L2\_DCM. Furthermore we also increased the constant CACHE\_MAX, which serves as an upper limit of the cache size given that L2 caches are larger than L1 caches.

- b) Plot the variation of the average number of misses (*Avg Misses*) with the stride size, for each considered dimension of the L2 cache.



c) By analyzing the obtained results:

- Determine the **size** of the L2 cache. Justify your answer.

According to the plot, we can observe two jumps in the miss rate of the L2 cache: from 128 kB to 256 kB and from 256 kB to 512 kB. Given that the second jump is significantly larger than the first, we must conclude that the array has surpassed the L2 cache capacity. The first jump can be explained by the fact that the cache size is close to the L2 cache capacity, and so the average misses are slightly higher.  
Hence the L2 cache capacity is 256 kB.

- Determine the **block size** adopted in this cache. Justify your answer.

For strides lower than  $2^8$  Bytes (64) the slope of the plot is increasing, meaning that the miss rate is rising. At that stride onwards, there's a change in its behaviour - it has become less steep (miss rate slightly stabilises). Thus, for larger strides we are not missing much more than we were before. We have exceeded the cache block size and every time we access an element of the array, it corresponds to a different block on the cache and so the miss rates for bigger strides become similar.  
Hence, the L2 cache block size must be 64 Bytes.

- Characterize the **associativity set size** adopted in this cache. Justify your answer.

following the same logic as in the third question in 3.1.1 c), if we consider the cache size  $2^{17}$ , we can observe that at a stride of around  $2^{16}$  Bytes (8 blocks are accessed) the average misses is 0 whereas at a stride of  $2^{17}$  Bytes, the miss rate is no longer 0, meaning that the L2 cache can no longer hold these different blocks simultaneously. The same happens when considering other arrays larger than the cache capacity (256 KiB).

Hence, the associativity set size is 8.

$$\text{Associativity set size} = \frac{\text{Cache Size}}{\text{Stride}} = \frac{2^{17}}{2^{16}} = 8 \text{ way associativity set size,}$$

## 3.2 Profiling and Optimizing Data Cache Accesses

### 3.2.1 Straightforward implementation

- a) What is the total amount of memory that is required to accommodate each of these matrices?

Each matrix has  $N^2$  elements.  
 Each element is a 16 bit integer.  
 Hence each matrix will require  $N^2 \times 16 \text{ bits} = 524\,288 \text{ Bytes (512 KiB)}$ .  
 Because we have 3 matrices in our program (mul1, mul2 and res), the total to accommodate them is  $3 \times 512 \text{ KiB} \approx 1.57 \text{ MiB}$

- b) Fill the following table with the obtained data.

|                                                     |                          |
|-----------------------------------------------------|--------------------------|
| Total number of L1 data cache misses                | $135.115492 \times 10^6$ |
| Total number of load / store instructions completed | $536.871884 \times 10^6$ |
| Total number of clock cycles                        | $641.034062 \times 10^6$ |
| Elapsed time                                        | 0.213679 seconds         |

- c) Evaluate the resulting L1 data cache Hit-Rate:

$$\text{Miss-rate} = \frac{\text{Cache Misses}}{\text{LDST Instructions}} = \frac{135.115492}{536.871884} = 0.25167136011$$

$$\text{Hit Rate} = 1 - \text{Miss Rate} \approx 0.748328 \approx 74.83\%$$

### 3.2.2 First Optimization: Matrix transpose before multiplication [2]

- a) Fill the following table with the obtained data.

|                                                     |                          |
|-----------------------------------------------------|--------------------------|
| Total number of L1 data cache misses                | $4.219714 \times 10^6$   |
| Total number of load / store instructions completed | $536.871884 \times 10^6$ |
| Total number of clock cycles                        | $519.417992 \times 10^6$ |
| Elapsed time                                        | 0.213141 seconds         |

- b) Evaluate the resulting L1 data cache *Hit-Rate*:

$$\text{Miss-rate} = \frac{\text{Cache Misses}}{\text{LDST instructions}} = \frac{4.219714}{536.871884} = 0.00789981558$$

$$\text{Hit-Rate} = 1 - \text{Miss-Rate} \approx 0.9921402 \approx 99.21\%$$

- c) Fill the following table with the obtained data.

|                                                     |                           |
|-----------------------------------------------------|---------------------------|
| Total number of L1 data cache misses                | $4.481081 \times 10^6$    |
| Total number of load / store instructions completed | $537.396174 \times 10^6$  |
| Total number of clock cycles                        | $520.1746964 \times 10^6$ |
| Elapsed time                                        | 0.173393 seconds          |

Comment on the obtained results when including the matrix transposition in the execution time:

After comparing the results of the tables, we can conclude that the time taken to transpose the matrix can be neglected as it did not significantly affect the total execution time (0.000252 microseconds). This happens because matrix multiplication takes much longer than doing a matrix transposition.  
 $\text{Hit-rate} = 1 - \left( \frac{4.481081}{537.396174} \right) \approx 0.99165 \approx 99.17\%$ .

- d) Compare the obtained results with those that were obtained for the straightforward implementation, by calculating the difference of the resulting hit-rates ( $\Delta\text{HitRate}$ ) and the obtained speedups.

|                                                                                                                                                                                                                                                                                                                                                                                 |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $\Delta\text{HitRate} = \text{HitRate}_{\text{mm2}} - \text{HitRate}_{\text{mm1}}: 0.9917 - 0.7983 = 0.2934$                                                                                                                                                                                                                                                                    |
| $\text{Speedup}(\# \text{Clocks}) = \#\text{Clocks}_{\text{mm1}} / \#\text{Clocks}_{\text{mm2}}: 641.034062 / 520.1746964 \approx 1.2323$                                                                                                                                                                                                                                       |
| $\text{Speedup}(\text{Time}) = \text{Time}_{\text{mm1}} / \text{Time}_{\text{mm2}}: 0.213679 / 0.173393 \approx 1.2323$                                                                                                                                                                                                                                                         |
| Comment:<br>Although the hit rate increased substantially (24.34%), the resultant Speed Up was quite smaller than expected.<br>This optimization may not be advantageous for computers with lower memory capacities as it requires more memory allocation, since an additional matrix has to be held in memory, and the time execution improvement may not justify such change. |

### 3.2.3 Second Optimization: Blocked (tiled) matrix multiply [2]

- a) How many matrix elements can be accommodated in each cache line?

|                                                                                                    |
|----------------------------------------------------------------------------------------------------|
| $L1 \text{ block size} \rightarrow 64 \text{ Bytes}$                                               |
| $\text{Cache line} \rightarrow 8 \text{ ways}$                                                     |
| $\text{Size of } (\text{int}16\text{-t}) \rightarrow 2$                                            |
| Hence, there can be accommodated $\frac{64 \times 8}{2} = 256$ matrix elements in each cache line. |

- b) Fill the following table with the obtained data.

|                                                     |                          |
|-----------------------------------------------------|--------------------------|
| Total number of L1 data cache misses                | $2.795436 \times 10^6$   |
| Total number of load / store instructions completed | $537.80223 \times 10^6$  |
| Total number of clock cycles                        | $214.603514 \times 10^6$ |
| Elapsed time                                        | 0.071534 seconds         |

- c) Evaluate the resulting L1 data cache *Hit-Rate*:

$$\text{Miss-rate} = \frac{\text{Cache Misses}}{\text{Load instructions}} = \frac{2.795436}{537.80223} = 0.065197888845$$

$$\text{Hit Rate} = 1 - \text{Miss Rate} \approx 0.994802 \approx 99.48\%$$

- d) Compare the obtained results with those that were obtained for the straightforward implementation, by calculating the difference of the resulting hit-rates ( $\Delta\text{HitRate}$ ) and the obtained speedup.

|                                                                                                                                      |
|--------------------------------------------------------------------------------------------------------------------------------------|
| $\Delta\text{HitRate} = \text{HitRate}_{\text{mm3}} - \text{HitRate}_{\text{mm1}}: 0.9948 - 0.7483 = 0.2465$                         |
| $\text{Speedup}(\# \text{Clocks}) = \# \text{Clocks}_{\text{mm1}} / \# \text{Clocks}_{\text{mm3}}: 641.034062 / 214.603514 = 2.9706$ |

Comment:

By implementing the division in sub-matrices during matrix multiplication, we have registered a speedup of 2.98, meaning that compared to the original implementation we have made the program run almost 3 times faster. The Hit Rate has shown an improvement as well, of around 25%. Taking all these factors into account, since no extra memory was needed to divide the matrix into submatrices (like in mm2.c), it's safe to say that this implementation is a worthwhile improvement.

- e) Compare the obtained results with those that were obtained for the matrix transpose implementation by calculating the difference of the resulting hit-rates ( $\Delta\text{HitRate}$ ) and the obtained speedup. If the obtained speedup is positive, but the difference of the resulting hit-rates is negative, how do you explain the performance improvement? (Hint: study the hit-rates of the L2 cache for both implementations;)

$$\Delta \text{HitRate} = \text{HitRate}_{\text{mm3}} - \text{HitRate}_{\text{mm2}}: 0.9948 - 0.9917 = 0.0031$$

$$\text{Speedup}(\# \text{Clocks}) = \# \text{Clocks}_{\text{mm2}} / \# \text{Clocks}_{\text{mm3}}: 2.42389$$

Comment:

After comparing the results, we can see that both changes are positive, i.e. the execution time improved as well as the hit rate.

L2 cache Hit Rates:

$$\text{Hit Rate}_{\text{mm2}} = 0.9821593 \approx 0.9822 ; \text{Hit Rate}_{\text{mm3}} = 0.994947 \approx 0.9950$$

There was an increase of 1.28% on the L2 cache hit rates. The program mm3.c registered better hit rates in both L1 and L2 caches, which justifies its improved elapsed time. If the hit rate for L1 cache had dropped from the second implementation to the third, perhaps the time improvement (speed up) would have been justified by a larger L2 cache hit rate on the mm3.c, but that does not occur.

### 3.2.3 Comparing results against the CPU specifications

Now that you have characterized the cache on your lab computer, you are going to compare it against the manufacturer's specification. For this you can check the device's datasheet, or make use of the command `lscpu`. Comment the results.

After running the command "lscpu -C" on the computer lab (lab6p6) we can conclude that our results regarding L1 and L2 caches capacities (32KiB and 256KiB, respectively) and block sizes (64 Bytes on both) are correct. Regarding the Associativity Set Size, for the L1 cache our result (8-way) was accurate, although our value for the L2 cache (8-way) was not the same as the real value (4-way). This could have been caused by some processor/system refinement that we are not aware of, that may be lowering the miss rate sooner than expected.

## A PAPI - Performance Application Programming Interface

The PAPI project [1] specifies a standard Application Programming Interface (API) for accessing hardware performance counters available in most modern microprocessors. These counters exist as a small set of registers that count *Events*, defined as occurrences of specific signals related to the processor's function (such as cache misses and floating point operations), while the program executes on the processor. Monitoring these events may have a variety of uses in the performance analysis and tuning of an application, since it facilitates the correlation between the source/object code structure and the efficiency of the actual mapping of such code to the underlying architecture. Besides performance analysis, and hand tuning, this information may also be used in compiler optimization, debugging, benchmarking, monitoring and performance modeling.

PAPI has been implemented on a number of different platforms, including: Alpha; MIPS R10K and R12K; AMD Athlon and Opteron; Intel Pentium II, Pentium III, Pentium M, Pentium IV, Itanium 1 and Itanium 2; IBM Power 3, 4 and 5; Cell; Sun UltraSparc I, II and II, etc.

Although each processor has a number of events that are native to that specific architecture, PAPI provides a software abstraction of these architecture-dependent *Native Events* into a collection of *Preset Events*, also known as *predefined events*, that define a common set of events deemed relevant and useful for application performance tuning. These events are typically found in many CPUs that provide performance counters. They give access to the memory hierarchy, cache coherence protocol events, cycle and instruction counts, functional unit, and pipeline status. Hence, preset events may be regarded as mappings from symbolic names (PAPI preset name) to machine specific definitions (native countable events) for a particular hardware resource. For example, Total Cycles (in user mode) is mapped into PAPI\_TOT\_CYC. Some presets are derived from the underlying hardware metrics. For example, Total L1 Cache Misses (PAPI\_L1\_TCM) is the sum of L1 Data Misses and L1 Instruction Misses on a given platform. The list of preset and native events that are available on a specific platform can be obtained by running the commands `papi_avail` and `papi_native_avail`, both provided by the papi source distribution.

Besides the standard set of events for application performance tuning, the PAPI specification also includes both a high-level and a low-level sets of routines for accessing the counters. The high level interface consists of eight functions that make it easy to get started with PAPI, by simply providing the ability to start, stop, and read sets of events. This interface is intended for the acquisition of simple but accurate measurement by application engineers [3, 4]:

- `PAPI_num_counters` – get the number of hardware counters available on the system;
- `PAPI_flops` – simplified call to get Mflops/s (floating point operation rate), real and processor time;
- `PAPI_ipc` – gets instructions per cycle, real and processor time;
- `PAPI_accum_counters` – add current counts to array and reset counters;
- `PAPI_read_counters` – copy current counts to array and reset counters;
- `PAPI_start_counters` – start counting hardware events;
- `PAPI_stop_counters` – stop counters and return current counts.

The following is a simple code example of using the high-level API [3, 4]:

```

#include <papi.h>

#define NUM_FLOPS 10000
#define NUM_EVENTS 1

int main() {
    int Events[NUM_EVENTS] = {PAPI_TOT_INS};
    long_long values[NUM_EVENTS];

    /* Start counting events */
    if (PAPI_start_counters(Events, NUM_EVENTS) != PAPI_OK)
        handle_error(1);

    do_some_work();

    /* Read the counters */
    if (PAPI_read_counters(values, NUM_EVENTS) != PAPI_OK)
        handle_error(1);

    printf("After reading the counters: %lld\n", values[0]);

    do_some_work();

    /* Add the counters */
    if (PAPI_accum_counters(values, NUM_EVENTS) != PAPI_OK)
        handle_error(1);

    printf("After adding the counters: %lld\n", values[0]);

    do_some_work();

    /* Stop counting events */
    if (PAPI_stop_counters(values, NUM_EVENTS) != PAPI_OK)
        handle_error(1);

    printf("After stopping the counters: %lld\n", values[0]);
}

```

#### Possible output:

```

After reading the counters: 441027
After adding the counters: 891959
After stopping the counters: 443994

```

The fully programmable low-level interface provides more sophisticated options for controlling the counters, such as setting thresholds for interrupt on overflow, as well as access to all native counting modes and events. Such interface is intended for third-party tool writers or users with more sophisticated needs.

The PAPI specification also provides access to the most accurate timers available on the platform in use. These timers can be used to obtain both real and virtual time on each supported platform: the real time clock runs all the time (e.g., a wall clock), while the virtual time clock runs only when the processor is running in user mode.

In the following code example, `PAPI_get_real_cyc()` and `PAPI_get_real_usec()` are used to obtain the real time it takes to create an event set in clock cycles and in microseconds [3, 4]:

```

#include <papi.h>

int main(){
    long long start_cycles, end_cycles, start_usec, end_usec;
    int EventSet = PAPI_NULL;

    if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT)
        exit(1);

    /*Create an EventSet */
    if (PAPI_create_eventset(&EventSet) != PAPI_OK)
        exit(1);

    /* Gets the starting time in clock cycles */
    start_cycles = PAPI_get_real_cyc();

    /* Gets the starting time in microseconds */
    start_usec = PAPI_get_real_usec();

    do_some_work();

    /* Gets the ending time in clock cycles */
    end_cycles = PAPI_get_real_cyc();

    /* Gets the ending time in microseconds */
    end_usec = PAPI_get_real_usec();

    printf("Wall clock cycles: %lld\n", end_cycles - start_cycles);
    printf("Wall clock time in microseconds: %lld\n", end_usec - start_usec);
}

```

**Possible output:**

```

Wall clock cycles: 100173
Wall clock time in microseconds: 136

```