

UNIT 3

5/2/24

## Advanced Cache Optimization Techniques

- \* REDUCING HIT TIME → Small and Simple first-level caches and way-prediction (generally ↓ power consumption)
- \* INCREASING CACHE BANDWIDTH → Pipelined caches  
→ Multibanked caches  
→ Nonblocking caches } Varying impact on power consumption
- \* REDUCING THE MISS PENALTY → Critical word first and merging write buffers (little impact on power)
- \* REDUCING MISS RATE → Compiler optimizations (improves power consumption)
- \* REDUCING MISS PENALTY/MISS RATE VIA PARALLELISM  
→ hardware prefetching and compiler prefetching (↑ power consumption ∵ prefetched data unused)

Banking → There are temporary registers defined internally

| Bank 0           | Bank 1           |
|------------------|------------------|
| P <sub>0,0</sub> | P <sub>0,1</sub> |
| P <sub>1,0</sub> | P <sub>1,1</sub> |

You can switch between banks  
Each bank is independent  
4 banks ideal



6/2/24

## (1) SMALL AND SIMPLE FIRST-LEVEL CACHES

Cache-hit critical path, 3 steps :

- addressing the tag memory using the index portion of the address
- comparing the read tag value to the address
- setting the multiplexer to choose the correct data item if the cache is set associative.

- \* L2 cache small enough to fit on chip with processor

## Example Appendix B

Determine whether a 32KB 4-way set associative L1 cache has a faster memory access time than a 32KB 2-way set associative L1 cache. Assume the miss penalty to L2 is 15 times the access time for the faster L1 cache. Ignore misses beyond L2. Which has the faster average memory access time?

→ by default, unless otherwise specified

$$AMAT_{2\text{-way}} = 1 + 0.038 \times 15 = 1.38$$

→  $15/1.4 = 10.1 \approx 10$  ∵ 4-way cache, the access time is 1.4 times longer

$$AMAT_{4\text{-way}} = 1 \times 1.4 + 0.037 \times 10 = 1.77$$

## (2) FAST HIT TIMES VIA WAY PREDICTION

WAY PREDICTION : Keep extra bits in cache to predict "way" (block within set) of next access

- Multiplexer set early to select desired block; only 1 tag comparison done that cycle (in parallel with reading data)

→ Miss → check other blocks for matches in next cycle

\* Accuracy ≈ 85%



DRAWBACK CPU pipeline harder if hit time is variable-length

### (3) INCREASING CACHE BANDWIDTH BY PIPELINING<sup>IN</sup>

Simply to pipeline cache access

→ Multiple clock cycle for 1 cache hit

ADVANTAGE Fast cycle time and slow hit

DRAWBACK Increasing the number of pipeline stages leads to

→ greater penalty on mispredicted branches

→ more clock cycles between the issue of load and use of the data

### (4) NON-BLOCKING CACHES TO INCREASE CACHE BANDWIDTH

→ Lock-up FREE CACHE allows continued cache hits during miss

Hit under miss → reduces effective miss penalty by working during miss vs. ignoring CPU requests

Hit under multiple miss / miss under miss

→ further lowers effective miss penalty by overlapping multiple misses

Example

Which is more important for floating point programs: 2-way set associativity or hit under one miss for the primary data caches? What about integer programs? Assume the following average miss rates for 32KB data caches: 5.2% for floating-point programs with a direct mapped cache, 4.9% for those programs with a 2-way set associative cache, 3.5% for integer programs with a direct-mapped cache, and 3.2% for integer programs with a 2-way set associative cache. Assume miss penalty to L2 is 10 cycles & the L2 misses & penalties are the same.

#### FLOATING-POINT PROGRAMS

$$\text{AMAT Miss rate}_{\text{DM}} \times \text{Miss penalty} = 5.2\% \times 10 = 0.52 \quad \left| \frac{0.52}{0.52} = 0.9433 \right.$$

$$\text{AMAT Miss rate}_{\text{2-way}} \times \text{Miss penalty} = 4.9\% \times 10 = 0.49 \quad \left| \frac{0.49}{0.49} = 0.98 \right.$$

#### INTEGER PROGRAMS

$$\text{AMAT Miss rate}_{\text{DM}} \times \text{Miss penalty} = 3.5\% \times 10 = 0.35 \quad \left| \frac{0.35}{0.35} = 0.914 \right.$$

$$\text{AMAT Miss rate}_{\text{2-way}} \times \text{Miss penalty} = 3.2\% \times 10 = 0.32 \quad \left| \frac{0.32}{0.35} = 0.914 \right.$$

### (5) Sequential Interleaving

- Works best when accesses naturally spread across banks  
→ mapping of addresses to banks affects behaviour of memory system
- Simple mapping that works well is sequential interleaving  
→ Spread block addresses sequentially across banks

#### Slack Time

Switch b/w processes : pseudo parallelism



7/2/24

### (6) REDUCE MISS PENALTY: Early Restart & Critical Word First

- Don't wait for full block before restarting CPU
- CRITICAL WORD FIRST - Request missed word from memory first, send it to CPU right away; let CPU continue while filling rest of the block
- EARLY RESTART - As soon as requested word of block arrives, send to CPU and continue execution (Sequential access)

### (7) MERGING WRITE BUFFER TO REDUCE MISS PENALTY

- Write buffer lets processor continue while waiting for write to complete
- Merging write buffer
  - ⇒ If buffer contains modified blocks, addresses can be checked to see if new data matches that of some write buffer entry
  - ⇒ If so, new data combined with that entry



### (8) REDUCING MISSES BY COMPILER OPTIMIZATIONS SOFTWARE ONLY APPROACH

#### Instructions

#### → REORDERING PROCEDURES

in memory to reduce conflict misses



\* Conflicts arise when both row and column majors are used simultaneously

{special PC = PC + x has to be defined, more cycles consumed}

8/2/24

### HARDWARE PREFETCHING OF INSTRUCTIONS & DATA

- Hardware prefetcher item before the processor requests them

STREAM BUFFER is used to store the next likely address to be requested

Compiler Controlled Prefetching {to reduce miss penalty/rate}

Compiler to insert prefetch instructions to request data before the processor needs it

- Types of prefetch
- (i) Data prefetch
  - (ii) Cache prefetch (load into cache)
  - (iii) Speculative prefetch

Question For the code below, determine which accesses are likely to cause data cache misses. Next, insert prefetch instructions to reduce misses. Finally, calculate the number of prefetch instructions executed and the misses avoided by prefetching.

Assume an 8KB direct mapped cache with 16 byte blocks is present and it is write back cache that does not write allocate. The elements of *a* and *b* are 8 bytes long since they are double-precision floating-point arrays.

*a* has 3 rows and 100 columns

*b* has 101 rows and 3 columns

Assume *a* and *b* are not in the cache at the start of the program.

```
for (i=0; i<3; i=i+1)
    for (j=0; j<100; j=j+1)
        a[i][j] = b[j][0] * b[j+1][0] * b[j+2][0];
```

Answer

*a* ranges from rows 0-2

columns 0-300

|   |   |   |     |    |
|---|---|---|-----|----|
|   | 0 | 1 | ... | 99 |
| 0 |   |   |     |    |
| 1 |   |   |     |    |
| 2 |   |   |     |    |

*b* ranges from rows 0-100

columns 0-2

|     |   |   |   |
|-----|---|---|---|
|     | 0 | 1 | 2 |
| 0   |   |   |   |
| 1   |   |   |   |
| 2   |   |   |   |
| ... |   |   |   |
| 100 |   |   |   |

$$\text{Number of elements brought in} = \frac{16}{8} = 2 \text{ elements}$$

$$\text{Total access} = 3 \times 100 = 300$$

Since only 2 elements are brought  $\Rightarrow$

even numbers lead to miss

odd numbers lead to hit

$$\therefore \text{Number of misses} = \frac{300}{2} = 150 \text{ misses}$$

12/2/24

### Compiler-Controlled Prefetching Example

```
for (j=0; j<100; j=j+1) {
    prefetch (b[j+7][0]);
    /* b[j, 0] for 7 iterations later */
    prefetch (a[0][j+7]);
    /* a[0, j] for 7 iterations later */
    a[0][j] = b[j][0] * b[j+1][0];
}
for (i=1; i<3; i=i+1)
    for (j=0; j<100; j=j+1) {
        prefetch (a[i][j+7]);
        /* a(i, j) for +7 iterations */
        a[i][j] = b[j][0] * b[j+1][0];
}
```

+ initially compulsory miss

+ miss i=0

+ miss i=1

+ miss i=2

19 misses

Example 2 Calculate the time saved in Example 1.

- (TB page 97-95)  
 Key loop times ignoring cache misses
- Original loop takes 7 clock cycles per iteration
  - 1<sup>st</sup> prefetch loop takes 9 clock cycles per iteration
  - 2<sup>nd</sup> prefetch loop takes 8 clock cycles per iteration
  - A miss takes 100 clock cycles

$$\text{Original doubly nested loop} \rightarrow 3 \times 100 = 300 \text{ times execution}$$

$$300 \times 7 = 2100 \text{ clock cycles} + \text{cache misses}$$

↓  
cc per iteration      ↓  
for i      ↓  
for j      ↓  
100 + 150

$$\text{Cache misses} \rightarrow 251 \times 100 = 25100 \text{ clock cycles}$$

↓  
cc per miss

$$\text{Total} = 25100 + 2100 = 27200 \text{ clock cycles}$$

↑ 714  
11 \* 100 = 1100

1<sup>st</sup> prefetch loop  $\Rightarrow$  iterates 100 times  $\Rightarrow$  900 clock cycles + cache misses

2<sup>nd</sup> prefetch loop  $\Rightarrow$   $i = 1, 2$        $\left\{ \begin{array}{l} 2 \times 100 \\ j = 0, \dots, 99 \end{array} \right. \Rightarrow 8 \times 200 = 1600$        $= 2000 \text{ cc}$   
 $\text{iterations}$        $\text{clock cycles} + 8 \times 100 = 800$

$$= 2400 \text{ clock cycles.}$$

$$\text{Total} = 2000 + 2400 = 4400 \text{ clock cycles}$$

If prefetch is overlapped with rest of execution  
 $\Rightarrow 27200 = 68 = 6.1818 \text{ times faster}$

4400 11

## Virtual Memory Monitor (VMM)

- Controlled access
- Protection bits in resource sharing



### PAGING

Consider a page size of 1KB

- A process of size 4.5 KB is partitioned into 5 pages
- 0.5 KB is unused in the last page  $\rightarrow$  cannot be used by another process since processes are independent of each other
- \* Unused memory within a page  $\Rightarrow$  INTERNAL FRAGMENTATION

A page size of 4KB would result in an internal fragmentation of 3.5KB for a process of 4KB size

- \* Internal fragmentation can be reduced by taking smaller page sizes

### SEGMENTATION

HOLE / EXTERNAL FRAGMENTATION

Reduced by method of COMPACTION

Overhead



## PAGING

- Each process is divided into small fixed-size chunks of the same size known as PAGES & associated with virtual memory.
- Pages assigned to available chunks of memory known as FRAMES or PAGE FRAMES (associated with main memory)

\* Page size = Frame size

\* PAGE TABLE contains details about page number and frame number



initially made to reside in main memory

<sup>active</sup>

each process has a page table loaded onto main memory  
→ still occupies some amount of space in main memory

→ Use Inverted Page Table instead  
↳ maintained in virtual memory

## Logical address

[Page #] [Offset]

offset of 10 bits  $\Rightarrow 2^{10}$  combinations

$\hookrightarrow 1024 \Rightarrow 1\text{KB}$  page size  $\Rightarrow$  can address from 0 to 1023

Sample Assume 16-bit addressing and the page size as 1KB. Given the relative address as 2024 find the page number & offset (i.e. the logical address)

Page size 1KB  $\Rightarrow 2^{10} \Rightarrow 10$  bits for offset  
 $2024 - 1024 = 1000 \rightarrow$  offset value  
 $2024 \Rightarrow 0x7E8$

Logical address [0000 0111 1110 1000]  
 page # offset

page # = 1

If page # frame #

|    |    |
|----|----|
| 0  | 0  |
| 1  | 7  |
| .. | .. |

0111 0111 1110 1000

Page #  $\Rightarrow$  1 is mapped to frame # 7  
 $\Rightarrow 7\text{KB}$  with offset 1000 in main memory

Physical address  
if frame no. 7

$$7\text{KB} + 1000 = 0x1FE8$$

13/2/24

## VIRTUAL ADDRESS TO PHYSICAL ADDRESS

|                     |             |
|---------------------|-------------|
| Virtual page number | Page offset |
|---------------------|-------------|

- Page table is indexed with the virtual page numbers to obtain the corresponding portion of physical address
- Page table maps each page in virtual memory to either a page in main memory or a page stored on disk (next level in hierarchy).

## ASIDE TRANSLATION LOOK-AHEAD BUFFER (TLB)

- acts as a cache of the page table for the entries that map to physical pages only
- overall access time is reduced

\* nothing specified  
 $\rightarrow$  convert address given in decimal (default)

- Q1. Consider the single-level paging scheme. The virtual address space is 4MB and page size is 4KB. What is the maximum page table entries size possible such that the entire page table fits well in one page?

Virtual memory space = 4MB

Page size = 4KB

$$4 \times 2^{10} = 2^9 = 1024 \text{ entries}$$

- \* For page table to fit well in one page  
page table size  $\leq$  page size

Let page table entry size = B bytes

Page table size = Number of entries  $\times$  Page table entry size  
in page table

$$2^9 \times B \text{ bytes} \leq 4 \text{ KB}$$

$$B \leq \frac{4 \times 2^{10}}{2^9}$$

$B \leq 4$  max possible page table entry size = 4 bytes

- Q2. Consider a machine with 64 MB physical memory and a 32 bit virtual address space. If the page size is 4 KB, what is the approximate size of the page table?

Physical memory = 64 MB

Page size = 4KB

$$\text{Number of frame} = \frac{64 \text{ MB}}{4 \text{ KB}} = \frac{2^6 \times 2^{20}}{2^2 \times 2^{10}} = 2^4 \times 2^{10} = 16,384 \text{ frame}$$

$$\text{Size of virtual memory} = 2^{32} \rightarrow 32\text{-bit addressing}$$

$$\Rightarrow 14 \text{ bits required}$$

$$\text{Number of entries} = 2^{32} = 2^{20} \text{ entries}$$

$$2^3 \times 2^{10} \text{ number}$$

B bytes per page

$$2^{20} \times B \text{ bytes} \leq 4 \text{ KB}$$

Page table size = Number of entries  $\times$  Page table entry size  
in page table

$$= 2^{20} \times 14 \text{ bits}$$

$$\approx 2^{20} \times 16 \text{ bits} \quad (\text{byte-oriented})$$

$$= 2^{20} \times 2 \text{ bytes}$$

$$= 2 \text{ MB}$$

- Q3. Consider a machine Compare the access times of 64KB caches with 64 B blocks and a single bank. What are the relative access times of 2-way and 4-way set associative caches in comparison to a direct-mapped organisation? Assume the access time of

- direct-mapped cache is 0.86 ns
- 2-way cache is 1.12 ns
- 4-way cache is 1.37 ns

$$\frac{\text{2-way}}{\text{direct-mapped}} = \frac{1.12 \text{ ns}}{0.86 \text{ ns}} = 1.302$$

$$\frac{\text{4-way}}{\text{direct-mapped}} = \frac{1.37 \text{ ns}}{0.86 \text{ ns}} = 1.593$$

\* Access time for direct-mapped is the least

\* Access time usually increases as associativity is increased

- Q4. Compare the access times of 4-way set associative caches with 64 bytes and a single bank. What are the relative access times of 32KB and 64KB caches in comparison to a 16KB cache.

Given Access time of 16KB  $\rightarrow 1.27 \text{ ns}$

$$32 \text{ KB} \rightarrow 1.35 \text{ ns}$$

$$64 \text{ KB} \rightarrow 1.37 \text{ ns}$$

$$\frac{\text{32KB cache}}{\text{16KB cache}} = \frac{1.35 \text{ ns}}{1.27 \text{ ns}} = 1.062$$

$$\frac{\text{64 KB cache}}{\text{16 KB cache}} = \frac{1.37 \text{ ns}}{1.27 \text{ ns}} = 1.0787$$

- Q5. For a 64 KB cache, find the cache associativity between 1 and 8 with the lowest average memory access time, given that misses per instruction for a certain workload suite is DM  $\rightarrow$  0.00664

$$2 \rightarrow 0.00366$$

$$4 \rightarrow 0.000987$$

$$8 \rightarrow 0.000266$$

Overall, there are 0.3 references per instruction. Assume cache misses take 10ns for all models.

Cycle time - 0.5 ns (DM, 2-way)

- 0.83 ns (4-way)

- 0.79 ns (8-way)

$$\text{Miss rate} = \frac{\text{Misses per instruction}}{\text{References per instruction}} \times 100$$

$$\text{DM} \rightarrow \frac{0.00664}{0.3} \times 100 = 2.213\% \quad \text{4-way} \rightarrow \frac{0.000987}{0.3} \times 100 = 0.389\%$$

$$\text{2-way} \rightarrow \frac{0.00366}{0.3} \times 100 = 1.22\% \quad \text{8-way} \rightarrow \frac{0.000266}{0.3} \times 100 = 0.889\%$$

$$t_{DM} = 0.5 + 2.213 \times 10 = 0.7213 \text{ ns}$$

$$t_{2\text{-way}} = 0.5 + \frac{1.22}{100} \times 10 = 0.622 \text{ ns}$$

$$t_{4\text{-way}} = 0.83 + \frac{0.389}{100} \times 10 = 0.8629 \text{ ns}$$

$$t_{8\text{-way}} = 0.79 + \frac{0.889}{100} \times 10 = 0.798867 \text{ ns}$$

∴ 2-way cache associativity has the least/lowest average memory access time

15/2/24

ALIASING - A situation in which 2 addresses access the same object

- Can occur in virtual memory when there are 2 virtual addresses for the same physical page

### SEGMENTATION

- Dividing the program and its associated data into partitions of equal or unequal length
- A logical address using segmentation consists of 2 parts, a SEGMENT NUMBER and an OFFSET.
- Similar to dynamic partitioning

(different in terms of) a pgm may occupy more than one partition, and these partitions need not be contiguous

- \* offset < Length of Process Segment Table for the particular segment

↳ physical address obtained

- \* Offset > Length  $\Rightarrow$  Segmentation fault

- Q. Assume 4KB pages, a 4-entry fully associative TLB, and true LRU replacement. If the following references are made 4669, 2227, 13916, 34587, 48870, 12608, 49225

Identify the TLB Miss, page fault and the final page table entry at the end of all the accesses

Assume that the if the pages must be brought in from disk, increment the next largest page number.

| TLB | VALID | TAG | PHYSICAL PAGE NUMBER |
|-----|-------|-----|----------------------|
|     | 1     | 40  | 125                  |
|     | 1     | 78  | 14                   |
|     | 1     | 3   | 6                    |
|     | 0     | X11 | 1312                 |

> Some page (in fully associative (∴ only tag & offset fields))

TLB Miss  $\rightarrow$  no page fault  
 $\rightarrow$  there is mapping in page table

Page Table

|    | VALID | PHYSICAL PAGE / IN DISK |
|----|-------|-------------------------|
| 0  | 1     | 5                       |
| 1  | 0     | Disk 13                 |
| 2  | 0     | Disk                    |
| 3  | 1     | 6                       |
| 4  | 1     | 9                       |
| 5  | 1     | 11                      |
| 6  | 0     | Disk                    |
| 7  | 1     | 4                       |
| 8  | 0     | Disk 14                 |
| 9  | 0     | Disk                    |
| 10 | 0     | 3                       |
| 11 | 1     | 12                      |

Page size = 4KB  
 $= 2^{12}$   
 $\Rightarrow 12$  bits for offset

4669  $\Rightarrow$  0x123D

Page number = 1  
TLB Miss  
 Disk; V=1 now  
 page = 13

2227  $\Rightarrow$  0x08B3

Page number = 0

13916  $\Rightarrow$  0x365C

Page number = 3

20/2/24

- \* 4669  $\Rightarrow$  0x123D Page number = 1  $\Rightarrow$  TLB Miss  
 Go to Page Table  $\rightarrow$  at index 1, V=0  $\Rightarrow$  page present in disk  
 Replace Disk by 13 and set V=1 for index 1 as per the assumption given in the question  
 One of the entries of TLB has V=0  
 Set that V=1, Tag=1, Physical Page number = 13

Physical memory address generation  $13 \Rightarrow 0xD$   
 $\therefore$  Physical memory address  $\rightarrow$  0xD23D

- \* 2227  $\Rightarrow$  0x8B3 Page number = 0  $\Rightarrow$  TLB Miss  
 Page Table index 0  $\rightarrow$  V=1  $\Rightarrow$  page present at 5  
 Using LRU, replace Tag=11 with Tag=0 and physical page number to 5  
 Physical memory address  $\Rightarrow$  0x58B3

- \* 13916  $\Rightarrow$  0x365C Page number = 3  $\Rightarrow$  TLB Hit  
 Physical memory address  $\rightarrow$  0x665C
- \* 34587  $\Rightarrow$  0x871B Page number = 8  $\Rightarrow$  TLB Miss  
 Go to Page Table  $\rightarrow$  at index 8, V=0  $\rightarrow$  Set V=1 and physical page to 14 (bringing  $\uparrow$  Page fault)  
 In TLB, replace second entry to Tag=8, Physical page no.<sup>=</sup> 14  
 Physical memory address  $\rightarrow$  0xE71B
- \* 48870  $\Rightarrow$  0xBEE6 Page number = 11  $\Rightarrow$  TLB Miss  
 Page Table  $\rightarrow$  index 11, V=1  $\Rightarrow$  No page fault  
 Replace Tag = 1 to Tag = 11 Physical mem addr  $\rightarrow$  0xCEE6
- \* 12608  $\Rightarrow$  0x3140 Page number = 3  $\Rightarrow$  TLB Hit  
 Physical memory address  $\rightarrow$  0x8140
- \* 49225  $\Rightarrow$  0xC049 Page number = 12  $\Rightarrow$  TLB Miss  
 Page Table  $\rightarrow$  no entry 12  $\Rightarrow$  Disk miss  
 Add entry at index 12 to Page Table V=1, Physical page = 15

Decode  $\Rightarrow$  identifies the operation and operands involved

CLASSMATE  
Date \_\_\_\_\_  
Page 60

cycles per instruction

$$\text{Pipeline CPI} = \text{Ideal pipeline CPI} + \text{Structural stalls} + \text{Data hazard stalls} + \text{Control stalls}$$

21/2/24

CLASSMATE  
Date \_\_\_\_\_  
Page 61

|     | 1                                                   | 2  | 3  | 4     | 5     | 6  | 7   | 8  | 9                     | 10 | 11 |
|-----|-----------------------------------------------------|----|----|-------|-------|----|-----|----|-----------------------|----|----|
| Sub | s <sub>2</sub> , s <sub>1</sub> , s <sub>3</sub>    | IF | ID | EX    | MEM   | WB |     |    |                       |    |    |
| and | s <sub>12</sub> , s <sub>2</sub> , s <sub>5</sub>   | IF | ID | Stall | Stall | EX | MEM | WB | s <sub>2</sub> in use |    |    |
| or  | s <sub>13</sub> , s <sub>6</sub> , s <sub>2</sub>   | IF | ID | Stall | Stall | EX | MEM | WB |                       |    |    |
| add | s <sub>14</sub> , s <sub>2</sub> , s <sub>2</sub>   | IF | ID | Stall | Stall | EX | MEM | WB |                       |    |    |
| sw  | s <sub>15</sub> , i <sub>10</sub> (s <sub>2</sub> ) | IF | ID | Stall | Stall | EX | MEM | WB |                       |    |    |

Rearrangement

Removing redundancy



- Instruction j dependent on instruction i if
- i produces a result that may be used by j
  - j is data dependent on instruction k & instruction k is data dependent on instruction i



### NAME DEPENDENCE

- occurs when 2 instructions use the same register or memory location (NAME), but there is no flow of data between the instructions associated with that name



- does not exist due to data transfer b/w instructions; exists because of the memory locations being accessed.

\* Can be resolved by RENAMING

done  
statically by compiler  
dynamically by hardware

DATA HAZARD exists whenever there is a name or data dependence between instructions

Possible data hazards:

(i) RAW  $\Rightarrow$  Read After Write

$$i \quad R2 \leftarrow R1 + R3$$

$$j \quad R4 \leftarrow R2 + R3$$

- True data dependence

- program order must be preserved to ensure j receives correct value of R2 after execution of i

(ii) WAW  $\Rightarrow$  Write After Write

$$i \quad R2 \leftarrow R1 + R3$$

$$j \quad R2 \leftarrow R4 + R5$$

(iii) WAR (Write After Read)

$$\begin{array}{l} i \quad R2 \leftarrow R1 + R3 \\ j \quad R3 \leftarrow R4 + R5 \end{array}$$

arises from an antidependence

- \* There is no data hazard called RAR

### CONTROL DEPENDENCE

determines the ordering of an instruction  $i$  w.r.t a branch  
so that  $i$  is executed in correct pgm order and only  
when it should be.

if  $p_1 \{$        $s_1$  is control dependent on  $p_1$   
 $s_1;$        $s_2$  is control dependent on  $p_2$  but not on  $p_1$   
 $\};$   
if  $p_2 \{$        $s_2;$   
 $\};$

Constraints imposed by control dependences

→ an instr. that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch

→ an instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

When processors preserve strict pgm order, they ensure that control dependences are also preserved

Speculation

→ SOFTWARE SPECULATION

→ HARDWARE SPECULATION

Control dependence is preserved by implementing control hazard detection that causes CONTROL STALLS.

Control stalls can be eliminated or reduced by a variety of hardware and software techniques

### LOOP UNROLLING

latencies of FP Operations Used

| Instr. producing result | Instr. using result | Latency (clock cycles) |
|-------------------------|---------------------|------------------------|
| FP ALU op               | Another FP ALU op   | 3                      |
| FP ALU op               | Store double        | 2                      |
| Load double             | FP ALU op           | 1                      |
| Load double             | Store double        | 0                      |

for ( $i = 999 ; i >= 0 ; i = i - 1$ )  
 $x[i] = x[i] + s;$

Loop : L.D F0, O(R1) ;  $F0 = \text{array element}$

ADD.D F4, F0, F2 ; add scalar in F2

S.D F4, D(R1) ; store result

DADDUI R1, R1, # -8 ; decrement ptr by 8 bytes

BNE R1, R2, Loop ; branch  $R1 = R2$

23/2/24

Since it is floating point type data, each occupies 8 bytes  
(The assembly code to write before scheduling)

$$\text{Delay} \rightarrow 1+2+0+1$$

Without scheduling  
Loop :



∴ Loop unrolling is done

Loop : LD F0, 0(R1)  
DADDUI R1, R1, # -8  
ADD.D F4, F0, F2  
Stall  
Stall  
SD F4, 8(R1)  
BNE R1, R2, Loop

7 cycles with scheduling (only 2 stalls)

\* Stalls after ADD.D are used by

Loop : LD F0, 0(R1)  
LD F6, -8(R1)  
LD F10, -16(R1)  
LD F4, -24(R1)  
  
ADD.D F4, F0, F2  
ADD.D F8, F6, F2  
ADD.D F12, F10, F2  
ADD.D F16, F14, F2  
  
SD F4, 0(R1)  
SD F8, -8(R1)  
DADDUI R1, R1, # -32  
  
SD F12, 16(R1)  
SD F16, 8(R1)  
BNE R1, R2, Loop

21 cycles for  
4 set of instructions