

# CSE211: Compiler Design

Nov. 22, 2022

- **Topic:** Decoupled Access Execute (DAE)
- **Discussion questions:**
  - What does it mean for an application to be memory bound?
  - What are some techniques for dealing with memory bottlenecks

*Traditional SMP System*



*Decoupled Access/Execute System*



# Announcements

- Homework 4 is released
  - pair up ASAP please
  - Exploring a DSL
  - pair assignment
  - due Dec. 5 (two weeks)
  - 6 page report
    - Discussion
- You can propose a DSL to explore, but you must do it within 1 week and there is no guarantee that I will approve it.

# Announcements

- Working on grading HW 2 still
  - Sorry for the delay but it will be out soon!

# Paper and project proposals

- Remember
  - Reports are due the day of the final: Dec 8.
  - I highly suggest not saving these until the last minute. They have a late deadline to give you flexibility, not to enable procrastination.
- Project presentation:
  - Everyone doing a project needs to prepare a 12 minute presentation
  - If your project isn't finished yet, then present background and as much as you have
  - I will randomly select 6 people to present in class on Dec. 1
  - Everyone else needs to submit a screen cast of the presentation.

# Paper and project proposals

- Final:
  - Dec. 8
  - Take home (1 day instead of 1 week)
  - Similar to the midterm in length and question style
  - Designed to take ~2.5 hours not including studying
  - If you want to set aside time, 12 - 3 pm
- Same rules as midterm
  - Open note, open internet
  - Do not talk to each other about the test while it is out. Do not google for specific questions
  - ask questions as private posts on Piazza
  - No guarantees on answering questions past business hours

# CSE211: Compiler Design

Nov. 22, 2022

- **Topic:** Decoupled Access Execute (DAE)
- **Discussion questions:**
  - What does it mean for an application to be memory bound?
  - What are some techniques for dealing with memory bottlenecks

*Traditional SMP System*



*Decoupled Access/Execute System*





# Specialization discussion

- CPUs:
  - Aim to be good at general tasks
  - poor area and energy utilization

# Specialization discussion

- CPUs:
  - Aim to be good at general tasks
  - poor area and energy utilization

How many floating point operations per second (FLOPS) on matrix multiplication

2 TFLOPS

# Specialization discussion

- CPUs:
  - Aim to be good at general tasks
  - poor area and energy utilization
- GPUs:
  - Good at regular, uniform parallelism
  - Bad at irregular parallelism and programs with control dependencies

How many floating point operations per second (FLOPS) on matrix multiplication

2 TFLOPS

# Specialization discussion

- CPUs:
  - Aim to be good at general tasks
  - poor area and energy utilization

How many floating point operations per second (FLOPS) on matrix multiplication

2 TFLOPS

- GPUs:
  - Good at regular, uniform parallelism
  - Bad at irregular parallelism and programs with control dependencies

125 TFLOPS  
(62x faster than CPU)

# Specialization discussion

- CPUs:
  - Aim to be good at general tasks
  - poor area and energy utilization
- GPUs:
  - Good at regular, uniform parallelism
  - Bad at irregular parallelism and programs with control dependencies
- TPUs:
  - Good at matrix multiplication
  - Not good at much else (12 instructions)

How many floating point operations per second (FLOPS) on matrix multiplication

2 TFLOPS

125 TFLOPS  
(62x faster than CPU)

# Specialization discussion

- CPUs:
    - Aim to be good at general tasks
    - poor area and energy utilization
  - GPUs:
    - Good at regular, uniform parallelism
    - Bad at irregular parallelism and programs with control dependencies
  - TPUs:
    - Good at matrix multiplication
    - Not good at much else (12 instructions)
- How many floating point operations per second (FLOPS) on matrix multiplication
- |            |                                                 |
|------------|-------------------------------------------------|
| 2 TFLOPS   |                                                 |
| 125 TFLOPS | (62x faster than CPU)                           |
| 180 TFLOPS | (much faster than CPU,<br>1.4x faster than GPU) |

# Specialization in modern SoCs

- From David Brooks lab at Harvard:  
<http://vlsiarch.eecs.harvard.edu/research/accelerators/die-photo-analysis/>
- CPUs, GPUs, Neural Engine, IP blocks (cryptography, DSP, etc.)



# How do programs take advantage of specialization?

- **Programmer-centric:**
  - Programmers write specialized code that targets specific specialized processors
- **API-centric**
  - e.g. Tensorflow targets CPU, GPU, TPU
- **Hardware-centric:**
  - Hardware optimizes programs
  - Pipelining, super scalar, caches, etc. (what our traditional systems already do)
- **Compiler-centric:**
  - Compiler performs non-trivial transformations to target specialized hardware

# Specialization is not new

- First GPU in 1951 (MIT flight simulator)
- Architecture academic work proposes many new designs
  - Evaluated on detailed simulators; rarely taped out
- Had a hard time breaking into the mainstream:
  - benefits had to outweigh eventual returns from Dennard's Scaling and Moore's Law
- But now...
  - Hennessy and Patterson's 2017 Turing award lecture: The New Golden Age of Computer Architecture

# Decoupled Access/Execute (DAE)

- 1982: James E. Smith
    - Lives in Montana now and gives interesting keynotes at architecture conferences

# *“Reverse-Engineering the Brain: A Computer Architecture Grand Challenge” ISCA 2018*

DECOPLED ACCESS/EXECUTE COMPUTER ARCHITECTURES

James E. Smith

Department of Electrical and Computer Engineering  
University of Wisconsin-Madison, Madison, Wisconsin 53706

Abstract

An architecture for improving computer performance is presented and discussed. The main feature of the architecture is a high degree of decoupling between operand access and execution. This results in an implementation which has two separate instruction streams that communicate via queues. A similar architecture has been previously proposed for array processors, but in that context the software is called on to do most of the coordination and synchronization between the instruction streams. This paper emphasizes the implementation features that remove this burden from the programmer. Performance comparisons with a conventional scalar architecture are given, and these show that considerable performance gains are possible.

Single instruction stream versions, both physical and conceptual, are discussed with the primary goal of minimizing the differences with known compilation and programming techniques. This would allow conventional architectures. Finally, the problem of deadlock in such a system is discussed, and one possible solution is

A second critical constraint on performance is time required for processor-memory communication. Current trends, both in hardware and software, tend to aggravate the memory communication problem. In hardware, the trend toward higher levels of integration has the effect of increasing the performance impact of all the effect of inter-chip communication, including processor-memory communication. At the architectural level the trend is toward elaborate virtual memory address protection methods. These tend to slow memory communication because they require checks. The use of multiprocessors often means that individual processors must contend for memory resources. In addition, interconnection structures and add both contention. Cache memory becomes a less effective solution in multiprocessor systems due to the problem of maintaining coherence. At the system level, facilities for defining cache memory structures and adding types and structures for defining. At the system level, causes an increase in the number of developed needed to check types, compute indices of data. All of which adds to increased delay when the processors that can diminish the effect of increased memory communication time.

This paper discussed a new type of architecture which separates its type into two parts: access to memory to fetch store results, and operand execution to provide execution. By architecturally separating these two parts, it is possible to provide a practical solution to the problem of deadlock in such a system.

# Decoupled Access/Execute (DAE)

- 1982: James E. Smith
  - Lives in Montana now and gives interesting keynotes at architecture conferences
- 2015: DeSC by Ham et al.
  - More optimizations and practicalities

## DeSC: Decoupled Supply-Compute Communication Management for Heterogeneous Architectures

Tae Jun Ham  
Princeton University  
[tae@princeton.edu](mailto:tae@princeton.edu)

Juan L. Aragón  
University of Murcia  
[jlaragon@um.es](mailto:jlaragon@um.es)

Margaret Martonosi  
Princeton University  
[mrm@princeton.edu](mailto:mrm@princeton.edu)

### ABSTRACT

Today's computers employ significant heterogeneity to meet performance targets at manageable power. In adapting increased compute targets on memory or communication, however, the relative amount of time spent on memory or communication latency has increased. System and software optimizations for memory and communication often come at the costs of increased complexity and reduced portability. We propose Decoupled Supply-Compute (DeSC) as a way to attack memory bottlenecks automatically, while maintaining good portability and low complexity. Drawing from Decoupled Access Execute (DAE) approaches, our work updates and expands on these techniques with increased specialization and automatic compiler support. Across the evaluated workloads, DeSC offers an average of 2.04x speedup over baseline (on homogeneous CMPs) and 1.56x speedup when a DeSC data supplier feeds a hardware accelerator. Achieving performance gains of what a perfect cache hierarchy would offer, while maintaining useful generality

greatly leverages improving computation performance at manageable power, its effective use raises additional challenges. First, the long-troubling "memory wall" becomes even more challenging in many accelerator-centric designs. From an Amdahl's Law point of view, as specialized accelerators speed up computations, the communication memory operations that feed them represent even more of the remaining performance slowdown [27, 50]. A second challenge in accelerator-oriented design is that the software-managed communication tailoring used to reduce communication cost often increases software complexity and reduces performance predictability [12, 13] with scratchpad memory, transfers in and it are typically tightly tailored to the scratchpad [28]. For example, for a loosely-coupled accelerator, programmers must also work to fit the scratchpad of computation and communication. Even small variations in accelerator design can require code to be reoptimized.

While using cache memories instead of DRAM can mitigate some concerns about power and software portability, many issues remain. For example, caches still require programming computation and communication. As caches expose variable communication to the accelerator, this can force a more complex design, either regarding computation or memory access. At the end of the day, the choice of at-accelerator design depends on the specific needs of the application.

# DAE - motivation

*simple example program*

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```

# DAE - motivation

*simple example program*

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```



*pseudo 3-address code*

```
for (int i = 0; i < SIZE; i++) {  
    float r0 = load(b + i);  
    float r1 = r0 * 3.14;  
    store(a + i, r1);  
}
```

# DAE - motivation

*simple example program*

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```

*pseudo 3-address code*

```
for (int i = 0; i < SIZE; i++) {  
    float r0 = load(b + i);  
    float r1 = r0 * 3.14;  
    store(a + i, r1);  
}
```



core 0

*time*

# DAE - motivation

*simple example program*

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```

*pseudo 3-address code*

```
for (int i = 0; i < SIZE; i++) {  
    float r0 = load(b + i);  
    float r1 = r0 * 3.14;  
    store(a + i, r1);  
}
```



# DAE - motivation

*simple example program*

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```

*pseudo 3-address code*

```
for (int i = 0; i < SIZE; i++) {  
    float r0 = load(b + i);  
    float r1 = r0 * 3.14;  
    store(a + i, r1);  
}
```



# DAE - motivation

*simple example program*

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```

*pseudo 3-address code*

```
for (int i = 0; i < SIZE; i++) {  
    float r0 = load(b + i);  
    float r1 = r0 * 3.14;  
    store(a + i, r1);  
}
```



# DAE - motivation

*simple example program*

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```

*pseudo 3-address code*

```
for (int i = 0; i < SIZE; i++) {  
    float r0 = load(b + i);  
    float r1 = r0 * 3.14;  
    store(a + i, r1);  
}
```



# DAE - motivation

*simple example program*

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```

*pseudo 3-address code*

```
for (int i = 0; i < SIZE; i++) {  
    float r0 = load(b + i);  
    float r1 = r0 * 3.14;  
    store(a + i, r1);  
}
```



# DAE - motivation

*simple example program*

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```

*pseudo 3-address code*

```
#pragma parallel  
for (int i = 0; i < SIZE; i++) {  
    float r0 = load(b + i);  
    float r1 = r0 * 3.14;  
    store(a + i, r1);  
}
```



# DAE - motivation



*pseudo 3-address code*

```
#pragma parallel
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    float r1 = r0 * 3.14;
    store(a + i, r1);
}
```



# DAE - motivation



# DAE - illustration



# DAE - illustration



*DAE: split into heterogeneous parallelism: one core does memory and one does computation*



# DAE - illustration



*DAE: split into heterogeneous parallelism: one core does memory and one does computation*



# DAE - illustration



*DAE: split into heterogeneous parallelism: one core does memory and one does computation*



# DAE - illustration



*DAE: split into heterogeneous parallelism: one core does memory and one does computation*



# DAE - illustration



*DAE: split into heterogeneous parallelism: one core does memory and one does computation*



*This is sequentialized 😞*  
How can we fix this?

# Store Buffers

core



# Store Buffers



# Store Buffers



# Store Buffers



# Store Buffers

<continue>

core



*to memory hierarchy*

# Store Buffers



# Store Buffers



# Store Buffers

core



*Eventually this fills up*

# Store Buffers



# DAE Parallelism



# DAE Parallelism



# DAE Parallelism



# DAE Parallelism



# DAE Parallelism



# DAE Parallelism



# DAE Parallelism



# DAE Parallelism



store buffer  
addr value

|  |  |
|--|--|
|  |  |
|  |  |
|  |  |
|  |  |

has 2 entries now

```
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    float r1 = r0 * 3.14;
    store(a + i, r1);
}
```



# Specializing a DAE architecture



# Specializing a DAE architecture



# Specializing a DAE architecture



# Specializing a DAE architecture

optimizations?  
FP unit  
Vector units



optimizations?  
Load/Store unit  
Storebuffer

*Less contention  
on memory hierarchy*

# DAE API



# Compiler

- Given a sequential program, how can we automatically target a DAE architecture?

# Program slicing

- Mark Weiser in 1981.
  - presented as a formalization of debugging



# Program slicing

Main idea:

# Program slicing

Main idea:

- **Forward Slicing:** given statements  $S$ , remove all statements except for those that depend (control or data) on  $s \in S$ 
  - Intuitively: get a minimal (heuristically) program where all actions depend on statements in  $S$

# Program slicing

Main idea:

- **Forward Slicing:** given statements  $S$ , remove all statements except for those that depend (control or data) on  $s \in S$ 
  - Intuitively: get a minimal (heuristically) program where all actions depend on statements in  $S$
- **Backward Slicing:** given statements  $S$ , remove all statements except for those that for which  $s \in S$  depend on.
  - Intuitively: get a minimal (heuristically) program where all actions influence statements in  $S$

# Program slicing

Main idea:

- **Forward Slicing:** given statements  $S$ , remove all statements except for those that depend (control or data) on  $s \in S$ 
  - Intuitively: get a minimal (heuristically) program where all actions depend on statements in  $S$

This is the one we will focus on
- **Backward Slicing:** given statements  $S$ , remove all statements except for those that for which  $s \in S$  depend on.
  - Intuitively: get a minimal (heuristically) program where all actions influence statements in  $S$

# Program slicing

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. r5 = r2 + r0;  
6. r6 = 128;  
7. assert(r5 == r6)
```

slicing criterion:  
[  
"7. assert(r5 == r6)"  
]

*start with the statement and work backwards until there are no more dependencies*

# Program slicing

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. r5 = r2 + r0;  
6. r6 = 128;  
7. assert(r5 == r6)
```

slicing criterion:  
[  
"7. assert(r5 == r6)"  
]

*start with the statement and work backwards until there are no more dependencies*

# Program slicing

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. r5 = r2 + r0;  
6. r6 = 128;  
7. assert(r5 == r6)
```

slicing criterion: [  
“7. assert(r5 == r6)”  
]

*start with the statement and work backwards until there are no more dependencies*

# Program slicing

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. r5 = r2 + r0;  
6. r6 = 128;  
7. assert(r5 == r6)
```

slicing criterion: [  
“7. assert(r5 == r6)”  
]

*start with the statement and work backwards until there are no more dependencies*

# Program slicing

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. r5 = r2 + r0;  
6. r6 = 128;  
7. assert(r5 == r6)
```

slicing criterion:  
[  
"7. assert(r5 == r6)"  
]

*start with the statement and work backwards until there are no more dependencies*

# Program slicing - Control dependence

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. bne r4, 64, END  
6. r5 = r2 + r0;  
7. r6 = 128;  
8. assert(r5 == r6)  
9.END:
```

slicing criterion:  
[  
"8. assert(r5 == r6)"  
]

*start with the statement and work backwards until there are no more dependencies*

# Program slicing - Control dependence

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. bne r4, 64, END
```

```
6. r5 = r2 + r0;  
7. r6 = 128;  
8. assert(r5 == r6)
```

```
9.END:
```

slicing criterion:  
[  
“8. assert(r5 == r6)”  
]

*start with the statement and work backwards until there are no more dependencies*

# Program slicing - Control dependence

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. bne r4, 64, END
```

```
6. r5 = r2 + r0;  
7. r6 = 128;  
8. assert(r5 == r6)
```

```
9.END:
```

slicing criterion:  
[  
“8. assert(r5 == r6)”  
]

*start with the statement and work backwards until there are no more dependencies*

# Program slicing - Control dependence

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. bne r4, 64, END
```

*branch statement*

slicing criterion: [  
“8. assert(r5 == r6)”  
]

*start with the statement and work backwards until there are no more dependencies*

```
6. r5 = r2 + r0;  
7. r6 = 128;  
8. assert(r5 == r6)
```

9.END:

# Program slicing - Control dependence

```
1. r0 = a + b;  
2. r1 = b + c;  
3. r2 = r0 * r0;  
4. r4 = r1 + r0;  
5. bne r4, 64, END
```

slicing criterion:  
[  
"8. assert(r5 == r6)"  
]

*start with the statement and work backwards until there are no more dependencies*

```
6. r5 = r2 + r0;  
7. r6 = 128;  
8. assert(r5 == r6)
```

```
9.END:
```

# Backwards slicing algorithm

```
Worklist = S; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
```

# Backwards slicing algorithm

```
Worklist = S; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
```



# Backwards slicing algorithm

```
Worklist = S; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
```

marked:      worklist:  
                assert()



# Backwards slicing algorithm

```
Worklist = S; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
marked:      worklist:
assert()
```



# Backwards slicing algorithm

```
Worklist = S; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
```

marked:      worklist:  
assert()      x  
                y3



# Backwards slicing algorithm

```
Worklist = S; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
```

marked:      worklist:  
assert()      x  
                y3  
                branch b3  
                branch b2



# Backwards slicing algorithm

```
Worklist = S; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
```

```
marked:      worklist:
assert()      x
              y3
              branch b3
              branch b2
```



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}

marked:          worklist:
assert()        y3
x               branch b3
branch b2
```



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}

marked:      worklist:
assert()     y3
x            branch b3
branch b2
r0
```



# Backwards slicing algorithm

```
Worklist = S; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
```

marked:      worklist:  
assert()      y3  
x              branch b3  
                branch b2  
r0



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}

marked:          worklist:
assert()        y3
x               branch b3
branch b2
r0
```



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
marked:      worklist:
assert()     y3
x           branch b3
branch b2
r0
```



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
```

marked: worklist:  
assert() branch b3  
x branch b2  
y3 r0



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
marked:           worklist:
assert()         branch b3
branch b2
x
y3
r0
y1
y2
```



# Backwards slicing algorithm

```
Worklist = S; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
marked:           worklist:
assert()         branch b3
branch b2
x
y3
r0
y1
y2
```



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
marked:           worklist:
assert()         branch b3
branch b2
x
y3
r0
y1
y2
```



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
marked:           worklist:
assert()         branch b2
r0
x
y3
branch b3      y1
                y2
```



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
marked:           worklist:
assert()         branch b2
r0               r0
x                y1
y3               y2
branch b3        b3
```



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
worklist:
branch b2
marked: r0
assert() y1
x y2
y3 b3
branch b3 branch b1
```



# Backwards slicing algorithm

```
Worklist = s; // slicing criteria
while (!Worklist.empty()) {
    stmt = Worklist.pop();
    if (is_marked(stmt)) {
        continue;
    }
    mark(stmt);
    for a in stmt.args() {
        worklist.append(a);
    }
    for p in cfg[stmt].predecessors() {
        worklist.append(p.branch_stmt());
    }
}
worklist:
branch b2
marked: r0
assert() y1
x y2
y3 b3
branch b3 branch b1
```

rest of example  
is an exercise



# Back to DAE



# Compiler

## Step 1: compile to SSA

```
for (int i = 0; i < SIZE; i++) {  
    a[i] = b[i] * 3.14;  
}
```



```
// SSA pseudo code  
for (int i = 0; i < SIZE; i++) {  
    float r0 = load(b + i);  
    float r1 = r0 * 3.14;  
    store(a + i, r1);  
}
```

# Compiler

**Step 2:** Create two copies, one for the access and one for the execute

Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    float r1 = r0 * 3.14;
    store(a + i, r1);
}
```

Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    float r1 = r0 * 3.14;
    store(a + i, r1);
}
```

# Compiler

**Step 3:** Replace loads in Execute with FIFO reads, stores with SB\_store\_values

Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    float r1 = r0 * 3.14;
    store(a + i, r1);
}
```

Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    float r1 = r0 * 3.14;
    store(a + i, r1);
}
```

# Compiler

## Step 3: Replace loads in Execute with FIFO reads

Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    float r1 = r0 * 3.14;
    store(a + i, r1);
}
```

Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

# Compiler

**Step 4:** Enqueue loaded values on the Access. Store addresses instead of values

**Access**

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    float r1 = r0 * 3.14;
    store(a + i, r1);
}
```

**Execute**

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

# Compiler

**Step 4:** Enqueue loaded values on the Access. Store addresses instead of values

Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    float r1 = r0 * 3.14;
    SB_store_addr(a + i);
}
```

Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

# Compiler

**Step 5:** Slice the Execute on all FIFO dequeue and SB store value calls

Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    float r1 = r0 * 3.14;
    SB_store_addr(a + i);
}
```

Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

# Compiler

**Step 6:** Slice the Access on all FIFO enqueue and SB store address calls

## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    float r1 = r0 * 3.14;
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

# Compiler

**Step 6:** Slice the Access on all FIFO enqueue and SB store address calls

## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    float r1 = r0 * 3.14;
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



## Access

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = load(b + i);
    FIFO_enqueue(r0);
    SB_store_addr(a + i);
}
```

## Execute

```
// SSA pseudo code
for (int i = 0; i < SIZE; i++) {
    float r0 = FIFO_dequeue()
    float r1 = r0 * 3.14;
    SB_store_value(r1);
}
```

*blocks until queue has an item to dequeue*



# Performance bounds

- A program  $p$  has execution time of  $E(p)$ . The time spent on compute (arithmetic) is  $C(p)$ . The time spent on memory latency is  $M(p)$ .
- For simple core models, we can approximate:  $E(p) = C(p) + M(p)$ 
  - Why might this not be completely accurate for more complex cores?
- In DAE, the Execute time ideally is  $C(p)$ , and the Access ideally is  $M(p)$ .
- Optimistic estimates of DAE performance is
  - $\max(C(p), M(p))$
  - best case is when  $C(p) \sim = M(p)$ , we get 2x performance increase
- Pros/cons?

# Performance results in DeSC (micro '15)

Access and Execute are OoO cores



# Final considerations

- Dependencies:
  - If Access depends on a value from Execute, performance can suffer
  - Also called LoD (loss of decoupling events)
- Coherence:
  - Access must read up-to-date values, even when waiting on Execute
- More optimizations:
  - Similar to asynchronous stores, some loads can be done asynchronously as well (if the value is not needed by the Access).

# GraphAttack (taco '21)

- Follow up work led by Aninda Manocha at Princeton
  - She made some great slides!
- Further specializing DAE
  - Graph applications
  - in order cores

# Graph Applications: Access Patterns are Irregular

- Iterative, frontier-based graph applications
  - Describes many graph processing workloads (e.g. BFS, SSSP, PR)
- **Indirect** accesses to neighbor data
  - Conditionally populate next frontier



|       |   | neighbors |   |   |   |   |
|-------|---|-----------|---|---|---|---|
|       |   | 0         | 1 | 2 | 3 | 4 |
| nodes | 0 | 0         | 1 | 0 | 1 | 0 |
|       | 1 | 0         | 0 | 1 | 0 | 0 |
| 2     | 0 | 0         | 0 | 0 | 1 |   |
| 3     | 0 | 0         | 1 | 0 | 1 |   |
| 4     | 1 | 0         | 0 | 0 | 0 | 0 |

```
for node in frontier:  
    val = process_node(node)  
    for neib in G.neighbors(node):  
        update = update_neib(node_vals, val, neib)  
        if(add_to_frontier(update)):  
            new_frontier.push(neib)
```

Iterative, frontier-based graph application



# Neighbor Memory Accesses: The Problem

- Irregular accesses experience cache misses
- **Neighbor Memory Accesses (NMAs):** irregular memory accesses in critical path



<sup>1</sup> Runtimes measured on a simulated in-order core.

# Making NMAs Asynchronous Through Dependence Graph Analysis

- Graph application kernels can be sliced along their NMAs to create **asynchronous accesses**
  - Producer handles all instructions necessary to determine address of NMA and issues access
  - Consumer handles all instructions that depend on data returned from NMA
- **Dependence graph analysis** can identify NMAs as pointer indirect accesses in innermost loop
  - Competitive parallel implementations use RMWs, so NMAs can be loads or RMWs
  - Search for accesses with addresses originating from earlier load

```
for node in frontier:  
    val = process_node(node)  
    for neib in G.neighbors(node):  
        update = update_neib(node_vals, val, neib)  
        if add_to_frontier(update):  
            new_frontier.push(neib)
```

*Iterative, frontier-based graph application*



*Dependence graph analysis for Producer/Consumer slicing*

# Compilation Procedure



# GraphAttack! Tolerates Latency in Graph Applications by Making NMAs Asynchronous



# GraphAttack! Tolerates Latency and Better Utilizes In-Order Cores



2 Parallel In-Order Cores



vs.



1 In-Order GraphAttack! Pair



# GraphAttack! is Scalable and Energy-Efficient

OoO  
1 OoO Core

vs.

InO | InO | InO | InO  
InO | InO | InO | InO  
8 InO cores

vs.

P | P | P | P  
C | C | C | C  
4 InO GraphAttack Pairs



# Performance results in GraphAttack (taco '21)

Access and Execute are in order cores



# Enjoy the holiday break!

- Get paired up for the assignment and start on it early
- Start working on paper review!