

# BetterTogether



## An Interference-Aware Framework for Fine-grained Software Pipelining on Heterogeneous SoCs

Yanwen Xu, Rithik Sharma, Zheyuan Chen,  
Shaan Mistry, Tyler Sorensen

UC SANTA CRUZ | BE Baskin  
Engineering

Microsoft

# Motivation: Accelerating Computation at Edge

- Lower latency
- Energy efficiency
- Privacy benefits



Google Pixel



NVIDIA Jetson Thor



NVIDIA Jetson Orin Nano



# Motivation: Accelerating Computation at Edge



Input: 3D point cloud

e.g., *Classic Algorithm from NVIDIA*

3D Octree  
Reconstruction

Decomposed  
into



Output: octree data structure

Stage 1  
Morton  
Encoding

Stage 2  
Sorting

Stage 3  
Remove  
Duplicates

Stage 4  
Build  
Radix Tree

Stage 5  
Edge  
Count

Stage 6  
Prefix Sum

Stage 7  
Build  
Octree

Stages depends on data from previous stages

# Motivation: Accelerating Computation at Edge



Developer

```
// -----
// Encode
// -----
[[nodiscard]] constexpr uint32_t morton3D_SplitBy3bits(const uint32_t a) {
    auto x = static_cast<uint32_t>(a) & 0x000003ff;
    x = (x | x << 16) & 0x30000ff;
    x = (x | x << 8) & 0x0300f00f;
    x = (x | x << 4) & 0x30c30c3;
    x = (x | x << 2) & 0x9249249;
    return x;
}

[[nodiscard]] constexpr uint32_t m3D_e_magicbits(const uint32_t x,
                                              const uint32_t y,
                                              const uint32_t z) {
    return morton3D_SplitBy3bits(x) | (morton3D_SplitBy3bits(y) << 1) |
           (morton3D_SplitBy3bits(z) << 2);
}

[[nodiscard]] constexpr uint32_t xyz_to_morton32(const glm::vec4 &xyz,
                                                const float min_coord,
                                                const float range) {
    constexpr auto bit_scale = 1024;
    const auto i = static_cast<uint32_t>((xyz.x - min_coord) / range * bit_scale);
    const auto j = static_cast<uint32_t>((xyz.y - min_coord) / range * bit_scale);
    const auto k = static_cast<uint32_t>((xyz.z - min_coord) / range * bit_scale);
    return m3D_e_magicbits(i, j, k);
},
```



CPU



Bottleneck



Google Tensor G2



## Google Tensor G2

Mali-G710 MP7  
3 TPU  
2 Cortex-X1 (P-cores)  
2 Cortex-A78 (medium)  
4 Cortex-A55 (E-cores)

Modern SoCs integrate **diverse Processing Units (PU)**

# PU Profiling

Mali-G710 MP7  
2 Cortex-X1  
2 Cortex-A78  
4 Cortex-A55



All Available PUs on Pixel

- Ran each stage on each available PUs



All Stages

# PU Profiling

Mali-G710 MP7  
2 Cortex-X1  
2 Cortex-A78  
4 Cortex-A55



All Available PUs on Pixel



- Ran each stage on each available PUs



All Stages

# PU Profiling

Mali-G710 MP7  
2 Cortex-X1  
2 Cortex-A78  
4 Cortex-A55



All Available PUs on Pixel



- Ran each stage on each available PUs
- Found optimal **Stage->PU** mapping



All Stages

# Efficient Pipelined Scheduling



# Efficient Pipelined Scheduling



# Efficient Pipelined Scheduling



# Efficient Pipelined Scheduling



# Efficient Pipelined Scheduling



# Efficient Pipelined Scheduling



# Efficient Pipelined Scheduling



# Efficient Pipelined Scheduling



# Efficient Pipelined Scheduling



# Efficient Pipelined Scheduling



- Mobile systems are prone to **intra-application interferences**

- E.g., we think 4.95 ms, but real measurement was 7.77 ms, **~57% slower**

Execution Timeline

(a) Expected



36.3% difference

(b) Measured



Processing Units (PUs)

Big Core CPUs

Medium Core CPUs

Little Core CPUs

GPU

0 2 4 6 8 10 12 14 16 18 20

Time



# Challenge I: Interference

- When PUs fully utilized
- **Slowdowns and Speedups\*** due to
  - *Resource contention*
  - *Dynamic voltage and frequency scaling (DVFS)*
  - *Thermal throttling*
  - *Power management*
  - ...



**Red = Slower down**  
**Green = Speedup**

\*We consulted with engineers from a major mobile vendor, whose insights were consistent with our observations<sup>21</sup>

# Finding the Optimal Schedule is Hard

- **Schedule** = mapping from program **stages** to appropriate **PU**



# Challenge II: Portability

- Large design exploration space
  - e.g., 9 stage AlexNet  $5^9 \approx 1.9 \text{ M}$  potential schedules
  - **~37 years** for Google Pixel 7a
- **Schedules** are not portable
  - Optimal schedule on Pixel does not work on NVIDIA Jetson



The scheduling framework need to be **portable** and **flexible**, and **suitable** for rapid development



|               |
|---------------|
| NVIDIA Jetson |
| 8 GPU-SMs     |
| 6 E-cores     |

|               |
|---------------|
| Apple A18 SoC |
| 5 GPU-cores   |
| 8 NPU-cores   |
| 2 P-cores     |
| 4 E-cores     |



|                 |
|-----------------|
| Google Pixel 7a |
| 7 GPU-cores     |
| 1 TPU           |
| 2 P-cores       |
| 2 M-cores       |
| 4 E-cores       |

# We present *BetterTogether*

- A performance modeling approach that accounts for intra-application interference



# We present *BetterTogether*

OpenMP

Vulkan

NVIDIA  
CUDA



- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices



# We present *BetterTogether*

OpenMP

Vulkan

NVIDIA  
CUDA



- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices
- Consists of 3 major components



# We present *BetterTogether*

- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices
- Consists of 3 major components
  - *BetterTogether Profiling*
    - Attack **Challenge I (Interference)**



# We present *BetterTogether*



- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices
- Consists of 3 major components
  - *BetterTogether Profiling*
    - Attack **Challenge I (Interference)**
  - *BetterTogether Optimizer*
    - Attack **Challenge II (Portability)**



# We present *BetterTogether*



- A performance modeling approach that accounts for intra-application interference
- An end-to-end static pipeline generator that works across a variety of devices
- Consists of 3 major components
  - *BetterTogether Profiling*
    - Attack **Challenge I (Interference)**
  - *BetterTogether Optimizer*
    - Attack **Challenge II (Portability)**
  - *BetterTogether Implementor*
    - Efficient static heterogenous pipeline execution



# This Work: *BetterTogether* Overview



# This Work: *BetterTogether* Overview



1

User decompose  
the workloads into  
Stages



# This Work: BetterTogether Overview



1 User decompose the workloads into Stages



2

```
1 void run_stage_1_cpu(in, out, N) {  
2     #pragma omp parallel for  
3     for (int i = 0; i < N; ++i)  
4         out[i] = morton32(in[i]);  
5 }
```

CPU Code (e.g., OpenMP)

```
1 __global__ void run_stage_1_gpu(in, out, N) {  
2     int idx = threadIdx.x + blockDim.x * blockIdx.x;  
3     int stride = blockDim.x * gridDim.x;  
4     for (int i = idx; i < N; i += stride)  
5         out[i] = morton32(in[i]);  
6 }
```

GPU Code (e.g., CUDA)



# This Work: BetterTogether Overview



1 User decompose the workloads into Stages



2

```
1 void run_stage_1_cpu(in, out, N) {  
2     #pragma omp parallel for  
3     for (int i = 0; i < N; ++i)  
4         out[i] = morton32(in[i]);  
5 }
```

CPU Code (e.g., OpenMP)

```
1 __global__ void run_stage_1_gpu(in, out, N) {  
2     int idx = threadIdx.x + blockDim.x * blockIdx.x;  
3     int stride = blockDim.x * gridDim.x;  
4     for (int i = idx; i < N; i += stride)  
5         out[i] = morton32(in[i]);  
6 }
```

GPU Code (e.g., CUDA)

2 User provide implementations for each PU

|     | S1  | S2  | .. | S7  |
|-----|-----|-----|----|-----|
| PU0 | 2.6 | 3.3 |    | 5.8 |
| PU1 | 0.8 | 1.5 |    | 1.9 |
| PU2 | 0.6 | 1.4 |    | 2.1 |
| PU3 | 0.8 | 9.0 |    | 1.5 |

Interference aware  
**BT-Proiling**

# BT-Proiling - Interference aware profiling

- While profiling each {PU × Stage} pair:
  - Concurrently execute similar stages on other PUs
  - Simulate **whole-application** execution to capture resource contention



Overcome

# BT-Optimizer

- We express our optimization problem as linear constraints

*d) Notation and Decision Variables:*

|               |                                                                                                       |
|---------------|-------------------------------------------------------------------------------------------------------|
| $N$           | Total number of pipeline stages                                                                       |
| $N_i$         | The pipeline stage $i$ , with $i \in \mathcal{N} = \{0, \dots, N-1\}$                                 |
| $\mathcal{C}$ | PU classes: $\mathcal{C} = \{c_1, \dots, c_M\}$                                                       |
| $t_{i,c}$     | Profiled latency of stage $i$ on PU $c$                                                               |
| $x_{i,c}$     | Decision variable: $x_{i,c} \in \{0, 1\}$ ;<br>$x_{i,c} = 1 \Leftrightarrow$ stage $i$ runs on PU $c$ |

- We propose a **three-step** optimization approach



# *Raw BT Profiling results*

| Interference-aware Profiling Table |     |     |    |     |
|------------------------------------|-----|-----|----|-----|
|                                    | S1  | S2  | .. | S7  |
| PU0                                | 2.6 | 3.3 |    | 5.8 |
| PU1                                | 0.8 | 1.5 |    | 1.9 |
| PU2                                | 0.6 | 1.4 |    | 2.1 |
| PU3                                | 0.8 | 9.0 |    | 1.5 |

## *BT Profiling Table*

# SMT Solver

# BT-Optimizer Step 1 - Minimizing Pipeline Bubbles



- Improve utilization by reducing idle gaps (*pipeline bubbles*) across PUs
- By **reducing bubbles**, we **keep all PUs busy** and improve pipeline throughput.



# BT-Optimizer Step 2: Optimizing Latency

- High utilization  $\neq$  **low latency**
- Generate K schedules (e.g.,  $k = 20$ ),
  - each with a different assignment of stages to PUs.
- Minimize latency



# BT-Optimizer Optional Step 3: Autotuning



# BT-Optimizer Optional Step 3: Autotuning



# Workloads Evaluated

- Three edge compute vision tasks
  1. **AlexNet-dense**
  2. **AlexNet-sparse**
  3. **Octree Construction**
- Common in resource-constrained environments



- |                                                                                                                          |                                                                                                                                                       |                                                                                                                                                            |
|--------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| • <i>AlexNet-dense</i> <ul style="list-style-type: none"><li>• Dense linear algebra</li><li>• CIFAR-10 dataset</li></ul> | • <i>AlexNet-sparse</i> <ul style="list-style-type: none"><li>• Sparse linear algebra</li><li>• Pruned w/ CONDENSA*</li><li>• Stored in CSR</li></ul> | • <i>Octree</i> <ul style="list-style-type: none"><li>• Tree traversals</li><li>• Irregular memory access</li><li>• Sorting</li><li>• Prefix Sum</li></ul> |
|--------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|

\* <https://github.com/NVlabs/condensa>

# Platforms Evaluated



Less Powerful GPUs

Mix CPUs



Powerful GPUs

Little ARM CPUs



| Platform                                         | Backend        | CPU                                                                  | CPU Frequency                            | Integrated GPU |
|--------------------------------------------------|----------------|----------------------------------------------------------------------|------------------------------------------|----------------|
| <b>Google Pixel 7A</b>                           | Vulkan         | 2x Cortex-X1<br>2x Cortex-A78<br>4x Cortex-A55                       | 2.85 GHz<br>2.35 GHz<br>1.80 GHz         | Mali-G710 MP7  |
| <b>OnePlus 11</b>                                | Vulkan         | 1x Coretex-X3<br>2x Coretex-A715<br>2x Cortex-A710<br>3x Cortex-A510 | 3.2 GHz<br>2.8 GHz<br>2.8 GHz<br>2.0 GHz | Adreno 740     |
| <b>NVIDIA Jetson Orin Nano</b>                   | CUDA<br>Vulkan | 6x Cortex-A78AE                                                      | 1.7 GHz                                  | Ampere GPU     |
| <b>*NVIDIA Jetson Orin Nano (low-power mode)</b> | CUDA<br>Vulkan | 4x Cortex-A78AE                                                      | ~0.85 GHz                                | Ampere GPU     |

\* In low-power mode, 2 cores are shutdown, and overall CPU frequency is reduced by half

# Results Overview



- Geomean speedup of **2.14x** across all workloads, w/ peak of **7.59x**
- In 1 case, **slowdown**

## BT-Proiling

*BetterTogether* produces predictions that closely match measured execution time



Without BT show discrepancies



*predicted and measured execution*

# We have additional insights in the paper

- Preliminary results on
  - Using Google's **EdgeTPU**
  - Implemented using *nnapi*
  - Showing a **1.25x** speedup for AlexNet-dense application on top of existing results
  - Showcasing the flexibility and extensibility of *BetterTogether*



# Conclusion

- We propose ***BetterTogether***
  - An **interference-aware profiling** method that produce accurate profiling tables by accounting for *intra-application interference*
  - an end-to-end static pipeline generator for edge SoCs
- Using *BetterTogether*, we implemented **3** class of applications
- Evaluated across **4** diverse devices, achieving up to **7.59x (geo. 2.14x)**



## Team

Yanwen Xu, Rithik Sharma, Zheyuan Chen  
Shaan Mistry, Tyler Sorensen  
{yxu83, riksharm, zchen406, sdmistry,  
tyler.sorensen}@ucsc.edu



**Open-Source Repo**  
[github.com/ucsc-redwood/better-together](https://github.com/ucsc-redwood/better-together)

# Backup slides



Apple A18 SoC  
90mm<sup>2</sup> @ TSMC N3E

- *big.LITTLE CPU*
- *Integrated GPU*
- *Domain Specific Accelerators (DSA)*
- ...

Modern SoCs integrate **diverse Processing Units (PU)**

Image by ChipWise: <https://chipwise.tech/our-portfolio/apple-a18-a18-pro-die-shot/>



Apple A18 SoC  
90mm<sup>2</sup> @ TSMC N3E

5 GPU-cores  
8 NPU-cores  
2 P-cores  
4 E-cores

- *big.LITTLE CPU*
- *Integrated GPU*
- *Domain Specific Accelerators (DSA)*
- ...

Modern SoCs integrate **diverse Processing Units (PU)**

Image by ChipWise: <https://chipwise.tech/our-portfolio/apple-a18-a18-pro-die-shot/>

We want to utilize all system resources (i.e., PUs)

# Profiling-guided approach (isolated benchmarking)



# Profiling-guided approach (isolated benchmarking)



# BT Implementor

- We define Task

- light weight, pointing to CPU/iGPU shared memory

- Using concurrent Queue to pass Tasks around stages.

Each chunk process the incoming tasks in respective type for Cores



- BetterTogether yields higher correlations



Correlation (0.0–1.0) between predicted and actual times across all applications and platforms. **Higher is better.**