



# Squeezing Operator Performance Potential for the Ascend Architecture

Yuhang Zhou<sup>1</sup>, Zhibin Wang<sup>1</sup>, Guyue Liu<sup>2</sup>, Shipeng Li<sup>1</sup>, Xi Lin<sup>1</sup>, Zibo Wang<sup>1</sup>, Yongzhong Wang<sup>3</sup>,  
Fuchun Wei<sup>3</sup>, Jingyi Zhang<sup>3</sup>, Zhiheng Hu<sup>3</sup>, Yanlin Liu<sup>3</sup>, Chunsheng Li<sup>3</sup>, Ziyang Zhang<sup>3</sup>,  
Yaoyuan Wang<sup>3</sup>, Bin Zhou<sup>4</sup>, Wanchun Dou<sup>1</sup>, Guihai Chen<sup>1</sup>, Chen Tian<sup>1</sup>

<sup>1</sup> Nanjing University <sup>2</sup> Peking University <sup>3</sup> Huawei Technologies Co., Ltd. <sup>4</sup> Shandong University



# Outline



- Introduction
- System Design
- Case Study
- Evaluation
- Conclusion

# Outline



 Introduction

 System Design

 Case Study

 Evaluation

 Conclusion

# AI Domain-Specific Architecture (DSA)



Deep learning models



Better arithmetic support

Domain-specific architecture



NVIDIA GPU



Google TPU



Huawei Ascend NPU



Cambricon MLU





# Operator Optimization Needs Profiling



*What about  
Ascend?*





# Ascend Architecture



GPU (NVIDIA)



NPU (Ascend)

Compute Unit      Memory Unit



# Dedicated Compute Units

## Transformer computation

~10%  
 (Pooling,  
 Relu)

~90%  
 (MatMul, Convolution,  
 Fully connected)



## Computing precision

Cube: INT8/FP16

Vector: INT8/FP16/FP32

Scalar: INT32/FP32/...



# Customized Memory Architecture

③ Asymmetric bandwidth



Compute Unit



Memory Unit

→ MTE-GM

→ MTE-L1

→ MTE-UB



# Efficient Transfer Control Units



# Instruction Pipeline



*Example of matrix multiplication  $A \times B$*

*Inter parallelism, intra serialism.*





# Summary of Ascend Architecture

Pros

Ascend Architecture

Cons

Dedicated compute units

**Accurately identifying operator bottleneck is a challenging, but essential task!**

Operational flexibility

Efficient transfer control and instruction pipeline

Inputting

LTE

Pipeline

...



# Existing Operator Performance Analysis



# Limitations of Performance Analysis



(i) Massive combinations between precisions and transfers



# Limitations of Performance Analysis



Underutilization?



(ii) Incorrect analysis by ignoring the sequential execution





# Our Goals



# Outline



-  Introduction
-  System Design
-  Case Study
-  Evaluation
-  Conclusion

# Overview





# Profiling and Component Abstraction



# Component-based Roofline Model



*Utilization of component can reflect the operator's bottleneck.*

## Operator-aware ideal performance

$$U_{cube} = \frac{A_{cube}}{I_{cube}} \quad \begin{cases} A_{cube} = \frac{O_{cube}}{T_{total}} \\ I_{cube} \end{cases} \quad \begin{array}{l} \text{Profiling} \\ \text{Different precisions?} \end{array}$$

$$I_{cube} = \frac{\sum_{\text{prec}} O_{\text{prec}}}{\sum_{\text{prec}} \frac{O_{\text{prec}}}{P_{\text{prec}}}} \quad \begin{array}{l} \text{Harmonic Mean} \\ \text{Arithmetic Power} \end{array}$$

## Underutilization Analysis

$$\textcircled{1} \quad U_{cube} = \frac{A_{cube}}{I_{cube}} = \underbrace{\frac{O_{cube}}{T_{cube} \cdot I_{cube}}}_{E_{cube}} \cdot \underbrace{\frac{T_{cube}}{T_{total}}}_{R_{cube}}$$

$$\textcircled{2} \quad E_{\text{component}} \leq \frac{R_{\text{threshold}}}{U_{\text{threshold}}} \quad \begin{array}{l} \downarrow \\ \text{Inefficient Component} \end{array} \quad \begin{array}{l} \downarrow \\ R_{\text{component}} < R_{\text{threshold}} \end{array} \quad \begin{array}{l} \downarrow \\ \text{Insufficient Parallelism} \end{array}$$



# Pruning, Visualization and Analysis

## Pruning results



- ✓ Component abstraction
- ✓ Remove irrelevant components
- ✓ Remove impossible combinations

## Roofline Analysis of Add\_ReLU Operator



$U_{component}$  of Vector+MTE-UB (38.42%):  
**Underutilization**

$R_{component}$  of MTE-GM (58.68%):  
**Insufficient parallelism**



# Outline



- Introduction
- System Design
- Case Study
- Evaluation
- Conclusion



# Case Study: Optimization of Add\_ReLU Operator

$$\text{Add\_ReLU}(x) = \text{ReLU}(x + c)$$



Data flow



Instruction timeline



# Iteration 1: Reducing spatial dependency

Original Code

```

① ...
② ub_to_gm(gm_1, ub_1);
③ gm_to_ub(ub_1, gm_2);
④ ...

```



Optimized Code

```

① ...
② ub_to_gm(gm_1, ub_2);
③ gm_to_ub(ub_1, gm_2);
④ ...

```



Insufficient parallelism  
(38.42%)



MTE-UB bound  
(66.24%)





## Iteration 2: Minimizing redundant transfer

Original Code

```
① for i = 1 to n do  
②   gm_to_ub(ub_1, c);  
③   ...
```



MTE-UB bound  
(66.24%)

The single operator time reduced by **1.73x**.

The *component\_utilization* up by **32.1%**.

The total inference latency down by **244.261 μs**.

```
① gm_to_ub(ub_1, c);  
② for i = 1 to n do  
③   ...  
④ end for
```



MTE-UB bound  
(70.52%)





# Optimization Experience

We summarize the common bottleneck causes and optimization strategies.

| Bottleneck Cause | Compute Bound   | MTE Bound                                  | Insufficient Parallelism | Inefficient Compute      | Inefficient MTE |                     |         |
|------------------|-----------------|--------------------------------------------|--------------------------|--------------------------|-----------------|---------------------|---------|
| Strategy         | Operator        | Bottleneck Cause and Optimization Strategy |                          |                          |                 |                     | Speedup |
|                  |                 | Compute Bound                              | MTE Bound                | Insufficient Parallelism | Inefficient MTE | Inefficient Compute |         |
|                  | Add_ReLU        |                                            | MRT                      | RSD                      |                 |                     | 1.72    |
|                  | Depthwise       |                                            | MRT                      | AIS,RUS,PP               | ITG             |                     | 1.26    |
|                  | AvgPool         |                                            |                          |                          |                 | AIP                 | 4.31    |
|                  | Mul             |                                            |                          | RSD                      |                 |                     | 1.34    |
|                  | Conv2D          |                                            | MRT                      | RSD                      |                 |                     | 2.65    |
|                  | FullyConnection |                                            |                          |                          | ITG             |                     | 1.22    |
|                  | MatMul          |                                            | OF                       |                          |                 |                     | 1.10    |
|                  | GeLU            | EA                                         |                          |                          |                 |                     | 1.06    |

In MobileNetV3 inference, Our operator optimizations perform well with speedups of **1.06-4.31×**.

*More cases can be found in the paper.*



# Outline



- Introduction
- System Design
- Case Study
- Evaluation
- Conclusion

# Evaluation on End-to-End Optimization

Device: Ascend 910 (Training); Ascend 310 (Inference)

Workloads: 100B PanGu- $\alpha$  (Training); MobileNetV3 (Inference)



Training

The ratio of *insufficient parallelism* reduced by **21.38%**.

The *iteration time* speedup is **2.04x**.

Inference

The ratio of *insufficient parallelism* reduced by **11.61%**.

The *total time* speedup is **1.21x**.





# Overall Optimization Results

| Type           | Model              | Parameter | Dataset               | #NPUs |
|----------------|--------------------|-----------|-----------------------|-------|
| Vision         | MobileNetV3(M3)    | 5.4M      | ImageNet2012          | 8     |
|                | ResNet50           | 25.6M     |                       |       |
|                | ViT                | 86M       |                       |       |
|                | VGG16              | 138.4M    |                       |       |
| NLP            | Bert               | 110M      | WikiText2             | 8     |
|                | GPT2               | 355M      |                       |       |
| Recommendation | DeepFM             | 16.5M     | Criteo                | 8     |
|                | Wide and Deep(W&D) | 75.84M    |                       |       |
|                | DLRM               | 540M      |                       |       |
| LLM            | Llama 2            | 7B        | WikiText2             | 8     |
|                | PanGu- $\alpha$    | 100B      | 1.1TB Chinese Dataset | 128   |

Our optimizations cover 11 different models.



Computation time speedups range from 1.08-2.7x.

Iteration time speedups range from 1.07-2.15x.



# Outline



-  Introduction
-  System Design
-  Case Study
-  Evaluation
-  Conclusion



## Conclusion

1. We propose a component-based roofline model and underutilization analysis to identify the operator bottlenecks on Ascend.
2. Through in-depth operator optimization case studies, we guide users on how to complete optimization.
3. Based on extensive practical optimization experiments, we share our practical insights and valuable experiences.

## Future Work

1. The component-based roofline model can extend to other DSAs like TPU.
2. Depth studies of hardware architecture, especially its interaction with the software.





# Thanks

Q&A

[yuhangzhou@smail.nju.edu.cn](mailto:yuhangzhou@smail.nju.edu.cn)



南 京 大 学

