

 SC19

# Arm's Scalable Vector Extension (SVE)

21 November 2019

- 9:00 – Presentations
  - [5] Welcome and Overview of Arm in HPC
  - [25] Introduction to Programming for Arm SVE
  - [25] Performance Investigation with ArmIE
  - [25] RIKEN Fugaku Processor Simulator
  - [30] A Closer Look at Arm SVE
  - [10] Training Cluster First-touch
- 11:00 – Hands-on
  - [45] Performance Analysis with ArmIE
  - [45] Arm C Language Extensions (ACLE)
  - [30] Hands-on with your codes
- 13:00 – ~Tutorial();

John Linford, Dani Ruiz-Munoz,  
Olly Perks, Francesco Petrogalli,  
Alex Rico, Roxana Rusitoru,  
Jelena Milanovic, et al.

[john.linford@arm.com](mailto:john.linford@arm.com)

# Tutorial Overview

<http://events.arm-hpc.org/sc19-hackathon>

## Goals and Objectives

- Introduce SVE as a tool for enhancing scientific application codes
- Equip prospective SVE programmers with performance engineering tools for SVE
- Found positive working relationships between application developers and SVE experts
- **Have fun!**

## Content Structure



|                                                    |                                                                                                                 |
|----------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| Foundations:<br>Compilers,<br>Libraries, and Tools | <ul style="list-style-type: none"><li>• Open source</li><li>• Commercial</li></ul>                              |
| Intrinsics:<br>ACLE for SVE                        | <ul style="list-style-type: none"><li>• Types and accessors</li><li>• Code Idioms</li></ul>                     |
| Tools:<br>Analysis and Debugging                   | <ul style="list-style-type: none"><li>• Arm Instruction Emulator (ArmIE)</li><li>• Arm Allinea Studio</li></ul> |

# Cluster access and hands-on materials

<http://events.arm-hpc.org/sc19-hackathon>

# Accessing the Cluster

- ssh student@cluster.arm-hpc.org
  - Password: Tr@ining!
  - A private AWS Graviton instance with all the necessary software installed and configured will be allocated for you.
  - Hands-on materials are installed in \$HOME.
  - Instructions on how to connect to your AWS instance via Forge Remote Client are given when you log in.

```
johlin02 — student002@ip-172-31-27-255:~ — ssh student@cluster.arm-hpc....  
Last login: Sat Nov 16 14:33:20 on ttys001  
[~$ ssh student@cluster.arm-hpc.org  
[student@cluster.arm-hpc.org's password:  
Warning: No xauth data; using fake authentication data for X11 forwarding.  
Last login: Sat Nov 16 21:24:40 2019 from 38.109.203.254  
  
          /\         [ ] [ ] [ ] / [---]  
        / \ \ | [ ] [ ] [ ] | [ ] [ ] | [ ]  
      / [---] \ [ ] [ ] [ ] [ ] [ ] [ ] [ ]  
     / [---] \ [ ] [ ] [ ] [ ] [ ] [ ] [ ]  
  
http://www.arm-hpc.org  
  
Download materials at https://gitlab.com/arm-hpc/training/arm-sve-tools/-/archive/master/arm-sve-tools-master.tar.gz  
  
sinfo: error: If using PrologFlags=Contain for pam_slurm_adopt, either proctrack/cgroup or proctrack/cray_aries is required. If not using pam_slurm_adopt, please ignore error.  
  
Welcome! Your student ID is 002.  
  
Please Install Arm Forge Remote Client:  
https://developer.arm.com/tools-and-software/server-and-hpc/arm-architecture-tools/downloads/download-arm-forge  
  
===== Arm Forge Remote Launch Settings ======  
Host Name: student002@cluster.arm-hpc.org  
Remote Installation Directory: /opt/arm/forge/19.1.3  
Password: Tr0ining!002  
======  
  
srun: error: If using PrologFlags=Contain for pam_slurm_adopt, either proctrack/cgroup or proctrack/cray_aries is required. If not using pam_slurm_adopt, please ignore error.  
[student002@ip-172-31-17-83 ~]$ █
```

# arm

# Arm and HPC



# CPU Engagement Models with Arm

Arm IP is the basic building block for extraordinary solutions.

## Core License

- Partner licenses complete microarchitecture.
- CPU differentiation via:
- Configuration options.
- Wide implementation envelope with different process technologies.

## Arm IP



## Architecture License

- Partner designs complete microarchitecture.
- Clean room, scratch.
- Maximum design freedom:
- Directly address needs of the target market.
- Arm architecture validation preserves software compatibility

# Pizza Engagement Models

The basic building blocks for an extraordinary pizza!

## Core License



## IP = Irresistible Pizza



## Architecture License



# High Performance Computing with Arm



# Vanguard Astra by HPE

- 2,592 HPE Apollo 70 compute nodes
  - 5,184 CPUs, 145,152 cores, 2.3 PFLOPs (peak)
- Marvell ThunderX2 ARM SoC, 28 core, 2.0 GHz
- Memory per node: 128 GB (16 x 8 GB DR DIMMs)
  - Aggregate capacity: 332 TB, 885 TB/s (peak)
- Mellanox IB EDR, ConnectX-5
  - 112 36-port edges, 3 648-port spine switches
- Red Hat RHEL for Arm
- HPE Apollo 4520 All-flash Lustre storage
  - Storage Capacity: 403 TB (usable)
  - Storage Bandwidth: 244 GB/s



# Isambard system specification

- 10,752 Armv8 cores (168n x 2s x 32c)
  - **Cavium ThunderX2 32core 2.1→2.5GHz**
- Cray XC50 ‘Scout’ form factor
- High-speed **Aries** interconnect
- Cray HPC optimised software stack
  - CCE, Cray MPI, math libraries, CrayPAT, ...
- **Phase 2 (the Arm part):**
  - Delivered Oct 22<sup>nd</sup>, handed over Oct 29<sup>th</sup>
  - Accepted Nov 9<sup>th</sup>
  - Upgrade to final B2 TX2 silicon, firmware, CPE completed March 15<sup>th</sup> 2019



# CEA : Deployment by ATOS



- 292 **Atos Sequana X1310 compute nodes**
- 584 CPUs, 18,688 cores
- Marvell ThunderX2 **ARM SoC, 32 cores, 2.2 GHz**
- Memory : 8 channels, DDR4 2666, 256 GB
- Mellanox InfiniBand EDR

- ✓ **Peak Performance 329 TFLOPS**
- ✓ **HPL = 84% of efficiency**
- ✓ **HPCG = 3.47 of HPL**



# A64FX: Spec Summary

## ■ Arm SVE, high performance and high efficiency

- DP performance 2.7+ TFLOPS, >90%@DGEMM
- Memory BW 1024 GB/s, >80%@STREAM Triad



| A64FX                   |                  |
|-------------------------|------------------|
| ISA (Base, extension)   | Armv8.2-A, SVE   |
| Process technology      | 7 nm             |
| Peak DP performance     | > 2.7+ TFLOPS    |
| SIMD width              | 512-bit          |
| # of cores              | 48 + 4           |
| Memory capacity         | 32 GiB (HBM2 x4) |
| Memory peak bandwidth   | 1024 GB/s        |
| PCIe                    | Gen3 16 lanes    |
| High speed interconnect | TofuD integrated |

# NAMD 3 Alpha, Tesla V100 Performance, ApoA1 92k Atoms

| Hardware platform |                 | ns/day, | Speedup |
|-------------------|-----------------|---------|---------|
| Intel Xeon 8168   | + 1x Tesla V100 | 102.1,  | 1.0x    |
| IBM Power9        | + 1x Tesla V100 | 99.7,   | 0.97x   |
| Cavium ThunderX2  | + 1x Tesla V100 | 98.2,   | 0.96x   |

## NAMD 3 Alpha Note:

This measured the new NAMD 3 Alpha in single-node mode, yielding greatest overall GPU acceleration for small to moderate sized atomic structure simulations.



ARM+NVIDIA  
Preliminary Results

# arm

# Let's Get Started!

