

**Dr. James Price**

University of Bristol /  
GW4 Alliance



# **Isambard: tales from the world's first Arm-based production supercomputer**

# 'Isambard' is a new UK Tier 2 HPC service from GW4



EPSRC

**CRAY**  
THE SUPERCOMPUTER COMPANY

**ARM**



EPSRC



Isambard Kingdom Brunel  
1804-1859

**GW4**

# The tiered model of HPC provision

**Tier 0:** international



**Tier 1:** national



**Tier 2:** regional

TIER 2 HPC CENTRES

Edinburgh

Cambridge

UCL

Loughborough

Bristol

Oxford

**Tier 3**



# Isambard system specification

- 10,752 Armv8 cores (168 x 2 x 32)
  - **Cavium ThunderX2 32 core 2.1GHz**
- Cray XC50 Scout form factor
- High-speed **Aries** interconnect
- Cray HPC optimised software stack
  - CCE, CrayPAT, Cray MPI, math libraries, ...
- **Technology comparison:**
  - **x86, Xeon Phi, Pascal GPUs**
- Phase 1 installed March 2017
- Phase 2 (the Arm part) currently in bring-up
- £4.7m total project cost over 3 years



# Cavium ThunderX2, a seriously beefy CPU

- 32 cores at up to 2.5GHz
- Each core is 4-way superscalar, Out-of-Order
- 32KB L1, 256KB L2 per core
- Shared 32MB L3
- Dual 128-bit wide NEON vectors
  - Compared to Skylake's 512-bit vectors, and Broadwell's 256-bit vectors
- 8 channels of 2666MHz DDR4
  - Compared to 6 channels on Skylake, 4 channels on Broadwell
  - AMD's EPYC also has 8 channels



# Benchmarking platforms

| Processor        | Cores  | Clock speed | TDP Watts | FP64 TFLOP/s | Bandwidth GB/s |
|------------------|--------|-------------|-----------|--------------|----------------|
|                  |        | GHz         |           |              |                |
| Broadwell        | 2 × 22 | 2.2         | 145       | 1.55         | 154            |
| Skylake Gold     | 2 × 20 | 2.4         | 150       | 3.07         | 256            |
| Skylake Platinum | 2 × 28 | 2.1         | 165       | 3.76         | 256            |
| ThunderX2        | 2 × 32 | 2.2         | 175       | 1.13         | 320            |

- BDW 22c** Intel Broadwell E5-2699 v4, \$4,115 each (near top-bin)
- SKL 20c** Intel Skylake Gold 6148, \$3,078 each
- SKL 28c** Intel Skylake Platinum 8176, \$8,719 each (near top-bin)
- TX2 32c** Cavium ThunderX2, **\$1,795 each** (near top-bin)

# Key architectural comparisons (node-level, dual socket)



# Isambard's core mission: deploying Arm in production HPC

Starting by porting/benchmarking/optimizing codes from the top 10 most heavily used on Archer:

- **VASP, CASTEP, GROMACS, CP2K, UM, NAMD, Oasis, SBLI, NEMO**
- Most of these codes are written in FORTRAN

Additional important codes for project partners:

- **OpenFOAM, OpenIFS, WRF, CASINO, LAMMPS, ...**

# Performance on heavily used applications from Archer



# Performance summary

- Performance is competitive with contemporary Intel processors
  - ThunderX2 is **faster** when memory bandwidth is critical
  - ThunderX2 is **slower** when FLOP/s and L1 cache bandwidth matters
  - Even in the worst case, only drops ~30% performance versus Broadwell
- Next-gen Arm CPUs will increase FLOP/s + cache bandwidth
  - Introduction of SVE will allow vector width of up to 2048-bits
  - Fujitsu A64FX chip unveiled recently with 512-bit SVE
  - Expecting 512-bits to be a common choice for server chips

| Benchmark  | ThunderX2 | Broadwell | Skylake  |
|------------|-----------|-----------|----------|
| STREAM     | Arm 18.3  | Intel 18  | CCE 8.7  |
| CloverLeaf | CCE 8.7   | Intel 18  | Intel 18 |
| TeaLeaf    | CCE 8.7   | GCC 7     | Intel 18 |
| SNAP       | CCE 8.6   | Intel 18  | Intel 18 |
| Neutral    | GCC 8     | Intel 18  | GCC 7    |
| CP2K       | GCC 8     | GCC 7     | GCC 7    |
| GROMACS    | GCC 8     | GCC 7     | GCC 7    |
| NAMD       | Arm 18.2  | GCC 7     | GCC 7    |
| NEMO       | CCE 8.7   | CCE 8.7   | CCE 8.7  |
| OpenFOAM   | GCC 7     | GCC 7     | GCC 7    |
| OpenSBLI   | CCE 8.7   | Intel 18  | CCE 8.7  |
| UM         | CCE 8.6   | CCE 8.5   | CCE 8.7  |
| VASP       | GCC 7.2   | Intel 18  | Intel 18 |

## Comparison of compilers on Arm

Exact same issues on x86

|               | GCC  | Arm   | CCE   |
|---------------|------|-------|-------|
| STREAM        | 97%  | 100%  | 99%   |
| CloverLeaf    | 92%  | 95%   | 100%  |
| TeaLeaf       | 99%  | 95%   | 100%  |
| SNAP          | 74%  | 87%   | 100%  |
| Neutral       | 100% | 94%   | 85%   |
| CP2K          | 100% | BUILD | CRASH |
| GROMACS       | 100% | 91%   | CRASH |
| NAMD          | 83%  | 100%  | BUILD |
| NEMO          | -    | -     | 100%  |
| OpenFOAM      | 100% | 97%   | BUILD |
| OpenSBLI      | -    | -     | 100%  |
| Unified Model | 84%  | 72%   | 100%  |

# Enabling co-design of future architectures with cycle-accurate simulation



- We've developed a new configurable cycle accurate simulator in Bristol
- Within ~5-10% of TX2 hardware
- Highly configurable to almost any design of HPC CPU:
  - Planning a Post-K / A64fx version
  - Already supports SVE binaries
- Plan future support for x86, RISC-V, co-processors, ...

# Conclusions

- Results show **ThunderX2 performance is competitive with current high-end server CPUs**, while **performance per dollar is compelling**
- **The software tools ecosystem is already in good shape**
- The full Isambard XC50 Arm system is coming up now, we're aiming to have early results to share at SC18
- The signs are that **Arm-based systems are now real alternatives for HPC**, reintroducing much needed competition to the market

# For more information

## **Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard**

S. McIntosh-Smith, J. Price, T. Deakin and A. Poenaru, CUG 2018, Stockholm

<http://uob-hpc.github.io/2018/05/23/CUG18.html>

**Bristol HPC group:**

<https://uob-hpc.github.io/>

**Isambard:**

<http://gw4.ac.uk/isambard/>

**Build and run scripts:**

<https://github.com/UoB-HPC/benchmarks>

# Backup

# Comparing performance per Dollar

- Hard to do this rigorously
  - RRP is not what anyone pays
  - Whole system cost has to be taken into account
  - Purchase price vs. TCO
- However, we *can* form some useful intuition
  - The following charts were generated by taking the performance results, dividing by the official published list prices of the CPUs only, then renormalizing to Broadwell

# Performance per Dollar for applications

