

COL380

Introduction to  
Parallel & Distributed Programming

# Agenda

- Memory bottlenecks
  - and some solutions
- Instruction latency and overlap
- Core organization
- Inter-communication

## Transistor count



Data source: Wikipedia ([wikipedia.org/wiki/Transistor\\_count](https://en.wikipedia.org/w/index.php?title=Transistor_count&oldid=1000000000))

OurWorldinData.org – Research and data to make progress against the world's largest problems

Licensed under CC-BY by the authors Hannah Ritchie and Max Roser.

- Can't clock faster  $\Rightarrow$  Do more per clock
  - ➔ Execute many simple instructions on many cores
  - ➔ Simpler operations are more general
  - ➔ Complex operations require hardware coordination across the chip
- Not just compute more things
  - ➔ Focus may be on parallelizing data access (Memory, IO)
  - ➔ Multiple processors can access memory in parallel, disrupt caches

- Can't clock faster  $\Rightarrow$  Do more per clock
  - ➔ Execute many simple instructions on many cores
  - ➔ Simpler operations are more general
  - ➔ Complex operations require hardware coordination across the chip
- Not just compute more things
  - ➔ Focus may be on parallelizing data access (Memory, IO)
  - ➔ Multiple processors can access memory in parallel, disrupt caches

Software orchestration

## Application Areas

- Weather/Climate simulation: 3D-grid, Long duration simulation
- Energy modeling: Fusion energy
- Data science: Filter, Join, Cross, Sort
- Financial processing: Market prediction, Investing, Blockchain
- Computational biology: Drug design, Gene sequencing, Vaccines
- Distributed Service: DB and File systems, Traffic network, Archive

# Supercomputer



Today: “Fastest supercomputer in the world”  
[HPL Rmax >1EF: 11020000000000000000]  
Nodes: 9,472  
Cores: 8,730,112 (64c CPU, 4x GPU/node)  
Memory: 4.6 PB of DRAM memory + Flash  
Interconnect: Multiple 100 GB/s NIC  
Racks: 74 cabinets  
Space: 7300 SqFt  
Power consumption: 21.1 MW

**FRONTIER**

# Supercomputer



## Arm CPU

A64FX: 48 core CPU with 512-bit SIMD

Peak Flops: 3379G (DP) [70.4G/core]

Memory BW: 1 TB/s

Network: 6D Torus [68GB/s x2] \*

Today: “Fastest supercomputer in the world”  
[HPL Rmax >1EF: 11020000000000000000000000000000]  
Nodes: 9,472

8,730,112 (64c CPU, 4x GPU/node)

Memory: 4.6 PB of DRAM memory + Flash

Connect: Multiple 100 GB/s NIC

Cabinet: 74 cabinets

Space: 7300 SqFt

Power consumption: 21.1 MW

**FRONTIER**

# Supercomputer



## Arm CPU

A64FX: 48 core CPU with 512-bit SIMD

Peak Flops: 3379G (DP) [70.4G/core]

Memory BW: 1 TB/s

## Intel CPU

2.3 GHz x40 cores (~3000 GFlop)  
+ 2x AVX-512 FMA units

Maximum Memory Speed: 3200 MHz

Memory Channels: 8

Memory bandwidth: ~200 GB/s

Today: “Fastest supercomputer in the world”  
[HPL Rmax > 1 EF: 11020000000000000000000000000000]

Nodes: 9,472

8,730,112 (64c CPU, 4x GPU/node)

Memory: 4.6 PB of DRAM memory + Flash

Network Connect: Multiple 100 GB/s NIC

Cabinets: 74 cabinets

Space: 7300 SqFt

Power consumption: 21.1 MW

**FRONTIER**

# Supercomputer



## Arm CPU

A64FX: 48 core CPU with 512-bit SIMD

Peak Flops: 3379G (DP) [70.4G/core]

Memory BW: 1 TB/s

## Intel CPU

2.3 GHz x40 cores (~3000 GFlop)  
+ 2x AVX-512 FMA units

Maximum Memory Speed: 3200 MHz

Memory Channels: 8

Memory bandwidth: ~200 GB/s

Today: “Fastest supercomputer in the world”  
[HPL Rmax > 1 EF: 11020000000000000000]  
Nodes: 9,472

8,730,112 (64c CPU, 4x GPU/node)

Memory: 4.6 PB of DRAM memory + Flash

Network: Multiple 100 GB/s NIC

Cabinets: 74 cabinets

Space: 7300 SqFt

**FRONTIER**

## Nvidia GPU

1000+ Cores: 9.7 TF (DP)

19.5 TF with Tensor Core

GPU Memory Bandwidth: 1.6 TB/s

Network: NVLink 600 GB/s

# Toy Supercomputer



Cray XMP-1

The New York Times

## *India and U.S. Agree On Supercomputer Sale*

 Give this article  

By [Steven R. Weisman, Special To the New York Times](#)  
Oct. 9, 1987

USD 20 MILLION

4x 64-bit Vector processor, ~117 MHz  
400 MFLOPS (peak)  
128 MB RAM

# Toy Supercomputer



Arduino



Jetson Tx

472 GF

The New York Times

## *India and U.S. Agree On Supercomputer Sale*

Give this article

By [Steven R. Weisman, Special To the New York Times](#)  
Oct. 9, 1987

USD 20 MILLION

4x 64-bit Vector processor, ~117 MHz

400 MFLOPS (peak)

128 MB RAM

Subodh Kumar

Supercomputer Node



Subodh Kumar

# Supercomputer Node



# Supercomputer Node



Subodh Kumar

# Modern Multi-Processor



# Modern Multi-Processor



# Modern Multi-Processor



# Modern Multi-Processor



# Memory

Memory  
Controller



- Fused Multiply Add
  - Double Precision  $A += B * C$
  - 2 FLOPS, 3+1 Operands (32 bytes)
- Example
  - 2x AVX512 on Intel core = 32 FLOP/cycle
  - 96 GF/core @3GHz
    - ▶ Needed 1536 GB/s memory bandwidth/core
  - Compare DDR4: Throughput ~25GB/s, Latency ~10ns

# Memory Bottleneck

- Fused Multiply Add

- Double Precision  $A += B * C$
- 2 FLOPS, 3+1 Operands (32 bytes)

- Example

- 2x AVX512 on Intel core = 32 FLOP/cycle
- 96 GF/core @3GHz
  - ▶ Needed 1536 GB/s memory bandwidth/core
- Compare DDR4: Throughput ~25GB/s, Latency ~10ns

Latency can be hidden

$$\begin{aligned}A[i] &+= B[i]*C[i] \\A[i+1] &+= B[i+1]*C[i+1] \\A[i+2] &+= B[i+2]*C[i+2] \\A[i+3] &+= B[i+3]*C[i+3]\end{aligned}$$

Caches can help with throughput  
but working set can be large

\* Must be used wisely

# Memory Gap Mitigation

## AMD AI Engine



## High Bandwidth Memory



~600 GB/site

# nVIDIA GA100 GPU



# nVIDIA GA100 GPU



# nVIDIA GA100 GPU



# nVIDIA GA100 GPU



# nVIDIA GA100 GPU



# Flynn's Classification



- A number of instruction threads
- A number of data of data threads

- Parallel computers
- Memory bottlenecks
- Parallel computer Organization