



# **Understanding Supercomputing and HPC Architectures and Programming Models**

**by: Tassadaq Hussain**

**Director Centre for AI and BigData  
Professor Electrical Engineering Department  
Namal University Mianwali**

## **Collaborations:**

**Barcelona Supercomputing Center, Spain**

**European Network on High Performance and Embedded Architecture and Compilation**

**Pakistan Supercomputing Center**

- **Importance of HPC**
- Types of Processing System
- HPC System Architecture
- Supercomputing Classes and Infrastructure
- Cluster Software Stacks
- Programming Models
- Demonstration (Supercomputing System)

# 120 Years of Moore's Law





Life Science



Earth Science



Social Science

Science

175 ZByte @2025

80%  
Data-Sciences

Data

100 ExaFLOPS  
@2020

87.04 B\$  
234.6 B\$ @2025

AI

Top500 List  
8 PetaFLOPS  
@2022

uProcessor  
100 B\$ @2020  
30% Cell Phone  
20% Embedded  
App  
50 Servers, PCs etc.

Computing

Digital Industrial Age  
5.5 Trillion \$ Revenue@2021





# High-Performance Computing Milestones (1960–2019)





## EXPECTATIONS



As of July 2022

Plateau will be reached: ○ <2 yrs. ● 2–5 yrs. ■ 5–10 yrs. ▲ >10 yrs. ✖ Obsolete before plateau

- Importance of HPC
- **Types of Processing System**
- HPC System Architecture
- Supercomputing Classes and Infrastructure
- Cluster Software Stacks
- Programming Models
- Demonstration (Supercomputing System)

# Basic Processor Architecture

- A central processing unit (CPU) gets instructions and/or data from memory, decodes the instructions and then **sequentially** performs them.
- Memory is used to store both program and data instructions
  - Program instructions are coded data which tell the computer to do something
  - Data is simply information to be used by the program



# Information and Computer



# Generations of microprocessors:

## **First-generation –**

From 1971 to 1972 the era of the first generation came which brought microprocessors like INTEL 4004 Rockwell international PPS-4 INTEL 8008 etc.

## **Second generation –**

The second generation marked the development of 8-bit microprocessors from 1973 to 1978. Processors like INTEL 8085 Motorola 6800 and 6801 etc came into existence.

## **Third generation –**

The third generation brought forward the 16-bit processors like INTEL 8086/80186/80286 Motorola 68000 68010 etc. From 1979 to 1980 this generation used the HMOS technology.

## **Fourth generation –**

The fourth-generation came into existence from 1981 to 1995. The 32-bit processors using HMOS fabrication came into existence. INTEL 80386 and Motorola 68020 are some of the popular processors of this generation.

## **Fifth-generation –**

From 1995 till now we are in the fifth generation. 64-bit processors like PENTIUM, Celeron, dual, quad, and octa-core processors came into existence.

# Processors for Supercomputers

## Microprocessor development directions:

- Increasing of clock frequency and speed instruction stream processing
- Processing of large collection of data in single processor instruction - SIMD
- Control path multiplication – multi threading

## •RISC processors

- MIPS
- IBM Power4

## •Pipeline Processor

## •AlphaVector processors

- NEC SX-6
- Cray (Cray X1)

## •CISC processors

- IA32
- AMD x86-64

## •VLIW processors

- IA64

## •Multi-core Processor

## •GPU

## •FPGA

- SRAM Based

- Importance of HPC
- Types of Processing System
- **HPC System Architecture (Parallel Processors)**
- Supercomputing Classes and Infrastructure
- Cluster Software Stacks
- Programming Models
- Demonstration (Supercomputing System)

# HPC System

- Multi/Many Processor System Architecture
- Computer Architecture wrt Memory Arrangement
- Shared Storage Solutions
- Network Configuration

# Flynn's Classical Taxonomy

- There are different ways to classify parallel computers. One of the more widely used classifications, in use since 1966, is called Flynn's Taxonomy.
- Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of ***Instruction*** and ***Data***.
- Each of these dimensions can have only one of two possible states: ***Single*** or ***Multiple***.

# Processor Architectures

SISD



SIMD



MISD



MIMD



# Computer Architecture wrt Memory Arrangement

Shared-memory

Distributed-memory

Distributed Shared-memory



# Shared Memory

- Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space.



- Multiple processors can operate independently but share the same memory resources.
- Changes in a memory location effected by one processor are visible to all other processors.
- Shared memory machines can be divided into two main classes based upon memory access times: **UMA** and **NUMA**.

# Distributed Memory

- Like shared memory systems, distributed memory systems vary widely but share a common characteristic. Distributed memory systems require a communication network to connect inter-processor memory.
- Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors.
- Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.
- When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility.
- The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet.



# Hybrid Distributed-Shared Memory

- The largest and fastest computers in the world today employ both shared and distributed memory architectures.



- The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global.
- The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another.
- Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future.
- Advantages and Disadvantages: whatever is common to both shared and distributed memory architectures.

# Types: Memory Access

Unified Memory Access (UMA)

Non Unified Memory Access (NUMA)

- Cache-Coherent NUMA (ccNUMA)
- Distributed Memory (DM) / MPI
- In-Memory Computing
- NUMA Hybrid Architectures



*Cost of computer storage since the 1950s in dollars (unadjusted) per terabyte*

# Shared Disk Storage

- DAS (Direct Attached Storage)
- NAS (Network Attached Storage)
- **SAN (Storage Area Network)**
- **Parallel File System Storage**
- Object Storage:
- Tiered Storage: SSD and HDD
- Ceph : Object, Blocks, File Storage



# Comparison: SAN, NAS, PFS

| Criteria             | Storage Area Network (SAN)                  | Network Attached Storage (NAS)         | Parallel File System                                  |
|----------------------|---------------------------------------------|----------------------------------------|-------------------------------------------------------|
| Access Type          | Block-level access                          | File-level access                      | File-level access with parallel capabilities          |
| Performance          | High-performance, low-latency               | Varies based on NAS solution           | High-performance, scalable                            |
| Workload Flexibility | Well-suited for demanding workloads         | Suitable for general workloads         | Designed for high-performance computing               |
| Scalability          | Scalable, can handle growth                 | Varies based on NAS solution           | Highly scalable and optimized for large clusters      |
| Complexity           | Can be complex to set up and manage         | Relatively simple to set up and manage | Can be complex to optimize for specific workloads     |
| Isolation            | Can provide isolated storage access         | Shared access across clients           | Shared access, optimized for parallelism              |
| Hardware Requirement | Requires dedicated SAN hardware (FC)        | NAS devices with Ethernet connectivity | Requires specialized parallel file system software    |
| Cost                 | Higher cost due to dedicated infrastructure | Generally cost-effective               | Costs can vary based on solution and scale            |
| Management           | May require specialized expertise           | Generally easier to manage             | Requires expertise in parallel file system management |
| Protocols Supported  | Typically Fibre Channel, iSCSI, etc.        | NFS, SMB, other file protocols         | NFS, Lustre, GPFS, BeeGFS, etc.                       |
| Use Cases            | High-performance applications               | General-purpose file sharing           | High-performance computing, data-intensive tasks      |

# Network Communication Protocols

InfiniBand is a high-speed networking technology primarily designed for low-latency, high-bandwidth communication within data centers and high-performance computing (HPC) environments.

InfiniBand excels in short to medium-range communication within a data center or between closely located computing clusters.



# RDMA and HCA

Remote Direct Memory Access (RDMA) and Host Channel Adapter (HCA) allows computers of Cluster to access each other's memory directly without involving the host CPU.

It executes **data offloading** and **data transfer tasks** from the host to dedicated network adapters that support **RDMA**, such as **InfiniBand or RoCE (RDMA over Converged Ethernet)**.

This bypasses the need for the usual network stack processing, resulting in reduced latency and improved data throughput.





# Network Security

Soft Firewall: Zentyal, pfSense, IPFire, SmoothWall, ClearOS, and Iptables.



- Importance of HPC
- Types of Processing System
- HPC System Architecture
- **Supercomputing Classes and Infrastructure**
- Cluster Software Stacks
- Programming Models
- Demonstration (Supercomputing System)

# What is a Supercomputer?

*“A supercomputer is a device for turning compute-bound problems into I/O-bound problem”* - Seymour Cray

A supercomputer is a computer system that leads the world in terms of processing capacity, particularly speed of calculations, at the time of its introduction. (Wikipedia)

# Basic Arch of Supercomputer



# Basic Arch of Supercomputing



# Basic Introduction of Supercomputing



# Constituents

- Processing
- Software
- Programming Models
- Storage
- Network
- Accessibility
- Power



# Basic Introduction of Supercomputing

- Performance
- Programmability
- Portability
- Scalability
- Accessibility
- Usability
- Power
- Cost



FLOPS = Floating Point Operation Per Second  
KFLOPS =  $10^3$   
= One Thousand Computation Per Second  
=  $12.00 * 1212.222 * 21212 + 232323 \dots$

MFLOPS =  $10^6$  Million Computation Per Second  
GFLOPS =  $10^9$  Billion  
TFLOPS =  $10^{12}$  Trillion  
PFLOPS =  $10^{15}$  Quadrillion  
EFLOPS =  $10^{18}$  Quintillion  
ZFLOPS =  $10^{21}$  Sextillion

# History of Supercomputers

1945-50 - Manchester Mark I

1950-55 - MIT Whirlwind

1955-60 - IBM 7090 - **210 KFLOPS**

1960-65 - CDC 6600 - **10.24 MFLOPS**

1965-70 - CDC 7600 - **32.27 MFLOPS**

1970-75 - CDC Cyber **76 MFLOPS**

1975-80 - Cray-1 - **160 MFLOPS**

1980-85 - Cray X-MP - **500 MFLOPS**

1985-90 - Cray Y-MP - **1.3 GFLOPS**

1990-95 - Fujitsu Numerical Wind Tunnel - **236 GFLOPS**

1995-00 - Intel ASCI Red - **2.150 TFLOPS**

2000-02 - IBM ASCI White, SP Power3 375 MHz - **7.226 TFLOPS**

2002-03 - NEC Earth Simulator - **35 TFLOPS**

# Supercomputing Generations

## **Gen I: 1970s. SIMD Array of Processors.**

- USAs developed ILLIAC-IV processor array, consisting of an 8 x 8 array of 64 processors. 106 MFLOPS

## **Gen II: Cray-1 Pipelined Vector Machines 1976,**

- 12 pipelined arithmetic units, 160 MFLOPS

## **Gen III: Shared-Memory multi-processor systems (MIMD Arch) 40 TFLOPS**

## **Gen IV: Massive Parallel Processor having 10 to thousands of processors**

## **Gen V: Cluster based (CPU+GPU+MIC)**

# Classes of Supercomputers

## General Purpose

**vector processing machines** - the same operation carried out on a large amount of data simultaneously

**Tightly connected cluster computers (NUMA)** - communication oriented architectures engineered from ground up, based on high speed interconnects and large number of processors

**Commodity clusters** - collection of large number of commodity PCs (COTS) interconnected by high-bandwidth low-latency network

**Special Purpose:** high performance computing devices with a hardware architecture dedicated to solve a single problem (equipped with custom ASICs or FPGA chips)

# Low Cost Solution: Commodity Cluster

A cluster is a collection of connected, independent computers working in unison to solve a problem COTS technology nodes are interconnected by Ethernet LAN, Myrinet, QsNet ELAN etc. computation can be performed by using popular programming toolkits and frameworks: OpenMP, MPI clusters require dedicated management software.

# Components of a Supercomputing System

- Compute Nodes
- Login/Access Nodes
- I/O Nodes
- Network Nodes (Switches/Routers)
- Management Nodes
- Storage Nodes
- Service Nodes

# Performance Measurement

1 login node and 20 compute nodes, each of them:

2 x Xeon(R) CPU E5-2673 v4 @ 2.30GHz, 40 cores and 2 threads/core, total 80 threads per node)

$$- 80 \times 2.3 \text{ GFLOPS} = 184 \text{ GFLOPS}$$

128GB of main memory distributed in 16 dimms x 32GB @ 2666MHz

2 x SSD 2TB as local storage and 2 x 3.2TB NVME

2 x GPU NVIDIA 4070TI (Ada Lovelace) with 12GB HBM2.

$$- 2 \times 41.13 \text{ TFLOPS} = 82.22 \text{ TFLOPS}$$

GPFS via one Ethernet link 10 Gbit

Supercomputing System Performance = 1660 TFLOPS

**1.6 PFLOPS**

# Performance Bottleneck

Main Memory Delay

Network (I/O) Delay

Disk Storage Delay

- Importance of HPC
- Types of Processing System
- HPC System Architecture
- Supercomputing Classes and Infrastructure
- **Cluster Software Stacks**
- Programming Models
- Applications

# Software Stack Structures and Usage



# **Cluster System Software Stack Configuration Tools**

## **Linux Distributions**

- Red Hat Enterprise Linux (RHEL) and CentOS
- SUSE Linux Enterprise Server (SLES)
- Ubuntu Server

## **HPC-Optimized Linux Distributions**

- CentOS Stream for HPC
- OpenHPC

## **Specialized Linux Variants:**

- Rocky Linux
- Scientific Linux

## **Environment Modules System**

- Singularity containers in an HPC environment

## **Cluster Management Tools**

- Automate the deployment and management of nodes

# High-Performance Computing (HPC) Bare Metal

Slurm: A popular open-source workload manager and job scheduler for managing and optimizing HPC clusters.

Distributed computation environments are used for software development (OpenMP, MPI). OpenMPI: A widely used message-passing interface for developing parallel and distributed computing applications.

Intel oneAPI: A unified software stack from Intel for developing high-performance applications that can leverage CPU, GPU, and FPGA architectures.

GNU Compiler Collection (GCC): A set of compilers and libraries often used for building high-performance applications in various programming languages.

CUDA Toolkit: NVIDIA's software platform for developing applications that leverage NVIDIA GPUs for high-performance computing.

High-Performance Math Function Library: BLAS (Basic Linear Algebra Subprograms), LINPACK, LAPACK (Linear Algebra PACKage) Linux

# OpenStack Platform

Nova: The compute service in OpenStack, responsible for managing virtual machines and instances.

Neutron: The networking service that provides networking resources and connectivity for OpenStack deployments.

Cinder: The block storage service that enables the creation and management of block storage volumes.

Swift: The object storage service designed to store and retrieve large amounts of unstructured data.

Horizon: The web-based dashboard interface for managing and monitoring OpenStack resources.

Heat: The orchestration service that allows users to define and automate the deployment of resources using templates.

## HPC , AI and BigData Software Stack

|                                         |                                        |                                                                                                                                                                 |                             |                         |                                 |
|-----------------------------------------|----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------|-------------------------|---------------------------------|
| Deep Learning and BigData Environment   | Frameworks                             | Caffe, Caffe2, Caffe-MPI, Chainer, Microsoft CNTK, Keras, MXNet, Tensorflow, Theano, PyTorch<br>Apache Hadoop, Apache Spark, Apache Flink, Apache Storm, Apache |                             |                         |                                 |
|                                         | Libraries                              | cnDNN, NCCL, cuBLAS,<br>Apache Hive, Apache Pig, Apache HBase, Spark SQL, MLlib, GraphX                                                                         |                             |                         |                                 |
|                                         | User Access                            | <b>NVIDIA DIGITS, Apache Zeppelin</b>                                                                                                                           |                             |                         |                                 |
| Programming Environment                 | Development & Performance Tools        | Intel Parallel Studio XS Cluster Edition                                                                                                                        | PGI Cluster Development Kit | GNU Toolchain           | NVIDIA CUDA                     |
|                                         | Scientific and Communication Libraries | Intel MPI                                                                                                                                                       | MVAPICH2, MVAPICH           | IBM Spectrum LSF        | Open MPI                        |
|                                         | Debuggers                              | Intel IDB                                                                                                                                                       | PGI PGDBG                   | GNU GDB                 |                                 |
| Schedulers, File Systems and Management | Resource Management/Job Scheduling     | Adaptive Computative<br>Moab, Maui TORQUE                                                                                                                       | SLURM                       | Altair PBS Professional | IBM Spectrum LSF<br>Grid Engine |
|                                         | File Systems                           | Lustre                                                                                                                                                          | NFS                         | GPFS                    | Local (ext3, ext4, XFS)         |
|                                         | Cluster Management                     | Beowulf, xCat, OpenHPC, Rocks, Bright Cluster Manager for HPC including support for NVIDIA Data Center GPU Manager                                              |                             |                         |                                 |
| Operating Systems and Drivers           | Drivers & Network Mgmt.                | Accelerator Software Stack and Drivers                                                                                                                          |                             | OFED, OPA               |                                 |
|                                         | Operating Systems                      | Linux (RHEL, CentOS, SUSE Enterprise, Ubuntu, etc.)                                                                                                             |                             |                         |                                 |

- Importance of HPC
- Types of Processing System
- HPC System Architecture
- Supercomputing Classes and Infrastructure
- Cluster Software Stacks
- **Parallel Programming Models**
- Demonstration (Supercomputing System)

# Frameworks for Parallel Programming

**Scientific Computing**

Big Data Processing

AI and Machine Learning

# Parallel Programming Models

Shared Memory: Multi-threading programming languages: such as POSIX Threads, Java Threads, OpenMP.

- OpenMP: Parallelize loops and sections of code using compiler directives.

MPI (Message Passing Interface): A widely used standard for writing parallel programs that execute on distributed memory systems. It's commonly used in high-performance computing for scientific simulations and calculations.

OpenACC: OpenACC is a directive-based approach to parallel programming that focuses on using high-level directives to guide the compiler in parallelizing code for accelerators like GPUs.

CUDA: NVIDIA's parallel computing platform and programming model that enables developers to use GPUs for accelerating scientific computations.

BLAS (Basic Linear Algebra Subprograms)

# Threads Model



- In the threads model of parallel programming, a single process can have multiple, concurrent execution paths.
- A single program that includes a number of subroutines.
- Threads are commonly associated with shared memory architectures and operating systems.

# Shared Memory Programming Model: OpenMP

- **OpenMP**
  - Compiler directive based; can use serial code
  - Portable / multi-platform, including Unix and Windows NT platforms

## OpenMP Constructs



# CUDA C++ Programming: NVCC



`dim3 threadsPerBlock(16, 16);`

`dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);`



Per thread registers and local memory



Per block Shared memory



Shared memory of all thread blocks in a cluster form Distributed Shared Memory

## Grid with Clusters

### Thread Block Cluster



### Thread Block Cluster



Global Memory

Global Memory shared between all GPU kernels

## C Program Sequential Execution

Serial code

Parallel kernel  
`Kernel0<<<>>()`

Serial code

Parallel kernel  
`Kernel1<<<>>()`

Host

Device

Grid 0

|              |              |              |
|--------------|--------------|--------------|
| Block (0, 0) | Block (1, 0) | Block (2, 0) |
| Block (0, 1) | Block (1, 1) | Block (2, 1) |

Host

Device

Grid 1

|              |              |
|--------------|--------------|
| Block (0, 0) | Block (1, 0) |
| Block (0, 1) | Block (1, 1) |
| Block (0, 2) | Block (1, 2) |

# Message Passing Model

The message passing model demonstrates the following characteristics:

- A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine as well across an arbitrary number of machines.
- Tasks exchange data through communications by sending and receiving messages.
- Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.



# Point-to-Point Communication

MPI defines more than 35 point-to-point communication methods, including

`MPI_Send`, `MPI_Recv`, `MPI_Sendrecv`,  
`MPI_Isend`, `MPI_Irecv`, `MPI_Probe`,  
`MPI_Iprobe`, `MPI_Test`, `MPI_Testall`,  
`MPI_Wait`, and `MPI_Waitall`.

## Collective Communication

These include: `MPI_Allgather`,  
`MPI_Allgatherv`, `MPI_Allreduce`,  
`MPI_Alltoall`, `MPI_Alltoallv`,  
`MPI_Barrier`, `MPI_Bcast`, `MPI_Gather`,  
`MPI_Gatherv`, `MPI_Reduce`,  
`MPI_Scatter`, and `MPI_Scatterv`.



```
void main(int argc, char *argv[])
{
int myrank, size;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Processor %d of %d: Hello World!\n", myrank, size);
MPI_Finalize();
}
```



**Example of Parallel Communications Overhead and Complexity:** actual callgraph from the simple parallel "hello world" program shown. Most of the routines are from communications libraries.

## MPI include file

### *Declarations, prototypes, etc.*

Program Begins

### **Serial code**

## Initialize MPI environment

## *Parallel code begins*

Do work & make message passing calls

### Terminate MPI environment

### **Parallel code ends**

### **Serial code**

## Program Ends

# Hybryd

- In this model, any two or more parallel programming models are combined.
- Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with either the threads model (POSIX threads) or the shared memory model (OpenMP). This hybrid model lends itself well to the increasingly common hardware environment of networked SMP machines.
- Another common example of a hybrid model is combining data parallel with message passing. As mentioned in the data parallel model section previously, data parallel implementations (F90, HPF) on distributed memory architectures actually use message passing to transmit data between tasks, transparently to the programmer.

# Big Data Processing

Apache Hadoop: An open-source framework for distributed storage and processing of large datasets across clusters of computers using the MapReduce programming model.

Apache Spark: A fast and general-purpose cluster computing system that provides in-memory data processing for big data analytics.

Apache Flink: A stream processing framework for real-time event-driven data processing with support for batch processing as well.

Apache Beam: A unified programming model for both batch and stream data processing, supporting multiple backends like Apache Spark, Google Cloud Dataflow, and more.

# AI and Machine Learning:

TensorFlow: An open-source deep learning framework developed by Google that supports both CPU and GPU acceleration for training neural networks.

PyTorch: An open-source machine learning framework developed by Facebook's AI Research lab, known for its dynamic computation graph and ease of use.

Apache MXNet: A flexible and efficient deep learning framework that supports both symbolic and imperative programming paradigms.

Horovod: A distributed deep learning framework designed to scale TensorFlow and PyTorch training across multiple GPUs and nodes.

Distributed TensorFlow: An extension of TensorFlow that enables distributed training of machine learning models across multiple devices and machines.

# Conclusion Parallel Programming Models

| Parallel Programming Model                        | Heterogeneous Computing Devices | Memory Model                             | Network Support                             | Distributed Computing Support             |
|---------------------------------------------------|---------------------------------|------------------------------------------|---------------------------------------------|-------------------------------------------|
| OpenMP                                            | X86, AMD                        | Shared memory                            | No network support                          | Limited support                           |
| CUDA/C++                                          | GPU                             | Dedicated memory                         | No built-in support                         | Limited support                           |
| MPI (Message Passing Interface)                   | X86, AMD, GPU, TPU              | Distributed memory                       | Full network support (Ethernet, InfiniBand) | Excellent support (MPI)                   |
| Heterogeneous Computing SDKs (e.g., Intel oneAPI) | X86, AMD, GPU, TPU              | Shared and dedicated                     | Full network support (Ethernet, InfiniBand) | Excellent support (MPI, oneAPI)           |
| OpenACC                                           | GPU                             | Shared memory                            | Limited support (Ethernet)                  | Limited support                           |
| TensorFlow                                        | GPU, TPU                        | Shared memory (GPU), TPU-specific memory | Limited support (Ethernet)                  | Excellent support (TensorFlow Strategies) |

- Importance of HPC
- Types of Processing System
- HPC System Architecture
- Supercomputing Classes and Infrastructure
- Cluster Software Stacks
- Programming Models
- **Demonstration (Supercomputing System)**



# Bottom-up Design Approach



# Applications Services

Data Sciences

Health Science

Social Sciences

Agriculture

High Performance Computing

Web  
(IoT, VLSI Design)

# Development Frameworks and Libraries

Interactive

GCC

Python

OpenMP

MPI

CUDA

OpenACC

OpenCL

TensorFlow

Horovod

Hadoop

PowerAI

DeepSpeed

Spark

# Distributed System & Software Stack

OpenHPC, ROCKS

OpenShift, xCAT  
Nutanix Acropolis

Open-Stack  
Kubernetes

Linux Kernel: OpenPBS, PBS-Pro, SLURM, Ganglia , Open vSwitch, warewolf, Lustre, BeeGFS, Ceph, Mellanox OFED, IPoIB, OpenEth, Network Information Service, ACPI

Rolls, Singularity Image, Docker, Contrainer

# Hardware System

Intelligent RACK infrastructure  
PDU, PMS

Accelerators  
GPU/TPU/FPGAs

Multi-core  
CISC/SuperScalar

SAN/NAS,  
SSDs/NVMe

High-Speed Ethernet,  
Infiniband

# Challenges

How to Write Highly Scalable and Portable Parallel Programs?

How to Enable Automated Use of Effective Parallel Programming Techniques When Writing Parallel Programs?

How to Enforce the Use of Parallel Programming Design Principles During the Programming Process?

How to Employ Suitable Optimization Techniques?

How to Promote Interdisciplinary Collaboration?

## Namal Supercomputing System

**Linux Operating system,**

**20 Servers Nodes**

- Two Xeon processors Intel(R) E5-2673 v4, 2 NVIDIA GPUs 24GDDR and 128 GB main memory

50 Tbyte SSD

**Interconnection Fast Ethernet**

**The software stack**

- Cloud and Baremetal
- OpenMP, OpenACC
- MPI

**Application Development**

- VLSI Chip Design
- AI and BigData





## Barcelona Supercomputing Center

Linux Operating system,  
54 Servers

Two Power 9 processors, 4 NVIDIA GPUs with 512 GB of  
main memory

Interconnection InfiniBand is an industry-standard to  
interconnect servers that allows the local memory of one  
server to be accessed from remote servers speedily.

The software stack

**Horovod**, from Uber. Horovod Plugs into TensorFlow,  
PyTorch, and MXNet.



# **Understanding Supercomputing and HPC Architectures and Programming Models**

**by: Tassadaq Hussain**

**Director Centre for AI and BigData**

**Professor Electrical Engineering Department  
Namal University Mianwali**

## **Collaborations:**

**Barcelona Supercomputing Center, Spain**

**European Network on High Performance and Embedded Architecture and Compilation**

**Pakistan Supercomputing Center**