

# Aamir Shafi's Profile



- **Background:**
  - BESE, MCS - NUST, '03
  - PhD, ICG, University of Portsmouth,
  - Post-doc, MIT, USA, '11
  - Co-founded xFlow Research, '11 – '12  
(Worked for Dell & Cavium Networks)
- Conducts R&D in the area of High Performance Computing (HPC)
- Architect/main developer of the MPJ Express software



# Moore's law

- Number of transistors incorporated in a chip will approximately double every two years



# Single Processor Performance



**15:** Intel introduces 5<sup>th</sup> generation Intel® processor (**1.3 billion transistors**).



**12:** Intel introduces the Intel® Core i5 processor (**1 billion transistors**).



**001:** Intel introduces Intel® Pentium® 4 processor (**2 million transistors**).



## WHAT CAN BE DONE, CAN BE OUT

Intel continues to deliver on the promise of Moore's Law through the introduction of powerful multi-core technologies, transistor architecture, advances in materials science and new innovation.

**2004:** Intel introduces Intel® Pentium® 4 processor with HT technology (**125 million transistors**).



# Measuring Performance

FLOPS, Floating Point Operations per Second



| Name       | FLOPS     |
|------------|-----------|
| zettaFLOPS | $10^{21}$ |
| exaFLOPS   | $10^{18}$ |
| petaFLOPS  | $10^{15}$ |
| teraFLOPS  | $10^{12}$ |
| gigaFLOPS  | $10^9$    |
| megaFLOPS  | $10^6$    |
| kiloFLOPS  | $10^3$    |
| FLOPS      | 1         |

# Performance Development †



† <https://www.top500.org/static/media/uploads/>

# Countries †

The rise of China!

China: 0.4% (2) in 2000 → 45.5% (227) in 2018

USA: 51.6% (258) in 2000 → 24.8% (124) in 2018



# TOP5 Supercomputers (June '19)

| # | Name              | Computer                                                                                              | Site                                        | Country       | Total Cores | Rmax [PFlop/s] | Power (MW) |
|---|-------------------|-------------------------------------------------------------------------------------------------------|---------------------------------------------|---------------|-------------|----------------|------------|
| 1 | Summit            | IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband | Oak Ridge National Laboratory               | United States | 2414592     | 148.6          | 10.1       |
| 2 | Sierra            | IBM Power System S922LC, IBM POWER9 22C 3.1GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband | LLNL                                        | United States | 1572480     | 94.6           | 7.4        |
| 3 | Sunway TaihuLight | Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway                                                       | National Supercomputing Center in Wuxi      | China         | 1064960     | 93.01          | 15.3       |
| 4 | Tianhe-2 A        | TH-IVB-FEP Cluster, Intel Xeon E5-2692v2 12C 2.2GHz, TH Express-2, Matrix-2000                        | National Super Computer Center in Guangzhou | China         | 4981760     | 61.44          | 18.4       |
| 5 | Frontera          | Dell C6420, Xeon Platinum 8280 28C 2.7GHz, Mellanox InfiniBand HDR                                    | Texas Advanced Computing Center/Univ.       | United States | 448448      | 23.51          | -          |

# #1: IBM Summit



# #3: Sunway TaihuLight



# What helps Performance?

- **Clock Speed**
- **In a clock cycle, can do more work -- since transistors are faster, transistors are more energy-efficient, and there's more of them**
- **Better architectures: finding more parallelism in one thread, better branch prediction, better cache policies, better memory organizations, more thread-level parallelism, etc.**

# Modern Trends

- **Clock speed improvements are slowing**
  - power constraints
- **Difficult to further optimize a single core for performance**
- **Multi-cores: each new processor generation will accommodate more cores**
- **Need better programming models and efficient execution for multi-threaded applications**
- **Need better memory hierarchies**
- **Need greater energy efficiency**

# Intel® Core™ i7-3770T Processor



|                            |                 |                        |                |
|----------------------------|-----------------|------------------------|----------------|
| # of Cores                 | 4               | Lithography            | 22 nm          |
| # of Threads               | 8               | Max TDP                | 45 W           |
| Clock Speed                | 2.5 GHz         | Recomm. Customer Price | TRAY: \$294.00 |
| Max Turbo Frequency        | 3.7 GHz         | Max Memory Size        | 32 GB          |
| Intel® Smart Cache         | 8 MB            | Memory Types           | DDR3-1333/1600 |
| Instruction Set            | 64-bit          | # of Memory Channels   | 2              |
| Instruction Set Extensions | SSE4.1/4.2, AVX | Max Memory Bandwidth   | 25.6 GB/s      |
| Embedded Options Available | No              |                        |                |

# What is Computer Architecture?

Computer Architecture =

Instruction Set

Architecture +

Machine Organization (e.g.,  
Pipelining, Memory Hierarchy,  
Storage systems, etc) +  
Hardware implementation