



# Intel® Many Integrated Core Architecture

December 2010  
Intel

# Legal Disclaimer

*INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.*

- *Intel may make changes to specifications and product descriptions at any time, without notice.*
- *All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.*
- *Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.*
- *Penryn, Nehalem, Westmere, Sandy Bridge, and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user*
- *Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.*
- *Intel, Xeon, Netburst, Core, VTune, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.*

*\*Other names and brands may be claimed as the property of others.*



# Intel in High-Performance Computing



Dedicated,  
Renowned  
Applications  
Expertise

Tera/Exa-  
Scale  
Research



Large Scale  
Clusters  
for Test &  
Optimization



Broad Software  
Tools  
Portfolio

Defined  
HPC  
Application  
Platform



Platform  
Building  
Blocks



Manufacturing  
Process  
Technologies



Leading  
Performance,  
Energy Efficient



Many  
Integrated  
Core  
Architecture

A long term commitment to the HPC market segment





# Tick-Tock Development Cycles

## Integrate. Innovate.

TICK

TOCK

45nm

32nm

22nm

Penryn

Nehalem

Westmere

Sandy Bridge

Ivy Bridge

Future Proc.

Intel® Core™  
Microarchitecture

Sandy Bridge  
Microarchitecture

Forecast

SSE4.1

SSE4.2

AESNI

AVX

Future Instructions



# Increasing Performance and Energy Efficiency

## Process Technology 22NM



## Legacy TURBO-BOOST



## Core Architecture AVX



## Processors MULTI/MANY-CORE



Potential future options, subject to change without notice.



# Two Socket Platform Evolution



Frontsidebus



Multiple  
Frontsidebuses



Integrated  
Memory  
Controller,  
QPI



Integrated  
Memory  
Controller,  
QPI,  
Integrated  
I/O



# Intel and Parallelism



*Images not intended to reflect actual die sizes*

|            | Nocona           | Woodcrest        | Nehalem-EP       | Westmere-EP      | Sandy Bridge     | Aubrey Isle<br>(Knights Ferry) |
|------------|------------------|------------------|------------------|------------------|------------------|--------------------------------|
| Frequency  | 3.6GHz           | 3.0GHz           | 3.2GHz           | 3.33GHz          | TBD              | 1.2GHz                         |
| Core(s)    | 1                | 2                | 4                | 6                | 8                | 32                             |
| Thread(s)  | 2                | 2                | 8                | 12               | 16               | 128                            |
| SIMD Width | 128<br>(2 clock) | 128<br>(1 clock) | 128<br>(1 clock) | 128<br>(1 clock) | 256<br>(1 clock) | 512<br>(1 clock)               |

**Knights Ferry builds on established CPU architecture and programming concepts - providing the benefits of code re-use to developers of highly parallel applications**



# How to get High Performance and Energy Efficiency?

high F.P. performance



+



small extreme  
energy efficient  
core



many integrated and  
parallel small energy  
efficient,  
high-performance cores

**The Newest Addition to the Intel Server Family.  
Industry's First General Purpose Many Core Architecture.**



# Intel® MIC Customer Value



Combine :

The many benefits of broad Intel CPU programming models, techniques, and familiar developer tools

+

The compute density and energy efficiency associated with specialty accelerators for parallel workloads

=

Intel® Many Integrated Core products



MIC = CO-PROCESSOR for highly-parallel workloads  
FULLY PROGRAMMABLE



# Intel® MIC Architecture – Knights Family



**Multiple IA cores**  
- In-order, short pipeline  
- Multi-thread support

**16-wide vector units (512b)**  
- Extended instruction set  
Fully coherent caches

**1024-bit ring bus**  
**GDDR5 memory**  
- Supports virtual memory

## Standard IA Shared Memory Programming

For illustration only.

Future options subject to change without notice.



# Aubrey Isle Core (in KNF)



## The Aubrey Isle co-processor core:

- Scalar pipeline derived from the dual-issue Pentium processor
- Short execution pipeline
- Fully coherent cache structure
- Significant modern enhancements such as multi-threading, 64-bit extensions, and sophisticated pre-fetching.
- 4 execution threads per core
- Separate register sets per thread
- Supports IEEE standards for floating point arithmetic
- Fast access to its 256KB local subset of a coherent L2 cache.
- 32KB instruction cache per core
- 32KB data cache for each core.

## Enhanced x86 instructions set with:

- Over 100 new instructions,
- Wide vector processing operations
- Some specialized scalar instructions
- 3-operand, 16-wide vector processing unit (VPU)
- VPU executes integer, single-precision float, and double precision float instructions

## Interprocessor Network

1024 bits wide, bi-directional (512 bits in each direction)



# New VPU Instructions

>100 new Instructions

512-bit SIMD

32x 512b vector-register, 8x 16b mask-register

16 FLOAT32, 8 FLOAT64, 16 INT32 or 512 LOGICAL1 elements /vreg

Ternary, Multiply-Add (FMA)

More flops in fewer ops (IEEE conform)

e.g.  $vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0$

Load-op

Third operand can be taken direct from memory

Broadcast/Swizzle/Format Conversion (on Load/Store)

Float16, unorm8, etc. - allows more efficient use of caches

Predication/Masking on most Operations

Gather/Scatter support



# The “Knights” Family

## Knights Ferry

Software Development Platform



## Future Knights Products

### Knights Corner

1<sup>st</sup> Intel® MIC product

22nm process

>50 Intel Architecture Cores

Within PCIe Power Envelope

Additional Enhancements

# “Knights Ferry” Development Platform



## Software Development Platform

Growing availability through 2011

Up to 32 cores, up to 1.2 GHz

Up to 128 threads at 4 threads / core

Up to 8MB shared coherent cache

TFLOPS Performance

Up to 2 GB GDDR5 shared memory

PCIe Card (within 300W envelope)

Bundled with Intel HPC SW tools

Software development platform for Intel® MIC architecture



# Intel® MIC Architecture Programming

## Single Source Code



## Common with Intel® Xeon®

- Programming Models
- C/C++, Fortran compilers
- Intel SW developer tools and libraries (MKL, IPP, TBB, ArBB, ...)
- Coding and optimization techniques and SW tools
- Ecosystem support

## Eliminates Need for Dual Programming Architecture

For illustration only, potential future options subject to change without notice.



# Example: Computing PI

```
# define NSET 1000000
int main ( int argc, const char** argv )
{ long int i;
  float num_inside, Pi;
  num_inside = 0.0f;
#pragma offload target (MIC)
#pragma omp parallel for reduction(+:num_inside)
  for( i = 0; i < NSET; i++ )
  {
    float x, y, distance_from_zero;
    // Generate x, y random numbers in [0,1)
    x = float(rand()) / float(RAND_MAX + 1);
    y = float(rand()) / float(RAND_MAX + 1);
    distance_from_zero = sqrt(x*x + y*y);
    if ( distance_from_zero <= 1.0f )
      num_inside += 1.0f;
  }
  Pi = 4.0f * ( num_inside / NSET );
  printf("Value of Pi = %f \n",Pi);
}
```

**One additional line from the CPU version**

(For illustration only)



# Heterogeneous Programming with MIC

|                              |
|------------------------------|
| MPI (C/C++, FTN)             |
| MKL, IPP (C/C++, FTN)        |
| Cilk (C++)                   |
| CnC (C++)                    |
| TBB (C++)                    |
| ArBB (C++)                   |
| CAF (FTN)                    |
| OpenMP (C/C++/FTN)           |
| Fortran90 Arrays (FTN)       |
| CEAN (C++)                   |
| OpenCL (C/C++)               |
| Intel Compilers (C/C++, FTN) |



Larger #Cores  
Wider Vectors/SIMD



Programming Intel® MIC is the same as programming a CPU



# Intel Development Tools for HPC

## Leading developer tools for performance on nodes and clusters



### Advanced Performance

C++ and Fortran Compilers, MKL/IPP Libraries & Analysis Tools for Windows\*, Linux\* developers on IA based multi-core node

### Distributed Performance

MPI Cluster Tools, with C++ and Fortran Compiler and MKL Libraries, and analysis tools for Windows\*, Linux\* developers on IA based clusters



# Scaling Performance Forward

## Software Tools Vision



**Employ versatile and common development tools across all IA architectures**

**Single Portable Software Stack**

**Flexible Programmability**

**Scalable Performance**

***Data-Parallelism  
Thread-Parallelism  
Messaging***

...

Potential future options, no indication of actual product or development, subject to change without notice.



Access innovations ... *in the formative stages*

The screenshot shows the homepage of whatif.intel.com. At the top, there's a navigation bar with the Intel logo, "Intel® Software Network", "Communities", "Partners", "Tools & Downloads", "Forums & Support", "Blog", "Resources", and "Go to Intel.com". Below the navigation is a breadcrumb trail: "Home > What If Experimental Software". The main title "What If Experimental Software" is displayed above a large image of a man in a blue sweater standing in front of a cityscape. The text "Welcome to whatif.intel.com" and "What if software were like this?" is overlaid on the image. Below the image is a paragraph of text about experimental software and feedback. A sidebar on the left lists "Active Projects", "Designing New Capabilities", "Creating Concurrent Code", "Math Libraries", and "Performance Tuning", each with a list of links. A "What If Support Forums" section on the right shows a sample forum post about merging videos.

## Active Projects

### Designing New Capabilities

- Intel® OpenCL SDK **New!**
- Intel Advisor Lite Now Part of Intel® Parallel Studio
- Intel® Web APIs **New!**
- Intel® Energy Checker SDK
- Intel® SOA Expressway XSLT 2.0 Processor
- Smoke - Game Technology Demo **Rev 1.2 Released**
- Isolated Execution
- Intel® Direct Ethernet Transport
- Intel® Software Development Emulator

### Creating Concurrent Code

- Intel® Cilk++ Software Development Kit
- Intel® Concurrent Collections for C++ **Rev 0.6 Released**
- Intel® C/C++ STM Compiler, Prototype Edition **Rev 4.0 Released**

### Math Libraries

- Intel® Cluster Poisson Solver Library
- Intel® Adaptive Spike-Based Solver
- Intel® Ordinary Differential Equations Solver Library

### Performance Tuning

- Intel® Software Tuning Agent
- Intel® Architecture Code Analyzer
- Intel® Performance Tuning Utility 4.0 Update 3 Released
- Intel® Platform Modeling with Machine Learning



# Intel TeraScale Research Areas

## MANY-CORE COMPUTING



**Teraflops** of computing power

## 3D STACKED MEMORY



**Terabytes** of memory bandwidth

## SILICON PHOTONICS



**Terabits** of I/O throughput

Future vision, does not represent real products.





# Moore's Law: Alive and Well at Intel



180 nm  
**1999**

130 nm  
**2001**

90 nm  
**2003**

65 nm  
**2005**

45 nm  
**2007**

32 nm  
**2009**

22 nm  
**2011**

Continuing Moore's Law  
each new process technology  
allows up to:

Transistor  
Performance  
**+20%**

Switching Power  
**-30%**

**Production**

**Development**

**Research**

**On Track**

15nm



# Industry Trend to Multi/Many-Core

**Energy Efficient (HPC) Computing  
with Multi/Many-Core Processors**



Multi Processor



Hyper-Threading



Dual-Core



Multi-Core (4+)



Many-Core

**But: not all cores all equal !**

(for illustration only)



# Intelligent Processor Performance Scaling Forward

## Processor Core Performance



Potential future options, subject to change without notice.

## Faster Time To Productivity

- Total Application Performance
- Increased Single Thread Performance
- Increased Floating Point Performance and Bandwidth
- Irregular Data-Access
- Balanced Processor and System Architecture
- Less Complex Software Development and Support



# Intel® Turbo Boost Technology 2.0



Intelligent and energy efficient performance on demand

*The number of Turbo bins shown is only for illustrative purposes and is not representative of the actual number of turbo bins available.*

