



NVIDIA®

GPU Teaching Kit  
Accelerated Computing



ILLINOIS  
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

# Lecture 1.1 – Course Introduction

Course Introduction and Overview

# Course Goals

- Learn how to program heterogeneous parallel computing systems and achieve
  - High performance and energy-efficiency
  - Functionality and maintainability
  - Scalability across future generations
  - Portability across vendor devices
- Technical subjects
  - Parallel programming API, tools and techniques
  - Principles and patterns of parallel algorithms
  - Processor architecture features and constraints

# People

- Wen-mei Hwu (University of Illinois, NVIDIA)
- David Kirk (NVIDIA)
- Joe Bungo (NVIDIA)
- Mark Ebersole (formerly NVIDIA)
- Abdul Dakkak (Microsoft, formerly University of Illinois)
- Izzat El Hajj (American University of Beirut, formerly University of Illinois)
- Andy Schuh (University of Illinois)
- John Stratton (Whitman College)
- Isaac Gelado (NVIDIA)
- John Stone (NVIDIA, formerly University of Illinois)
- Javier Cabezas (AMD, formerly NVIDIA)
- Michael Garland (NVIDIA)

# Course Content

|                                          |                                                                                                                                                                                                                                                                                                                                       |
|------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Module 1:<br>Course Introduction         | <ul style="list-style-type: none"><li>• 1.1 - Course Introduction and Overview</li><li>• 1.2 - Introduction to Heterogeneous Parallel Computing</li><li>• 1.3 - Portability and Scalability in Heterogeneous Parallel Computing</li></ul>                                                                                             |
| Module 2:<br>Introduction to CUDA C      | <ul style="list-style-type: none"><li>• 2.1 - CUDA C vs. CUDA Libs vs. OpenACC</li><li>• 2.2 - Memory Allocation and Data Movement API Functions</li><li>• 2.3 – Threads and Kernel Functions</li><li>• 2.4 - Introduction to CUDA Toolkit</li><li>• 2.5 – Nsight Compute and Nsight Systems</li><li>• 2.6 – Unified Memory</li></ul> |
| Module 3:<br>CUDA Parallelism Model      | <ul style="list-style-type: none"><li>• 3.1 - Kernel-Based SPMD Parallel Programming</li><li>• 3.2 - Multidimensional Kernel Configuration</li><li>• 3.3 - Color-to-Greyscale Image Processing Example</li><li>• 3.4 - Blur Image Processing Example</li><li>• 3.5 - Thread Scheduling</li></ul>                                      |
| Module 4:<br>Memory Model and Locality   | <ul style="list-style-type: none"><li>• 4.1 - CUDA Memories</li><li>• 4.2 - Tiled Parallel Algorithms</li><li>• 4.3 - Tiled Matrix Multiplication</li><li>• 4.4 - Tiled Matrix Multiplication Kernel</li><li>• 4.5 - Handling Arbitrary Matrix Sizes in Tiled Algorithms</li></ul>                                                    |
| Module 5:<br>Thread Execution Efficiency | <ul style="list-style-type: none"><li>• 5.1 - Warps and SIMD Hardware</li><li>• 5.2 - Performance Impact of Control Divergence</li></ul>                                                                                                                                                                                              |



# Course Content

|                                                           |                                                                                                                                                                                                                                                               |
|-----------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Module 6:<br>Memory Access Performance                    | <ul style="list-style-type: none"><li>6.1 - DRAM Bandwidth</li><li>6.2 - Memory Coalescing in CUDA</li></ul>                                                                                                                                                  |
| Module 7:<br>Parallel Computation Patterns<br>(Histogram) | <ul style="list-style-type: none"><li>7.1 - Histogramming</li><li>7.2 - Introduction to Data Races</li><li>7.3 - Atomic Operations in CUDA</li><li>7.4 - Atomic Operation Performance</li><li>7.5 - Privatization Technique for Improved Throughput</li></ul> |
| Module 8:<br>Parallel Computation Patterns<br>(Stencil)   | <ul style="list-style-type: none"><li>8.1 - Convolution</li><li>8.2 - Tiled Convolution</li><li>8.3 - Tile Boundary Conditions</li><li>8.4 - Analyzing Data Reuse in Tiled Convolution</li></ul>                                                              |
| Module 9:<br>Parallel Computation Patterns<br>(Reduction) | <ul style="list-style-type: none"><li>9.1 - Parallel Reduction</li><li>9.2 - A Basic Reduction Kernel</li><li>9.3 - A Better Reduction Kernel</li></ul>                                                                                                       |
| Module 10:<br>Parallel Computation Patterns<br>(Scan)     | <ul style="list-style-type: none"><li>10.1 - Prefix Sum</li><li>10.2 - A Work-inefficient Scan Kernel</li><li>10.3 - A Work-Efficient Parallel Scan Kernel</li><li>10.4 - More on Parallel Scan</li></ul>                                                     |



# Course Content

|                                                                              |                                                                                                                                                                                                                        |
|------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Module 11:<br>Breadth-First (BFS) Queue                                      | <ul style="list-style-type: none"><li>• 11.1 – Breadth-First (BFS) Queue</li></ul>                                                                                                                                     |
| Module 12:<br>Floating Point Considerations                                  | <ul style="list-style-type: none"><li>• 12.1 - Floating Point Precision Considerations</li><li>• 12.2 - Numerical Stability</li></ul>                                                                                  |
| Module 13:<br>GPU as part of the PC Architecture                             | <ul style="list-style-type: none"><li>• 13.1 - GPU as part of the PC Architecture</li></ul>                                                                                                                            |
| Module 14:<br>Efficient Host-Device Data Transfer                            | <ul style="list-style-type: none"><li>• 14.1 - Pinned Host Memory</li><li>• 14.2 - Task Parallelism in CUDA</li><li>• 14.3 - Overlapping Data Transfer with Computation</li><li>• 14.4 - CUDA Unified Memory</li></ul> |
| Module 15:<br>Application Case Study: Advanced MRI Reconstruction            | <ul style="list-style-type: none"><li>• 15.1 - Advanced MRI Reconstruction</li><li>• 15.2 - Kernel Optimizations</li></ul>                                                                                             |
| Module 16:<br>Application Case Study:<br>Electrostatic Potential Calculation | <ul style="list-style-type: none"><li>• 16.1 - Electrostatic Potential Calculation (Part 1)</li><li>• 16.2 - Electrostatic Potential Calculation (part 2)</li></ul>                                                    |



# Course Content

Module 17:  
Computational Thinking for Parallel  
Programming

- 17.1 - Introduction to Computational Thinking

Module 18:  
Related Programming Models: MPI

- 18.1 - Introduction to Heterogeneous Supercomputing and MPI

Module 19:  
CUDA Python Using Numba

- 19.1 - CUDA Python using Numba

Module 20:  
Related Programming Models:  
OpenCL

- 20.1 - OpenCL Data Parallelism Model
- 20.2 - OpenCL Device Architecture
- 20.3 - OpenCL Host Code (Part 1)

Module 21:  
Related Programming Models:  
OpenACC

- 21.1 - Introduction to OpenACC
- 21.2 - OpenACC Subtleties

Module 22:  
Related Programming Models:  
OpenGL

- *Module scheduled for a future release of the teaching kit*



# Course Content

|                                    |                                                                                                                                                                                                                                              |
|------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Module 23:<br>Dynamic Parallelism  | <ul style="list-style-type: none"><li>• 23.1 - Dynamic Parallelism</li></ul>                                                                                                                                                                 |
| Module 24:<br>Multi-GPU            | <ul style="list-style-type: none"><li>• 24.1 - OpenMP</li><li>• 24.2 - Multi-GPU Introduction I</li><li>• 24.3 - Multi-GPU Introduction II</li><li>• 24.4 - OpenMP and Cooperative Groups</li><li>• 24.5 - Multi-GPU Heat Equation</li></ul> |
| Module 25:<br>Using CUDA Libraries | <ul style="list-style-type: none"><li>• 25.1 - cuBLAS</li><li>• 25.2 - cuSOLVER</li><li>• 25.3 - cuFFT</li><li>• 25.4 - Thrust</li></ul>                                                                                                     |
| Module 26:<br>Advanced Thrust      | <ul style="list-style-type: none"><li>• <i>Module scheduled for a future release of the teaching kit</i></li></ul>                                                                                                                           |





# GPU Teaching Kit

Accelerated Computing



The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the [Creative Commons Attribution-NonCommercial 4.0 International License](#).