Supplementary Material

Lecture Slides

other slides to be added soon

Source Code

Header Files

The header files are compliant with both regular C++11/14 compilers such as current GCC distributions and the NVCC CUDA compiler.

Chapter 3: Modern Architectures

The following programs can be compiled with any C++11/14 compliant compiler (GCC 5.4 is sufficient). The AVX examples need a CPUs with AVX 1 support (any modern CPU). Note that fused-multiply-add needs AVX 2 which is supported by Haswell CPUs and above.

Matrix Matrix Multiplication

AVX maximum computation

Vector Normalization

Chapter 4: C++11 Multithreading

The following programs can be compiled with any C++11/14 compliant compiler (GCC 5.4 is sufficient).

Hello World

Returning Values from Threads

Static Thread Distributions for Matrix Vector Multiplication

False Sharing

Dynamic Thread Distributions for All-Pairs Distance Computation

  • mnist_exporter.py (execute this to download the MNIST data set)
  • all_pair.cpp (all-pairs distance computation using dynamic thread distributions)

Condition Variables

Basic Thread Pool

Chapter 5: Advanced C++11 Multithreading

The following programs can be compiled with any C++11/14 compliant compiler (GCC 5.4 is sufficient). Note that not all CPUs support 128 bit Compare-and-Swap (CAS) operations.

Atomics

Work-Sharing Thread Pool

  • threadpool.hpp (thread pool with dynamic work-sharing capability)
  • tree.cpp (concurrenty traverse a tree using dynamic work-sharing)

Knapsack Problem

  • knapsack.cpp (demonstrating spurious superlinear speedup in branch-and-bound algorithms)

Chapter 6: OpenMP

The following programs can be compiled with a C++11/14 compliant compiler with OpenMP 4.5 support (GCC 6 is sufficient).

Hello World

Vector Addition

Matrix Vector Multiplication

One-Nearest Neighbor Classifier on MNIST data

Scheduling of Inner Products (Linear Kernels, Covariance Matrices)

Softmax Regression on MNIST (Inference and Training)

Custom Reductions

Chapter 7: Compute Unified Device Architecture (CUDA)

The following programs can be compiled with a CUDA 8 (C++11 compliant) or CUDA 9 (C++14 compliant). The host compiler can be any C++11/C++14 compliant compiler (GCC 5.4.0 is sufficient).

Hello World

Eigenfaces (Principal Component Analysis)

Dynamic Time Warping (C++14 and thus CUDA 9 required)

Chapter 8: Advanced CUDA Programming

The following programs can be compiled with a CUDA 8 (C++11 compliant) or CUDA 9 (C++14 compliant). The host compiler can be any C++11/C++14 compliant compiler (GCC 5.4.0 is sufficient).

Warp Intrinsics and Atomics

  • znorm.cu (segmented reductions using warp intrinsics)
  • atomics.cu (global reduction using atomics)
  • cas.cu (compare and swap on CUDA devices)

Multi GPU and Streaming

Unified Virtual Memory (UVM)

Chapter 9: Message Passing Interface

The following programs can be compiled with current OpenMPI distributions in combination with a C++11/C++14 compliant compiler (GCC 5.4.0 is sufficient).

Hello World

Ping Pong via Point-to-Point Communication

Computing Primes (Point-to-Point vs. Global Collectives)

Jacobi Iteration (Solving the Laplace/Poisson Equation with Stencil Codes)

Matrix Matrix Multiplication (Complex Communication)

Chapter 10: Unified Parallel C++

The following programs can be compiled with current UPC++ compilers and executed with GASNet.

Hello World

Axpy (A * X + y)

Matrix Vector Multiplication

Mandelbrot Set

  • view.py (visualize results)
  • mandel1.cxx (basic mandelbrot set computation)
  • mandel2.cxx (mandelbrot set computation using master-slave approach)

Letter Counting

Histograms