Lecture Slides
- chapter_01.pptx (Slides for Chapter 1 [online])
- chapter_02.pptx (Slides for Chapter 2 [online])
- chapter_03.pptx (Slides for Chapter 3 [online])
other slides to be added soon
Source Code
Header Files
The header files are compliant with both regular C++11/14 compilers such as current GCC distributions and the NVCC CUDA compiler.
- hpc_helper.hpp (C++/CUDA timers, CUDA error handler, no_init_t wrapper)
- binary_IO.hpp (load linear memory from or dump it to disk)
- bitmap_IO.hpp (automatically normalize and write linear arrays as bitmaps to disk)
- cbf_generator.hpp (sample from the Cylinder-Bell-Funnel data set)
- svd.hpp (a convenient wrapper for CUDA's CUSOLVER SVD call)
Chapter 3: Modern Architectures
The following programs can be compiled with any C++11/14 compliant compiler (GCC 5.4 is sufficient). The AVX examples need a CPUs with AVX 1 support (any modern CPU). Note that fused-multiply-add needs AVX 2 which is supported by Haswell CPUs and above.
Matrix Matrix Multiplication
- matrix_mult.cpp (matrix multiplication with prior transposition)
- matrix_matrix_mult.cpp (AVX accelerated matrix multiplication)
AVX maximum computation
- vector_max.cpp (AVX accelerated horizontal maximum over an array)
- pointwise_vector_max.cpp (AVX accelerated pointwise maximum of two arrays)
Vector Normalization
- vector_norm_aos_plain.cpp (plain vector normalization in AOS format)
- vector_norm_soa_plain.cpp (plain vector normalization in SOA format)
- vector_norm_aos_avx.cpp (AVX accelerated vector normalization in AOS format with AOS2SOA shuffle)
- vector_norm_soa_avx.cpp (AVX accelerated vector normalization in SOA format)
Chapter 4: C++11 Multithreading
The following programs can be compiled with any C++11/14 compliant compiler (GCC 5.4 is sufficient).
Hello World
- hello_world.cpp (saying "hello" using multiple threads)
Returning Values from Threads
- traditional.cpp (passing return values using pointers)
- promise_future.cpp (passing return value using promises and futures)
- packaged_task.cpp (wrapping functions with packaged tasks)
- async.cpp (asynchronously execute functions with return values)
Static Thread Distributions for Matrix Vector Multiplication
- matrix_vector.cpp (pure block and cyclic as well as block-cyclic distributions)
False Sharing
- false_sharing.cpp (an extreme example for false sharing)
Dynamic Thread Distributions for All-Pairs Distance Computation
- mnist_exporter.py (execute this to download the MNIST data set)
- all_pair.cpp (all-pairs distance computation using dynamic thread distributions)
Condition Variables
- alarm_clock.cpp (modelling a sleeping student using condition variables)
- one_shot_alarm_clock.cpp (modelling sleeping students using promises and shared futures)
- ping_pong.cpp (playing ping pong with condition variables)
Basic Thread Pool
- threadpool_basic.hpp (basic thread pool implementation)
- main_basic.cpp (exemplary usage of the thread pool)
- main_basic_tree.cpp (exemplary usage of thread pool with recursive submission of tasks)
Chapter 5: Advanced C++11 Multithreading
The following programs can be compiled with any C++11/14 compliant compiler (GCC 5.4 is sufficient). Note that not all CPUs support 128 bit Compare-and-Swap (CAS) operations.
Atomics
- query_atomics.cpp (query properties of atomics)
- atomic_counting.cpp (benchmark atomic counting versus mutex based counting)
- atomic_max.cpp (lock-free computation of maximum)
- arbitrary_atomics.cpp (user-defined conditional atomics)
- universal_atomics.cpp (user-defined ternary atomics)
Work-Sharing Thread Pool
- threadpool.hpp (thread pool with dynamic work-sharing capability)
- tree.cpp (concurrenty traverse a tree using dynamic work-sharing)
Knapsack Problem
- knapsack.cpp (demonstrating spurious superlinear speedup in branch-and-bound algorithms)
Chapter 6: OpenMP
The following programs can be compiled with a C++11/14 compliant compiler with OpenMP 4.5 support (GCC 6 is sufficient).
Hello World
- hello_world.cpp (OpenMP Hello World)
Vector Addition
- vector_add.cpp (vector addition)
- vector_add_scoped.cpp (vector addition with better scoping)
Matrix Vector Multiplication
- matrix_vector.cpp (matrix vector multiplication)
One-Nearest Neighbor Classifier on MNIST data
- mnist_exporter.py (download MNIST data and labels)
- 1NN.cpp (one-nearest neighbor classifier on MNIST data)
Scheduling of Inner Products (Linear Kernels, Covariance Matrices)
- mnist_exporter.py (download MNIST data and labels)
- scheduling.cpp (scheduling for-loops in OpenMP)
Softmax Regression on MNIST (Inference and Training)
- mnist_softmax.py (Tensorflow softmax regression on MNIST)
- softmax.cpp (lovingly hand-crafted OpenMP version of the above)
Custom Reductions
- custom_reduction.cpp (user-defined reduction operations)
- avx_reduction.cpp (you can even combine AVX and OpenMP)
- string_reduction.cpp (counterexample for non-commutative monoids)
Chapter 7: Compute Unified Device Architecture (CUDA)
The following programs can be compiled with a CUDA 8 (C++11 compliant) or CUDA 9 (C++14 compliant). The host compiler can be any C++11/C++14 compliant compiler (GCC 5.4.0 is sufficient).
Hello World
- hello_world.cu (CUDA Hello World)
Eigenfaces (Principal Component Analysis)
- README.md (follow instructions to download CelebA data set)
- convert_images.py (convert ~200,000 jpegs into binary file)
- mean_computation.cu (compute mean celebrity image)
- mean_correction.cu (adjust data matrix with the mean image)
- covariance.cu (efficiently compute covariance matrix of adjusted data matrix)
- eigenfaces.cu (compute eigenvectors of covariance matrix)
Dynamic Time Warping (C++14 and thus CUDA 9 required)
- dtw_host.cu (host version of Dynamic Time Warping)
- dtw_device.cu (all device versions of Dynamic Time Warping)
Chapter 8: Advanced CUDA Programming
The following programs can be compiled with a CUDA 8 (C++11 compliant) or CUDA 9 (C++14 compliant). The host compiler can be any C++11/C++14 compliant compiler (GCC 5.4.0 is sufficient).
Warp Intrinsics and Atomics
- znorm.cu (segmented reductions using warp intrinsics)
- atomics.cu (global reduction using atomics)
- cas.cu (compare and swap on CUDA devices)
Multi GPU and Streaming
- single_gpu.cu (baseline implementation)
- multi_gpu.cu (utilizing multiple GPUs)
- streamed_gpu.cu (utilizing multiple streams)
- multi_streamed_gpu.cu (utilizing multiple GPU and multiple streams)
Unified Virtual Memory (UVM)
- uvm_minimal_example.cu (UVM minimal example)
Chapter 9: Message Passing Interface
The following programs can be compiled with current OpenMPI distributions in combination with a C++11/C++14 compliant compiler (GCC 5.4.0 is sufficient).
Hello World
- hello_world.cpp (MPI Hello World)
Ping Pong via Point-to-Point Communication
- ping_pong_ring.cpp (blocking ping pong in a ring)
- ping_pong_ring_nonblock.cpp (non-blocking ping pong in a ring)
Computing Primes (Point-to-Point vs. Global Collectives)
- primes_serialized_comm.cpp (serialized reduction using point-to-point communication)
- primes.cpp (parallel reduction using global collective communication primitives)
Jacobi Iteration (Solving the Laplace/Poisson Equation with Stencil Codes)
- jacobi_seq.cpp (sequential Jacobi iteration over 2D domain)
- jacobi_1D_block_simple.cpp (simple 1D partitioned blocking Jacobi iteration over 2D domain)
- jacobi_1D_block.cpp (advanced 1D partitioned blocking Jacobi iteration over 2D domain)
- jacobi_1D_nonblock.cpp (advanced 1D partitioned non-blocking Jacobi iteration over 2D domain)
Matrix Matrix Multiplication (Complex Communication)
- matrix_mult_rows.cpp (row-oriented matrix matrix multiplication)
- matrix_mult_cols.cpp (column-orientend matrix matrix multiplication)
- matrix_mult_2D.cpp (2D tile-oriented matrix matrix multiplication)
- summa.cpp (Scalable Universal Matrix Multiplication Algorithm)
Chapter 10: Unified Parallel C++
The following programs can be compiled with current UPC++ compilers and executed with GASNet.
Hello World
- hello_world.cxx (UPC++ Hello World)
Axpy (A * X + y)
- axpy.cxx (axpy)
Matrix Vector Multiplication
- matrix_vector.cxx (matrix vector multiplication)
Mandelbrot Set
- view.py (visualize results)
- mandel1.cxx (basic mandelbrot set computation)
- mandel2.cxx (mandelbrot set computation using master-slave approach)
Letter Counting
- letter1.cxx (basic letter counting)
- letter2.cxx (yet more letter counting)
Histograms
- histo1.cxx (histograms using locks)
- histo2.cxx (histograms using multiple locks)
- histo3.cxx (histograms using atomics)