ECE 459: Programming for Performance

Jeff Zarnett

Estimated study time: 5 minutes

Table of contents

Sources and References

Equivalent UW courses — CS 343 (Concurrent and Parallel Programming), CS 350 (Operating Systems), CS 454/654 (Distributed Systems)

Primary textbook — McCool, M., Robison, A. D. and Reinders, J. Structured Parallel Programming: Patterns for Efficient Computation, Morgan Kaufmann, 2012.

Supplementary references — Mattson, T. G., Sanders, B. A. and Massingill, B. L. Patterns for Parallel Programming, Addison-Wesley, 2004; Fog, A. Optimizing Software in C++: An Optimization Guide for Windows, Linux, and Mac Platforms, Copenhagen University College of Engineering, continuously updated; Drepper, U. What Every Programmer Should Know About Memory, Red Hat, 2007 (available on arXiv and the author’s site); Williams, A. C++ Concurrency in Action, 2nd ed., Manning, 2019; Herlihy, M. and Shavit, N. The Art of Multiprocessor Programming, 2nd ed., Morgan Kaufmann, 2020.

Equivalent UW Courses

ECE 459 is a one-term upper-year course on making real programs run faster on real hardware. CS 343 is its closest counterpart on the CS side — both courses teach concurrency primitives (threads, locks, monitors, futures), race conditions, and lock-free techniques, though CS 343 anchors its material in the uC++ language and the theoretical side of the concurrency monitor hierarchy. CS 350 is the operating systems course, covering process management, memory, file systems, and scheduling at the kernel level, and ECE 459 assumes a working knowledge of that material. CS 454/654 is a single cross-listed course on distributed systems — replication, consensus, fault tolerance — which ECE 459 only brushes against in its coverage of large-scale parallelism.

What This Course Adds Beyond the Equivalents

Relative to CS 343, ECE 459 is more hardware-aware and more measurement-driven. It spends serious time on profiling, benchmarking methodology, cache behavior, false sharing, vectorization and SIMD intrinsics, GPGPU via OpenCL or CUDA, and on optimization at the level of compiler flags, memory allocators, and the linker. It omits the uC++ monitor and coroutine material and the full concurrency theory of CS 343. Compared with CS 350, it assumes rather than teaches the operating system abstractions and focuses on how a programmer exploits them. Compared with CS 454/654, it stays on a single machine or cluster and does not cover consensus protocols, replication, or the CAP theorem.

Topic Summary

Performance, Amdahl’s and Gustafson’s Laws

The course begins with why performance matters and how to measure it honestly. Amdahl’s law gives the speedup ceiling for a fixed workload with a serial fraction \( s \):

\[ S(n) = \frac{1}{s + (1-s)/n}. \]

Gustafson’s law reframes this for workloads that grow with the number of processors. Both are used to set realistic expectations before optimizing.

Profiling and Measurement

Students learn to profile with sampling and instrumentation tools (perf, gprof, valgrind/callgrind, flamegraphs), to distinguish wall-clock from CPU time, to identify hotspots, and to reason about statistical noise. Benchmarking pitfalls — warm-up, compiler elimination of unused results, CPU frequency scaling — are discussed directly.

Single-Threaded Optimization

Before going parallel, the course covers what a compiler can and cannot do: inlining, loop unrolling, auto-vectorization, strength reduction, and the effect of aliasing and restrict. Data-structure layout, cache-line alignment, and avoiding pointer chasing follow from Drepper’s memory hierarchy material.

Threads, Locks, and Races

POSIX threads and modern C++ threads are used to introduce mutexes, condition variables, read-write locks, and the memory model. Data races, deadlock, livelock, and priority inversion are defined, along with standard avoidance strategies such as lock ordering and hierarchical locking.

Lock-Free and Transactional Techniques

Atomic operations, compare-and-swap, the ABA problem, hazard pointers, and simple lock-free queues show how to avoid locks where they hurt throughput. Software and hardware transactional memory are introduced as alternative concurrency models.

MESI-style cache coherence is sketched to explain why writes to nearby addresses from different cores ruin performance. Padding, per-thread buffers, and NUMA-aware allocation are the practical remedies.

SIMD and Vectorization

SSE/AVX intrinsics, compiler auto-vectorization, and the pattern of structure-of-arrays versus array-of-structures are presented as the first level of data parallelism. Alignment, masking, and gather/scatter operations are covered through small examples.

GPU Programming

OpenCL or CUDA is introduced for massively parallel workloads: the host/device model, kernels, work-groups, shared memory, memory-coalescing rules, and when offloading actually pays off after the transfer cost.

Distributed and Message-Passing Parallelism

MPI-style message passing and high-level frameworks are covered briefly, mainly to contrast shared-memory and distributed-memory costs. The focus is on latency and bandwidth trade-offs rather than on consensus or fault tolerance.

High-Performance Languages and Ecosystems

The course touches on Rust’s ownership model as a safer concurrency substrate, on language-level parallel primitives (OpenMP, Intel TBB, C++17 parallel algorithms), and on the role of just-in-time and ahead-of-time compilation in performance-critical deployments.

Queueing and Scalability

Little’s law \( L = \lambda W \) and elementary queueing results are used to reason about throughput and latency in server workloads, tying performance programming to system-level capacity planning.