CS 343: Concurrent and Parallel Programming

Peter Buhr

Estimated study time: 1 hr 39 min

Table of contents

Source credit: These notes are based on the official course notes by Peter Buhr, available at student.cs.uwaterloo.ca/~cs343, distributed under a personal/educational use permission. Content has been substantially paraphrased, restructured, and enriched with material from MIT 6.004, Stanford CS149, and CMU 15-418.

Part I: Control Flow Foundations

Chapter 1: Advanced Control Flow

Good control flow discipline is not merely stylistic — it directly affects whether your programs are correct, readable, and maintainable. This chapter revisits the fundamentals of structured control flow, with particular attention to patterns that eliminate the need for flag variables and expose the inherent structure of an algorithm.

Loop Forms and the Multi-Exit Loop

The while and for loops are interchangeable when only a predicate governs the loop. The for loop, however, has a structural advantage: it makes it straightforward to add or remove a loop index for debugging without restructuring the surrounding code. You should not use while to simulate for when a loop variable is involved.

Both while and for test their exit condition at the top. The do-while loop tests at the bottom, making it appropriate when you need to execute the body at least once. A third form — the multi-exit loop (sometimes called a mid-test loop) — exits at one or more points within the body:

for ( ;; ) {
    cin >> d;
    if ( cin.fail() ) break;   // middle exit
    // process d
}

This pattern eliminates priming code — the duplicated read required before a while loop and again inside it:

// BAD: priming duplication
cin >> d;
while ( !cin.fail() ) {
    // process d
    cin >> d;
}

A critical discipline: a loop exit never needs an else clause. If a condition causes a break, all subsequent statements are logically part of the loop body when the condition is false — do not nest them in an else.

Static Multi-Level Exit

Static multi-level exit allows a break or continue to jump out of multiple nested control structures simultaneously. In µC++ and Java, this is done with labelled break and continue. In C/C++, the same effect requires goto:

// µC++/Java style — clean, readable
FR: for ( int i = 0; i < 10; i += 1 ) {
    SW: switch ( x ) {
        BK: {
            for ( int j = 0; j < 10; j += 1 ) {
                if ( . . . ) break FR;    // exit outer loop
                if ( . . . ) break SW;    // exit switch
                if ( . . . ) break BK;    // exit block
            }
        }
    }
}

The central benefit is the elimination of flag variables — boolean variables whose only purpose is to affect control flow. Such variables are the variable equivalent of goto: they can be set, reset, and tested at arbitrary locations, obscuring the actual flow of the program. Compare:

// BAD: flag variables obscure structure
bool flag1 = false;
for ( int i = 0; i < 10 && !flag1; i += 1 ) {
    bool flag2 = false;
    for ( int j = 0; j < 10 && !flag1 && !flag2; j += 1 ) {
        if ( C1 ) flag2 = true;
        else if ( C2 ) flag1 = true;
    }
}
if ( flag1 ) E1; else E2;

// GOOD: structure is explicit
F1: for ( int i = 0; i < 10; i += 1 ) {
    F2: for ( int j = 0; j < 10; j += 1 ) {
        if ( C1 ) { E2; break F2; }
        if ( C2 ) { E1; break F1; }
    }
}

Labelled break and continue are a disciplined goto: they can only branch forward and cannot jump into a control structure, which prevents the spaghetti pathologies of unrestricted goto use.

There is one legitimate use for a flag variable: when state must be carried from one inner lexical scope to another. For example, if a command-line argument is optionally present and affects logic in a later loop, a flag variable retains that state across the two scopes. This is structurally necessary, not a code smell.

Dynamic Memory Allocation

Stack allocation is always preferable to heap allocation when it is possible. It is faster, requires no explicit deallocation, and cannot leak. The idiom “use the stack, Luke” captures this preference.

There are exactly four situations where heap allocation is necessary:

Storage that must outlive its creating block — when a function constructs an object and returns it to the caller, the object must be heap-allocated because the function’s stack frame is gone on return.
Input of unknown size — when reading a sequence of values with no bound known in advance, a dynamically growing container (which internally heap-allocates) is required.
Array elements requiring different constructor arguments — if an array’s elements are of a type with no default constructor, or if each must be initialized with a distinct value, ordinary stack arrays fail. The µC++ macro uArray handles this efficiently:

cin >> size;
uArray( S, sa, size );   // stack allocation, O(1) time
for ( int id = 0; id < sa.size(); id += 1 )
    sa[id]( id );        // constructor call with argument

This is superior to std::unique_ptr<S[]> or std::vector<S> because those perform O(N) heap allocations; uArray allocates a single block and uses placement new.

Large arrays on small stacks — coroutine stacks are typically 256 KB by default. An array of 100,000 structs will overflow such a stack. Use uArrayPtr for heap-backed but RAII-managed arrays.

To diagnose memory usage, malloc_stats() at program exit prints heap allocation statistics. Alternatively, the shell variable MALLOC_STATS=; export MALLOC_STATS enables automatic reporting.

Chapter 2: Nonlocal Transfer

When a routine is called, the call/return mechanism dictates that control always returns to the statement immediately following the call. But many real algorithms have multiple outcomes: a normal completion and one or more ancillary outcomes that should transfer control elsewhere. This chapter examines the problem of nonlocal transfer — returning from a routine to somewhere other than the call site.

The Multi-Outcome Pattern

Subroutines in Fortran recognized this early with alternate return parameters: a subroutine could return to a labelled statement in the caller rather than to the next instruction. While crude, this acknowledged the fundamental truth that algorithms have multiple outcomes and that separating those outcomes makes programs more readable.

The problem deepens with modularization. If you refactor the inner body of a nested loop into a helper routine, static exit labels in the original become unavailable to the helper. The helper must return one level at a time, and intermediate code must propagate the exceptional outcome upward — a tedious, error-prone process.

Traditional Approaches

Languages have historically used four mechanisms to handle multiple outcomes:

Return codes — the routine returns a value encoding success or failure (e.g., printf returning a negative value on error). The problem: checking is optional. A careless caller ignores the return code entirely, and the exceptional case propagates silently.

Status flags — a shared (often global) variable like errno is set when an error occurs. The value persists until overwritten. This is fragile in concurrent programs because another thread may overwrite errno before the first thread examines it. Worse, the flag can be checked at any time, so there is no structural guarantee it relates to the most recent operation.

Fix-up routines — a routine pointer is passed to the called function; if an exceptional event occurs, the fix-up routine is called. This improves composability but pollutes every function signature with fix-up parameters, even when they are irrelevant to the function’s logic.

Return unions (modern) — the C++17 std::optional and std::variant types package a result together with a status. The caller must explicitly check the status before accessing the value:

optional<int> rtn() {
    optional<int*> p = Malloc( sizeof(int) );
    if ( !p ) return nullopt;
    **p = 7;
    if ( random() % 2 ) return **p;
    return nullopt;
}

Return unions are strictly better than raw return codes, but they still share a fundamental weakness: exceptional propagation is passive. The programmer must remember to check; the type system does not enforce it. More critically, each function in a call chain must explicitly forward the exceptional case upward, compounding the bookkeeping.

Dynamic Multi-Level Exit

Dynamic multi-level exit (DME) breaks out of this one-level-at-a-time straitjacket. Rather than requiring each intermediate routine to pass an exceptional outcome upward, DME allows a deeply nested routine to transfer directly to a handler at any level on the call stack.

In C, this is implemented with setjmp/longjmp. A label variable (jmp_buf) holds two pieces of information: a pointer to a stack frame, and a transfer point within that frame’s routine. longjmp is a two-step operation: first it restores execution to the saved stack frame, then it jumps to the saved program counter within that frame.

The key distinction from static multi-level exit is that the destination is determined at runtime, not at compile time. This allows DME to work across separately compiled modules and recursive calls — but it also means the destination cannot be statically verified, introducing the classic goto pathologies in a more powerful form.

Exception Handling

Exception handling is a structured form of DME. It replaces the raw jmp_buf/longjmp mechanism with a discipline that includes:

Exception types: named categories of exceptional events
Raise (throw/_Resume): creates an exception and initiates propagation
Propagation: the mechanism for locating an appropriate handler
Handlers (catch/_CatchResume): code that responds to a raised exception

The crucial advantage over traditional techniques is that exception handling is active: when an exceptional event occurs, control transfers immediately to a handler. The programmer cannot accidentally ignore the exception. The code at the raise site does not need to forward the exception manually — propagation is automatic.

An exceptional event is not necessarily an error. It is any situation that is known to exist but is ancillary to the primary algorithm — infrequent, but possible. End-of-file, division by zero, popping an empty stack: these are exceptional events. The word “exception” does not imply surprise or catastrophe; it implies a secondary control path.

Execution Environment and Implementation

The execution environment significantly affects what an EHM must do. In an object-oriented language, objects may have destructors that must run regardless of how a block is exited. If an exception propagates through a frame containing local objects, those objects must be properly destroyed — a process called stack unwinding. Similarly, finally clauses (Java) and _Finally clauses (µC++) must always execute:

// µC++: _Finally always executes even on exceptional exit
L: try {
    infile = new ifstream( "abc" );
    if ( . . . ) break L;   // normal exit
} _Finally {
    infile.close();          // always executes
    delete infile;
}

Stack unwinding means that termination exceptions are expensive to raise but cheap to not raise — guarded block entry (the try keyword) has near-zero cost. The exception type table is stored statically; a stack walk occurs only when a throw is executed. For code that raises exceptions rarely, this asymmetry is ideal.

The Static/Dynamic Call/Return Taxonomy

All control flow can be characterized along two axes: whether the name is looked up statically (at compile time) or dynamically (at runtime), and whether control returns to the static context (where the handler was defined) or the dynamic context (where the raise occurred).

	Static call/raise	Dynamic call/raise
Static return	Sequel	Termination exception
Dynamic return	Routine	Resumption exception / routine pointer

A sequel is a routine whose name is resolved statically but which returns to the end of the block in which it is declared — not to after the call. This allows multi-level exit to be modularized within a monolithic program:

A: for ( ;; ) {
    sequel S1( . . . ) { . . . }   // returns to end of block A
    void M1( . . . ) { . . . if ( . . . ) S1( . . . ); . . . }
    // ...
}   // S1 returns here

Sequels are efficient (the destination is statically known) but fail for library code, where the sequel’s definition and the call site are in separate compilation units. Termination and resumption exceptions overcome this limitation at the cost of dynamic (runtime) handler lookup.

Termination (the standard throw/catch) unwinds the stack from raise to handler — the raise point is gone. Resumption returns control to the raise point after the handler executes — no stack unwinding occurs. Resumption is analogous to calling a fix-up routine, but without polluting function signatures:

void f( . . . ) {
    if ( . . . ) _Resume E();   // raises resumption exception
    // control returns here after handler executes
}
int main() {
    try {
        f( . . . );
    } _CatchResume( E ) {
        // handler executes, then control returns to raise point in f
    }
}

The termination/resumption distinction appears in the µC++ _Throw versus _Resume keywords and their corresponding catch versus _CatchResume handlers.

Chapter 3: The µC++ Exception Handling Model

µC++ extends C++ with a carefully designed exception handling mechanism (EHM) that addresses weaknesses in the standard C++ model. Understanding it here prepares us for the concurrent exception propagation we will need later.

Exception Types and Raising

In µC++, exceptions must be instances of types defined with _Exception:

_Exception FileError { . . . };   // exception type

Every exception type inherits from uBaseException, which provides:

message() — a string describing the exception
source() / sourceName() — the coroutine or task that raised it
defaultTerminate() / defaultResume() — called if the exception goes unhandled

There are two raising operations: _Throw creates a termination exception (stack unwinds); _Resume creates a resumption exception (stack is preserved). Either can include an _At clause to deliver the exception to another execution context — the mechanism underlying nonlocal exceptions in concurrent programs.

A subtle but important difference from standard C++: when µC++ raises an exception via _Throw t where t is of derived type D but is referenced through a base-type pointer B*, the actual type D is preserved. Standard C++ slices the exception to type B, losing derived-class information. µC++ maintains the true dynamic type, so handlers can catch the most specific applicable type.

Resumption Handlers

A _CatchResume handler acts like a fix-up routine whose body executes and then returns control to the raise point. All _CatchResume clauses must appear before any catch clauses in a try block:

try {
    f( . . . );
} _CatchResume( E1 & e ) {
    // fix up the problem; control returns to raise point
} catch( E2 & e ) {
    // terminate: stack unwinds to here
}

The handler body executes in the lexical scope of the try block through a lexical link — similar to the this pointer for objects but pointing to the enclosing block’s stack frame. This means the handler can access local variables of the block just as a lambda with [&] capture would. No break, continue, goto, or return is permitted from a resumption handler — these would violate the return-to-raise-point semantics. If correction is impossible, the handler should _Throw a new exception to initiate termination propagation.

Nonlocal Exceptions

When execution consists of multiple coroutines each with their own stacks, exceptions can be raised in one execution context and delivered to another. µC++ calls these nonlocal exceptions:

_Resume E() _At coroutine_id;   // raise E at another coroutine

Delivery is deferred: the target coroutine receives the exception only when it becomes active at a designated detection point (inside _Enable, resume(), or suspend()). This prevents races where an exception arrives while the target is in an inconsistent state.

Nonlocal delivery is disabled by default; a coroutine must explicitly open a window for delivery with _Enable:

try {
    _Enable {        // delivery enabled in this block
        suspend();   // exception may be delivered here
    }
} _CatchResume( E ) { . . . }
  catch( E ) { . . . }

Nested _Enable and _Disable blocks compose additively — entering a more specific _Disable within an _Enable temporarily suppresses specific exception types. This fine-grained control is essential for building correct concurrent programs where some phases of a coroutine’s logic must be protected from interruption.

A concrete inter-task exception example: a timeout task that cancels a worker if it runs too long:

_Exception Timeout {};

_Task Worker {
    void main() {
        try {
            _Enable<Timeout> {       // open delivery window
                for ( int i = 0; i < 1000000; i++ ) {
                    heavyWork( i );
                    suspend();       // detection point: exception may arrive here
                }
            }
        } catch ( Timeout & ) {
            cout << "Worker cancelled by timeout" << endl;
        }
    }
};

_Task Timer {
    Worker & w;
    void main() {
        uBaseTask::sleep( 500ms );   // wait 500 ms
        _Resume Timeout() _At w;    // deliver nonlocal exception to worker
    }
public:
    Timer( Worker & w ) : w(w) {}
};

int main() {
    Worker w;
    Timer  t(w);
}   // both tasks run concurrently; Timer cancels Worker after 500 ms

The _Resume Timeout() _At w line resumes a termination exception into the worker’s stack. The exception is not delivered immediately — it waits until the worker reaches a detection point (the suspend() here). This prevents the exception from arriving while the worker is in a critical section.

Part II: Coroutines

Chapter 4: Coroutines and Cooperative Multitasking

A coroutine is a generalization of a routine. A routine always begins at its first statement when called and its local variables vanish when it returns. A coroutine, by contrast, can suspend at any point and resume from exactly that point on the next call. Its local variables persist across suspensions. This seemingly small change in semantics opens a wide class of programming problems that are extremely awkward to solve with ordinary routines.

The canonical example is a generator: a computation that produces a sequence of values on demand. The Fibonacci sequence provides a clear illustration. Using ordinary routines, you must store the entire execution state (which of three cases applies, and the previous two values) in explicit global or member variables. The code no longer resembles the mathematical definition of the sequence. Using a coroutine, the structure mirrors the definition directly:

_Coroutine Fibonacci {
    int fn;           // communication variable (result)
    void main() {
        int fn1, fn2;
        fn = 0; fn1 = fn;
        suspend();              // yield first value
        fn = 1; fn2 = fn1; fn1 = fn;
        suspend();              // yield second value
        for ( ;; ) {
            fn = fn1 + fn2; fn2 = fn1; fn1 = fn;
            suspend();          // yield subsequent values
        }
    }
public:
    int operator()() {
        resume();               // transfer to coroutine
        return fn;
    }
};

There is no explicit state variable, no switch statement. The coroutine’s execution location — where it suspended last — is its implicit state. This is what Buhr calls coroutine “Zen”: let the coroutine’s own flow of control manage the state rather than encoding state explicitly in variables.

What a Coroutine Is

A coroutine’s execution state has three components:

Execution location: the program counter at the last suspend point (or the beginning, if not yet started)
Execution state: the stack holding all local variables and the frames of routines called from the coroutine
Execution status: one of inactive (suspended), active (currently running), or terminated (main has returned)

Each coroutine has its own private stack. In µC++, the default stack size is 256 KB; it can be specified in the constructor. Unlike threads, coroutines execute synchronously — only one coroutine is active at a time, and context switches happen only at explicit resume() and suspend() calls.

Coroutines are the conceptual predecessor to concurrent tasks. They introduce the idea of suspension and resumption on separate stacks without the complexity of true parallelism. Once you understand coroutines, the step to concurrent tasks becomes one of adding scheduling and preemption rather than a fundamentally different execution model.

Semi-Coroutines

A semi-coroutine operates asymmetrically: when it suspends, control always returns to the coroutine that last resumed it. This is analogous to the caller/callee relationship in ordinary routines. The formatter example illustrates the contrast between approaches:

The direct solution has a clean structure with nested loops, but it produces output imperatively — it cannot be turned into a component that accepts one character at a time. The routine solution flattens the loop structure into a single function with explicit state variables g and b. The class solution encapsulates the state but still uses explicit counters. The coroutine solution:

_Coroutine Format {
    char ch;          // input from caller
    int g, b;
    void main() {
        for ( ;; ) {                           // for as many characters
            for ( g = 0; g < 5; g += 1 ) {    // groups of 5 blocks
                for ( b = 0; b < 4; b += 1 ) {// blocks of 4 characters
                    for ( ;; ) {
                        suspend();
                        if ( ch != '\n' ) break;
                    }
                    cout << ch;
                }
                cout << "  ";
            }
            cout << endl;
        }
    }
public:
    Format() { resume(); }   // prime coroutine to first suspend
    void prt( char ch ) { Format::ch = ch; resume(); }
};

The nested loop structure of the direct solution is completely preserved. The coroutine converts the reads of the direct solution into suspends — each suspend() marks a point where the coroutine expects the next character to be delivered via the prt member.

The constructor calls resume() to advance the coroutine main to its first suspend() — a technique called priming. Without priming, the first call to prt would start the coroutine main from the beginning rather than delivering the character to the right place.

Correct Coroutine Construction

The recommended approach to building a coroutine is to first write a direct solution — a standalone program that does the computation. Then convert it to a coroutine by:

Moving the processing code into void main().
Converting writes (if the coroutine produces) or reads (if it consumes) to suspend() calls.
Using public member functions and shared member variables to transfer data in and out.

This approach works because the coroutine’s flow control manages all execution state implicitly. The direct solution for the formatter has four nested loops; the coroutine preserves all four. A common mistake is to rewrite the coroutine main as a single loop with a switch statement that explicitly tracks state — this defeats the entire purpose of a coroutine.

Full Coroutines and Cycles

A full coroutine breaks the asymmetry of semi-coroutines. Rather than always returning to the last resumer, a full coroutine can explicitly resume() any other coroutine, forming resume cycles:

void mem() { resume(); }   // resume this coroutine (not its caller)

The semantics of resume() and suspend() in µC++:

suspend() deactivates the currently active coroutine and activates its last resumer.
resume() deactivates the currently active coroutine and activates this (the coroutine object the member was called on).

The mutual reference problem in full coroutine cycles — Fc x(y), y(x) does not compile because y is not yet declared when x is initialized — is resolved by declaring the objects first and closing the cycle via a separate partner() call.

A full coroutine program has three phases: starting the cycle (creating objects and establishing references), executing the cycle (each coroutine resumes the next), and stopping the cycle (returning control to the program’s main). The stopping phase is subtle: when a coroutine’s main returns, control goes to the starter — the coroutine that performed the first resume() of this coroutine. In a cycle, the starter is often not the last resumer, so explicit resume() calls may be needed to unwind the cycle cleanly.

The classic illustration is a Ping-Pong pair, two full coroutines that take turns resuming each other a fixed number of times:

_Coroutine Pong;   // forward declaration

_Coroutine Ping {
    Pong & pong;
    int N;
    void main();   // defined after Pong
public:
    Ping( Pong & pong, int N ) : pong(pong), N(N) {}
    void start() { resume(); }   // starter kicks off the cycle
};

_Coroutine Pong {
    Ping & ping;
    int N;
    void main() {
        for ( int i = 0; i < N; i += 1 ) {
            cout << "Pong " << i << endl;
            ping.resume();     // hand control to Ping
        }
        // main() returns → control goes to Ping's starter (program main)
    }
public:
    Pong( Ping & ping, int N ) : ping(ping), N(N) {}
    void start() { resume(); }
};

void Ping::main() {
    for ( int i = 0; i < N; i += 1 ) {
        cout << "Ping " << i << endl;
        pong.resume();         // hand control to Pong
    }
}

int main() {
    Ping ping( /* Pong ref */ );
    Pong pong( ping, 5 );
    // close the Ping→Pong reference, then:
    ping.start();              // program main is the starter
}

Execution trace for N=2:

program main → ping.start() → Ping::main begins → "Ping 0" → pong.resume()
→ Pong::main begins → "Pong 0" → ping.resume()
→ Ping::main continues → "Ping 1" → pong.resume()
→ Pong::main continues → "Pong 1" → ping.resume()
→ Ping::main finishes (returns) → control to Pong (Ping's last resumer)
→ Pong::main finishes (returns) → control to program main (Pong's starter = program main)

Each resume() deactivates the caller and activates the target. When Ping::main returns, the stack unwinds back to the last resumer of Ping, which is Pong. When Pong’s main also returns, it returns to program main (the starter of Pong). This careful chain of starters is the mechanism that allows cycles to terminate cleanly.

Coroutines in Modern Languages

The concept of coroutines appears across many modern languages, though the syntax and semantics vary:

Python (3.5+) uses async/await for coroutines. An async def function is a coroutine that yields control with await, allowing an event loop to drive execution.

JavaScript uses generator functions (function*) that produce values with yield and can be driven by for...of or manually via .next().

C++20 introduces stackless coroutines as a language feature. Unlike µC++ coroutines (which maintain a full private stack), C++20 coroutines suspend by returning a promise and resuming via a handle. They are more efficient in memory but cannot suspend in the middle of a called function — the suspension must occur in the coroutine’s own frame. This “stackless” property limits expressiveness compared to µC++’s stackful coroutines.

Part III: Concurrency Foundations

Chapter 5: Concurrency

Concurrency is the ability to have multiple active computations — called processes or threads — at the same point in real time. It is distinct from sequential execution, where only one computation is active at a time, and from parallelism, which implies simultaneously executing on multiple processors. A concurrent program can exhibit concurrency even on a single processor through time-slicing.

Why Write Concurrent Programs?

Three motivations drive concurrent programming:

Natural problem structure: many real-world systems have inherently concurrent components — a web server handling multiple simultaneous clients, a GUI responding to user events while performing background computation. Modeling these systems sequentially requires artificial sequencing of naturally concurrent activities.

Performance: Moore’s Law has shifted from increasing clock frequency to increasing core count. Exploiting modern hardware requires parallelism. Amdahl’s Law quantifies the limit: if a fraction s of a program is sequential, the maximum speedup with N processors is \( \frac{1}{s + \frac{1-s}{N}} \). As \(N \to \infty\), speedup is bounded by \(\frac{1}{s}\). A program that is 5% sequential cannot exceed 20× speedup regardless of processor count.

I/O overlap: while one thread waits for a disk read or network response, other threads can continue useful work. This improves throughput even on a single processor.

Why Concurrency is Difficult

Concurrency introduces non-determinism. The relative speed of execution among threads depends on scheduling decisions that are outside the programmer’s control. This means a program may behave differently on different runs even with the same inputs — a property that makes bugs difficult to reproduce and reason about. A bug that appears once in a million runs (a Heisenbug) may be catastrophic in a safety-critical system.

The fundamental challenge is that concurrent threads share state, and operations on shared state that appear atomic at the source level are typically not atomic at the hardware level. An increment x += 1 compiles to three operations: load, add, store. Two concurrent threads both incrementing x may interfere, producing x+1 instead of x+2 — a race condition.

Concurrent Hardware

A uniprocessor runs one thread at a time; the illusion of concurrency comes from rapid context switching driven by a timer interrupt. A multiprocessor runs multiple threads simultaneously on separate cores, each with its own register file and cache. Distributed systems connect separate machines with no shared memory.

The threading model describes how user-level threads map to kernel-level threads and to hardware processors:

1:1 (kernel threads): each user thread has its own kernel thread. System calls block only the calling thread. High overhead — kernel thread creation and context switching are expensive.
M:1 (user threads): many user threads multiplex a single kernel thread. Cheap context switching (no kernel involvement), but a blocking system call blocks all user threads.
M:N (hybrid): M user threads on N kernel threads. The M:N scheduler dispatches user threads onto kernel threads, providing cheap context switches and parallelism. This is what µC++ uses.

Execution States

A thread transitions through several states during its lifetime:

New: the thread has been created but not yet started
Ready: eligible to run, waiting for a processor
Running: actively executing on a processor
Blocked: waiting for an event (I/O, lock, synchronization)
Halted: execution has completed

Scheduling is the policy for choosing which ready thread runs next. Preemptive scheduling forcibly context-switches a running thread (via a timer interrupt) to give others a turn. Non-preemptive scheduling relies on threads voluntarily yielding — simpler but vulnerable to a thread monopolizing the processor.

Thread Creation

Several mechanisms exist for creating concurrent threads:

COBEGIN/COEND (Dijkstra, 1965): a structured parallel block where all named routines execute concurrently, and the block completes when all routines finish. This is sometimes called fork-join parallelism and maps naturally to recursive divide-and-conquer:

// µC++: create and join threads in one structured block
COBEGIN
    T1();   // thread 1
    T2();   // thread 2
    T3();   // thread 3
COEND       // wait for all three

START/WAIT: explicit fork and join. START creates a new thread; WAIT blocks until a specific thread finishes. Less structured but more flexible than COBEGIN/COEND.

Actors: stateless concurrent objects that process messages from a mailbox. Each actor is created with new, sent a message with |, and self-destructs by returning Delete from its receive member. The µC++ actor system starts with uActor::start() and waits for termination with uActor::stop().

Thread objects (_Task in µC++): the preferred mechanism in this course. A _Task is a class with a void main() member that executes concurrently. The task starts when the object is created and terminates when main() returns. The destructor implicitly waits for main() to finish — making delete task_ptr the join operation:

_Task Adder {
    int * row; int cols; int & subtotal;
    void main() {
        subtotal = 0;
        for ( int c = 0; c < cols; c += 1 ) subtotal += row[c];
    }
public:
    Adder( int row[], int cols, int & subtotal )
        : row(row), cols(cols), subtotal(subtotal) {}
};

Speedup and Amdahl’s Law

Linear speedup means doubling processors halves execution time. In practice, speedup is sub-linear due to serial bottlenecks (Amdahl’s Law), synchronization overhead, and memory contention. Super-linear speedup is occasionally observed when parallelism allows the working set to fit in cache in ways that the sequential version could not.

The critical path of a task graph — the longest chain of dependent operations — sets an absolute lower bound on execution time regardless of how many processors are used. Maximizing throughput requires minimizing the critical path, not just maximizing parallelism. Greedy scheduling (always schedule a ready task immediately) achieves optimal completion time when all tasks have equal duration.

Designing Parallel Programs: The Four-Step Methodology

Stanford CS149 and CMU 15-418 teach a four-phase framework for decomposing any problem into a parallel program. Working through these phases systematically avoids the common trap of parallelizing code before understanding its structure:

1. Decomposition — identify tasks that can execute concurrently. A task here is any chunk of work that could, in principle, be done independently. The goal is to expose enough parallelism to keep all processors busy; exposing far more tasks than processors is fine. The key constraint is data dependencies: task B depends on task A if B needs A’s output. Dependent tasks cannot be parallelized with each other.

2. Assignment — distribute tasks to workers (threads, cores). Two strategies exist. Static assignment maps tasks to workers at compile time; it has zero runtime overhead but fails when task durations are unpredictable (load imbalance leaves some workers idle). Dynamic assignment uses a shared work queue: idle workers pull tasks as needed, achieving natural load balance at the cost of queue synchronization overhead. Task granularity matters — tasks that are too small spend more time in overhead than in computation.

3. Orchestration — structure communication and synchronization. This includes choosing data layouts (do threads share a struct or have private copies?), deciding when to synchronize, and minimizing the cost of data movement. Locality is paramount: accessing data in a thread’s own cache is orders of magnitude faster than fetching from another core’s cache or from DRAM. Poor orchestration can make a parallel program slower than its sequential equivalent due to excessive cache-line invalidations.

4. Mapping — bind workers to hardware execution units. On a multicore CPU, the OS scheduler handles this, but affinity hints (pthread_setaffinity_np) can pin a thread to a specific core to improve cache locality. On a GPU, the programmer specifies a grid of thread blocks, and the hardware maps blocks to streaming multiprocessors. The right mapping exploits the specific cache topology of the target machine.

Most parallel correctness bugs arise in step 3 (forgotten synchronization, wrong data sharing). Most parallel performance bugs arise in steps 2 and 4 (load imbalance, excessive inter-core communication).

Worked example — parallel merge sort walking through all four steps:

Decomposition: the array is recursively split into halves. Each half is independent (no shared writes), so both recursive sorts can run concurrently. The merge step depends on both halves completing — it sits on the critical path.

Assignment: use task parallelism. Spawn a task for the left half; the current thread handles the right half; then merge. This continues recursively. Because task granularity shrinks exponentially, add a threshold below which the sort runs sequentially (avoids spawning threads for 4-element arrays).

Orchestration: the only synchronization is joining the left-half task before merging. No shared data is written between the two halves during sorting — each task works on its own subarray slice.

Mapping: the work-stealing runtime handles this automatically. Deep recursive tasks naturally distribute across available cores.

// µC++ parallel merge sort
const int SEQ_THRESHOLD = 1024;

_Task Sorter {
    int * arr; int lo, hi;
    void main() { parallelSort( arr, lo, hi ); }
public:
    Sorter( int * arr, int lo, int hi ) : arr(arr), lo(lo), hi(hi) {}
};

void parallelSort( int * arr, int lo, int hi ) {
    if ( hi - lo <= SEQ_THRESHOLD ) {
        std::sort( arr + lo, arr + hi );   // sequential base case
        return;
    }
    int mid = lo + (hi - lo) / 2;
    Sorter * left = new Sorter( arr, lo, mid ); // spawn left half
    parallelSort( arr, mid, hi );               // right half in current task
    delete left;                                // join: wait for left to finish
    std::inplace_merge( arr+lo, arr+mid, arr+hi );
}

The sequential threshold prevents thread explosion: without it, sorting 10M elements spawns ~10M tasks (log₂(10M) ≈ 23 levels × breadth). With the threshold, only ~10M/1024 ≈ 9770 tasks are spawned — enough to fill all cores, cheap enough to not dominate overhead.

Work Stealing: The Scheduler Behind Cilk and Fork/Join

Work stealing is a dynamic task-scheduling algorithm that achieves near-optimal load balancing with low overhead. It is the runtime scheduler used in Cilk (MIT), Java’s ForkJoinPool, .NET’s Task Parallel Library, and Rust’s Tokio.

Each worker thread maintains a double-ended queue (deque) of tasks. A thread pushes and pops its own tasks from the bottom of its deque (like a stack — LIFO order, which preserves cache locality for recently created tasks). An idle thread that has exhausted its own deque steals from the top of another thread’s deque, chosen randomly.

The asymmetry is crucial: the owning thread uses the bottom (hot end) without synchronization most of the time; stealing only touches the top (cold end) when a deque is non-empty, which is infrequent. This means synchronization overhead is proportional to the number of steals, not the number of tasks created.

Continuation stealing vs. child stealing: In continuation stealing (used by Cilk Plus), when a thread spawns a child task, it immediately executes the child while the continuation (the rest of the spawning function) is placed on the deque for potential stealing. In child stealing (used by most library implementations), the spawned child is placed on the deque and the parent continues. Child stealing is simpler to implement (no compiler support needed) but has worse cache behavior for the common case.

The Blumofe-Leiserson analysis proves that work stealing executes a computation with total work \(T_1\) and span (critical path) \(T_\infty\) in expected time:

\[ T_P \leq \frac{T_1}{P} + O(T_\infty) \]

This bound is optimal: the first term is unavoidable (work must be done), and the second is unavoidable (the critical path must be traversed serially). Work stealing achieves this bound with high probability using only random victim selection — no centralized scheduler needed.

The same parallel merge sort expressed in Java’s ForkJoin framework (the standard work-stealing library since Java 7) illustrates child stealing semantics directly:

class MergeSort extends RecursiveAction {
    int[] arr; int lo, hi;
    static final int THRESHOLD = 1024;

    MergeSort( int[] arr, int lo, int hi ) {
        this.arr = arr; this.lo = lo; this.hi = hi;
    }

    @Override
    protected void compute() {
        if ( hi - lo <= THRESHOLD ) {
            Arrays.sort( arr, lo, hi );  // sequential base case
            return;
        }
        int mid = lo + (hi - lo) / 2;
        MergeSort left  = new MergeSort( arr, lo, mid );
        MergeSort right = new MergeSort( arr, mid, hi );
        // child stealing: both subtasks go on the deque
        invokeAll( left, right );        // fork both, join both
        merge( arr, lo, mid, hi );
    }
}

ForkJoinPool pool = new ForkJoinPool();
pool.invoke( new MergeSort( arr, 0, arr.length ) );

invokeAll forks both subtasks (adds them to the current thread’s deque) and then joins them. Idle worker threads steal from the deque and execute subtasks concurrently. The framework automatically adjusts parallelism to the available processors — no explicit thread count needed.

Critical Sections and Mutual Exclusion

A critical section is a sequence of statements that access a shared resource and that must execute atomically with respect to other threads accessing the same resource. The simplest shared resource is a variable:

// Two threads, both executing x += 1:
// Thread A:  load r1, x    (r1 = 5)
// Thread B:  load r2, x    (r2 = 5)
// Thread A:  add r1, 1     (r1 = 6)
// Thread B:  add r2, 1     (r2 = 6)
// Thread A:  store x, r1   (x = 6)
// Thread B:  store x, r2   (x = 6)  -- LOST UPDATE

The correct final value is 7; the concurrent execution yields 6. This is the lost update problem, a classic race condition.

Mutual exclusion is the property that at most one thread executes its critical section at a time. A correct mutual exclusion protocol must satisfy:

Safety (mutual exclusion): at most one thread is in its critical section at any time.
Liveness (no deadlock/livelock): if threads want to enter, some thread eventually does.
Fairness (no starvation): every thread that wants to enter eventually does.

Additionally, any lock that blocks rather than busy-waits should guarantee that a thread makes at most one check before blocking — ensuring a blocked thread is not forced to busy-wait after being woken up.

Software Solutions to Mutual Exclusion

Before examining hardware-assisted locks, it is instructive to understand why purely software solutions to mutual exclusion are so difficult. Each of the following attempts fails in some way:

Lock variable: a single shared boolean lock. Thread A reads lock == false, then Thread B reads lock == false, then both set lock = true and enter — mutual exclusion violated. The read-modify-write sequence is not atomic.

Alternation: two threads alternate access using a shared turn variable. This satisfies mutual exclusion and prevents starvation, but requires strict alternation — if Thread A wants to enter twice before Thread B wants to enter once, Thread A must wait for B even though B has no interest in the critical section. Violates the independent progress property.

Declare intent (flag array): each thread raises a flag before entering. A thread enters only if the other’s flag is down. This prevents deadlock but allows livelock: both raise their flags simultaneously and neither enters.

Retract intent: threads lower their flag when they see the other’s flag is raised, wait, and try again. This can cause indefinite postponement (starvation) if the random waits happen to synchronize repeatedly.

Prioritized retract intent: assign priorities to break ties. Prevents starvation for two threads but does not generalize to N threads.

Dekker’s algorithm (1965, first correct software mutex): combines alternation (as a tie-breaker) with intent declaration:

bool flag[2] = {false, false};
int turn = 0;

// Thread i (i=0 or i=1):
flag[i] = true;                   // declare intent
while ( flag[1-i] ) {             // other wants in?
    if ( turn != i ) {
        flag[i] = false;          // retract intent
        while ( turn != i ) {}    // wait for turn
        flag[i] = true;           // re-declare intent
    }
}
// CRITICAL SECTION
turn = 1 - i;
flag[i] = false;                  // retract intent

Peterson’s algorithm (1981): a simpler two-thread solution. Both threads declare intent and then each “gives way” to the other:

flag[i] = true;
turn = 1 - i;                    // offer turn to other
while ( flag[1-i] && turn == 1-i ) {}  // wait if other wants in and it's their turn
// CRITICAL SECTION
flag[i] = false;

Peterson’s algorithm is elegant and correct for two threads. Extensions to N threads exist (the Bakery algorithm and the N-thread tournament), but they all become complex and require careful reasoning about memory ordering.

Hardware Solutions

Software solutions require multiple memory accesses that are individually observable by other threads. Hardware provides atomic read-modify-write instructions that make the critical combination uninterruptible:

Test-and-Set: atomically reads a memory location and sets it to 1 (or some locked value), returning the old value. A thread that reads 0 (unlocked) has acquired the lock; a thread that reads 1 must retry.

Compare-and-Assign (CAS): atomically reads a location, compares it to an expected value, and writes a new value only if the comparison succeeds. CAS is more general than test-and-set and is the foundation for lock-free data structures.

Fetch-and-Increment: atomically increments a counter and returns the previous value. Used in the N-thread Bakery algorithm to draw a unique ticket number.

Swap: atomically exchanges a register and a memory location. Equivalent in power to test-and-set.

All hardware solutions with busy-waiting (spin locks) suffer from wasting processor cycles. A thread spinning on a lock occupies a CPU while doing no useful work. On a uniprocessor, spinning is particularly wasteful because the thread holding the lock cannot run while the spin-lock holder waits. Spin locks are appropriate when critical sections are very short and the expected wait is less than the overhead of a context switch.

A CAS-based spin lock demonstrates how test-and-set semantics compose with std::atomic:

class SpinLock {
    std::atomic<bool> locked{false};
public:
    void acquire() {
        bool expected = false;
        // spin until CAS succeeds: false→true
        while ( !locked.compare_exchange_weak(
                    expected, true,
                    std::memory_order_acquire,
                    std::memory_order_relaxed ) ) {
            expected = false;           // reset: CAS overwrites expected on failure
            while ( locked.load(std::memory_order_relaxed) ) {}  // local spin
        }
    }
    void release() {
        locked.store( false, std::memory_order_release );
    }
};

The inner while ( locked.load(relaxed) ) loop is a test-then-test-and-set (TTAS) optimization. The naive approach re-executes CAS on every iteration, generating a write (or attempted write) each cycle — on an x86 multiprocessor, each failed CAS invalidates the cache line, flooding the interconnect. TTAS spins on a plain read (shared cache line, no invalidation) and only attempts the expensive CAS when the lock looks free. Under contention, TTAS can be dramatically faster than a raw CAS loop.

The memory_order_acquire on the successful CAS and memory_order_release on the store form the acquire-release fence pair that prevents the critical-section code from being reordered outside the lock.

Part IV: Synchronization Mechanisms

Chapter 6: Locks

The lock taxonomy organizes locking mechanisms along two axes: whether a waiting thread spins (busy-waits) or blocks (suspends and yields the processor), and whether the lock conveys only exclusion or also carries state (a condition).

Spin Locks

A spin lock is the simplest possible lock: a thread repeatedly tests a condition until it becomes true. µC++ provides uSpinLock with acquire(), release(), and tryacquire(). Spin locks are appropriate for very short critical sections (a few instructions) where the expected contention is low and the cost of blocking exceeds the cost of spinning.

The implementation must handle the case where a thread yields without scheduling (to allow another thread on the same kernel thread to run) while holding the spin lock — the yieldNoSchedule( lock ) idiom passes the spin lock to the runtime system, which releases it after the yield.

An adaptive spin lock spins for a bounded number of iterations before blocking. This hybridizes the benefits of both approaches: low latency when the critical section is short, no wasted cycles when contention is high.

Mutex Locks

A mutex lock (mutual exclusion lock) is a blocking lock for protecting critical sections. When a thread cannot acquire the lock, it suspends and is placed on a waiting queue. When the lock is released, a waiting thread is woken up.

The implementation must address barging: a newly arriving thread may acquire a freshly released lock before the thread that was unblocked to receive it. This violates the expectation that released threads make progress. The solution is baton passing — the releasing thread does not mark the lock as available; instead, it directly hands the “baton” (the right to proceed) to the next waiting thread:

void release() {
    lock.acquire();
    owner = nullptr;
    if ( !blocked.empty() ) {
        // wake next blocked thread, baton is passed
    } else {
        avail = true;         // no one waiting — lock becomes available
    }
    lock.release();
}

The baton-passing pattern appears throughout concurrent programming: it is a general technique for preventing one class of scheduling unfairness.

µC++ provides uOwnerLock as the standard reentrant mutex lock. A thread that already holds a uOwnerLock can acquire it again without deadlocking; the lock tracks the nesting depth. The times() member returns the nesting depth. Non-reentrant locks that a thread attempts to reacquire will deadlock.

The stream lock (osacquire/isacquire) protects I/O streams from interleaving output from multiple threads. Any output written within an osacquire scope appears atomically:

{ osacquire acq( cout );   // acquire cout's lock
    cout << "Thread " << id << ": result = " << val << endl;
}   // release on scope exit

Synchronization Locks

A synchronization lock (condition variable in POSIX terminology) provides a mechanism for a thread to block until a condition becomes true. Unlike a mutex lock (which is binary), a synchronization lock has no persistent state — a signal() that occurs when no thread is waiting is silently lost:

uCondLock cond;
// Waiter:
cond.wait( mutex );      // atomically release mutex and block
// Signaller:
cond.signal();           // wake one waiter (lost if none)
// or
cond.broadcast();        // wake all waiters

The wait() operation atomically releases the associated mutex and suspends the thread. On wakeup, the thread reacquires the mutex before returning from wait(). This atomic release-and-block is crucial: without it, a signal sent between the decision to wait and the actual blocking is lost, causing the waiter to sleep forever (a classic race condition).

Spurious wakeups — a thread waking from wait() when the condition is not actually satisfied — are permitted by POSIX and must be handled. The standard pattern wraps wait() in a while loop that re-checks the condition:

while ( !condition ) cond.wait( mutex );

Barriers

A barrier is a synchronization point at which all participants must arrive before any may proceed. It is the natural synchronization primitive for scatter-gather parallelism: scatter work across threads, gather when all are done.

µC++ provides uBarrier. Each participant calls block() when it reaches the barrier. The last participant to arrive calls the virtual last() member (which may perform a reduction or other aggregation) and then releases all waiting threads.

The uBarrier can be reset and reused, supporting iterative parallel algorithms. The reinitialization problem — a thread re-entering the barrier before all threads have left from the previous phase — is handled by using two alternating barrier counts.

A barrier-based parallel reduction demonstrates the pattern clearly. Each thread computes a partial sum; the barrier synchronizes before the final accumulation:

const int N = 1'000'000, P = 4;
int data[N];
double partials[P];

class SumBarrier : public uBarrier {
    double & total;
public:
    SumBarrier( int participants, double & total )
        : uBarrier( participants ), total( total ) {}
    void last() override {           // called by the last thread to arrive
        total = 0;
        for ( int i = 0; i < P; i += 1 ) total += partials[i];
    }
};

_Task Worker {
    int id; SumBarrier & bar;
    void main() {
        int chunk = N / P;
        int lo = id * chunk, hi = lo + chunk;
        partials[id] = 0;
        for ( int i = lo; i < hi; i += 1 ) partials[id] += data[i];
        bar.block();   // wait for all workers; last one runs SumBarrier::last()
    }
public:
    Worker( int id, SumBarrier & bar ) : id(id), bar(bar) {}
};

double total;
SumBarrier bar( P, total );
{
    uArray( Worker, workers, P );
    for ( int i = 0; i < P; i += 1 ) workers[i]( i, bar );
} // tasks join here; total is fully computed

The virtual last() override removes the need for a separate reduction step after the barrier — the reduction is the barrier’s final action.

Priority Inversion

Priority inversion is a subtle and dangerous concurrency bug that occurs when a high-priority thread is indirectly blocked by a low-priority thread. It was infamously observed in the Mars Pathfinder mission in 1997, where a real-time operating system’s watchdog timer repeatedly reset the lander because a high-priority communications task was blocked behind a low-priority meteorological data task that held a mutex. The reset occurred every time the high-priority task missed its deadline — a bug that was dormant in testing and only appeared under the precise timing conditions of actual operation.

The scenario requires three threads at three distinct priority levels:

High-priority thread H needs a shared resource (a mutex).
Low-priority thread L holds the mutex and is preempted before releasing it.
Medium-priority thread M is runnable and has no interest in the mutex.

Because M has higher priority than L, the scheduler runs M instead of L. But H is blocked waiting for the mutex that L holds. L cannot run (M is running instead). H is therefore indirectly blocked by M — a thread that does not even touch the shared resource. In a real-time system where H has a hard deadline, this can cause a deadline miss.

Priority inversion is a symptom of the interaction between priority scheduling and mutual exclusion. It does not occur in systems without priority scheduling (e.g., FIFO scheduling) and does not occur when high-priority threads never need resources held by low-priority threads.

Solutions:

Priority inheritance (POSIX real-time extension): when a low-priority thread holds a resource needed by a higher-priority thread, the low-priority thread temporarily inherits the higher thread’s priority until it releases the resource. This prevents medium-priority threads from preempting L while H is waiting for the mutex L holds. Priority inheritance is implemented by most real-time OS kernels and is available in Pthreads via pthread_mutexattr_setprotocol(PTHREAD_PRIO_INHERIT).

Priority ceiling: each mutex is assigned a priority ceiling equal to the maximum priority of any thread that might acquire it. A thread can only acquire a mutex if its priority is strictly higher than the ceiling of all currently held mutexes. This prevents a thread from acquiring a mutex that could later block a higher-priority thread. Priority ceiling eliminates priority inversion entirely but requires knowing all mutex users in advance.

Avoiding blocking: real-time systems often avoid mutexes entirely for critical shared data, using lock-free atomic operations or message-passing to communicate between threads of different priorities.

The following minimal example shows all three threads and the inversion scenario with Pthreads:

#include <pthread.h>
pthread_mutex_t res;
pthread_mutexattr_t attr;

void setup() {
    pthread_mutexattr_init( &attr );
    // Enable priority inheritance — LOW inherits HIGH's priority while holding res
    pthread_mutexattr_setprotocol( &attr, PTHREAD_PRIO_INHERIT );
    pthread_mutex_init( &res, &attr );
}

void * low_task( void * ) {       // priority 10
    pthread_mutex_lock( &res );
    // ... long computation holding res ...
    pthread_mutex_unlock( &res );
    return nullptr;
}

void * medium_task( void * ) {    // priority 20 — no interest in res
    // CPU-bound work that preempts low_task WITHOUT priority inheritance
    // WITH inheritance: low inherits priority 30 → medium cannot preempt it
    return nullptr;
}

void * high_task( void * ) {      // priority 30
    pthread_mutex_lock( &res );   // blocks until low releases
    // ... critical work ...
    pthread_mutex_unlock( &res );
    return nullptr;
}

Without PTHREAD_PRIO_INHERIT: medium_task preempts low_task, high_task misses its deadline. With it: as soon as high_task blocks on res, low_task inherits priority 30, completes, and high_task can proceed before medium_task ever gets CPU time.

Binary and Counting Semaphores

A binary semaphore is a generalization of a mutex lock that allows the release to occur from a different thread than the acquisition. This makes it suitable for signaling (one thread signals another that an event has occurred) rather than just mutual exclusion.

The operations are classically called P (from Dutch passeren, to pass) and V (from verhogen, to increment):

P(): if the semaphore’s counter is greater than 0, decrement and proceed; otherwise block.
V(): if a thread is waiting, wake it; otherwise increment the counter.

The counter encodes state: V before P means P will not block (the V is “remembered”). This is the key difference from a condition variable, where a signal before a wait is lost.

A counting semaphore allows the counter to be any non-negative integer. This enables elegant solutions to the bounded-buffer problem:

uSemaphore full(0), empty(MaxItems), mutex(1);

// Producer:
empty.P();      // wait for space
mutex.P();
// add item
mutex.V();
full.V();       // signal item available

// Consumer:
full.P();       // wait for item
mutex.P();
// remove item
mutex.V();
empty.V();      // signal space available

The full semaphore counts items; empty counts spaces. The asymmetry in initial values (full=0, empty=MaxItems) means the buffer starts empty. This is a textbook example of a split binary semaphore: two semaphores together maintain a mutex invariant.

Lock Programming Patterns

Precedence graphs encode dependencies among concurrent tasks: task B must complete before task C may begin. Semaphores implement these dependencies directly — a V() on a semaphore signals completion; the dependent task calls P() before starting.

The Readers-Writer problem illustrates the full power and difficulty of lock programming. Multiple readers may access shared data concurrently (reads do not conflict), but a writer needs exclusive access (writers conflict with both readers and other writers). A naive mutex lock prevents all concurrency even among readers.

A correct implementation must manage:

Readers count (to track whether any readers are active)
Writer exclusion (only one writer at a time)
Reader/writer priority (to prevent starvation of one group by the other)

The baton-passing approach using split binary semaphores achieves this. The key insight is maintaining an entry semaphore as the baton: at most one thread holds the baton at a time and passes it to the appropriate next thread based on the current counts and waiting queues.

The simplest correct solution (readers-preference, may starve writers under heavy read load) uses just two semaphores and a reader count:

uSemaphore mutex(1), wrt(1);
int rcnt = 0;

// Reader:
mutex.P();
rcnt += 1;
if ( rcnt == 1 ) wrt.P();   // first reader locks out writers
mutex.V();

// ... READ ...

mutex.P();
rcnt -= 1;
if ( rcnt == 0 ) wrt.V();   // last reader unlocks writers
mutex.V();

// Writer:
wrt.P();
// ... WRITE ...
wrt.V();

The rcnt counter is protected by mutex (a binary semaphore used as a mutex). The wrt semaphore is the actual reader-writer gate: the first reader acquires it (locking out all writers) and the last reader releases it. Writers acquire wrt exclusively. This solution allows concurrent readers but can starve writers if readers continuously arrive.

A writer-preference solution (preventing starvation of writers) adds a third semaphore rdrs that blocks new readers from entering once a writer is waiting, but it requires careful accounting of who is currently inside. Seven progressively refined solutions are developed in Buhr’s notes; each addresses a weakness in the previous. The final solution achieves fairness for both readers and writers with bounded waiting.

Chapter 7: Concurrent Errors

Understanding what can go wrong in concurrent programs is as important as knowing the correct mechanisms. The errors fall into two categories: safety violations (something bad happens) and liveness violations (something good never happens).

Race Conditions

A race condition occurs when the correctness of a computation depends on the relative timing of events in multiple threads. The classic example is two threads concurrently modifying a shared variable without synchronization. The result is non-deterministic and typically wrong.

Race conditions are insidious because they are timing-dependent. A program may run correctly in 999 out of 1000 executions and fail catastrophically in the 1000th. Inserting debug output or attaching a debugger changes thread timing and may make the bug disappear — hence the term Heisenbug.

Static variables in C++ are a hidden source of races. A static local variable is shared across all threads; its initialization in one thread races with reads from another. µC++ static task members are also shared; accessing them requires explicit synchronization.

Livelock

Livelock is a state where threads are active but making no progress. Unlike deadlock (where threads are blocked), livelocked threads are running — they are executing the retry logic of a failed protocol. The symmetric failure of both Dekker’s “declare intent” step is an example: both threads raise their flags simultaneously and both see the other’s flag, so both lower their flags, wait, and repeat — forever, in the worst case.

Livelock is prevented by breaking symmetry: either by assigning static priorities, or by using a random exponential backoff (as in Ethernet’s CSMA/CD protocol).

Starvation

Starvation (indefinite postponement) occurs when a thread is perpetually denied access to a resource. A lock is unfair if it can repeatedly choose other threads while one thread waits indefinitely. A fair lock guarantees that every waiting thread eventually acquires the lock — typically by serving requests in FIFO order.

Starvation is distinct from deadlock: a starved thread is still runnable and will acquire the resource eventually if competing threads stop. Whether starvation is acceptable depends on the application — real-time systems often require bounded waiting times.

Deadlock

Deadlock is a state where a set of threads are each waiting for a resource held by another thread in the set, forming a cycle in the resource-request graph. No thread in the cycle can proceed, and without external intervention, none ever will.

Deadlock has two flavors:

Synchronization deadlock: Thread A waits for Thread B to signal a condition; Thread B waits for Thread A to signal first. Neither ever signals. This is prevented by careful ordering of signal and wait operations.

Mutual exclusion deadlock: Thread A holds lock X and waits for lock Y; Thread B holds lock Y and waits for lock X. Neither can release what it holds while waiting for what it needs.

A minimal deadlock example — the classic “dining philosophers” reduced to two threads and two locks:

uOwnerLock X, Y;

_Task ThreadA {
    void main() {
        X.acquire();          // (1) A acquires X
        uThisTask().sleep( uDuration(0, 1000) ); // let B acquire Y
        Y.acquire();          // (3) A waits for Y — DEADLOCK: B holds Y, waits for X
        // ... use both X and Y ...
        Y.release();
        X.release();
    }
};

_Task ThreadB {
    void main() {
        Y.acquire();          // (2) B acquires Y
        X.acquire();          // (4) B waits for X — DEADLOCK: A holds X, waits for Y
        // ... use both X and Y ...
        X.release();
        Y.release();
    }
};

The fix is consistent lock ordering: both threads must acquire X before Y. Thread B changes to X.acquire(); Y.acquire(); — now B blocks at X (which A holds) rather than holding Y while waiting for X, breaking the cycle:

_Task ThreadB {       // FIXED
    void main() {
        X.acquire();   // acquire in the same order as A
        Y.acquire();
        // ... use both X and Y ...
        Y.release();
        X.release();
    }
};

The Four Coffman Conditions

Deadlock requires all four of the following conditions to hold simultaneously:

Mutual exclusion: resources cannot be shared — only one thread uses a resource at a time.
Hold and wait: a thread holds at least one resource while waiting for another.
No preemption: resources cannot be forcibly taken from a thread; they must be released voluntarily.
Circular wait: there exists a cycle in the resource-request graph.

Breaking any one condition prevents deadlock.

Prevention

Ordered resource allocation: assign a global ordering to all resources and require threads to acquire resources in increasing order. This breaks the circular-wait condition. If Thread A must always acquire X before Y, and Thread B does the same, Thread B will block at X (which A holds) rather than holding Y and waiting for X.

Hold and wait prevention: a thread must acquire all resources it will ever need at once (resource hoarding). This is impractical in general — threads often do not know future needs.

Preemption: force a thread to release held resources when it cannot acquire the next. The thread then re-acquires all resources from scratch. Risk: starvation if the thread repeatedly fails.

Avoidance: The Banker’s Algorithm

Dijkstra’s Banker’s Algorithm (1965) prevents deadlock by only granting resource requests that leave the system in a safe state — one from which all threads can eventually complete, even if they each request their maximum declared resource needs.

The algorithm maintains:

The maximum resource need of each thread (declared in advance)
The currently allocated resources per thread
The available resources in the system

Before granting a request, the system simulates the allocation and checks whether the resulting state is safe: whether there exists an ordering of threads such that each can complete using only available resources plus resources eventually freed by threads that finish earlier. If safe, the request is granted; otherwise the requesting thread blocks until a safe grant is possible.

The Banker’s Algorithm is rarely used in practice because (a) threads must declare maximum needs in advance, which is often impossible, and (b) resources are rarely homogeneous. Its value is conceptual: it shows that deadlock avoidance is possible in principle.

Detection and Recovery

Rather than preventing deadlock, systems can detect it after the fact using a resource allocation graph (or its generalization to multiple instances, a graph reduction algorithm) and recover by terminating or rolling back one or more threads. This is the approach taken by database systems, which can abort and restart transactions.

Part V: Communication Patterns

Chapter 8: Indirect Communication — Monitors

Direct use of locks and semaphores is error-prone. The monitor abstraction (Hoare, 1974; Brinch Hansen, 1973) encapsulates shared data with the synchronization needed to protect it. A monitor is a class in which at most one thread is executing a member function at any time. The programmer writes ordinary-looking methods; the language or runtime enforces mutual exclusion automatically.

From Critical Regions to Monitors

Critical regions were an early attempt at structured synchronization: the compiler would automatically generate locking code around a region block. Conditional critical regions added a guarded condition:

region V when (B) { S }

The thread blocks until boolean expression B is true, then executes statement S. The weakness is that B is re-evaluated whenever another thread exits the region — potentially causing all waiting threads to re-check their conditions even if only one is relevant. This is the “thundering herd” problem.

Monitors refine this with condition variables that allow precise signaling.

In µC++, a monitor is declared with _Monitor:

_Monitor BoundedBuffer {
    int buf[10]; int front = 0, back = 0, count = 0;
    uCondition notempty, notfull;
public:
    void insert( int val ) {
        if ( count == 10 ) notfull.wait();
        buf[back] = val; back = (back + 1) % 10; count++;
        notempty.signal();
    }
    int remove() {
        if ( count == 0 ) notempty.wait();
        int val = buf[front]; front = (front + 1) % 10; count--;
        notfull.signal();
        return val;
    }
};

All member functions of _Monitor are implicitly _Mutex — they cannot execute concurrently. The _Nomutex keyword allows a specific member to be called without acquiring the lock (useful for read-only accessors that need no mutual exclusion).

Scheduling Within a Monitor

The deeper design question is: after a signal() wakes a waiting thread, who runs next? The signaller or the waittee? This question has several answers, each with tradeoffs:

External scheduling (_Accept): the monitor controls which member functions can be called at any given time. A thread inside the monitor uses _Accept to block until a specific member is called from outside:

_Monitor BoundedBuffer {
    . . .
    void main() {   // monitor server task
        for ( ;; ) {
            _Accept( insert ) { . . . }
            or _Accept( remove ) { . . . }
        }
    }
};

External scheduling is powerful but requires the programmer to explicitly manage which calls are accepted. It is analogous to a server that chooses which request to service.

Internal scheduling: the monitor uses condition variables to signal waiting threads based on conditions computed inside the monitor. There are several variants:

No-priority blocking (Signal-and-Continue, µC++ default): the signaller continues executing; the signalled thread moves from the condition queue to the entry queue. The signalled thread may find the condition false when it eventually runs (because other threads may have changed state). This requires while ( !condition ) wait() loops.
Priority blocking (Signal-and-Urgent-Wait, Hoare monitors): the signaller blocks and the signalled thread runs immediately. The signalled thread is guaranteed to find the condition true on waking. This requires re-entry queues and is more complex but allows single-check if ( !condition ) wait().
Immediate-return signal (Signal-and-Exit): the signaller signals and exits the monitor entirely. The signalled thread runs. No re-check needed, but the signaller cannot do further work in the monitor after signaling.

The choice among these affects both correctness reasoning and performance. Hoare monitors (priority blocking) allow the simplest correctness reasoning but impose overhead from signaller suspension. Signal-and-continue monitors are simpler to implement but require more careful condition management.

Mesa vs. Hoare: Why Signal-and-Continue Won

The choice between Hoare and Mesa semantics is one of the most historically debated design decisions in concurrent programming. Understanding why Mesa semantics prevailed illuminates fundamental tradeoffs.

Hoare semantics (C.A.R. Hoare, 1974): when a thread signals, the signalled thread runs immediately inside the monitor, with the signaller blocked in a special re-entry queue. This gives the signalled thread an implicit guarantee: the condition it was waiting for is still true when it resumes, because no other thread has had a chance to run inside the monitor in the interim. The consequence is that wait() can be paired with a simple if:

// Hoare monitor (hypothetical):
if ( count == 0 ) notempty.wait();    // safe: condition guaranteed on wakeup
int val = buf[front]; . . .

Mesa semantics (Lampson and Redell, 1980, from the Mesa programming language at Xerox PARC): when a thread signals, the signalled thread is moved from the condition queue to the entry queue and will run when the signaller exits the monitor or blocks again. This is signal-and-continue — the signaller retains the monitor lock and continues executing. The signalled thread must re-check its condition when it eventually acquires the lock, because other threads may have run in the meantime:

// Mesa monitor (µC++ default, Java, POSIX):
while ( count == 0 ) notempty.wait();   // must loop: condition may be false again
int val = buf[front]; . . .

Why Mesa won in practice: Hoare semantics requires immediate context-switching from signaller to signalled thread — expensive. Mesa semantics allows the signaller to finish its critical section before yielding. This is generally more efficient because the signaller usually has more work to do after signaling. Additionally, Mesa semantics handles spurious wakeups (wakeups not caused by a signal, permitted by POSIX and common in implementations using OS condition variables) without modification — the while loop naturally re-checks the condition.

Mesa semantics also compose better with complex monitors. In Hoare monitors, a signalled thread that returns to the entry queue upon waking (because it couldn’t proceed) creates a cascading series of context switches. Mesa’s model is simpler for the runtime to implement correctly across interrupt handlers, multi-core hardware, and OS-level context switching.

The practical rule flowing from Berkeley CS 162, POSIX, and Java: always use while, never if, around a wait() call. This is mandatory in Mesa/Java monitors and harmless in Hoare monitors — it is the universally safe pattern regardless of the underlying semantics.

The difference is most visible in a shared counter monitor. Suppose multiple producers add to a buffer and multiple consumers wait for items. Under Mesa semantics, if leads to a bug:

// WRONG under Mesa semantics (µC++, Java, POSIX):
_Monitor Buffer {
    int count = 0;
    uCondition hasItem;
public:
    void produce() { count++; hasItem.signal(); }
    int consume() {
        if ( count == 0 ) hasItem.wait();  // BUG: signal woke us, but another
        count--;                           // consumer ran first and took the item!
        return count;                      // count is now -1
    }
};

// CORRECT under Mesa semantics:
_Monitor Buffer {
    int count = 0;
    uCondition hasItem;
public:
    void produce() { count++; hasItem.signal(); }
    int consume() {
        while ( count == 0 ) hasItem.wait(); // re-check after wakeup
        count--;
        return count;
    }
};

The scenario that breaks if: thread C1 is sleeping on hasItem. Thread P1 produces an item and signals. Before C1 acquires the monitor lock and runs, thread C2 sneaks in, sees count == 1, consumes the item (count back to 0), and exits. C1 now wakes up, checks nothing (it already passed the if), and decrements to -1. The while loop forces C1 to re-check: it sees count == 0 and waits again.

Signal vs. broadcast: a signal() wakes exactly one thread; a broadcast() (notifyAll() in Java) wakes all waiting threads. When a condition may be satisfied for multiple waiters (e.g., multiple readers can proceed once a writer exits), broadcast avoids missing threads. The cost is the thundering herd: all woken threads compete for the monitor lock, and all but one immediately re-wait. Use broadcast when correctness requires it; prefer signal otherwise.

Readers and Writers in a Monitor

The readers-writer problem in a monitor uses shadow queues to achieve precise scheduling. Rather than a simple condition queue, a shadow queue holds the actual waiting threads and allows the monitor to service them selectively. The key invariant: at most one writer active, and writers cannot be active when readers are:

_Monitor ReadersWriter {
    int rcnt = 0, wcnt = 0;
    uCondition readers, writers;
public:
    void startRead() {
        if ( wcnt > 0 || !writers.empty() ) readers.wait();
        rcnt++;
        readers.signal();   // wake another waiting reader (cascading)
    }
    void endRead() {
        rcnt--;
        if ( rcnt == 0 ) writers.signal();
    }
    void startWrite() {
        if ( rcnt > 0 || wcnt > 0 ) writers.wait();
        wcnt++;
    }
    void endWrite() {
        wcnt--;
        if ( !readers.empty() ) readers.signal();
        else writers.signal();
    }
};

The cascading readers.signal() in startRead() wakes waiting readers one by one, each waking the next — allowing all waiting readers to proceed together while excluding the writer.

Nested Monitor Calls and the Rendezvous Problem

A thread calling a monitor member that in turn calls another monitor member creates a nested monitor call. If the inner monitor’s member blocks, the outer monitor’s lock is held by the blocked thread, preventing other threads from entering the outer monitor. This is the nested monitor problem — effectively a form of deadlock.

Solutions include: restructuring code to avoid nesting (preferred), using a global monitor that encompasses the nested call, or using lock release-and-reacquire protocols (complex and error-prone).

A related problem is failed cooperation: when one thread in a monitor attempts a rendezvous (waiting for a partner) and the expected partner never arrives. µC++ raises RendezvousFailure in this situation.

Comparing Semaphores and Monitors

Semaphores and condition variables differ in crucial ways:

Property	Semaphore (P/V)	Condition Variable (wait/signal)
`V` before `P`	P does not block	signal before wait is lost
Multiple `V`s	may start multiple threads simultaneously	each signal starts one thread serially
Block condition	`P` blocks only if counter = 0	`wait` always blocks

These differences mean that semaphore-based and monitor-based solutions to the same problem can look quite different. It is possible to simulate each with the other, but the simulations require care.

Java Monitors

Java’s synchronized keyword implements a basic monitor, but with a weaker signaling model. Every object has an associated lock; a synchronized method or block acquires it. The built-in wait(), notify(), and notifyAll() are the condition operations.

Java uses no-priority, no-blocking (signal-and-continue) semantics with a single implicit condition variable per object. The inability to have multiple named conditions makes it impossible to implement complex scheduling policies (like readers-writer with priority) without auxiliary data structures. Moreover, notify() wakes an arbitrary thread from the wait set — not necessarily the one that should be woken. This forces the use of notifyAll() (waking all waiters) more often than ideal, leading to the thundering herd problem. The programming pattern while ( !condition ) wait() is not just recommended — it is mandatory in Java due to both spurious wakeups and the arbitrary-wakeup semantics of notify().

Chapter 9: Direct Communication — Tasks

A task in µC++ is the concurrent analogue of a coroutine: a class with a void main() member that executes in its own thread, with its own stack, concurrently with other tasks. Tasks communicate via public member functions, but unlike a monitor — where the caller executes the function body — in a task, the function call is a message: the caller blocks, the task’s thread processes the request, and the caller resumes with the result.

Task Structure

_Task Server {
    int count = 0;
    void main() {
        for ( ;; ) {
            _Accept( request );   // external scheduling: wait for a call
        }
    }
public:
    int request( int val ) {
        count += val;
        return count;
    }
};

When a client calls server.request( 5 ), the client blocks. The server task, currently inside _Accept( request ), wakes up and executes the body of request. When request returns, the client unblocks and receives the return value. This is synchronous message passing with automatic serialization.

Task Scheduling: External vs. Internal

External scheduling uses _Accept to select which member calls are serviced. The task can accept different members at different points in its logic:

_Accept( insert ) { . . . }
or _Accept( remove ) { . . . }

The or allows accepting any of several members (non-deterministic choice). A _When( condition ) _Accept( . . . ) guards the accept with a condition — the member is only accepted if the condition is true. This enables rich scheduling policies without additional condition variables.

A concrete example: a bounded buffer task that gates inserts and removes based on the current count:

_Task BoundedBuffer {
    int buf[MaxItems];
    int in = 0, out = 0, count = 0;
public:
    void insert( int val ) { buf[in] = val; in = (in + 1) % MaxItems; count += 1; }
    int  remove()          { int v = buf[out]; out = (out + 1) % MaxItems; count -= 1; return v; }

    void main() {
        for ( ;; ) {
            _Accept( ~BoundedBuffer ) { break; }
            or _When( count < MaxItems ) _Accept( insert ) { }   // only when not full
            or _When( count > 0 )       _Accept( remove ) { }    // only when not empty
        }
    }
};

Without _When, a client calling insert on a full buffer would be accepted and the buffer would overflow. The guard makes the accept conditional: a blocked inserter is held in the accept queue until count < MaxItems becomes true (after a remove). Note that the logic lives entirely in main() — no mutex, no condition variable, no spurious wakeups.

Internal scheduling uses condition variables inside member functions. The task acquires its mutex at each member call and can wait internally:

void insert( int val ) {
    if ( count == MaxItems ) full.wait();
    // insert val
    empty.signal();
}

External scheduling is often cleaner for task-based communication because the control logic lives in main() — the task’s “brain” — rather than scattered across member functions.

Accepting the Destructor

When a client deletes a task (delete task_ptr), the destructor is called. If the task is still running, the client must wait. Using _Accept( ~Server ) in the task’s main allows graceful shutdown: the task completes its current work, then accepts the destructor call to terminate:

void main() {
    bool done = false;
    while ( !done ) {
        _Accept( request ) { . . . }
        or _Accept( ~Server ) { done = true; }
    }
}

This pattern ensures that in-flight requests are fully processed before the task terminates.

Increasing Concurrency

Two structural patterns increase the amount of concurrency in a task-based system:

Server-side buffering: instead of processing each client request synchronously (one at a time), the server immediately acknowledges the request, stores it in a buffer, and processes requests concurrently with accepting new ones. An administrator task manages a pool of worker tasks:

// Administrator pattern
void main() {
    for ( ;; ) {
        _Accept( request ) {
            // dispatch to worker
        }
    }
}

Workers may be created on demand or maintained in a pool. Common worker roles include the simple worker (compute and report), the notifier (wait for an event and notify), the courier (transfer data between servers), and the timer (delay and notify).

Client-side concurrency — avoiding blocking while waiting for results:

Tickets: the client receives a unique ticket on submitting a request and polls (or blocks later) for the result.
Callbacks: the client provides a function to call when the result is ready.
Futures: the client receives a placeholder object. Accessing the value blocks until it is computed. _Select in µC++ allows waiting for the first of several futures:

Future_ISM<int> f1 = server1.request( . . . );
Future_ISM<int> f2 = server2.request( . . . );
_Select( f1 ) { . . . }   // wait for whichever arrives first
or _Select( f2 ) { . . . }

Futures are the foundation of modern async/await patterns in Python, JavaScript, and C++20. The µC++ Future_ISM (Internal State Machine) tracks whether the value is available, cancelled, or computed, and provides an implicit conversion operator that blocks if the value is not yet ready.

Part VI: Performance and Modern Approaches

Chapter 10: Memory, Optimization, and the Memory Model

Correct concurrent programs still need to perform well. This chapter examines how compilers, processors, and memory hierarchies interact with concurrent code — and why these interactions can silently violate correctness assumptions.

Sequential Optimizations

Compilers and processors apply numerous optimizations to sequential code that are correct in isolation but problematic in concurrent contexts:

Instruction reordering: the compiler may reorder instructions if they appear to the sequential observer to have the same effect. In a single-threaded context, this is always safe. In a multi-threaded context, another thread may observe the intermediate state — a state that should not be visible.

Register allocation: the compiler may keep a variable in a register rather than writing it to memory. Another thread reading the variable from memory sees the stale value. The C volatile keyword prevents this optimization for a specific variable, but volatile alone is insufficient for correct concurrent code — it prevents register caching but does not prevent reordering.

Elision (dead store elimination): if a write to a variable appears to be overwritten before any read, the compiler may eliminate it. In a concurrent context, another thread may read the intermediate value that was “dead” from the single-thread perspective.

Replication: the compiler may read a variable once and reuse the cached value for subsequent reads in a loop. Another thread modifying the variable makes this cache stale.

The Memory Hierarchy and Cache Coherence

Modern processors use multi-level cache hierarchies (L1, L2, L3) to bridge the speed gap between processor registers and main memory. A cache line (typically 64–256 bytes) is the unit of data transfer. When a variable is loaded, the entire cache line containing it is brought up the hierarchy.

In a multiprocessor, each core has its own L1 and L2 caches, with L3 typically shared. When two cores hold copies of the same cache line and one modifies it, the other’s copy becomes stale. Cache coherence protocols (MESI, MOESI) maintain the invariant that all cores see a consistent view of memory, but the implementation introduces communication overhead.

False sharing occurs when two threads access different variables that happen to reside on the same cache line. Whenever one thread writes its variable, the entire cache line is invalidated in the other thread’s cache, forcing a reload even though the other thread’s variable was not actually modified. This causes significant performance degradation (cache thrashing) even when threads have no logical data sharing.

To diagnose false sharing: if two variables are accessed frequently by different threads and their sum of sizes is less than a cache line, pad them to separate cache lines.

The MESI Protocol: How Cache Coherence Works

MESI is the dominant cache coherence protocol for shared-memory multiprocessors. The acronym names the four states a cache line can be in, from any given core’s perspective:

Modified (M): this core has the only valid copy, and it has been modified (dirty). Memory is stale. The core must write back to memory before any other core can read this line.
Exclusive (E): this core has the only copy, and it matches memory (clean). Can be silently written to (transition to M) without notifying other cores.
Shared (S): multiple cores may hold valid read-only copies. A write requires invalidating all other copies first.
Invalid (I): this core’s copy is stale or absent. A read causes a cache miss and a fetch from memory or another core’s cache.

The protocol operates through snooping: every cache monitors (snoops) the shared bus. When a core writes to a Shared line, it broadcasts an invalidation; all other caches set that line to Invalid. When an Invalid line is read, the reading core broadcasts a read request; if another core holds it in Modified state, that core writes back the dirty data, both cores end up in Shared state, and memory is updated.

Snooping vs. directory-based coherence: snooping requires broadcasting every coherence message to all caches — this scales to perhaps 16–64 cores. For larger systems, directory-based coherence replaces the broadcast bus with a directory that tracks, per cache line, which cores have copies. Coherence messages are sent only to the relevant cores. The directory eliminates the broadcast bottleneck at the cost of added latency for the directory lookup.

Performance implications: the MESI protocol makes a write to a shared line expensive. The writing core must first acquire exclusive ownership (sending an invalidation that each other core must acknowledge) before the write can proceed. If multiple cores frequently write the same cache line (even different bytes within it — false sharing), the line bounces between cores in M state, generating massive coherence traffic. This is cache line ping-ponging, and it can reduce parallel speedup to below 1.0.

A concrete state-transition walkthrough for two cores (C0 and C1) sharing variable x:

Event	C0 state	C1 state	Memory
Start	I	I	x=0
C0 reads x	E	I	x=0
C1 reads x	S	S	x=0
C0 writes x=1	M	I (invalidated)	x=0 (stale)
C1 reads x	S	S	x=1 (C0 writes back)

At the “C0 writes x=1” step, C0 broadcasts an invalidation on the bus. C1 snoops it and marks its copy Invalid. The write-back to memory happens when C1 next reads x — C0 intercepts the read request and supplies the dirty data directly (cache-to-cache transfer), updating memory in the process.

False sharing demonstration — two threads increment separate counters that share a cache line:

struct alignas(64) PaddedCounter {   // force each counter to its own cache line
    long count = 0;
    // char pad[64 - sizeof(long)];  // explicit padding if no alignas
};

PaddedCounter counters[P];           // one per thread, on separate lines

// Each thread only writes counters[myId].count — no actual sharing
// Without alignas(64): both counters[0] and counters[1] may share a line
// → every write by thread 0 invalidates thread 1's copy → serial throughput
// With alignas(64): each counter is on its own line → true independence

Arithmetic Intensity and the Roofline Model

Two quantities determine the performance ceiling of any parallel program:

Arithmetic intensity is the ratio of floating-point operations to bytes transferred from memory:

\[ I = \frac{\text{FLOPs}}{\text{bytes}} \]

A matrix-vector multiply has low arithmetic intensity (roughly 1 FLOP per 8 bytes for double-precision). A matrix-matrix multiply has high arithmetic intensity (O(N) FLOPs per element fetched — the same data is reused many times).

The Roofline model (Williams, Waterman, Patterson, 2009) plots attainable GFLOPs/s as a function of arithmetic intensity. The ceiling has two regimes:

Memory-bandwidth-bound (left, low intensity): performance is limited by how fast data arrives from memory. Doubling compute units does not help if memory cannot feed them. The bound is \( I \times B \) where \(B\) is peak memory bandwidth.
Compute-bound (right, high intensity): performance is limited by the number of floating-point units. The bound is the peak FLOP/s of the processor.

The transition between regimes is the arithmetic intensity ridge point: \( I^* = \text{peak FLOPs} / \text{peak bandwidth} \). On a modern server CPU, this is typically 5–15 FLOPs/byte; on a GPU it is 50–200 FLOPs/byte (GPUs have very high compute density but relatively limited bandwidth per FLOP).

Practical consequences: sparse matrix operations, graph traversals, and most random-access workloads are strongly memory-bandwidth-bound. Dense linear algebra (BLAS level 3 routines) is compute-bound when implemented with tiling. The Roofline model tells you which optimization matters: for a memory-bound kernel, better algorithms that reduce data movement matter more than faster arithmetic units.

A concrete example — computing arithmetic intensity for a dot product vs. matrix multiply:

Dot product: a[i] * b[i] + ... (N iterations)
  FLOPs:  2N   (1 mul + 1 add per iteration)
  Bytes:  2N * 8  (two double arrays, each read once)
  Intensity: 2N / 16N = 0.125 FLOPs/byte  ← deeply memory-bound

Matrix multiply: C = A * B  (N × N matrices)
  FLOPs:  2N³   (N² dot products, each N FLOPs)
  Bytes:  3N² * 8  (read A, B; write C — each element visited once naively)
  Intensity: 2N³ / 24N² = N/12 FLOPs/byte  ← grows with N, becomes compute-bound

For N = 1024:  intensity ≈ 85 FLOPs/byte — well above the ridge point on any CPU

This is why BLAS DGEMM achieves near-peak FLOP/s while a naive dot product saturates memory bandwidth: the same bytes fund far more arithmetic when data is reused across many multiplications. Cache tiling in matrix multiply is simply an implementation trick to make the measured intensity match the theoretical value — keeping the working set in L2/L3 so the same elements are reused many times before being evicted.

The Memory Model

The memory model defines what values a thread may observe when reading a shared variable. Without a memory model contract, compiler and processor optimizations make behavior unpredictable.

Sequential consistency (Lamport, 1979): the most intuitive model. The result of any concurrent execution is as if all operations were executed in some total order that respects each thread’s program order. Under SC, there is no concept of “stale” or “out-of-order” reads.

Real hardware does not implement sequential consistency by default:

Total Store Order (TSO) — used by x86: stores are buffered in a per-core write queue before reaching memory. A load may return the value from the queue (seeing a thread’s own recent stores) while other threads see the old value. TSO permits FIFO ordering of stores but not arbitrary reordering.
Weak memory ordering — used by ARM, POWER: loads and stores may be reordered arbitrarily, subject only to data dependencies. This allows more aggressive optimizations but requires explicit memory fences (barriers) to enforce ordering.

C++11 atomics provide a portable memory model. Operations on std::atomic<T> are sequentially consistent by default; relaxed memory ordering can be requested for performance:

std::atomic<int> x{0};
x.store( 1, std::memory_order_release );   // paired with acquire
int val = x.load( std::memory_order_acquire );

The release/acquire pair creates a happens-before relationship: everything before the release is visible to the thread performing the acquire. This is the minimum ordering needed for correct lock implementation.

volatile in C and C++ prevents register caching and dead-store elimination for a single variable, but does not prevent instruction reordering and is insufficient for thread synchronization. The Java volatile keyword provides stronger guarantees (sequential consistency for that variable), which is why it is more commonly used for concurrent code in Java.

Double-checked locking without atomics is broken: the initialization and the check-for-initialized are not ordered with respect to other threads, so one thread may see a partially constructed object. The C++11 fix uses std::call_once or static local variable initialization (which is guaranteed thread-safe since C++11).

The double-checked locking bug and fix are a classic illustration of why memory ordering matters:

// BROKEN: without atomic, compiler/CPU may reorder store to 'instance'
// before the constructor body completes
class Singleton {
    static Singleton * instance;
    static std::mutex m;
public:
    static Singleton * get() {
        if ( instance == nullptr ) {         // check 1 (unsynchronized)
            std::lock_guard<std::mutex> lk(m);
            if ( instance == nullptr ) {     // check 2 (synchronized)
                instance = new Singleton();  // PROBLEM: may be partially visible
            }
        }
        return instance;
    }
};

// CORRECT: std::atomic ensures visibility across threads
class Singleton {
    static std::atomic<Singleton*> instance;
    static std::mutex m;
public:
    static Singleton * get() {
        Singleton * p = instance.load( std::memory_order_acquire );
        if ( p == nullptr ) {
            std::lock_guard<std::mutex> lk(m);
            p = instance.load( std::memory_order_relaxed );
            if ( p == nullptr ) {
                p = new Singleton();
                instance.store( p, std::memory_order_release );
            }
        }
        return p;
    }
};

// SIMPLEST CORRECT: C++11 static local — thread-safe initialization guaranteed
Singleton * get() {
    static Singleton instance;   // initialized exactly once, thread-safe
    return &instance;
}

The release store ensures all writes from the constructor are visible before any thread reads instance via an acquire load. The magic of the C++11 static local is that the compiler inserts equivalent fencing — you get safety without writing it yourself.

Chapter 11: Lock-Free Data Structures and Alternative Approaches

Transactional Memory: Optimistic Concurrency for Shared State

Transactional memory (TM) applies the database concept of transactions to shared memory: a programmer wraps a block of code in an atomic{} construct, and the runtime guarantees that the block executes atomically and in isolation with respect to other transactions. If two transactions conflict, one is rolled back and retried. From the programmer’s perspective, the complexity of lock acquisition, release, and deadlock avoidance disappears.

// Transactional memory (conceptual):
atomic {
    x++;
    y--;          // atomic, isolated — no explicit lock needed
}

The appeal is enormous: no lock ordering to remember, no deadlocks possible, composable (two transactional operations can be composed into one transaction without redesigning either). The challenge is implementation.

Data versioning manages the old and new values during a transaction. Two strategies:

Eager versioning (undo logging): writes go directly to memory, but each modified location’s original value is saved in a per-transaction undo log. If the transaction aborts, the undo log is applied in reverse to restore the original values. The advantage is that committed data is immediately visible (no delayed write-back). The disadvantage is that an abort is expensive — every logged write must be undone.

Lazy versioning (write buffer): writes go to a private per-transaction buffer. Memory is only updated when the transaction commits. If the transaction aborts, the buffer is simply discarded. The advantage is cheap aborts. The disadvantage is that reads within the transaction must check the private buffer before reading from memory (increasing read overhead), and committing requires atomically writing all buffered values.

A step-by-step trace illustrates the difference:

Shared memory:  x = 3,  y = 7
Transaction T:  x = 5;  y = x + 1;   (intends to set x=5, y=6)

── Eager versioning ──────────────────────────────────────────────────
Step 1:  undo_log ← {x: 3}         // save old value
         memory[x] ← 5             // write directly to memory
Step 2:  undo_log ← {x:3, y:7}     // save old y
         memory[y] ← 6             // write directly to memory
  Commit: validate read set — no conflict → undo_log discarded (done)
  Abort:  apply undo_log in reverse: memory[y]=7, memory[x]=3 (restored)

── Lazy versioning ───────────────────────────────────────────────────
Step 1:  write_buf[x] ← 5          // stays in private buffer
Step 2:  read x → check buf → 5    // reads from own buffer, not memory
         write_buf[y] ← 6
  Commit: validate read set → no conflict
          atomically flush: memory[x]=5, memory[y]=6
  Abort:  discard write_buf (zero cost — memory is untouched)

The key tradeoff: eager versioning makes reads cheap (always from real memory) but aborts expensive (must undo); lazy versioning makes aborts cheap (discard buffer) but reads slightly expensive (must check buffer first) and commits potentially costly (atomic flush of all buffered writes).

Conflict detection determines when two transactions interfere. A conflict occurs when one transaction reads a location another writes (or both write the same location):

Pessimistic detection: conflicts are detected immediately when a load or store is requested. The detecting transaction either waits (blocking) or aborts immediately (non-blocking). This minimizes wasted work but may cause high contention under heavy conflict.

Optimistic detection: conflicts are checked only at commit time. The transaction runs freely and only validates its read and write sets before committing. If a conflict is found, the transaction aborts and retries. This performs well under low contention (the common case in well-designed concurrent programs) but wastes work when conflicts are frequent.

Hardware transactional memory (HTM): modern Intel processors (Haswell and later) expose HTM through XBEGIN/XEND/XABORT instructions. The cache coherence hardware tracks each transaction’s read and write sets at cache-line granularity — conflict detection comes for free as a side effect of MESI snooping. A conflicting invalidation causes the transaction to abort. HTM is fast but has limited capacity (the transaction’s working set must fit in L1 cache) and can abort for many reasons beyond conflicts (interrupts, capacity overflow, certain instructions). All HTM implementations require a software fallback path for when the hardware aborts persistently.

When to use TM: transactional memory is most effective when conflicts are rare and transactions are short. It excels at protecting irregular data structures (trees, graphs, linked lists) where fine-grained locking would require complex locking protocols and is prone to deadlock. It is a poor fit for I/O-heavy code (I/O operations cannot be rolled back) or for workloads with high contention (frequent aborts waste work and energy).

Compare-and-Assign and Lock-Free Stacks

The Compare-and-Assign (CAS) instruction atomically performs: if *ptr == expected then *ptr = desired and return true; else return false. This is the building block for lock-free (also called non-blocking) data structures, where progress is guaranteed without any thread ever holding a lock.

A lock-free stack using CAS:

struct Node { int val; Node * next; };
std::atomic<Node*> head{nullptr};

void push( int val ) {
    Node * n = new Node{ val };
    do {
        n->next = head.load();
    } while ( !head.compare_exchange_weak( n->next, n ) );
}

Node * pop() {
    Node * old_head;
    do {
        old_head = head.load();
        if ( old_head == nullptr ) return nullptr;
    } while ( !head.compare_exchange_weak( old_head, old_head->next ) );
    return old_head;
}

The retry loop ensures that if another thread modifies head between the load and the CAS, the operation restarts rather than corrupting the structure.

The ABA Problem

CAS compares values, not identities. If a thread reads head = A, then A is popped and B is pushed, and then A is pushed again (perhaps after being freed and reallocated), another thread’s CAS sees head == A and succeeds — even though the state is entirely different. This is the ABA problem and can cause corrupted data structures or use-after-free bugs.

Here is a step-by-step trace on the lock-free stack:

Initial state:  head → A → B → null   (Thread 1 will pop)

Thread 1 reads: old_head = A, A->next = B   (about to CAS head: A → B)
  — Thread 1 is preempted here —

Thread 2 pops A:  head → B → null
Thread 2 pops B:  head → null
Thread 2 pushes A again (same address reused):  head → A → null

Thread 1 resumes:
  CAS( head, A, B ) — succeeds! (head is still A)
  head now → B → null   ← B was already freed — use-after-free!

The fix with tagged pointers (version counter packed alongside the pointer):

struct TaggedPtr {
    Node * ptr;
    uintptr_t tag;   // incremented on every push
};
std::atomic<TaggedPtr> head{ {nullptr, 0} };

void push( int val ) {
    Node * n = new Node{ val };
    TaggedPtr old_head = head.load();
    TaggedPtr new_head;
    do {
        n->next = old_head.ptr;
        new_head = { n, old_head.tag + 1 };   // bump version
    } while ( !head.compare_exchange_weak( old_head, new_head ) );
}

Now Thread 2’s pops and re-push increment the tag. When Thread 1 attempts CAS( {A, 0}, {B, 0} ), the actual head is {A, 2} — the tag mismatch causes the CAS to fail, and Thread 1 retries with the correct current state.

Solutions include:

Tagged pointers: pack a version counter into the unused bits of a pointer. The CAS now checks both the address and the version, so the B-push-pop and A-reinsert sequence changes the version and the second CAS fails.
Hazard pointers: before dereferencing a pointer, a thread registers it as a hazard. Memory reclamation checks hazard registrations before freeing — a pointer visible to any thread cannot be freed.
Epoch-based reclamation: threads declare their “epoch” (logical time). Memory freed in epoch N is not reclaimed until all threads have advanced past epoch N.

Lock-free data structures are not a panacea. They are correct but complex to implement, and their performance advantage over lock-based structures is often marginal except under very high contention. Prefer well-tested lock-based structures from standard libraries; use lock-free structures only when profiling demonstrates lock contention is a bottleneck.

GPGPU Computing

Graphics Processing Units (GPUs) are massively parallel processors with thousands of cores designed for data-parallel computation. The programming model, exemplified by CUDA, differs fundamentally from shared-memory threading:

Threads execute in groups called warps (NVIDIA, 32 threads) or wavefronts (AMD, 64 threads). All threads in a warp execute the same instruction simultaneously (SIMT — Single Instruction Multiple Threads).
Divergence: if threads within a warp take different branches, the warp serializes the branches. Code should be written to minimize branch divergence for maximum throughput.
Memory hierarchy: GPU has registers (per thread), shared memory (per thread block, ~48 KB), and global memory (~GBs, high latency). Maximizing performance requires carefully staging data through shared memory.
Synchronization: __syncthreads() provides a barrier within a thread block. Cross-block synchronization requires global memory and atomics.

GPUs excel at embarrassingly parallel problems: matrix multiplication, image processing, neural network inference. Problems with irregular data access patterns, frequent synchronization, or complex control flow are poor fits for GPU execution.

Concurrency Across Languages

The design of concurrency support in a programming language reflects fundamental decisions about the level of abstraction provided:

Ada 95 takes a high-level approach with protected objects (a form of monitor) and tasks (threads) as first-class language concepts. The select statement provides external scheduling. Rendezvous-based synchronization (the caller and callee execute a shared procedure body together) is a distinctive Ada feature.

Go uses goroutines (lightweight threads, similar to user threads) and channels for communication. The philosophy — “Don’t communicate by sharing memory; share memory by communicating” — encourages designing concurrent systems as networks of goroutines passing data rather than as shared-memory objects protected by locks. Channels are typed first-class values; the select statement waits for the first of multiple channels to become ready.

A Go producer-consumer pipeline with a buffered channel illustrates the idiom:

func producer(ch chan<- int, n int) {
    for i := 0; i < n; i++ {
        ch <- i * i    // send squares into the channel
    }
    close(ch)          // signal that no more values are coming
}

func main() {
    ch := make(chan int, 10)   // buffered: producer runs ahead by up to 10 items
    go producer(ch, 20)
    for val := range ch {      // range loop receives until channel is closed
        fmt.Println(val)
    }
}

Contrast with a µC++ _Task bounded buffer: in Go, the channel is the buffer — there is no separate data structure to protect. The select statement adds the ability to service whichever of several channels is ready first, directly analogous to µC++’s _Accept ... or _Accept:

select {
case v := <-ch1:  // receive from ch1
    process(v)
case ch2 <- x:    // send to ch2 (only when ch2 has room)
    x = nextValue()
case <-done:      // shutdown signal
    return
}

Java provides the synchronized keyword, wait()/notify()/notifyAll(), and the comprehensive java.util.concurrent package (introduced in Java 5). The latter includes ConcurrentHashMap, BlockingQueue, Semaphore, CountDownLatch, CyclicBarrier, and an executor framework. Java 21 introduced virtual threads (Project Loom), which are lightweight user threads managed by the JVM — similar in spirit to goroutines.

C++11 standardized threading with std::thread, std::mutex, std::condition_variable, and std::atomic. The <future> header provides std::future, std::promise, and std::async. C++17 added parallel execution policies for standard algorithms (std::for_each(..., std::execution::par, ...)). C++20 added coroutines, std::latch, and std::barrier.

Pthreads

POSIX Threads (Pthreads) is the low-level C threading API underlying most Unix/Linux threading. Its primitives — pthread_create, pthread_join, pthread_mutex_lock, pthread_cond_wait — map directly to the OS kernel’s thread management. Pthreads is verbose but provides fine-grained control.

The routine abstraction pattern wraps a thread-started-function in a class:

struct WorkerArg { int id; double result; };
void * worker( void * arg ) {
    WorkerArg * wa = (WorkerArg *)arg;
    wa->result = compute( wa->id );
    return nullptr;
}
pthread_t tid;
WorkerArg arg{ 7, 0.0 };
pthread_create( &tid, nullptr, worker, &arg );
pthread_join( tid, nullptr );

This pattern is essentially what std::thread encapsulates.

OpenMP

OpenMP is a pragma-based parallel programming extension for C, C++, and Fortran, targeting shared-memory parallelism. Parallel regions and data-sharing clauses are expressed as annotations on sequential code:

#pragma omp parallel for schedule(dynamic) num_threads(4)
for ( int i = 0; i < N; i++ ) {
    result[i] = compute( data[i] );
}

OpenMP is ideal for incrementally parallelizing existing sequential code. Its reduction clause automatically handles common patterns:

double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for ( int i = 0; i < N; i++ ) sum += f(i);

OpenMP’s sections pragma supports task parallelism (different code paths running concurrently); task and taskwait support recursive divide-and-conquer. The atomic and critical pragmas provide fine-grained synchronization within parallel regions.

A recursive Fibonacci with OpenMP tasks shows the divide-and-conquer pattern:

long fib( int n ) {
    if ( n < 2 ) return n;
    long x, y;
    #pragma omp task shared(x)
        x = fib( n - 1 );           // spawn a task for the left branch
    #pragma omp task shared(y)
        y = fib( n - 2 );           // spawn a task for the right branch
    #pragma omp taskwait            // wait until both children finish
    return x + y;
}

int main() {
    #pragma omp parallel
    #pragma omp single              // only one thread generates the root task
        printf( "%ld\n", fib(30) );
}

The task pragma creates a unit of work that any idle thread in the team can execute — this is exactly the work-stealing model described earlier, but implemented with directives rather than explicit task objects. The taskwait corresponds to a join or delete of a µC++ task.

The limitation of OpenMP is that it extends sequential code rather than restructuring it. Complex inter-task dependencies, dynamic task graphs, and fine-grained producer-consumer patterns are better expressed with explicit thread management.

These notes cover the full arc of CS 343 — from the foundations of structured control flow, through the coroutine abstraction, to the full machinery of concurrent programming with locks, semaphores, monitors, and tasks. The deepest lesson of this course is that concurrency is first and foremost a language design problem: the primitives your language provides shape not just how you write concurrent programs but how you reason about their correctness. Peter Buhr’s µC++ is designed to make that reasoning as explicit and disciplined as possible.