CS 136E: Algorithm Design and Data Abstraction — The Machine Behind C

Estimated study time: 3 hr 5 min

Table of contents

⚡ Why make it up

CS 136 introduces C, pointers, and recursion as the bridge from Racket’s functional purity to imperative reasoning. It necessarily stops short of where the language actually lives — the machine that runs it, the optimizations that exploit its undefined corners, and the work that goes into building a small language of one’s own. CS 136E fills that gap with three movements: how to reason about state once it is no longer pure; how state actually lives in registers, stacks, and heaps on real ARM64 and x86-64 hardware; and how to construct a tiny imperative language in C, from lexer through mark-and-sweep collector. Ragde’s flaneries (TYR, IYMLC, FICS, FICS2) reach for adjacent ideas through Racket and a functional lens; CS 136E is unapologetically C-flavored and machine-flavored — the working systems-programmer’s enrichment of CS 136.

Sources and References

Ragde, Prabhakar. Teach Yourself Racket (TYR); If You Must Learn C (IYMLC); A Functional Introduction to Computer Science, Part I (FICS) and Part II (FICS2). Publicly available at cs.uwaterloo.ca/~plragde/flaneries/. The impurity-in-Racket discussion in Chapter 1 draws on TYR §3 and FICS2 §2 (Effects in Racket); the program-verification sketch in Chapter 4 is informed by FICS2 §3 (SIMP); the interpreter structure in Part III builds on ideas from FICS §10 and FICS2 §6–7. Ideas are attributed inline where drawn.

Kernighan, Brian & Ritchie, Dennis. The C Programming Language, 2nd ed. Prentice Hall, 1988. The canonical reference for C semantics and idiom.

Bryant, Randal & O’Hallaron, David. Computer Systems: A Programmer’s Perspective, 3rd ed. Pearson, 2016 (CSAPP). Primary source for memory layout, calling conventions, and machine-level programming in Part II.

Nystrom, Robert. Crafting Interpreters. Available at craftinginterpreters.com. Primary structural source for the interpreter in Part III.

Cooper, Keith & Torczon, Linda. Engineering a Compiler, 3rd ed. Morgan Kaufmann, 2022. Reference for lexer and parser design in Chapter 10.

Hoare, C.A.R. “An Axiomatic Basis for Computer Programming.” Communications of the ACM 12(10), 1969. The original paper behind Hoare-triple notation used in Chapter 4.

ARM. ARM Architecture Reference Manual, ARMv8-A. developer.arm.com. All ARM64 instruction semantics verified here.

Intel. Intel® 64 and IA-32 Architectures Software Developer’s Manual. All x86-64 instruction semantics verified here.

cppreference.com. C language reference. C standard library and language rules verified here.

Drepper, Ulrich. “What Every Programmer Should Know About Memory.” Red Hat, 2007. Freely available at akkadia.org. Primary source for cache effects and memory hierarchy in Chapter 8.

Levine, John. Linkers and Loaders. Morgan Kaufmann, 1999. Free web edition available. Reference for the linker/loader discussion in Chapter 7.

Patterson, David & Hennessy, John. Computer Organization and Design: ARM Edition (ARMv8). Morgan Kaufmann, 2017. Reference for ARMv8-A instruction encoding in Chapter 5.

Pierce, Benjamin. Types and Programming Languages (TAPL). MIT Press, 2002. Reference for environment semantics and evaluation rules in Chapter 11.

Abelson, Harold & Sussman, Gerald. Structure and Interpretation of Computer Programs (SICP), 2nd ed. MIT Press, 1996. Freely available at mitpress.mit.edu. The environment-frame model of Chapter 11 is directly descended from SICP §3.2.

MIT OpenCourseWare 6.004 (Computation Structures). ocw.mit.edu. Reference for the abstract machine and beta reduction correspondence in Part II.

Memarian, Kayvan et al. “Into the Depths of C: Elaborating the de facto Standards.” PLDI 2016. Basis for the pointer-provenance discussion in Chapter 2 — 323-person practitioner survey, Cerberus formal model, three-way ISO C11 / compiler / real-world divergence.

Regehr, John. Embedded in Academia blog. blog.regehr.org. The case-analysis mental model (Type 1/2/3 functions), the stupid() → return 1 example, and the optimizer-as-theorem-prover framing in Chapter 3 are drawn from Regehr’s public writing.

Lee, Juneyoung; Kim, Chung-Kil; Hur, Chung-Kil; Lopes, Nuno. “Reconciling High-Level Optimizations and Low-Level Code with Twin Memory.” PLDI 2019 (Alive2). The count of 95 LLVM miscompilation bugs, the freeze instruction, and the poison-value analysis in Chapter 3 draw on this work and the associated Alive2 verifier (alive2.llvm.org).

Evans, Jason. “A Scalable Concurrent malloc(3) Implementation for FreeBSD.” BSDCan 2006. Basis for the jemalloc discussion in Chapter 8: size classes, arenas, per-thread caches.

Wilson, Paul. “Uniprocessor Garbage Collection Techniques.” IWMM 1992. The GC taxonomy (reference counting vs. tracing; mark-sweep vs. mark-compact vs. copying) in Chapter 12 follows Wilson’s survey.

Appel, Andrew. “Simple Generational Garbage Collection and Fast Allocation.” Software: Practice and Experience 19(2), 1989. Source for the 80–98% early-death empirical figure and the generational-hypothesis framing in Chapter 12.

Jones, Richard; Hosking, Antony; Moss, Eliot. The Garbage Collection Handbook, 2nd ed. CRC Press, 2023 (~3,400 publications surveyed). The tri-color marking, barrier taxonomy, and GC algorithm comparisons in Chapter 12 are informed by this survey.

ISO/IEC 9899:2024 (C23). Draft N3096. cppreference.com tracks C23 additions referenced in Appendix K (attributes, typeof, _BitInt, nullptr).

Assumed background: CS 136 in full — C types, control flow, functions, pointers, dynamic memory with malloc/free, recursion, and basic ADTs (linked lists, stacks, queues).

Part I — Reasoning about State

Chapter 1: From Substitution to Sequencing

In CS 135 you had a precise model of computation: to evaluate (f x), replace each occurrence of the formal parameter in the body of f with the value of x, then evaluate the result. This is the substitution model, and it comes with a beautiful guarantee — the value of any subexpression depends only on that subexpression, never on what happened before or after. Functions are mathematical functions: the same inputs always produce the same outputs, and you can reason about a program by reading it the way you read algebra.

CS 136 introduced assignment, and the substitution model quietly broke. Not loudly; it just stopped applying. Consider:

int x = 0;
int y = x + 1;
x = 10;
int z = x + 1;

After these four lines, y is 1 and z is 11. The expression x + 1 appears twice, but it produces two different values, because x changed between the two uses. No algebraic rewriting explains this. The meaning of x + 1 depends not just on the expression itself but on when it is evaluated.

This temporal dimension is what imperative programming introduces. A program is no longer a description of what to compute; it is a recipe for what to do, and in what order. Understanding why a variable holds a particular value at a particular moment requires tracing the history of all assignments to it — which is why debugging an imperative program is so different from debugging a functional one.

Mutation in Racket: `set!` and Boxes

Racket’s teaching dialects exclude mutation precisely to keep the substitution model intact. Full Racket offers it back. The set! form (pronounced “set bang”) mutates an existing binding:

(define x 0)
(define y (+ x 1))  ; y is 1
(set! x 10)
(define z (+ x 1))  ; z is 11, not 1

The ! suffix is a Racket convention meaning this has side effects. Ragde’s Teach Yourself Racket (TYR §3) introduces this as the point where the language’s behavior “can no longer be modelled without tracking the current state of every mutable binding” — the substitution model needs an amendment, specifically a store.

What should that amendment look like? A clean formulation due to Ragde (FICS2 §2) augments the substitution model with a finite map σ — the store — from locations to values. Variables are compiled to locations; reading a variable looks up σ; set! updates σ. Every step of evaluation now carries σ as both an input (state before this step) and an output (state after):

\[\langle e, \sigma \rangle \;\Downarrow\; \langle v, \sigma' \rangle\]

A pure expression leaves σ unchanged (σ' = σ). A set! changes exactly one entry. Sequential composition threads σ' from one step as the σ into the next. This notation is called big-step operational semantics, and it is the formal language in which the semantics of imperative programs are actually stated — though we will work informally with it throughout this course.

Racket also provides boxes — first-class mutable heap cells:

(define b (box 0))        ; create a box holding 0
(unbox b)                 ; => 0
(set-box! b 42)           ; mutate the box
(unbox b)                 ; => 42

A box is Racket’s explicit analogue of a C pointer-to-value. The correspondence is direct:

Racket	C
`(box 0)`	`int p = malloc(sizeof(int)); p = 0;`
`(unbox b)`	`*p`
`(set-box! b 42)`	`*p = 42`
`b` itself (the box)	`p` (the pointer)

In C the indirection is implicit in the type system (int * tells you this is a pointer to an int), while in Racket it is explicit in the syntax (box/unbox). Semantically they are the same: a name bound to a location, and the location holding a value.

Referential transparency. An expression is referentially transparent if it can always be replaced by its value without changing the program's meaning. Pure functions and immutable bindings are referentially transparent. Any expression involving set!, I/O, or mutable state is not — its value depends on when it is evaluated, not just on its syntactic form.

Once mutation enters, referential transparency is gone and the substitution model no longer applies. This is not a bug — it is the feature. The whole point of mutation is that the world changes, and a program ought to be able to observe that change.

The State Model: State as Extra Parameter

Let us pin down the informal model. A C program has three ingredients:

Expressions that are evaluated and yield values. Evaluating an expression may read the state.
Statements that are executed and may update the state. The canonical statement is the assignment x = e;.
State σ, mapping variable names (more precisely, memory locations) to their current values.

When we write int x = 5; int y = x + 1;, we are executing under state σ₀ = {x → 5} after the first line, and σ₁ = {x → 5, y → 6} after the second. The expression x + 1 reads σ(x) = 5 and yields 6. A later x = 10; gives σ₂ = {x → 10, y → 6}, and the next evaluation of x + 1 yields 11.

This model is abstract enough to reason with, but concrete enough to connect to the actual machine. The state σ will eventually correspond to the contents of registers and memory cells. The “sequential threading” will correspond to the instruction pointer advancing through machine code.

L-values and R-values

C makes a distinction that Racket blurs: whether an expression denotes a location or a value.

x = x + 1;

The x on the left denotes the location where the result will be stored — the l-value. The x on the right denotes the current value stored at that location — the r-value. Same syntax, different meaning depending on which side of = you are on.

L-value. An expression that denotes a memory location. It can appear on the left side of an assignment. Variables, pointer dereferences (*p), and array subscripts (a[i]) are l-values.

R-value. An expression whose current value is read but which does not designate a persistent location. x + 1 and 42 are r-values. Converting an l-value to an r-value is called an lvalue conversion (or load).

In assembly language, the distinction is explicit: a load instruction reads a value from an address, while a store instruction writes a value to an address. The C compiler decides whether each use of x needs a load (r-value context) or just the address (l-value context), and emits the appropriate instruction. We will see this in Chapter 5.

Mutable Variables vs. Mutable Heap Cells

C has two independent sources of mutation: local variables (which can be reassigned) and heap cells (which can be written through pointers). They are conceptually distinct.

A local variable like int x = 5; lives on the stack. You can reassign x any number of times. When the function returns, the storage evaporates. The name x is a location, and you access it by name.

A heap cell allocated with malloc has no name — only a pointer to it:

int *p = malloc(sizeof(int));
*p = 5;     // write through pointer
*p = 10;    // reassign through pointer
free(p);    // storage reclaimed

Here *p is the l-value (the location), and p is how you navigate to it. Two different pointers can point to the same cell — aliasing — and then two different l-values refer to the same location. That is the subject of Chapter 2.

The practical upshot of this chapter: when you read a C program, you are not reading algebraic definitions. You are reading a script for state transformations. Every expression might be reading state, and every assignment is writing it. The program’s meaning is the trajectory through state-space it traces during execution — and that trajectory depends on the order in which statements execute.

Chapter 2: Aliasing, Lifetimes, and the C Abstract Machine

Two Names for One Cell

Consider:

int x = 42;
int *p = &x;
int *q = &x;

Now p and q both point to x. They are aliases of each other. Any write through p is immediately visible through q, and through x itself:

*p = 99;
printf("%d %d %d\n", x, *p, *q);   // 99 99 99

This is not a curiosity — it is fundamental to how C programs are structured. A function that receives a pointer can modify the caller’s data. A struct that contains a pointer can share memory with another struct. An array element can be accessed both by name and by pointer arithmetic. The state σ from Chapter 1 now has the property that a single location in σ may be reachable by multiple names.

Aliasing makes reasoning harder in a specific way: you cannot analyze a piece of code in isolation. The function:

void increment(int *a, int *b) {
    *a += 1;
    *b += 1;
}

seems to add 1 to two independent integers. But if you call increment(&x, &x), it adds 2 to x (first statement makes x = 43, second makes it 44). The code is correct for non-aliased inputs but surprising when aliased. A compiler trying to optimize this function cannot assume a and b point to different locations unless the programmer says so (with the restrict qualifier, a C99 feature that promises no aliasing between two pointers for the duration of a function).

Aliasing permeates C programs. Function arguments are pointers to caller data. scanf’s &n parameter is a pointer the library writes through. A linked list node’s next field points into the same heap arena as all other nodes. Once you think in terms of pointers, you are always reasoning about a graph of aliased locations.

The Three Lifetimes of C Objects

Every object in C has a lifetime — a period during which the object exists and its storage is valid. C has three kinds of lifetime:

Automatic (stack) lifetime. Storage is allocated when the enclosing block is entered and reclaimed when the block is exited. Local variables have automatic lifetime. The storage is on the stack; when the function returns, the stack frame is popped and all its locals cease to exist.

Static lifetime. Storage exists for the entire duration of the program. Global variables and variables declared with static have static lifetime. They are initialized once (before main runs) and live until the program exits.

Allocated (heap) lifetime. Storage is explicitly allocated with malloc (or related functions) and explicitly released with free. The programmer is responsible for matching every allocation with exactly one deallocation. Objects have allocated lifetime.

The three lifetimes create three classes of pointer hazard:

Dangling pointers to stack objects. A classic error:

int *dangle(void) {
    int local = 42;
    return &local;   // returns pointer to a stack variable
}

After dangle returns, local no longer exists. The pointer is dangling — it points to memory that may now belong to a different stack frame. Reading through it is undefined behavior. The compiler may warn about this, but it is legal C syntax.

Use-after-free. After free(p), the pointer p is dangling. The storage may be reallocated by a subsequent malloc, so writes through p corrupt unrelated data:

int *p = malloc(sizeof(int));
*p = 42;
free(p);
*p = 99;   // undefined behavior: use after free

Memory leaks. The converse of use-after-free: never calling free. The storage remains allocated but unreachable, wasting memory. In a long-running program, a leak eventually exhausts the heap.

These three hazards are the main reason garbage collection was invented: a GC-equipped language manages lifetimes automatically, eliminating all three. C delegates this to the programmer. In Part III we will build a mark-and-sweep collector for our tiny language precisely to see what that automation entails.

The C Abstract Machine

The C standard does not specify a concrete machine — it specifies an abstract machine: an imaginary computer whose behavior defines what every C program means. The abstract machine has a store (mapping locations to values), a program counter, a call stack, and a set of rules for evaluating each C construct.

Real hardware is free to implement the abstract machine however it likes, provided the observable behavior — the sequence of I/O operations and the final return value — is the same. This is the as-if rule: the compiler may transform your program arbitrarily as long as it produces the same observable results.

The consequences are profound. The compiler may:

Reorder statements that do not affect observable output.
Eliminate variables that are written but never read.
Replace a function call with an inlined copy of its body.
Prove that a loop terminates and delete the termination check.
Assume that undefined behavior never occurs — and optimize accordingly.

That last point is the dangerous one, and we will return to it at length in Chapter 3. For now, note that the abstract machine is the specification, not the implementation. When you write int x = 5; x = x + 1;, the abstract machine says x has value 6 after those two steps. Whether the real hardware ever stores 5 in a register, or whether it computes 6 directly, is the compiler’s business.

The as-if rule. A C compiler may perform any transformation to a program as long as the observable behavior — I/O and the final exit status — is identical to what the C abstract machine would have produced. Transformations permitted under this rule include dead-code elimination, constant folding, loop unrolling, and instruction reordering.

One important corollary: a variable may not exist in any register or memory location at a given point in the program’s execution if the compiler has determined its value is not needed. You cannot assume that &x will always produce a useful address — the compiler is only obliged to make &x work if you actually dereference it or pass it somewhere.

`volatile` and the Torn Abstraction

There are situations where the abstract machine’s freedom to reorder and eliminate becomes a problem. Hardware device registers, for example, are memory-mapped locations that change independently of the program’s control flow. An I/O register might change its value between two consecutive reads because an external event fired. If the compiler caches the first read in a register and reuses that value, it misses the update.

C provides the volatile qualifier to address this:

volatile int *device_status = (volatile int *)0xFFC00100;
while (*device_status == 0) { /* wait */ }

volatile tells the compiler that every read and write to this location must actually access memory, in the order written. The compiler cannot cache, reorder, or eliminate accesses to volatile objects. volatile punches through the abstract machine and says: this is real hardware, not a mathematical variable.

Importantly, volatile does not provide any guarantee about atomicity or memory ordering for concurrent threads — that requires a separate mechanism (_Atomic types in C11, or compiler-provided intrinsics). volatile is specifically about the single-threaded interaction between a program and hardware registers, or between a debugger and program state.

Aliasing and the Abstract Machine Together

Aliasing and the abstract machine interact in a particularly treacherous way. Consider:

void f(int *p, long *q) {
    *p = 1;
    *q = 2;
    return *p;   // must this be 1?
}

The compiler can reason: p is an int * and q is a long *. They cannot alias the same location (a location cannot simultaneously be an int and a long in the abstract machine’s type system). Therefore the write to *q cannot affect *p, and the return value must be 1. The compiler will optimize this to simply return the constant 1, eliminating the load of *p entirely.

But if the caller passes p and q as pointers to the same memory with different type interpretations, the compiler’s reasoning is invalidated — and the code produces a value the programmer did not intend. This is the strict aliasing rule, one of C’s most dangerous pieces of undefined behavior, and we discuss it in detail in Chapter 3.

The take-away from this chapter: C is a language of explicit locations with three distinct lifetimes, and any two pointers might or might not alias the same location. The compiler’s abstract machine model uses type information to resolve ambiguity — but it does so with sharp consequences when the programmer violates the rules.

The Cerberus Problem: Three-Way Misalignment

In 2016, Memarian, Matthiessen, Sewell and collaborators at the University of Cambridge published “Into the Depths of C: Elaborating the De Facto Standards” (PLDI 2016/2019), a systematic study of what C actually means in practice. Their central finding: there is a three-way misalignment between three different things that are all called “C semantics”:

What systems code in production relies on. Real operating systems, kernels, runtime libraries, and allocators use pointer patterns that the ISO C11 standard classifies as undefined behavior. The Linux kernel, glibc, the CPython runtime, and many others contain these patterns.
What compiler implementors aim to provide. GCC and Clang implement a subset of the ISO standard and add their own extensions. Their actual behavior on common UB patterns often differs from both the standard and from each other.
What ISO C11 specifies. The standard is written by a committee and contains gaps, ambiguities, and deliberate underspecification.

Cerberus conducted a web survey in 2015 receiving responses from 323 practitioners — including compiler developers and OS developers — asking specific questions about what they believed C programs were permitted to do. The answers revealed that even expert C programmers hold conflicting beliefs about the semantics of their own code.

The project’s proposed resolution is a provenance model for C pointers, which addresses one of the most common conflict points.

Pointer Provenance

Chapter 2 described aliasing as two pointers reaching the same memory location. Cerberus introduces a finer distinction: every pointer has a provenance — a record of which allocation it came from — and the abstract machine tracks this provenance separately from the numeric address.

Consider:

int x = 1, y = 2;
int *p = &x + 1;   // one past the end of x
int *q = &y;       // points to y
// On most implementations, &x + 1 == &y numerically

On a typical x86-64 or ARM64 stack, &x + 1 and &y may hold the same numeric address — x and y are adjacent on the stack. But they have different provenance: p was derived from &x, and q was derived from &y. Under the provenance model:

Reading or writing through p is only valid within the object x (and one-past the end for pointer arithmetic). Dereferencing p (which points one past x) is out of bounds even if *p == y numerically.
Comparing p == q for equality: in the provenance model, this comparison is unspecified or implementation-defined even though the addresses are numerically equal. The ISO standard says pointer comparisons between unrelated objects have unspecified results.
Dereferencing q is valid and accesses y.

Pointer provenance. Each pointer carries a provenance: a token identifying the allocation from which the pointer was derived. Pointer arithmetic is only valid within the object (or one-past-end) of the provenance. Accessing memory through a pointer with the wrong provenance — even if the numeric address is valid — is undefined behavior under the Cerberus model. Provenance is not stored at runtime; it is a semantic concept used by the abstract machine and exploited by compilers in alias analysis.

The practical consequence: you cannot derive a pointer to object B by doing arithmetic on a pointer to object A, even if they are adjacent in memory. The C standard technically forbids this for unrelated objects; the provenance model makes explicit why — the pointer’s provenance restricts which addresses it may validly access.

Provenance and the Allocator

The provenance model connects directly to malloc. Every call to malloc creates a fresh provenance. Two allocations at the same address (possible after a free and a subsequent malloc) have different provenances even if the address is numerically the same. A pointer held to the old allocation, pointing at the same address, cannot validly access the new allocation.

This is why use-after-free is undefined behavior even if malloc immediately returns the same address: the new allocation has a different provenance, and the old pointer’s provenance is dead.

It is also why GC root tracking in Chapter 12 is subtle: a pointer’s value (numeric address) and its provenance together constitute the full identity of a pointer. A copying GC that moves objects must update all pointers — not just their numeric values, but their provenance tokens. Our mark-and-sweep GC in Chapter 12 avoids this because it does not move objects; a copying GC must handle it explicitly.

What Cerberus Found in Real Code

The survey revealed that experienced C programmers routinely rely on:

Integer-to-pointer casts for memory-mapped hardware access (assigning a hardware address like (int *)0xFFC00100). ISO C11 makes this implementation-defined; the provenance model accommodates it by treating integer-to-pointer casts as creating a pointer with a “wildcard” provenance that can access any live allocation at that address.
Comparison of unrelated pointers (e.g., pointer-range checks in allocators: if (ptr >= heap_start && ptr < heap_end)). Technically unspecified in ISO C11 for pointers into different objects; in practice, all compilers compile this to a pair of unsigned comparisons.
memcpy and memmove across object boundaries — provenance-punning in the sense of treating a struct’s byte representation as a raw buffer.

The Cerberus team’s formal model (cerberus-c) is available as open-source software and can execute C programs under multiple memory models, showing which behaviors are consistent under each. It remains the most rigorous formal account of what C actually means.

Chapter 3: A Field Guide to Undefined Behavior

What the Standard Actually Says

Most programming languages specify what a program means in all circumstances, even bad ones. Java throws a NullPointerException; Python raises TypeError; Haskell’s type system makes large classes of errors impossible. C takes a different position. The C standard classifies problematic situations into three tiers: implementation-defined behavior (the behavior is unspecified but the implementation must document what it does), unspecified behavior (there are multiple valid interpretations, the implementation need not document which), and undefined behavior (the standard imposes no requirement whatsoever).

That last category is the dangerous one. When a C program invokes undefined behavior, anything can happen. Not “the program crashes,” not “the value is unpredictable,” but truly anything — including appearing to work correctly for years before failing catastrophically under a new compiler version, a new optimization flag, or a different hardware architecture.

Undefined behavior (UB). Behavior that the C standard explicitly does not constrain. A program that invokes UB is not a valid C program — the standard makes no prediction about what it will do. From the compiler's perspective, UB is a promise the programmer makes that certain situations will not arise, freeing the compiler to optimize under that assumption.

This framing — UB as a promise from programmer to compiler — is the key to understanding what goes wrong. When you write *p where p is null, you are not saying “crash here.” You are saying “I promise this will never happen, and if it does, I accept any consequence.” The compiler takes you at your word.

Case Study 1: Signed Integer Overflow

In C, overflow of signed integers is undefined behavior. Overflow of unsigned integers wraps around modulo $2^{32}$ (or $2^{64}$), which is defined. These are different, and the distinction matters.

Consider this loop:

int i;
for (i = 0; i < i + 1; i++) {
    do_work(i);
}

The condition i < i + 1 is always true for any non-wrapping integer — mathematically, $n < n + 1$ for all $n$. But when i reaches INT_MAX, i + 1 would overflow. For unsigned i, that wrap is defined and the condition becomes UINT_MAX < 0, which is false, stopping the loop. For signed i, overflow is UB, so the compiler is entitled to assume it never happens — and therefore that i < i + 1 is always true, making the loop infinite.

GCC at -O2 will compile this into an infinite loop for int i, and a finite loop that stops at UINT_MAX for unsigned int i. Both are correct according to the standard.

This is not a compiler bug. It is a compiler doing exactly what the standard permits: using the UB contract to produce faster, simpler code. The lesson is that integer arithmetic in C is only defined for values that stay within the range [INT_MIN, INT_MAX], and reasoning about signed integers requires ensuring that range is never exceeded.

// safe addition: check before, not after
int safe_add(int a, int b) {
    if (b > 0 && a > INT_MAX - b) {
        /* overflow would occur */
        return -1;
    }
    if (b < 0 && a < INT_MIN - b) {
        /* underflow would occur */
        return -1;
    }
    return a + b;
}

Note that a + b > INT_MAX is itself undefined behavior if the addition overflows — you must check before performing the operation.

Case Study 2: Strict Aliasing

The C standard says that you may not access an object through a pointer of a different type, except through char *. More precisely: if two expressions refer to objects of different types, the compiler may assume they do not alias, unless one of the types is char.

int x = 1;
float *fp = (float *)&x;    // type-punning
*fp = 3.14f;                // undefined behavior: writing float through int*

The compiler is entitled to assume that fp (a float *) and &x (an int *) point to different objects. It may cache the value of x in a register and never reload it after the store through fp, because “obviously” *fp = 3.14f cannot affect x — they’re different types.

Strict aliasing violations are among the hardest bugs to track down because they only appear when optimization is enabled. At -O0, the compiler does not cache aggressively, so the violation goes unnoticed. At -O2, optimizations that exploit the strict aliasing rule are enabled, and the program silently produces wrong results.

The standard-blessed way to type-pun (inspect the bit pattern of one type as another) is memcpy:

float f = 3.14f;
int bits;
memcpy(&bits, &f, sizeof(int));   // well-defined bit copy

Because memcpy works through void * and char * pointers internally, it does not violate the aliasing rules. Modern compilers recognize this pattern and optimize it to a register copy.

Case Study 3: Sequence Points and Evaluation Order

C specifies that between certain sequence points — after each full expression, after each function argument list is evaluated, at &&, ||, and ?: — all side effects of previous evaluations are complete. But within a full expression, the order of evaluation of subexpressions is largely unspecified.

The infamous example:

int i = 0;
i = i++ + ++i;   // undefined behavior

This modifies i twice without an intervening sequence point. The compiler can evaluate the subexpressions in any order and there is no defined result. Clang produces 0, GCC produces 2, and a different optimization level may produce something else entirely.

A subtler case:

printf("%d %d\n", ++i, ++i);   // undefined: both ++i modify i

Function arguments are evaluated before the call (there is a sequence point at the call), but the order of argument evaluation is unspecified. Both ++i subexpressions modify i without an ordering constraint between them.

The rule of thumb: modify each variable at most once in a full expression, and do not read a variable that you are also modifying elsewhere in the same expression.

Case Study 4: Uninitialized Reads

Reading an uninitialized variable is undefined behavior:

int x;
int y = x + 1;   // UB: x was never set

The naive expectation is that x holds whatever garbage was previously in that stack slot. That is sometimes what happens, but the compiler is permitted to go further. Since x is uninitialized and using it is UB, the compiler may assume that this code path is never reached. If the only path to y = x + 1 passes through code that was predicated on a condition, the compiler may decide the condition is always false and eliminate the entire branch — including code that does not depend on x at all.

This manifests in real bugs when a conditional branch writes to x in one arm but not the other, and the programmer mistakenly believes the “other arm never happens in practice”:

int x;
if (condition) { x = compute(); }
use(x);   // UB if condition is ever false

Compilers with -O2 have been observed to delete the condition check entirely here, because “if condition is ever false, we have UB, so we may assume it’s always true.”

Tools to find UB: AddressSanitizer (-fsanitize=address) catches out-of-bounds accesses and use-after-free; UndefinedBehaviorSanitizer (-fsanitize=undefined) catches signed overflow, null pointer dereferences, misaligned accesses, and more; Valgrind’s memcheck catches uninitialized reads. None of these is exhaustive, but together they catch the common cases. Adding -fsanitize=address,undefined to your build for testing is not optional; it is table stakes.

Case Study 5: Out-of-Bounds Access

Reading or writing outside the bounds of an array is undefined behavior:

int a[5];
a[5] = 42;   // UB: index 5 is out of bounds for a[0..4]

The hardware will often not crash — the write succeeds because there happens to be writable memory adjacent to the array (the next local variable, or a saved frame pointer). The data at that location is silently corrupted. This is the source of a large class of security vulnerabilities: buffer overflows. By writing past the end of a stack buffer, an attacker can overwrite the function’s return address and redirect execution to injected code.

Compiler optimizations can also produce surprising effects. A loop that the compiler can prove iterates with i in [0, n) for array size n may see the check a[i] < 0 optimized away — because “obviously” a[i] is always in bounds (the programmer said so by writing a loop bounded by n), so there is no UB, so no special handling needed. Adding an out-of-bounds access in the loop would be UB, and since UB cannot happen, the bounds check is vacuous.

The Optimizer as Adversary

It is tempting to view UB as a list of obscure edge cases to memorize. A better mental model: the optimizer is a theorem prover that assumes your program has no UB, then derives the simplest program consistent with that assumption. If your program does have UB, the optimizer’s derivations are unsound, and the output is arbitrary.

Every undefined behavior in this chapter has been exploited by real compiler optimizations in production code. The history of Linux kernel bugs includes several cases where removing a null-pointer check (because “we dereference the pointer three lines earlier, so obviously it’s not null, so the check is dead code”) introduced a privilege-escalation vulnerability — because the kernel deliberately dereferenced null for certain hardware access patterns, relying on UB to “work” on that hardware.

The antidote is not avoiding C — it is understanding the contract. Use -Wall -Wextra. Use sanitizers during testing. Read the standard when in doubt. And never write code that depends on undefined behavior “happening to work” on your platform.

Regehr’s Case-Analysis Mental Model

Chapter 3 described undefined behavior as a promise from programmer to compiler. John Regehr (University of Utah), who has written extensively on UB and maintains one of the most-read systems programming blogs (blog.regehr.org), offers a complementary framing: the case-analysis mental model.

When a compiler sees a function body, it performs case analysis over all possible inputs. For each case, it asks: “does this input lead to defined behavior?” If no, the case is discarded. The compiler only produces code for defined cases — and since the discarded cases cannot arise in a valid program, the programmer has no standing to complain about what happens for them.

Regehr’s canonical example, quoted from his blog:

int stupid(int a) {
    if ((a + 1) > a) return 1;
    return 0;
}

The compiler performs case analysis:

If a < INT_MAX: a + 1 does not overflow. a + 1 > a is true. Return 1.
If a == INT_MAX: a + 1 overflows. This is signed integer overflow — undefined behavior. This case is discarded.

Since the only case where the function might return 0 is the discarded UB case, the compiler concludes: the return value is always 1. Output:

// Compiled output (GCC -O2):
int stupid(int a) { return 1; }

The if test and both branches are gone. This is not a compiler bug. It is a correct transformation: the programmer promised there would be no UB, and the only way there can be no UB is if the function always returns 1.

Regehr’s Three Function Types

Regehr classifies functions by how they interact with UB:

Type 1. Behavior is defined for all inputs. These functions are safe to call with any argument. Example: unsigned int add_unsigned(unsigned int a, unsigned int b) { return a + b; } — unsigned overflow is defined (wraps modulo $2^{32}$).
Type 2. Behavior is defined for some inputs and undefined for others. These have preconditions. Example: int add_signed(int a, int b) { return a + b; } — defined unless the result overflows. The function has an implicit precondition that the sum fits in int.
Type 3. Behavior is undefined for all inputs. These are unreachable code, misidentified as functions. Any caller of a Type 3 function is already in undefined territory.

The classification is useful for API design: public APIs should be Type 1 wherever possible. Type 2 is acceptable when the preconditions are clearly documented and easily checkable by the caller. Type 3 indicates a design error.

The Linux Kernel Null-Pointer Elimination

One of the most famous real-world UB exploitation cases involves the Linux kernel, circa 2009. A simplified version:

struct task_struct {
    int uid;
    // ...
};

void vulnerable_function(struct task_struct *task) {
    int uid = task->uid;    // (1) dereference task
    // ...
    if (task == NULL) {     // (2) null check after dereference
        return;
    }
    // ...
}

The compiler reasons: at point (1), we dereference task. If task were null, that would be undefined behavior. Since we promised no UB, task cannot be null at point (1). Therefore the check at point (2) is always false (we just established task != NULL). The compiler eliminates the null check.

The vulnerability: the Linux kernel deliberately mapped address 0 for certain hardware configurations. User processes could map their own code at address 0. A process that triggered vulnerable_function with task = NULL would cause the “dead” branch to execute kernel code at address 0 — which the attacker controlled. The UB-elision optimization converted a null-pointer check into a privilege escalation.

This is not a theoretical concern. CVE-2009-1897, CVE-2009-2692, and several related kernel vulnerabilities share this root cause: defensive null checks after pointer dereferences, eliminated by the compiler because “you already proved it’s not null.”

The fix: move the null check before the dereference. And never dereference a pointer you have not yet verified is non-null.

Alive2: Finding UB Bugs in Compilers Themselves

If UB is hard to reason about for programmers, it is also hard to reason about for compiler writers. In 2019, Lee, Kim, Hur, and Lopes introduced Alive2, a formal verification framework that checks whether LLVM’s optimization passes are semantically correct — that is, whether the transformed program has the same observable behavior as the original under all defined inputs.

Running Alive2 on LLVM’s own test suite found 95 miscompilation bugs, of which 25 stemmed directly from undefined behavior semantics: cases where an LLVM optimization was correct under one interpretation of pointer semantics or poison values but incorrect under another. The most recurring issues:

Pointer provenance conflicts. Different LLVM optimization passes assumed contradictory semantics for pointer comparisons: some treated pointer equality as deterministic (two pointers are equal iff they hold the same address), others treated it as non-deterministic (two pointers to different objects are never equal, even if numerically equal). Combining these passes on the same function produced miscompilations.

Poison value propagation. LLVM uses a poison value (analogous to undef but stricter) to represent values produced by UB. An optimization that converts select (cond, x, y) into (cond * x) + (!cond * y) is incorrect if x is poison and cond is false: in the original, x is never used; in the transformed version, multiplying a poison value by 0 might or might not be poison, depending on the semantics.

The lesson Alive2 demonstrates: UB semantics are sufficiently complex that even the world’s most-used optimizing compiler contains bugs arising from conflicting UB assumptions. Getting it right requires formal specification. The proposed fix — a freeze instruction that stabilizes a poison value into an arbitrary but non-poison result — was adopted into LLVM in 2020.

The `volatile` / UB Interaction

One subtle interaction: volatile (Chapter 2) does not suppress UB. Writing to a null volatile int * is still UB. Accessing an int through a volatile float * is still a strict-aliasing violation. volatile tells the compiler that accesses must happen and in order — it does not grant permission to violate any other rule of the abstract machine.

The common misconception is that volatile makes a variable “safe from compiler optimization.” It does not. It makes accesses to that variable non-optimizable in terms of elision and reordering, but it does not protect the program from UB that the compiler might use to eliminate surrounding code. A program with a volatile variable and also a signed integer overflow is still fully undefined.

Chapter 4: Loop Invariants and Small Proofs

Why Proofs?

In CS 135, reasoning about programs was almost effortless. To show that (length (list 1 2 3)) is 3, you stepped through the substitution model: unfold length, substitute the argument, reduce — and you were done. The substitution model was the proof.

With mutation, that approach fails. You cannot substitute x with 5 if x might change. What replaces it?

The answer is Hoare logic, introduced by C.A.R. Hoare in 1969. Hoare logic provides a formal system for reasoning about programs that change state. It is not used in most day-to-day programming, but it underlies every claim that an imperative algorithm is correct, whether or not the programmer writes it down. Every experienced programmer who says “this loop is correct because at each step we maintain the property that …” is using Hoare logic informally.

This chapter introduces the notation and applies it to two small C programs. The goal is not to make you prove every program you write — that would be impractical. The goal is to give you a mental framework that, when a loop looks suspicious, lets you articulate why it is suspicious and what invariant it should maintain.

Hoare Triples

A Hoare triple is written:

\[\{P\} \; C \; \{Q\}\]

It means: if property $P$ holds before executing statement (or sequence of statements) $C$, and $C$ terminates, then property $Q$ holds afterwards. $P$ is the precondition and $Q$ is the postcondition.

For a simple assignment:

\[\{Q[e/x]\} \;\; x = e; \;\; \{Q\}\]

where $Q[e/x]$ means “$Q$ with every free occurrence of $x$ replaced by $e$.” This is the assignment axiom: the precondition you need before an assignment is whatever the postcondition requires of x, with x replaced by the right-hand-side expression. It reads backwards from the goal.

Example: to prove {?} x = x + 1; {x > 0}, substitute in the assignment axiom: precondition is x + 1 > 0, which simplifies to x > -1, or x >= 0. So:

\[\{x \geq 0\} \;\; x = x + 1; \;\; \{x > 0\}\]

For sequential composition:

\[\frac{\{P\}\; C_1 \;\{R\} \quad \{R\}\; C_2 \;\{Q\}}{\{P\}\; C_1;\, C_2 \;\{Q\}}\]

This just says: if $C_1$ takes us from $P$ to $R$, and $C_2$ takes us from $R$ to $Q$, then the sequence takes us from $P$ to $Q$. The intermediate assertion $R$ is called the glue.

For conditionals:

\[\frac{\{P \land b\}\; C_1 \;\{Q\} \quad \{P \land \neg b\}\; C_2 \;\{Q\}}{\{P\}\; \texttt{if}\; (b)\; C_1\; \texttt{else}\; C_2 \;\{Q\}}\]

Both branches of the conditional must establish $Q$, starting from $P$ augmented by the branch condition.

The Loop Rule and Invariants

The most important rule is for loops:

\[\frac{\{I \land b\}\; C \;\{I\}}{\{I\}\;\texttt{while}\; (b)\; C \;\{I \land \neg b\}}\]

Here $I$ is the loop invariant — a property that holds before the loop starts, is preserved by each iteration (if the guard $b$ holds), and is therefore true when the loop exits (combined with the fact that $b$ is now false).

The three-part loop invariant argument is often expressed as:

Initialization: $I$ holds before the first iteration.
Maintenance: if $I$ and $b$ hold at the top of an iteration, then $I$ holds at the end of the iteration.
Termination: the loop eventually terminates (typically by showing a decreasing measure — a non-negative integer quantity that strictly decreases each iteration).

The strength of the conclusion comes from combining $I \land \neg b$ after the loop exits. If $I$ says “the first $k$ elements of the array are sorted” and $b$ says “$k < n$”, then when the loop exits, $\neg b$ says $k \geq n$, so together we get “the first $n$ elements are sorted” — the full result.

Proof 1: `array_sum`

// requires: n >= 0, a has at least n elements
// effects: returns the sum of a[0], a[1], ..., a[n-1]
int array_sum(const int *a, int n) {
    int sum = 0;
    int i = 0;
    while (i < n) {
        sum += a[i];
        i++;
    }
    return sum;
}

Let $S_k$ denote $a[0] + a[1] + \cdots + a[k-1]$ (the sum of the first $k$ elements, with $S_0 = 0$). Our invariant is:

\[I: \quad \texttt{sum} = S_i \;\land\; 0 \leq i \leq n\]

Initialization: before the loop, sum = 0 and i = 0. $S_0 = 0$ by definition, so sum = $S_0$ = $S_i$. And $0 \leq 0 \leq n$ (since $n \geq 0$). ✓

Maintenance: assume $I$ holds at the top of an iteration and $i < n$. The body executes sum += a[i] then i++. After the body:

new sum = old sum + $a[i]$ = $S_i + a[i]$ = $S_{i+1}$ (by definition of $S$).
new i = old $i + 1$.
So sum = $S_{\text{new }i}$, and $0 \leq \text{new }i \leq n$ (we added 1 to $i$ which was $< n$, so new $i \leq n$). ✓

Termination: i increases by 1 each iteration, bounded by n. So the loop runs exactly n times.

Conclusion: after the loop, $I$ holds and $\neg(i < n)$ holds, so $i \geq n$. Combined with $i \leq n$, we have $i = n$. Therefore sum $= S_n$, which is what we want to return. ✓

Proof 2: `reverse_array`

// requires: n >= 0, a has at least n elements
// effects: reverses a[0..n-1] in place
void reverse_array(int *a, int n) {
    int lo = 0, hi = n - 1;
    while (lo < hi) {
        int tmp = a[lo];
        a[lo] = a[hi];
        a[hi] = tmp;
        lo++;
        hi--;
    }
}

Let $a_0$ denote the original array (the value of a before the function was called). Our invariant is:

\[I: \quad 0 \leq \texttt{lo} \leq \texttt{hi}+1 \leq n \;\;\land\]\[\forall j < \texttt{lo}: a[j] = a_0[n-1-j]\]\[\forall j > \texttt{hi}: a[j] = a_0[n-1-j]\]\[\forall \texttt{lo} \leq j \leq \texttt{hi}: a[j] = a_0[j]\]

In words: the outer portions (indices $0 \ldots \texttt{lo}-1$ and $\texttt{hi}+1 \ldots n-1$) have been reversed into their final positions, while the middle portion (indices $\texttt{lo} \ldots \texttt{hi}$) is still in its original order.

Initialization: lo = 0, hi = n - 1. The outer portions are empty, so the first two conditions hold vacuously. The middle portion is the entire array, still in original order. ✓

Maintenance: assume $I$ and lo < hi. The body swaps a[lo] and a[hi], then increments lo and decrements hi. The swap places a[lo] (which was $a_0[\texttt{lo}]$) into position hi, where the reversed array should have $a_0[n-1-(n-1-\texttt{lo})] = a_0[\texttt{lo}]$ — correct. Similarly for a[hi]. The middle portion shrinks from both ends, preserving the condition for the new middle. ✓

Termination: hi - lo decreases by 2 per iteration (or the loop exits when they cross). The loop runs $\lfloor n/2 \rfloor$ times.

Conclusion: after the loop, lo >= hi. Combined with the invariant’s first clause ($\texttt{lo} \leq \texttt{hi}+1$), we have lo = hi (odd length, middle element stays in place) or lo = hi + 1 (even length, all swapped). In either case, the outer portions cover the full array and all elements are in their reversed positions. ✓

Limits: What Hoare Logic Does Not Cover

Hoare logic as presented here has two significant gaps.

First, it assumes sequential execution. In a concurrent program, another thread can modify a shared variable between any two of your statements, invalidating the invariant at arbitrary points. Concurrent reasoning requires additional machinery — separation logic, rely-guarantee conditions, or lock invariants — beyond our scope here.

Second, it assumes no undefined behavior. The proof of array_sum assumes a[i] accesses valid memory ($0 \leq i < n$ and a has at least n elements). If the caller passes $n < 0$ or a too-short array, the loop invariant still holds in the abstract machine, but the actual program may access invalid memory, invoking UB, and the abstract machine reasoning no longer corresponds to real execution.

The connection to CS 135 equational reasoning: both are forms of induction. In CS 135, you proved a recursive function correct by showing the base case is right and the inductive step preserves correctness. The loop invariant argument is the iterative counterpart — base case = initialization, inductive step = maintenance, conclusion = post-loop state. The substitution model was the mechanism; Hoare logic is the mechanism’s imperative equivalent.

For further reading on program verification: Ragde’s FICS2 §3 develops a simple imperative language (SIMP) and gives a formal semantics in which Hoare-style proofs are made precise. CS 245 (Logic and Computation) covers first-order logic and its connection to program reasoning more formally. CS 341 applies invariant arguments to the correctness proofs for standard algorithms.

Part II — The Machine Underneath

Chapter 5: ARM64 and x86-64 Side by Side

Why Look at Assembly?

The C abstract machine describes what a program means. Real hardware describes how it runs. Between the two sits the compiler — an extraordinarily complex piece of software that translates the former into the latter. For most programs, you never need to see the translation. But there are moments when reading assembly is the only way to understand what is happening: when a loop is slower than expected, when a debugger shows a crash inside the runtime, when you need to know whether two pointer reads are actually separate loads or whether the compiler merged them. Assembly is the ground truth.

This chapter introduces the two assembly languages you will encounter most in 2026. ARM64 (also called AArch64 or the A64 instruction set) runs on every iPhone and Android phone, every Mac with an M-series chip, and an increasing fraction of cloud servers (AWS Graviton, Azure Cobalt, Ampere Altra). x86-64 (also called AMD64) runs on the rest of the cloud — Intel and AMD servers — on all Windows PCs, and on most Linux desktops. Between them they cover essentially all of modern computing.

The two architectures reflect different design philosophies formed decades apart. ARM64 is a clean-slate RISC design: all instructions are exactly 32 bits wide, load and store are the only instructions that touch memory, arithmetic happens only in registers, and there are 31 general-purpose registers of uniform capability. x86-64 carries sixty years of backward compatibility: instructions range from 1 to 15 bytes, many instructions can implicitly read from or write to memory as operands, and the register set is irregular (some operations only work in certain registers, some registers have 8-, 16-, 32-, and 64-bit sub-names that partially overlap). ARM64 is easier to learn; x86-64 is what your cloud instance runs.

Registers

The register file is the CPU’s fastest storage — a small set of named locations that the ALU operates on directly. Loads bring values from memory into registers; stores write them back. Everything in between happens in registers.

ARM64 registers:

Name	Width	Role
`x0`–`x7`	64-bit	function arguments / return values
`x8`–`x18`	64-bit	general purpose (x8: indirect result pointer)
`x19`–`x28`	64-bit	callee-saved: must be preserved across calls
`x29`	64-bit	frame pointer (fp)
`x30`	64-bit	link register (lr): holds return address
`sp`	64-bit	stack pointer
`xzr`	64-bit	zero register: always reads 0, writes discarded
`w0`–`w30`	32-bit	lower 32 bits of the corresponding x registers

Writing to w0 zero-extends into x0 — the upper 32 bits of x0 become zero. This matters for unsigned 32-bit arithmetic. The wzr register is the 32-bit zero register.

x86-64 registers:

Name	64-bit	32-bit	16-bit	8-bit	Role
accumulator	`rax`	`eax`	`ax`	`al`/`ah`	return value (low), implicit in `mul`/`div`
counter	`rcx`	`ecx`	`cx`	`cl`	4th arg; shift count
data	`rdx`	`edx`	`dx`	`dl`	3rd arg; return value (high)
base	`rbx`	`ebx`	`bx`	`bl`	callee-saved
source idx	`rsi`	`esi`	`si`	`sil`	2nd argument
dest idx	`rdi`	`edi`	`di`	`dil`	1st argument
stack ptr	`rsp`	—	—	—	stack pointer
frame ptr	`rbp`	—	—	—	frame pointer (callee-saved)
`r8`–`r15`	`r8`–`r15`	`r8d`–`r15d`	`r8w`–`r15w`	`r8b`–`r15b`	general purpose

x86-64’s register naming is famously irregular. Writing to eax (32-bit) zero-extends into rax (64-bit). Writing to ax (16-bit) does not zero-extend — it leaves the upper 48 bits unchanged. Writing to al (8-bit low) leaves upper 56 bits unchanged. This asymmetry is a legacy of 16-bit and 32-bit predecessors; ARM64 avoids it by having only two register widths (32 and 64 bits) with consistent zero-extension.

Instruction Set Architecture (ISA). The contract between software and hardware: the set of instructions a processor understands, the registers it provides, the memory model it exposes, and the binary encoding of all of the above. The ISA is what the compiler targets; the microarchitecture (pipeline, caches, out-of-order execution) is how the hardware implements the ISA, invisible to software.

Instruction Encoding

ARM64 instructions are always exactly 32 bits. The high bits determine the instruction type; the remaining bits encode registers and immediates. Because the encoding is fixed-width, the program counter can always advance by exactly 4 bytes to reach the next instruction, and branch targets can be computed without parsing variable-length encoding.

x86-64 instructions range from 1 byte (nop, ret) to 15 bytes (a mov with all optional prefixes). The CPU’s front-end fetches a block of bytes and decodes them sequentially, which requires careful hardware engineering to maintain throughput. The variable-length encoding gives x86-64 some code-density advantages but makes hand-decoding considerably harder.

A Minimal Example: Computing a Square

int square(int n) {
    return n * n;
}

Compiled with clang -O1 targeting both architectures:

// ARM64 (AArch64)
// argument: n in w0 (32-bit)
square:
    mul  w0, w0, w0    // w0 = w0 * w0
    ret                // return (branch to x30)

; x86-64 (Intel syntax)
; argument: n in edi (32-bit)
square:
    imul  edi, edi     ; edi = edi * edi  (signed 32×32 → 32)
    mov   eax, edi     ; return value in eax
    ret

ARM64’s mul writes the result back to w0, which is also the return-value register, so no move is needed. x86-64’s imul r, r multiplies in-place into edi, then a mov copies it to eax (the return register). At -O2, clang recognizes the move is unnecessary and uses imul eax, edi directly.

Two things to notice: ARM64 uses w registers (32-bit) for int arithmetic, not x registers (64-bit). x86-64 uses e__ registers (32-bit) for int, not r__ (64-bit). Both architectures treat C’s int as 32 bits. Operations on 32-bit registers zero-extend into the 64-bit register on both architectures (for ARM64 always; for x86-64 only when writing the e__ form).

Dual-Track: Iterative Factorial

A more realistic example: iterative factorial with a local variable.

long factorial(int n) {
    long result = 1;
    for (int i = 2; i <= n; i++) {
        result *= i;
    }
    return result;
}

ARM64 output (clang -O1):

// ARM64
// n in w0; return value in x0
factorial:
    cmp   w0, #1           // compare n with 1
    b.le  .Lreturn_one     // if n <= 1, skip loop
    mov   x8, #1           // result = 1
    mov   w9, #2           // i = 2
.Lloop:
    smull x8, w9, x8       // x8 = sign-extend(w9) * x8  (i * result)
    add   w9, w9, #1       // i++
    cmp   w9, w0           // compare i with n
    b.le  .Lloop           // if i <= n, loop
    mov   x0, x8           // move result to return register
    ret
.Lreturn_one:
    mov   x0, #1
    ret

x86-64 output (clang -O1):

; x86-64 (Intel syntax)
; n in edi; return value in rax
factorial:
    cmp   edi, 1            ; compare n with 1
    jle   .Lreturn_one      ; if n <= 1, skip loop
    mov   rax, 1            ; result = 1
    mov   ecx, 2            ; i = 2
.Lloop:
    movsxd rdx, ecx         ; sign-extend i (32-bit) to 64-bit
    imul   rax, rdx         ; result *= i  (64-bit multiply)
    inc    ecx              ; i++
    cmp    ecx, edi         ; compare i with n
    jle    .Lloop           ; if i <= n, loop
    ret                     ; return (rax already holds result)
.Lreturn_one:
    mov    eax, 1
    ret

Reading these in parallel is instructive. Both:

Use a compare-and-branch idiom for the loop: cmp/b.le (ARM64) and cmp/jle (x86-64).
Must widen i from 32-bit to 64-bit before the 64-bit multiply: smull (ARM64 signed multiply long) and movsxd/imul (x86-64 sign-extend + multiply).
Use 64-bit arithmetic for the long result accumulator.

The differences: ARM64 always places the return value in x0, so an explicit mov x0, x8 is needed at the end. x86-64 places the return value in rax, and by maintaining rax as the accumulator throughout the loop, no final move is needed. This kind of register allocation optimization — choosing the accumulator register to match the calling convention’s return register — is exactly what compilers do automatically, and seeing the assembly makes it concrete.

Reading Your Own Assembly

On a Mac with Apple Silicon, run:

clang -S -O1 -o factorial.s factorial.c

This produces ARM64 assembly in factorial.s. Add -masm=intel to get Intel syntax for x86-64 cross-compilation experiments. On a Linux x86-64 machine, the same command produces x86-64 assembly by default.

Compiler Explorer (godbolt.org) lets you type C on the left and see assembly for both ARM64 (clang –target=aarch64) and x86-64 (clang/gcc) on the right, with color-coded correspondence between source lines and assembly. It is the fastest way to develop intuition for how C constructs compile.

Chapter 6: The Stack, Calling Conventions, and Frames

The Stack

Every function call in C allocates a stack frame — a region of memory that holds the function’s local variables, saved registers, and bookkeeping information. The frames are arranged in a LIFO stack: when you call a function, its frame is pushed on top; when it returns, its frame is popped. The stack grows downward on both ARM64 and x86-64 (toward lower addresses), and the stack pointer (sp on ARM64, rsp on x86-64) always points to the top (lowest address in use) of the current frame.

ARM64 Calling Convention: AAPCS64

The ARM64 calling convention is specified in the Procedure Call Standard for the Arm 64-bit Architecture (AAPCS64), published by Arm Ltd. Its key rules:

Argument passing. The first eight integer/pointer arguments go in x0–x7 (or w0–w7 for 32-bit values). Additional arguments are passed on the stack. The first eight floating-point arguments go in v0–v7.

Return values. Integer/pointer results ≤ 64 bits are in x0. Results 65–128 bits use x0/x1. Larger results are written to memory pointed to by x8 (the indirect result location register).

Callee-saved registers. x19–x28, x29 (frame pointer), and d8–d15 (FP/SIMD) must be preserved across a call. If a function uses any of these, it must save them on entry and restore them on exit.

Caller-saved registers. x0–x18, x30 (link register), and d0–d7, d16–d31 may be clobbered by a called function. If the caller needs these values after a call, it must save them first.

The link register. bl target (branch with link) writes the return address into x30 and branches. ret branches to the address in x30. A leaf function (one that calls no other functions) can use x30 as scratch; a non-leaf function must save x30 on the stack (typically in its first prologue instruction) and restore it before returning.

Stack alignment. The stack pointer must be 16-byte aligned at the point of every function call. The compiler enforces this by rounding up the frame size to a multiple of 16.

A typical non-leaf ARM64 function prologue/epilogue:

// ARM64 prologue: save fp (x29) and lr (x30)
stp  x29, x30, [sp, #-16]!   // push {x29, x30}; decrement sp by 16
mov  x29, sp                  // set frame pointer

// ... function body ...

// ARM64 epilogue: restore and return
ldp  x29, x30, [sp], #16      // pop {x29, x30}; increment sp by 16
ret                             // branch to x30

stp is “store pair” — stores two 64-bit registers at once. The #-16 pre-decrement variant adjusts sp before the store, atomically moving the stack pointer and storing the values. ldp is the load pair counterpart.

x86-64 Calling Convention: System V AMD64 ABI

Linux, macOS, and BSDs all use the System V AMD64 Application Binary Interface for x86-64. Windows uses a different convention (Microsoft x64 ABI) with four argument registers instead of six; this matters when cross-compiling or calling into Windows DLLs.

Argument passing. Integer/pointer arguments in order: rdi, rsi, rdx, rcx, r8, r9. Floating-point in xmm0–xmm7. Additional arguments on the stack.

Return values. Integer/pointer in rax (and rdx for 128-bit). Floating-point in xmm0.

Callee-saved registers. rbx, rbp, r12–r15 must be preserved.

Caller-saved registers. rax, rcx, rdx, rsi, rdi, r8–r11, xmm0–xmm15 may be clobbered.

call and ret. call target pushes the return address (the address of the instruction after call) onto the stack and branches. ret pops the return address from the stack and branches to it. Unlike ARM64, there is no link register; the return address lives on the stack.

The red zone. Below rsp (toward lower addresses) lies a 128-byte region called the red zone that is guaranteed not to be clobbered by signal handlers. A leaf function that needs no more than 128 bytes of local storage can use the red zone without adjusting rsp — it simply uses [rsp - 8], [rsp - 16], etc. This saves two instructions (the sub rsp, N prologue and add rsp, N epilogue) for every leaf. ARM64 has no red zone.

A typical x86-64 non-leaf prologue/epilogue:

; x86-64 prologue
push  rbp              ; save old frame pointer
mov   rbp, rsp         ; set frame pointer to current sp
sub   rsp, 32          ; allocate 32 bytes of locals (aligned to 16)

; ... function body ...

; x86-64 epilogue
leave                  ; equivalent to: mov rsp, rbp; pop rbp
ret

leave is a single instruction that restores the stack pointer from the frame pointer and pops rbp. It exists precisely because the prologue pattern push rbp; mov rbp, rsp is so common.

Seeing Frames in Practice: Recursive Functions

The frame mechanism is most visible in recursive functions. Consider factorial from Chapter 5, but now recursive:

long rfact(int n) {
    if (n <= 1) return 1;
    return n * rfact(n - 1);
}

ARM64 assembly:

// ARM64 (non-leaf: calls itself)
rfact:
    stp   x29, x30, [sp, #-32]!   // save fp, lr; alloc 32 bytes
    mov   x29, sp                   // set frame pointer
    cmp   w0, #1                    // compare n with 1
    b.le  .Lbase                    // if n <= 1, branch to base case
    str   w0, [sp, #16]             // save n (w0) to stack
    sub   w0, w0, #1                // argument for recursive call: n - 1
    bl    rfact                     // call rfact(n-1); result in x0
    ldr   w8, [sp, #16]             // reload saved n
    smull x0, w8, x0                // x0 = n * rfact(n-1) (signed, widening)
    ldp   x29, x30, [sp], #32      // restore fp, lr; free frame
    ret
.Lbase:
    mov   x0, #1
    ldp   x29, x30, [sp], #32
    ret

x86-64 assembly:

; x86-64
rfact:
    push  rbp
    mov   rbp, rsp
    push  rbx                 ; save rbx (callee-saved; we'll use it)
    sub   rsp, 8              ; keep stack 16-byte aligned
    cmp   edi, 1
    jle   .Lbase
    mov   ebx, edi            ; save n in rbx (callee-saved)
    lea   edi, [edi - 1]      ; argument: n - 1
    call  rfact               ; rfact(n-1), result in rax
    imul  rax, rbx            ; rax = n * rfact(n-1)
    jmp   .Lexit
.Lbase:
    mov   eax, 1
.Lexit:
    add   rsp, 8
    pop   rbx
    pop   rbp
    ret

Notice that ARM64 saves n to the stack (at [sp, #16]) while x86-64 saves it in the callee-saved register rbx. Both achieve the same goal — preserving n across the recursive call that might clobber the argument register — but ARM64 uses a stack slot while x86-64 borrows a callee-saved register. In practice, for deeper recursion, x86-64’s approach uses fewer memory operations, but it requires the function to save and restore rbx, trading one memory op for two.

Frame Pointers and Stack Unwinding

The frame pointer (x29 on ARM64, rbp on x86-64) points to the base of the current frame. The frame at the base contains the saved previous frame pointer, so the frame pointers form a linked chain back to main. A debugger traverses this chain to produce a stack trace.

With -O2 or higher, compilers often omit the frame pointer (-fomit-frame-pointer is implied by default on x86-64 at optimization levels above 0) and use it as a general-purpose register for performance. Stack unwinding then relies on unwind tables in the binary’s .eh_frame section — DWARF-encoded tables that describe how to reconstruct the caller’s register state at any program counter. libunwind, used by backtrace(3) on Linux and by C++ exception handling on all platforms, reads these tables.

For debugging, you can force frame pointers with -fno-omit-frame-pointer. For production binaries where you want crash reports, build with debug info (-g) and include unwind tables (default) even at -O2.

Chapter 7: The Address Space of a Process

The Illusion of Private Memory

When your C program runs, it sees a vast, flat address space — 48 bits on current ARM64 hardware, 48 bits on x86-64, meaning roughly 256 terabytes of addressable memory. The program behaves as if it owns all of it. Of course, it does not: the hardware’s memory management unit (MMU) translates every address the program uses into a physical RAM address, and it does so in units called pages (typically 4 KiB, sometimes 2 MiB or 1 GiB for “huge pages”). A program only has physical RAM behind the pages it has actually touched; the rest of the address space is either unmapped (touching it causes a segfault) or swapped to disk.

This is virtual memory, and the full story belongs to CS 350. For our purposes, the important consequence is: the address space is a convenient fiction managed by the OS, and the layout of that fiction follows a standard pattern.

Process Address Space Layout

A running C program’s virtual address space is divided into several segments, each with a specific purpose and protection:

Text segment. The compiled machine instructions. Marked read-only and executable; writing to it causes a segfault. This protection prevents accidentally (or maliciously) overwriting code. The linker places all functions here.

Read-only data (rodata). String literals ("hello\n" in printf) and const globals live here. It is readable but not writable; an attempt to write char *p = "hello"; p[0] = 'H'; is undefined behavior (and will likely segfault, because the OS enforces the read-only permission).

Initialized data segment. Global and static variables that have non-zero initial values. The binary file contains these values verbatim, and the loader copies them into memory at program start.

BSS (Block Started by Symbol). Global and static variables initialized to zero (or not explicitly initialized, since C guarantees zero-initialization for static-duration objects). Rather than storing all those zeroes in the binary, the linker just records the size of the BSS region; the OS fills it with zeroes on demand. int global_array[1000000]; adds 4 MB to BSS but essentially nothing to the binary file.

Heap. Dynamic memory from malloc. The allocator manages a pool of memory obtained from the OS via brk(2) (which extends the data segment upward) or mmap(2) (which maps anonymous pages anywhere in the address space). We implement a miniature version of this in Chapter 8.

Memory-mapped region. Shared libraries (.dylib on macOS, .so on Linux) are mapped here by mmap. Each library is mapped once and shared between all processes that use it. This is also where large malloc allocations go — above a threshold (typically 128 KiB), malloc calls mmap directly rather than using the brk heap.

Stack. Grows downward from a fixed high address. Limited to about 8 MiB on most Linux systems (you can check with ulimit -s). The OS detects stack overflow by leaving a guard page — an unmapped page just below the stack limit — so that the first write beyond the limit causes a segfault rather than silently corrupting data in the heap below. A stack overflow (from runaway recursion) triggers exactly this.

Observing the Layout

On Linux:

cat /proc/self/maps

This prints the memory map of the running process: each line shows an address range, permissions (r, w, x, private), and the mapped file or [heap]/[stack] label. Try it in a shell or from C:

#include <stdio.h>
int main(void) {
    FILE *maps = fopen("/proc/self/maps", "r");
    char line[256];
    while (fgets(line, sizeof(line), maps)) fputs(line, stdout);
    fclose(maps);
}

On macOS, vmmap $$ (from a terminal) or vmmap <pid> shows the same information. The region names differ slightly (__TEXT, __DATA, __DATA_CONST, etc., reflecting Mach-O segment conventions), but the layout principles are identical.

ASLR

Address Space Layout Randomization (ASLR) randomizes the base addresses of the text, heap, stack, and library regions each time a program runs. This makes it harder for an attacker to predict where injected code or useful library functions live, because the addresses change every execution. Modern kernels enable ASLR by default; you can observe it by printing &main or argv in two successive runs of the same program and seeing different values.

ASLR requires position-independent executables (PIE): the compiler emits code that works regardless of where it is loaded (using pc-relative addressing for everything). Both clang and gcc produce PIE by default on recent Linux and macOS. The performance cost is negligible on 64-bit architectures with sufficient registers for pc-relative addressing.

The full story of virtual memory — page tables, TLB misses, demand paging, copy-on-write, huge pages, memory-mapped files — is CS 350 territory. The one-line bridge: every address your program uses is a virtual address; the MMU translates it to physical RAM on every access; this translation is expensive enough that hardware caches it in the TLB, and a TLB miss causes a hardware page-table walk that costs ~100 ns. Accessing memory is not free.

For the linker and loader: the journey of a .c file through compilation, assembly, linking, and loading into the address space — including how the BSS size, rodata, and text end up in their respective segments — is the subject of CS 241. The bridge here is that malloc is just the user-space portion of the heap story: below it, the allocator calls into the OS to extend the heap, and above it, the program calls malloc without knowing whether it’s getting a fresh page or a reused chunk.

Chapter 8: `malloc` from Scratch

What an Allocator Does

malloc(n) returns a pointer to at least n bytes of usable memory, or NULL if it cannot satisfy the request. free(p) returns previously allocated memory to the allocator for reuse. The allocator’s job is to implement these two operations efficiently, with minimal wasted space and minimal time overhead.

The allocator sits between the application and the OS. The OS provides large chunks of memory (pages, typically 4 KiB) via sbrk/mmap. The allocator carves them up into the small pieces the application requests and reassembles them when freed. The tricky part: the allocator does not know in advance how many allocations will be made, of what sizes, or in what order they will be freed.

This chapter builds three allocators of increasing sophistication. All of them use a global arena — a large array representing the heap:

#define HEAP_SIZE (1 << 20)   // 1 MiB

static char heap[HEAP_SIZE];
static size_t heap_end = 0;   // next free byte

In a real allocator, heap would be obtained from sbrk(HEAP_SIZE) or mmap(NULL, HEAP_SIZE, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, -1, 0). We use a static array to keep the code self-contained.

Allocator 1: The Bump Allocator

The simplest possible allocator: just advance a pointer.

void *bump_alloc(size_t n) {
    // align to 8 bytes
    n = (n + 7) & ~7u;
    if (heap_end + n > HEAP_SIZE) return NULL;
    void *p = heap + heap_end;
    heap_end += n;
    return p;
}

void bump_free(void *p) {
    (void)p;   // no-op: we never reclaim
}

The alignment step (n + 7) & ~7u rounds n up to the next multiple of 8. Alignment is critical: on ARM64, a 64-bit load from a misaligned address causes a hardware exception on most configurations (or silent slowdown). On x86-64, misalignment causes performance penalties but not faults (on most hardware). The C standard requires malloc to return memory aligned for any standard type, which on 64-bit platforms means 16-byte alignment (to accommodate long double and SIMD vectors). Our bump allocator uses 8-byte alignment for simplicity.

The bump allocator is extraordinarily fast — essentially free, at two additions and a comparison. It is used in garbage collectors (allocating into a fresh “nursery” space), in arena allocators (where all allocations in a batch are freed together), and as the starting point for understanding allocation overhead. Its fatal flaw is bump_free being a no-op: freed memory is never reclaimed. The bump allocator is only correct for programs that allocate indefinitely without freeing.

Arena allocation. An allocation strategy where all objects in a group are allocated from a contiguous buffer (the arena), and the entire arena is freed at once when the group is done. No individual object requires individual deallocation. Fast and simple; used in compilers (per-compilation-unit arenas), web servers (per-request arenas), and game frames (per-frame arenas).

Allocator 2: Free-List Allocator

To support free, we need to track which regions are in use and which are available. The standard approach: embed a header before each allocation describing its size and status.

typedef struct block_hdr {
    size_t size;         // usable payload size in bytes (not counting header)
    int    in_use;       // 1 if allocated, 0 if free
    struct block_hdr *next;  // next block in free list (valid only if !in_use)
} block_hdr;

#define HDR_SIZE  (sizeof(block_hdr))

The layout of the heap becomes:

[ block_hdr | payload bytes ] [ block_hdr | payload bytes ] ...

malloc(n) searches the free list for a block with enough payload space, marks it in-use, and returns a pointer to the payload (past the header). free(p) walks back to the header (which is at (char*)p - HDR_SIZE), marks it free, and adds it to the free list.

The implementation:

static block_hdr *free_list = NULL;

void *fl_alloc(size_t n) {
    n = (n + 7) & ~7u;   // align to 8 bytes

    // search free list for a block that fits (first-fit)
    block_hdr **pp = &free_list;
    while (*pp) {
        block_hdr *b = *pp;
        if (b->size >= n) {
            *pp = b->next;     // unlink from free list
            b->in_use = 1;
            return (char *)b + HDR_SIZE;
        }
        pp = &b->next;
    }

    // no free block found: extend heap
    if (heap_end + HDR_SIZE + n > HEAP_SIZE) return NULL;
    block_hdr *b = (block_hdr *)(heap + heap_end);
    heap_end += HDR_SIZE + n;
    b->size = n;
    b->in_use = 1;
    b->next = NULL;
    return (char *)b + HDR_SIZE;
}

void fl_free(void *p) {
    if (!p) return;
    block_hdr *b = (block_hdr *)((char *)p - HDR_SIZE);
    b->in_use = 0;
    b->next = free_list;
    free_list = b;     // push onto free list
}

This is a first-fit allocator: it takes the first free block large enough to satisfy the request. Best-fit searches the entire free list for the smallest sufficient block, reducing wasted space but scanning longer. Next-fit remembers where the last search ended and picks up from there, avoiding repeatedly scanning past large free blocks at the beginning.

Splitting and Coalescing

The allocator above has two problems:

Internal fragmentation. If the free list has a 100-byte block and we request 16 bytes, we hand the entire 100-byte block to the caller, wasting 84 bytes. The fix is splitting: carve a 16-byte block off the front and return the remaining 84 bytes to the free list.

// inside fl_alloc, when a block is found:
if (b->size >= n + HDR_SIZE + 8) {   // enough room to split?
    block_hdr *tail = (block_hdr *)((char *)b + HDR_SIZE + n);
    tail->size = b->size - HDR_SIZE - n;
    tail->in_use = 0;
    tail->next = free_list;
    free_list = tail;          // return remainder to free list
    b->size = n;               // shrink this block to exactly n
}

External fragmentation. After many alloc/free cycles, the heap may contain many small free blocks interspersed with in-use blocks. A request for a large block cannot be satisfied even if the total free space is sufficient. The fix is coalescing: when a block is freed, check whether the adjacent blocks are also free and merge them.

Coalescing requires knowing the next block in memory (trivial: advance the pointer by HDR_SIZE + b->size), but also the previous block in memory (not trivial with a singly-linked free list). The classic solution is the boundary tag technique: store an identical copy of the header at the end of each block. The previous block’s end-tag is immediately before the current block’s header.

Real allocators — ptmalloc (glibc), jemalloc (FreeBSD/Firefox), tcmalloc (Google) — build on these ideas with substantial engineering: segregated free lists (one per size class), per-thread caches (to avoid locking), and careful use of OS page allocation to minimize fragmentation at the page level. Drepper’s “What Every Programmer Should Know About Memory” covers allocator internals in detail, including the interaction with CPU caches and TLB behavior.

The `malloc(0)` Corner

What should malloc(0) return? The C standard says: either a null pointer, or a unique pointer that shall not be dereferenced. Implementations differ:

glibc returns a non-null unique pointer (a distinct zero-byte allocation).
macOS libmalloc returns a non-null unique pointer.
Some embedded allocators return NULL.

Code that calls malloc(0) and then checks if (result == NULL) to detect failure is not portable — a successful malloc(0) may return NULL on some platforms. Similarly, free(malloc(0)) must be safe (you must be able to free whatever malloc(0) returned, including NULL, since free(NULL) is a no-op).

Alignment and Padding

Structures in C are padded so that each member is aligned to its natural alignment (size of the type, up to the maximum alignment required by the platform, typically 8 or 16 bytes). Consider:

struct mixed {
    char  a;       // 1 byte
    // 7 bytes padding
    double b;      // 8 bytes (needs 8-byte alignment)
    int   c;       // 4 bytes
    // 4 bytes padding
};
// total: 24 bytes, not 13

The sizeof(struct mixed) is 24, not 13, because of alignment padding. The allocator must return memory aligned to the maximum alignment required by any type — 16 bytes on modern 64-bit systems (to accommodate __int128, long double, and SIMD vector types). Our (n + 7) & ~7u rounds to 8-byte alignment; a production allocator would use (n + 15) & ~15u for 16-byte alignment.

The bridge to Chapter 9: when we design TIL’s memory layout in Part III, we will use a flat array of tagged values, and each value will be padded to a fixed size to avoid alignment arithmetic at runtime.

The Scaling Problem

Our free-list allocator in Chapter 8 has a global free list and a single bump pointer. This design has two fundamental scalability problems:

Lock contention. In a multi-threaded program, every malloc and free must lock the global heap. On a 32-core server, threads spend a large fraction of their time waiting to acquire this lock, not doing useful work.
Fragmentation under diverse workloads. A first-fit free list, even with coalescing, fragments badly under workloads with many different object sizes. The heap ends up looking like Swiss cheese: many small free regions between in-use regions, none big enough to satisfy a medium-sized request.

Jason Evans’ jemalloc allocator, introduced in FreeBSD 2006 and adopted by Firefox, Facebook, and many other systems, addresses both problems with a design that has three distinguishing features: size classes, arenas, and thread caches.

Size Classes

Instead of a single free list for all sizes, jemalloc maintains segregated free lists — one per size class. Size classes are chosen at powers of two with intermediate steps:

Small: 8, 16, 32, 48, 64, 80, 96, 112, 128, 160, 192, 224, 256, …, 3584 bytes. Approximately 50 size classes below 4 KiB.
Large: 4 KiB, 8 KiB, …, up to some large threshold. Satisfied from OS pages directly.
Huge: Multi-megabyte allocations. Obtained with mmap and tracked separately.

Every small allocation is rounded up to the nearest size class. A request for 17 bytes becomes a 32-byte allocation. This internal fragmentation (wasting up to ~50% of each small allocation) is acceptable because it eliminates external fragmentation (scattered unusable holes): all free blocks in a size-class list are the same size and immediately reusable for any request in that class.

Small objects of the same size class are packed into slabs — contiguous regions of memory, typically one or a few OS pages. Each slab contains a fixed number of slots of the class’s size, managed as a bitmap. Allocation from a slab is a bitmap scan (find a zero bit, flip it, return the corresponding address) — very fast.

Arenas

jemalloc partitions the heap into arenas — independent heap instances, each with its own set of size-class free lists and its own lock. The number of arenas defaults to 4 × (number of CPUs). Threads are assigned to arenas round-robin at first use.

Because threads in different arenas never contend for the same lock, the effective lock contention per arena drops by a factor of (number of arenas / number of threads) — typically a factor of 4 or more. For a 32-core machine, this alone reduces malloc contention by ~8×.

Arenas also improve cache locality: objects allocated in the same arena tend to be on the same set of pages, and those pages tend to stay in the same CPU’s cache. A program that allocates and frees objects on the same thread (the common case) will find its arena’s pages warm in L1/L2 cache.

Thread Caches

For the hottest path — small allocations that are allocated and freed on the same thread — jemalloc maintains per-thread caches (tcaches). A thread cache is a small, per-thread array of recently freed objects, indexed by size class.

On free(p) for a small object: instead of returning p to the arena immediately, push p into the tcache for its size class (no locking needed — per-thread).

On malloc(n) for a small object: check the tcache first. If there is a cached object, pop it (no locking needed). Only if the tcache is empty, fall back to the arena (with locking).

In benchmarks on typical server workloads (allocate/use/free patterns dominated by request-scoped objects), tcache hits account for 80–95% of all malloc/free calls. The locking overhead of the arena is paid only for the remaining 5–20%. This is why jemalloc achieves near-linear throughput scaling with thread count on such workloads, where glibc’s ptmalloc shows saturation past ~8 threads.

Comparing Allocator Designs

Feature	Bump allocator (Ch 8)	Free-list (Ch 8)	jemalloc
Allocation speed	O(1), ~2 instructions	O(free-list length)	O(1) amortized (tcache hit)
Deallocation	No-op	O(1) + coalesce	O(1) amortized (tcache)
Fragmentation	None (no reclaim)	External fragmentation	Internal only (bounded by size-class rounding)
Thread safety	Not thread-safe	Requires global lock	Lock-free in common case (tcache)
Suitable for	Short-lived arenas	General single-threaded	General multi-threaded

The bridge from Chapter 8: our free-list allocator is the conceptual core of any heap. jemalloc is what happens when you engineer that core for production: size classes eliminate external fragmentation, arenas reduce contention, and thread caches reduce locking to the uncommon case. Every real production allocator (jemalloc, TCMalloc, mimalloc, Hoard) is a variation on these three ideas.

Part III — Building a Tiny Language

Chapter 9: TIL — A Tiny Imperative Language

Why Design Before Implementing

When you write a C program, you are working with a language someone else designed. Its syntax, evaluation order, scoping rules, and memory model are fixed before you type a line. Building a language of your own — even a tiny one — forces you to make all of those choices explicitly. What does if mean when the condition is a function? What should happen when a variable is used before it is set? Does the language have first-class functions, and if so, what environment do they close over?

These questions have answers in every language you have ever used. Those answers are usually invisible because the language designers made them before you arrived. In this chapter and the next three, we will make them ourselves, in the open, and see what they cost to implement.

The language is called TIL — Tiny Imperative Language. It is a Turing-complete language in the sense that any computable function can be expressed in it, but it is deliberately small: no strings, no arrays, no modules, no type system (dynamic typing only), no tail-call optimization, and no error recovery beyond returning a distinguished error value. Every omission is intentional; complexity is the enemy of understanding.

TIL Values

TIL has three value types:

Integers. Unbounded signed integers (we will represent them as C long for convenience, accepting the 64-bit limitation). Arithmetic is ordinary integer arithmetic; division by zero signals an error.
Booleans. #t (true) and #f (false). Produced by comparison operators; consumed by if and while.
Closures. A function value together with its defining environment. First-class values: closures can be passed as arguments, returned from functions, stored in variables, and called.

TIL has no null value, no floating point, and no compound data (lists, structs). This is the stingiest reasonable core.

TIL Syntax

TIL uses an S-expression-style surface syntax — parenthesized prefix notation, like Racket — because it is easy to parse and unambiguous. We want to spend our implementation effort on semantics, not on resolving C-style ambiguity in *p++.

program     ::= (define (name param ...) expr) ...
expr        ::= integer
              | boolean
              | name
              | (if expr expr expr)
              | (while expr expr)
              | (begin expr ...)
              | (set! name expr)
              | (let ((name expr) ...) expr)
              | (lambda (param ...) expr)
              | (op2 expr expr)          ; binary arithmetic/comparison
              | (name expr ...)          ; function call
              | (print expr)

op2         ::= + | - | * | / | = | < | > | <= | >=

integer     ::= optional-minus digit+
boolean     ::= #t | #f
name        ::= [a-z][a-z0-9-]*        ; lowercase with digits and hyphens

A TIL program is a sequence of define forms followed by an implicit call to main. Each define introduces a named function at top level.

TIL by Example

Factorial:

(define (factorial n)
  (if (<= n 1)
      1
      (* n (factorial (- n 1)))))

Fibonacci using mutation (showing set! and while):

(define (fib n)
  (let ((a 0) (b 1) (i 0))
    (while (< i n)
      (begin
        (set! i (+ i 1))
        (let ((tmp b))
          (set! b (+ a b))
          (set! a tmp))))
    a))

Notice that while in TIL is an expression: it evaluates its body for side effects and returns #f when the condition becomes false. The real work is the set! mutations inside. This mirrors C’s while loop that can only return void — the useful output accumulates in mutable bindings.

Higher-order functions — passing a function as an argument:

(define (apply-twice f x)
  (f (f x)))

(define (double x) (* 2 x))

A call (apply-twice double 3) should return 12. This requires closures: double must be a first-class value, and apply-twice must be able to call whatever it receives.

Scoping Rules

TIL uses lexical scope (also called static scope): the meaning of a name is determined by where the name appears in the source code, not by the call chain at runtime. This is the same scoping rule used by Racket, Python, JavaScript, and C (for local variables).

(define (make-adder n)
  (lambda (x) (+ x n)))

(define add5 (make-adder 5))

When (make-adder 5) executes, it creates a closure over the n = 5 binding in scope at the lambda expression. The returned closure captures that environment. Later, calling (add5 3) evaluates (+ x n) in the environment of the original lambda body, where n is 5, giving 8. The fact that make-adder has already returned and its frame is conceptually “gone” does not matter — the closure keeps the environment alive.

This is where the name “closure” comes from: the function value closes over the free variables in its body (here, n).

Lexical scope is what makes closures well-defined. If TIL used dynamic scope instead (where names are looked up in the current call stack), n would refer to whatever n meant at the call site, not at the definition site. Dynamic scope makes programs harder to reason about and is not used in any modern general-purpose language.

What TIL Deliberately Omits

No arrays. Adding arrays requires either a type system (to distinguish array-pointer from scalar) or a runtime type tag (which bloats every value). We omit arrays to keep the value representation in Chapter 12’s GC simple.

No recursion without explicit function definition. TIL’s let is not letrec: bindings in (let ((f (lambda (x) (f x)))) ...) cannot call f recursively, because the f in the lambda body refers to the outer f, not the one being defined. Top-level define forms are mutually recursive (because the top-level environment is already populated before any define’s body is evaluated). This is the same restriction as C: a local function pointer variable cannot point to itself without a separate set!.

No error handling. Division by zero, calling a non-function, or referencing an unbound name returns a special error value TIL_ERR that propagates through arithmetic but stops the program when printed. A production interpreter would use exceptions or result types.

No tail-call optimization. TIL’s factorial as written above will overflow the C stack for large n (each recursive call adds a C stack frame, since we implement TIL’s function calls via C function calls in Chapter 11). Fixing this requires converting the interpreter to continuation-passing style and trampoline dispatch, which is FICS2 §8 (Continuations) territory — beyond our scope.

Formal Semantics Preview

We will implement TIL using a big-step operational semantics in the style of Chapter 1:

\[\langle e, \sigma, \rho \rangle \Downarrow \langle v, \sigma' \rangle\]

Here $e$ is the expression to evaluate, $\sigma$ is the current store (mapping locations to values — this is where set! writes), $\rho$ is the current environment (mapping names to locations — this is lexical scope), $v$ is the resulting value, and $\sigma'$ is the post-evaluation store.

The environment $\rho$ is immutable per expression — it changes only when entering a new scope (let or function call). The store $\sigma$ changes under set! and when new let bindings are introduced. This two-level design — $\rho$ for names, $\sigma$ for values — is what allows closures to capture $\rho$ at definition time while still sharing mutable state through $\sigma$.

In the implementation, $\rho$ will be a linked list of frames, and $\sigma$ will be implicit in the C heap (every TIL value object lives at a C heap address, and set! writes through a pointer stored in the environment). We make this concrete in Chapter 11.

Chapter 10: Lexer and Recursive-Descent Parser in C

The Pipeline

Converting TIL source text into something we can evaluate requires two stages: lexing (also called tokenization or scanning) and parsing. The lexer reads a stream of characters and produces a stream of tokens — the words of the language. The parser reads the token stream and builds an abstract syntax tree (AST) — a structured representation of the program’s meaning.

The pipeline:

source text  →  [lexer]  →  token stream  →  [parser]  →  AST

After parsing, the evaluator in Chapter 11 walks the AST, and Chapter 12’s garbage collector manages the AST nodes’ memory.

Tokens

A TIL token is one of the following kinds:

typedef enum {
    TOK_LPAREN,    // (
    TOK_RPAREN,    // )
    TOK_INT,       // integer literal: 42, -7
    TOK_BOOL,      // #t or #f
    TOK_IDENT,     // identifier: factorial, n, apply-twice
    TOK_EOF,       // end of input
    TOK_ERR,       // lexer error: unrecognized character
} tok_kind;

typedef struct {
    tok_kind kind;
    union {
        long    int_val;    // for TOK_INT
        int     bool_val;   // for TOK_BOOL: 1 or 0
        char    ident[64];  // for TOK_IDENT
    };
} Token;

A Token is a small tagged union: the kind field discriminates, and the union body holds the kind-specific data. We use a fixed-length ident buffer rather than a dynamically allocated string to avoid allocation during lexing.

Tagged union. A C idiom combining an enum discriminant with a union of possible payloads. The discriminant tells you which union member is currently valid. Tagged unions are C's way of representing values with multiple possible types — what a functional language handles with algebraic data types and pattern matching.

The Lexer

The lexer holds a pointer to the current position in the input string:

typedef struct {
    const char *src;    // full source
    size_t       pos;   // current read position
} Lexer;

static void lexer_init(Lexer *L, const char *src) {
    L->src = src;
    L->pos = 0;
}

static char peek(Lexer *L) {
    return L->src[L->pos];
}

static char advance(Lexer *L) {
    return L->src[L->pos++];
}

The core function produces the next token:

Token next_token(Lexer *L) {
    Token t;

    // skip whitespace and comments (;; to end of line)
    while (1) {
        while (peek(L) == ' ' || peek(L) == '\t' ||
               peek(L) == '\n' || peek(L) == '\r') {
            advance(L);
        }
        if (peek(L) == ';') {
            while (peek(L) && peek(L) != '\n') advance(L);
        } else break;
    }

    char c = peek(L);

    if (c == '\0') { t.kind = TOK_EOF; return t; }
    if (c == '(')  { advance(L); t.kind = TOK_LPAREN; return t; }
    if (c == ')')  { advance(L); t.kind = TOK_RPAREN; return t; }

    // boolean: #t or #f
    if (c == '#') {
        advance(L);
        char b = advance(L);
        t.kind = TOK_BOOL;
        t.bool_val = (b == 't') ? 1 : 0;
        return t;
    }

    // integer: optional minus, then digits
    if (c == '-' || isdigit(c)) {
        long val = 0;
        int neg = 0;
        if (c == '-') { neg = 1; advance(L); c = peek(L); }
        if (!isdigit(c)) {
            // lone '-' is an identifier (the subtraction operator)
            t.kind = TOK_IDENT;
            t.ident[0] = '-'; t.ident[1] = '\0';
            return t;
        }
        while (isdigit(peek(L))) {
            val = val * 10 + (advance(L) - '0');
        }
        t.kind = TOK_INT;
        t.int_val = neg ? -val : val;
        return t;
    }

    // identifier: [a-z][a-z0-9!?-]*
    if (islower(c) || c == '+' || c == '*' || c == '/' ||
        c == '=' || c == '<' || c == '>') {
        size_t i = 0;
        while (peek(L) && peek(L) != '(' && peek(L) != ')' &&
               !isspace((unsigned char)peek(L))) {
            if (i < 63) t.ident[i++] = advance(L);
            else advance(L);
        }
        t.ident[i] = '\0';
        t.kind = TOK_IDENT;
        return t;
    }

    // unrecognized
    advance(L);
    t.kind = TOK_ERR;
    return t;
}

A few design notes. We scan - and immediately check whether the next character is a digit; if so, it is a negative integer literal; if not, - is an identifier (the subtraction operator). This is a classic lexer ambiguity in Lisp-family languages. Operators like +, *, <= are all identifiers — TIL does not have separate operator tokens, and the parser handles them by name in the appropriate expression forms.

The 64-character identifier limit is a simplification. A production lexer would either use a global string intern table (so all uses of the same identifier share a pointer) or dynamically allocate identifiers. The intern table has the bonus of making identifier equality a pointer comparison rather than strcmp.

The Abstract Syntax Tree

The AST represents the parsed program as a tree of nodes. Each node corresponds to one syntactic form:

typedef enum {
    EXPR_INT,       // integer literal
    EXPR_BOOL,      // boolean literal
    EXPR_NAME,      // variable reference
    EXPR_IF,        // (if cond then else)
    EXPR_WHILE,     // (while cond body)
    EXPR_BEGIN,     // (begin e1 e2 ...)
    EXPR_SET,       // (set! name e)
    EXPR_LET,       // (let ((x e) ...) body)
    EXPR_LAMBDA,    // (lambda (params...) body)
    EXPR_CALL,      // (func arg ...)
    EXPR_PRINT,     // (print e)
    EXPR_OP2,       // (op e e) — binary operator
} expr_kind;

Because different expression kinds have different shapes, we use a tagged union for the AST node:

typedef struct Expr Expr;
typedef struct ExprList ExprList;
typedef struct Binding Binding;

struct ExprList {
    Expr     *expr;
    ExprList *next;
};

struct Binding {
    char     name[64];
    Expr     *val;
    Binding  *next;
};

struct Expr {
    expr_kind kind;
    union {
        long  int_val;              // EXPR_INT
        int   bool_val;             // EXPR_BOOL
        char  name[64];             // EXPR_NAME, EXPR_SET, EXPR_LAMBDA, EXPR_CALL
        struct { Expr *cond, *then_, *else_; } if_;   // EXPR_IF
        struct { Expr *cond, *body; } while_;          // EXPR_WHILE
        ExprList *begin_list;       // EXPR_BEGIN
        struct { Binding *binds; Expr *body; } let;   // EXPR_LET
        struct { ExprList *params; Expr *body; } lam; // EXPR_LAMBDA
        struct { Expr *func; ExprList *args; } call;  // EXPR_CALL
        Expr *print_expr;           // EXPR_PRINT
        struct { char op[4]; Expr *left, *right; } op2; // EXPR_OP2
    };
};

All Expr nodes are heap-allocated with malloc. Their lifetime is the entire program execution — we never free them individually. In a production interpreter, the parser arena would own all AST nodes and free them at once after the program runs. For our purposes, the small leak is acceptable.

Recursive Descent

A recursive descent parser translates the grammar directly into a set of mutually recursive C functions — one function per grammar production. TIL’s grammar is simple enough that no lookahead beyond the current token is needed (it is LL(1): we always know which production to use by looking at the next token).

The parser holds the lexer and the current (one token) lookahead:

typedef struct {
    Lexer  lex;
    Token  current;   // the already-consumed, not-yet-used token
} Parser;

static void parser_init(Parser *P, const char *src) {
    lexer_init(&P->lex, src);
    P->current = next_token(&P->lex);
}

static Token consume(Parser *P) {
    Token t = P->current;
    P->current = next_token(&P->lex);
    return t;
}

static Token expect(Parser *P, tok_kind k) {
    Token t = P->current;
    if (t.kind != k) {
        fprintf(stderr, "parse error: expected %d, got %d\n", k, t.kind);
        exit(1);
    }
    return consume(P);
}

The main expression parser:

static Expr *parse_expr(Parser *P) {
    Token t = P->current;

    if (t.kind == TOK_INT) {
        consume(P);
        Expr *e = malloc(sizeof(Expr));
        e->kind = EXPR_INT;
        e->int_val = t.int_val;
        return e;
    }
    if (t.kind == TOK_BOOL) {
        consume(P);
        Expr *e = malloc(sizeof(Expr));
        e->kind = EXPR_BOOL;
        e->bool_val = t.bool_val;
        return e;
    }
    if (t.kind == TOK_IDENT) {
        consume(P);
        Expr *e = malloc(sizeof(Expr));
        e->kind = EXPR_NAME;
        strncpy(e->name, t.ident, 64);
        return e;
    }
    if (t.kind == TOK_LPAREN) {
        consume(P);   // eat '('
        return parse_compound(P);
    }
    fprintf(stderr, "parse error: unexpected token\n");
    exit(1);
}

parse_compound handles all the parenthesized forms. It reads the first element to determine which form we are in:

static Expr *parse_compound(Parser *P) {
    // P->current is the token immediately after the '('
    Token head = P->current;
    Expr *e = malloc(sizeof(Expr));

    if (head.kind == TOK_IDENT && strcmp(head.ident, "if") == 0) {
        consume(P);
        e->kind = EXPR_IF;
        e->if_.cond  = parse_expr(P);
        e->if_.then_ = parse_expr(P);
        e->if_.else_ = parse_expr(P);
        expect(P, TOK_RPAREN);
        return e;
    }
    if (head.kind == TOK_IDENT && strcmp(head.ident, "while") == 0) {
        consume(P);
        e->kind = EXPR_WHILE;
        e->while_.cond = parse_expr(P);
        e->while_.body = parse_expr(P);
        expect(P, TOK_RPAREN);
        return e;
    }
    if (head.kind == TOK_IDENT && strcmp(head.ident, "begin") == 0) {
        consume(P);
        e->kind = EXPR_BEGIN;
        e->begin_list = parse_expr_list(P);  // parses until ')'
        return e;
    }
    if (head.kind == TOK_IDENT && strcmp(head.ident, "set!") == 0) {
        consume(P);
        Token name_tok = expect(P, TOK_IDENT);
        e->kind = EXPR_SET;
        strncpy(e->name, name_tok.ident, 64);
        e->if_.cond = parse_expr(P);   // reuse cond field for the value expr
        expect(P, TOK_RPAREN);
        return e;
    }
    if (head.kind == TOK_IDENT && strcmp(head.ident, "let") == 0) {
        consume(P);
        e->kind = EXPR_LET;
        e->let.binds = parse_binding_list(P);
        e->let.body  = parse_expr(P);
        expect(P, TOK_RPAREN);
        return e;
    }
    if (head.kind == TOK_IDENT && strcmp(head.ident, "lambda") == 0) {
        consume(P);
        e->kind = EXPR_LAMBDA;
        e->lam.params = parse_param_list(P);
        e->lam.body   = parse_expr(P);
        expect(P, TOK_RPAREN);
        return e;
    }
    if (head.kind == TOK_IDENT && strcmp(head.ident, "print") == 0) {
        consume(P);
        e->kind = EXPR_PRINT;
        e->print_expr = parse_expr(P);
        expect(P, TOK_RPAREN);
        return e;
    }
    // binary operator
    if (head.kind == TOK_IDENT && is_op2(head.ident)) {
        consume(P);
        e->kind = EXPR_OP2;
        strncpy(e->op2.op, head.ident, 4);
        e->op2.left  = parse_expr(P);
        e->op2.right = parse_expr(P);
        expect(P, TOK_RPAREN);
        return e;
    }
    // function call: first element is a function expression
    e->kind = EXPR_CALL;
    e->call.func = parse_expr(P);
    e->call.args = parse_expr_list(P);
    return e;
}

The helper functions parse_expr_list, parse_binding_list, and parse_param_list all read sequences until a ) token:

static ExprList *parse_expr_list(Parser *P) {
    if (P->current.kind == TOK_RPAREN) {
        consume(P);
        return NULL;
    }
    ExprList *node = malloc(sizeof(ExprList));
    node->expr = parse_expr(P);
    node->next = parse_expr_list(P);
    return node;
}

This is right-recursive and naturally builds a linked list. The recursion depth is bounded by the nesting depth of begin / function-argument lists, which is reasonable for typical programs.

Parsing the Top Level

A TIL program is a list of define forms. Each define is:

(define (name param ...) body)

We store top-level definitions in a simple array:

typedef struct {
    char    name[64];
    Expr   *body;         // a EXPR_LAMBDA
} TopDef;

#define MAX_DEFS 256
TopDef defs[MAX_DEFS];
int    n_defs = 0;

void parse_program(const char *src) {
    Parser P;
    parser_init(&P, src);
    while (P.current.kind != TOK_EOF) {
        expect(&P, TOK_LPAREN);
        Token kw = expect(&P, TOK_IDENT);
        assert(strcmp(kw.ident, "define") == 0);
        expect(&P, TOK_LPAREN);
        Token fname = expect(&P, TOK_IDENT);
        // read params
        ExprList *params = parse_param_list(&P);
        Expr *body = parse_expr(&P);
        expect(&P, TOK_RPAREN);
        // wrap in lambda
        Expr *lam = malloc(sizeof(Expr));
        lam->kind = EXPR_LAMBDA;
        lam->lam.params = params;
        lam->lam.body   = body;
        strncpy(defs[n_defs].name, fname.ident, 64);
        defs[n_defs].body = lam;
        n_defs++;
    }
}

After parse_program, we populate the top-level environment with all the defs before evaluating anything. This is what gives TIL’s top-level functions mutual recursion for free.

Error Recovery

Our parser calls exit(1) on any syntax error — the simplest possible error handling. A production parser would instead record errors and continue parsing (to find additional errors), or implement panic mode recovery (consuming tokens until a safe synchronization point like the next ) or the end of a top-level form). Cooper & Torczon’s Engineering a Compiler §3 covers error recovery in systematic detail. For TIL, exit(1) is appropriate: we are building a teaching interpreter, not a production compiler.

Chapter 11: Eval and Environments

The Eval Function’s Job

The evaluator takes an AST node and an environment and produces a value. In the notation from Chapter 9:

\[\text{eval}(e, \rho, \sigma) \;\to\; (v, \sigma')\]

In C, the environment $\rho$ is a pointer to the current frame (a linked list), and the store $\sigma$ is implicit in the heap: every TIL value is a heap-allocated C struct, and set! modifies the struct in place. There is no separate $\sigma$ data structure; the heap is $\sigma$.

TIL Values in C

Every TIL value is represented as a heap-allocated Value struct:

typedef enum {
    VAL_INT,
    VAL_BOOL,
    VAL_CLOSURE,
    VAL_ERROR,
} val_kind;

typedef struct Value Value;
typedef struct Env   Env;

struct Value {
    val_kind kind;
    int      marked;    // for the GC in Chapter 12
    union {
        long    int_val;
        int     bool_val;
        struct {
            ExprList *params;  // parameter names (AST nodes of kind EXPR_NAME)
            Expr     *body;
            Env      *env;     // the environment at definition time
        } closure;
    };
};

The marked field is used exclusively by the garbage collector in Chapter 12. Every Value is heap-allocated with malloc; every Env frame is heap-allocated with malloc. The GC will manage their lifetimes.

Environments as Linked Frames

An environment maps names to Value * pointers. We represent it as a linked list of frames, where each frame is a small association list (array of name→value pairs). A new frame is pushed when entering a let expression or calling a function; the outer frames are accessible via the parent pointer.

#define FRAME_SIZE 16

typedef struct Binding_cell {
    char    name[64];
    Value  *val;       // pointer into the heap; GC may move this (we won't)
} Binding_cell;

struct Env {
    Binding_cell bindings[FRAME_SIZE];
    int          count;
    Env         *parent;
    int          marked;   // for GC
};

static Env *env_new(Env *parent) {
    Env *e = malloc(sizeof(Env));
    e->count  = 0;
    e->parent = parent;
    e->marked = 0;
    return e;
}

static void env_bind(Env *e, const char *name, Value *val) {
    assert(e->count < FRAME_SIZE);
    strncpy(e->bindings[e->count].name, name, 64);
    e->bindings[e->count].val = val;
    e->count++;
}

static Value **env_lookup(Env *e, const char *name) {
    for (; e; e = e->parent) {
        for (int i = 0; i < e->count; i++) {
            if (strcmp(e->bindings[i].name, name) == 0) {
                return &e->bindings[i].val;
            }
        }
    }
    return NULL;   // unbound name
}

env_lookup returns a Value ** — a pointer to the slot in the frame — not just the Value *. This is essential for set!: mutation writes a new Value * into the existing slot, and env_lookup gives us the address of that slot.

The Evaluator

The central function eval dispatches on the expression kind:

Value *eval(Expr *e, Env *env) {
    if (!e) return make_error();

    switch (e->kind) {

    case EXPR_INT: {
        Value *v = malloc(sizeof(Value));
        v->kind = VAL_INT; v->marked = 0;
        v->int_val = e->int_val;
        return v;
    }

    case EXPR_BOOL: {
        Value *v = malloc(sizeof(Value));
        v->kind = VAL_BOOL; v->marked = 0;
        v->bool_val = e->bool_val;
        return v;
    }

    case EXPR_NAME: {
        Value **slot = env_lookup(env, e->name);
        if (!slot) {
            fprintf(stderr, "unbound: %s\n", e->name);
            return make_error();
        }
        return *slot;
    }

    case EXPR_SET: {
        Value **slot = env_lookup(env, e->name);
        if (!slot) { fprintf(stderr, "set! unbound: %s\n", e->name); return make_error(); }
        *slot = eval(e->if_.cond, env);   // reusing cond field for value expr
        Value *v = malloc(sizeof(Value));
        v->kind = VAL_BOOL; v->marked = 0; v->bool_val = 0;   // set! returns #f
        return v;
    }

    case EXPR_IF: {
        Value *cond = eval(e->if_.cond, env);
        if (cond->kind == VAL_ERROR) return cond;
        int truthy = (cond->kind == VAL_BOOL) ? cond->bool_val : 1;
        return eval(truthy ? e->if_.then_ : e->if_.else_, env);
    }

    case EXPR_WHILE: {
        while (1) {
            Value *cond = eval(e->while_.cond, env);
            if (cond->kind == VAL_ERROR) return cond;
            if (cond->kind == VAL_BOOL && !cond->bool_val) break;
            Value *body_val = eval(e->while_.body, env);
            if (body_val->kind == VAL_ERROR) return body_val;
        }
        Value *v = malloc(sizeof(Value));
        v->kind = VAL_BOOL; v->marked = 0; v->bool_val = 0;
        return v;
    }

    case EXPR_BEGIN: {
        Value *last = make_bool(0);
        for (ExprList *el = e->begin_list; el; el = el->next) {
            last = eval(el->expr, env);
            if (last->kind == VAL_ERROR) return last;
        }
        return last;
    }

    case EXPR_LET: {
        Env *new_env = env_new(env);
        for (Binding *b = e->let.binds; b; b = b->next) {
            Value *v = eval(b->val, env);   // evaluate in *outer* env
            if (v->kind == VAL_ERROR) return v;
            env_bind(new_env, b->name, v);
        }
        return eval(e->let.body, new_env);
    }

    case EXPR_LAMBDA: {
        Value *v = malloc(sizeof(Value));
        v->kind = VAL_CLOSURE; v->marked = 0;
        v->closure.params = e->lam.params;
        v->closure.body   = e->lam.body;
        v->closure.env    = env;           // capture current environment
        return v;
    }

    case EXPR_CALL: {
        Value *func = eval(e->call.func, env);
        if (func->kind == VAL_ERROR) return func;
        if (func->kind != VAL_CLOSURE) {
            fprintf(stderr, "call: not a function\n");
            return make_error();
        }
        // build new frame with parameter bindings
        Env *call_env = env_new(func->closure.env);  // parent = closure's env
        ExprList *param = func->closure.params;
        ExprList *arg   = e->call.args;
        while (param && arg) {
            Value *argv = eval(arg->expr, env);
            if (argv->kind == VAL_ERROR) return argv;
            env_bind(call_env, param->expr->name, argv);
            param = param->next;
            arg   = arg->next;
        }
        if (param || arg) {
            fprintf(stderr, "arity mismatch\n");
            return make_error();
        }
        return eval(func->closure.body, call_env);
    }

    case EXPR_OP2: {
        Value *l = eval(e->op2.left,  env);
        Value *r = eval(e->op2.right, env);
        if (l->kind == VAL_ERROR) return l;
        if (r->kind == VAL_ERROR) return r;
        return apply_op2(e->op2.op, l, r);
    }

    case EXPR_PRINT: {
        Value *v = eval(e->print_expr, env);
        print_value(v);
        return v;
    }

    default:
        return make_error();
    }
}

The apply_op2 helper handles arithmetic and comparison:

static Value *apply_op2(const char *op, Value *l, Value *r) {
    if (l->kind != VAL_INT || r->kind != VAL_INT) return make_error();
    long a = l->int_val, b = r->int_val;
    Value *v = malloc(sizeof(Value));
    v->marked = 0;
    if (strcmp(op, "+") == 0)  { v->kind=VAL_INT;  v->int_val  = a + b; }
    else if (strcmp(op, "-") == 0)  { v->kind=VAL_INT;  v->int_val  = a - b; }
    else if (strcmp(op, "*") == 0)  { v->kind=VAL_INT;  v->int_val  = a * b; }
    else if (strcmp(op, "/") == 0)  {
        if (b == 0) { free(v); return make_error(); }
        v->kind=VAL_INT; v->int_val = a / b;
    }
    else if (strcmp(op, "=") == 0)  { v->kind=VAL_BOOL; v->bool_val = (a == b); }
    else if (strcmp(op, "<") == 0)  { v->kind=VAL_BOOL; v->bool_val = (a <  b); }
    else if (strcmp(op, ">") == 0)  { v->kind=VAL_BOOL; v->bool_val = (a >  b); }
    else if (strcmp(op, "<=") == 0) { v->kind=VAL_BOOL; v->bool_val = (a <= b); }
    else if (strcmp(op, ">=") == 0) { v->kind=VAL_BOOL; v->bool_val = (a >= b); }
    else { free(v); return make_error(); }
    return v;
}

The `set!` Case and Why It Works

The EXPR_SET case deserves a second look:

case EXPR_SET: {
    Value **slot = env_lookup(env, e->name);
    *slot = eval(e->if_.cond, env);
    ...
}

env_lookup walks the environment chain from the innermost frame outward, looking for a binding with the given name. It returns &frame->bindings[i].val — a pointer to the slot where the Value * is stored. Writing through slot changes the Value * stored in that slot, which is visible to any code that looks up the same name in the same or inner environments.

This is lexical mutation: set! modifies the nearest enclosing binding, following the static scope chain. It does not create a new binding; if the name is unbound, set! is an error. The combination of lexical scope (lambda captures env) and lexical mutation (set! modifies a slot reachable through env) gives TIL the same semantics as Scheme/Racket’s mutable bindings — as described by Ragde in FICS2 §2 with the store model from Chapter 1.

Why Every Value is Heap-Allocated

Every Value is created with malloc. This is deliberate and important: closures capture environment pointers, and environments may outlive the function call that created them. If we stack-allocated Env frames (or Value structs), they would be destroyed when the function returns, leaving the closure pointing to a dangling stack frame.

Heap allocation means values and environments persist until explicitly freed or collected. This is the setup for Chapter 12: without GC, the heap grows without bound as the program creates closures, and eventually malloc returns NULL.

Running the Interpreter

The main loop:

int main(void) {
    /* read source from stdin (or a file) */
    char src[65536];
    size_t n = fread(src, 1, sizeof(src)-1, stdin);
    src[n] = '\0';

    /* parse */
    parse_program(src);

    /* build global environment */
    Env *global = env_new(NULL);
    for (int i = 0; i < n_defs; i++) {
        Value *clo = eval(defs[i].body, global);
        env_bind(global, defs[i].name, clo);
    }

    /* call main() */
    Value **main_slot = env_lookup(global, "main");
    if (!main_slot) { fprintf(stderr, "no main defined\n"); return 1; }
    Value *main_clo = *main_slot;
    Env *call_env = env_new(main_clo->closure.env);
    Value *result = eval(main_clo->closure.body, call_env);
    print_value(result);
    return 0;
}

Notice that all define bodies are evaluated in the global environment before main is called. Each define with a lambda body creates a closure that captures the global environment. Because the global environment is fully populated before any define body executes — we pre-bind all names — the closures can call each other, giving mutual recursion at the top level.

The evaluation of a lambda expression does not execute the body. It creates a closure value that captures the current environment and remembers the body and parameter list. The body executes only when the closure is called (EXPR_CALL). This is standard call-by-value semantics: arguments are evaluated before the call, and function bodies execute only when invoked.

SICP’s Environment Model

The design above is directly descended from the environment model in SICP (Abelson & Sussman, §3.2). SICP describes environments as “a sequence of frames, each frame being a table of bindings associating variable names with their corresponding values.” A new frame is created for each procedure application, with the enclosing environment of the procedure as its parent. set! modifies the binding in the first frame where the variable is found.

Our implementation faithfully implements this model in C. The only difference from SICP’s Scheme implementation is that we store Value * pointers rather than values directly, because C requires us to manage memory explicitly. Every conceptual choice — lexical scope, mutation via frame slots, closure capture of environment — is the same.

Chapter 12: Mark-and-Sweep, the Small One

Why GC Becomes Necessary

In Chapter 11, every call to eval creates one or more heap-allocated Value * and potentially a new Env * frame. The factorial function, called with n = 10, creates roughly:

10 Env * frames (one per recursive call)
~50 Value * objects (arguments, partial results, booleans from comparisons)

Most of these become garbage almost immediately — the recursive calls return and no one holds a reference to those frames or values anymore. But with malloc-only allocation and no free, they accumulate. A program that calls factorial in a loop will exhaust the heap.

The problem is deeper for closures. A closure captures an environment Env *. If the closure is stored in a variable and outlives the function call that created it, the captured environment must persist too. But determining when the environment can be freed is exactly the lifetime analysis problem that garbage collection solves automatically.

Manual free is not a solution here — it would require tracking every pointer to every Value and Env, ensuring each is freed exactly once when no other references remain. This is the reference-counting or ownership-tracking problem. Languages with GC (Java, Python, Go, Racket) solve it by letting the runtime find and free unreachable objects automatically.

We will implement the simplest GC strategy: mark-and-sweep.

The Object Graph

The live values in the program at any moment are exactly those reachable from the root set: the set of Value * and Env * pointers that the evaluator currently holds. During the execution of eval, the roots are:

All Env * frames currently in use (the current call chain).
The global environment.
The TIL value currently being returned (if any).

Values reachable from these roots — transitively, through closure.env, env.bindings[i].val, and env.parent — are live. All others are garbage.

The mark-and-sweep algorithm has two phases:

Mark. Starting from the roots, traverse the object graph and mark every reachable Value and Env.
Sweep. Scan the entire heap and free every unmarked Value and Env. Reset marks on surviving objects.

After one mark-and-sweep cycle, all garbage has been freed and the heap is ready for new allocations.

Implementation: A Pool Allocator

To implement sweep efficiently, we need to enumerate all allocated objects. With raw malloc, we cannot do this — malloc’s heap is opaque. We use a pool allocator: pre-allocate arrays of Value and Env objects, track which ones are in use, and allocate from the pool.

#define VALUE_POOL_SIZE  8192
#define ENV_POOL_SIZE    2048

static Value value_pool[VALUE_POOL_SIZE];
static int   value_used[VALUE_POOL_SIZE];   // 1 if slot is in use
static int   value_pool_top = 0;

static Env   env_pool[ENV_POOL_SIZE];
static int   env_used[ENV_POOL_SIZE];
static int   env_pool_top = 0;

Allocation from the pool:

Value *alloc_value(void) {
    // fast path: bump the top
    if (value_pool_top < VALUE_POOL_SIZE) {
        value_used[value_pool_top] = 1;
        value_pool[value_pool_top].marked = 0;
        return &value_pool[value_pool_top++];
    }
    // pool full: trigger GC, then retry
    gc_collect();
    for (int i = 0; i < VALUE_POOL_SIZE; i++) {
        if (!value_used[i]) {
            value_used[i] = 1;
            value_pool[i].marked = 0;
            return &value_pool[i];
        }
    }
    fprintf(stderr, "out of memory\n");
    exit(1);
}

Env *alloc_env(void) {
    if (env_pool_top < ENV_POOL_SIZE) {
        env_used[env_pool_top] = 1;
        env_pool[env_pool_top].marked = 0;
        return &env_pool[env_pool_top++];
    }
    gc_collect();
    for (int i = 0; i < ENV_POOL_SIZE; i++) {
        if (!env_used[i]) {
            env_used[i] = 1;
            env_pool[i].marked = 0;
            return &env_pool[i];
        }
    }
    fprintf(stderr, "out of env memory\n");
    exit(1);
}

Replace every malloc(sizeof(Value)) and malloc(sizeof(Env)) in Chapters 10 and 11 with alloc_value() and alloc_env().

The Mark Phase

Marking traverses the object graph recursively from the roots. We mark both Value and Env objects:

static void mark_value(Value *v);
static void mark_env(Env *e);

static void mark_value(Value *v) {
    if (!v || v->marked) return;
    v->marked = 1;
    if (v->kind == VAL_CLOSURE) {
        mark_env(v->closure.env);
        // AST nodes (params, body) are not heap-allocated by the pool,
        // they live in the parse arena and are never collected.
    }
}

static void mark_env(Env *e) {
    if (!e || e->marked) return;
    e->marked = 1;
    for (int i = 0; i < e->count; i++) {
        mark_value(e->bindings[i].val);
    }
    mark_env(e->parent);
}

The if (!v || v->marked) return; guard prevents infinite loops when the object graph contains cycles (which TIL can create with set! — for example, a closure stored in a variable it closes over). The marked flag is cleared in the sweep phase and set here; the second visit to a marked object terminates the recursion.

The Sweep Phase

After marking, every live object has marked = 1. Every dead object has marked = 0. Sweep frees the dead ones:

static void gc_sweep(void) {
    for (int i = 0; i < value_pool_top; i++) {
        if (value_used[i] && !value_pool[i].marked) {
            value_used[i] = 0;   // reclaim slot
        }
        value_pool[i].marked = 0;   // reset mark for next cycle
    }
    for (int i = 0; i < env_pool_top; i++) {
        if (env_used[i] && !env_pool[i].marked) {
            env_used[i] = 0;
        }
        env_pool[i].marked = 0;
    }
}

The sweep also resets marked on surviving objects, preparing for the next GC cycle. The pool slots marked as unused are available for new allocations in alloc_value/alloc_env.

Collecting: Assembling the Roots

The full GC entry point needs to identify the root set. This is the hard part in a real GC: the runtime must know exactly which stack slots, global variables, and registers hold live pointers. In our simplified interpreter, the roots are:

The global environment pointer.
Any Env * in the call chain at the moment of collection (currently active call frames).

We track the call chain with a simple global stack of environment pointers:

#define GC_STACK_SIZE 1024
static Env *gc_root_stack[GC_STACK_SIZE];
static int  gc_root_top = 0;

static void gc_push_root(Env *e) { gc_root_stack[gc_root_top++] = e; }
static void gc_pop_root(void)    { gc_root_top--; }

In eval, before any allocation that might trigger GC, we push the current environment. After the allocation, we pop it. In practice, we push at the start of every eval call and pop at the end:

Value *eval(Expr *e, Env *env) {
    gc_push_root(env);
    Value *result = eval_inner(e, env);   // the actual dispatch
    gc_pop_root();
    return result;
}

Then gc_collect marks from all root-stack frames, plus the global:

extern Env *global_env;   // set up in main()

void gc_collect(void) {
    // mark phase
    mark_env(global_env);
    for (int i = 0; i < gc_root_top; i++) {
        mark_env(gc_root_stack[i]);
    }
    // sweep phase
    gc_sweep();
}

When to Trigger Collection

Our alloc_value triggers GC when the pool is full. This is a stop-the-world collector: evaluation pauses completely while GC runs. For short-lived programs, this is perfectly fine. For long-running programs, it introduces unpredictable pauses.

A better heuristic: trigger GC when the pool is 80% full. This ensures a margin for allocations made during GC (the mark traversal itself does not allocate, but it is good practice). Even better: trigger based on allocation rate — if the program is allocating quickly, collect more often.

The pool size we chose (8,192 values, 2,048 environments) is deliberately small to make GC visible during testing. A real TIL deployment would increase these to millions, and the pool would be extended by mmap when exhausted.

Stop-the-World and Its Costs

Our mark-and-sweep collector is stop-the-world: the program stops completely while GC runs, then resumes. This simplifies implementation enormously — there is no need to handle the case where the program allocates new objects while GC is in progress. The penalty is pause time: for a large heap with many objects, a full mark-sweep can take tens to hundreds of milliseconds.

Real GC algorithms address this in several ways:

Incremental GC. Run a small amount of GC work per allocation, spreading the cost out over time. Requires a write barrier — every pointer write must notify the GC so it can update its mark state.

Generational GC. Most objects die young (the “generational hypothesis”). Divide the heap into a nursery (young generation) and an old space (old generation). Collect the nursery frequently (minor GC) and old space rarely (major GC). Promoted objects move from nursery to old space when they survive a collection. Ragde’s FICS2 §7 describes copying GC (the algorithm behind most nurseries) in detail; CS 241E at Waterloo has students implement it.

Concurrent GC. Run the GC on a separate thread while the program continues. Requires careful synchronization and write barriers to handle objects allocated or modified during GC.

Reference counting. Each object maintains a count of how many pointers point to it; when the count drops to zero, the object is freed immediately. Simple to implement but cannot collect cycles (two objects pointing to each other have non-zero counts forever even if the program has no path to either). Python and Swift use reference counting with a separate cycle collector as a backup.

What the GC Cannot See

Our GC has a blind spot: the C stack. If a Value * lives in a C local variable (not in a pool-tracked Env frame), the GC cannot mark it. This is why we push Env * frames onto gc_root_stack rather than marking individual Value * locals — the Env frame aggregates all bindings for a scope, and marking it recursively marks all contained values.

A more complete GC would also track every C local variable that holds a Value *, either by requiring the programmer to register them (as CPython’s Py_INCREF/Py_DECREF do) or by walking the C call stack precisely (as the Java JVM does, using stack maps compiled into the bytecode). Our design avoids this complexity by keeping all live Value * pointers inside Env frames, which are themselves tracked.

This is a common simplification in pedagogical interpreter GC designs: SICP’s garbage collector for the register machine (§5.3) tracks machine registers as the root set, not the C stack. The same idea — a well-defined root set that covers all live pointers — underlies every real GC.

Connecting to the Course Sequence

The mark-and-sweep algorithm here is the simplest possible GC. It scales badly to large heaps (sweep time is proportional to total heap size, not live set size) and leaves holes (fragmentation). Its virtues are clarity and correctness.

The algorithms that follow from here in the field:

Copying (Cheney, 1970): Instead of sweeping, copy live objects to a fresh half-space, compacting away fragmentation. Allocation becomes a bump pointer into the fresh space. FICS2 §7 walks through a Racket implementation; CS 146 notes that CS 241 covers copying GC.
Tri-color incremental (Dijkstra et al., 1978): Uses a third color (gray = marked but children not yet visited) to interleave GC work with program execution.
G1, ZGC, Shenandoah (Java), Boehm GC (C/C++): Production collectors used in real systems, each addressing the pause-time and fragmentation problems with sophisticated engineering.

For the working systems programmer, the take-away from this chapter is not the specific algorithm — it is the structure: maintain a root set, traverse the object graph, reclaim unreachable objects, and do so without corrupting any live data. The three invariants of a correct collector are: (1) mark every reachable object, (2) free no reachable object, and (3) free every unreachable object eventually. Our implementation satisfies all three.

Wilson’s Taxonomy (1992)

Paul Wilson’s “Uniprocessor Garbage Collection Techniques” (1992) — a survey paper still cited in essentially every GC text — organizes the field into a taxonomy that Chapter 12 only hinted at. The key dimensions:

By liveness detection strategy:

Reference counting: each object maintains a count of pointers to it. When count reaches zero, the object is freed immediately. Fast for short-lived objects; fails for cycles.
Tracing (mark-and-sweep, copying): traverse the live object graph from roots, identify live objects, reclaim the rest. Our Chapter 12 implementation is tracing.

By memory management strategy (within tracing):

Mark-sweep: mark live objects in place, sweep and free dead objects. Heap stays in place; no compaction; subject to fragmentation.
Mark-compact: after marking, slide live objects together to one end of the heap (compacting), freeing the rest as a contiguous region. Eliminates fragmentation; allocation becomes a bump pointer; expensive to compact.
Copying: copy live objects to a new semi-space (“to-space”), leaving dead objects in the old space (“from-space”). After copying, roles are swapped. Allocation in to-space is a bump pointer; extremely fast. Wastes half the heap at any time.

By scope:

Whole-heap (stop-the-world): pause the program, collect everything, resume. Simple to implement; causes noticeable pauses.
Generational: divide the heap into generations (nursery = young, tenured = old). Collect the nursery frequently (most objects die young); collect tenured space rarely. Most objects in the nursery are dead by the time the next nursery collection runs — the generational hypothesis, empirically validated across many languages.
Incremental: interleave GC work with program execution. Uses a write barrier — every pointer write notifies the collector, so the collector can track which objects’ reference graphs have changed.
Concurrent: run the GC on a separate thread simultaneously with the program. Requires write barriers and careful synchronization; hard to implement correctly.

The Generational Hypothesis: Empirical Evidence

Appel’s 1989 paper “Simple Generational Garbage Collection and Fast Allocation” quantified the generational hypothesis across several Lisp programs: between 80% and 98% of allocated objects die before the next collection. This means a nursery collection (which only traces objects in the nursery) finds almost nothing live to promote — the collection cost is proportional to the survivors, not the total allocations. With a tiny survivor set, nursery collection is very fast.

The observation holds broadly: Java object lifetime studies show that a typical server application allocates millions of short-lived request-scoped objects (HTTP request objects, strings, intermediate results) that all die within a single request. Long-lived objects (connection pools, caches) are a small fraction of allocations but survive many collections and get promoted to old space.

The practical consequence: most modern GC-equipped runtimes (JVM, Go, .NET CLR, Ruby MRI) use generational collection as their baseline strategy. The nursery is collected frequently (every few milliseconds), old space much less frequently (every few seconds). This keeps the average pause time short while handling long-lived objects without excessive collection overhead.

Tri-Color Marking

Chapter 12’s mark phase painted objects either unmarked (not yet seen) or marked (live). Dijkstra, Lamport, Martin, Scholten, and Steffens (1978) introduced a third color for incremental and concurrent marking:

White: not yet seen. Will be freed if still white at sweep time.
Gray: marked as reachable, but its children have not yet been marked. In the work queue.
Black: marked as reachable, and all children have been processed. Done.

The algorithm processes the gray set: take a gray object, mark all its white children gray, then mark the object itself black. When no gray objects remain, all white objects are garbage.

The tri-color invariant: a black object never points directly to a white object. If this invariant holds, sweeping all white objects is safe — no black object will lose a reference through a now-freed white object.

For a concurrent collector, the invariant can be broken by the program: a running thread might create a new pointer from a black object to a white object while the collector is in progress. The write barrier captures this: every time a pointer field is written, the barrier checks whether the invariant would be violated and grays the affected object if so. Two common write barriers:

Dijkstra (incremental update): when a pointer b→w is written (black→white), gray w. Prevents the invariant from being violated at the point of the write.
Yuasa (snapshot-at-beginning): when any pointer is overwritten, gray the old target. Maintains a snapshot of the object graph at collection start; objects reachable from the snapshot are considered live.

The tri-color framework is the foundation of JVM’s G1 and ZGC collectors, .NET’s CLR GC, and Go’s tricolor concurrent mark-sweep. Chapter 12’s binary mark/unmark is a degenerate version of tri-color for stop-the-world collection (no concurrent mutation, so no write barrier needed).

Pause Time: The Fundamental Tradeoff

Every GC algorithm makes a tradeoff between throughput (total program work done per unit time) and pause time (worst-case length of a GC stop-the-world pause):

Stop-the-world mark-sweep (Chapter 12): maximum throughput (no write barrier overhead), unbounded pause time (proportional to heap size).
Incremental: lower throughput (write barrier on every pointer write, ~10–20% overhead for pointer-heavy code), bounded pause time (configurable, typically 1–10 ms).
Concurrent: similar throughput to incremental, shorter pauses (GC work done in parallel with the program).
Generational: high throughput for short-lived objects, short minor GC pauses (~1 ms), occasional long major GC pauses.
ZGC / Shenandoah (Java): sub-millisecond worst-case pauses on multi-terabyte heaps, at ~10–15% throughput overhead.

For TIL, stop-the-world is appropriate: we are building a teaching interpreter, not a server runtime. For a production system — a web server, a database, a game engine — pause time requirements vary dramatically, and choosing the wrong GC strategy causes real problems. (Java’s old stop-the-world collectors were a real pain point for low-latency applications, driving the development of G1, ZGC, and Shenandoah over the 2010s.)

Jones, Hosking, and Moss’s The Garbage Collection Handbook (2011, 2nd ed. 2023) catalogues nearly 3,400 GC-related publications. It is the standard reference for anyone implementing a production collector, covering parallel, incremental, concurrent, and real-time variants, along with chapters on the GC/OS interface (huge pages, NUMA effects) and energy-aware collection.

Interlude: The Compilation Pipeline

Between the C source file you write and the machine code that executes, four programs run in sequence: the preprocessor, the compiler, the assembler, and the linker. Most programmers invoke them all with a single command (gcc foo.c -o foo or clang foo.c -o foo), but each stage is distinct and produces a distinct artifact. Understanding the pipeline explains why header files exist, why extern declarations are necessary, and how malloc ends up in your program even though you never wrote it.

Stage 1: The Preprocessor

The preprocessor runs before the compiler sees the source. It handles all # directives:

#include "foo.h" is textual inclusion: the preprocessor reads foo.h and inserts its text verbatim at the #include point. Header files exist because of this: declarations need to be seen by the compiler before use, and the preprocessor propagates them.
#define MAX 256 defines a macro: every subsequent occurrence of MAX in the source is replaced by 256 before the compiler sees the file.
#if, #ifdef, #ifndef, #else, #endif perform conditional compilation, selecting code based on defined macros.
#pragma passes implementation-specific hints to the compiler.

You can run only the preprocessor with clang -E foo.c. The output is often surprising in its size: a #include <stdio.h> on a modern Linux system expands to thousands of lines of system headers. Everything the compiler eventually processes was inserted by the preprocessor.

The preprocessor operates purely textually — it knows nothing about C types, scopes, or expressions. This is why macros can be dangerous:

#define SQUARE(x) x * x
int y = SQUARE(1 + 2);   // expands to 1 + 2 * 1 + 2 = 5, not 9

The idiomatic safe form is #define SQUARE(x) ((x) * (x)). Better still: use an inline function, which is type-safe and obeys precedence rules.

Stage 2: The Compiler

The compiler transforms the preprocessed C source into assembly language. It does this in several internal passes:

Lexing and parsing. The same pipeline we built for TIL in Chapters 10–11, but vastly more complex (C’s grammar is famously ambiguous and context-sensitive: T * x; is either a multiplication expression or a pointer declaration depending on whether T is a type name or a variable).
Semantic analysis. Type checking, name resolution, and conversion. This is where “implicit cast from int * to void *” warnings come from.
Intermediate representation (IR) generation. The compiler translates the AST to an internal IR — LLVM uses LLVM IR, GCC uses GIMPLE/RTL. The IR is low-level enough to analyze but not yet committed to a specific ISA.
Optimization passes. Dozens of optimizations run on the IR: constant folding, dead code elimination, loop unrolling, inlining, alias analysis, strength reduction, common subexpression elimination. The UB-exploiting optimizations from Chapter 3 all happen here, applied to the IR.
Code generation. The IR is translated to assembly for the target ISA (ARM64, x86-64, etc.), with register allocation (assigning IR temporaries to machine registers) and instruction selection (choosing the best instruction sequence for each IR operation).

clang -O0 -S foo.c -o foo.s stops after Stage 2 and outputs assembly. -O1, -O2, -O3 enable progressively more optimization passes.

Stage 3: The Assembler

The assembler converts assembly source into an object file — a binary file containing:

Machine code for all defined functions (encoded according to the ISA’s binary format).
A symbol table listing all the names (function and global variable names) defined in this file, with their addresses in the object file.
Relocation records noting every place where an address needs to be filled in (because, at this point, we do not know the final address of anything — each object file assumes it starts at address 0).

clang -c foo.c -o foo.o produces an object file. objdump -d foo.o (Linux) or otool -tv foo.o (macOS) disassembles it.

Stage 4: The Linker

The linker combines multiple object files (and libraries, which are archives of object files) into a single executable. It:

Resolves external references: when main.o calls printf, and printf is defined in libc.a, the linker matches the call site in main.o to the definition in libc.a.
Assigns final addresses: the linker decides where each segment (text, data, bss) will go in the process address space and fills in all the relocation records with actual addresses.
Produces the final binary: an ELF file on Linux (Executable and Linkable Format) or a Mach-O file on macOS.

Two kinds of linking:

Static linking. Library code is copied into the executable. The binary contains everything it needs; it runs without any additional files. Executables are large.

Dynamic linking. Library code is left in separate shared library files (.so on Linux, .dylib on macOS, .dll on Windows). The executable records which shared libraries it needs and which symbols it uses from them. At runtime, the dynamic linker (also called the loader) maps the shared library into the process address space and patches the program’s calls to point to the library’s functions. Executables are small; all processes sharing the same library share its pages in physical memory.

The full story — symbol resolution, GOT (Global Offset Table), PLT (Procedure Linkage Table), RPATH, and lazy binding — is the core subject of CS 241. The bridge here: every malloc call in a C program that does not statically link libc uses the PLT to redirect to libc’s malloc at runtime, and the dynamic linker performs the address-patching when your program starts.

Stage 5: The Loader

Technically part of the OS rather than the toolchain, the loader is what the kernel invokes when you run a program. It:

Reads the ELF/Mach-O headers to find the segments (text, data, bss) and their sizes and permissions.
Maps the segments into the process address space using mmap.
Initializes BSS (zeros it out).
Invokes the dynamic linker to resolve shared library dependencies.
Jumps to the program’s entry point (usually _start, a runtime wrapper that calls your main).

The loader’s work is largely invisible, but it is where all the address-space layout from Chapter 7 comes from. The guard page below the stack, the randomized base addresses of each segment (ASLR), and the initial esp/sp pointing to the top of the stack: all set up by the loader before main ever runs.

Seeing the Pipeline Directly

On macOS with clang:

clang -v -Wall -O1 factorial.c -o factorial

The -v flag prints each subprocess invocation: you will see the preprocessor command, the compiler (cc1), the assembler (as), and the linker (ld) run sequentially. On Linux with gcc, the same flag works identically.

To see intermediate artifacts:

clang -save-temps -O1 factorial.c -o factorial

This keeps factorial.i (preprocessed), factorial.s (assembly), and factorial.o (object file) alongside the final binary. Comparing factorial.i and the original factorial.c makes the scale of header expansion visceral.

Appendix A: TIL in Practice — A Worked Session

To make the interpreter concrete, let us trace through a complete TIL program by hand, showing what the lexer produces, what the AST looks like, what the environment chain looks like at each step, and when GC fires.

The Program

(define (double x) (* 2 x))
(define (apply-twice f x) (f (f x)))
(define (main) (print (apply-twice double 3)))

Expected output: 12.

Lexer Output

The lexer processes the source left to right. For (define (double x) (* 2 x)):

TOK_LPAREN
TOK_IDENT "define"
TOK_LPAREN
TOK_IDENT "double"
TOK_IDENT "x"
TOK_RPAREN
TOK_LPAREN
TOK_IDENT "*"
TOK_INT 2
TOK_IDENT "x"
TOK_RPAREN
TOK_RPAREN

The lexer produces a flat stream; all the tree structure is in the parentheses. The parser reconstructs the tree.

AST Structure

After parsing, the three define forms produce:

defs[0]: name="double"
  body: EXPR_LAMBDA
    params: [EXPR_NAME "x"]
    body: EXPR_OP2 "*"
      left:  EXPR_INT 2
      right: EXPR_NAME "x"

defs[1]: name="apply-twice"
  body: EXPR_LAMBDA
    params: [EXPR_NAME "f", EXPR_NAME "x"]
    body: EXPR_CALL
      func: EXPR_NAME "f"
      args: [EXPR_CALL
               func: EXPR_NAME "f"
               args: [EXPR_NAME "x"]]

defs[2]: name="main"
  body: EXPR_LAMBDA
    params: []
    body: EXPR_PRINT
      expr: EXPR_CALL
        func: EXPR_NAME "apply-twice"
        args: [EXPR_NAME "double", EXPR_INT 3]

The nested (f (f x)) becomes a nested EXPR_CALL — calling f with argument (f x), where (f x) is itself a call.

Evaluation Trace

Global environment setup. All three define bodies are EXPR_LAMBDA nodes. Evaluating a EXPR_LAMBDA creates a closure that captures the current environment (the global env, currently empty). After processing all three define forms:

global_env: {
  "double"      → closure(params=[x], body=*2x, env=global)
  "apply-twice" → closure(params=[f,x], body=(f(fx)), env=global)
  "main"        → closure(params=[], body=(print ...), env=global)
}

Each closure’s .env points back to global_env. This gives mutual recursion: the body of apply-twice references f and x from its own call frame, not from global. But double is looked up in global_env at call time via the environment chain.

Calling main. eval(main_body, call_env) where call_env is a new frame with no bindings (main has no parameters) and parent = global_env.

The body is EXPR_PRINT. We evaluate the inner EXPR_CALL apply-twice double 3:

Look up apply-twice → finds the closure above.
Evaluate arguments: EXPR_NAME "double" → closure for double; EXPR_INT 3 → integer 3.
Create call frame: {f → closure(double), x → int:3, parent → global_env}.

Inside apply-twice body: (f (f x)).

Outer call: f = closure(double), argument = inner (f x).
Inner (f x): f = closure(double), x = int:3. Create frame {x → 3, parent → global_env}. Evaluate (* 2 x) = 2 * 3 = 6. Return int:6.
Outer call: f = closure(double), argument = int:6. Create frame {x → 6, parent → global_env}. Evaluate (* 2 x) = 2 * 6 = 12. Return int:12.

Back in main: print int:12. Output: 12.

GC Trigger

In this small program, the pool never fills. But consider calling (apply-twice double 3) one million times in a loop:

(define (loop n)
  (if (= n 0)
      0
      (begin
        (print (apply-twice double n))
        (loop (- n 1)))))
(define (main) (loop 1000000))

Each call to apply-twice creates two call frames, two argument values (int:n and the double closure), and the intermediate result. That is roughly 5 Value objects and 2 Env objects per iteration = 5,000,000 values for 1,000,000 iterations — far more than our pool of 8,192 values. GC fires approximately every 1,638 iterations (8,192 / 5), and each cycle must identify the live objects (roughly: the current call stack of loop calls, which are O(n) deep for naive recursion) and sweep the dead ones.

This exposes the key practical concern with naive mark-sweep: if loop is implemented recursively as above and n is 1,000,000, the C stack overflows before GC is relevant. The proper fix is to implement loop iteratively using while in TIL, or to implement tail-call optimization in the interpreter (outside our scope here).

Appendix B: Looking Forward

CS 136E ends where the interesting problems begin. Every chapter has pointed toward a future course or a deeper topic; this appendix collects those pointers into a map.

From Part I

Hoare logic and formal verification. Chapter 4 introduced Hoare triples at a sketch level. The full development — weakest preconditions, strongest postconditions, handling of loops with explicit termination metrics, and proof of programs with pointers — is the subject of CS 245 (Logic and Computation) and upper-year courses in program verification. Tools like Dafny, Frama-C, and CBMC automate the proof-checking for C programs; they require the programmer to supply loop invariants but verify them mechanically.

Concurrency and undefined behavior. Chapter 3 noted that our Hoare logic breaks under concurrent access. The C11 memory model (_Atomic types and memory_order annotations) defines a formal semantics for concurrent C programs. Locking, atomics, and memory fences — the tools that make concurrent programs correct — are covered in CS 343 (Concurrent and Parallel Programming) and, from an OS perspective, in CS 350.

Type systems as UB prevention. Many of Chapter 3’s UB categories — type punning, uninitialized reads, use-after-free — are prevented entirely by Rust’s type system, which statically enforces ownership, borrowing, and initialization. Rust is essentially C with a static analyzer built into the type checker. CS 343 touches on concurrency-safe languages; a dedicated Rust course or independent study fills the gap.

From Part II

The full ISA and microarchitecture. Chapters 5 and 6 showed the programmer-visible ISA. Below the ISA lies the microarchitecture: pipelines, branch predictors, out-of-order execution, register renaming, and caches. These affect performance profoundly but are mostly invisible to correct programs. CS 350 touches on this from an OS perspective; Patterson & Hennessy (both the ARM and RISC-V editions) cover it in full. MIT OCW 6.004 (Computation Structures) offers a freely available treatment building from logic gates to a pipelined processor.

Linking, loading, and virtual memory. Chapter 7’s one-paragraph bridges pointed here explicitly. CS 241 (Foundations of Sequential Programs) has students build a linker and simple assembler for MIPS (and, in the CS 241E enrichment, a full compiler and GC). CS 350 (Operating Systems) covers virtual memory in full: page tables, TLBs, demand paging, memory-mapped files, and the swap system. Drepper’s “What Every Programmer Should Know About Memory” (freely available) covers cache and TLB effects on real programs with benchmarks.

Compiler construction. The lexer and parser in Chapter 10 cover the front end. The middle and back end — IR design, dataflow analysis, register allocation, instruction scheduling — are the subject of CS 444 (Compiler Construction) at Waterloo. Cooper & Torczon’s Engineering a Compiler is the standard textbook. LLVM’s source (github.com/llvm/llvm-project) is the most-studied production compiler; its tutorials are public and well-written.

From Part III

Continuation-passing style, trampolining, and tail calls. TIL as implemented in Chapter 11 overflows the C stack on deep recursion. The fix is trampolining: instead of calling eval recursively, return a thunk (a zero-argument closure representing “the next thing to do”), and have a trampoline loop in main that invokes thunks one at a time, never deepening the C call stack. This transformation is called continuation-passing style (CPS). Ragde’s FICS2 §8 covers it in detail.

Copying and generational GC. Chapter 12 implemented mark-and-sweep, the simplest correct collector. The next step is Cheney’s copying algorithm (1970): divide the heap into two halves; during GC, copy live objects from “from-space” to “to-space”, compacting as you go; then swap the roles of the halves. Allocation becomes a bump pointer into to-space, making it essentially free. Generational GC (Appel 1989, Wilson et al. 1992) observes that most objects die young and collects the nursery frequently while rarely collecting the old generation. CS 241E implements a generational copying collector for a full language. FICS2 §7 describes copying GC for a Racket-based interpreter.

Language design trade-offs. TIL made many choices quietly. S-expression syntax simplifies parsing but feels unfamiliar. Dynamic typing avoids a type-checker but shifts errors to runtime. Lexical scope is the right default but makes some patterns (like call/cc) require continuation objects. Each choice reflects a deliberate design position. Pierce’s Types and Programming Languages (TAPL) is the canonical reference for the theory behind type system design; Mitchell’s Concepts in Programming Languages covers language design trade-offs from a comparative perspective.

Garbage-collected language implementation. A complete garbage-collected language runtime — handling threads, exceptions, foreign function interfaces, and debuggability — is an order of magnitude more complex than TIL. The most accessible full-scale reference remains Nystrom’s Crafting Interpreters, which builds two complete language implementations (a tree-walking interpreter in Java and a bytecode VM in C) from scratch, including a mark-compact GC. Reading it after CS 136E is an excellent bridge to a serious understanding of language implementation.

The Deeper Question

Every piece of software you use — the browser, the OS, the compiler itself — is ultimately a C program (or a language whose runtime is a C program), running on either ARM64 or x86-64, managed either by malloc/free or by a GC, translated to machine code by a pipeline of the kind described in the Interlude. The gap between “I can write a C program” and “I understand what happens when it runs” is exactly the gap CS 136E has tried to close: from the abstract machine to the real one, from assignment to aliasing, from recursion to stack frames, from malloc to the GC that replaces it.

The tools are C, ARM64, x86-64, and TIL. The habits of mind are the permanent part: state as extra parameter, the optimizer as theorem prover, the call frame as a first-class object, the object graph as the runtime’s shared memory. These apply everywhere — in Python’s CPython runtime, in Go’s GC, in Rust’s ownership system, in the JVM. Different implementations, same underlying ideas. The difference between a programmer who has seen the machine and one who has not is the difference between reasoning from principles and cargo-culting patterns. CS 136E is an attempt to give you the former.

Appendix C: C Strings — The Notorious Special Case

Strings in C are perhaps the most discussed, most dangerous, and most practically important topic not covered in CS 136 itself. Since the TIL lexer in Chapter 10 processes character-by-character input using C strings, and since strings appear in virtually every real C program, a thorough treatment belongs here.

What a C String Is

A C string is a null-terminated array of char. The null terminator is the character '\0' (ASCII value 0), which marks the end of the string. The type of a string literal "hello" is const char[6] — five visible characters plus the null terminator — and it decays to const char * in most contexts.

char *s = "hello";   // pointer to read-only literal
char  t[] = "hello"; // mutable copy on the stack: t[0]='h',...,t[5]='\0'

These are fundamentally different:

s points to read-only memory (the string literal lives in the .rodata segment). Writing s[0] = 'H' is undefined behavior.
t is a mutable stack array initialized with the literal’s bytes. t[0] = 'H' is fine.

The most common string bugs flow from confusing these, or from not accounting for the null terminator in size calculations.

The `strlen` / `sizeof` Confusion

sizeof("hello") is 6 (includes the null terminator). strlen("hello") is 5 (excludes it). This off-by-one confusion causes buffer overflows:

char buf[5];
strcpy(buf, "hello");   // writes 6 bytes into a 5-byte buffer: OVERFLOW

strcpy copies characters including the null terminator. The correct size is strlen(s) + 1. The safe alternative is strncpy(buf, s, sizeof(buf) - 1); buf[sizeof(buf) - 1] = '\0'; — but strncpy does not null-terminate if the source is longer than the limit, which is why the explicit buf[...] = '\0' is necessary.

Better: use snprintf(buf, sizeof(buf), "%s", s). snprintf always null-terminates (it writes at most sizeof(buf) - 1 characters plus the null terminator) and returns the number of characters that would have been written if the buffer were large enough, allowing the caller to detect truncation.

Better still: avoid fixed-size buffers for user input. Read with getline (POSIX) or fgets with a dynamically grown buffer.

Dynamic Strings

To create a string on the heap:

char *strdup(const char *s) {
    size_t len = strlen(s) + 1;   // +1 for null terminator
    char *copy = malloc(len);
    if (copy) memcpy(copy, s, len);
    return copy;
}

This is so common that POSIX provides it as strdup(3). (It is not in the C standard before C23, but every major platform provides it.) The caller is responsible for freeing the returned pointer.

A common pattern in interpreters (like TIL’s lexer) is to intern strings: maintain a hash table mapping string contents to a canonical char *. All uses of the same identifier then share a single allocation, and identifier equality is a pointer comparison (==) instead of strcmp. This is also why our Token.ident field used a fixed 64-byte buffer: for a teaching lexer, the overhead is acceptable; a production lexer would intern.

Pointer Arithmetic on Strings

Because a string is an array, you can advance through it with pointer arithmetic:

// count occurrences of character c in string s
int count_char(const char *s, char c) {
    int n = 0;
    while (*s) {                    // loop until null terminator
        if (*s == c) n++;
        s++;                        // advance pointer by 1 byte
    }
    return n;
}

*s reads the character at the current position; s++ advances to the next. The condition *s is truthy for any non-zero character and falsy for '\0'. This idiom is idiomatic C and reads naturally once you internalize that a char * is both a pointer and an implicit array.

The dangerous counterpart: knowing when not to advance past the end. Once s points past the null terminator, every read is undefined behavior. All standard string functions are written with this discipline, but hand-written loops often are not.

`const char ` vs `char const` vs `char const * const`

The const qualifier in C applies to the thing to the left of const (or to the thing to the right if const is the first token). This creates three distinct combinations:

char *p;             // pointer to mutable char: both *p and p are mutable
const char *p;       // pointer to const char: *p is immutable, p is mutable
char *const p;       // const pointer to mutable char: *p is mutable, p is immutable
const char *const p; // const pointer to const char: both are immutable

Function parameters that are passed a string for reading should use const char *. This communicates intent, prevents accidental writes, and allows the caller to pass string literals (which are const) without a cast warning. Returning const char * from a function signals that the caller should not free the result (it likely points into a static buffer or a literal). Returning char * signals that the caller owns the memory (and must free it).

Buffer Overflows as Security Vulnerabilities

The classic gets(buf) reads an arbitrarily long line into buf with no bounds check. It is so dangerous that it was removed from the C11 standard entirely. But any fixed-size buffer with unchecked input can overflow:

void process(const char *input) {
    char local[64];
    strcpy(local, input);    // fine if strlen(input) < 64; UB otherwise
    // ...
}

On x86-64, local is on the stack, just above the saved return address. A sufficiently long input overwrites the return address. When process returns, it branches to the attacker-controlled address. This is a stack buffer overflow, and it is one of the most exploited vulnerability classes in history. Mitigations include:

Stack canaries (-fstack-protector): the compiler places a random value (“canary”) between the local buffers and the saved return address; on function exit, it checks that the canary is intact.
Non-executable stack (NX bit / W^X): the OS marks the stack non-executable, so code injected into a buffer cannot run.
ASLR (Chapter 7): randomizes where the stack and library functions land, making it harder to predict the address to jump to.

None of these is a substitute for writing correct code. The correct fix is always to bounds-check every write.

Appendix D: Variadic Functions and `va_list`

C’s printf family takes a variable number of arguments. This is not magic; it relies on a well-defined (though architecture-specific) mechanism for accessing variadic arguments at runtime. Understanding it illuminates both how calling conventions work and why printf with a format-string mismatch is undefined behavior.

The `va_list` Mechanism

A variadic function declares its last named parameter followed by ...:

int my_printf(const char *fmt, ...);

At the call site, the compiler passes arguments according to the calling convention — extra arguments beyond the named ones go in additional registers (up to the argument register limit) and then on the stack. The va_list type provides access to these extra arguments:

#include <stdarg.h>

int sum_ints(int count, ...) {
    va_list ap;
    va_start(ap, count);    // initialize ap after 'count'
    int total = 0;
    for (int i = 0; i < count; i++) {
        total += va_arg(ap, int);   // fetch next argument as 'int'
    }
    va_end(ap);
    return total;
}

va_start(ap, last_named) initializes the va_list to point past the last named argument. va_arg(ap, type) fetches the next argument and advances ap. va_end(ap) cleans up (on some ABIs it is a no-op; on others it releases resources).

The critical constraint: va_arg must be called with the correct type. The function has no way to know the type of the next argument other than out-of-band information (like printf’s format string). If va_arg(ap, int) is called when the caller passed a double, the behavior is undefined — on most ABIs, a double occupies two 32-bit argument slots (or one 64-bit FP register), and treating it as an int reads the wrong bytes.

This is why printf format string mismatches are undefined behavior, not just “might print wrong values.” A %d format with a double argument calls va_arg(ap, int) when the calling convention placed the double in an FP register — and va_arg(ap, int) reads from an integer register, getting garbage.

`va_list` on ARM64 vs x86-64

The implementation of va_list differs between architectures:

On x86-64 (System V ABI), va_list is a structure containing two offsets (one for integer registers, one for FP registers) and two overflow areas (for arguments that spilled to the stack). va_arg advances the appropriate offset and falls back to the stack overflow area when the registers are exhausted. The compiler must generate code at the call site to dump all variadic arguments to a known layout, even if some end up in registers.

On ARM64 (AAPCS64), va_list is simpler for the common case: a single pointer that advances through the argument save area. The compiler saves x0–x7 and v0–v7 to the stack in the function prologue (the varargs save area), and va_arg reads from this area in order. This makes va_arg a simple array index.

The difference matters when writing code that manipulates va_list directly (for functions like vprintf that forward a va_list to another variadic function). Both architectures define va_copy(dest, src) to copy a va_list safely.

A Minimal `printf` Implementation

Understanding va_list is best anchored by writing a printf subset:

void mini_printf(const char *fmt, ...) {
    va_list ap;
    va_start(ap, fmt);

    for (const char *p = fmt; *p; p++) {
        if (*p != '%') {
            putchar(*p);
            continue;
        }
        p++;   // skip '%'
        switch (*p) {
        case 'd': {
            int val = va_arg(ap, int);
            // print decimal -- simplified: just print absolute value
            char buf[32];
            int len = snprintf(buf, sizeof(buf), "%d", val);
            for (int i = 0; i < len; i++) putchar(buf[i]);
            break;
        }
        case 's': {
            const char *s = va_arg(ap, const char *);
            while (*s) putchar(*s++);
            break;
        }
        case 'c':
            putchar(va_arg(ap, int));   // char promotes to int in varargs
            break;
        case '%':
            putchar('%');
            break;
        }
    }
    va_end(ap);
}

The char → int promotion in va_arg(ap, int) for the %c case is a real requirement: the C standard says that char, short, and float are promoted in variadic argument lists — char and short become int, float becomes double. Calling va_arg(ap, char) when the caller passed a char is undefined behavior on most ABIs (the argument was promoted to int at the call site; reading it as char reads only the low byte).

This is another source of latent bugs: writing a variadic function and forgetting that certain types are promoted.

Appendix E: SIMD — When One Instruction Does the Work of Many

Both ARM64 and x86-64 have vector instruction sets that operate on multiple data values in a single instruction. On ARM64, this is NEON (the Advanced SIMD extension, now called the ASIMD base). On x86-64, it is the SSE/AVX family. A brief look at both is useful: much of what makes modern CPUs fast at array processing, image manipulation, and cryptography is SIMD.

The Idea

A 128-bit vector register can hold:

16 bytes (int8x16 — sixteen 8-bit integers)
8 half-words (int16x8 — eight 16-bit integers)
4 words (int32x4 — four 32-bit integers)
2 double-words (int64x2 — two 64-bit integers)
4 single-precision floats (float32x4)
2 double-precision floats (float64x2)

A single SIMD instruction then adds, multiplies, shifts, or compares all lanes in parallel. vadd.s32 q0, q1, q2 on ARM NEON adds four pairs of 32-bit integers in one instruction, the same work that four scalar add instructions would do.

ARM64 NEON

ARM64’s vector registers are v0–v31, each 128 bits. They can also be accessed as 64-bit d registers (lower half) or 32-bit s registers (lower quarter). The mnemonics for NEON instructions use suffixes to indicate the element type:

// ARM64 NEON: add four pairs of int32
ld1  {v0.4s}, [x0]        // load 4 × 32-bit ints from memory pointed to by x0
ld1  {v1.4s}, [x1]        // load 4 × 32-bit ints from memory pointed to by x1
add  v2.4s, v0.4s, v1.4s  // add lane-by-lane: v2[i] = v0[i] + v1[i]
st1  {v2.4s}, [x2]        // store result to memory at x2

Four 32-bit additions in three instructions (one load, one add, one store), versus 4 loads + 4 adds + 4 stores = 12 instructions scalar.

x86-64 SSE/AVX

x86-64’s original SSE registers are xmm0–xmm15, each 128 bits. AVX extends them to 256 bits (ymm0–ymm15), and AVX-512 to 512 bits (zmm0–zmm31).

; x86-64 SSE: add four pairs of int32
movdqu  xmm0, [rdi]        ; load 4 × 32-bit ints from memory
movdqu  xmm1, [rsi]        ; load another 4 × 32-bit ints
paddd   xmm0, xmm1         ; xmm0[i] += xmm1[i] (packed add dword)
movdqu  [rdx], xmm0        ; store result

paddd is “packed add doubleword” — four simultaneous 32-bit additions. The SSE naming convention is admittedly opaque: p for packed, add for addition, d for doubleword (32-bit).

Auto-Vectorization

You rarely write NEON or SSE directly. Modern compilers auto-vectorize straightforward loops:

void add_arrays(int *a, const int *b, const int *c, int n) {
    for (int i = 0; i < n; i++) {
        a[i] = b[i] + c[i];
    }
}

With -O2 or -O3, clang will recognize this as a vectorizable loop and emit NEON (on ARM64) or SSE2 (on x86-64) code for the bulk, with a scalar fallback for the tail if n is not a multiple of the vector width. The condition for auto-vectorization: the loop must have no loop-carried dependencies, no potential aliasing between the arrays (the compiler uses the strict aliasing rule from Chapter 3, or you can add __restrict__ to the pointers), and a trip count that can be determined early enough to vectorize.

SIMD is why memcpy is fast, why strlen on modern CPUs can scan 16 or 32 bytes per cycle, and why numerical libraries (BLAS, LAPACK) achieve near-peak-hardware arithmetic throughput. The principle from Chapter 2 applies: aliasing analysis is what enables the compiler to vectorize. Telling the compiler that two pointers do not alias — either via restrict or by trusting the strict aliasing rule — is what unlocks this.

ARM64 vs x86-64: SIMD Comparison

Feature	ARM64 NEON	x86-64 SSE2/AVX2
Register width	128-bit (or 64-bit half)	128-bit (SSE2), 256-bit (AVX2), 512-bit (AVX-512)
Register count	32 (v0–v31)	16 (SSE), 16 (AVX), 32 (AVX-512)
Availability	Mandatory on AArch64	SSE2 mandatory on x86-64; AVX2 on Intel Haswell+ (2013)
Integer ADD width	8/16/32/64-bit elements	Same via PADDB/PADDW/PADDD/PADDQ
Encoding overhead	Fixed 32-bit	VEX prefix required for AVX (2–3 extra bytes per instruction)

For most numerical workloads, AVX2 (256-bit) on x86-64 processes twice as many elements per instruction as NEON (128-bit) on ARM64. But ARM64’s larger register file (32 vs 16) reduces register spills for complex kernels. Apple’s M-series chips partially compensate with AMX (Apple Matrix extensions), a proprietary matrix multiply coprocessor not visible to general assembly code.

The practical take-away: SIMD is not something to write by hand for most programs. It is something to be aware of — to understand why clang -O2 produces different output than -O0, why profile-guided optimization can change which loops get vectorized, and why aligning arrays to 16 or 32 bytes can make a loop 2× faster by avoiding cross-page loads.

Appendix F: `restrict` and Alias-Free Programming

Chapter 2 introduced aliasing as a fundamental complication of imperative reasoning. Chapters 3 and E showed that alias analysis is what enables both compiler optimization and auto-vectorization. C99 provides a language-level tool to assert non-aliasing: the restrict qualifier.

The `restrict` Qualifier

restrict on a pointer parameter is a promise to the compiler: for the duration of this function’s execution, the object pointed to by this pointer is accessed only through this pointer (or values derived from it). No other pointer aliases it.

void vector_add(int * restrict a,
                const int * restrict b,
                const int * restrict c,
                int n) {
    for (int i = 0; i < n; i++) {
        a[i] = b[i] + c[i];
    }
}

Without restrict, the compiler must assume that a, b, and c might alias. Writing a[i] could affect the value read in the next iteration via b[j] or c[j]. With restrict, the compiler knows they are independent and can freely vectorize, reorder, or eliminate loads.

The standard library functions memcpy and strcpy have restrict on their source and destination:

void *memcpy(void * restrict dest, const void * restrict src, size_t n);

This is why memcpy(p, p, n) (copying a buffer to itself) is undefined behavior: the two pointers alias the same memory, violating restrict. memmove is the correct function for overlapping copies — it does not have restrict.

When `restrict` Is Correct

restrict is a promise, not a check. The compiler trusts you. Violating restrict invokes undefined behavior:

int arr[4] = {1, 2, 3, 4};
// undefined: both parameters alias arr
vector_add(arr, arr, arr, 4);

The rule of thumb: add restrict when you can guarantee that the pointed-to objects do not overlap for the lifetime of the function. Callers that pass overlapping buffers to a restrict-qualified function have UB.

restrict also applies to non-parameter pointer variables — you can write int * restrict local_ptr = ... inside a function body — but this is rare and requires careful reasoning about the entire function’s aliasing.

`restrict` and the Wider Design

The existence of restrict is itself a commentary on C’s design: because the language permits arbitrary aliasing by default, alias-free programming requires an opt-in annotation. Fortran, by contrast, prohibits aliasing between formal parameters by default, which is historically why Fortran numerical code outperformed C in benchmarks for decades — Fortran compilers could auto-vectorize freely, while C compilers had to be conservative.

Rust’s ownership system provides restrict-equivalent guarantees statically: a mutable reference (&mut T) can only exist alone (exclusive access), making every &mut T parameter implicitly restrict. This is one reason Rust can achieve Fortran-class numerical performance without annotations, while C requires explicit restrict to reach the same ceiling.

For further reading: the GCC documentation on __restrict__ and LLVM’s alias analysis documentation explain how compilers use aliasing information in practice. Drepper’s “What Every Programmer Should Know About Memory” §6 discusses the performance impact of aliasing on cache behavior.

Appendix G: Function Pointers — First-Class Functions in C

TIL has first-class functions as values. C has them too — but without the syntactic elegance. A function pointer in C is a pointer to a function, with a type that encodes the function’s signature. Understanding function pointers is essential for writing generic algorithms, callbacks, and dispatch tables — the last of which is exactly what eval’s switch statement is compiling down to.

Basic Syntax

The type of a pointer to a function taking int and returning int is int (*)(int). The name goes where the * is:

int square(int n) { return n * n; }

int (*fp)(int) = square;   // fp points to square
int result = fp(5);        // call through the pointer: result = 25

The &square address-of is optional: a function name in a value context decays to a pointer to itself, like an array name decays to a pointer to its first element.

The type notation is famously inside-out. Use a typedef to make it readable:

typedef int (*int_to_int)(int);
int_to_int fp = square;

Function Pointers as Parameters

The classic use is passing a comparison function to a sort:

void qsort(void *base, size_t n, size_t size,
           int (*cmp)(const void *, const void *));

The cmp parameter is a pointer to a function taking two const void * (generic pointers) and returning an int negative/zero/positive for less-than/equal/greater-than. The caller supplies the comparison:

int compare_ints(const void *a, const void *b) {
    int x = *(const int *)a;
    int y = *(const int *)b;
    return (x > y) - (x < y);   // branchless: -1, 0, or 1
}

int arr[] = {3, 1, 4, 1, 5, 9};
qsort(arr, 6, sizeof(int), compare_ints);

The cast (const int *)a converts the void * to int * so we can dereference it. This is safe because we know (from context) that qsort was called with an int array. The compiler cannot verify this — it trusts the programmer. Getting the type wrong here is undefined behavior.

Dispatch Tables

A dispatch table (or virtual function table, vtable) is an array of function pointers indexed by some discriminant. Instead of a switch with many cases, you use the discriminant as an array index:

typedef Value *(*eval_fn)(Expr *, Env *);

static Value *eval_int_node(Expr *e, Env *env)   { /* ... */ }
static Value *eval_name_node(Expr *e, Env *env)  { /* ... */ }
static Value *eval_if_node(Expr *e, Env *env)    { /* ... */ }
// ... one function per node type ...

static eval_fn dispatch[] = {
    [EXPR_INT]    = eval_int_node,
    [EXPR_BOOL]   = eval_bool_node,
    [EXPR_NAME]   = eval_name_node,
    [EXPR_IF]     = eval_if_node,
    [EXPR_WHILE]  = eval_while_node,
    [EXPR_BEGIN]  = eval_begin_node,
    [EXPR_SET]    = eval_set_node,
    [EXPR_LET]    = eval_let_node,
    [EXPR_LAMBDA] = eval_lambda_node,
    [EXPR_CALL]   = eval_call_node,
    [EXPR_PRINT]  = eval_print_node,
    [EXPR_OP2]    = eval_op2_node,
};

Value *eval(Expr *e, Env *env) {
    return dispatch[e->kind](e, env);
}

This replaces the monolithic switch in Chapter 11 with a table lookup and an indirect call. The generated assembly is different: instead of a jump table (which the compiler generates for a switch on contiguous integer cases anyway), you get an explicit array of function pointers and a blr (ARM64) or call [dispatch + rax*8] (x86-64) indirect branch.

The performance difference is usually negligible for an interpreter, but dispatch tables are the standard implementation technique for:

Object-oriented vtables. Every C++ virtual method table is a dispatch table: the vptr in a class instance points to an array of function pointers, one per virtual method. obj->method() compiles to (*(obj->vptr[method_index]))(obj, ...).
OS syscall dispatch. The Linux kernel’s sys_call_table is an array of function pointers indexed by syscall number. int 0x80 / syscall jumps to sys_call_table[rax].
Interpreter bytecode dispatch. Bytecode VMs (Python’s CPython, Java’s JVM in interpreter mode) dispatch on opcode using a table or computed goto.

Function Pointers and the Type System

Function pointer types must exactly match the pointed-to function’s signature. Casting a function pointer to a different function-pointer type and calling it is undefined behavior (even though it “works” on most platforms for compatible signatures):

int add(int a, int b) { return a + b; }
int (*fp)(int) = (int (*)(int))add;   // wrong arity cast
fp(5);   // undefined behavior: called with 1 arg, function expects 2

The ABI’s calling convention determines what happens in practice (extra argument registers are just unused, or garbage is passed), but the behavior is undefined. C does not have function overloading or runtime type information, so there is no way to detect the mismatch.

Appendix H: Integer Promotions and Implicit Conversions

C’s implicit numeric conversions are a source of subtle bugs, particularly at the boundary between signed and unsigned arithmetic. Understanding them is essential for writing portable, correct code — especially in functions like strlen (which returns size_t, an unsigned type) and memcmp (which returns int).

The Promotion Rules

Whenever a char, short, or bit-field appears in an arithmetic expression, it is implicitly promoted to int before any arithmetic. If int cannot represent all values of the original type, it is promoted to unsigned int. This is the integer promotion rule.

char a = 200;      // stored as 0xC8 (= -56 as signed char on typical platforms)
char b = 100;
int c = a + b;     // a and b are promoted to int first
                   // int(-56) + int(100) = int(44), so c = 44

Without promotion, char arithmetic would wrap at 255 on unsigned platforms — and the result would differ between architectures where char is signed and where it is unsigned. Promotion makes arithmetic results predictable.

Usual Arithmetic Conversions

When a binary operator has operands of different types, they are converted to a common type by the usual arithmetic conversions:

Both promoted (per above).
If one is long double, the other is converted to long double.
Else if one is double, the other is converted to double.
Else if one is float, the other is converted to float.
Else (both are integer types after promotion): if both have the same signedness, convert to the wider type. If different signedness, it gets complicated.

The dangerous case is mixed signed/unsigned:

int a = -1;
unsigned int b = 1;
if (a < b) printf("yes\n");   // does NOT print

Here, a < b triggers the usual arithmetic conversions for mixed signedness: a (signed) and b (unsigned) both have the same rank, so a is converted to unsigned int. As an unsigned integer, -1 becomes UINT_MAX (the largest unsigned value), which is greater than 1. The comparison is false. This is the signed/unsigned comparison trap, and it bites C programmers regularly.

The warning -Wsign-compare (included in -Wall) catches this pattern. The fix: either cast explicitly or ensure both operands have the same signedness before comparing.

`size_t` and Signed Arithmetic

strlen returns size_t — an unsigned integer of pointer width (32 or 64 bits). A classic bug:

char s[] = "hello";
for (int i = strlen(s) - 1; i >= 0; i--) {
    // process s[i] in reverse
}

This loop appears to walk backward through the string. But strlen(s) returns size_t (unsigned), and size_t - 1 for size_t(0) wraps to SIZE_MAX — a huge number. The condition i >= 0 compares a signed int with a size_t: after the usual arithmetic conversions, i is converted to size_t. When i reaches -1 (signed), it becomes SIZE_MAX unsigned, and SIZE_MAX >= 0 is always true — infinite loop and out-of-bounds access.

The fix: use size_t for the loop variable, or compute the count as int before the loop:

int len = (int)strlen(s);
for (int i = len - 1; i >= 0; i--) { /* safe */ }

Or, more idiomatically, restructure the loop to avoid the subtraction entirely:

for (size_t i = strlen(s); i-- > 0; ) { /* uses post-decrement; checks before decrement */ }

`char` Signedness and `EOF`

The getchar() function returns int, not char. This is deliberate: it needs to return all 256 possible char values and the sentinel EOF (typically -1). If getchar returned char, and char is signed (as it is on most platforms), the byte 0xFF would be -1 — indistinguishable from EOF.

The correct idiom:

int c;
while ((c = getchar()) != EOF) {
    // process c
}

The error version:

char c;   // wrong type
while ((c = getchar()) != EOF) {
    // if char is signed, this loop might terminate early
    // when a 0xFF byte is read, c becomes -1, which equals EOF
}

The EOF / char signedness trap is one of the oldest bugs in C programming. ctype.h functions like isspace(c) and toupper(c) have the same issue: they take an int argument whose value must be representable as unsigned char or equal to EOF. Passing a signed char (which may be negative) is undefined behavior on platforms where char is signed. The correct idiom is isspace((unsigned char)c).

A Note on `int` Width Portability

The C standard guarantees only minimum widths: short ≥ 16 bits, int ≥ 16 bits, long ≥ 32 bits, long long ≥ 64 bits. On every 32- and 64-bit platform you will encounter in 2026, int is 32 bits and long is either 32 (Windows) or 64 bits (Linux/macOS on 64-bit). But embedded platforms can have 16-bit int, and the C standard permits it.

For portable fixed-width integers, use <stdint.h>: int32_t, uint64_t, int8_t, etc. For pointer-sized integers, use intptr_t and uintptr_t. For sizes and offsets, use size_t and ptrdiff_t. These types make the intent explicit and prevent platform-specific surprises.

Appendix I: Structs, Unions, and Alignment in C

The tagged union pattern used throughout TIL’s implementation — a discriminant field plus a union of possible payloads — is ubiquitous in C. Understanding how structs and unions are laid out in memory, and why padding exists, is necessary for writing code that interacts with binary formats, hardware registers, and the allocator.

Struct Layout and Padding

C structs are laid out in declaration order, with padding inserted between members to satisfy alignment requirements. Each member must start at an address that is a multiple of the member’s alignment — typically the member’s size, up to a platform maximum (usually 8 or 16 bytes on 64-bit platforms).

struct example {
    char   a;    // 1 byte  at offset 0
    // 3 bytes padding
    int    b;    // 4 bytes at offset 4  (needs 4-byte alignment)
    char   c;    // 1 byte  at offset 8
    // 7 bytes padding
    double d;    // 8 bytes at offset 16 (needs 8-byte alignment)
};
// sizeof(struct example) = 24

Padding is invisible to the programmer but has real consequences: memcpy on a struct copies padding bytes (which may contain garbage), memset clears them, and comparing two structs byte-by-byte with memcmp is unreliable because padding bytes are uninitialized.

To minimize padding, order struct members from largest to smallest:

struct example_tight {
    double d;    // 8 bytes at offset 0
    int    b;    // 4 bytes at offset 8
    char   a;    // 1 byte  at offset 12
    char   c;    // 1 byte  at offset 13
    // 2 bytes padding (to bring total to multiple of 8)
};
// sizeof(struct example_tight) = 16  (vs 24 above)

The `offsetof` Macro

offsetof(type, member) (from <stddef.h>) returns the byte offset of a member within a struct. It is useful for introspecting layout and for implementing dynamic dispatch (embedding a struct inside a larger struct and recovering the outer pointer from the embedded pointer):

#include <stddef.h>

struct Node {
    int value;
    struct Node *next;
};

printf("value at offset %zu\n", offsetof(struct Node, value));  // 0
printf("next  at offset %zu\n", offsetof(struct Node, next));   // 8 (on 64-bit)

Linux’s kernel uses this pattern extensively: a struct list_head is embedded inside many kernel structures, and container_of(ptr, type, member) (a macro built on offsetof) recovers the enclosing struct pointer from a list_head *.

Unions and Type Punning

A union in C overlays all its members at the same starting address. Its size is the size of the largest member:

union data {
    int   i;
    float f;
    char  bytes[4];
};
union data d;
d.i = 0x3F800000;          // store bit pattern
printf("%f\n", d.f);       // read as float: 1.0 (IEEE 754 encoding of 1.0)

Reading a different union member than the one last written is implementation-defined in C99 and defined (via the active-member rule) in C11 and later. This is the standard-blessed way to inspect the bit representation of a float as an int, replacing the UB-inducing pointer-cast approach (*(int *)&f). Our TIL Value struct uses a union for exactly this purpose: the same memory holds either an int_val, a bool_val, or a closure struct, with the kind field discriminating.

Flexible Array Members

C99 introduced flexible array members: the last member of a struct can be an incomplete array type T[]. The struct size does not include the array; it is allocated separately by allocating the struct plus extra bytes:

struct String {
    size_t length;
    char   data[];   // flexible array member
};

struct String *new_string(const char *s) {
    size_t len = strlen(s);
    struct String *str = malloc(sizeof(struct String) + len + 1);
    str->length = len;
    memcpy(str->data, s, len + 1);
    return str;
}

str->data can be accessed as a char * of length len + 1. The allocation is a single malloc for both the header and the data, which is cache-friendly and reduces pointer chasing. Python’s PyObject and many kernel data structures use this pattern.

A Note on the Imperative Discipline

There is a tension at the heart of imperative programming that this course has circled repeatedly without naming directly. It is worth stating plainly now.

A pure functional program is a description — it says what things are. An imperative program is a procedure — it says what to do. The description is timeless; the procedure unfolds in time. Equational reasoning, the great tool of mathematics, applies directly to descriptions and only indirectly — with invariants, pre- and postconditions, loop invariants — to procedures.

This is not a defect of imperative programming. The world is full of processes that unfold in time: a bank ledger, a network connection, a game state, a running OS. Describing these processes in a purely functional style is possible — it is what the state-as-extra-parameter model in Chapter 1 formalizes — but it can be clumsy. Mutation, shared mutable state, and side effects are tools for matching the natural structure of these domains. The cost is that they make reasoning harder.

The discipline this course recommends: use mutation deliberately and locally. A function that takes no mutable inputs and returns a value without side effects is easy to test, easy to reason about, and easy to compose. A function that mutates shared state transitively through a chain of pointers is hard to test, hard to reason about, and fragile under refactoring. The principle of minimizing the scope of mutation — writing most of your logic in functions that transform values, and confining mutation to small, well-documented surfaces — is the practical wisdom behind both functional programming and modern imperative style guides.

The C abstract machine gives you a surface where mutation is pervasive and aliasing is unrestricted. The tools in this course — loop invariants, the UB catalogue, lifetime discipline, the allocator, the GC — are the discipline that tames that surface. They do not eliminate the complexity of mutation; they make it legible. A program is a conversation between the programmer and the machine: the programmer states what should happen, the machine executes, and the gap between intent and execution is where bugs live. Closing that gap — through proof, through testing, through sanitizers, through careful language design — is the life’s work of systems programming.

The three parts of this course are three aspects of the same discipline: learning to think about state (Part I), learning to see it in the machine (Part II), and learning to build systems that manage it so programmers do not have to (Part III). None of these is the full story; each is an entry point into a field — programming language theory, computer architecture, and language implementation — that repays decades of study.

Appendix J: `static`, Linkage, and Scope in C

The keyword static in C has two entirely different meanings depending on context, and both are important for understanding the address-space layout from Chapter 7 and the compilation pipeline from the Interlude.

`static` as a Storage Class

When applied to a local variable, static changes its lifetime from automatic to static:

int count_calls(void) {
    static int count = 0;   // initialized once, persists across calls
    count++;
    return count;
}

The variable count lives in the .data segment (or .bss if initialized to zero) rather than the stack. It is initialized exactly once (before main runs) and retains its value between calls to count_calls. The declaration static int count = 0 is not an assignment — it is an initialization that happens exactly once at program start.

This is the mechanism behind memoization in C:

int fib_memo(int n) {
    static int cache[100] = {0};   // zero-initialized, persists
    static int computed[100] = {0};
    if (n < 2) return n;
    if (computed[n]) return cache[n];
    int result = fib_memo(n-1) + fib_memo(n-2);
    cache[n] = result;
    computed[n] = 1;
    return result;
}

This is exactly what Ragde describes in FICS2 §2 (Effects in Racket) under memoization — using mutable state to cache previously computed results. The Racket version uses a hash table; the C version uses a static array. Both exploit the idea that a function’s return value for a given input never changes (for pure-ish functions), so the result can be cached in persistent state.

`static` as a Linkage Specifier

When applied to a function or variable at file scope, static gives it internal linkage: the name is visible only within the current translation unit (.c file) and is not exported to the linker.

// foo.c
static int helper(int x) { return x * 2; }  // not visible outside foo.c
int public_fn(int x) { return helper(x); }   // visible everywhere

Without static, every non-static function and global variable has external linkage — the linker can see it and resolve references to it from other translation units. If two .c files both define a non-static helper function, the linker reports a multiply-defined symbol error. If both define static helper, there is no conflict — they are independent symbols.

The practical rule: every function and global variable that is not part of a module’s public API should be static. This reduces the risk of accidental name collisions during linking and communicates intent to readers. It also enables some compiler optimizations: a static function with a single call site may be inlined even without __attribute__((always_inline)), because the compiler knows no other file can call it.

`extern` and the Declaration/Definition Distinction

C distinguishes declarations (saying a name exists and has a type) from definitions (saying a name exists, has a type, and allocating storage for it). The rule:

A function definition has a body. A function declaration has only the signature.
A variable definition is int x = 5; (allocates storage). A variable declaration is extern int x; (says x exists somewhere).

The extern keyword marks a variable as declared but not defined here:

// file_a.c
int shared_counter = 0;   // definition: allocates storage

// file_b.c
extern int shared_counter;   // declaration: says the name exists
void increment(void) { shared_counter++; }

If two files both write int shared_counter = 0; (no extern), the linker sees two definitions of the same symbol and reports an error. Headers should contain only declarations (extern int x;, function signatures), never definitions (int x = 0;), to avoid this.

Inline Functions

C99 added inline as a suggestion to the compiler to expand the function body at the call site rather than generating a real call. Unlike C++ inline, a C inline function in a header is not a definition — each translation unit that includes the header gets a local copy of the body, but the linker still needs one definition of the function at external linkage to link against if any translation unit takes its address.

The idiomatic C pattern for header-inlined functions:

// math_utils.h
static inline int max_int(int a, int b) {
    return a > b ? a : b;
}

static inline gives internal linkage and an inlining hint. Each translation unit that includes the header gets its own copy, no linkage conflicts arise, and the compiler is free to inline the body.

The difference from a macro: static inline is type-safe, obeys scope rules, and evaluates arguments exactly once. #define MAX(a,b) ((a)>(b)?(a):(b)) evaluates either a or b twice if there are side effects, and can produce surprising results with expressions like MAX(x++, y++).

This distinction — macro vs inline function vs out-of-line function — illustrates a recurring theme: C gives you tools at multiple levels of abstraction (preprocessor text, language semantics, machine instructions), and choosing the right level is part of writing good C.

Appendix K: Tooling — The Extended Ecosystem

Chapter 3 mentioned sanitizers (-fsanitize=address,undefined) as tools for catching UB. Stanford CS107’s curriculum and Princeton COS 217’s practitioner focus both emphasize tooling as a first-class skill, not an afterthought. This appendix collects the essential tools.

Sanitizers

AddressSanitizer (ASan) (-fsanitize=address): detects out-of-bounds reads and writes, use-after-free, use-after-return, use-after-scope, double-free, and memory leaks. Instruments every heap and stack allocation with redzones — poisoned memory regions around allocations — and checks every load and store against them. Runtime overhead: roughly 2× memory, 30–60% time. Platform: Linux, macOS, Windows (partial). Foundational paper: Serebryany et al. (Google), USENIX ATC 2012.

UndefinedBehaviorSanitizer (UBSan) (-fsanitize=undefined): detects signed integer overflow, null pointer dereferences, misaligned memory accesses, out-of-bounds array indexing, invalid enum values, invalid function pointer calls, and more. Instruments each potentially-UB operation with a runtime check. Overhead: ~10–30%. Catches the cases from Chapter 3 directly: -fsanitize=signed-integer-overflow,null,bounds,alignment covers four of the five case studies.

MemorySanitizer (MSan) (-fsanitize=memory): detects reads of uninitialized memory. Unlike ASan, which checks for out-of-bounds, MSan checks for undefined value propagation — tracking which bits of each value were initialized. Far harder to implement than ASan (requires tracking every byte’s initialization status), but catches the “uninitialized read” category that ASan misses. Linux only; overhead: ~3× time.

ThreadSanitizer (TSan) (-fsanitize=thread): detects data races — concurrent accesses to the same memory location where at least one is a write, with no synchronization between them. Instruments every memory access with happens-before tracking. Overhead: ~5–15× time; suitable for race-finding, not production. Essential for verifying lock discipline in any multi-threaded C program.

Valgrind / Memcheck

Valgrind is a dynamic binary instrumentation framework. Its primary tool, Memcheck, detects heap errors without recompilation: invalid reads/writes, use of uninitialized values, illegal frees, and memory leaks. Overhead: roughly 20–50× slowdown (it JIT-recompiles every instruction). Unlike ASan, Memcheck works on arbitrary binaries and can check library code without instrumentation. On Linux, it is often the first tool to run when debugging a memory error in code you cannot recompile.

Valgrind’s Callgrind tool instruments call edges and counts instructions, producing cache simulation data. It produces profiles visualizable with KCachegrind, showing cache miss rates per source line — the tool CS107 uses for its performance profiling exercises.

GDB and LLDB

A debugger is not optional for C programmers. GDB (GNU Debugger) and LLDB (LLVM Debugger) both allow:

Setting breakpoints at source lines, function entries, memory addresses.
Examining the call stack (backtrace, frame), local variables (info locals), and registers (info registers).
Stepping through code line-by-line (next, step) or instruction-by-instruction (nexti, stepi).
Watching for changes to a variable (watch).
Inspecting memory directly (x/16xb &buf — examine 16 bytes in hex at buf’s address).

On macOS with Apple Silicon, LLDB is the native debugger. On Linux, GDB is standard. Both support the same conceptual workflow; the commands differ slightly.

CS107 structures a large portion of its assignments around deliberate use of GDB and Valgrind — not as emergency measures when something goes wrong, but as routine tools used during development. This is the right disposition: run Valgrind before committing, use GDB to understand any assertion failure rather than adding more printfs.

Compiler Explorer (godbolt.org)

Already mentioned in Chapter 5, Compiler Explorer deserves a note here as a reasoning tool, not just a curiosity. The key use cases:

Verify that your loop was vectorized. Select target x86-64 clang -O2 and look for ymm register uses. If absent, try -O3 or add restrict to pointer parameters.
Confirm that a small function was inlined. If you call static inline foo() from bar(), verify that bar’s assembly contains foo’s instructions rather than a call instruction.
Check that a UB-exploiting optimization fires. Write int stupid(int a) { ... } as in Regehr’s example, compile at -O2, and confirm the output is mov eax, 1; ret.
Compare ARM64 vs x86-64 output. Toggle between clang --target=aarch64-linux-gnu and clang -target x86_64-linux-gnu for the same source — exactly the dual-track comparison of Part II.

The ability to instantly see how source changes affect assembly, across multiple compilers and architectures, makes Compiler Explorer the most valuable daily tool for low-level reasoning about C. Princeton COS 217 and Stanford CS107 both assign exercises that use it.

The C Standard Document Itself

For definitive answers about undefined behavior, no source is more authoritative than the C standard. The current standard is C23 (ISO/IEC 9899:2024); the previous widely-used version is C17. Free draft versions are available:

N3096: the last public draft before C23 was finalized.
N2596: the C17 draft.

The standard uses precise terminology: “undefined behavior” (no requirement), “unspecified behavior” (valid but implementation-needn’t-document), “implementation-defined behavior” (valid; implementation must document). Annex J (informative) lists common UB categories — it is a checklist worth reading once. The standard’s Rationale document (a separate publication, not part of the standard) explains why specific choices were made, which is often more illuminating than the normative text itself.