CS 447: Software Testing, Quality Assurance, and Maintenance

Arie Gurfinkel

Estimated study time: 41 minutes

Table of contents

Sources and References

Primary textbook — Paul Ammann and Jeff Offutt, Introduction to Software Testing, 2nd ed., Cambridge University Press, 2016. Supplementary texts — Flemming Nielson, Hanne Riis Nielson, and Chris Hankin, Semantics with Applications: An Appetizer, Springer, 2007; Aaron R. Bradley and Zohar Manna, The Calculus of Computation: Decision Procedures with Applications to Verification, Springer, 2007; Uwe Schöning, Logic for Computer Scientists, Birkhäuser, 2008; Daniel Kroening and Ofer Strichman, Decision Procedures: An Algorithmic Point of View, 2nd ed., Springer, 2016. Online resources — Z3 SMT solver documentation (z3prover.github.io); KLEE symbolic execution engine (klee.github.io); Dafny verifier documentation (dafny.org); AddressSanitizer/MemorySanitizer LLVM documentation; LLVM compiler infrastructure project.

Chapter 1: Foundations — Faults, Failures, and the Testing Mindset

Software quality is not an accident. It emerges from disciplined engineering practices built on a precise vocabulary for characterising what goes wrong and why. The foundational framework used throughout this course is the Fault-Error-Failure model developed by Laprie and adopted in the dependability community.

A fault (also called a defect or bug) is a static artifact: a mistake in the source code, a missing check, a wrong operator. A fault by itself causes no harm until it is exercised. An error is the manifestation of a fault as an incorrect internal state during execution — a wrong register value, an invalid pointer, a corrupted heap node. An error becomes a failure only when the incorrect internal state propagates to an observable output that deviates from specification. Not every fault leads to a failure; a fault in dead code never triggers an error; an error that is masked (e.g., the wrong value is added to zero) never surfaces as a failure. This three-level distinction is fundamental: it explains why testing can miss bugs even at high coverage.

Testing exercises the software against expected behaviour on selected inputs, reporting pass/fail. Debugging begins after a failure is observed and seeks to locate the underlying fault. Verification asks whether a program satisfies its specification for all inputs (often using formal methods). Validation asks whether the right product was built for the user’s needs. The V-model maps these activities to lifecycle phases: requirements validation, design verification, unit testing, integration testing, system testing, and acceptance testing, all structured as symmetric wings of a V.

Testing is fundamentally hard for deep mathematical reasons. The halting problem (Turing, 1936) establishes that no algorithm can decide for all programs and inputs whether a program terminates. Rice’s theorem generalises this: no non-trivial semantic property of programs — including “does this program crash?”, “does this function return the right value?”, or “is this loop invariant maintained?” — is decidable. Testing is therefore sampling: we execute the program on a finite, strategically chosen subset of inputs and hope that those inputs are representative. The choice of what to sample is the theory of test adequacy criteria.

A test adequacy criterion defines when we have tested enough. Line coverage and branch coverage are structural criteria; mutation score is a fault-based criterion. High structural coverage does not imply correctness: a test suite can achieve 100% branch coverage while missing entire classes of semantic bugs. The mutation score — the fraction of artificial faults (mutants) that are detected by at least one test — is empirically a better proxy for test suite quality.

Software quality attributes span multiple dimensions. Correctness demands that outputs satisfy the specification. Reliability demands that failure rates remain below acceptable thresholds under operational conditions. Security requires resistance to adversarial inputs. Performance demands that latency and throughput meet requirements. The SQIP (Software Quality Improvement Process) framework ties these attributes to measurable goals.

The economic argument for early defect detection is decisive. IBM and NASA studies consistently show that the cost of fixing a fault grows exponentially with the phase in which it is discovered: a requirements-phase fix costs roughly one unit; a design-phase fix costs three to five; a code-phase fix costs ten; a test-phase fix costs forty; a production fix costs one hundred or more. Investing in rigorous testing and formal methods is cost-effective, not gold-plating.

Finally, the distinction between syntax and semantics underpins the entire enterprise. A syntactically valid program successfully parses and type-checks. A semantically correct program computes the right function. Type checkers and compilers catch the former class of errors; test suites, static analysers, and formal verifiers are needed for the latter.

Chapter 2: Coverage Criteria — Structural and Logic

Structural coverage criteria measure how thoroughly a test suite exercises the syntactic structure of a program. They provide objective, measurable stopping conditions for testing and are required by safety standards such as DO-178C (avionics) and ISO 26262 (automotive).

Control Flow and Basic Coverage

A control flow graph (CFG) represents a procedure as a directed graph. Nodes are basic blocks — maximal sequences of instructions with no branches in or out. Edges represent possible transfers of control: sequential fall-through, taken/not-taken branch edges, exception edges, and return edges. Every procedure has a unique entry node and one or more exit nodes.

Statement coverage (C0) requires that every statement (equivalently, every basic block) is executed by at least one test case. C0 is the weakest useful criterion; it is the bare minimum expected of any serious test suite. Branch coverage (C1) requires that every CFG edge is traversed. C1 implies C0: if every edge is covered, every node is reachable. Branch coverage detects unreachable branches — code that can never execute — which often indicate logic errors. The relationship is strict: achieving C0 does not imply C1.

Path coverage requires every path from entry to exit to be executed. It is theoretically the strongest structural criterion but practically infeasible: a loop with \(n\) iterations that contains one conditional generates \(2^n\) distinct paths. Two practical relaxations are k-path coverage (all paths of length at most \(k\)) and prime-path coverage (all paths that are not a proper subpath of any other path). Prime-path coverage strictly subsumes branch coverage.

Data Flow Coverage

Data flow criteria focus on relationships between definitions and uses of variables. A definition of variable \(x\) at node \(n\) (written \(\mathrm{def}(x, n)\)) occurs when a value is assigned to \(x\). A use of \(x\) at node \(m\) is a c-use if it appears in a computation, or a p-use if it appears in a predicate. A def-use pair \((n, m)\) for \(x\) exists when there is a definition-free path from \(n\) to \(m\) for \(x\) (i.e., \(x\) is not redefined along the path).

Criterion	Requirement
All-defs	Every definition has at least one use covered
All-uses	Every def-use pair is covered
All-du-paths	Every definition-free path between each def-use pair is covered

Data flow criteria are particularly effective at detecting uninitialized variable uses, dead assignments, and incorrect variable threading through complex function call chains.

Logic Coverage and MC/DC

Condition coverage criteria decompose compound Boolean predicates. A predicate such as \((A \wedge B) \vee C\) contains three conditions: \(A\), \(B\), and \(C\).

Multiple Condition Coverage (MCC) requires all \(2^n\) combinations of \(n\) Boolean conditions to be exercised. For large predicates this is exponential in the number of conditions.

Modified Condition/Decision Coverage (MC/DC) is the standard mandated by DO-178C for Level A (catastrophic failure) avionics software. MC/DC requires: (1) every condition takes both true and false values; (2) every condition independently affects the outcome of the decision — i.e., for each condition there exist two test cases where only that condition changes and the overall decision outcome changes. MC/DC requires only \(n+1\) test cases for \(n\) conditions, making it tractable while providing strong fault-detection guarantees. MC/DC strictly subsumes branch coverage and is incomparable with path coverage.

Mutation Testing

Mutation testing operationalises the question: “Would my tests detect a realistic bug?” A mutant is a version of the program with one small syntactic change. Common mutation operators include:

Operator	Description	Example
AOR	Arithmetic operator replacement	\(a + b \to a - b\)
ROR	Relational operator replacement	\(x < y \to x \leq y\)
COR	Conditional operator replacement	\(\wedge \to \vee\)
LCR	Logical connector replacement	\(\&\& \to \\|\\|\)
UOI	Unary operator insertion	\(x \to -x\) or \(x \to \neg x\)

A test suite kills a mutant if at least one test case produces a different output on the mutant than on the original. The mutation score is the ratio of killed mutants to non-equivalent mutants. An equivalent mutant is syntactically different but semantically identical; detecting equivalence is undecidable in general.

Mutation testing tools include PIT (Pitest) for Java and MutPy for Python. The subsumption lattice of coverage criteria has the ordering: path coverage \(\supset\) branch coverage \(\supset\) statement coverage, and MC/DC \(\supset\) branch coverage.

Chapter 3: Dynamic Analysis — Sanitizers and Fuzzing

Structural coverage criteria tell us what code was executed; dynamic analysis tools tell us how it was executed and whether any safety properties were violated at runtime. This chapter covers two complementary families: sanitisers (which instrument code to detect violations as they occur) and fuzzing (which generates large volumes of inputs to find crashing or incorrect behaviour).

Memory Safety and the Sanitiser Family

C and C++ provide direct memory access without bounds checking. This efficiency comes at a security cost: the dominant classes of exploitable vulnerabilities in real-world software are memory safety errors.

AddressSanitizer (ASan) detects heap/stack/global buffer overflows, heap use-after-free, heap double-free, and use-after-return. Its mechanism is elegant: every 8 bytes of application memory are mapped to 1 byte of shadow memory. A shadow value of zero means all 8 bytes are addressable; a non-zero value \(k\) means the first \(k\) bytes are addressable and the rest are poisoned (unaddressable). Every memory access is instrumented with a shadow check; if the access falls into poisoned memory, ASan aborts with a precise error report. ASan imposes roughly a 2× slowdown and 1.5–3× memory overhead, making it practical for CI/CD pipelines. It is enabled in LLVM and GCC with -fsanitize=address.

MemorySanitizer (MSan) tracks whether memory was initialised before it is read. It propagates taint through all computations: any value derived from uninitialized memory is itself tainted. A branch on a tainted condition triggers an MSan error. MSan requires rebuilding all linked libraries with instrumentation.

UndefinedBehaviorSanitizer (UBSan) catches signed integer overflow, division by zero, null pointer dereference, type-punning violations, and misaligned memory accesses — all of which are undefined behaviour in C/C++, meaning the compiler is free to assume they never occur and may generate surprising code when they do.

ThreadSanitizer (TSan) detects data races using a happens-before analysis. It instruments every memory access and synchronisation operation, maintaining a vector-clock per memory location to detect concurrent conflicting accesses without synchronisation.

Sanitiser	Primary target	Overhead
ASan	Memory safety (spatial)	~2× CPU, ~2× memory
MSan	Uninitialised reads	~3× CPU
UBSan	Undefined behaviour	~1.2× CPU
TSan	Data races	~5–15× CPU, ~5–10× memory

Fuzzing

Fuzzing (fuzz testing) automatically generates large quantities of inputs and monitors the program under test for crashes, assertion failures, or sanitiser alerts. The three fuzzing paradigms are:

Black-box fuzzing: no program knowledge; generate random inputs. Simple but low coverage.
Grey-box fuzzing: uses lightweight runtime feedback (edge coverage bitmaps) to guide input generation toward unexplored code paths.
White-box fuzzing: uses program analysis (symbolic execution) to generate inputs that reach specific paths.

Coverage-guided grey-box fuzzing (AFL, libFuzzer) is the dominant approach in practice. The fuzzer maintains a corpus of interesting inputs — inputs that exercise previously unseen edge transitions in the CFG. Each mutation of a corpus entry is executed; if it triggers new coverage, it is added to the corpus. Mutation strategies include bit flips, byte substitutions, block insertion, and dictionary-guided splicing. The feedback loop between coverage bitmaps and mutation drives the corpus toward deeper program states over time.

Structure-aware fuzzing extends coverage-guided fuzzing with grammar-based mutators (e.g., libProtobuf-mutator). Instead of mutating raw bytes, the fuzzer mutates parsed protocol-buffer representations, ensuring mutations remain syntactically valid. This dramatically improves effectiveness for parsers, compilers, and network protocols.

AFL discovered critical vulnerabilities in OpenSSL (heartbleed-class), GnuTLS, libjpeg, libpng, and dozens of other widely deployed libraries. The reason fuzzing finds bugs that code review misses is that it explores unexpected value combinations and edge cases that human reviewers implicitly assume away.

Chapter 4: Propositional Logic and SAT Solving

Formal verification tools ultimately reduce program correctness questions to logic problems. The simplest and most computationally tractable fragment is propositional logic, which underlies modern SAT solvers — the workhorses of both automated testing (bounded model checking) and formal verification.

Syntax and Semantics

The syntax of propositional logic is defined inductively. Atoms (propositional variables) \(p, q, r, \ldots\) are the base formulas. From atoms, compound formulas are built using connectives: negation \(\neg \varphi\), conjunction \(\varphi \wedge \psi\), disjunction \(\varphi \vee \psi\), implication \(\varphi \to \psi\), and biconditional \(\varphi \leftrightarrow \psi\). Precedence (highest to lowest): \(\neg\), \(\wedge\), \(\vee\), \(\to\), \(\leftrightarrow\).

A truth assignment \(\nu : \mathrm{Atoms} \to \{0,1\}\) gives values to atoms; it extends to compound formulas by the usual truth tables. A formula \(\varphi\) is satisfiable if some assignment makes it true, valid (tautological) if every assignment makes it true, and unsatisfiable if no assignment makes it true. These three properties are mutually exclusive for any individual formula. The relationship between them: \(\varphi\) is valid iff \(\neg\varphi\) is unsatisfiable; \(\varphi\) is satisfiable iff \(\neg\varphi\) is not valid.

Normal Forms and the SAT Problem

Conjunctive Normal Form (CNF) represents a formula as a conjunction of clauses, where each clause is a disjunction of literals (atoms or their negations). Every propositional formula has an equisatisfiable CNF via the Tseitin transformation: introduce a fresh variable for each subformula, add clauses relating the fresh variable to its subformula, and conjoin them. The Tseitin transformation produces a CNF of size linear in the original formula, at the cost of introducing auxiliary variables.

CNF is the standard input format for SAT solvers. The SAT problem — given a CNF formula, is it satisfiable? — is NP-complete (Cook-Levin theorem, 1971). Despite worst-case exponential time, modern SAT solvers routinely handle formulas with millions of variables and clauses.

DPLL and CDCL

The DPLL algorithm (Davis-Putnam-Logemann-Loveland, 1962) solves SAT by backtracking search with two powerful simplifications:

Unit propagation: if a clause contains exactly one unset literal, that literal must be true; set it and propagate.
Pure literal elimination: if a variable appears with only one polarity, set it to satisfy all clauses containing it.

DPLL exhausts the search tree in the worst case but unit propagation prunes large portions early.

CDCL (Conflict-Driven Clause Learning) extends DPLL with non-chronological backtracking and clause learning. When a conflict (empty clause) is derived, CDCL analyses the implication graph — a DAG recording which assignments caused which unit propagations. It identifies the first unique implication point (1-UIP) and derives a learned clause that captures the generalised reason for the conflict. The solver then backtracks to the assertion level (often far earlier than the conflict level) and adds the learned clause permanently. Learned clauses prune future search, turning CDCL into a dramatically more effective procedure than DPLL for structured industrial instances.

The resolution rule underlies CDCL: from clauses \((A \vee l)\) and \((\neg l \vee B)\), derive \((A \vee B)\). Resolution proofs are refutation-complete: if a CNF is unsatisfiable, a resolution proof of the empty clause exists. The learned clauses in CDCL are exactly intermediate steps in a resolution proof.

Bounded model checking (BMC) encodes the question “does program \(P\) reach an error state within \(k\) steps?” as a SAT formula. The program’s transition relation is unrolled \(k\) times; the initial state and the negation of the safety property are conjoined. A satisfying assignment is a concrete counterexample trace. CBMC (C Bounded Model Checker) implements this for C programs, encoding pointer arithmetic, array accesses, and type casts as bitvector SAT constraints.

Chapter 5: SMT Solving and First-Order Logic

Propositional logic lacks the vocabulary to reason about program values: integers, arrays, pointers, and floating-point numbers. Satisfiability Modulo Theories (SMT) extends SAT solving with background theories that give semantics to these data types.

First-Order Logic

The syntax of first-order logic (FOL) extends propositional logic with terms and quantifiers. Terms are built from variables, constants, and function symbols: \(x\), \(0\), \(f(x, y)\), \(a[i]\). Atoms are formed by applying predicate symbols to terms: \(x < y\), \(p(x)\), \(x = y\). Formulas are built from atoms using propositional connectives and quantifiers: \(\forall x.\, \varphi\) and \(\exists x.\, \varphi\). An interpretation assigns a domain, denotations to constants and function symbols, and extensions to predicate symbols; it satisfies a closed formula in the standard Tarskian sense.

Full FOL is undecidable (Church-Turing, 1936): no algorithm decides satisfiability of arbitrary FOL formulas. However, important decidable fragments exist and these are the theories exploited by SMT solvers.

Background Theories for Program Reasoning

Theory	Fragment	Use case
Linear arithmetic over integers (QFLIA)	Quantifier-free linear integer arithmetic	Loop bound checking, array index bounds
Linear arithmetic over reals (QFLRA)	Quantifier-free linear real arithmetic	Numerical analysis, floating-point approximations
Theory of arrays (QF_AX)	McCarthy axioms	Heap and array modelling
Theory of bitvectors (QF_BV)	Fixed-width machine integers	Integer overflow, bitwise operations
Uninterpreted functions (QF_UF)	Equality with uninterpreted functions	Abstraction, summarisation

\[ \mathrm{read}(\mathrm{write}(a, i, v), j) = v \quad \text{if } i = j \]\[ \mathrm{read}(\mathrm{write}(a, i, v), j) = \mathrm{read}(a, j) \quad \text{if } i \neq j \]

Array equality is axiomatised by extensionality: two arrays are equal iff they agree on all indices. These axioms let SMT solvers reason precisely about heap read/write patterns.

The bitvector theory models machine integers of fixed width \(n\) with wraparound arithmetic. It captures the semantics of C integers, including unsigned and signed overflow. This is critical for finding integer-overflow vulnerabilities that the theory of mathematical integers would miss.

The DPLL(T) Architecture

SMT solvers implement the DPLL(T) architecture: a CDCL SAT solver at the propositional skeleton level, communicating with one or more theory solvers (T-solvers) via an interface. The CDCL solver treats theory atoms as propositional variables and assigns them truth values. Periodically it queries the T-solver: “Is this set of theory atoms consistent?” If not, the T-solver returns a theory conflict clause — a minimal subset of the assigned atoms that is theory-inconsistent — which the CDCL solver adds as a learned clause and backtracks on. This two-layer architecture achieves the tractability of SAT solving with the expressiveness of rich background theories.

Z3

Z3 is Microsoft Research’s open-source SMT solver, widely used in academic and industrial verification tools. Z3 accepts formulas in SMT-LIB 2 syntax and provides APIs in Python, C, Java, and .NET. In Python:

from z3 import *
x, y = Ints('x y')
s = Solver()
s.add(x + y == 10, x - y == 4)
print(s.check())   # sat
print(s.model())   # [y = 3, x = 7]

Z3’s tactics system allows users to compose solving strategies: simplification, bit-blasting (reducing bitvectors to propositional SAT), arithmetic preprocessing, and theory combination. Z3 is the backend SMT solver for Dafny, KLEE, CBMC, and many other tools covered in this course.

Chapter 6: Symbolic Execution

Symbolic execution is a program analysis technique that executes a program not on concrete input values but on symbolic values — formal expressions representing all possible values an input variable might take. The technique was originally proposed by King (1976) and has undergone a renaissance since 2005 driven by efficient SMT solvers.

The Core Idea

A symbolic executor maintains a symbolic state: a mapping from program variables to symbolic expressions, and a path condition (PC) — a conjunction of constraints over symbolic inputs that characterises the set of concrete inputs that would follow the current execution path. At every conditional branch \(\mathrm{if}\ (e)\ \mathrm{then}\ S_1\ \mathrm{else}\ S_2\), the executor forks: one branch adds \(e\) to the PC and explores \(S_1\); the other adds \(\neg e\) and explores \(S_2\). Before forking, it queries an SMT solver to check whether each branch’s path condition is satisfiable; infeasible paths are pruned. At each program path’s end (or at a safety check), the executor asks the SMT solver to produce a concrete witness — a test input that exercises that path.

Static vs. Dynamic Symbolic Execution

Static (offline) symbolic execution explores all paths simultaneously, building a tree of symbolic states. It is complete in the sense that no feasible path is missed, but it suffers from path explosion: a program with \(n\) independent binary branches has \(2^n\) paths. Loops further exacerbate this; unbounded loops require either loop summarisation or depth bounds.

Dynamic symbolic execution (concolic execution) interleaves concrete and symbolic execution. The term concolic merges “concrete” and “symbolic”. The approach (DART, Godefroid et al. 2005; CUTE; SAGE):

Execute concretely with a random input, collecting the symbolic path condition.
Negate the last branch condition to obtain a new PC targeting the alternate branch.
Solve the new PC to obtain a concrete input for the next iteration.
Repeat, systematically covering all branches.

Concolic execution avoids the path explosion of static analysis by running only one path at a time, while its systematic branch negation provides better coverage than random fuzzing.

KLEE

KLEE is an open-source symbolic execution engine built on the LLVM compiler infrastructure. KLEE operates on LLVM bitcode, tracking symbolic values through all LLVM instructions. It handles:

System calls: KLEE models system calls symbolically (e.g., treating file read results as symbolic) or concretises them.
Memory: KLEE handles symbolic pointer dereferences by forking on all plausible memory objects.
Test generation: At every path’s end, KLEE uses Z3 to generate a concrete input realising that path’s constraints and emits it as a test case.

Cadar et al. (2008) applied KLEE to the GNU coreutils suite (89 command-line utilities) and found 56 previously unknown bugs, including bugs in md5sum, seq, sort, and others, achieving higher coverage than the existing developer test suites.

Challenges in symbolic execution practice:

Path explosion: millions of feasible paths in real programs.
Constraint complexity: floating-point, cryptographic operations, and hash functions produce constraints that SMT solvers cannot solve efficiently.
Environment modelling: real programs interact with the OS (file system, network, environment variables); perfect symbolic models are difficult to construct.
Symbolic pointers: a pointer whose target is symbolic can alias any of many objects, requiring a fork per possible alias.

Applications and Hybrid Approaches

Symbolic execution is used for automatic test generation (input that reaches every branch), vulnerability finding (path condition implying out-of-bounds array access or null dereference), and whitebox fuzzing (SAGE at Microsoft finds deep bugs in image parsers, PDF readers, and media codecs by combining concolic execution with fuzzing at scale).

The comparison between fuzzing and symbolic execution is not a competition: they are complementary. Fuzzing is fast and scalable but blind to deep branches guarded by complex conditions. Symbolic execution handles those conditions precisely but scales poorly. Hybrid tools such as Driller and SymCC use fuzzing for broad coverage and symbolic execution to break through fuzzing barriers, achieving better results than either approach alone.

Chapter 7: Axiomatic Semantics and Program Verification

Formal verification asks the strongest possible question: does this program satisfy its specification for every input? This chapter develops the logical framework — Hoare logic and weakest preconditions — that underlies modern deductive verification tools.

Three Semantic Frameworks

Operational semantics defines program meaning as transition sequences between machine states: a small-step or big-step rewriting system. Denotational semantics maps programs to mathematical functions (typically domain-theoretic continuous functions). Axiomatic semantics defines program meaning in terms of assertions about program states before and after execution. For verification, axiomatic semantics is the most direct: it gives us a calculus for proving program properties without enumerating executions.

Hoare Triples

A Hoare triple \(\{P\}\ C\ \{Q\}\) asserts: if the precondition \(P\) holds before executing command \(C\), and \(C\) terminates, then the postcondition \(Q\) holds after \(C\). This is partial correctness — it does not guarantee termination. Total correctness \([P]\ C\ [Q]\) adds the requirement that \(C\) terminates whenever \(P\) holds.

The core Hoare logic rules are:

\[ \{P[E/x]\}\ x := E\ \{P\} \]

where \(P[E/x]\) denotes \(P\) with all free occurrences of \(x\) replaced by expression \(E\). The postcondition \(P\) is obtained by substituting \(x\) with its new value.

\[ \frac{\{P\}\ C_1\ \{R\} \quad \{R\}\ C_2\ \{Q\}}{\{P\}\ C_1;C_2\ \{Q\}} \]\[ \frac{\{P \wedge b\}\ C_1\ \{Q\} \quad \{P \wedge \neg b\}\ C_2\ \{Q\}}{\{P\}\ \mathrm{if}\ b\ \mathrm{then}\ C_1\ \mathrm{else}\ C_2\ \{Q\}} \]\[ \frac{\{I \wedge b\}\ C\ \{I\}}{\{I\}\ \mathrm{while}\ b\ \mathrm{do}\ C\ \{I \wedge \neg b\}} \]\[ \frac{P \Rightarrow P' \quad \{P'\}\ C\ \{Q'\} \quad Q' \Rightarrow Q}{\{P\}\ C\ \{Q\}} \]

Loop Invariants

Loop invariants are the hardest and most critical part of Hoare logic proofs. A valid loop invariant \(I\) must satisfy three conditions simultaneously: (1) initiation — \(I\) holds before the first iteration; (2) maintenance — if \(I\) and the loop guard hold before an iteration, then \(I\) holds after; (3) utility — the invariant together with the negated guard implies the postcondition. Finding the right invariant is a creative act; it captures the essential inductive structure of the loop’s computation. For a loop computing the sum of an array \(A[0..n-1]\), the invariant \(s = \sum_{j=0}^{i-1} A[j]\) encodes precisely what has been accumulated.

Weakest Preconditions

The weakest precondition \(\mathrm{wp}(C, Q)\) is the weakest assertion \(P\) such that \(\{P\}\ C\ \{Q\}\) holds. It is a predicate transformer: a function from postconditions to preconditions. The rules mirror the Hoare rules:

\[ \mathrm{wp}(x := E,\ Q) = Q[E/x] \]\[ \mathrm{wp}(C_1;C_2,\ Q) = \mathrm{wp}(C_1,\ \mathrm{wp}(C_2, Q)) \]\[ \mathrm{wp}(\mathrm{if}\ b\ \mathrm{then}\ C_1\ \mathrm{else}\ C_2,\ Q) = (b \Rightarrow \mathrm{wp}(C_1,Q)) \wedge (\neg b \Rightarrow \mathrm{wp}(C_2, Q)) \]\[ \mathrm{wp}(\mathrm{while}\ b\ \mathrm{do}\ C,\ Q) = I \quad \text{(given loop invariant } I \text{)} \]

The wp calculus provides a mechanical procedure for generating verification conditions from annotated programs.

Dafny

Dafny is a verification-aware programming language developed at Microsoft Research. Programs are written in an imperative style with specification annotations that Dafny verifies statically using Z3. Key annotations:

requires P — precondition on method inputs
ensures Q — postcondition guaranteed on method outputs
invariant I — loop invariant for a while loop
decreases e — termination measure (must decrease on each iteration and remain non-negative)
Ghost variables and ghost methods — exist only for verification purposes

When a Dafny program is compiled, Dafny generates verification conditions (VCs) — formulas that, if valid, imply program correctness — and submits them to Z3. If Z3 finds a counterexample, Dafny reports a verification error with a concrete failing state. Writing correct Dafny programs requires providing loop invariants and termination measures; the reward is a machine-checked proof of correctness.

Chapter 8: Verification Condition Generation

Verification condition (VC) generation is the algorithmic bridge between annotated programs and the logical formulas that SMT solvers discharge. The process transforms a program with specification annotations into a set of first-order logic formulas: if all of them are valid, the program is correct with respect to its specification.

VC Generation via Weakest Preconditions

\[ P \Rightarrow \mathrm{wp}(C, Q) \]

This formula must be valid. For loop-free programs, wp is computed structurally using the rules from Chapter 7. For loops, the annotated loop invariant \(I\) breaks the formula into three independent VCs:

Initiation: \(P \Rightarrow I\) (invariant holds initially)
Maintenance: \(\{I \wedge b\}\ C_{\mathrm{body}}\ \{I\}\) (invariant preserved by loop body)
Exit: \((I \wedge \neg b) \Rightarrow Q\) (invariant and negated guard imply postcondition)

Each VC is a closed first-order formula submitted to Z3. If Z3 returns SAT for the negation of a VC, it produces a concrete counterexample — an initial state satisfying the premises but violating the conclusion — which guides the programmer to fix their annotation or their code.

Procedure Calls and Modular Verification

For procedures with pre/postconditions, the modular verification discipline generates VCs for each procedure in isolation. A call site to procedure \(f\) with precondition \(\mathrm{pre}_f\) and postcondition \(\mathrm{post}_f\) requires: (1) the call-site satisfies \(\mathrm{pre}_f\); (2) the caller can assume \(\mathrm{post}_f\) after the call. The procedure body is never inlined; only its contract is used. This enables compositional reasoning that scales to large programs.

Invariant Generation

Supplying loop invariants manually is the primary bottleneck of deductive verification. Invariant generation approaches seek to automate this:

Template-based invariant generation fixes the shape of the invariant (e.g., a linear inequality \(a_1 x_1 + \ldots + a_n x_n \leq c\)) and solves for the coefficients using constraint solving. The constraint that the invariant is inductive yields a system of linear arithmetic constraints, solvable with LP or SMT.

Abstract interpretation (Cousot & Cousot, 1977) computes loop invariants as fixpoints of abstract transition functions. An abstract domain represents sets of program states as elements of a lattice. Common domains include:

Domain	Precision	Shape
Intervals	Moderate	\(l \leq x \leq u\) per variable
Octagons	Higher	\(\pm x \pm y \leq c\)
Polyhedra	Highest	Arbitrary linear inequalities

The fixpoint computation uses widening operators to ensure termination: widening sacrifices precision to force convergence. The resulting invariant is a sound overapproximation — it includes all reachable states, possibly along with unreachable ones.

Counterexample-guided abstraction refinement (CEGAR) iteratively refines an abstraction. Starting with a coarse (imprecise) abstract model, CEGAR checks the safety property on the abstract model. If it finds an abstract counterexample, it checks whether the counterexample is realisable in the concrete program. If not (a spurious counterexample), it refines the abstraction to exclude that spurious path and repeats.

Craig Interpolants and Lazy Abstraction

A Craig interpolant for an unsatisfiable conjunction \(A \wedge B\) is a formula \(I\) such that: \(A \Rightarrow I\), \(I \wedge B\) is unsatisfiable, and \(I\) contains only variables common to \(A\) and \(B\). Interpolants serve as intermediate assertions in proofs, and SMT solvers can extract them from UNSAT proofs. The IMPACT algorithm (lazy abstraction with interpolants) uses interpolants derived from infeasible abstract counterexamples as new invariants for abstraction refinement, combining CEGAR with automatic invariant generation.

Bounded Model Checking and k-Induction

Bounded model checking (BMC) with bound \(k\) encodes the question “does a safety property fail within \(k\) steps?” as a SAT/SMT formula. The transition relation is unrolled \(k\) times and conjoined with the negated safety property. BMC is complete for finding bugs within \(k\) steps but cannot prove safety for arbitrary \(k\).

k-induction extends BMC to unbounded proofs. It combines a base case (BMC up to \(k\)) with an inductive step: if the safety property holds for \(k\) consecutive states, it holds for the \((k+1)\)-th as well. For many practical programs, k-induction (with \(k\) small) provides complete safety proofs without requiring manual invariant annotation.

Chapter 9: Program Verification Tools and Practical Applications

This chapter surveys industrial and research tools that operationalise the formal techniques of the preceding chapters. Understanding these tools — their assumptions, their limitations, and their deployment contexts — is essential for applying formal methods to real engineering problems.

CBMC

The C Bounded Model Checker (CBMC) applies BMC to ANSI C and C++ programs. CBMC translates the program (with all its pointer arithmetic, type casts, and bitwise operations) into a single static assignment (SSA) form, unrolls all loops up to a specified bound, and generates a SAT/SMT formula whose satisfiability corresponds to the existence of a bug. CBMC can check array out-of-bounds access, pointer validity, integer overflow, division by zero, and user-specified assertions. It produces concrete counterexample traces (input values plus execution path) when bugs are found. CBMC has been applied to verify automotive (ISO 26262) and aerospace software; its main limitation is the loop bound — properties requiring large unrolling depths may exceed solver capacity.

Frama-C and Why3

Frama-C is a modular static analysis framework for C, developed at CEA LIST and Inria. Its WP (weakest precondition) plugin implements the VC generation approach of Chapter 8: the programmer annotates C code in ACSL (ANSI/ISO C Specification Language) with preconditions, postconditions, and loop invariants; the WP plugin generates VCs and dispatches them to SMT solvers including Z3, Alt-Ergo, and CVC5. Frama-C’s Value Analysis plugin implements abstract interpretation with the interval domain for value range analysis. Why3 provides a platform-independent VC generation infrastructure used by several verification tools as a common backend.

SLAM and Software Device Driver Verification

SLAM (Microsoft Research) was the first industrial application of predicate abstraction and CEGAR at scale. Applied to Windows device drivers — some of the most bug-prone code in the operating system — SLAM models the driver as a Boolean program whose predicates capture relevant properties (e.g., “the lock is held”, “the device is in state X”). The CEGAR loop refines the predicate set until the abstract model is either proved safe or a concrete counterexample is found. SLAM’s successor, SDV (Static Driver Verifier), is integrated into the Windows Driver Development Kit and has found thousands of driver bugs before Windows release.

Facebook Infer

Infer is an interprocedural static analyser developed at Facebook/Meta, now open-source. It targets Java, C, C++, and Objective-C. Infer’s core innovation is bi-abduction: a generalisation of abduction that simultaneously infers the frame (the part of the heap not modified by a call) and the missing precondition (heap cells the procedure needs but hasn’t been given). Bi-abduction enables Infer to compute procedure summaries fully automatically, without user annotations, and compose them interprocedurally. Infer is compositional: it analyses each procedure once, stores a summary, and reuses it at call sites — enabling analysis of millions of lines of code in continuous integration at Facebook. Infer detects null-pointer dereferences, resource leaks, and use-after-free bugs. It is “soundy” (intentionally unsound in controlled ways) — it sacrifices completeness for low false-positive rates suitable for developer tooling.

Abstract Interpretation in Practice: Astrée

The theoretical framework of abstract interpretation (Cousot & Cousot, 1977) reached an industrial milestone with the Astrée static analyser, applied to Airbus A380 primary flight control software. Astrée analyses approximately 130,000 lines of automatically generated C code (from Scade/Simulink models) using a combination of abstract domains (intervals, octagons, congruences, abstract pointer domains) tuned to this code. The 2003 analysis produced zero false alarms on the A380 code — proving the absence of runtime errors including array out-of-bounds, division by zero, and integer overflow. This remains one of the most impressive industrial applications of formal methods.

The Verification Gap

No single technique provides a complete solution to software quality. The following table positions the tools discussed in this course:

Technique	Automation	Completeness	Scalability	Best for
Testing + coverage	High	No	Very high	Regression, integration
Fuzzing (AFL, libFuzzer)	High	No	High	Memory safety, crashes
Sanitisers (ASan, TSan)	High	No (for input coverage)	High	Bug detection in CI/CD
Symbolic execution (KLEE)	Medium	Path-complete (bounded)	Medium	Deep path coverage
BMC (CBMC)	High	Bounded	Medium	Shallow safety properties
Deductive (Dafny, Frama-C)	Low	Yes (with annotations)	Low-Medium	Critical components
Abstract interpretation (Astrée)	High	Sound	Medium-High	Certified embedded code

The mature engineering process combines these techniques. Testing and fuzzing handle the bulk of quality assurance at low cost. Sanitisers catch memory safety errors that testing might miss. Symbolic execution targets hard-to-reach paths. Formal verification is reserved for the small, critical components where correctness guarantees are worth the annotation effort. Understanding this trade-off — what each tool promises, what it costs, and where it fails — is the mark of a competent software engineer.