CO 255: Introduction to Optimization (Advanced)

Levent Tuncel

Estimated study time: 3 hr 3 min

Table of contents

These notes are based on Felix Zhou’s course notes. Additional explanations, visualizations, and examples have been added for clarity.

Chapter 1: Introduction and Problem Formulations

Optimization is the mathematical backbone of modern engineering, economics, and data science. At its core, an optimization problem asks: over a set of feasible choices, which one achieves the best objective? This course focuses primarily on linear programs — the most tractable class of optimization problems — but places them in the broader context of convex programs and integer programs. Understanding when a problem is “easy” (solvable in polynomial time) versus “hard” (NP-complete) is one of the course’s major themes.

Definition (Optimization Problem). An optimization problem consists of an objective function \(f: S \to \mathbb{R}\), a feasible region \(S\), and a variable \(x\). We seek to minimize or maximize \(f(x)\) over \(x \in S\). Since \(\max f(x) = -\min(-f(x))\), minimization is general enough, and we often phrase everything as a minimization problem.

By convention we minimize, since any maximization problem \(\max f\) is equivalent to \(\min (-f)\). This symmetry allows a unified treatment of all LP problems regardless of objective direction.

1.1 Outcomes of an Optimization Problem

Every optimization problem falls into exactly one of four categories:

Infeasible: \(S = \emptyset\). No solution satisfies all constraints.
Optimal: There exists \(x_0 \in S\) with \(f(x_0) \le f(x)\) for all \(x \in S\). We call \(x_0\) an optimal solution and \(f(x_0)\) the optimal value.
Unbounded: \(S \ne \emptyset\) but for every \(M \in \mathbb{R}\) there exists \(x \in S\) with \(f(x) < M\). The objective can be made arbitrarily small.
Feasible but no optimum attained: \(S \ne \emptyset\) and the infimum is finite but not achieved (unusual in LP but possible in NLP).

1.2 Classes of Optimization Problems

1.2.1 Linear Programs

Linear programs are the workhorse of applied optimization: they are rich enough to model most real-world allocation problems, yet tractable enough to solve in polynomial time. Their defining feature — linearity — allows a complete geometric and algebraic theory.

Definition 2.1.1 (Affine Function). \(f : \mathbb{R}^n \to \mathbb{R}\) is affine if \(f(x) = \alpha^T x + \beta\) for some \(\alpha \in \mathbb{R}^n,\ \beta \in \mathbb{R}\).

Definition 2.1.2 (Linear Function). An affine function with \(\beta = 0\).

Definition 2.1.3 (Linear Constraint). An inequality of the form \(f(x) \le g(x)\), \(f(x) = g(x)\), or \(f(x) \ge g(x)\) where \(f, g\) are affine.

Definition 2.1.4 (Linear Program). An optimization problem such that the objective function is affine, the variables satisfy \(x \in \mathbb{R}^n\), and the variables are subject to finitely many linear constraints. A feasible solution is any \(\bar x \in \mathbb{R}^n\) satisfying the constraints; an optimal solution is a feasible solution with maximal objective value; the feasible region is the set of all feasible solutions.

Definition 2.1.5 (Standard Inequality Form). \(\max\ c^T x\) subject to \(Ax \le b,\ x \ge 0\).

Definition 2.1.6 (Standard Equality Form). \(\max\ c^T x\) subject to \(Ax = b,\ x \ge 0\).

Proposition 2.1.1. The Standard Inequality Form and Standard Equality Form are general enough to express any linear program.

The two forms are interconvertible: introduce slack variables \(x_{n+1}, \ldots\) to turn SIF into SEF; reverse \(\ge\) constraints by multiplying by \(-1\) and split equalities into two inequalities to turn any LP into SIF.

Definition (Linear Program). A linear program (LP) is an optimization problem where the objective function is affine, the variables are real-valued, and all constraints are linear inequalities or equalities. The canonical forms are:

Standard Inequality Form (SIF): \(\max\ c^T x\) subject to \(Ax \le b,\ x \ge 0\).

Standard Equality Form (SEF): \(\max\ c^T x\) subject to \(Ax = b,\ x \ge 0\).

These are equivalent: slack variables convert SIF to SEF, and splitting equality constraints converts any LP to SIF.

The two standard forms are algebraically equivalent: each can be transformed into the other by introducing slack or surplus variables. Working in SEF simplifies the simplex method’s algebraic description (bases, pivots), while SIF is more natural for duality theory.

Definition (Integer Linear Program). An integer program (IP) is an LP where some or all variables are restricted to integers: \(\min\ c^T x\) subject to \(Ax \le b,\ x \in \mathbb{Z}^n\).

Adding integrality constraints makes the problem dramatically harder in general (NP-complete), but for special constraint matrices — such as those arising from network flow and bipartite matching — the LP relaxation automatically produces integer solutions. This is the content of Chapter 5’s results on totally unimodular matrices.

1.2.2 Convex Programs

Convex programs generalize linear programs by allowing curved objective functions and constraints, provided they are all convex. The fundamental insight is that convexity guarantees “no false local minima”: every local minimum of a convex function over a convex set is a global minimum.

Definition (Convex Set). \(S \subseteq \mathbb{R}^n\) is convex if for all \(x, y \in S\) and all \(\lambda \in [0,1]\), \(\lambda x + (1-\lambda)y \in S\). Every point on the line segment between two points in \(S\) also lies in \(S\).

Convexity of the feasible set rules out “holes” or disconnected regions — the feasible region is a single connected chunk, which simplifies both analysis and algorithms.

Definition (Convex Function). \(f: S \to \mathbb{R}\) is convex if for all \(x, y \in S\) and \(\lambda \in [0,1]\): \[ f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda)f(y). \]

\(f\) is concave if \(-f\) is convex.

A convex function lies below the chord connecting any two of its points. Geometrically, the “bowl shape” of a convex function ensures that gradient methods always descend toward the global minimum.

Definition (Convex Program). Minimize a convex function \(f\) over a convex set \(S\). Linear programs are a special case of convex programs (affine functions are both convex and concave).

Convex programs occupy the “tractable” end of the optimization spectrum: the KKT conditions (Chapter 5) provide necessary and sufficient optimality conditions, and interior-point methods solve them in polynomial time.

Convex program with feasible ellipse and level curves

1.3 Motivating Examples

Transportation Problem

A company has a set \(F\) of distribution centers (facilities) and a set \(C\) of clients. Facility \(i\) can supply at most \(u_i\) units; client \(j\) demands exactly \(d_j\) units; the unit cost of shipping from \(i\) to \(j\) is \(c_{ij}\). We seek a minimum-cost shipping plan.

Let \(x_{ij}\) be the number of units shipped from \(i\) to \(j\). The LP is:

\[ \min \sum_{i \in F}\sum_{j \in C} c_{ij} x_{ij} \]

subject to \(\sum_{j \in C} x_{ij} \le u_i\) for all \(i \in F\), \(\sum_{i \in F} x_{ij} = d_j\) for all \(j \in C\), and \(x_{ij} \ge 0\).

Transportation bipartite graph with supply, demand, and costs

2-Player Zero-Sum Game

Given a payoff matrix \(A \in \mathbb{R}^{m \times n}\), player R (row) chooses row \(i\), player C (column) chooses column \(j\), and R pays C the amount \(a_{ij}\). If C plays a randomized strategy \(p\) (probability vector over columns), C’s guaranteed expected payoff is \(\min_i \sum_j a_{ij} p_j\). C’s LP (C-LP) is:

\[ \max\ v \quad \text{s.t.} \quad \sum_j a_{ij} p_j \ge v\ \forall i,\quad \sum_j p_j = 1,\ p \ge 0. \]

Symmetrically, R’s LP (R-LP) minimizes the maximum expected payout over columns. A remarkable consequence of LP duality is that the optimal values of (C-LP) and (R-LP) are equal — this is the minimax theorem.

Payoff matrix for a 2-player zero-sum game

Job Assignment and Fair Division

The assignment problem — matching \(n\) workers bijectively to \(n\) jobs at minimum total cost — is an IP. Crucially, its LP relaxation (dropping integrality) automatically has integer extreme points because the constraint matrix is totally unimodular (see Chapter 5).

The fair division problem seeks to distribute items among agents to maximize the Nash welfare \(\prod_i (\sum_j u_{ij} x_{ij})\), equivalently \(\max \sum_i \ln(\sum_j u_{ij} x_{ij})\), a convex program.

Chapter 2: Linear Programming Geometry

Chapter 1 introduced LP abstractly. Chapter 2 develops the geometric language — polyhedra, vertices, faces — that makes LP theory visual and intuitive. The key insight is that LP feasible regions are polyhedra, and optimal solutions (when they exist) always occur at vertices. This reduces the infinite optimization problem to a finite combinatorial one.

2.1 Polyhedra and Feasible Regions

The feasible region of an LP is defined by finitely many linear inequalities, each cutting the space with a hyperplane and keeping one side. The intersection of all these half-spaces is a polyhedron.

Definition (Polyhedron). A polyhedron \(P \subseteq \mathbb{R}^n\) is the intersection of finitely many closed half-spaces: \(P = \{x \in \mathbb{R}^n : Ax \le b\}\) for some \(A \in \mathbb{R}^{m \times n},\ b \in \mathbb{R}^m\). Polyhedra are convex sets.

Definition (Polytope). A polytope is a bounded polyhedron.

Polytopes are finite, compact objects — all their “directions” are bounded. The simplex method works especially cleanly on polytopes because the number of vertices is finite, though potentially exponentially large.

The feasible region of any LP is a polyhedron. This geometric fact underpins all of LP theory — algorithms like the simplex method walk along the vertices (extreme points) of this polyhedron.

2D LP feasible polytope with level curves and optimal vertex

3D simplex showing vertices, edges, and faces

Extreme points are the “corners” of the polyhedron — the points that cannot be expressed as the midpoint of two other feasible points. These are precisely where optimal solutions live.

Definition (Extreme Point). Let \(S \subseteq \mathbb{R}^n\) be convex. A point \(\bar x \in S\) is an extreme point of \(S\) if it cannot be written as \(\bar x = \tfrac{1}{2}(u + v)\) for distinct \(u, v \in S \setminus \{\bar x\}\). Equivalently, \(\bar x\) is extreme if for every direction \(\alpha \ne 0\), at least one of \(\bar x + \alpha,\ \bar x - \alpha\) lies outside \(S\).

Extreme points are “pinned down” by constraints: they cannot wiggle in any direction while staying feasible. The following theorem characterizes them algebraically, connecting the geometric notion to the rank of the tight constraint matrix.

Definition 2.7.4 (Convex Hull). \(\mathrm{conv}(S)\) for \(S \subseteq \mathbb{R}^n\) is the smallest convex set containing \(S\): \(\mathrm{conv}(S) := \bigcap_{S \subseteq H,\ H \text{ convex}} H\).

Definition 2.7.5 (Convex Combination). For \(x^{(1)}, \ldots, x^{(k)} \in \mathbb{R}^n\), a convex combination is \(\sum_{i=1}^k \lambda_i x^{(i)}\) with \(\sum_{i=1}^k \lambda_i = 1,\ \lambda_i \ge 0\).

Proposition 2.7.4. \(S \subseteq \mathbb{R}^n\) is convex if and only if it contains all convex combinations of its elements.

Corollary 2.7.4.1. The convex hull of \(S\) is the set of all convex combinations of elements of \(S\).

What makes this remarkable is that it bounds the “complexity” of any convex combination: no matter how many points are in \(S\), and no matter how elaborate their convex combinations, you never need more than \(n+1\) of them. In the plane (\(n=2\)), any point in the convex hull of an arbitrary set can be written as a convex combination of just three points — a triangle suffices. In three dimensions, four points (a tetrahedron). This dimensionality bound is tight and beautiful.

The two versions of Carathéodory’s theorem — for convex hulls and for cones — are closely related via a lifting argument. For any set \(S \subseteq \mathbb{R}^n\), define the lifted set \(S^+ = \{[s; 1] : s \in S\} \subseteq \mathbb{R}^{n+1}\). Then \(x \in \operatorname{conv}(S)\) if and only if \([x; 1] \in \operatorname{cone}(S^+)\). This “homogenization” trick — appending a 1 to every vector — converts convex hull membership into cone membership, allowing the cone version of Carathéodory to imply the convex hull version: a point in \(\operatorname{cone}(S^+) \subseteq \mathbb{R}^{n+1}\) is a non-negative combination of at most \(n+1\) linearly independent vectors from \(S^+\), which after projecting back to \(\mathbb{R}^n\) gives a convex combination of at most \(n+1\) points of \(S\). The same lifting device appears throughout the course: Farkas’ Lemma lifts the constraint vectors \(a_i\) to \([a_i; b_i]\) in the proof of Helly’s theorem, and polyhedral cones are compactified via the unit box to extract extreme rays.

Theorem 2.7.5 (Carathéodory, 1907). Let \(S \subseteq \mathbb{R}^n\). Every point in \(\mathrm{conv}(S)\) can be expressed as a convex combination of at most \(n+1\) points in \(S\).

The proof is an elegant application of linear independence: if a convex combination uses more than \(n+1\) points, the coefficients satisfy a linear dependence relation (there are too many for \(\mathbb{R}^n\)), which can be used to eliminate one point while maintaining a valid convex combination. Repeating this reduction arrives at \(n+1\) or fewer points.

The following theorem offers an equivalent, and somewhat surprising, characterization of extreme points. It says \(\bar x\) is extreme if and only if removing it from \(S\) does not destroy convexity. Intuitively, a non-extreme point is a “midpoint” of two others — removing it creates a gap in the line segment connecting those two others, breaking convexity. An extreme point, by contrast, is not “needed” as a midpoint by any pair, so removing it leaves convexity intact.

Theorem 2.7.6. Let \(\bar x \in S \subseteq \mathbb{R}^n\) be convex. \(\bar x\) is an extreme point of \(S\) if and only if \(S \setminus \{\bar x\}\) is convex.

Theorem 2.7.7 below is the algebraic heart of the theory of extreme points. It translates the geometric notion — “cannot be expressed as a midpoint” — into a linear-algebraic rank condition that is checkable by computation.

Theorem 2.7.7. Let \(P = \{x \in \mathbb{R}^n : Ax \le b\}\), \(\bar x \in P\), and let \(A^= x \le b^=\) be the constraints tight at \(\bar x\). Then \(\bar x\) is an extreme point of \(P\) if and only if \(\mathrm{rank}(A^=) = n\).

Proof. If \(\operatorname{rank}(A^=) < n\), there exists \(\alpha \ne 0\) with \(A^= \alpha = 0\). For small \(\varepsilon > 0\) both \(\bar x \pm \varepsilon \alpha\) satisfy all constraints, so \(\bar x\) is not extreme. Conversely, if \(\operatorname{rank}(A^=) = n\), then \(\bar x\) is the unique solution to \(A^= x = b^=\); any midpoint expression forces both endpoints to satisfy \(A^= u = A^= v = b^=\), contradicting uniqueness. \(\square\)

The corollary gives a useful finiteness bound: since we choose \(n\) linearly independent rows from the \(m\) rows of \(A\) to determine each extreme point, there are at most \(\binom{m}{n}\) extreme points. This bound is exponential, but it is the quantity that controls the worst-case complexity of the simplex method.

Corollary 2.7.8.1. (i) If \(\mathrm{rank}(A) < n\) there are no extreme points. (ii) Every polyhedron has at most \(\binom{m}{n}\) extreme points.

A polyhedron with no extreme points must contain an entire line — it extends infinitely in both directions along some direction. The following definition and proposition capture exactly this dichotomy.

Definition 2.7.8 (Pointed Polyhedron). A polyhedron is pointed if it does not contain a line (i.e., there is no \(d \ne 0\) such that \(\{x + \lambda d : \lambda \in \mathbb{R}\} \subseteq P\)).

Proposition 2.7.9. \(P \subseteq \mathbb{R}^n\) is a nonempty pointed polyhedron if and only if \(P\) has an extreme point.

Full rank of tight constraints means \(\bar x\) is uniquely determined by those constraints — it is a “vertex” in the algebraic sense. If the rank falls short, a direction exists along which we can move while staying on all tight constraints, so \(\bar x\) lies in the interior of an edge or face.

Theorem 2.7.10. Let \(P = \{x \in \mathbb{R}^n : Ax \le b\}\) be a polyhedron with no lines. If the LP \(\max\ c^T x,\ x \in P\) has an optimal solution, then it has an optimal solution that is an extreme point of \(P\).

Theorem 2.7.11. Let \(P = \{x \in \mathbb{R}^n : Ax \le b\}\) be a polyhedron and \(\bar x \in P\). Then \(\bar x\) is an extreme point of \(P\) if and only if there exists \(c \in \mathbb{R}^n\) such that \(\bar x\) is the unique optimal solution to \(\max\ c^T x,\ x \in P\).

Two fundamental structural theorems about polytopes follow from the extreme point characterization. The first says a bounded polyhedron is nothing more than the convex hull of its corners; the second says the converse — any finite set of points generates a polytope. Together they close a duality between the “inequality description” and the “vertex description” of polytopes.

Definition 2.7.9 (Polytope). A polytope is a bounded polyhedron.

Theorem 2.7.12. Let \(F \subseteq \mathbb{R}^n\) be a polytope. Then \(F\) is the convex hull of its extreme points.

Theorem 2.7.13. The convex hull of any finite subset of \(\mathbb{R}^n\) is a polytope.

What about unbounded polyhedra? A polyhedron that extends to infinity cannot be described purely by its extreme points — it also has extreme rays, directions in which the feasible region extends without bound. The Minkowski sum allows us to decompose any pointed polyhedron into a compact “core” (a polytope spanned by extreme points) and a cone of extreme rays. This decomposition is the key to understanding unbounded LPs: an LP is unbounded precisely when there exists an extreme ray along which the objective improves.

The Minkowski sum gives us a powerful algebraic operation on sets: adding two sets pointwise. Geometrically, \(S + T\) is obtained by “sliding” every copy of \(T\) to every point of \(S\). This operation is exactly what we need to decompose an unbounded polyhedron into its bounded “core” and its cone of unbounded directions.

Definition 2.7.10 (Minkowski Sum). \(S + T := \{s + t \in \mathbb{R}^n : s \in S,\ t \in T\}\).

Definition 2.7.11 (Polyhedral Cone). A set that is simultaneously a cone and a polyhedron.

The next two theorems — together known as the Minkowski-Weyl theorem — provide the complete structural description of polyhedra. Theorem 2.7.14 handles the pointed case, decomposing any pointed polyhedron into a compact part (a polytope) plus a cone of unbounded directions. Theorem 2.7.15 generalizes to all polyhedra, including those containing lines.

Theorem 2.7.14. Let \(F \subseteq \mathbb{R}^n\) be a nonempty pointed polyhedron. There is a polytope \(P \subseteq \mathbb{R}^n\) and pointed polyhedral cone \(K \subseteq \mathbb{R}^n\) such that \(F = P + K\).

Theorem 2.7.15. \(P\) is a polyhedron if and only if \(P = Q + C\) where \(Q\) is a polytope and \(C\) is a polyhedral cone.

Theorem 2.7.15 is the Minkowski-Weyl theorem: every polyhedron can be built by taking a polytope and “extending” it in the directions of a polyhedral cone. This is the complete structural description of polyhedra. When the polyhedral cone is trivial (just the origin), we recover a polytope. When there is no polytope part (the cone is the whole thing), we have a polyhedral cone. The general case is always a sum of these two extreme cases.

Some polyhedra contain entire lines — they extend to infinity in both directions along some vector. Such polyhedra have no extreme points at all. The lineality space captures exactly which directions allow this bidirectional extension.

The two key structural concepts needed to handle polyhedra containing lines are affine subspaces and lineality spaces. An affine subspace is a “shifted” linear subspace — it passes through a specific point but is otherwise flat. The lineality space is the collection of all directions in which the polyhedron extends without bound in both directions simultaneously; it is the “linear part” of the polyhedron.

Definition 2.7.12 (Affine Subspace). The solution set of \(Ax = b\) for some \(A, b\). Every affine subspace of \(\mathbb{R}^n\) can be written as \(l + \{x : Ax = 0\}\) for some \(l\).

Definition 2.7.13 (Lineality Space). The largest affine subspace contained in a polyhedron.

The culminating structural theorem of this section completes the Minkowski-Weyl decomposition by adding the lineality space. Notice how \(F = P + K + L\) breaks the polyhedron into three orthogonal parts: the bounded core \(P\), the “one-directional” cone \(K\) of extreme rays, and the “two-directional” flat \(L\). This is the most general description of any nonempty polyhedron.

Theorem 2.7.18. Let \(F \subseteq \mathbb{R}^n\) be a nonempty polyhedron. There exist a pointed polyhedral cone \(K \subseteq \mathbb{R}^n\) and a polytope \(P \subseteq \mathbb{R}^n\) such that \(F = P + K + L\) where \(L\) is the lineality space of \(F\).

The full decomposition \(F = P + K + L\) is beautiful: every nonempty polyhedron is a “translated” polytope, extended by a cone of rays and a lineality space of lines. From an LP perspective, the lineality space contains directions that never affect objective value — they are the “neutral” directions. The cone contains directions that may improve the objective (unboundedness), and the polytope contains the bounded part where optimal vertices live.

Theorems 2.7.20 and 2.7.21 are the “vertex description” counterparts to the “halfspace description” of polytopes and cones. They say that the two perspectives — describing a set by the halfspaces that contain it (the \(H\)-representation) or by the vertices and rays that generate it (the \(V\)-representation) — are equivalent. Every polytope can be built as the convex hull of finitely many points, and every polyhedral cone can be built as the conic hull of finitely many rays. Converting between these representations is a fundamental computational problem in LP.

Theorem 2.7.20 (Characterization of Polytopes). \(P\) is a polytope if and only if \(P = \mathrm{conv}(S)\) for a finite set \(S \subseteq \mathbb{R}^n\). Moreover, we may take \(S\) to be the set of extreme points of \(P\).

Theorem 2.7.21. \(P\) is a polyhedral cone if and only if \(P = \mathrm{cone}(S)\) for a finite set \(S \subseteq \mathbb{R}^n\).

Theorem (Polytope = Convex Hull of Extreme Points). A polytope \(F \subseteq \mathbb{R}^n\) is the convex hull of its (finitely many) extreme points. Conversely, the convex hull of any finite set is a polytope.

This theorem says that a polytope is completely described by its vertices — a finite set of points. Every other feasible point is a convex combination of vertices. This is both a structural fact and the basis for LP optimality theory.

Theorem (Optimum at Extreme Point). If an LP \(\max\{c^T x : x \in P\}\) over a polyhedron \(P\) with no lines has an optimal solution, then it has an optimal solution that is an extreme point of \(P\).

Proof. Take an optimal solution \(\bar x\) maximizing the number of tight constraints. If it is not an extreme point, \(\operatorname{rank}(A^=) < n\), so there is a nonzero \(\alpha\) with \(A^= \alpha = 0\). The optimality of \(\bar x\) forces \(c^T \alpha = 0\), so the whole line \(\{\bar x + \lambda \alpha\}\) has the same objective value. Since \(P\) is not a line, this ray must eventually leave \(P\), activating one more constraint — repeating this argument yields an extreme point with the same objective value. \(\square\)

The proof is a “greedy tightening” argument: start at any optimal point, and if it is not a vertex, slide along the constant-objective face until a new constraint becomes tight. Repeat until full rank — an extreme point — is reached. This justifies the simplex method’s vertex-walking strategy: we need only search among vertices.

2.2 Feasibility: Fourier-Motzkin and Farkas

Having characterized optimal solutions geometrically (vertices of polyhedra), we now address the prior question: is the system \(Ax \le b\) feasible at all? Fourier-Motzkin elimination gives a constructive procedure, and Farkas’ lemma extracts from it a clean algebraic certificate of infeasibility.

Fourier-Motzkin Elimination (FME) is a procedure to determine whether a system \(Ax \le b\) is feasible by successively eliminating variables. Partition rows by the sign of coefficient \(a_{in}\): let \(I^+, I^-, I^0\) index positive, negative, zero entries. For each pair \((k,l) \in I^+ \times I^-\), generate the inequality:

\[ \frac{1}{a_{ln}}\left(b_l - \sum_{j=1}^{n-1} a_{lj}x_j\right) \le \frac{1}{a_{kn}}\left(b_k - \sum_{j=1}^{n-1} a_{kj}x_j\right). \]

The new system in \(n-1\) variables is feasible if and only if the original is. Repeating until no variables remain: the final system is a set of scalar inequalities \(0 \le \gamma_i\); feasibility holds iff all \(\gamma_i \ge 0\). If infeasible, FME produces a certificate \(u \ge 0\) with \(u^T A = 0\) and \(u^T b < 0\), which is exactly the content of Farkas’ Lemma.

Theorem 2.2.2 (Fourier-Motzkin Elimination). Let \(Ax \le b\) be the given system where \(A \in \mathbb{R}^{m \times n}\). Define \(I^+ = \{i : a_{in} > 0\}\), \(I^- = \{i : a_{in} < 0\}\), \(I^0 = \{i : a_{in} = 0\}\). Generate the reduced system \(A'x' \le b'\) in \(n-1\) variables using inequalities from \(I^0\) and the \((k,l)\)-inequalities for all \(k \in I^+,\ l \in I^-\). Then \(Ax \le b\) is feasible if and only if \(A'x' \le b'\) is feasible. After eliminating all variables, the final system has only trivial scalar inequalities \(0 \le \gamma_i\) and is feasible iff all \(\gamma_i \ge 0\).

Proposition 2.2.3. (1) If either \(I^+\) or \(I^-\) is empty, there are no \((k,l)\)-inequalities. (2) Every inequality of \(A'x' \le b'\) is a nonnegative linear combination of inequalities from \(Ax \le b\). (3) \(A'x' \le b'\) has \(n-1\) variables but can have as many as \(\tfrac{|I^+| \cdot |I^-|}{1}\) constraints from the new \((k,l)\)-pairs alone, so in the worst case \(O(m^2)\) new constraints per variable eliminated.

The exponential blow-up in FME is severe in practice: after \(k\) rounds of elimination, the system can have up to \(O(m^{2^k})\) inequalities. This makes FME impractical as an algorithm for solving LP — it has theoretical significance (it proves the projection theorem and implies Farkas’ Lemma) but the simplex method is far more efficient. The key theoretical use of FME is that every derived inequality is a certified non-negative combination of original inequalities — so the Farkas certificate \(y\) can be traced back through the elimination steps.

Lemma 2.2.4. \(Ax \le b\) is feasible if and only if \(A'x' \le b'\) is feasible.

Farkas’ lemma is one of the most important results in LP theory. It says feasibility and infeasibility have “certificates” — proofs that can be verified in polynomial time. The great philosophical point is this: if you claim a system is infeasible, how do you convince someone else? You need to exhibit a short proof — a Farkas certificate. The lemma says such a certificate always exists.

There are three variants of Farkas’ lemma, each suited to a different LP form. They all say the same thing in different algebraic disguises: feasibility and infeasibility are dual to each other, and exactly one holds. Understanding which variant applies to which situation is a core skill in LP.

The first variant handles the purest case: linear equations with no sign restrictions on \(x\).

Theorem 2.2.5 (Fundamental Theorem of Linear Algebra). Let \(A \in \mathbb{R}^{m \times n},\ b \in \mathbb{R}^m\). Exactly one of the following holds:

\(\exists x \in \mathbb{R}^n\) with \(Ax = b\).
\(\exists y \in \mathbb{R}^m\) with \(A^T y = 0\) and \(b^T y = -1\).

The key insight here is that alternative (II) is exactly a certificate of linear dependence: \(y\) shows that \(b\) is not in the column span of \(A\). This is precisely the familiar statement from linear algebra about the column space.

The second variant handles inequality systems — the most common form in LP. Notice how the non-negativity of \(u\) in alternative (II) corresponds to the requirement that the “combination” of constraints uses the inequalities in their stated direction.

Theorem 2.2.6. Let \(A \in \mathbb{R}^{n \times m},\ b \in \mathbb{R}^m\). Exactly one of the following has a solution:

\(x\) such that \(Ax \le b\).
\(u\) such that \(A^T u = 0,\ u \ge 0,\ b^T u < 0\).

Notice how the non-negativity \(u \ge 0\) in alternative (II) corresponds to the fact that the certificate combines the inequalities of (I) in their stated direction — you may scale each inequality by a non-negative factor but not flip it. If you could flip inequalities, you could manufacture contradictions out of thin air; the sign constraint on \(u\) prevents this.

The third variant is the most directly useful for LP: it handles inequality systems with non-negative variables, matching the Standard Inequality Form of LP. What makes this remarkable is that the certificate \(y\) in alternative (II) provides an explicit linear combination of the constraints that “derives” a contradiction — it witnesses infeasibility with arithmetic you can check by hand.

Theorem 2.2.7 (Farkas’ Lemma). Let \(A \in \mathbb{R}^{m \times n},\ b \in \mathbb{R}^m\). Exactly one of the following holds:

\(\exists x \ge 0\) with \(Ax \le b\).
\(\exists y \ge 0\) with \(A^T y \ge 0\) and \(b^T y < 0\).

Proof (sketch). If both (I) and (II) held, then \(0 > b^T y \ge x^T A^T y = (Ax)^T y \ge 0\), a contradiction. If (I) fails, Fourier-Motzkin Elimination derives \(0 \le \gamma < 0\) from nonnegative combinations of the rows, yielding the certificate \(u\) for (II). \(\square\)

The proof shows mutual exclusivity by a “dot product sandwich” argument: if (I) holds, then any \(y\) satisfying the constraints in (II) would give \(0 > b^T y \geq 0\), a contradiction. The existence of (II) when (I) fails follows from FME.

Geometric interpretation: \(b\) lies in the cone \(\text{cone}\{A_1, \ldots, A_n\}\) if and only if no hyperplane \(\{y : y^T \delta = 0\}\) separates \(b\) from the cone. Farkas’ Lemma says either \(b\) is in the cone (I), or a separating hyperplane exists (II). This is the “alternative theorem” of linear algebra: exactly one of the two alternatives holds.

2.3 Helly’s Theorem

Helly’s theorem is a remarkable result in combinatorial geometry: the failure of a family of convex sets to share a common point is always witnessed by a small subfamily — small in a size determined only by the ambient dimension, not by the size of the family.

Theorem 2.3.1 (Helly, 1923). Let \(H_1, \ldots, H_m\) be closed half-spaces in \(\mathbb{R}^n\) with \(\bigcap_{i=1}^m H_i = \emptyset\). Then some \(n+1\) of them have empty intersection.

Proof. Write each \(H_i = \{x : a_i^T x \ge b_i\}\). The system \(Ax \ge b\) is infeasible, so by Farkas' Lemma there exists \(y \ge 0\) with \(A^T y = 0\) and \(b^T y = 1\). This says \[ \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \sum_{i=1}^m y_i \begin{bmatrix} a_i \\ b_i \end{bmatrix} \in \mathbb{R}^{n+1}, \] so \([0;1]\) lies in the cone of the lifted vectors \(\{[a_i; b_i] : y_i > 0\}\). By Carathéodory's theorem for cones in \(\mathbb{R}^{n+1}\), some subset \(T \subseteq [m]\) with \(|T| \le n+1\) already achieves this cone membership: there exist \(\mu_i \ge 0\) for \(i \in T\) with \(\sum_{i \in T} \mu_i a_i = 0\) and \(\sum_{i \in T} \mu_i b_i = 1\). By Farkas, this \(\mu_T\) certifies that the subsystem \(\{a_i^T x \ge b_i : i \in T\}\) is infeasible. \(\square\)

The proof is beautifully modular: Farkas converts infeasibility into a cone membership problem in \(\mathbb{R}^{n+1}\), and Carathéodory reduces the cone representation to at most \(n+1\) generators. The bound \(n+1\) in Helly’s theorem is exactly the Carathéodory bound for cones in \(\mathbb{R}^{n+1}\) — these two theorems are dual to each other in a deep sense.

The half-space version extends to general convex sets by a finite approximation argument.

Theorem 2.3.2 (Helly for Convex Sets). Let \(S_1, \ldots, S_m \subseteq \mathbb{R}^n\) be convex sets with \(\bigcap_{i=1}^m S_i = \emptyset\). Then some \(n+1\) of them have empty intersection.

Proof. Since by assumption every subcollection of size \(n+1\) is feasible, for each such subcollection \(J\) pick a representative point \(x^J \in \bigcap_{i \in J} S_i\). Let \(X\) be the finite set of all representatives, and define \(P_i = \operatorname{conv}(S_i \cap X)\) for each \(i\). Every \(n+1\) of the polytopes \(P_i\) share a common representative point, so by Helly's theorem for polyhedra all \(P_i\) share a common point \(\bar x\). But \(\bar x \in P_i \subseteq S_i\) for every \(i\), contradicting \(\bigcap S_i = \emptyset\). \(\square\)

Helly’s theorem fails for infinite families: the sets \(S_k = [k, \infty)\) have every finite subcollection feasible but empty total intersection. Compactness restores the theorem — if every \(S_i\) is compact, an infinite family with pairwise non-empty intersections has a common point. This is a convex-set formulation of the Heine-Borel theorem.

Chapter 3: The Simplex Method

Chapter 2 established the geometry: LP optimal solutions occur at vertices of polyhedra. Chapter 3 develops the algebra and algorithms that exploit this structure. The simplex method walks along edges from vertex to vertex, improving the objective at each step. But before the algorithm, we need the dual LP — the “other side” of the problem that provides optimality certificates.

3.1 LP Duality: Motivation

How can we certify the optimality of a feasible solution? By taking nonnegative linear combinations of the constraints. For the LP \(\max\ 2x_1 + x_2\) subject to \(x_1 - 2x_2 \le 4,\ -x_1 + x_2 \le 10,\ \ldots\), a particular nonnegative combination of constraints yields \(2x_1 + x_2 \le 10\). Since the objective equals 10 at the feasible point \((0, 10)\), this proves optimality.

This motivates a systematic “dual” LP whose feasible solutions provide upper bounds on the primal objective.

Example 2.3.1 (LP). Consider \(\max\ z(x) := 2x_1 + x_2\) subject to \(x_1 - 2x_2 \le 4,\ -x_1 + x_2 \le 10,\ -x_2 \le 0,\ x \ge 0\). The combination \((0,1,2,0) \cdot Ax \le (0,1,2,0) \cdot b\) gives \(2x_1 + x_2 \le 10\). Since \(\bar x^T = (0,10)\) is feasible with \(z(\bar x) = 10\), this proves \(\bar x\) is optimal.

Definition 2.3.1 (Unbounded). An LP is unbounded if for every \(M \in \mathbb{R}\) there is a feasible \(x\) such that \(z(x)\) has a better objective value than \(M\). An unbounded LP is certainly feasible.

Definition 2.3.2 (Bounded). \(S \subseteq \mathbb{R}^n\) is bounded if there is some \(M \in \mathbb{R}_{++}\) such that \(S \subseteq [-M, M]^n\).

The dual LP is constructed by associating one variable \(y_i\) with each primal constraint. Multiplying the \(i\)-th constraint by \(y_i \geq 0\) and summing gives an upper bound on the primal objective — and minimizing this upper bound over all such \(y\) is the dual problem.

Definition 2.3.3 (Dual). Let \(A \in \mathbb{R}^{m \times n},\ b \in \mathbb{R}^m,\ c \in \mathbb{R}^n\). The primal LP is \((P): \max\ c^T x\) s.t. \(Ax \le b,\ x \ge 0\). Its dual is \((D): \min\ b^T y\) s.t. \(A^T y \ge c,\ y \ge 0\). Each primal constraint gives rise to a dual variable; each primal variable gives rise to a dual constraint. \((P)\) being a max-LP gives \((D)\) a min-LP.

Definition (Dual LP). Given the primal LP \((P): \max\ c^T x\) s.t. \(Ax \le b,\ x \ge 0\), its dual is \((D): \min\ b^T y\) s.t. \(A^T y \ge c,\ y \ge 0\).

Each primal constraint \(a_i^T x \le b_i\) gives rise to a dual variable \(y_i \ge 0\); each primal variable \(x_j\) gives rise to a dual constraint \((A^T y)_j \ge c_j\). Taking the dual twice returns the original primal: \(\text{dual}(\text{dual}(P)) = P\).

The dual is another LP. Its feasible solutions certify upper bounds on the primal: if \(b^T y = 17\) for some feasible dual \(y\), then the primal optimum is at most 17. Strong duality says the best (smallest) such certificate exactly matches the primal optimum.

3.2 Weak and Strong Duality

Weak duality is the trivial direction: any feasible dual solution provides an upper bound on the primal.

Theorem 2.3.2 (Weak Duality). Let \((P)\) be a max LP with objective \(c^T x\) and \((D)\) its dual min LP with objective \(b^T y\). Let \(x\) be feasible for \((P)\) and \(y\) feasible for \((D)\). Then \(c^T x \le b^T y\).

Proof. \(c^T x \le y^T A x \le y^T b = b^T y\), using \(A^T y \ge c,\ x \ge 0\) for the first inequality and \(Ax \le b,\ y \ge 0\) for the second. \(\square\)

Corollary 2.3.2.1. If \(\bar x, \bar y\) are feasible for \((P), (D)\) respectively and \(c^T \bar x = b^T \bar y\), then \(\bar x, \bar y\) are both optimal.

Corollary 2.3.2.2. \((P)\) is unbounded implies \((D)\) is infeasible, and vice versa.

Proposition 2.3.3. \(\mathrm{dual}(\mathrm{dual}(P)) = P\).

Weak duality says the primal and dual objectives form a “sandwich”: any primal feasible objective value is at most any dual feasible objective value. But it leaves open the question: how large is the gap? Could the primal optimum be strictly below the dual optimum? Strong duality — one of the most powerful results in all of optimization — says the answer is no. Under feasibility of both sides, the gap is exactly zero.

Notice the elegance of what this achieves. The primal problem asks “how good can we do?” The dual asks “how tight can we certify a bound?” Strong duality says these two processes converge to the same answer. The algorithm (simplex, interior point) and the certificate (dual solution) agree.

Theorem 2.3.4 (Strong Duality). If both \((P)\) and \((D)\) have feasible solutions, they both have optimal solutions \(\bar x, \bar y\) with equal optimal values: \(c^T \bar x = b^T \bar y\).

Proof. Strong duality is a consequence of the LP Uber Theorem, which completely classifies LP outcomes.

Theorem (Uber Theorem). For any LP \((P): \min\ c^T x,\ Ax \ge b\), exactly one of the following holds:

Both \((P)\) and its dual \((D): \max\ b^T y,\ A^T y = c,\ y \ge 0\) have optimal solutions with equal values.
\((P)\) is infeasible: there exists \(\bar y \ge 0\) with \(A^T \bar y = 0\) and \(b^T \bar y > 0\) (Farkas certificate).
\((P)\) is feasible and unbounded: there exists a direction \(\bar d\) with \(A\bar d \ge 0\) and \(c^T \bar d < 0\).

\[ \text{(B)}: \quad A^T \bar y = 0,\ \bar y \ge 0,\ A\bar d \ge 0,\ b^T \bar y - c^T \bar d \ge 1. \]

Claim 1: Exactly one of System (I) (primal-dual feasibility with equal objectives) and System (B) is feasible. This follows from a Farkas-type argument: System (B) is feasible if and only if System (I) is infeasible. The argument is a block-Farkas computation.

Claim 2: If System (I) is infeasible, any solution \((\bar y, \bar d)\) to (B) satisfies \(b^T \bar y - c^T \bar d = s \ge 1 > 0\). If \(b^T \bar y > 0\) then \(\bar y / b^T \bar y\) certifies outcome (II). If \(b^T \bar y = 0\) then \(c^T \bar d < 0\) and \(A\bar d \ge 0\), giving outcome (III) together with any feasible \(\bar x\) (if (P) is infeasible, outcome (II) holds directly).

Strong duality follows immediately: since both \((P)\) and \((D)\) are feasible, outcome (II) is impossible (a Farkas certificate for infeasibility would contradict the existence of a feasible point) and outcome (III) is impossible (an unbounded direction \(A\bar d \ge 0,\ c^T \bar d < 0\) combined with a dual feasible \(y\) gives \(0 \le (A\bar d)^T y = \bar d^T A^T y = c^T \bar d < 0\), a contradiction). So outcome (I) holds. \(\square\)

The Uber Theorem is more powerful than strong duality alone: it gives a complete classification of LP outcomes in a single theorem, with explicit certificates for each case. This makes it the master tool of LP theory — every LP algorithm either finds an outcome-(I) witness (a primal-dual optimal pair) or an outcome-(II)/(III) certificate.

The proof technique is characteristic of LP theory: assume the desired system fails, derive a Farkas certificate, observe it contradicts a known fact, conclude the system must hold. This pattern — “Farkas applied one level up” — repeats throughout the theory.

Lemma 2.3.5. If \((P)\) is feasible and \((D)\) is infeasible, then \((P)\) is unbounded.

Lemma 2.3.6. If \((P)\) is feasible and \((D)\) is infeasible, then \((P)\) is unbounded. (Proof via Farkas’ Lemma: there is \(d \ge 0\) with \(Ad \le 0\) and \(c^T d > 0\), so \(x(\lambda) = \bar x + \lambda d\) is feasible for all \(\lambda \ge 0\) with \(c^T x(\lambda) \to \infty\).)

Notice what Lemma 2.3.6 says geometrically: dual infeasibility means no nonneg combination of constraints pins down the objective, so there is an unbounded ray in the primal feasible region along which the objective grows without limit. The Farkas certificate \(d\) is literally this unbounded direction.

Theorem 2.3.7. If \((P)\) has an optimal solution then \((D)\) also has an optimal solution, and they have equivalent optimal values.

Theorem 2.4.1 (Fundamental Theorem of Linear Programming). Every LP is exactly one of: (i) has an optimal solution, (ii) unbounded, (iii) infeasible.

Theorem (Fundamental Theorem of LP). Every LP is exactly one of: (i) has an optimal solution, (ii) unbounded, (iii) infeasible.

Moreover, if \((P)\) is feasible and \((D)\) is infeasible, then \((P)\) is unbounded.

This “trichotomy” is the complete classification of LP outcomes. The four-way classification from Chapter 1 reduces to three: the “feasible but no optimum attained” case cannot occur for LPs because the feasible region is a closed polyhedron (compact level sets). The interplay between primal and dual gives a 3×3 grid of possible outcomes for the pair \((P), (D)\), and remarkably only four entries in that grid can actually occur: both optimal, \((P)\) unbounded / \((D)\) infeasible, \((P)\) infeasible / \((D)\) unbounded, and both infeasible. This is a complete characterization.

3.3 Complementary Slackness

Strong duality tells us that at optimality, the primal and dual objective values are equal. But “the two objectives are equal” is a global condition — it is hard to check without already knowing both optima. Complementary slackness translates this global equality into a system of local “on/off” conditions on individual variables and constraints. The key insight is that the equality \(c^T \bar x = b^T \bar y\) forces each term in a sum of nonnegative quantities to be exactly zero, giving sharp pointwise conditions.

In practice, complementary slackness is used in two ways. First, given a candidate pair \((\bar x, \bar y)\), one can verify optimality by checking the CS conditions — no need to solve the LP again. Second, one can use the CS conditions to find the optimal dual given a primal solution, or vice versa: fix one solution, enforce CS conditions, and solve for the other.

Complementary slackness (CS) translates the condition \(c^T \bar x = b^T \bar y\) into a system of “on/off” conditions on the variables and constraints.

Theorem 2.5.1 (Complementary Slackness Theorem). \(\bar x, \bar y\) feasible for \((P), (D)\) respectively are optimal if and only if:

For all \(j\): \(\bar x_j = 0\) or the \(j\)-th dual constraint is tight for \(\bar y\): \((A^T \bar y)_j = c_j\).
For all \(i\): \(\bar y_i = 0\) or the \(i\)-th primal constraint is tight for \(\bar x\): \((A\bar x)_i = b_i\).

Proof. From the proof of weak duality: \(0 = \sum_j (c_j - (A^T \bar y)_j)\bar x_j\). Each term is \(\le 0\) (dual feasibility) and \(\ge 0\) (\(\bar x \ge 0\)), so each term is exactly zero. This gives condition (a). Similarly, \(0 = \sum_i \bar y_i((A\bar x)_i - b_i)\) gives condition (b). \(\square\)

The proof extracts CS from the “sandwich” equality in weak duality: the sum of nonneg terms equals zero, so every term must be zero individually. This “each term is zero” condition is exactly the complementary slackness. In practice, CS conditions allow us to verify optimality of a claimed solution by checking a list of “either-or” conditions — each variable is either zero, or its corresponding dual constraint is tight.

The definitions of valid inequalities, cones, and active constraints are the vocabulary of duality geometry. A valid inequality is any constraint that every feasible point satisfies; dual variables generate valid inequalities by taking nonneg combinations of the primal constraints. Active constraints are the ones “touching” the current point — the boundary the point rests against. These concepts connect the algebra of CS conditions to the geometry of extreme points.

Definition 2.5.1 (Valid Inequality). We say that an inequality \(\alpha^T x \le \beta\), \(\alpha, x \in \mathbb{R}^n\), \(\beta \in \mathbb{R}\), is valid for a set \(S \subseteq \mathbb{R}^n\) if \(\alpha^T \bar x \le \beta\) for all \(\bar x \in S\).

Definition 2.5.2 (Cone). \(K \subseteq \mathbb{R}^n\) is a cone if: (1) \(0 \in K\); (2) \(\forall x \in K,\ \forall \lambda \ge 0,\ \lambda x \in K\); (3) \(\forall x, y \in K,\ x + y \in K\). For \(S \subseteq \mathbb{R}^n\), \(\mathrm{cone}(S)\) denotes the smallest cone containing \(S\).

Lemma 2.5.4. The intersection of an arbitrary family of cones is a cone.

Lemma 2.5.5. If \(S \subseteq \mathbb{R}^n\) is finite, \(S = \{a^{(1)}, \ldots, a^{(k)}\}\), then \(\mathrm{cone}(S) = \{x : x = \sum_i \lambda_i a^{(i)},\ \lambda_i \ge 0\}\).

Definition 2.5.3 (Tight/Active). We say \(\alpha^T x \le \beta\) is tight or active at \(\bar x\) if \(\alpha^T \bar x = \beta\).

Proposition 2.5.6 below is the geometric heart of LP optimality. It says a feasible point is optimal precisely when the objective direction \(c\) lies in the cone generated by the active constraint normals. Intuitively: if you stand at a corner of the feasible region and the “uphill” direction of the objective is blocked by the walls you are touching, you are at the optimum. If the objective points “outside” the cone of active walls, you can move along the boundary and improve.

Proposition 2.5.6 (Geometric Interpretation of Duality). Let \((P): \max\ c^T x\) s.t. \(Ax \le b\), and let \(\bar x\) be feasible. Let \(A^= x \le b^=\) be the constraints tight at \(\bar x\). Then \(\bar x\) is an optimal solution to \((P)\) if and only if \(c \in \mathrm{cone}(\text{rows of } A^=)\).

Proposition 2.5.8 shows that strong duality subsumes Farkas’ lemma — the two results are not independent but are part of the same duality fabric. Farkas is “strong duality for the trivial LP.”

Proposition 2.5.8 (Strong Duality Implies Farkas’ Lemma). Consider \((P): \max\ 0^T x\) s.t. \(Ax \le b,\ x \ge 0\). This is infeasible or has optimal value 0. Its dual \((D): \min\ b^T y\) s.t. \(y^T A \ge 0,\ y \ge 0\) is feasible with \(\bar y = 0\). Therefore, \((P)\) is infeasible if and only if \((D)\) is unbounded, i.e., there is a dual feasible solution with \(b^T y < 0\).

Physical interpretation: Think of each hyperplane \(a_i^T x = b_i\) as a wall in the feasible region. The objective \(c\) acts like a force on a particle. The particle reaches equilibrium (optimality) when \(c\) is balanced by normal forces from the walls it touches (tight constraints), with the normal force magnitudes being the dual variables. Walls not touching the particle exert no force, so their dual variables are zero.

Complementary slackness: tight vs. slack constraints at optimum

3.4 Duality Table and Shadow Prices

The strong duality theorem and CS conditions extend beyond the standard SIF formulation to general LP forms. The duality table below is a mechanical rule for constructing the dual from any LP: match each primal constraint type to a dual variable type, and vice versa. This table should be memorized — it is the “grammar” of LP duality.

For general LPs the primal-dual correspondence is:

Primal (max)	Dual (min)
\(\le\) constraint	\(\ge 0\) variable
\(\ge\) constraint	\(\le 0\) variable
\(=\) constraint	free variable
\(\ge 0\) variable	\(\ge\) constraint
\(\le 0\) variable	\(\le\) constraint
free variable	\(=\) constraint

Economic interpretation: In the resource-allocation LP \(\max c^T x\) s.t. \(Ax \le b,\ x \ge 0\), where \(b_i\) is the supply of resource \(i\), the optimal dual variable \(\bar y_i\) is the shadow price — the rate of change of the optimal value with respect to an increase in \(b_i\). This follows directly from strong duality: if \(b_i \to b_i + \varepsilon\), the optimal value changes by approximately \(\varepsilon \bar y_i\).

LP equivalence is important because the simplex method, duality theory, and Farkas’ lemma are each stated for specific standard forms. Equivalence ensures that all these tools apply regardless of how the LP is originally written. Two LPs are equivalent not just when they have the same feasible region, but when certificates of optimality, infeasibility, and unboundedness translate efficiently between them.

Definition 2.6.1 (LP Equivalence). LPs \((P)\) and \((P')\) are equivalent if: (1) \((P)\) has an optimal solution iff \((P')\) does; (2) \((P)\) is infeasible iff \((P')\) is; (3) \((P)\) is unbounded iff \((P')\) is; (4) certificates of optimality, infeasibility, and unboundedness can be “easily” converted between the two problems.

The following strong duality theorem for general LPs is the culmination of all the duality theory developed so far. It applies to any LP, regardless of whether variables are free or sign-constrained, and regardless of whether constraints are equalities or inequalities. The proof reduces to the SIF case using equivalence transformations.

Theorem 2.6.4 (Strong Duality for General LPs). Let \((P), (D)\) be primal-dual LPs.

If \((P), (D)\) both have feasible solutions, they both have optimal solutions with equal optimal objective values.
If one of \((P), (D)\) has an optimal solution, then both do, with equal optimal objective values.

Theorem 2.6.5 (CS Conditions for General LPs). Let \((P), (D)\) be a primal-dual LP pair. Let \(\bar x\) be feasible in \((P)\) and \(\bar y\) feasible in \((D)\). Then \(\bar x, \bar y\) are optimal in their respective problems if and only if: \(\forall j \in [n],\ \bar x_j = 0\) or the \(j\)-th dual constraint is tight; \(\forall i \in [m],\ \bar y_i = 0\) or the \(i\)-th primal constraint is tight.

3.5 Geometry of Duality

The algebraic optimality conditions have a beautiful geometric interpretation. A feasible point is optimal if and only if the objective vector is “trapped” by the normals of the active constraints — it cannot point in any direction that would increase the objective while staying feasible.

Definition (Valid Inequality). \(\alpha^T x \le \beta\) is a valid inequality for \(P \subseteq \mathbb{R}^n\) if \(\alpha^T \bar x \le \beta\) for all \(\bar x \in P\).

Valid inequalities are constraints that every feasible point satisfies — they can be added to the LP formulation without changing the feasible region. They are also the “witnesses” in duality: a dual feasible \(y\) produces a valid inequality \(c^T x \le b^T y\) for the primal.

Theorem (Geometric Optimality). Let \(\bar x\) be a feasible solution to \((P): \max\ c^T x,\ Ax \le b\). Let \(A^= x \le b^=\) be the tight constraints at \(\bar x\). Then \(\bar x\) is optimal if and only if \(c \in \text{cone}(\text{rows of } A^=)\).

This says: \(\bar x\) is optimal precisely when the objective direction \(c\) lies in the cone generated by the active constraint normals — exactly the equilibrium condition described physically above. If \(c\) pointed outside this cone, we could move in the direction of \(c\) while staying feasible (by moving away from the active constraints), increasing the objective — contradicting optimality.

Primal and dual feasible regions side-by-side

3.6 Game Theory and the Minimax Theorem

LP duality provides an elegant proof of one of the foundational results in game theory. A two-player zero-sum game is specified by a payoff matrix \(A \in \mathbb{R}^{m \times n}\): if the row player (Rose) plays row \(i\) and the column player (Colin) plays column \(j\), Rose pays Colin the amount \(A_{ij}\). In a zero-sum game the two players have directly opposing interests — every dollar Colin gains is a dollar Rose loses.

In a mixed strategy, each player randomizes. Rose picks a probability distribution \(p \in \Delta_m = \{p \ge 0 : \mathbf{1}^T p = 1\}\) and Colin picks \(q \in \Delta_n\). The expected payout is \(p^T A q\). Rose wants to minimize this; Colin wants to maximize it.

\[ (\text{Rose}): \quad v^* = \min_{p \in \Delta_m} \max_{j \in [n]} (Ap)_j. \]\[ \min\ v \quad \text{s.t.} \quad (Ap)_j \le v\ \forall j,\quad \mathbf{1}^T p = 1,\quad p \ge 0. \]\[ (\text{Colin}): \quad w^* = \max_{q \in \Delta_n} \min_{i \in [m]} (A^T q)_i, \]\[ \max\ w \quad \text{s.t.} \quad (A^T q)_i \ge w\ \forall i,\quad \mathbf{1}^T q = 1,\quad q \ge 0. \]

\[ \min_{p \in \Delta_m} \max_{j \in [n]} (Ap)_j = \max_{q \in \Delta_n} \min_{i \in [m]} (A^T q)_i. \]

Proof. The two LPs (Rose) and (Colin) are duals of each other. One can verify this by converting (Rose) to standard form and deriving its dual, which is exactly (Colin). Both are feasible (take \(p = \mathbf{1}/m\) and \(q = \mathbf{1}/n\)), so LP strong duality gives equal optimal values \(v^* = w^*\). \(\square\)

The minimax theorem was regarded as a deep result when von Neumann proved it in 1928 using fixed-point theory. LP duality reduces it to a one-line argument. The common optimal value \(v^* = w^*\) is the value of the game: it is the amount Colin can guarantee receiving (by playing \(\bar q\)) and simultaneously the maximum Rose can guarantee paying (by playing \(\bar p\)). The strategies \((\bar p, \bar q)\) form a Nash equilibrium — neither player benefits from deviating unilaterally.

A striking feature is the symmetry: it does not matter who commits first. Whether Rose reveals her strategy before Colin responds, or vice versa, the equilibrium payout is the same \(v^*\). This is LP strong duality in game-theoretic language: the min-over-primal equals the max-over-dual.

Chapter 4: The Simplex Method

Chapter 3 developed LP duality and optimality conditions. Chapter 4 turns these conditions into a concrete algorithm. The simplex method is a vertex-walking algorithm for LPs in Standard Equality Form (SEF). It exploits the theorem that an optimal solution, if it exists, can always be found at a vertex (extreme point) of the feasible polyhedron. Each pivot moves from one vertex to an adjacent one with a strictly better objective value — a finite process that must terminate at an optimum.

4.1 Bases and Basic Feasible Solutions

Work with \((P): \max\ c^T x,\ Ax = b,\ x \ge 0\) where \(A \in \mathbb{R}^{m \times n}\), \(\operatorname{rank}(A) = m\).

The algebraic notion of a basis corresponds to the geometric notion of a vertex: each vertex of the polyhedron is determined by exactly \(m\) linearly independent tight constraints. To turn this geometric insight into an algorithm, we need an algebraic handle on vertices — that is what bases and basic feasible solutions provide.

Think of it this way. We have \(n\) variables and \(m\) equality constraints. The equality constraints form an underdetermined system: there are infinitely many solutions. A basis \(B\) of size \(m\) picks out \(m\) “special” variables (the basic variables) and forces all other variables to zero. The resulting unique solution \(x_B = A_B^{-1} b,\ x_N = 0\) is the basic solution. If it happens to be non-negative, it is a basic feasible solution — and this corresponds exactly to a vertex of the feasible polytope.

The word “simplex” in the simplex method comes from the following geometric object — in low dimensions, the simplest possible polytope with the most basic structure.

Definition 2.8.1 (Simplex). \(S \subseteq \mathbb{R}^n\) is a simplex if \(S = \mathrm{conv}(v^{(1)}, \ldots, v^{(n+1)})\) where the \(v^{(i)}\) are affinely independent.

Definition 2.8.2 (Basis). We say \(B \subseteq [n]\) is a basis of \(A\) if \(A_B\) is square and nonsingular, equivalently \(|B| = m\) and \(\mathrm{rank}(A_B) = m\). The complementary index set is \(N = [n] \setminus B\). The unique solution \(x_B = A_B^{-1} b,\ x_N = 0\) is the basic solution corresponding to \(B\).

Definition 2.8.3 (Basic Solution). \(\bar x\) is a basic solution to \(Ax = b\) if there is a basis \(B\) such that \(\bar x\) is the basic solution corresponding to \(B\).

Definition 2.8.4 (Basic Feasible Solution). A basic solution \(\bar x \ge 0\) is a basic feasible solution (BFS). Variables \(x_j,\ j \in B\) are basic variables; the others are non-basic variables.

Definition 2.8.5 (Feasible Basis). \(B \subseteq [n]\) is a feasible basis for \((P)\) if \(B\) is a basis of \(A\) and the BFS determined by \(B\) is feasible in \((P)\). A feasible basis determines a unique BFS.

A BFS corresponds to a vertex of the feasible polyhedron: the non-basic variables at zero are the “tight constraints” \(x_j = 0\), and together with \(Ax = b\) they pin down the vertex uniquely.

The equivalence between BFS, linear independence, and extreme points is the cornerstone of the simplex method’s correctness:

Theorem 2.8.1. Let \(A \in \mathbb{R}^{m \times n}\), \(\mathrm{rank}\ A = m\), \(b \in \mathbb{R}^m\), \(F = \{x \in \mathbb{R}^m : Ax = b,\ x \ge 0\}\). Suppose \(\bar x \in F\). The following are equivalent:

\(\bar x\) is a basic feasible solution to \(Ax = b,\ x \ge 0\).
The columns \(\{A_j : \bar x_j > 0\}\) are linearly independent.
\(\bar x\) is an extreme point of \(F\).

This three-way equivalence confirms that the simplex method — which moves between BFS — is indeed walking along the vertices of the feasible polytope. The algebraic pivot corresponds precisely to the geometric step from one vertex to an adjacent one.

4.2 Simplex Algorithm: Pivot Steps

Each simplex pivot moves from the current BFS to a better adjacent one. The key insight is that each step decomposes the problem into: (a) which non-basic variable to “bring in” (the entering variable), and (b) which basic variable to “kick out” (the leaving variable). The entering variable is chosen to improve the objective (positive reduced cost); the leaving variable is determined by the ratio test, which ensures we do not violate feasibility by overshooting.

The reduced costs \(\bar c_N = c_N - A_N^T \bar y\) are the true “value” of increasing each non-basic variable, accounting for the indirect effect through the basic variables. A positive reduced cost means increasing that variable strictly improves the objective — it is “under-priced” relative to its shadow price. When all reduced costs are non-positive, no non-basic variable is worth increasing, and the current BFS is optimal.

The ratio test in step 7 deserves special attention. When we increase the entering variable \(x_k\) from zero, the basic variables change as \(x_B \leftarrow x_B - d x_k\). The minimum ratio \(\bar\alpha = \min\{x_j/d_j : d_j > 0\}\) tells us how much we can increase \(x_k\) before some basic variable hits zero (and must leave the basis). This is exactly the constraint that keeps all variables non-negative.

Given a feasible basis \(B\) and corresponding BFS \(\bar x\):

Compute the dual variable \(\bar y = A_B^{-T} c_B\) (solution to \(A_B^T y = c_B\)).
Compute reduced costs \(\bar c_N = c_N - A_N^T \bar y\).
If \(\bar c_N \le 0\): stop — \(\bar x\) is optimal (and \(\bar y\) is dual optimal).
Pick any entering variable \(k \in N\) with \(\bar c_k > 0\) (improving direction).
Compute pivot direction \(d = A_B^{-1} A_k\).
If \(d \le 0\): stop — \((P)\) is unbounded.
Compute step size \(\bar\alpha = \min\{x_j / d_j : j \in B,\ d_j > 0\}\) and leaving variable \(l\) achieving the minimum (ratio test).
Update basis: \(B' = B \cup \{k\} \setminus \{l\}\) and \(\bar x' = \bar x + \bar\alpha \bar d\).

Simplex Method (SEF)

Input: \((A, b, c)\) defining (P) in SEF; initial feasible basis \(B\) with BFS \(\bar x\).

Loop:

Solve \(A_B^T \bar y = c_B\). Compute \(\bar c_N = c_N - A_N^T \bar y\).
If \(\bar c_N \le 0\): return optimal \((\bar x, \bar y)\).
Pick \(k \in N\) with \(\bar c_k > 0\) (e.g., smallest index: Bland's rule).
Solve \(A_B d = A_k\).
If \(d \le 0\): return unbounded.
Compute \(\bar\alpha = \min\{x_j/d_j : j \in B, d_j > 0\}\); let \(l\) achieve the min.
Update: \(B \leftarrow B \cup \{k\} \setminus \{l\}\); \(\bar x \leftarrow \bar x + \bar\alpha \bar d\).
Go to step 1.

Simplex method walking along vertices to the optimum

Correctness of a pivot step. The three key properties of each non-degenerate pivot are captured by three claims:

Claim 1: the direction \(d\) satisfies \(c^T d < 0\). \textit{Proof}: \(c^T d = \bar y^T A^= d = \bar y^T e_j = \bar y_j < 0\), where \(j\) is the leaving constraint index with \(\bar y_j < 0\), and \(e_j\) is the \(j\)th standard basis vector.

Claim 2: for sufficiently small \(\varepsilon > 0\), the point \(\tilde x + \varepsilon d\) is feasible. \textit{Proof}: tight constraints remain satisfied (\(A^= (\tilde x + \varepsilon d) = b^= + \varepsilon A^= d \ge b^=\) since the pivot step ensures the entering constraint activates), and slack constraints remain satisfied by continuity.

Claim 3: after the pivot, the new point \(\tilde x + \bar\lambda d\) is an extreme point with strictly better objective. \textit{Proof}: it is a BFS by Proposition 2.8.2, and \(c^T(\tilde x + \bar\lambda d) = c^T \tilde x + \bar\lambda c^T d < c^T \tilde x\) since \(\bar\lambda > 0\) and \(c^T d < 0\) by Claim 1.

Why does the simplex method improve the objective? The objective rewrites as:

\[ c^T x = c_B^T A_B^{-1} b + \sum_{j \in N} \bar c_j x_j. \]

The constant term is the current value. Since \(\bar c_k > 0\) and we increase \(x_k\), the objective strictly increases (when \(\bar\alpha > 0\)).

4.3 Termination and Degeneracy

The simplex method must terminate because there are only finitely many bases. But degeneracy — where multiple bases correspond to the same BFS — can cause the algorithm to cycle indefinitely without making progress.

The next two propositions confirm that each step of the simplex method maintains the invariants that make it work: after a pivot, the new solution is still a BFS (Proposition 2.8.2), and when no pivot can improve the objective, the current solution is optimal (Proposition 2.8.3). Together they establish the correctness of the algorithm.

Proposition 2.8.2. If the pivot step does not result in an unbounded direction, then the updated solution \(x'\) is a BFS with basis \(B' = B \cup \{k\} \setminus \{l\}\).

Proposition 2.8.3. If \(\bar c_N \le 0\), then the current BFS \(\bar x\) is optimal in \((P)\).

Proposition 2.8.3 is the “certificate of optimality” half of the simplex method: when the algorithm halts with \(\bar c_N \le 0\), the dual solution \(\bar y\) computed in step 1 is dual feasible, and by construction the primal-dual pair \((\bar x, \bar y)\) satisfies complementary slackness. Strong duality then confirms they are both optimal. The simplex method does not merely find a solution — it simultaneously finds a proof of optimality.

Theorem 2.8.4. The Simplex Method applied to LP problems in SEF with a basic feasible solution terminates in at most \(\binom{n}{m}\) iterations provided that \(\bar\alpha > 0\) in each iteration. When it stops, it gives either a certificate of optimality or unboundedness.

Theorem (Finite Termination). The simplex method terminates in at most \(\binom{n}{m}\) iterations (the number of possible bases) provided each pivot has \(\bar\alpha > 0\).

When \(\bar\alpha > 0\), every pivot strictly improves the objective, so no basis can be revisited. The bound \(\binom{n}{m}\) is the number of possible bases — an exponential bound, but one that is rarely approached in practice.

Definition 2.8.6 (Degenerate). A BFS \(\bar x\) determined by basis \(B\) of \(A\) is degenerate if \(\bar x_i = 0\) for some \(i \in B\). When degeneracy occurs, \(\bar\alpha = 0\) and the algorithm may not make progress (it may cycle).

Definition 2.8.7 (Nondegenerate). An LP problem \((P)\) with constraints \(Ax = b,\ x \ge 0\) is nondegenerate if every basis of \(A\) is nondegenerate.

Definition (Degeneracy). A BFS \(\bar x\) with basis \(B\) is degenerate if \(\bar x_j = 0\) for some \(j \in B\). When degeneracy occurs, \(\bar\alpha = 0\) and the objective does not improve — the algorithm may cycle.

Degeneracy occurs when more than \(m\) constraints are tight at a vertex — the vertex is an intersection of “too many” hyperplanes. Geometrically, degenerate vertices look like regular vertices but may cause the algorithm to walk in circles through different bases all representing the same vertex.

Bland’s Rule: Choose the entering variable \(k\) and leaving variable \(l\) both with the smallest index among all eligible candidates. This guarantees termination.

Theorem (Bland's Rule). The simplex method with Bland's rule terminates in at most \(\binom{n}{m}\) iterations, even in the presence of degeneracy.

Bland’s rule breaks the symmetry that causes cycling: by imposing a total order on variable choices, it ensures the algorithm always makes a definite “progress” signal even when the objective does not improve.

4.4 Lexicographic Simplex and Two-Phase Method

The degeneracy problem — stalling at a degenerate vertex — and the initialization problem — finding an initial BFS — are the two practical obstacles to running the simplex method on a real LP. The lexicographic pivot rule addresses the former; the two-phase method addresses the latter.

An elegant alternative to Bland’s rule is perturbation: slightly modify the right-hand side \(b\) to eliminate all degeneracy. The perturbed LP has no degenerate vertices, so the simplex method terminates by the basic finite termination theorem.

To handle degeneracy more elegantly, perturb \(b\) formally by replacing \(b_i\) with \(b_i - \varepsilon^i\), where \(\varepsilon\) is a formal indeterminate and computations take place in the polynomial ring \(\mathbb{R}[\varepsilon]\) with the ordering: \(p(\varepsilon) < q(\varepsilon)\) if \(p(\varepsilon') < q(\varepsilon')\) for all sufficiently small \(\varepsilon' > 0\). The lexicographic simplex compares ratio-test candidates under this polynomial ordering, breaking all ties. The perturbed LP is automatically nondegenerate, guaranteeing strict improvement at every step.

Proposition 2.9.1. The perturbed LP problem \((P')\) (obtained by replacing \(b_i\) with \(b_i - \varepsilon^i\)) is nondegenerate.

Proof. Suppose \(\tilde x\) is a degenerate BFS with basis \(B\) and the tight constraints \(A^= = \{a_i : i \in D\}\) are linearly dependent: \(\sum_{i \in D} \lambda_i a_i = 0\) with \(\lambda \ne 0\). Then the perturbed right-hand side gives \(\sum_{i \in D} \lambda_i (b_i - \varepsilon^i) = \sum_i \lambda_i b_i - \sum_i \lambda_i \varepsilon^i\). The first term is zero (since \(\tilde x\) is degenerate). The second term \(-\sum_{i \in D} \lambda_i \varepsilon^i\) is a nonzero polynomial in \(\varepsilon\) (the terms \(\lambda_i \varepsilon^i\) have distinct degrees, so cancellation is impossible). A nonzero polynomial is nonzero for all sufficiently small \(\varepsilon' > 0\), contradicting the tight-constraint equation. \(\square\)

Theorem 2.9.2. The Lexicographic Simplex Method applied to \((P)\) with a starting BFS terminates in at most \(\binom{n}{m}\) iterations.

The initialization problem is the other practical challenge: the simplex method requires a starting BFS, but none may be obvious from the LP’s original formulation. The two-phase method is a beautiful resolution: solve a modified LP (Phase I) whose optimal value tells us whether the original LP is feasible, and whose optimal solution gives us the starting BFS for Phase II. The auxiliary LP always has an obvious starting BFS — the artificial variables \(s_i\) are set to \(b_i\) (assuming \(b \ge 0\) after possibly flipping rows).

To find an initial feasible basis when none is obvious, the Two-Phase Method introduces artificial variables \(s_1, \ldots, s_m\) and solves:

\[ \text{(P-AUX):}\ \max\ -\mathbf{1}^T s \quad \text{s.t.}\ Ax + Is = b,\ x,s \ge 0. \]

The basis \(B = \{n+1,\ldots,n+m\}\) gives an immediate BFS with \(s = b \ge 0\). If the optimal value of (P-AUX) is zero, then a feasible basis for (P) has been found; otherwise (P) is infeasible.

Proposition 2.10.1. \((\text{P-AUX})\) has an optimal objective value equal to zero if and only if \((P)\) has a feasible solution.

The simplex method not only finds a primal optimal solution \(\bar x\) — it simultaneously computes a dual optimal solution \(\bar y = A_B^{-T} c_B\). At termination, the pair \((\bar x, \bar y)\) satisfies complementary slackness. The abridged CS theorem below gives a stronger form of this: not only is one side of each CS pair zero, but their sum is strictly positive. This “complementary” relationship is exact — no partial zeroes are possible.

Theorem 2.11.1 (Abridged Complementary Slackness). Consider \((P): \max\ c^T x,\ Ax = b,\ x \ge 0\) and \((D): \min\ b^T y,\ A^T y \ge c\). Suppose \((P)\) has an optimal solution. Then \((P), (D)\) have optimal solutions \(\bar x, \bar y\) such that for all \(j = 1, 2, \ldots, n\):

\[ \bar x_j (A^T \bar y - c)_j = 0 \quad \text{and} \quad \bar x_j + (A^T \bar y - c)_j > 0. \]

This is a strengthening of ordinary CS: not only does one of the two quantities vanish, but their sum is strictly positive — so exactly one is zero and the other is strictly positive.

Chapter 5: Applications and Special Structure

With the theory and algorithm of LP developed, Chapter 5 turns to applications: combinatorial optimization (matchings, flows, games), special matrix structure (totally unimodular matrices), and extensions to nonlinear optimization (convex programs, KKT conditions). The common thread is that linear programming, when applied to problems with special structure, yields much more than just a solution — it yields structural insight, integer optima automatically, and a bridge to combinatorial min-max theorems.

5.1 LP Unboundedness and Infeasibility (Geometric)

An LP can fail in two ways. Geometrically:

Infeasible: the constraint halfspaces have empty intersection.
Unbounded: the feasible region extends infinitely in the direction of improving objective.

Unbounded LP: feasible region extends to infinity

Infeasible LP: constraints have empty intersection

5.2 Combinatorial Optimization and Integer Programs

Many optimization problems arising from combinatorics — who matches to whom, which path is shortest, how much flow fits through a network — are naturally formulated as integer programs. The key insight is that for many such problems, the LP relaxation (dropping integrality) automatically produces integer solutions, making them solvable in polynomial time despite their integer formulation.

Definition 3.1.1 (Graph). \(G = (V, E)\) is a graph where \(E\) is a subset of pairs \(\{u, v\}\) with \(u \ne v \in V\). We take graphs to be simple by default.

Definition 3.1.2 (Matching). A matching in a graph \(G\) is a subset \(M \subseteq E\) such that every vertex \(v \in V(G)\) is incident with at most one edge in \(M\).

Definition 3.1.3 (Perfect Matching). A matching in \(G\) is perfect if it is incident with every vertex (i.e., its cardinality is exactly \(|V|/2\)).

Definition 3.1.4 (Maximum Weight Matching Problem). Let \(w_e \in \mathbb{R}\) for every \(e \in E\). The maximum weight matching problem in \(G\) is to find a matching \(M\) such that \(\sum_{e \in M} w_e\) is maximized.

Definition 3.1.5 (Maximum Weight Perfect Matching Problem). The maximum weight matching problem with the added constraint that the matching must be perfect.

Definition 3.1.6 (Bipartite). A graph \(G = (V, E)\) is bipartite if there is a bipartition \(A, B\) of \(V\) such that every edge \(\{u,v\} \in E\) has \(u \in A,\ v \in B\) or vice versa.

Definition 3.1.7 (Complete). A graph is complete if \(\{u, v\} \in E\) for every pair \(u, v \in V\).

Many combinatorial problems (matching, shortest path, max-flow) are naturally formulated as integer programs. The LP relaxation drops the integrality constraint:

\[ (\text{IP})\ \max c^T x,\ Ax \le b,\ x \in \mathbb{Z}^n \quad \le \quad (\text{LP})\ \max c^T x,\ Ax \le b. \]

The LP optimal value is an upper bound on the IP optimal value. Equality holds when the LP solution is automatically integral — a situation governed by totally unimodular matrices.

The integer hull captures all integer points that can be convexly combined — it is the “best LP relaxation” of the integer program.

The integer hull is the most important polyhedral object in integer programming. It is the convex hull of all integer feasible points — the “LP relaxation of the LP relaxation” that is tight for integer solutions. If we knew the integer hull explicitly (as a set of linear inequalities), we could solve any IP over these constraints as an LP. The challenge of IP is precisely that the integer hull may require exponentially many inequalities to describe. Theorems 3.2.1 and 3.2.2 confirm that the integer hull is always a polyhedron — it is a well-structured object, even if its description is complex.

Definition 3.2.1 (Integer Hull). If \(P := \{x \in \mathbb{R}^n : Ax \le b\}\), the integer hull of \(P\) is \(P_I := \mathrm{conv}(P \cap \mathbb{Z}^n)\).

Theorem 3.2.1. Let \(A \in \mathbb{R}^{m \times n}\), \(b \in \mathbb{R}^m\), \(S := \{x \in \mathbb{Z}^n : Ax \le b\}\) be bounded. Then \(\mathrm{conv}(S)\) is a polytope.

Theorem 3.2.2. Let \(A \in \mathbb{Q}^{m \times n}\), \(b \in \mathbb{Q}^m\), \(P := \{x \in \mathbb{Z}^n : Ax \le b\}\). Then \(P\) is a polyhedron.

Corollary 3.2.2.1. Let \(A \in \mathbb{Q}^{m \times n}\), \(b \in \mathbb{R}^m\), \(S := \{x \in \mathbb{Z}^n : Ax \le b\}\). Then \(\mathrm{conv}(S)\) is a polyhedron.

When \(P = P_I\), the LP and the IP have the same optimal value for every objective — the LP is an “exact” relaxation. This is the ideal situation, and it holds whenever \(A\) is totally unimodular.

Theorem 3.2.3. Let \(A \in \mathbb{Q}^{m \times n}\), \(b \in \mathbb{Q}^m\), \(P := \{x \in \mathbb{R}^n : Ax \le b\}\) nonempty and bounded. Then \(P = P_I\) if and only if for all \(c \in \mathbb{Z}^n\), the LP \(\max\ c^T x,\ x \in P\) has an integer optimal value.

Theorem 3.2.4 collects five equivalent characterizations of the condition \(P = P_I\). This five-way equivalence is remarkable: it shows that the condition “the LP relaxation always has an integral optimal solution” is equivalent to “every extreme point is integral” (a structural condition), “the LP optimal is always integral” (an objective-value condition), and even “the dual LP optimal is always integral” (a condition on the dual). Each equivalence opens a different avenue for proving or refuting \(P = P_I\). In practice, the most useful direction is (1) \(\Rightarrow\) (3): if we can verify that every extreme point of \(P\) is integral (e.g., by showing \(A\) is TU), then we conclude the LP always has an integral optimal solution.

Theorem 3.2.4. Let \(A \in \mathbb{Q}^{m \times n}\), \(b \in \mathbb{Q}^m\). Suppose \(P := \{x \in \mathbb{R}^n : Ax \le b\}\) is nonempty and bounded. The following are equivalent:

\(P = P_I\)
Every extreme point of \(P\) is in \(\mathbb{Z}^n\)
\(\forall c \in \mathbb{R}^n\), the LP \(\max\{c^T x : Ax \le b\}\) has an optimal solution \(\bar x \in \mathbb{Z}^n\)
\(\forall c \in \mathbb{Z}^n\), \(\max\{c^T x : Ax \le b\} \in \mathbb{Z}\)
\(\forall c \in \mathbb{Z}^n\), \(\min\{b^T y : A^T y = c,\ y \ge 0\} \in \mathbb{Z}\)

Theorem (Characterization of \(P = P_I\)). For bounded \(P\) with rational \(A, b\), \(P = P_I\) if and only if for all \(c \in \mathbb{Z}^n\), the LP \(\max\{c^T x : x \in P\}\) has an integer optimal solution.

5.3 Totally Unimodular Matrices

The condition that makes LP relaxations automatically integral is total unimodularity — a condition on the constraint matrix \(A\) that ensures all bases have determinant \(\pm 1\), forcing Cramer’s rule to produce integers.

Definition 3.3.1 (Submatrix). Let \(I \subseteq [m],\ J \subseteq [n]\); then \(A_{IJ}\) is the submatrix of \(A\) with rows indexed by \(I\) and columns by \(J\): \([a_{ij}]_{i \in I, j \in J}\).

Definition 3.3.2 (Totally Unimodular). \(A \in \mathbb{Z}^{m \times n}\) is totally unimodular (TU) if for every \(k \in [m]\), the determinant of every \(k \times k\) submatrix of \(A\) is in \(\{-1, 0, 1\}\). In particular, every entry of \(A\) must lie in \(\{-1, 0, 1\}\).

Total unimodularity is a strong algebraic condition: not just the full matrix, but every square submatrix must have determinant in \(\{-1, 0, 1\}\). Yet many natural constraint matrices from combinatorics satisfy it.

Theorem 3.3.3 establishes that total unimodularity is equivalent to two other properties: determinant \(\pm 1\) for every basis, and integer-valued inverse for every basis. The equivalence (1) \(\Leftrightarrow\) (3) is immediate from Cramer’s rule: if \(\det A_B = \pm 1\), then \(A_B^{-1} = \frac{\text{adj}(A_B)}{\det A_B}\) has integer entries. The equivalence (2) \(\Leftrightarrow\) (3) is the crucial LP statement: integer right-hand sides and integer inverse bases produce integer extreme points.

Theorem 3.3.3. Let \(A \in \mathbb{Z}^{m \times n}\), \(\mathrm{rank}\ A = m \le n\). The following are equivalent:

\(|\det A_B| = 1\) for every basis \(B\) of \(A\).
Every extreme point of \(\{x \in \mathbb{R}^n : Ax = b,\ x \ge 0\}\) is in \(\mathbb{Z}^n\) for every \(b \in \mathbb{Z}^m\).
\(A_B^{-1} \in \mathbb{Z}^{m \times m}\) for all bases \(B\) of \(A\).

Proposition 3.3.4 lists the closure properties of TU matrices — operations that preserve total unimodularity. These are extremely useful in practice: they tell us that we can augment a TU matrix with an identity block, take its transpose, or scale its rows by \(\pm 1\), and the result is still TU. This allows us to recognize TU structure even when it is disguised by routine LP transformations such as adding slack variables or switching to SEF.

Proposition 3.3.4. Let \(A \in \{-1, 0, 1\}^{m \times n}\). The following are equivalent: (i) \(A\) is TU; (ii) \(A^T\) is TU; (iii) \([A \mid I]\) is TU; (iv) \(\begin{pmatrix}A \\ I\end{pmatrix}\) is TU; (v) \([A \mid A]\) is TU; (vi) \(DA\) is TU for any diagonal matrix \(D\) with \(\pm 1\) entries.

Theorems 3.3.5 and 3.3.6 are the payoff of TU for LP in practice. They say that if \(A\) is TU and \(b\) is integral, then the LP over \(\{Ax \le b\}\) or \(\{Ax \le b, x \ge 0\}\) automatically has integral extreme points — no matter what the objective is. This means: solve the LP (ignoring integrality), and the optimal solution you get is already integral. No branch-and-bound, no cutting planes, no extra work.

Theorem 3.3.5. Let \(A \in \{-1, 0, 1\}^{m \times n}\) be TU and \(b \in \mathbb{Z}^m\). Then every extreme point of \(\{x \in \mathbb{R}^n : Ax \le b\}\) is in \(\mathbb{Z}^n\).

Theorem 3.3.6. Let \(A \in \{-1, 0, 1\}^{m \times n}\) be TU and \(b \in \mathbb{Z}^m\). Then every extreme point of \(\{x \in \mathbb{R}^n : Ax \le b,\ x \ge 0\}\) is in \(\mathbb{Z}^n\).

Now the critical question is: which matrices actually arise in practice are TU? The two most important examples come from graphs. Theorem 3.3.7 says the node-arc incidence matrix of any directed graph is TU — this single fact implies that network flow problems (shortest path, maximum flow, minimum cost flow) all have integral LP relaxations. The corollary then gives us the bipartite case, which drives the matching theory.

Theorem 3.3.7. The node-arc incidence matrix of every directed graph is TU.

Corollary 3.3.8.1. The incidence matrix of every undirected bipartite graph is TU.

Theorem (TU and Integer Optima). If \(A\) is TU and \(b \in \mathbb{Z}^m\), then every extreme point of \(\{x \in \mathbb{R}^n : Ax \le b\}\) (or \(\{x : Ax = b, x \ge 0\}\)) is an integer vector. Hence the LP relaxation of any IP with such \(A\) automatically has an integral optimal solution.

Proof (sketch). Every extreme point \(\bar x\) is determined by a basis \(B\) of \(A\) via \(\bar x_B = A_B^{-1} b\). By TU, \(|\det A_B| = 1\), so by Cramer's rule \(A_B^{-1} \in \mathbb{Z}^{m \times m}\). Therefore \(\bar x_B = A_B^{-1} b \in \mathbb{Z}^m\). \(\square\)

The proof is a simple application of Cramer’s rule: inverse of an integer matrix with determinant \(\pm 1\) is another integer matrix. TU makes every basis inversion produce an integer result, so every extreme point is integral.

Key examples of TU matrices:

The node-arc incidence matrix of any directed graph is TU. Each column (arc) has exactly one \(+1\) (head) and one \(-1\) (tail), so any square submatrix has determinant in \(\{-1, 0, 1\}\) (by induction on size).
The incidence matrix of any bipartite graph is TU. Proof: orient all edges from one side of the bipartition to the other to get a directed graph; its incidence matrix is TU by (1), and the undirected incidence matrix differs only by row signs.

These facts are tight: the incidence matrix of an odd cycle (non-bipartite graph) is not TU. The determinant of the incidence matrix of a triangle is \(\pm 2\), violating TU. This is why bipartiteness is necessary for König’s theorem: non-bipartite graphs have incidence matrices that are not TU, so their LP relaxations may have fractional optima.

5.4 König’s Theorem

König’s theorem — first proved in 1916 by purely combinatorial means — follows from total unimodularity as an almost immediate corollary. The LP lens makes the underlying structure transparent. What makes this theorem striking is that it asserts the equality of two quantities that are trivially in a one-sided relationship: a matching and a vertex cover cannot have the same size unless both are optimal. König says they always can be made equal in bipartite graphs — a min-max equality that has no analog in general graphs.

The key insight here is that maximum matching is a primal LP (maximize the number of edges selected, subject to each vertex being covered at most once), and minimum vertex cover is its dual (minimize the number of vertices selected, subject to each edge being covered by at least one selected vertex). Strong LP duality gives equal objective values; TU of the bipartite incidence matrix gives integral solutions. These two facts together prove König in two lines.

Definition 3.4.1 (Vertex Cover). A set \(C \subseteq V(G)\) such that every edge is incident with some vertex in \(C\).

Theorem 3.4.1 (König’s Theorem, 1916). In every bipartite graph, the cardinality of a maximum matching equals the cardinality of a minimum vertex cover.

Proof. Let \(A\) be the incidence matrix of a bipartite graph \(G\). Since \(A\) is TU, the LP relaxation of both the maximum matching IP and the minimum vertex cover IP have integral optimal solutions. By LP strong duality, the optimal values are equal. The two problems are LP duals of each other, so their integer optima coincide. \(\square\)

The proof is beautifully concise: the two facts (TU gives integrality; LP duality gives equal optima) combine to prove a non-trivial combinatorial equality in two lines. This is the power of the LP/TU framework: combinatorial theorems become consequences of algebraic structure.

Theorem 3.8.1 is a clean dichotomy theorem for perfect matchings in balanced bipartite graphs: either the graph has a perfect matching, or it has a “deficient set” — a subset of one side that is too small in the neighborhood to be matched. This is the contrapositively stated form of Hall’s theorem, and it gives an explicit obstruction certificate whenever a perfect matching does not exist.

Theorem 3.8.1. A bipartite graph \(G = (V, E)\) with bipartition \((W, J)\), \(|W| = |J|\), either (i) has a perfect matching, or (ii) has a deficient set.

5.5 Regular Bipartite Graphs, Edge Coloring, and the Card Puzzle

The LP relaxation framework is particularly powerful for regular graphs. An \(r\)-regular graph has every vertex with degree exactly \(r\). The key observation is that for regular bipartite graphs, an explicit fractional solution lies in the perfect matching polytope — and TU forces integer extreme points.

Theorem 5.5.1. Every \(r\)-regular bipartite graph has a perfect matching.

Proof. Let \(A\) be the incidence matrix of the bipartite graph \(G = (X \cup Y, E)\) with \(|X| = |Y|\). Since every vertex has degree \(r\), the vector \(x = \tfrac{1}{r}\mathbf{1}_E\) satisfies \(Ax = \mathbf{1}\) (each row of \(A\) sums to \(r\)) and \(x \ge 0\), so \(x\) lies in the perfect matching polytope \(P = \{x \ge 0 : Ax = \mathbf{1}\}\). Since \(A\) is TU and \(\mathbf{1}\) is integral, every extreme point of \(P\) is a \(0\)-\(1\) vector — the characteristic vector of a perfect matching. As \(x\) is a convex combination of extreme points, at least one matching appears with positive coefficient and hence exists. \(\square\)

This proof is a template for the LP/TU proof machine: exhibit a fractional point in the polytope, invoke TU for integrality of extreme points, and conclude by convexity. No alternating-path argument is needed.

Corollary 5.5.2. Every \(r\)-regular bipartite graph (or multigraph) is \(r\)-edge-colorable.

Proof. By induction on \(r\). Base case \(r = 1\): the single perfect matching is the unique color class. Inductive step: by Theorem 5.5.1, the graph has a perfect matching \(M\); assign color \(r\) to all edges of \(M\). Removing \(M\) leaves an \((r-1)\)-regular bipartite graph (each vertex loses exactly one incident edge), which is \((r-1)\)-edge-colorable by induction. \(\square\)

The same result holds for multigraphs by the same proof, since the incidence matrix of a bipartite multigraph is also TU.

The card puzzle. Take a standard 52-card deck and deal all cards into a \(4 \times 13\) grid. Can you always rearrange the cards within each column so that every row contains each of the 13 ranks exactly once? Yes. Model this as a 13-regular bipartite multigraph \(G\) with parts \(\{\text{rows } 1,2,3,4\}\) and \(\{\text{ranks } A, 2, \ldots, K\}\): for each card in the grid, add an edge between its row and its rank. Since each row has 13 cards (one per column) and each rank has 4 cards (one per suit), \(G\) is 13-regular on the row side and 4-regular on the rank side. A proper 4-edge-coloring of \(G\) assigns each card a column slot (1–13) such that each row has all 13 ranks and each rank appears once in each row — exactly the desired arrangement. By Corollary 5.5.2, such a coloring exists.

5.6 Maximum Flow — Minimum Cut

The Max-Flow Min-Cut theorem admits a proof identical in structure to König’s, once we observe that the flow LP has a TU constraint matrix. This unifies two seemingly different theorems under the same algebraic roof. Notice how the “LP + TU” framework becomes a proof machine: once you identify the right LP formulation and verify TU of the constraint matrix, you get (a) an integral optimal solution automatically, and (b) via LP duality, a min-max equality theorem. The Max-Flow Min-Cut theorem, König’s theorem, and Hall’s theorem are all instances of this same machine applied to different combinatorial structures.

Definition 3.5.1 (Cut). For a directed graph \(\tilde G = (V, \tilde E)\) and \(U \subseteq V\), the cut is \(\delta(U) := \{ij \in \tilde E : i \in U,\ j \notin U\}\).

Definition 3.5.2 (st-Cut). Let \(U \subseteq V\) with \(s \in U,\ t \notin U\). Then \(\delta(U)\) is an st-cut.

Definition 3.5.3 (Cut Capacity). The capacity of an st-cut \(W\) is \(\sum_{ij \in \delta(W)} u_{ij}\).

Theorem 3.5.1. The maximum st-flow is equal to the minimum st-cut (capacity).

Theorem 3.5.2 (Maximum-Flow Minimum-Cut). Let \(\tilde G = (V, \tilde E)\) be a directed graph with two distinct nodes \(s, t \in V\) and \(u \in \mathbb{R}^{\tilde E}_+\). The value of the maximum flow in \(\tilde G\) equals the capacity of the minimum st-cut. Furthermore, if \(u \in \mathbb{Z}^{\tilde E}_+\), there exists a maximum flow that is integral.

The proof is identical in structure to König’s: the flow LP has a TU constraint matrix (the node-arc incidence matrix of the directed graph), so the LP and its dual both have integral optima; the dual LP encodes the minimum cut problem. The integrality of the maximum flow is a free consequence — it follows from TU rather than from the Ford-Fulkerson algorithm.

The polyhedral theory underlying these results relies on the notions of dimension, faces, and facets. These concepts give us a “stratification” of a polyhedron: extreme points (dimension 0), edges (dimension 1), two-dimensional faces, and so on up to facets (dimension \(n-1\)). This hierarchy is not just combinatorially interesting — it has direct algorithmic implications. Facets are the “essential” inequalities in any LP formulation: Theorem 3.6.1 says every minimal description of a full-dimensional polyhedron has exactly one inequality per facet. Knowing the facets of the integer hull \(P_I\) would give the tightest possible LP relaxation.

Definition 3.6.1 (Dimension). For a polyhedron \(P \subseteq \mathbb{R}^n\), \(\dim P\) is the number of affinely independent points in \(P\) less one.

Definition 3.6.2 (Valid Inequality, revisited). Given \(a \in \mathbb{R}^n,\ \alpha \in \mathbb{R}\), \(a^T x \le \alpha\) is a valid inequality for \(P\) if \(P \subseteq \{x \in \mathbb{R}^n : a^T x \le \alpha\}\).

Definition 3.6.3 (Face). \(P_f \subseteq P\) is a face of \(P\) if \(P_f = P \cap \{x \in \mathbb{R}^n : a^T x = \alpha\}\) for some valid inequality \(a^T x \le \alpha\). Every face of \(P\) is a polyhedron. Dimensions: \(\dim \emptyset = -1\), extreme points have \(\dim 0\), edges have \(\dim 1\), facets have \(\dim(P) - 1\).

Definition 3.6.4 (Facet). A face \(P_f\) of \(P\) is a facet if \(\dim P_f = \dim P - 1\).

Theorem 3.6.1. Let \(P \subseteq \mathbb{R}^n\) with \(\dim P = n\). Every description of \(P\) in terms of linear inequalities contains at least one inequality for each facet, and all minimal descriptions have exactly one inequality per facet. Moreover, minimal descriptions are unique up to scaling by positive constants.

Facets of \(P_I\) give the strongest valid inequalities for the integer hull and help judge IP formulations.

The following theorems are “integer analogs” of Farkas’ lemma — alternative theorems in the setting of integer solutions rather than real solutions. They answer the question: when does the system \(Ax = b\) have an integer solution? Just as Farkas’ lemma characterizes real feasibility via a dual certificate, these results characterize integer feasibility via rational certificates involving the integrality of \(A^T y\) and \(b^T y\). Notice how each steps up in generality: from rational systems over \(\mathbb{Q}\), to real systems with approximate integer solutions.

Theorem 3.8.2 (Fundamental Theorem of Linear Algebra, Gauss). Given \(A \in \mathbb{Q}^{m \times n}\) and \(b \in \mathbb{Q}^m\), exactly one of the following holds: (I) \(\exists \bar x \in \mathbb{Q}^n\) such that \(Ax = b\); (II) \(\exists y \in \mathbb{Q}^m\) such that \(A^T y = 0,\ b^T y \ne 0\).

Theorem 3.8.4 (Kronecker). Let \(a \in \mathbb{Q}^n,\ b \in \mathbb{Q}\). Exactly one of the following has a solution: (I) \(\exists x \in \mathbb{Z}^n\) with \(a^T x = b\); (II) \(\exists y \in \mathbb{Q}\) with \(ya \in \mathbb{Z}^n,\ yb \notin \mathbb{Z}\).

Kronecker’s Approximation Theorem is a beautiful analytic result: for real (possibly irrational) systems, exact integer solutions may not exist, but the question is whether arbitrarily close approximations can be found. The alternative is again a certificate — a dual vector \(y\) that “distinguishes” the integer lattice from the target \(b\). This is the bridge between number theory and LP that makes the theory of integer programming so deep.

Theorem 3.8.5 (Kronecker’s Approximation Theorem, 1884). Let \(A \in \mathbb{R}^{m \times n},\ b \in \mathbb{R}^m\). Exactly one of the following holds: (I) For all \(\varepsilon > 0\), there is \(x \in \mathbb{Z}^n\) such that \(\|Ax - b\| < \varepsilon\); (II) There is \(y \in \mathbb{R}^m\) such that \(A^T y \in \mathbb{Z}^n\) and \(b^T y \notin \mathbb{Z}\).

Corollary 3.8.5.1. Let \(A \in \mathbb{Q}^{m \times n},\ b \in \mathbb{Q}^m\). Exactly one of the following holds: (I) \(\exists x \in \mathbb{Z}^n\) with \(Ax = b\); (II) \(\exists y \in \mathbb{Q}^m\) with \(A^T y \in \mathbb{Z}^n\) and \(b^T y \notin \mathbb{Z}\).

The Hermite Normal Form is the integer analog of row reduction. Just as every matrix over a field can be reduced to row echelon form by elementary row operations (which correspond to unimodular transformations over \(\mathbb{Z}\)), every integer matrix can be reduced to Hermite Normal Form by unimodular column operations. This provides a canonical form for integer linear algebra, useful for solving integer systems and analyzing the lattice structure underlying IP.

Definition 3.8.1 (Unimodular). \(U \in \mathbb{Z}^{n \times n}\) is called unimodular if \(|\det U| = 1\).

Definition 3.8.2 (Hermite Normal Form). A matrix \(H\) (not necessarily square) in Hermite Normal Form satisfies: \(h_{ii} > 0\) for all \(i\); \(0 \le -h_{ij} < h_{ii}\) for \(j < i\); \(h_{ij} = 0\) for \(j > i\). Every \(A \in \mathbb{Z}^{m \times n}\) with \(\mathrm{rank}\ A = m\) can be written as \(A = HU\) where \(U \in \mathbb{Z}^{n \times n}\) is unimodular and \(H\) is in Hermite Normal Form.

5.7 Matrix Rounding

The TU framework yields a striking result about rounding real matrices to integers while preserving sums.

Theorem 5.7.1 (Matrix Rounding). Let \(A \in \mathbb{R}^{m \times n}\). There exists an integer matrix \(\tilde A \in \mathbb{Z}^{m \times n}\) such that:

Each entry: \(\tilde a_{ij} \in \{\lfloor a_{ij} \rfloor, \lceil a_{ij} \rceil\}\).
Column sums: \(\sum_i \tilde a_{ij} \in \left\{\lfloor \sum_i a_{ij} \rfloor,\, \lceil \sum_i a_{ij} \rceil\right\}\) for each \(j\).
Row sums: \(\sum_j \tilde a_{ij} \in \left\{\lfloor \sum_j a_{ij} \rfloor,\, \lceil \sum_j a_{ij} \rceil\right\}\) for each \(i\).
Grand total: \(\sum_{ij} \tilde a_{ij} \in \left\{\lfloor \sum_{ij} a_{ij} \rfloor,\, \lceil \sum_{ij} a_{ij} \rceil\right\}\).

Proof (sketch). Subtracting \(\lfloor a_{ij} \rfloor\) reduces to the case \(0 \le a_{ij} < 1\). Form an LP whose feasible solutions are matrices \(\tilde A \in \{0,1\}^{m \times n}\) with row and column sums constrained to be within one of the original fractional sums. The constraint matrix is the incidence matrix of the complete bipartite graph \(K_{m,n}\) (rows and columns as two vertex classes), which is TU by Corollary 3.3.8.1. With integer right-hand sides, the LP's extreme points are integral, giving the desired \(\{0,1\}\) matrix. \(\square\)

A naive rounding — rounding each entry independently and then correcting column sums — generally fails: small rounding errors in entries can add up to a large error in the column sum. The theorem says this can always be avoided. The key is that the feasible region of the rounding problem is an integral polytope (by TU), so an integer solution always exists.

Application to statistics. Publishing tables of census data requires rounding. The matrix rounding theorem guarantees that a table of fractional averages can always be rounded to integers so that every row total and every column total rounds correctly — a non-obvious property that naive entry-by-entry rounding does not guarantee.

5.8 Hall’s Theorem and Weighted Matching

Hall’s theorem gives the existential condition for perfect matchings in bipartite graphs. The LP/duality perspective gives an algorithmic extension: the Hungarian algorithm for maximum weight perfect matching iteratively updates the dual to expand the “equality subgraph” until it admits a perfect matching.

Definition 3.7.1 (Neighbour). Let \(G = (V, E)\) be bipartite with bipartition \((W, J)\). Given \(S \subseteq V\), the neighbour set of \(S\) is \(N(S) := \{u \in V : u \notin S,\ \exists v \in S,\ \{u,v\} \in E\}\).

Theorem 3.7.1 (Hall, 1939). Let \(G = (V, E)\) be bipartite with bipartition \((W, J)\) and \(|W| = |J|\). \(G\) has a perfect matching if and only if \(|N(S)| \ge |S|\) for all \(S \subseteq W\).

Definition 3.7.2 (Deficient Set). A set \(S \subseteq W\) is deficient if \(|N(S)| < |S|\).

Hall’s theorem is the classical existential statement. The LP approach upgrades it to an algorithm: by maintaining dual feasibility and expanding the equality subgraph, we find a maximum weight perfect matching rather than merely certifying existence.

For the maximum weight perfect matching problem in bipartite graphs, one maintains a dual feasible solution \(\bar y\) (one variable per vertex) satisfying \(\bar y_u + \bar y_v \ge c_{uv}\) for all edges \(uv\). The equality subgraph \(G(\bar y) = \{uv : \bar y_u + \bar y_v = c_{uv}\}\) is updated until it admits a perfect matching, at which point both primal and dual are optimal by complementary slackness. When \(G(\bar y)\) lacks a perfect matching, Hall’s theorem identifies a deficient set \(S\), which guides the dual update:

\[ \bar y_v' = \begin{cases} \bar y_v - \varepsilon & v \in S \\ \bar y_v + \varepsilon & v \in N_{G(\bar y)}(S) \\ \bar y_v & \text{otherwise} \end{cases} \]

where \(\varepsilon = \min\{\bar y_u + \bar y_v - c_{uv} : uv \in E,\ u \in S,\ v \notin N_{G(\bar y)}(S)\}\). Each iteration decreases the dual objective by an integer amount \(|S| - |N_{G(\bar y)}(S)| \ge 1\), guaranteeing termination.

5.9 Nonlinear and Convex Optimization

The final section steps beyond LP to the broader world of nonlinear optimization. The key results parallel the LP theory: there are first-order optimality conditions (analogous to LP’s CS conditions), second-order tests (via the Hessian), and gradient methods (analogous to the simplex method). For convex programs, these conditions are both necessary and sufficient.

In nonlinear optimization, the feasible set and/or objective can be arbitrary (smooth) functions. Convex programs occupy the “easy” end of this spectrum.

Before developing the theory of convex optimization, we need to recall the topological foundation. The key concepts — interior, closure, compactness — serve as the “stage” on which convex analysis plays out. These are not abstract formalities; they directly govern when optimizers exist and when gradient conditions are meaningful.

Definition 4.1.1 (Open Euclidean Ball). \(B(\bar x, \delta) := \{x \in \mathbb{R}^n : \|x - \bar x\|_2 < \delta\}\) for \(\delta > 0\).

Definition 4.1.2 (Interior). For \(S \subseteq \mathbb{R}^n\), \(\mathrm{int}\ S := \{x \in S : B(x, \delta_x) \subseteq S \text{ for some } \delta_x > 0\}\).

Definition 4.1.4 (Closure). Given \(S \subseteq \mathbb{R}^n\), \(\mathrm{cl}\ S := \{\bar x \in \mathbb{R}^n : \exists \{x^{(k)}\} \subseteq S \text{ with } x^{(k)} \to \bar x\}\).

Definition 4.1.5 (Compact). \(S \subseteq \mathbb{R}^n\) is compact if it is closed and bounded.

Compactness is the magic condition that guarantees optima exist. Without it, a continuous function on a closed set might not attain its minimum (it approaches the infimum asymptotically). The Bolzano-Weierstrass theorem is the key lemma: every sequence in a compact set has a convergent subsequence that stays in the set.

Theorem 4.1.1 (Bolzano-Weierstrass). Let \(S \subseteq \mathbb{R}^n\) be compact. Then every sequence \(\{x^{(k)}\} \subseteq S\) has a convergent subsequence \(\{x^{(l)}\}\) with \(x^{(l)} \to \bar x \in S\).

Proof. Reduce to \(n = 1\) by applying the result coordinate-by-coordinate (a standard diagonal argument extracts a subsequence converging in each coordinate). For \(n = 1\): let \(S \subseteq [-N, N]\). Define \(t = \inf\{t' \in [-N, N] : \{k : x^{(k)} \le t'\} \text{ is infinite}\}\). The set \(\{t' : \text{infinitely many }x^{(k)} \le t'\}\) is non-empty (contains \(N\)) and bounded below, so \(t\) is well-defined. For every \(\varepsilon > 0\), both \((t - \varepsilon, t]\) and \([t, t + \varepsilon)\) contain infinitely many sequence elements (by definition of \(t\) as the infimum), so the interval \((t - \varepsilon, t + \varepsilon)\) is hit infinitely often. Extracting one index from each such interval yields a subsequence converging to \(t\), which lies in the closed set \(S\). \(\square\)

Definition 4.1.7 (Level Sets). For \(f : S \to \mathbb{R}\), the level set (sublevel set) at \(\alpha\) is \(\mathrm{Level}_\alpha(f) := \{x \in S : f(x) \le \alpha\}\).

The Weierstrass theorem is the fundamental existence result for optimization: any continuous function on a compact set attains its minimum and maximum. This is why “compact feasible region + continuous objective” is the gold standard for an optimization problem that is well-posed. In LP theory, polytopes are compact, and continuous (linear) objectives always attain their optimum — which is the LP “no infimum without minimum” property.

Theorem 4.1.4 (Weierstrass). Let \(S \subseteq \mathbb{R}^n\) be nonempty and compact, \(f : S \to \mathbb{R}\) continuous. Then \(f\) attains both its infimum and supremum over \(S\).

For unbounded feasible regions (closed but not compact), compactness fails and Weierstrass does not directly apply. Coercivity is the replacement condition: a coercive function “blows up” as we walk to infinity, so we can always restrict attention to a bounded sublevel set without loss of generality. This converts the unbounded problem to a compact one, recovering the Weierstrass guarantee.

Definition 4.1.10 (Coercive). \(f : S \to \mathbb{R}\) is coercive on \(S\) if the level sets \(\{x \in S : f(x) \le \alpha\}\) are bounded for every \(\alpha \in \mathbb{R}\). Equivalently, \(\|x^{(k)}\| \to \infty \Rightarrow f(x^{(k)}) \to \infty\).

Theorem 4.1.5. Let \(S \subseteq \mathbb{R}^n\) be nonempty and closed, \(f : S \to \mathbb{R}\) continuous and coercive. Then \(f\) attains its infimum over \(S\).

Definition 4.1.12 (Positive Semidefinite). A symmetric matrix \(A \in \mathbb{R}^{n \times n}\) is positive semidefinite if all eigenvalues are nonnegative, equivalently \(h^T A h \ge 0\) for every \(h \in \mathbb{R}^n\).

Definition 4.1.13 (Positive Definite). A positive semidefinite matrix is positive definite if every eigenvalue is strictly positive, equivalently \(h^T A h > 0\) for every \(h \ne 0\).

Definition 4.2.2 (Strictly Convex). \(f\) is strictly convex if for \(\lambda \in (0,1)\) and \(u \ne v \in S\): \(f(\lambda u + (1-\lambda)v) < \lambda f(u) + (1-\lambda)f(v)\).

The epigraph is a powerful conceptual device: it “lifts” a function into a set in one higher dimension, converting questions about functions into questions about convex sets. The graph of \(f\) is the boundary of its epigraph; the epigraph is everything “above” the graph. This geometric perspective unifies many convexity arguments, since tools from convex geometry (like separation theorems) can be applied directly to the epigraph.

Definition 4.2.4 (Epigraph). For \(f : \mathbb{R}^n \to \mathbb{R}\), \(\mathrm{epi}(f) := \{(\mu, x) \in \mathbb{R} \times \mathbb{R}^n : f(x) \le \mu\}\).

Theorem 4.2.2. \(f : \mathbb{R}^n \to \mathbb{R}\) is a convex function if and only if \(\mathrm{epi}(f)\) is a convex set in \(\mathbb{R}^{n+1}\).

Theorem 4.2.3 gives the gradient characterization of convexity. It says convex functions lie above their first-order Taylor expansions: the tangent hyperplane at any point is a global lower bound. This is the “supporting hyperplane property” for graphs of functions. For optimization purposes, it means that a local condition (the gradient at one point) gives global information — if \(\nabla f(\bar x) = 0\) and \(f\) is convex, then \(\bar x\) is a global minimizer.

Theorem 4.2.3. Let \(S \subseteq \mathbb{R}^n\) be convex, \(f : S \to \mathbb{R}\) continuously differentiable on \(S\). Then \(f\) is convex if and only if \(f(v) \ge f(u) + \nabla f(u)^T (v - u)\) for all \(u, v \in S\).

Definition 4.2.9 (Hessian Matrix). For \(f : S \to \mathbb{R}\) twice continuously differentiable, \(H_f(\bar x) = \left[\frac{\partial^2 f}{\partial x_i \partial x_j}(\bar x)\right]_{i,j \in [n]}\).

Theorem 4.2.7. Let \(S \subseteq \mathbb{R}^n\) be convex, \(f : S \to \mathbb{R}\) twice continuously differentiable. Then \(f\) is convex on \(S\) if and only if \(H_f(x)\) is positive semidefinite for every \(x \in S\).

Theorem 4.2.9 gives the monotone gradient characterization: a differentiable function is convex if and only if its gradient is monotone (in the sense that the gradient “points more” in the direction of increasing function values). This condition is useful for verifying convexity of functions where the Hessian is difficult to compute, and it is also the key condition that guarantees gradient descent methods converge.

Theorem 4.2.9. Let \(S \subseteq \mathbb{R}^n\) be convex, \(f : S \to \mathbb{R}\) continuously differentiable. Then \(f\) is convex on \(S\) if and only if \([\nabla f(u) - \nabla f(v)]^T (u - v) \ge 0\) for all \(u, v \in S\).

Definition 4.2.11 (Search Direction). In iterative optimization, \(d \in \mathbb{R}^n\) is the search direction.

Definition 4.2.12 (Step Size). \(\alpha > 0\) is the step size, producing the update \(\bar x \leftarrow \bar x + \alpha d\).

Theorem 4.3.1 (First-Order Optimality). Let \(S \subseteq \mathbb{R}^n\) be nonempty and convex, \(f : S \to \mathbb{R}\) convex and differentiable at \(\bar x \in S\). Then \(\bar x\) minimizes \(f\) over \(S\) if and only if \([\nabla f(\bar x)]^T (x - \bar x) \ge 0\) for all \(x \in S\).

Theorem (First-Order Optimality). Let \(S \subseteq \mathbb{R}^n\) be nonempty and convex, \(f: S \to \mathbb{R}\) convex and differentiable at \(\bar x \in S\). Then \(\bar x\) minimizes \(f\) over \(S\) if and only if \(\nabla f(\bar x)^T(x - \bar x) \ge 0\) for all \(x \in S\).

The first-order optimality condition says the gradient at \(\bar x\) must point “outward” relative to \(S\): moving to any other feasible point would not decrease the objective. For unconstrained minimization (\(S = \mathbb{R}^n\)), this reduces to \(\nabla f(\bar x) = 0\).

Theorem (Convexity via Hessian). Let \(f: S \to \mathbb{R}\) be twice continuously differentiable on convex open \(S\). Then \(f\) is convex on \(S\) if and only if the Hessian \(H_f(x)\) is positive semidefinite for every \(x \in S\).

The Hessian characterization gives a practical test for convexity: check that all eigenvalues of \(H_f(x)\) are non-negative at every point. Positive semidefiniteness means the function “curves upward” in every direction — the bowl shape that defines convexity.

Gradient methods iteratively update \(\bar x \leftarrow \bar x + \alpha d\):

Steepest descent: \(d = -\nabla f(\bar x)\). Simple but slow (linear convergence).
Newton’s method: \(d = -[H_f(\bar x)]^{-1} \nabla f(\bar x)\). Quadratic convergence near the optimum but requires computing and inverting the Hessian.

The Separating Hyperplane Theorem is the geometric backbone of all of duality theory — it says that convexity is the condition under which “you cannot be confused about which side you are on.” If a point is outside a convex set, there is always a hyperplane that cleanly separates them. This result was used implicitly throughout the LP theory (in the geometric proof of Farkas’ lemma, and in the cone optimality condition), and now it becomes explicit.

Theorem 4.4.2 (Separating Hyperplane). Let \(S \subseteq \mathbb{R}^n\) be nonempty, closed, and convex. Then for every \(u \in \mathbb{R}^n \setminus S\), there exist \(a \in \mathbb{R}^n \setminus \{0\},\ \alpha \in \mathbb{R}\) such that \(a^T x \le \alpha\) for all \(x \in S\) and \(a^T u > \alpha\). In other words, the hyperplane \(a^T x = \alpha\) separates \(u\) from \(S\).

Proof. Let \(\bar x\) be the nearest point in \(S\) to \(u\) (it exists and is unique by compactness of \(S \cap B(u, r)\) for large \(r\), plus Weierstrass). Set \(a = u - \bar x \ne 0\) and \(\alpha = a^T \bar x\). Then:

\(a^T u = a^T(u - \bar x) + a^T \bar x = \|u - \bar x\|^2 + \alpha > \alpha\).
For all \(x \in S\): since \(\bar x\) is the nearest point in \(S\) to \(u\), for any \(x \in S\) and \(\lambda \in [0,1]\) the point \(\bar x + \lambda(x - \bar x) \in S\) (by convexity), so \(\|u - \bar x\|^2 \le \|u - \bar x - \lambda(x - \bar x)\|^2\). Expanding and dividing by \(\lambda > 0\) then taking \(\lambda \to 0\) gives \((u - \bar x)^T(x - \bar x) \le 0\), i.e., \(a^T x \le a^T \bar x = \alpha\). \(\square\)

The nearest-point construction gives an explicit, computable separating hyperplane. It also proves that every closed convex set is the intersection of all closed half-spaces containing it — the converse of the obvious fact that such intersections are closed and convex.

A supporting hyperplane is a “touching” hyperplane — it meets the set at the boundary, and the entire set lies on one side. Supporting hyperplanes at boundary points are the nonlinear analog of tight constraints at extreme points. Notice that if \(S\) is smooth at a boundary point \(\bar x\), the supporting hyperplane is unique (it is the tangent hyperplane); if \(S\) has a “corner” at \(\bar x\), multiple supporting hyperplanes may exist — and their normals span the cone used in the KKT conditions.

Definition 4.4.1 (Supporting Hyperplane). A hyperplane \(a^T x = \alpha\) is a supporting hyperplane for \(S\) if \(S \cap \{x : a^T x = \alpha\} \ne \emptyset\) and \(S \subseteq \{x : a^T x \le \alpha\}\).

The Separating Hyperplane Theorem states that for any nonempty closed convex \(S\) and \(u \notin S\), there exists a hyperplane separating \(u\) from \(S\). This is the geometric heart of Farkas’ Lemma and strong duality for convex programs.

5.10 Tangent Cones and Normal Cones

The optimality conditions for LP (Proposition 2.5.6: \(c \in \operatorname{cone}(\text{rows of } A^=)\)) generalize to all convex programs via the language of tangent and normal cones. This framework unifies the LP and convex optimization theories.

\[ T(S, \tilde x) = \overline{\operatorname{cone}}(S - \{\tilde x\}) = \operatorname{cl}\{\lambda(x - \tilde x) : x \in S,\ \lambda \ge 0\}. \]

It contains all directions one can move from \(\tilde x\) while staying (approximately) in \(S\).

\[ N(S, \tilde x) = T(S, \tilde x)^* = \{c \in \mathbb{R}^n : c^T(x - \tilde x) \le 0 \ \forall x \in S\}. \]

It contains all vectors that “point away from \(S\)” at \(\tilde x\). The normal cone is always a closed convex cone.

Optimality has a clean normal-cone characterization: \(\tilde x\) minimizes \(c^T x\) over \(S\) if and only if the objective direction \(c\) lies in the normal cone \(N(S, \tilde x)\). Equivalently, moving to any other feasible point does not decrease \(c^T x\) — every direction in \(T(S, \tilde x)\) has non-negative inner product with \(c\).

\[ T(P, \tilde x) = \{d : A^= d \ge 0\}, \qquad N(P, \tilde x) = \{A^={}^T y : y \ge 0\}, \]

where \(A^=\) is the submatrix of tight constraints at \(\tilde x\). The optimality condition \(c \in N(P, \tilde x)\) is exactly the LP complementary slackness theorem (Proposition 2.5.6).

The tangent and normal cones are polar duals of each other: \(T(S, \tilde x)^* = N(S, \tilde x)\) and \(N(S, \tilde x)^* = T(S, \tilde x)\). This duality generalizes LP duality to all convex sets. Moreover, the normal cone satisfies a key decomposition theorem when the feasible region is an intersection.

\[ N(S, \tilde x) = N(S_1, \tilde x) + \cdots + N(S_m, \tilde x). \]

The theorem says the normal cone of an intersection equals the sum of the individual normal cones — provided the intersection has nonempty interior (the analogue of Slater’s condition). For LP, \(P\) is the intersection of half-spaces \(H_i = \{a_i^T x \ge b_i\}\); the normal cone of each half-space at \(\tilde x\) is \(\{0\}\) if the constraint is slack, or \(\operatorname{cone}(\{a_i\})\) if it is tight. Summing gives \(N(P, \tilde x) = \{A^={}^T y : y \ge 0\}\), recovering the LP optimality condition. The LP sum is always closed (it is the cone of a finite set, which is a polyhedron), which is why Theorem 5.9.3 gives an exact equality rather than merely a closure.

When the interior condition fails, the theorem can fail. For example, two closed balls \(B_1\) and \(B_2\) tangent at the origin satisfy \(N(B_1, 0) + N(B_2, 0) \ne N(B_1 \cap B_2, 0)\): the left side is a line, while the right side is all of \(\mathbb{R}^2\).

\[ -\nabla f(\tilde x) \in N(S, \tilde x). \]

This is because for any \(x \in S\), the gradient inequality gives \(f(x) \ge f(\tilde x) + \nabla f(\tilde x)^T(x - \tilde x)\), so \(0 \ge f(x) - \alpha \ge \nabla f(\tilde x)^T(x - \tilde x)\). Geometrically, the gradient of a convex function at a boundary point of a sublevel set is an outward normal — it points in the direction of increasing \(f\), away from the set.

5.11 Lagrangian Duality and KKT Conditions

The LP duality theory of Chapter 3 extends to general convex programs via the Lagrangian. By “relaxing” constraints into the objective (with multipliers \(\lambda\)), we obtain a dual problem that provides lower bounds on the primal optimum.

For a general constrained program \((P): \inf f(x)\) s.t. \(g_i(x) \le 0\), \(x \in S\), the Lagrangian is:

\[ L(x, \lambda) = f(x) + \lambda^T g(x), \quad \lambda \ge 0. \]

The Lagrangian dual is \((D): \sup_{\lambda \ge 0} h(\lambda)\) where \(h(\lambda) = \inf_{x \in S} L(x, \lambda)\). Weak duality gives \(\sup_\lambda h(\lambda) \le \inf f(x)\); strong duality (requiring a Slater point \(\hat x\) with \(g(\hat x) < 0\) in the convex case) closes this gap.

The Lagrangian is constructed by moving the inequality constraints \(g_i(x) \le 0\) from the feasible region into the objective, weighted by multipliers \(\lambda_i \ge 0\). Instead of penalizing infeasibility by making the problem infeasible, we “tax” constraint violations: each unit of violation of \(g_i(x) \le 0\) (i.e., each unit of positive \(g_i(x)\)) costs \(\lambda_i\) in the objective. The key insight here is that if we set \(\lambda_i\) correctly, minimizing the Lagrangian over all \(x\) gives the same answer as minimizing \(f\) over the feasible region. The Lagrangian dual problem is to find the best (largest) such lower bound.

Definition 4.5.1 (Lagrangian). \(L : \mathbb{R}^n \times \mathbb{R}^m \to \mathbb{R}\) defined by \(L(x, \lambda) := f(x) + \lambda^T g(x)\).

Definition 4.5.2 (Lagrange Multipliers). The scalars \(\lambda_1, \ldots, \lambda_m\) in the Lagrangian.

Definition 4.5.3 (Lagrangian Dual). The Lagrangian dual of \((P)\) is \((D): \sup_{\lambda \ge 0} h(\lambda)\) where \(h(\lambda) := \inf_{x \in S} L(x, \lambda)\).

Theorem 4.5.1 is the “sufficiency” direction of the KKT theorem: if a primal-dual pair \((\bar x, \bar\lambda)\) satisfies these three conditions, then \(\bar x\) is optimal. This is the easy direction — it is proved directly from the Lagrangian weak duality inequality. Notice how condition (ii) says \(\bar x\) is a minimizer of the Lagrangian \(L(\cdot, \bar\lambda)\) over \(S\), while condition (iii) is exactly the complementary slackness from LP theory, now applied to inequality constraints \(g_i(x) \le 0\).

Theorem 4.5.1. Let \(S \subseteq \mathbb{R}^n\), \(\bar x \in S\). Then \(\bar x\) is an optimal solution of \((P)\) if there exists \(\bar\lambda \in \mathbb{R}^m\) such that: (i) \(g(\bar x) \le 0\) (primal feasibility); (ii) \(\bar\lambda \ge 0\) and \(f(\bar x) + \bar\lambda^T g(\bar x) = \inf_{x \in S} L(x, \bar\lambda)\) (dual feasibility); (iii) \(\bar\lambda_i \cdot g_i(\bar x) = 0\) for all \(i\) (complementary slackness).

Definition 4.5.4 (Slater Point). \(\hat x \in \mathbb{R}^n\) is a Slater point for \((P): \inf\{f(x) : x \in \mathbb{R}^n,\ g(x) \le 0\}\) if \(g(\hat x) < 0\).

The Slater condition — the existence of a strictly feasible point — is the convex analog of LP feasibility. Under this mild condition, strong duality holds and the KKT conditions characterize optimality completely. Notice how close this is to the LP setting: LP strong duality required both the primal and dual to be feasible; Slater’s condition (one strictly feasible point) is the convex program’s replacement for that requirement. In practice, Slater’s condition is almost always easy to verify — if the problem has any feasible interior point, you have a Slater point. Degenerate cases where Slater fails are rare but do cause duality gaps that complicate the analysis.

Theorem 4.5.2 is the “saddle point” characterization of optimality under Slater’s condition. The saddle point condition \(L(\bar x, \lambda) \le L(\bar x, \bar\lambda) \le L(x, \bar\lambda)\) says that \((\bar x, \bar\lambda)\) is simultaneously a minimizer of \(L(\cdot, \bar\lambda)\) (over \(x\)) and a maximizer of \(L(\bar x, \cdot)\) (over \(\lambda \ge 0\)). This minimax structure is the nonlinear analog of strong duality: the primal minimization and the dual maximization meet at the same value. The Slater condition ensures no “duality gap” — this is the nonlinear version of the LP duality gap being zero.

Theorem 4.5.2. Let \(S = \mathbb{R}^n\), \(f, g_1, \ldots, g_m : \mathbb{R}^n \to \mathbb{R}\) convex. Suppose \((P)\) has a Slater point. Then a feasible solution \(\bar x\) of \((P)\) is optimal if and only if there exists \(\bar\lambda \ge 0\) such that: (i) \(L(\bar x, \lambda) \le L(\bar x, \bar\lambda) \le L(x, \bar\lambda)\) for all \(x \in \mathbb{R}^n,\ \lambda \ge 0\) (saddle point condition); (ii) \(\bar\lambda^T g(\bar x) = 0\) (complementary slackness).

Theorem 4.5.3 is the KKT theorem in its most elegant and geometrically meaningful form. Under Slater’s condition, \(\bar x\) is optimal if and only if the negative gradient \(-\nabla f(\bar x)\) lies in the cone generated by the gradients of the active inequality constraints. This is precisely the geometric optimality condition from Proposition 2.5.6 in LP: “the objective direction lies in the cone of active constraint normals.” The KKT theorem is thus a direct generalization of the LP geometric optimality criterion to smooth nonlinear programs.

What makes this remarkable is that the KKT conditions are not just necessary (which would follow from a Lagrange multiplier theorem) — under convexity and the Slater condition, they are also sufficient. This “if and only if” character makes KKT the definitive tool for nonlinear convex optimization: to certify optimality of a point \(\bar x\), exhibit multipliers \(\bar\lambda\) satisfying the conditions, and the proof is complete.

Theorem 4.5.3 (Karush 1939, Kuhn-Tucker 1951). Consider the same \((P)\) as above with a Slater point. Suppose \(\bar x \in \mathbb{R}^n\) satisfies \(g(\bar x) \le 0\), and \(f, g_i\) for \(i \in J(\bar x) := \{i : g_i(\bar x) = 0\}\) are differentiable at \(\bar x\). Then \(\bar x\) is optimal in \((P)\) if and only if \(-\nabla f(\bar x) \in \mathrm{cone}\{\nabla g_i(\bar x) : i \in J(\bar x)\}\).

When both equality and inequality constraints are present, the Slater condition no longer applies directly (a strict equality cannot be “strictly satisfied”). The Mangasarian-Fromovitz Constraint Qualification (MFCQ) is the appropriate generalization: it requires that the equality constraint gradients are linearly independent, and that a direction exists that simultaneously satisfies all active inequality constraints strictly and all equality constraints. Under MFCQ, the KKT conditions are again necessary for optimality — though for general nonconvex programs they are not sufficient.

Theorem 4.5.4 (Mangasarian-Fromovitz Constraint Qualification). Consider \((P): \inf f(x)\) s.t. \(g_i(x) \le 0\) for \(i \in [m]\), \(h_j(x) = 0\) for \(j \in [p]\), \(x \in S\). Feasible \(\bar x\) is optimal if and only if: (i) \(\bar x \in \mathrm{int}(S)\), \(g(\bar x) \le 0\), \(h(\bar x) = 0\); (ii) \(f, g_i, h_j\) are continuous and \(f, g_i\) for \(i \in J(\bar x)\) and \(h_j\) are differentiable at \(\bar x\); (iii) \(\{\nabla h_i(\bar x) : i \in [p]\}\) is linearly independent; (iv) \(\{d \in \mathbb{R}^n : \nabla g_i(\bar x)^T d < 0\ \forall i \in J(\bar x),\ \nabla h_j(\bar x)^T d = 0\ \forall j\} \ne \emptyset\).

Theorem (KKT Conditions, Karush 1939 / Kuhn–Tucker 1951). Under Slater's condition, \(\bar x\) is optimal for convex \((P)\) if and only if there exists \(\bar\lambda \ge 0\) with:

\(g(\bar x) \le 0\) (primal feasibility)
\(-\nabla f(\bar x) \in \text{cone}\{\nabla g_i(\bar x) : i \in J(\bar x)\}\) where \(J(\bar x) = \{i : g_i(\bar x) = 0\}\)
\(\bar\lambda_i \cdot g_i(\bar x) = 0\) for all \(i\) (complementary slackness)

The KKT conditions are three-part: primal feasibility (the point is in the feasible region), stationarity (the objective gradient is balanced by active constraint gradients), and complementary slackness (inactive constraints have zero multiplier). Together they characterize the optimum completely — just as LP complementary slackness characterizes LP optima.

This is the nonlinear analog of the LP complementary slackness theorem. For LP, the KKT conditions reduce to exactly the LP CS conditions derived from strong duality. The entire LP theory of Chapter 3 is thus a special case of the general Lagrangian duality theory, and this unification reveals why both theories have the same “feel”: they are instances of the same convex optimization structure.

Chapter 6: The Ellipsoid Method

6.1 From Optimization to Feasibility

The KKT theory of Chapter 5 provides necessary and sufficient conditions for optimality of convex programs, but it does not by itself yield an efficient algorithm. The simplex method is efficient in practice but not in theory — there exist polyhedra on which it takes exponentially many steps. A fundamentally different approach, due to Khachiyan (1979) building on earlier work by Shor and Yudin-Nemirovskii, solves LP in polynomial time. The key idea is to convert the optimization problem to a sequence of feasibility problems, and to solve each feasibility problem by shrinking a sequence of ellipsoids around the feasible region.

The first reduction goes as follows. Suppose we want to minimize a linear objective \(\min\ c^T x\) over a closed convex set \(S\) contained in a ball of radius \(R\). If we can binary-search on the value \(\alpha\) — asking, for various \(\alpha\), whether \(S \cap \{x : c^T x \le \alpha\}\) is nonempty — then we can find the optimum to any prescribed precision \(\varepsilon\). The number of queries is \(O(\log(2R/\varepsilon))\), which is polynomial in the input size. So it suffices to solve the convex feasibility problem: given a closed convex set \(S\), determine whether \(S = \emptyset\).

The convex set is presented to the algorithm not by an explicit list of inequalities but by a separation oracle: given any point \(x \in \mathbb{R}^n\), the oracle either confirms \(x \in S\) or returns a hyperplane separating \(x\) from \(S\). This oracle model is natural: for a polyhedron \(Ax \ge b\), the oracle simply checks each inequality and returns any violated one as the separating hyperplane. Two assumptions are made: (1) \(S \subseteq B(0, R)\), a ball of known radius \(R\), and (2) the oracle is available. The first assumption can be removed for LP by using a polynomial-size bound on extreme-point solutions.

6.2 Ellipsoids and Volume

An ellipsoid in \(\mathbb{R}^n\) is an affine image of the unit ball. Given a symmetric positive definite matrix \(D \in \mathbb{R}^{n \times n}\) and a center \(c \in \mathbb{R}^n\), the associated ellipsoid is:

\[ E(c, D) = \{ x \in \mathbb{R}^n : (x - c)^T D^{-1} (x - c) \le 1 \}. \]

When \(D = r^2 I\), this reduces to the ball \(B(c, r)\). The matrix \(D\) encodes both the shape (which directions are “stretched”) and the scale. Since \(D\) is positive definite, it has positive eigenvalues, and the ellipsoid is a genuine bounded convex body.

A key formula governs the volume of an ellipsoid. Under an invertible linear map \(x \mapsto Ax + c\), volumes scale by \(|\det A|\). The unit ball maps to an ellipsoid with \(D = A A^T\), so \(\det A = \sqrt{\det D}\). Therefore:

\[ \operatorname{vol}(E(c, D)) = \sqrt{\det D} \cdot \operatorname{vol}(B(0, 1)). \]

For an initial ellipsoid \(B(0, R)\) we have \(D = R^2 I\), so \(\det D = R^{2n}\) and \(\operatorname{vol}(B(0,R)) = R^n \operatorname{vol}(B(0,1))\). Since the unit ball fits in the box \([-1,1]^n\), its volume is at most \(2^n\), giving a crude but useful upper bound \(\operatorname{vol}(E) \le (2R)^n\).

6.3 The Ellipsoid Algorithm

The algorithm proceeds as follows. Initialize with the ball \(E_0 = B(0, R)\), which is guaranteed to contain \(S\). At each step, we have an ellipsoid \(E_k = E(c_k, D_k)\) known to contain \(S\).

Query the oracle at the center \(c_k\). If \(c_k \in S\), halt — we have found a feasible point.
If \(c_k \notin S\), the oracle returns a halfspace \(H = \{x : a^T x \le a^T c_k\}\) with \(S \subseteq H\). This traps \(S\) in the “cap” \(E_k \cap H\).
Compute a new ellipsoid \(E_{k+1}\) of minimum volume that contains \(E_k \cap H\), and repeat.

The key geometric fact is that the minimum-volume ellipsoid containing a halfspace \(E \cap \{x : a^T(x - c) \le 0\}\) (where the hyperplane passes through the center) has a closed-form description, and its volume satisfies:

\[ \frac{\operatorname{vol}(E_{k+1})}{\operatorname{vol}(E_k)} \le e^{-1/(2n+2)}. \]

This bound is established by reducing (via the affine invariance of the ratio) to the case where \(E\) is the unit ball and the hyperplane passes through the origin, then computing the minimum bounding ellipsoid of a hemisphere explicitly. The ratio \(e^{-1/(2n+2)}\) is small but consistent: after \(k(2n+2)\) iterations, the volume has shrunk by a factor of \(e^{-k}\).

The termination criterion is: if \(\operatorname{vol}(E_k) < \varepsilon\), conclude that \(S\) has volume less than \(\varepsilon\) (or is empty). Since the volume of \(E_k\) is at most \(e^{-k} \cdot (2R)^n\), the algorithm terminates after

\[ k = O\!\left(n \log(2R) + \log(1/\varepsilon)\right) \]

iterations. Each iteration requires a constant number of arithmetic operations on numbers whose bit size is polynomially bounded. Since both \(\log R\) and \(\log(1/\varepsilon)\) are polynomially bounded in the input size for LP, the total running time is polynomial.

Theorem 6.3.1 (Ellipsoid Method). Let \(S \subseteq B(0, R)\) be a closed convex set given by a separation oracle. For any \(\varepsilon > 0\), the ellipsoid algorithm terminates in \(O\!\left(n \log(R/\varepsilon)\right)\) iterations, each requiring polynomial-time computation, and either finds a point in \(S\) or certifies that \(\operatorname{vol}(S) < \varepsilon\).

6.4 Khachiyan’s Theorem: LP is Polynomial Time

Applying the ellipsoid method to LP feasibility requires overcoming two obstacles that do not arise in the abstract convex setting.

Obstacle 1 — unboundedness. The polyhedron \(Q = \{x : Ax \ge b\}\) may be unbounded, so no ball of known radius contains it. The resolution uses a result proved in Chapter 7: if \(Q\) is nonempty, it contains an extreme point whose coordinates have bit size polynomially bounded in the size of \(A\) and \(b\). Therefore, any feasible solution can be assumed to lie in the box \([-R, R]^n\) for an \(R\) that is polynomial in the input size, and we replace \(Q\) with \(Q \cap [-R, R]^n\) without affecting feasibility.

Obstacle 2 — zero volume. The ellipsoid method terminates by declaring infeasibility when the current ellipsoid has volume below \(\varepsilon\), but it cannot certify exact infeasibility. If \(Q\) is nonempty but low-dimensional (zero volume), the algorithm never declares the set small enough to stop.

The fix is a small perturbation: subtract \(\varepsilon_0\) from each right-hand side, replacing \(Ax \ge b\) by \(Ax \ge b - \varepsilon_0 \mathbf{1}\). Choose \(\varepsilon_0\) to be the smallest positive number whose binary encoding has more bits than any solution to the LP of minimizing \(\varepsilon\) subject to \(Ax \ge b + \varepsilon \mathbf{1}\) being feasible. This ensures:

If \(Q\) was nonempty, it remains nonempty after perturbation (relaxing constraints preserves feasibility).
If \(Q\) was empty, the perturbed system \(Ax \ge b - \varepsilon_0 \mathbf{1}\) remains empty (since \(\varepsilon_0\) is smaller than the optimal \(\varepsilon\) for making the system feasible).
The perturbed polytope, if nonempty, is full-dimensional, so its volume is bounded below by a quantity polynomial in the input size (via the sub-determinant formula for the simplex formed by its extreme points).

With these two modifications, the ellipsoid method solves LP feasibility in polynomial time. Since LP optimization reduces to LP feasibility via the strong duality system (primal feasibility + dual feasibility + equal objective values), LP optimization is also polynomial time.

Theorem 6.4.1 (Khachiyan, 1979). Linear programming can be solved in polynomial time.

This was a landmark result. Throughout the 1970s, the question of whether LP admitted a polynomial-time algorithm was the central open problem of combinatorial optimization. Khachiyan’s theorem — proved using the ellipsoid method — settled it. In practice, the simplex method is far faster on typical instances, but the ellipsoid method’s theoretical guarantee is crucial for complexity-theoretic arguments.

Chapter 7: Complexity Theory

7.1 Instances, Size, and Bit Complexity

Every mathematical question about optimization can be asked in a complexity-theoretic framework by specifying how the input is encoded. A problem is a collection of instances, and an algorithm for the problem must, for every instance, terminate and produce the correct answer.

For LP, an instance is a matrix \(A \in \mathbb{Z}^{m \times n}\) and vectors \(b, c \in \mathbb{Z}^m\). The size of an instance is the total number of bits required to write down all numbers in binary. If \(L(A)\) denotes the number of bits needed to write the largest entry of \(A\) (including sign), then the instance size is \(O(mn \cdot L(A))\). We use big-\(O\) notation throughout to suppress constant factors that depend on encoding choices.

This bit-complexity model differs from the arithmetic-operations model familiar from linear algebra courses: here we must also account for the fact that multiplying two \(k\)-bit numbers takes \(O(k^2)\) bit operations. Numbers that grow during a calculation can make an algorithm slow even if the number of arithmetic operations is small.

7.2 Polynomial-Time Algorithms and the Class P

An algorithm is polynomial-time if its running time (in bit operations) on any instance \(I\) is at most \(p(\lvert I \rvert)\) for some fixed polynomial \(p\), where \(\lvert I \rvert\) denotes the size of \(I\). The class P is the set of all decision problems that admit polynomial-time algorithms.

Polynomial time is the theoretical threshold for “efficient.” Exponential-time algorithms are practically unusable: an algorithm running in time \(2^n\) takes over 100 years on an instance of size 125 (about one second on size 100), and the age of the Earth exceeds \(2^{180}\) seconds. By contrast, a polynomial can always be composed with another polynomial to yield a polynomial, so polynomial subroutines compose freely — a key property for building complex algorithms from simpler ones.

7.3 The Classes NP and co-NP

For many combinatorial problems, we cannot find solutions efficiently but can verify them efficiently once given. A decision problem is in NP (nondeterministic polynomial time) if there exists a polynomial-time algorithm \(A\) and a polynomial \(p\) such that:

For every YES instance \(I\), there exists a certificate \(C\) with \(\lvert C \rvert \le p(\lvert I \rvert)\) such that \(A(I, C)\) accepts.
For every NO instance \(I\), \(A(I, C)\) rejects for every certificate \(C\).

The certificate is a short proof that the instance is a YES instance. The verifier \(A\) checks the certificate in polynomial time. For example, in the Hamiltonian cycle problem (does a graph have a cycle visiting all vertices?), the certificate is the cycle itself — once given, it can be verified by inspection in linear time.

The class co-NP consists of problems whose complements are in NP: a problem is in co-NP if NO instances have polynomial-size certificates. LP infeasibility, for instance, has a Farkas certificate for every infeasible instance.

The inclusion \(P \subseteq \text{NP} \cap \text{co-NP}\) holds trivially, since a polynomial-time algorithm serves as its own certificate with the empty string. Edmonds’ conjecture asserts the converse: \(P = \text{NP} \cap \text{co-NP}\). This conjecture has profoundly influenced the field, guiding researchers to first prove that a problem is in both NP and co-NP as evidence that a polynomial algorithm should exist. The most famous unsettled instance is the Hamilton cycle problem, which is trivially in NP but not known to be in co-NP.

7.4 Linear Algebra Feasibility is in P

The question “does \(Ax = b\) have a solution?” (with integer \(A, b\)) is in P. The algorithm is Gaussian elimination, but the polynomial-time analysis requires bounding the intermediate numbers.

Intermediate value bound. Apply forward Gaussian elimination, permuting rows and columns so no intermediate pivoting is needed (this can be arranged by running the algorithm once to determine the permutation, then re-running without permutations). After \(k\) elimination steps, each entry \(\gamma\) of the remaining submatrix satisfies:

\[ \gamma = \frac{\det(D_1)}{\det(D_0)}, \]

where \(D_0\) is a \(k \times k\) submatrix of \(A\) (the upper-left triangular part) and \(D_1\) is the corresponding \((k+1) \times (k+1)\) submatrix of \([A \mid b]\). This is proved by induction: row operations do not change determinants, so all intermediate entries are ratios of subdeterminants of the original matrix.

Size bound. An \(m \times m\) integer matrix with entries of size \(L\) has determinant bounded in absolute value by \(m! \cdot 2^{mL}\), so the size of its determinant is \(O(m \log m + m L)\). The ratios of subdeterminants therefore have bit size \(O(m \log m + m L(A))\), which is polynomial in the input size.

The same analysis applies to backward elimination. Since each step involves polynomially many arithmetic operations on numbers of polynomially bounded size, Gaussian elimination runs in polynomial time.

NP and co-NP certificates. Linear algebra feasibility is also in NP (the solution itself certifies YES) and in co-NP (the Fundamental Theorem of Linear Algebra certifies NO: if \(Ax = b\) has no solution, there exists \(y\) with \(A^T y = 0\) and \(b^T y = 1\), and this \(y\) has polynomial size by the same sub-determinant argument). In this sense, linear algebra feasibility was already in P, NP, and co-NP long before complexity theory was formalized — Gauss’s algorithm has been “polynomial time” for centuries.

7.5 LP Feasibility is in NP and co-NP

LP in NP. We must show that if \(Ax \ge b\) is feasible, it has a feasible solution of polynomial size. The argument has two parts. If the polytope \(Q = \{x : Ax \ge b\}\) has an extreme point, that extreme point is the unique solution of \(n\) tight constraints — a square system with a sub-determinant inverse — and its size is polynomially bounded by the intermediate value bound. If \(Q\) has no extreme point, it contains a line; we lift to \(\mathbb{R}^{2n}\) by writing each variable as the difference of two non-negative variables. The lifted polytope has no lines (all variables bounded below by zero) and hence has an extreme point. Projecting back gives a feasible solution of polynomial size. Therefore LP feasibility is in NP.

LP in co-NP. Farkas’ lemma provides the certificate. If \(Ax \ge b\) is infeasible, there exists \(y \ge 0\) with \(A^T y = 0\) and \(b^T y < 0\). This \(y\) is a feasible solution to another LP (the dual of a trivial LP) and therefore has polynomial size by the NP argument applied to that dual. So LP infeasibility is certifiable efficiently: just present the Farkas vector.

LP optimization in NP and co-NP. For LP optimization, the natural decision question is: “Is the optimal value of \(\max c^T x,\, Ax \ge b\) at least \(\alpha\)?” This is equivalent to the feasibility of the system \(Ax \ge b,\, c^T x \ge \alpha\), which is in NP and co-NP by the above.

7.6 LP is in P via the Ellipsoid Method

Combining Chapter 6 with the size bounds above:

Theorem 7.6.1. Linear programming is solvable in polynomial time.

The proof is the ellipsoid method with the two fixes from Section 6.4: replace the polyhedron with a bounded one (using the polynomial-size bound on extreme points) and perturb to full dimension (using a polynomially computable perturbation that preserves feasibility status). The resulting polytope is bounded, full-dimensional if nonempty, and has volume bounded below by a polynomial expression in the input size. The ellipsoid method then terminates in polynomially many iterations, each in polynomial time, yielding an exact LP solution.

Notably, the simplex method is not known to be polynomial-time: for every pivoting rule, there exist families of polyhedra requiring exponentially many pivots. The question of whether some pivot rule achieves polynomial time — or even whether shortest paths between extreme points always have polynomial length — remains open.

7.7 Integer Feasibility is in NP

The integer feasibility problem asks: given an integer matrix \(A\) and vector \(b\), does \(Ax \ge b\) have an integer solution? The fundamental result of this section is that this problem is in NP.

Theorem 7.7.1. If \(Ax \ge b\) (with \(A, b\) integer) has an integer solution, then it has one of polynomial size.

Proof. Let \(Q = \{x \in \mathbb{R}^n : Ax \ge b\}\). By lifting (replacing each variable \(x_i\) with \(x_i^+ - x_i^-\), \(x_i^\pm \ge 0\)), we may assume \(Q\) is pointed. Since \(Q\) is a pointed polyhedron, the Minkowski-Weyl theorem gives a decomposition \(Q = \operatorname{conv}(X_0) + \operatorname{cone}(D_0)\), where \(X_0\) is a finite set of extreme points (rational, polynomial-size) and \(D_0\) is a finite set of vectors. A second application of Minkowski-Weyl applied to the cone \(\{d : Ad \ge 0\}\) shows that \(D_0\) can be taken to be integer vectors, each of polynomial size (using a polytope \(B \cap \{Ad \ge 0\}\) where \(B = [-1,1]^n\), whose extreme points are rational of polynomial size, scaled to integers).

Given an integer solution \(\tilde x \in Q\), write \(\tilde x = \hat x + \hat d\) where \(\hat x \in \operatorname{conv}(X_0)\) and \(\hat d \in \operatorname{cone}(D_0)\). By Carathéodory’s theorem for cones, express \(\hat d = \sum_{i=1}^l \beta_i d_i\) with \(d_1, \ldots, d_l \in D_0\) linearly independent, \(\beta_i \ge 0\), and \(l \le n\). If any \(\beta_i > 1\), replace \(\tilde x\) with \(\tilde x - d_i\) (still an integer point in \(Q\) since \(d_i\) is an integer vector and \(\tilde x - d_i \in Q\)); repeat until all \(\beta_i \in [0,1)\). Then \(\sum \alpha_j + \sum \beta_i \le 1 + n\), so dividing through by \(n+1\) shows \(\tilde x / (n+1)\) is a convex combination of \(0\) and the \(n+1\) vectors \(((n+1) \alpha_j x_j)\) and \(((n+1) \beta_i d_i)\). Each of these vectors has polynomial size. Since \(\tilde x\) is an integer and is sandwiched between polynomial-size rational lower and upper bounds on each coordinate (its convex-hull position), its size is polynomially bounded. \(\square\)

This proof is a showcase of polyhedral geometry: the Minkowski-Weyl decomposition, Carathéodory for cones, and the sub-determinant size bound all contribute to a result that appears number-theoretic but is fundamentally geometric.

7.8 NP-Completeness and Integer Programming

Once integer feasibility is known to be in NP, the natural questions are: Is it in co-NP? Is it in P? Strong evidence says no to both.

NP-hardness. A problem is NP-complete if it is in NP and every problem in NP reduces to it via polynomial-time reductions. Cook (1971) proved that the zero-one feasibility problem (does a system \(Ax \ge b\) have a 0-1 solution?) is NP-complete. The reduction encodes the execution of any NP certificate-checking algorithm as a system of linear inequalities over 0-1 variables:

Model the algorithm’s memory at each time step as 0-1 variables \(x_{ij}\) (bit \(i\) at time \(j\)).
Encode each instruction (a local memory update depending on neighboring bits) as a linear inequality. Boolean operations are modelled directly: \(x_C = \lnot x_A\) becomes \(x_C = 1 - x_A\); \(x_C = x_A \land x_B\) becomes \(x_C \ge x_A + x_B - 1,\, x_C \le x_A,\, x_C \le x_B\); \(x_C = x_A \lor x_B\) becomes \(x_C \le x_A + x_B,\, x_C \ge x_A,\, x_C \ge x_B\).
The accepting condition (bit \(x_{1,P} = 1\) at the final step \(P\)) is a single linear equality.

The entire NP certificate-checking algorithm, run on an unknown certificate, becomes a 0-1 feasibility system: a solution is precisely a certificate that makes the algorithm accept. The system has polynomially many variables and constraints (since the algorithm runs in polynomial time, the memory and number of steps are polynomial). Therefore, any NP problem reduces to 0-1 feasibility in polynomial time.

Integer feasibility extends 0-1 feasibility (since 0-1 constraints are just \(0 \le x_i \le 1\) with integrality). Integer feasibility also reduces to 0-1 feasibility: if an integer solution has polynomially bounded size, each variable can be expressed in binary with polynomially many bits, and each bit is a 0-1 variable. Substituting the binary expansion into \(Ax \ge b\) yields a 0-1 system. Hence integer feasibility and 0-1 feasibility are equivalent under polynomial reductions — both are NP-complete.

Theorem 7.8.1 (Cook, 1971). The zero-one feasibility problem is NP-complete. Hence integer feasibility is NP-complete.

Corollary 7.8.2. (i) If integer feasibility is in P, then P = NP. (ii) If integer feasibility is in co-NP, then NP = co-NP.

Both conclusions are considered extremely unlikely. The maximum clique problem (is there a clique of size \(k\)?), the Hamiltonian cycle problem, 3-satisfiability, and thousands of other combinatorial problems are all NP-complete — they reduce to and from integer feasibility in polynomial time.

7.9 Polynomial Systems: Beyond LP and IP

To close the course, we survey what happens when the linearity restriction is dropped.

Over the reals. A system of polynomial equations with integer coefficients, viewed as a decision problem over \(\mathbb{R}\), is decidable: there is an algorithm (though not a polynomial-time one) that determines whether the system has a real solution. The key is that the reals are a real closed field, and the theory of real closed fields admits quantifier elimination (Tarski, 1951): any first-order formula over the reals is equivalent to one with no quantifiers, hence decidable. Inequalities \(p(x) \ge 0\) can be modelled as polynomial equations by introducing a slack variable \(z\) with \(p(x) - z^2 = 0\).

Over the complex numbers. Polynomial systems over \(\mathbb{C}\) are also decidable and admit a beautiful certificate theorem: Hilbert’s Nullstellensatz asserts that a system \(p_1 = \cdots = p_m = 0\) has no complex solution if and only if there exist polynomials \(q_1, \ldots, q_m\) such that \(\sum_i q_i p_i = 1\). This is the exact nonlinear analog of Farkas’ lemma. The effective Nullstellensatz bounds the degrees of the \(q_i\) in terms of the degrees and number of the \(p_i\), converting the problem to a linear system in the coefficients of the \(q_i\) — though the resulting linear system has doubly-exponential size.

As a striking application: the three-colorability of a graph can be encoded as a polynomial system over \(\mathbb{C}\). Assign each vertex a color from the three cube roots of unity \(\{1, \omega, \omega^2\}\) (where \(\omega = e^{2\pi i/3}\)), impose the constraint \(x_v^3 - 1 = 0\) for each vertex \(v\), and encode “distinct colors on adjacent vertices” via \(\prod_{uv \in E} (x_u - x_v) \ne 0\), which in turn is written as a polynomial equation with an auxiliary variable. The system has a complex solution if and only if the graph is three-colorable.

Over the integers — undecidability. Polynomial equations over \(\mathbb{Z}\) lie in a fundamentally different realm. Hilbert’s Tenth Problem asked (in 1900): is there an algorithm to determine whether a multivariate polynomial with integer coefficients has an integer solution? The answer, proved by Matiyasevich (1970) building on work of Davis, Putnam, and Robinson, is no — no algorithm exists.

The proof encodes any computer program as a system of polynomial equations: a solution exists if and only if the program terminates on a given input. Since the halting problem is undecidable, so is Hilbert’s Tenth Problem. The expressive power of polynomial equations over \(\mathbb{Z}\) is staggering: there exists an explicit system of 14 polynomial equations in 26 integer variables whose solutions for the variable \(k\) are precisely the primes (shifted by 2). Any decidable combinatorial conjecture can be reformulated as an instance of integer polynomial feasibility.

Over the rationals — an open problem. Whether polynomial equations over \(\mathbb{Q}\) are decidable remains one of the deepest open problems in mathematics. Fermat’s Last Theorem can be reformulated as the insolubility of \(x^n + y^n - z^n = 0\) in positive rationals — and this was not proved until Wiles (1995), over 350 years after Fermat’s conjecture. The intractability of even specific rational polynomial systems suggests that a general algorithm is unlikely, and the conjecture of undecidability is actively pursued (work of Poonen and others). The boundary between what is algorithmically solvable and what is not passes through this deceptively simple question.

Summary of Key Theorems

Theorem	Statement
Weak Duality	\(c^T x \le b^T y\) for all feasible \(x, y\)
Strong Duality	If both feasible: \(\max c^T x = \min b^T y\)
Complementary Slackness	\(\bar x, \bar y\) optimal iff CS conditions (a)(b) hold
Fundamental Theorem of LP	LP is optimal, unbounded, or infeasible
Farkas’ Lemma	Exactly one of (I) \(Ax \ge b, x \ge 0\) or (II) \(A^T y = 0, y \ge 0, b^T y < 0\)
TU Guarantee	TU \(A\) + integral \(b\) → all extreme points integral
König	Bipartite: max matching = min vertex cover
Max-Flow Min-Cut	Integer capacities: max flow = min cut capacity
KKT	Convex program with Slater point: optimality iff KKT conditions hold
Ellipsoid Method	Separation oracle + bounded feasible region → polynomial feasibility check
Khachiyan 1979	LP is solvable in polynomial time
LP in NP ∩ co-NP	LP: polynomial-size certificates for feasibility and infeasibility
Cook 1971	Zero-one (and integer) feasibility is NP-complete
Matiyasevich 1970	Integer polynomial feasibility is undecidable