PMATH 455: Convex Analysis and Geometry
Kateryna Tatarko
Estimated study time: 5 hr
Table of contents
These notes synthesize and enrich material from the primary sources listed below. The goal is to present convex geometry as a living subject — with motivation, geometric intuition, fully worked examples, and connections to analysis and optimization — rather than as a dry catalogue of definitions and theorems. All mathematical content is paraphrased into original prose; no text is reproduced verbatim from any source.
Sources and References
These notes draw primarily on the following works, paraphrased and synthesized into original exposition:
- Primary: D. Hug and W. Weil, Lectures on Convex Geometry, Springer, 2020. This is the main reference for the geometric theory of convex bodies, mixed volumes, and the Brunn–Minkowski theory.
- Supplementary (combinatorial/polyhedral): I. Pak, Lectures on Discrete and Polyhedral Geometry (draft manuscript), covering f-vectors, Euler’s formula, and combinatorial aspects of polytopes.
- Supplementary (asymptotic): S. Artstein-Avidan, A. Giannopoulos, and V. D. Milman, Asymptotic Geometric Analysis, Parts I and II, AMS Mathematical Surveys and Monographs, 2015 and 2021. Used for log-concavity, functional inequalities, and duality.
- Analysis background: E. Lieb and M. Loss, Analysis, 2nd ed., AMS Graduate Studies in Mathematics, 2001. Particularly relevant for rearrangement inequalities and functional analytic tools.
- Rearrangement inequalities: A. Burchard, A Short Course on Rearrangement Inequalities, online lecture notes, University of Toronto (available at math.toronto.edu/almut/rearrange.pdf). Principal reference for Chapter 7.
- Online resources: MIT OpenCourseWare lecture notes on convex analysis (6.253 and related courses); publicly available course notes from the University of Washington, Berkeley, and other institutions on convex geometry and functional analysis.
All mathematical content is paraphrased into original prose; no text is reproduced verbatim from any source.
Chapter 1: Basic Convexity
The notion of convexity is deceptively simple — a set is convex if any two of its points can be “seen” from each other without leaving the set — yet it underpins an enormous range of mathematics. Convex geometry provides the geometric framework for optimization (every local minimum of a convex function is a global minimum), probability (log-concave distributions are the natural class for concentration inequalities), and analysis (the Hahn–Banach theorem is at heart a separation result for convex sets). In this chapter we lay the affine and convex foundations that will be needed throughout the course.
1.1 Affine Structure in \(\mathbb{R}^n\)
Before studying convexity, we must recall the affine geometry of Euclidean space. Fix the ambient space \(\mathbb{R}^n\) with its standard inner product \(\langle \cdot, \cdot \rangle\) and norm \(\|\cdot\|\).
The key distinction between linear and affine structure is that affine geometry forgets the special role of the origin. In a linear subspace, the origin must be present; in an affine subspace it need not be. This flexibility is essential when we study convex bodies that are not symmetric about the origin.
Affine maps are precisely the maps of the form \(x \mapsto Ax + b\) (linear part plus a translation). They preserve affine combinations and therefore preserve convexity: if \(f(x) = Ax + b\) and \(C\) is convex, then \(f(C)\) is convex. Pure linear maps also preserve convexity, but they impose the artificial constraint that \(0 \in C\) (since \(f(0) = 0\) forces the image to contain the origin). Affine maps are exactly general enough: they can translate, rotate, and shear, but they cannot “fold” or “curve,” so they map segments to segments and convex sets to convex sets.
The practical consequence: throughout this course, definitions and theorems will be stated in terms of affine combinations (where scalars sum to 1) rather than linear combinations (where scalars are unrestricted). The affine hull \(\mathrm{aff}(S)\), the relative interior \(\mathrm{relint}(C)\), and Carathéodory’s theorem (Section 2.1) all reflect this choice. Working affinely means the theory is invariant under translations — a convex body translated to a new position has exactly the same convex geometry, just at a different location.
Every affine subspace of \(\mathbb{R}^n\) is a translate of a unique linear subspace; that is, it has the form \(x_0 + V\) where \(V\) is a linear subspace and \(x_0 \in \mathbb{R}^n\). The dimension of an affine subspace is the dimension of its underlying linear subspace \(V\). Affine subspaces of dimensions \(0, 1, n-1, n\) are called points, lines, hyperplanes, and the full space, respectively.
A set \(S\) is called affinely independent if no point of \(S\) is an affine combination of the others, equivalently if the vectors \(x_1 - x_0, \ldots, x_k - x_0\) are linearly independent for any labeling \(x_0, \ldots, x_k\) of \(S\).
1.1b A Warm-Up: Convexity as Absence of “Shortcuts”
Before the formal definition, here is a way to build geometric intuition. Think of a convex set as a “no-shortcut” domain: if you are at any two points inside the domain, you can travel in a straight line between them without leaving the domain. Non-convex sets are those with “dents” or “holes” that force you to detour.
- Convex: Open ball \(\{x : \|x\| < 1\}\), closed ball \(\|x\| \leq 1\), half-space \(\{x : a_1 x_1 + \cdots + a_n x_n \leq c\}\), line segment, affine subspace, the positive orthant \(\{x : x_i \geq 0 \,\forall i\}\), the cone of positive semidefinite matrices.
- Not convex: The unit sphere \(\|x\| = 1\) (a line between two antipodal points passes through the interior, which is not on the sphere), the set \(\{x : \|x\| \geq 1\}\) (the complement of the open ball — a line between two points on the same side can dip below radius 1), any set with a "bite" taken out of it, a donut (torus), the set \(\{(x,y) : xy \geq 1, x > 0\}\) (two branches of a hyperbola — a line between a point on one branch and a point on the other passes through the origin, not in the set).
For polytopes, only finitely many directions matter — those of the facet normals. An \(n\)-dimensional polytope with \(m\) facets is determined by \(m\) numbers (the support function values at the \(m\) facet normals), giving a finite-dimensional parametrization of the “polytope moduli space.”
1.2 Convex Sets and the Convex Hull
Elementary examples of convex sets include: the empty set, singletons, line segments, open and closed balls, half-spaces \(\{x : \langle a, x \rangle \leq b\}\), affine subspaces, and the entire space \(\mathbb{R}^n\). Intersections of convex sets are convex; unions of convex sets need not be.
- The intersection \(C \cap D\) is convex.
- For any scalar \(\lambda \in \mathbb{R}\), the dilation \(\lambda C = \{\lambda x : x \in C\}\) is convex.
- The Minkowski sum \(C + D = \{x + y : x \in C, y \in D\}\) is convex.
- Any affine image and preimage of a convex set is convex.
- If \(\{C_\alpha\}_\alpha\) is any family of convex sets, then \(\bigcap_\alpha C_\alpha\) is convex.
This closure under intersections is the key reason that convex sets are ubiquitous in optimization: feasible regions defined by linear inequalities (each defining a half-space, which is convex) are automatically convex.
1.2b Key Examples: Visualizing Convex Sets in Low Dimensions
Before proceeding to supporting hyperplanes, it is helpful to develop a library of standard examples and to understand how the convex hull operation behaves in familiar settings.
This 1D case is the foundation for the Prékopa–Leindler proof and the 1D Brunn–Minkowski inequality: for intervals, \(|A + B| = |A| + |B|\) (equality, not just inequality).
The convex hull in \(\mathbb{R}^2\) of a finite point cloud is a convex polygon — the smallest convex polygon containing all the points. Computing this convex hull is one of the fundamental problems in computational geometry, solvable in \(O(n \log n)\) time by the Graham scan or Jarvis march algorithms.
The PSD cone is the fundamental object in semidefinite programming (SDP), which generalizes linear programming by replacing linear inequality constraints with matrix inequality constraints \(X \succeq 0\). Many problems in combinatorial optimization (maximum cut, graph coloring) can be relaxed to SDPs.
1.3 Supporting Hyperplanes and Separation
One of the most powerful tools in convex geometry is the ability to separate convex sets by hyperplanes. Geometrically, if two convex sets do not overlap, there should be a “wall” — a hyperplane — that keeps them apart. This intuition is made precise by the separation theorems, which form the geometric foundation of convex optimization duality.
The connection to optimization is direct: the KKT conditions for constrained optimization are equivalent to separation of the primal feasible set from the “target” set, and the strong duality theorem in linear programming is equivalent to the strict separation of disjoint convex sets.
1.4 Extreme Points
The extreme points are the “irreducible” building blocks of a convex set — the points that cannot be decomposed as mixtures of other points. In a polytope they are exactly the vertices; in a ball they are the points on the boundary sphere; in a half-space they do not exist (the half-space is unbounded and every boundary point can be expressed as a convex combination of other boundary points).
- For the closed unit disk \(\mathbb{B}^2 \subset \mathbb{R}^2\), the extreme points are precisely the points on the boundary circle \(\mathbb{S}^1\). Any interior point \(p\) with \(\|p\| < 1\) can be written as the midpoint of \(p - \varepsilon e\) and \(p + \varepsilon e\) (for small \(\varepsilon\) and any unit vector \(e\)), both of which lie in the disk.
- For a closed triangle (filled), the extreme points are the three vertices. Any point on an edge (but not a vertex) is the average of the two endpoints.
- For a closed line segment \(\left[a, b\right]\), the extreme points are just \(a\) and \(b\).
- For the positive semidefinite cone \(\mathcal{S}^n_+ = \{A \in \mathbb{R}^{n \times n} : A \succeq 0\}\), the extreme rays are the rank-1 matrices \(\{vv^T : v \in \mathbb{R}^n\}\).
This theorem, proven in Chapter 7 in the context of integral representations, shows that extreme points completely determine compact convex sets. It is a deep result: it says that for any compact convex body, no matter how “round” and featureless it appears, it is always built from its most rigid, irreducible elements.
For a simplex (triangle in 2D, tetrahedron in 3D, \(n\)-simplex in \(\mathbb{R}^n\)): the extreme points are exactly the \(n+1\) vertices, and Carathéodory’s theorem tells us every point in the simplex is a convex combination of at most \(n+1\) vertices — exactly the Krein–Milman content for this case. For the hypercube \([-1,1]^n\): the extreme points are the \(2^n\) vertices \((\pm 1, \ldots, \pm 1)\), and their convex hull is the entire cube. For the unit ball: the sphere generates the ball. The content of Krein–Milman is that this always works, no matter how “smooth” or “round” the set is, and even in infinite dimensions.
1.4b Caratheodory’s Theorem for Linear Hulls and Extreme Points
The following result is a linear (rather than affine) analogue of Carathéodory’s theorem, and connects Chapter 1 to Chapter 7.
This characterization is important for the simplex method: the vertices of the feasible polytope (the extreme points of the feasible region) are the only candidates for the optimal LP solution. The simplex method searches only among vertices (zero-dimensional faces), exploiting Krein–Milman (every compact convex set is the convex hull of its extreme points, so the max of a linear functional is attained at an extreme point).
1.5 Polyhedra, Polytopes, and the Minkowski Sum
The equivalence between the “vertex description” \(P = \mathrm{conv}(V)\) and the “half-space description” \(P = \{x : Ax \leq b\}\) is called the Weyl–Minkowski theorem. It is non-trivial: converting between the two representations is computationally hard in general (the “vertex enumeration problem”), but the theoretical equivalence is fundamental.
This operation — adding a ball to “smooth” a polytope — is called taking the parallel body of \(A\) at radius 1. It is precisely the set of points within distance 1 of \(A\). The Steiner formula (Chapter 4) expresses the volume of this parallel body as a polynomial in the radius.
The Minkowski sum is the fundamental operation of convex geometry. It appears in the Steiner formula for the volume of dilated convex bodies, in the definition of mixed volumes, and in the Brunn–Minkowski inequality.
Chapter 2: Combinatorial Theorems
The three great combinatorial theorems of convex geometry — Carathéodory’s theorem, Radon’s theorem, and Helly’s theorem — all concern the interplay between convexity and dimension. They have the same flavor: “how many points do you need” or “what do small families force about large families.” Together they form the toolkit for the combinatorial geometry of convex sets.
2.0 The Big Picture: Combinatorial Convexity
Before diving into the theorems, it is worth understanding what questions they answer and why the answers are not obvious.
Carathéodory: If a point is in the convex hull of a set, how many generating points do you need? The naive answer is “possibly infinitely many” (e.g., a circle can generate a disk, and a single point on the circle is not enough). Carathéodory says: always at most \(n+1\) points in \(\mathbb{R}^n\). This is remarkable and tight: for the standard simplex (the convex hull of \(n+1\) points), an interior point genuinely requires all \(n+1\) vertices.
Radon: Can you always partition a set of points into two parts with intersecting convex hulls? For \(n+1\) points in general position (no linear dependencies), the answer is no — the convex hulls of any partition of a simplex’s vertices into two non-empty parts are disjoint. But add one more point (\(n+2\) total) and the answer becomes: always yes. This is the threshold.
Helly: If a collection of convex sets satisfies a “small-scale” intersection condition (every \(n+1\) of them share a point), does it satisfy a “global” intersection condition (all of them share a point)? Again, the answer is: yes, and the threshold \(n+1\) is tight (take \(n+1\) pairwise intersecting sets in \(\mathbb{R}^n\) with no common point — this is possible; but require every \(n+1\) of them to intersect and you’re done).
The number \(n+1\) appearing in all three theorems is not a coincidence: it is the affine dimension of \(\mathbb{R}^n\) plus one, and reflects the fact that \(n+1\) is the maximum size of an affinely independent set in \(\mathbb{R}^n\).
2.1 Carathéodory’s Theorem
The first major combinatorial result tells us that convex combinations can always be realized with few points.
Since \(k > n+1\), the \(k\) points \(x_1, \ldots, x_k\) cannot be affinely independent (as \(\mathbb{R}^n\) has affine dimension \(n\), so any affinely independent set has at most \(n+1\) elements). Therefore there exist real numbers \(\mu_1, \ldots, \mu_k\), not all zero, with \(\sum_{i=1}^k \mu_i = 0\) and \(\sum_{i=1}^k \mu_i x_i = 0\).
For any \(s \in \mathbb{R}\), the combination \(x = \sum_{i=1}^k (\lambda_i - s\mu_i) x_i\) still sums to \(\sum \lambda_i - s \sum \mu_i = 1\). Choose \(s^* = \min\left\{\frac{\lambda_i}{\mu_i} : \mu_i > 0\right\}\). Then all coefficients \(\lambda_i - s^*\mu_i\) are nonneg, and at least one equals zero (the one achieving the minimum). This expresses \(x\) as a convex combination of at most \(k-1\) points. Repeating this argument reduces to at most \(n+1\) points.
2.2 Radon’s Theorem
Partition the indices into \(I^+ = \{i : \mu_i \geq 0\}\) and \(I^- = \{i : \mu_i < 0\}\). Let \(\Lambda = \sum_{i \in I^+} \mu_i = \sum_{i \in I^-} (-\mu_i) > 0\) (both sums equal \(\Lambda\) since \(\sum \mu_i = 0\)). Then the point \(p = \sum_{i \in I^+} \frac{\mu_i}{\Lambda} x_i = \sum_{i \in I^-} \frac{-\mu_i}{\Lambda} x_i\) is a convex combination of points in \(\{x_i : i \in I^+\}\) and simultaneously of points in \(\{x_i : i \in I^-\}\), so \(p\) lies in both convex hulls.
Radon’s theorem is the foundation of Helly’s theorem. It can also be stated as: any \(n+2\) points admit a Radon partition \(A \cup B\) with \(A \cap B = \emptyset\) and \(\mathrm{conv}(A) \cap \mathrm{conv}(B) \neq \emptyset\).
2.3 Helly’s Theorem
For the inductive step, assume the theorem holds for any \(m-1\) sets (with \(m > n+1\)). For each \(i \in \{1, \ldots, m\}\), the hypothesis guarantees every \(n+1\) of the remaining \(m-1\) sets \(\{C_j : j \neq i\}\) have a common point (since those \(n+1\) sets are also \(n+1\) of the full family). By the inductive hypothesis applied to the family \(\{C_j : j \neq i\}\), there exists a point \(x_i \in \bigcap_{j \neq i} C_j\).
We now have \(m\) points \(x_1, \ldots, x_m \in \mathbb{R}^n\). Since \(m \geq n+2\), Radon’s theorem applies: partition \(\{x_1, \ldots, x_m\}\) into disjoint sets \(A\) and \(B\) with \(\mathrm{conv}(A) \cap \mathrm{conv}(B) \ni p\) for some point \(p\).
Fix any index \(k\). Either \(x_k \in A\) or \(x_k \in B\). If \(x_k \in A\), then every point of \(B\) (say \(x_j \in B\)) satisfies \(j \neq k\), so \(x_j \in C_k\). Since \(C_k\) is convex and \(p \in \mathrm{conv}(B)\), we get \(p \in C_k\). Similarly if \(x_k \in B\). Since \(k\) was arbitrary, \(p \in \bigcap_{k=1}^m C_k\).
Now take instead \([0,2]\), \([1,3]\), \([2,4]\). Every two overlap, and the global intersection is \(\{2\}\) — just a single point. Modifying the last interval to \([2.1, 4]\) destroys the common intersection: \([0,2]\) and \([2.1, 4]\) fail to intersect, violating the pairwise condition.
- \(C_1 \cap C_2 \cap C_3\): the triangle \(\{x,y \geq 0,\, x+y \leq 2\}\), nonempty (contains origin).
- \(C_1 \cap C_2 \cap C_4\): the region \(x \geq 0, y \geq 0, x - y \leq 1\). Contains \((0,0)\).
- \(C_1 \cap C_3 \cap C_4\): \(x \geq 0, x+y \leq 2, x-y \leq 1\). Contains \((1,0)\).
- \(C_2 \cap C_3 \cap C_4\): \(y \geq 0, x+y \leq 2, x-y \leq 1\). Contains \((0,0)\).
There is a fourth theorem in this family worth knowing: Tverberg’s theorem (1966), which generalizes Radon. It states that any set of \((r-1)(n+1)+1\) points in \(\mathbb{R}^n\) can be partitioned into \(r\) subsets whose convex hulls have a common point. For \(r=2\) this is exactly Radon’s theorem (\(n+2\) points, partition into 2). Tverberg’s theorem is the starting point for an active modern area (topological Tverberg theorems and their combinatorial consequences).
Together, these results illustrate a deep principle: in convex geometry, dimension bounds “what can happen.” In \(\mathbb{R}^n\), the threshold is always \(n+1\) or \(n+2\) — the number of points that can be in general position, and the number at which linear dependence is forced.
2.4 Kirchberger’s Theorem
Kirchberger’s theorem concerns the separability of two finite point sets.
Chapter 3: Geometry of Polytopes
A polytope is the convex hull of finitely many points, or equivalently (by Weyl–Minkowski) the bounded intersection of finitely many half-spaces. Polytopes are the “molecules” of convex geometry: they are the most concrete convex bodies, they can be described explicitly, and their combinatorial structure is both rich and tractable. The study of polytopes connects combinatorics, geometry, and algebra in deep ways.
3.0 Motivation: Why Study Polytopes?
Polytopes occupy a special place in convex geometry: they are the most concrete convex bodies (described by finitely many data), yet their study reveals deep truths that apply to all convex bodies. Every convex body can be approximated by polytopes (in the Hausdorff metric, polytopes are dense in the space of convex bodies), so proving results for polytopes and passing to a limit often gives results for general convex bodies.
The study of polytopes also connects to:
- Linear programming: the feasible region of an LP is a polyhedron, and the optimal solution is at a vertex. The simplex method navigates the face structure of the polytope to find the optimum.
- Combinatorics: the face lattice of a polytope encodes combinatorial information (Euler’s formula, Dehn–Sommerville relations, the \(g\)-theorem). The connection between polytope theory and algebraic combinatorics (via toric varieties and Stanley–Reisner rings) is one of the most productive interfaces in modern mathematics.
- Algebraic geometry: the Newton polytope of a polynomial controls many of its geometric and algebraic properties. The theory of tropical geometry replaces polynomial rings by polytope combinatorics.
3.1 Faces, f-Vectors, and Euler’s Formula
A convex polytope \(P \subset \mathbb{R}^n\) is the convex hull of finitely many points. The combinatorial structure of \(P\) is encoded in its faces.
A cleaner modern proof uses simplicial homology: the Euler characteristic \(\chi = V - E + F\) equals the alternating sum of Betti numbers, and for \(\mathbb{S}^2\) these are \(\beta_0 = \beta_2 = 1, \beta_1 = 0\), giving \(\chi = 2\).
| Solid | \(V\) | \(E\) | \(F\) | \(V-E+F\) |
|---|---|---|---|---|
| Tetrahedron | 4 | 6 | 4 | 2 |
| Cube | 8 | 12 | 6 | 2 |
| Octahedron | 6 | 12 | 8 | 2 |
| Dodecahedron | 20 | 30 | 12 | 2 |
| Icosahedron | 12 | 30 | 20 | 2 |
3.2 Gauss–Bonnet Theorem for Polytopes
The classical Gauss–Bonnet theorem for smooth surfaces has a discrete analogue for polytopes, expressed in terms of angle defects.
3.3 Cauchy’s Rigidity Theorem
A striking theorem in the geometry of polytopes is that convex polytopes are rigid—their shape is determined by the shapes (but not the sizes) of their faces and the combinatorial structure.
3.3b Normal Fans and the Structure of Polytopes
The combinatorial structure of a polytope is encoded in its normal fan — a partition of \(\mathbb{R}^n\) into cones that remembers which faces are “optimal” in each direction.
The Minkowski sum \(P + Q\) has as its normal fan the common refinement of the normal fans of \(P\) and \(Q\). This is why Minkowski sums can “refine” the combinatorial structure of a polytope (adding more faces), and it is the precise geometric content of the formula \(h_{P+Q} = h_P + h_Q\).
3.4 Minkowski’s Existence and Uniqueness Theorem
A central question in polytope theory: given facet normals and areas, does a convex polytope with those data exist, and is it unique?
Chapter 4: Mixed Volumes and Symmetrizations
The theory of mixed volumes is the heart of Brunn–Minkowski theory. It generalizes the notion of volume to a multilinear setting, capturing how the volume of a Minkowski combination of bodies depends on all the pieces simultaneously. The Alexandrov–Fenchel inequalities are among the deepest results in convex geometry, with connections to algebraic geometry, combinatorics, and information theory.
4.0 Motivation: Volume as a Polynomial
Before the Steiner formula, let us appreciate why it is remarkable that the volume of a parallel body should be a polynomial. For a single point \(K = \{p\}\), the parallel body \(K_\varepsilon = B(p, \varepsilon)\) is a ball of radius \(\varepsilon\), with volume \(\omega_n \varepsilon^n\) — a polynomial. For a line segment \(K = [a,b]\) in \(\mathbb{R}^3\) (1-dimensional), the parallel body is a “capsule” (cylinder with hemispherical caps), with volume \(\pi\varepsilon^2 |b-a| + \frac{4}{3}\pi\varepsilon^3\) — again a polynomial. For a flat polygon \(K\) in \(\mathbb{R}^3\), the parallel body has volume equal to the polygon area times \(2\varepsilon\) (the flat slab) plus edge contributions (quarter-cylinders) plus corner contributions (spherical wedges). The formula sums to a polynomial in \(\varepsilon\), and the coefficient structure is exactly the Steiner formula.
The general Steiner formula says this always works, for any convex body in any dimension. The coefficients \(W_k(K)\) are the quermassintegrals — intrinsic geometric measurements of \(K\) at all scales. The formula simultaneously encodes the volume (\(k=0\)), the surface area (\(k=1\)), the mean width (\(k=n-1\)), and the Euler characteristic (implicitly, through the Gauss–Bonnet theorem).
4.1 The Steiner Formula
Let \(K \subset \mathbb{R}^n\) be a convex body (compact convex set with nonempty interior) and \(\mathbb{B}^n\) the unit ball. The parallel body of \(K\) at distance \(\varepsilon \geq 0\) is \(K_\varepsilon = K + \varepsilon \mathbb{B}^n\). A fundamental discovery of Jakob Steiner is that the volume of \(K_\varepsilon\) is a polynomial in \(\varepsilon\).
\textbf{Unit ball.} For \(K = \mathbb{B}^n\), we have \(K + \varepsilon \mathbb{B}^n = (1+\varepsilon)\mathbb{B}^n\), so \(\mathrm{Vol}_n(K + \varepsilon \mathbb{B}^n) = (1+\varepsilon)^n \omega_n\) where \(\omega_n = \mathrm{Vol}_n(\mathbb{B}^n)\). Expanding by the binomial theorem: \((1+\varepsilon)^n \omega_n = \sum_{k=0}^n \binom{n}{k} \omega_n \varepsilon^k\). So \(W_k(\mathbb{B}^n) = \omega_n\) for all \(k\). The quermassintegrals of the ball are all equal to its volume (up to normalization).
\[ \mathrm{Vol}_3(K + \varepsilon \mathbb{B}^3) = abc + (ab + bc + ca) \cdot 2\varepsilon + (a+b+c) \cdot \pi \varepsilon^2 + \frac{4\pi}{3} \varepsilon^3. \]Matching with Steiner: \(W_0 = abc\), \(3 W_1 \cdot 2 = 2(ab+bc+ca)\), etc. The coefficient of \(\varepsilon^3\) is \(\omega_3 = \frac{4\pi}{3}\), as expected (the ball has volume \(\frac{4\pi}{3}\)).
4.2 Mixed Volumes
The Steiner formula motivates a multilinear generalization. For convex bodies \(K_1, \ldots, K_m\) and non-negative scalars \(t_1, \ldots, t_m\), the volume \(\mathrm{Vol}_n(t_1 K_1 + \cdots + t_m K_m)\) is a polynomial in \(t_1, \ldots, t_m\) of degree \(n\).
Mixed volumes thus form a multilinear algebra for convex bodies: they assign to any \(n\)-tuple of convex bodies a single non-negative real number that varies multilinearly. This is analogous to how the determinant is a multilinear function of \(n\) vectors in \(\mathbb{R}^n\). Indeed, for line segments \(K_i = [0, v_i]\), the mixed volume \(V(K_1, \ldots, K_n) = |\det(v_1, \ldots, v_n)| / n!\) — the mixed volume of segments is (up to a constant) the absolute value of the determinant of their direction vectors.
Here \(V(K_1, K_1) = \mathrm{Area}(K_1) = 4\), \(V(K_2, K_2) = \mathrm{Area}(\mathbb{B}^2) = \pi\), and \(V(K_1, K_2)\) is the mixed area. By the formula \(V(K, L) = \frac{1}{2}\left(\mathrm{Area}(K+L) - \mathrm{Area}(K) - \mathrm{Area}(L)\right)\), we use \(K_1 + K_2 = [-1,1]^2 + \mathbb{B}^2\) (the rounded square). Its area is \(4 + 4\pi/4 \cdot 4 + \pi \cdot 1 = 4 + 4 + \pi\) — wait, more carefully: the rounded square has 4 flat sides each of length 2 at distance 1 from the center, and 4 quarter-circles of radius 1. Area \(= 2 \times 2 \times 4 \text{ (rectangle core)} + \pi \times 1^2 \text{ (disk)} = 4 + \pi\) for the unit ball part. Actually: \(\mathrm{Area}([-1,1]^2 + \mathbb{B}^2) = 4 + \mathrm{perimeter of } [-1,1]^2 + \pi = 4 + 8 + \pi\). So \(V(K_1, K_2) = \frac{1}{2}(4 + 8 + \pi - 4 - \pi) = \frac{8}{2} = 4\). Geometrically, the mixed area of a convex body and the disk equals half its perimeter — a beautiful connection between mixed volumes and surface area.
The Alexandrov–Fenchel inequalities are deep generalizations of the Brunn–Minkowski inequality and have profound combinatorial consequences (including log-concavity of the sequence of mixed volumes).
4.2b The Mixed Volume Formula for Polytopes
For polytopes, mixed volumes have an explicit combinatorial formula. Let \(P\) and \(Q\) be polytopes in \(\mathbb{R}^n\). The mixed volume \(V(P, Q, \ldots, Q)\) (with \(P\) appearing once and \(Q\) appearing \(n-1\) times) is related to the “boundary interaction” between \(P\) and \(Q\).
For \(P = Q = \mathbb{B}^2\): the “edges” are infinitesimal arcs, and the formula becomes \(2V(\mathbb{B}^2, \mathbb{B}^2) = \int_{\mathbb{S}^1} h_{\mathbb{B}^2}(u) \,d\sigma(u) = \int_{\mathbb{S}^1} 1\, d\sigma = 2\pi = 2 \cdot \mathrm{Area}(\mathbb{B}^2) = 2\pi\). Good, consistent.
For \(P\) a rectangle \([0,a] \times [0,b]\) and \(Q = \mathbb{B}^2\): the rectangle has four edges with outward normals \(e_1, -e_1, e_2, -e_2\) and lengths \(b, b, a, a\). The support function of \(\mathbb{B}^2\) is \(h = 1\) (constant on unit vectors). So \(2V(P, Q) = 1 \cdot b + 1 \cdot b + 1 \cdot a + 1 \cdot a = 2(a+b)\), giving \(V(P, \mathbb{B}^2) = a+b = \frac{1}{2}\mathrm{perimeter}(P)\). This confirms the general formula: \(V(K, \mathbb{B}^2) = \frac{1}{2}\mathrm{perimeter}(K)\) in the plane.
- \(V(K,K,K) = \mathrm{Vol}(K)\) — the volume of \(K\).
- \(3V(K,K,B) = \mathrm{Surface Area}(K)\) — so \(V(K,K,B) = S(K)/3\).
- \(3V(K,B,B) = 2M(K)\) where \(M(K) = \int_{\partial K} H\,dA\) is the mean width integral — so \(V(K,B,B) = 2M(K)/3\).
- \(V(B,B,B) = \mathrm{Vol}(B) = \frac{4\pi}{3}\omega_3\).
4.3 Symmetrizations
A symmetrization is a procedure that replaces a convex body by a more symmetric one while preserving or improving certain geometric quantities.
- \(S_u(K)\) is a convex body.
- \(\mathrm{Vol}_n(S_u(K)) = \mathrm{Vol}_n(K)\) (volume is preserved).
- The diameter satisfies \(\mathrm{diam}(S_u(K)) \leq \mathrm{diam}(K)\).
- The surface area does not increase: \(\mathrm{Vol}_{n-1}(\partial S_u(K)) \leq \mathrm{Vol}_{n-1}(\partial K)\).
- Repeated Steiner symmetrizations in appropriately chosen directions converge (in Hausdorff metric) to a ball.
4.4 The Brunn–Minkowski Inequality
The Brunn–Minkowski inequality is the central inequality of convex geometry, and arguably the single deepest theorem in the subject. Minkowski discovered it around 1887 in the context of mixed volumes; Brunn had proved the concavity of cross-sectional volumes slightly earlier. Together their work revealed that volume behaves “like a linear function” when \(1/n\)-th powers are taken — a profound geometric-analytic statement that encodes the isoperimetric inequality, the AM–GM inequality, log-concavity of measures, and much more.
The inequality has the form of a concavity statement: the function \(t \mapsto \mathrm{Vol}((1-t)A + tB)^{1/n}\) is concave. This is what makes convex bodies “well-behaved” under mixing: as you interpolate between two bodies, the volume (properly normalized) does not drop below the linear interpolation of the endpoint volumes. The normalization \(1/n\) is forced by homogeneity — it is the unique power that makes the inequality scale-invariant.
The inequality has been called “the backbone of convex geometry” (by Schneider) and “the most important inequality in geometry” (by numerous authors). It implies:
- The isoperimetric inequality (ball minimizes surface area for given volume) — Section 4.5.
- Brunn’s theorem on log-concavity of cross-sectional volumes — Section 5.3.
- The Prékopa–Leindler inequality (or is equivalent to it) — Section 5.2.
- The Shannon entropy power inequality via the Lieb–Sobolev connection.
- AM–GM as a special case in dimension 1.
The Brunn–Minkowski inequality is the central inequality of convex geometry. It quantifies the relationship between volumes of convex bodies and their Minkowski combinations. Its importance is difficult to overstate: it implies the isoperimetric inequality, log-concavity of the volume of cross-sections, and serves as the geometric foundation for entropy inequalities, the Prékopa–Leindler inequality, and optimal transport.
The \(1/n\) power is the right normalization: it makes the inequality dimensionally homogeneous (scaling \(A\) by \(\lambda\) multiplies the left side by \(\lambda\) and the right side by \(\lambda\) as well). If we used volume directly instead of its \(n\)-th root, the inequality would fail for non-homothetic sets.
Step 1. In dimension 1, for non-empty intervals \(A = \left[a_1, a_2\right]\) and \(B = \left[b_1, b_2\right]\), the Minkowski sum is \(A + B = \left[a_1 + b_1, a_2 + b_2\right]\), so \(|A + B| = |A| + |B|\), confirming equality.
\[ \mathrm{Vol}(A + B) = \prod_i (a_i + b_i). \]By the AM–GM inequality applied to \(\prod_i \frac{a_i}{a_i + b_i} \leq \left(\frac{1}{n}\sum_i \frac{a_i}{a_i+b_i}\right)^n\), one shows \(\prod_i (a_i + b_i) \geq \left(\prod_i a_i^{1/n} + \prod_i b_i^{1/n}\right)^n\), giving the Brunn–Minkowski inequality for boxes.
Step 3. For finite unions of boxes and then by approximation for arbitrary compact sets, one uses induction on the number of boxes via a “bisection” argument: any compact set can be approximated by finite unions of boxes, and the inequality is preserved in the limit by the continuity of the volume functional.
Step 4. Equality analysis shows that equality forces \(A/\mathrm{Vol}(A)^{1/n}\) and \(B/\mathrm{Vol}(B)^{1/n}\) to be translates of each other, i.e., \(A\) and \(B\) are homothetic.
We prove \(\mathrm{Vol}_n(A + B)^{1/n} \geq \mathrm{Vol}_n(A)^{1/n} + \mathrm{Vol}_n(B)^{1/n}\).
By homogeneity, we may normalize so that \(\mathrm{Vol}_n(A) = \mathrm{Vol}_n(B) = 1\) and show \(\mathrm{Vol}_n(A+B) \geq 2^n\). (If the normalized result holds, the general result follows by scaling.) Actually, it is cleaner to prove the equivalent form: for \(t \in [0,1]\), \(\mathrm{Vol}_n((1-t)A + tB) \geq 1\), when \(\mathrm{Vol}(A) = \mathrm{Vol}(B) = 1\). This is equivalent by the substitution \(C = (1-t)A + tB\).
The Hadwiger–Ohmann proof uses the following bisection idea. By translating \(A\) and \(B\), we may assume a hyperplane \(H = \{x_n = 0\}\) bisects both \(A\) and \(B\) simultaneously (a single hyperplane can bisect any two bodies by the ham sandwich theorem; for the Brunn–Minkowski proof we use a simpler version — choose \(H\) so that \(\mathrm{Vol}(A \cap \{x_n \geq 0\}) = \frac{1}{2}\mathrm{Vol}(A)\), then translate \(B\) along the \(x_n\)-axis so the same hyperplane bisects \(B\) as well).
\[ \mathrm{Vol}_n(A+B) \geq \mathrm{Vol}_n(A^+ + B^+) + \mathrm{Vol}_n(A^- + B^-). \]Applying the inductive hypothesis (dimension \(n-1\) applied to the cross-sections, or alternatively induction on the bisection depth) to each half, and using \(\mathrm{Vol}(A^\pm) = \frac{1}{2}\mathrm{Vol}(A)\) and \(\mathrm{Vol}(B^\pm) = \frac{1}{2}\mathrm{Vol}(B)\), one obtains the inequality by a clean induction. The details require careful bookkeeping of the half-space volumes but the idea is transparent: each bisection step reduces to smaller bodies, and the inequality propagates upward.
4.4b Equality Cases in Brunn–Minkowski
The equality conditions in the Brunn–Minkowski inequality are as important as the inequality itself.
- A cube \(A = [-1,1]^3\) and a ball \(B = \mathbb{B}^3\) are NOT homothetic (one is polyhedral, one is smooth). So the inequality is strict for this pair.
- A cube \(A = [-1,1]^3\) and a rescaled cube \(B = [-r,r]^3\) ARE homothetic (both are scaled copies of the same convex body). So equality holds.
- Two simplices of different shapes (one equilateral, one right-angled) in \(\mathbb{R}^2\) are NOT homothetic unless they have the same proportions. Equality fails for most pairs of simplices.
In the isoperimetric inequality (which follows from Brunn–Minkowski), the equality condition corresponds to the ball being the unique extremal shape — the body that saturates the isoperimetric inequality is the ball, and the equality in Brunn–Minkowski (with \(B = \varepsilon\mathbb{B}^n\)) becomes tight only when \(A\) is itself a ball.
Now take \(A = \{0\} \cup \{1\} = \{0,1\}\) (two points, a non-convex set) and \(B = [0,1]\). Then \(A+B = [0,1] \cup [1,2] = [0,2]\), so \(|A+B| = 2 > 1 + 1 = |A| + |B|\)… but wait: \(|A| = 0\) (two points have measure zero). The inequality gives \(|A+B|^1 \geq |A|^1 + |B|^1 = 0 + 1 = 1\), and indeed \(|A+B| = 2 \geq 1\). But since \(A\) is not a convex body (it has no interior), the equality condition doesn’t apply.
4.5 Isoperimetric and Isodiametric Inequalities
Equality holds iff equality holds in Brunn–Minkowski, which requires \(K\) and \(\varepsilon\mathbb{B}^n\) to be homothetic — i.e., \(K\) is itself a ball.
Chapter 5: Functional Inequalities
Convexity has deep implications for functions and measures, not just sets. The study of log-concave functions and measures is one of the most active areas of modern probability theory and analysis, with applications ranging from random sampling algorithms (the ball-walk, the logistic diffusion) to central limit theorems for convex bodies (the KLS conjecture).
5.0 Motivation: Why Log-Concavity?
Log-concavity occupies a central place in modern analysis and probability for the following reason: it is the natural class of functions and measures that are “close to Gaussian” while not requiring any specific parametric form. The Gaussian distribution is log-concave (with \(\log f(x) = -\|x\|^2/2\) concave), as are all uniform distributions on convex bodies, all exponential family distributions with concave log-partition functions, and many combinatorial sequences.
The fundamental theorem motivating the study of log-concave measures is:
This is the content of Prékopa’s theorem (Section 5.2). Its significance: the Gaussian measure is log-concave, and the class of measures with properties similar to the Gaussian is exactly the log-concave class. Every theorem about Gaussians (concentration, CLT, entropy bounds) has a generalization to log-concave measures — and those generalizations often require the Brunn–Minkowski/Prékopa–Leindler framework.
5.1 Log-Concave Functions and Measures
The class of log-concave measures is the natural class for proving concentration of measure, isoperimetric inequalities, and central limit theorems. The Bobkov–Madiman inequality, the KLS conjecture (now nearly resolved), and the central limit theorem for convex bodies all concern log-concave measures.
- The standard Gaussian \(f(x) = e^{-\|x\|^2/2}\) is log-concave: \(\log f(x) = -\|x\|^2/2\), which is strictly concave.
- The uniform distribution \(f = \mathbf{1}_C\) on a convex set \(C\) is log-concave: for \(x, y \in C\), any convex combination \((1-t)x + ty \in C\) as well, so \(f\) takes value 1 at the convex combination, satisfying the inequality.
- Products of log-concave functions are log-concave: if \(f\) and \(g\) are log-concave, so is \(fg\) (take logarithms).
- The function \(f(x) = e^{-\|x\|}\) (Laplace distribution) is log-concave. The function \(e^{-|x|^p}\) is log-concave for \(p \geq 1\) and log-convex for \(p \leq 1\).
- The binomial coefficient \(\binom{n}{k}\) as a function of \(k\) is log-concave: \(\binom{n}{k}^2 \geq \binom{n}{k-1}\binom{n}{k+1}\) (the sequence rises then falls, always with log-concavity).
5.2 The Prékopa–Leindler Inequality
The Prékopa–Leindler inequality is a functional version of the Brunn–Minkowski inequality. It is in some sense more general and can be used to derive Brunn–Minkowski as a corollary.
But the arithmetic-geometric mean inequality gives \((1-t)\alpha + t\beta \geq \alpha^{1-t}\beta^t\), completing the one-dimensional proof.
The \(n\)-dimensional case proceeds by applying Fubini iteratively: apply the 1D case to the innermost integral variable, then Prékopa’s theorem ensures the resulting function in the remaining variables still satisfies the hypothesis, allowing induction.
5.2b Prékopa–Leindler: A Second Proof and Deeper Understanding
The Prékopa–Leindler inequality has a beautiful “transportation” interpretation that connects it to optimal transport theory.
This transport perspective was developed by McCann (1997) and gives a clean proof of Brunn–Minkowski: the volume of the interpolated body \((1-t)A + tB\) is the “mass” of the interpolated measure between uniform distributions on \(A\) and \(B\). Convexity of the displacement path (guaranteed by the convexity of \(A\) and \(B\)) then directly gives the Brunn–Minkowski inequality.
Since all three integrals equal \((2\pi)^{n/2}\), we get \((2\pi)^{n/2} \geq (2\pi)^{n/2}\) — equality, as expected since the Gaussian is self-similar under interpolation.
5.3 Applications: Log-Concavity of Volumes
Chapter 6: Convex Functions
Convex functions are to convex geometry what linear maps are to linear algebra — they are the morphisms that respect the structure. Every convex set determines a characteristic function (the indicator), a gauge function (Minkowski functional), and a support function, all of which are convex or concave. Conversely, every convex function determines a convex set (its epigraph). This chapter develops the theory of convex functions in parallel with the geometric theory of convex sets.
6.0 Motivation: Why Convex Functions?
Convex functions are to analysis what linear functions are to algebra — they are the simplest, most well-behaved class of functions, and understanding them completely is already a rich and non-trivial task. The key properties that make convex functions tractable are:
No false minima. Every local minimum of a convex function is a global minimum. This is the property that makes convex optimization problems tractable: any algorithm that finds a local minimum has in fact found the global minimum. For non-convex functions, gradient descent can get stuck in local minima and the problem is generally NP-hard.
Unique minima for strictly convex functions. A strictly convex function has at most one minimizer. This ensures well-posedness of optimization problems: the solution (if it exists) is unique, and numerical algorithms converge to a definite answer.
Epigraph characterization. A function is convex if and only if its epigraph is a convex set. This converts functional analysis questions (about functions) into geometric questions (about sets), allowing all the machinery of convex geometry to apply.
Subgradients everywhere. At every point in the interior of its domain, a convex function has at least one subgradient — a slope of a supporting hyperplane to the epigraph. This provides a “calculus” for non-smooth functions, enabling optimization algorithms even when the function has kinks.
Jensen’s inequality. Convex functions satisfy \(f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]\), an inequality with countless applications in probability, information theory, and analysis.
The higher-dimensional setting introduces new phenomena not present in 1D: convex functions on \(\mathbb{R}^n\) can fail to be differentiable on sets of positive measure (unlike 1D where non-differentiability is at most countable), and the subdifferential is a compact convex set rather than just an interval.
6.1 Definitions and Basic Properties
This epigraph characterization is powerful: it converts questions about convex functions into questions about convex sets, allowing all the geometric machinery to apply.
Dually, the sublevel set \(\{x : f(x) \leq c\}\) is a convex set for every \(c\) (since it is the projection of the convex set \(\mathrm{epi}(f) \cap \{r \leq c\}\)). This is why convex programs (minimizing a convex function over a convex set) have convex feasible regions.
- Any linear function \(f(x) = \langle a, x \rangle + b\).
- Any norm \(f(x) = \|x\|\).
- The function \(f(x) = \|x\|^2\) (strictly convex).
- The negative entropy \(f(x) = \sum_i x_i \log x_i\) on the positive orthant.
- The log-sum-exp function \(f(x_1, \ldots, x_n) = \log\left(\sum_i e^{x_i}\right)\).
- Positive scaling: If \(f\) is convex and \(\alpha \geq 0\), then \(\alpha f\) is convex.
- Sum: If \(f\) and \(g\) are convex, so is \(f + g\).
- Maximum: If \(f\) and \(g\) are convex, so is \(\max(f,g)\). More generally, the pointwise supremum of any family of convex functions is convex (even uncountable families).
- Composition with affine: If \(f\) is convex and \(A\) is a linear map and \(b\) is a vector, then \(x \mapsto f(Ax + b)\) is convex.
- Composition with non-decreasing convex scalar function: If \(f : \mathbb{R}^n \to \mathbb{R}\) is convex and \(g : \mathbb{R} \to \mathbb{R}\) is convex and non-decreasing, then \(g \circ f\) is convex.
6.1b Operations Preserving Convexity: A Complete Reference
The following lists all fundamental operations that preserve convexity of functions. These are the building blocks from which convex functions are constructed in practice.
- Non-negative linear combination: \(\alpha f + \beta g\) is convex for \(\alpha, \beta \geq 0\).
- Pointwise supremum (possibly uncountable): \(\sup_\alpha f_\alpha\) is convex if each \(f_\alpha\) is convex. This is the key property for minimax problems.
- Affine composition: \(f(Ax + b)\) is convex if \(f\) is convex. In particular, restrictions to affine subspaces preserve convexity.
- Perspective function: \(g(x,t) = t f(x/t)\) for \(t > 0\) is convex if \(f\) is convex. The perspective operation connects convexity to projective geometry.
- Infimal convolution: \((f \Box g)(x) = \inf_y (f(y) + g(x-y))\) is convex if \(f\) and \(g\) are convex. This operation "smears" one function with another and is the convex analogue of Minkowski sum.
- Partial minimization: \(h(x) = \inf_y f(x,y)\) is convex in \(x\) if \(f(x,y)\) is jointly convex and the infimum is finite for all \(x\). This is the convex version of marginalizing out a variable.
6.2 Jensen’s Inequality
More generally, for any convex \(g\), Jensen gives \(g(\mathbb{E}[X]) \leq \mathbb{E}[g(X)]\). Applying this with \(g(x) = x^2\) gives \((\mathbb{E}[X])^2 \leq \mathbb{E}[X^2]\), i.e., \(\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \geq 0\) — a tautology, but a useful consistency check. Applying with \(g = |\cdot|^p\) for \(p \geq 1\) gives the power-mean inequality \(|\mathbb{E}[X]|^p \leq \mathbb{E}[|X|^p]\).
6.3 Subdifferentials
For convex functions that need not be differentiable, the subdifferential generalizes the gradient.
- \(\partial f(x)\) is a nonempty, compact convex set for every \(x\) in the interior of the domain of \(f\).
- If \(f\) is differentiable at \(x\), then \(\partial f(x) = \{\nabla f(x)\}\).
- \(x^*\) is a global minimum of \(f\) if and only if \(0 \in \partial f(x^*)\).
- The subdifferential map \(x \mapsto \partial f(x)\) is monotone: \(\langle v - w, x - y \rangle \geq 0\) whenever \(v \in \partial f(x)\) and \(w \in \partial f(y)\).
Proof that \(0 \in \partial f(x^*)\) iff \(x^*\) is a global minimum. By definition, \(0 \in \partial f(x^*)\) means \(f(x) \geq f(x^*) + \langle 0, x - x^* \rangle = f(x^*)\) for all \(x\), i.e., \(x^*\) is a global minimum.
- \(\partial f(x) = \{\mathrm{sgn}(x)\} = \{1\}\) for \(x > 0\), and \(\{-1\}\) for \(x < 0\).
- \(\partial f(0) = \left[-1, 1\right]\).
At the origin \((0,0)\): any subgradient \((g_1, g_2)\) must satisfy \(|x_1| + |x_2| \geq g_1 x_1 + g_2 x_2\) for all \((x_1, x_2)\). This holds for all \((x_1,x_2)\) iff \(|g_1| + |g_2| \leq 1\), i.e., \((g_1, g_2)\) is in the unit \(\ell^1\) ball: \(\partial f(0,0) = \{(g_1,g_2) : |g_1| + |g_2| \leq 1\} = \mathcal{C}_2\). The subdifferential at the “most non-smooth” point (the origin) is the polar body of the unit ball of the norm itself — a precise duality between the function and its conjugate.
In one direction: if \(0 \in \partial f(x^*)\), then by definition, \(f(x) \geq f(x^*) + \langle 0, x - x^* \rangle = f(x^*)\) for all \(x\) — so \(x^*\) is a global minimum. In the other direction: if \(x^*\) is a global minimum, then for any direction \(d\), the directional derivative \(f'(x^*; d) = \lim_{t \downarrow 0} (f(x^* + td) - f(x^*))/t \geq 0\) (since \(x^*\) is a minimum). For a convex function, this means the zero vector is a subgradient.
This principle unifies all of calculus optimization:
- Smooth unconstrained: \(0 = \nabla f(x^*)\) (since \(\partial f = \{\nabla f\}\) when differentiable).
- Non-smooth unconstrained: \(0 \in \partial f(x^*)\), e.g., \(f(x) = |x|\) is minimized at \(x^* = 0\) since \(0 \in [-1,1] = \partial|x|(0)\).
- Constrained optimization: For \(\min f(x)\) subject to \(x \in C\), define \(f_C(x) = f(x) + \mathbf{1}_C(x)\) (adding \(+\infty\) outside \(C\)). Then \(0 \in \partial f_C(x^*) = \partial f(x^*) + N_C(x^*)\), where \(N_C(x^*)\) is the normal cone to \(C\) at \(x^*\). This is the generalized KKT condition.
In convex optimization, the KKT conditions for \(\min f(x) \text{ s.t. } g(x) \leq 0\) can be written as \(0 \in \partial f(x^*) + \lambda^* \partial g(x^*)\) (for optimal \(\lambda^* \geq 0\)), a direct generalization using subdifferentials.
6.4 Convexity and Differentiability
- Measurable \(\Rightarrow\) continuous (locally, on the interior of the domain). Unlike for general functions, a measurable convex function on an open convex set is automatically locally Lipschitz continuous. This is because convexity imposes a strong geometric constraint: the epigraph is a convex set, and the "sharp corners" that prevent continuity cannot exist in the interior.
- Continuous \(\Rightarrow\) locally Lipschitz (on compact subsets of the interior). For any compact subset \(C'\) of the interior of the domain, there is a Lipschitz constant \(L\) such that \(|f(x) - f(y)| \leq L\|x-y\|\) for \(x, y \in C'\).
- Locally Lipschitz \(\Rightarrow\) differentiable a.e. by Rademacher's theorem.
- Differentiable + convex \(\Rightarrow\) \(C^1\) on the interior (the gradient is continuous).
- Twice differentiable + convex \(\iff\) \(D^2 f \succeq 0\) everywhere.
6.5 Convex Conjugates: Computations and Geometric Interpretation
The Legendre–Fenchel transform (treated in depth in Chapter 8) is already motivated by the subdifferential theory of convex functions. Here we develop the connection.
- At each point \(x\), the gradient \(\nabla f(x) = y\) tells us the "slope" (direction of steepest ascent).
- The Legendre transform \(f^*(y) = \langle x, y\rangle - f(x)\) where \(x = (\nabla f)^{-1}(y)\) is the point with slope \(y\).
This is exactly the classical Legendre transform from mechanics and thermodynamics. In thermodynamics, the internal energy \(U(S, V)\) (entropy, volume) is Legendre-transformed to the Helmholtz free energy \(F(T, V) = \sup_S(TS - U(S,V))\), where temperature \(T = \partial U/\partial S\) is the “slope” dual to entropy.
Chapter 7: Rearrangement Inequalities
Rearrangement inequalities concern the question: when is an integral \(\int fg\) maximized or minimized? The answer involves replacing \(f\) and \(g\) by their “rearrangements” — functions that have the same distribution but are sorted in the same or opposite order. This theory, developed by Hardy, Littlewood, and Pólya in the 1920s–30s and extended by many others, has deep connections to convexity, majorization, and the geometry of function spaces.
7.0 Motivation: What is a Rearrangement and Why Should We Care?
A “rearrangement” of a function is a way to restructure it to be “nicer” — typically more symmetric or more sorted — while preserving its distribution. The key insight is that many important functionals of a function (integrals, norms, energy) are not changed by rearrangement, or are improved by it. Rearrangement inequalities then say: the “nicest” rearrangement (the symmetric decreasing one) maximizes or minimizes various functionals.
The physical motivation: the symmetric decreasing rearrangement of a density concentrates mass toward the origin. Physical systems often minimize energy by concentrating mass (gravitational collapse, electron orbitals). So rearrangement inequalities describe the ground states of physical systems — they encode which configurations are energetically optimal.
The probabilistic motivation: if \(X\) and \(Y\) are random variables, the integral \(\int f(x) g(x)\, dx\) is maximized (for fixed marginal distributions of \(X\) and \(Y\)) when \(X\) and \(Y\) are “comonotone” — large values of \(X\) occur with large values of \(Y\). The Hardy–Littlewood inequality formalizes this: to maximize \(\mathbb{E}[f(X)g(X)]\), make \(f\) and \(g\) both decreasing (or both increasing) functions of \(x\).
7.1 Symmetric Decreasing Rearrangements
- \(f^*\) is radially symmetric: \(f^*(x)\) depends only on \(\|x\|\).
- \(f^*\) is non-increasing as a function of \(\|x\|\).
- \(f^*\) is equimeasurable with \(f\): \(|\{f^* > t\}| = |\{f > t\}|\) for all \(t \geq 0\).
7.1b Rearrangements and Sobolev Inequalities
One of the most important applications of the symmetric decreasing rearrangement is to Sobolev inequalities — inequalities that bound the \(L^p\) norm of a function by the \(L^p\) norm of its gradient.
This pattern — reduce to a symmetric case by rearrangement, then solve — appears throughout the calculus of variations and PDE theory.
7.2 The Hardy–Littlewood Inequality
7.3 Majorization and the Hardy–Littlewood–Pólya Theorem
- \(x \prec y\) (i.e., \(x\) is majorized by \(y\)).
- \(x = Dy\) for some doubly stochastic matrix \(D\).
- \(\sum_{i=1}^n \phi(x_i) \leq \sum_{i=1}^n \phi(y_i)\) for all convex functions \(\phi : \mathbb{R} \to \mathbb{R}\).
7.4 The Birkhoff–von Neumann Theorem
For the converse, take any doubly stochastic matrix \(D\) that is not a permutation matrix. By the Frobenius theorem (or Hall’s marriage theorem applied to the bipartite graph of nonzero entries), there exists a permutation \(\sigma\) such that \(D_{i,\sigma(i)} > 0\) for all \(i\). Let \(\lambda = \min_i D_{i,\sigma(i)} > 0\). Then \(D' = (D - \lambda P_\sigma)/(1-\lambda)\) is again doubly stochastic, so \(D = \lambda P_\sigma + (1-\lambda) D'\). This decomposes \(D\) as a convex combination, showing it is not extreme. By induction, any doubly stochastic matrix is a convex combination of permutation matrices.
7.4b Applications of Majorization: Entropy and Quantum Mechanics
Majorization connects convexity, entropy, and the foundations of quantum mechanics in a unified framework.
The uniform distribution \(p^* = (1/n, \ldots, 1/n)\) satisfies \(p \prec p^*\) for all probability vectors \(p\) with \(\sum p_i = 1\), since \(p^*\) has all components equal (it is “maximally flat”). Therefore \(H(p) \leq H(p^*) = \log n\) for all probability distributions — the maximum entropy principle from Jensen’s inequality (Section 6.2) is a special case of Schur-concavity.
Nielsen’s theorem (1999): A bipartite pure quantum state \(|\psi\rangle\) can be converted to \(|\phi\rangle\) by LOCC if and only if the Schmidt coefficient vector of \(|\psi\rangle\) is majorized by that of \(|\phi\rangle\). This is the quantum mechanical manifestation of majorization — and it is proved using the Birkhoff–von Neumann theorem (doubly stochastic matrices = convex combinations of permutations) applied to the density matrix algebra.
Decreasing rearrangements: \(p_{[1]} = 0.5, p_{[2]} = 0.3, p_{[3]} = 0.2\) (already sorted); \(q_{[1]} = 0.6, q_{[2]} = 0.3, q_{[3]} = 0.1\).
Check: \(p_{[1]} = 0.5 \leq 0.6 = q_{[1]}\) ✓. \(p_{[1]}+p_{[2]} = 0.8 \leq 0.9 = q_{[1]}+q_{[2]}\) ✓. \(\sum p_i = \sum q_i = 1\) ✓. So \(p \prec q\). This means \(p\) is “more equal” than \(q\) — and indeed \(H(p) = -0.5\log0.5 - 0.3\log0.3 - 0.2\log0.2 \approx 1.485\) bits, while \(H(q) = -0.6\log0.6 - 0.3\log0.3 - 0.1\log0.1 \approx 1.417\) bits — confirming \(H(p) > H(q)\) as predicted by Schur-concavity of entropy.
By Hardy–Littlewood–Pólya, \(p = Dq\) for some doubly stochastic matrix \(D\). Equivalently, by Birkhoff, \(p\) is a convex combination of permutations of \(q\). One such combination: \(p = \frac{1}{6}(q_1, q_2, q_3) + \frac{1}{6}(q_1, q_3, q_2) + \frac{1}{6}(q_2, q_1, q_3) + \ldots\) (mixing over all permutations of \(q\) recovers a uniform distribution, not exactly \(p\) — but we can construct \(D\) explicitly from the majorization conditions).
7.5 The Krein–Milman Theorem
Existence of extreme points. Among all faces of \(C\) (intersections with supporting hyperplanes), choose a minimal face \(F\). Any point in the relative interior of \(F\) must be an extreme point: if \(p = (x+y)/2\) with \(x, y \in C\), then both \(x, y \in F\) (minimality), and since \(F\) is minimal, \(x = y = p\). So \(\mathrm{ext}(C) \neq \emptyset\).
Density. Suppose for contradiction that \(\overline{\mathrm{conv}}(\mathrm{ext}(C)) \subsetneq C\). Then there exists \(p \in C \setminus \overline{\mathrm{conv}}(\mathrm{ext}(C))\). By the separation theorem, there is a linear functional \(\ell\) such that \(\ell(p) > \max_{x \in \overline{\mathrm{conv}}(\mathrm{ext}(C))} \ell(x)\). But the maximum of \(\ell\) over the compact set \(C\) is attained at some face \(F'\) of \(C\), and every extreme point of \(F'\) is an extreme point of \(C\), hence lies in \(\overline{\mathrm{conv}}(\mathrm{ext}(C))\). This contradicts \(\ell(p) > \max \ell\) over \(C\).
7.6 Proximal Operators and Moreau Envelopes
The proximal operator is the key tool bridging convex analysis and algorithms. It provides a computationally tractable way to handle non-smooth convex functions.
Moreover, \(M_\lambda f(x) \nearrow f(x)\) as \(\lambda \searrow 0\) (the Moreau envelope approaches \(f\) from below). The Moreau envelope is thus a smooth approximation to \(f\), and the proximal operator provides a “step” in the direction of the approximation’s gradient.
Soft thresholding is the core operation in the LASSO, compressed sensing, and wavelet denoising. The fact that it is the proximal operator of \(\|\cdot\|_1\) explains why LASSO can be solved efficiently by iterative soft-thresholding algorithms.
7.7 Strongly Convex Functions
Strong convexity is a quantitative strengthening of convexity that gives faster convergence for optimization algorithms.
In optimization: strongly convex functions have unique minimizers, and gradient descent on a \(\mu\)-strongly convex, \(L\)-smooth function converges geometrically with rate \((1 - \mu/L)^k\). The condition number \(\kappa = L/\mu\) is the key parameter — problems with \(\kappa \approx 1\) are “well-conditioned” and converge quickly; problems with large \(\kappa\) are “ill-conditioned” and converge slowly.
Chapter 8: Duality and the Legendre Transform
Duality is a recurring theme in mathematics: a geometric object has a “dual” object that encodes the same information from a different perspective. In convex geometry, the polar dual of a convex body \(K\) is a body \(K^\circ\) that is “small where \(K\) is large.” In convex analysis, the Legendre–Fenchel transform converts a convex function into its “dual” function. These two dualities are deeply connected and together form the foundation of convex optimization theory.
8.1 Polar Sets and Support Functions
- \(K^\circ\) is always closed, convex, and contains the origin.
- If \(K\) is a closed convex set containing the origin, then \((K^\circ)^\circ = K\) (the bipolar theorem).
- If \(K \subseteq L\), then \(L^\circ \subseteq K^\circ\) (polarity reverses inclusion).
- For the unit ball: \((\mathbb{B}^n)^\circ = \mathbb{B}^n\).
- For a polytope \(P = \mathrm{conv}\{v_1, \ldots, v_m\}\), its polar is \(P^\circ = \{y : \langle v_i, y \rangle \leq 1 \text{ for all } i\}\).
the cross-polytope! The condition \(\langle v, y\rangle \leq 1\) for all sign vectors \(v \in \{-1,+1\}^n\) is equivalent to \(\max_{v \in \{-1,+1\}^n} \sum_i v_i y_i = \sum_i |y_i| \leq 1\).
\[ \mathcal{C}_n^\circ = \{y : \langle \pm e_i, y\rangle \leq 1 \text{ for all } i\} = \{y : |y_i| \leq 1 \text{ for all } i\} = [-1,1]^n = K. \]So the hypercube and the cross-polytope are indeed polar duals of each other. This is a concrete verification of \((K^\circ)^\circ = K\).
In \(\mathbb{R}^2\): the square \([-1,1]^2\) (4 vertices, 4 edges) is polar dual to the diamond \(\{|x_1| + |x_2| \leq 1\}\) (4 vertices, 4 edges). Notice the number of vertices of one equals the number of facets of the other (in 2D, both have 4 vertices and 4 edges — the square and diamond are combinatorially equivalent, related by a \(45°\) rotation).
\textbf{Unit ball.} For \(K = \mathbb{B}^n\), the maximum of \(\langle u, x\rangle\) over \(\|x\| \leq 1\) is \(\|u\|\) (achieved at \(x = u/\|u\|\)). So \(h_{\mathbb{B}^n}(u) = \|u\|\). The support function of the ball is the Euclidean norm — a smooth function.
\textbf{Square.} For \(K = [-1,1]^2\), the maximum of \(\langle u, x\rangle = u_1 x_1 + u_2 x_2\) over \(|x_i| \leq 1\) is \(|u_1| + |u_2|\) (achieved by setting \(x_i = \mathrm{sgn}(u_i)\)). So \(h_{[-1,1]^2}(u) = |u_1| + |u_2| = \|u\|_1\). The support function of the square is the \(\ell^1\) norm — a piecewise linear function with kinks on the axes.
\textbf{Cross-polytope.} For \(K = \mathcal{C}_2 = \{|x_1|+|x_2| \leq 1\}\) (diamond), the maximum of \(u_1 x_1 + u_2 x_2\) subject to \(|x_1|+|x_2| \leq 1\) is \(\max(|u_1|, |u_2|) = \|u\|_\infty\). So \(h_{\mathcal{C}_2}(u) = \|u\|_\infty\).
This confirms the duality: the support function of \(K^\circ\) is the gauge of \(K\), and we see \(h_{[-1,1]^2}(u) = \|u\|_1\) while the gauge of \([-1,1]^2\) is \(\|u\|_\infty = h_{\mathcal{C}_2}(u)\). The ball is self-dual because the \(\ell^2\) norm is the support function of \(\mathbb{B}^n\) and also its gauge.
This is a piecewise linear function. For \(u = (1, 0, \ldots, 0)\): \(h = 1\) (achieved at \(e_1\)). For \(u = (-1, -1, \ldots, -1)\): \(h = 0\) (achieved at the origin). For \(u = (3, 1, 2)\): \(h = \max(0, 3, 1, 2) = 3\) (achieved at \(e_1\)).
- For \(K = \{x_0\}\) (a single point), \(h_K(u) = \langle u, x_0 \rangle\). The support function of a point is the linear functional that evaluates at that point. This is the "degenerate" case — it is a linear function of \(u\), not merely convex.
- For \(K = [a, b]\) (a line segment in \(\mathbb{R}^n\), not just \(\mathbb{R}\)), \(h_K(u) = \max(\langle u, a\rangle, \langle u, b\rangle)\) since the maximum of a linear function over a segment is achieved at an endpoint.
- For the box \(K = [-a_1, a_1] \times [-a_2, a_2] \times \cdots \times [-a_n, a_n]\) with \(a_i > 0\): \(h_K(u) = \sum_{i=1}^n a_i |u_i|\). This is an \(\ell^1\)-type expression, weighted by the half-lengths. In particular, for the unit cube \([-1,1]^n\): \(h(u) = \|u\|_1\).
Furthermore, every positively 1-homogeneous convex function \(h: \mathbb{R}^n \to \mathbb{R}\) is the support function of some unique convex body — the body \(K = \{x : \langle x, u\rangle \leq h(u) \text{ for all } u \in \mathbb{S}^{n-1}\}\). So the support function parametrizes the space of convex bodies as a convex cone in the space of functions, and Minkowski addition corresponds to addition in this cone.
- \(h_K(u_1, u_2) = \max_{(x_1,x_2) \in [0,1]^2}(u_1 x_1 + u_2 x_2) = \max(0, u_1) + \max(0, u_2) = u_1^+ + u_2^+\).
- \(h_L(u) = \|u\|\).
- \(h_{K+L}(u) = u_1^+ + u_2^+ + \|u\|\).
8.1b Polar Duality: Proofs and Further Examples
We now prove the bipolar theorem and work through additional examples of polar duals.
For the reverse, suppose \(x_0 \notin K\). Since \(K\) is closed and convex and \(x_0 \notin K\), by the strict separation theorem there exists \(v \in \mathbb{R}^n\) and \(\alpha > \beta\) such that \(\langle v, x_0\rangle > \alpha > \beta \geq \langle v, x\rangle\) for all \(x \in K\). Since \(0 \in K\), we have \(\beta \geq 0\). Let \(\tilde{v} = v/\alpha\); then \(\langle \tilde{v}, x_0\rangle > 1\) and \(\langle \tilde{v}, x\rangle \leq \beta/\alpha \leq 1\) for all \(x \in K\). So \(\tilde{v} \in K^\circ\) but \(\langle \tilde{v}, x_0\rangle > 1\), meaning \(x_0 \notin (K^\circ)^\circ\). This proves \((K^\circ)^\circ \subseteq K\), completing the proof.
This inversive relationship is central to John’s theorem: the John ellipsoid of \(K\) (maximal volume inscribed ellipsoid) is related to the Loewner ellipsoid of \(K^\circ\) (minimal volume circumscribed ellipsoid), and the Mahler volume \(\mathrm{Vol}(K)\mathrm{Vol}(K^\circ)\) is a key affine invariant (the Mahler conjecture asks for its minimum, achieved by the cube and simplex in different senses, but still unresolved for non-symmetric bodies in high dimensions).
8.2 The Legendre–Fenchel Transform
- If \(f(x) = \frac{1}{2}\|x\|^2\), then \(f^*(y) = \frac{1}{2}\|y\|^2\). The Gaussian is self-conjugate. Derivation: \(\sup_x (\langle x,y\rangle - \frac{1}{2}\|x\|^2)\). Differentiating in \(x\): \(y - x = 0\), so \(x^* = y\), and \(f^*(y) = \langle y,y\rangle - \frac{1}{2}\|y\|^2 = \frac{1}{2}\|y\|^2\).
- If \(f(x) = \frac{1}{2p}\|x\|^{2p}\) for \(p > 1\), then \(f^*(y) = \frac{1}{2q}\|y\|^{2q}\) where \(1/p + 1/q = 1\). This is the Hölder duality for powers: the conjugate of \(x^p/p\) is \(y^q/q\).
- If \(f(x) = \|x\|\), then \(f^*(y) = \mathbf{1}_{\mathbb{B}^n}(y)\) (the indicator of the unit ball, zero inside and \(+\infty\) outside). Derivation: \(\sup_x (\langle x,y\rangle - \|x\|) = \sup_x (\langle x,y\rangle - \|x\|)\). If \(\|y\| \leq 1\), then \(\langle x,y\rangle \leq \|x\|\|\y\| \leq \|x\|\), so \(\langle x,y\rangle - \|x\| \leq 0\), achieved at \(x=0\). If \(\|y\| > 1\), take \(x = t y/\|y\|\): then \(\langle x,y\rangle - \|x\| = t\|y\| - t = t(\|y\|-1) \to \infty\) as \(t \to \infty\). So \(f^*(y) = 0\) for \(\|y\| \leq 1\) and \(+\infty\) for \(\|y\| > 1\).
- If \(f(x) = e^x\) on \(\mathbb{R}\), then \(f^*(y) = y\log y - y\) for \(y > 0\) (and \(0\) at \(y = 0\), \(+\infty\) for \(y < 0\)). Derivation: \(\sup_x (xy - e^x)\). Differentiating: \(y - e^x = 0\), so \(x^* = \log y\), and \(f^*(y) = y \log y - y\). This conjugate function is the negative entropy — the convex conjugate of the exponential function is the entropy.
- If \(f(x) = -\log x\) on \((0,\infty)\), then \(f^*(y) = -1 - \log(-y)\) for \(y < 0\) (and \(+\infty\) for \(y \geq 0\)). Derivation: \(\sup_{x>0}(xy - (-\log x)) = \sup_{x>0}(xy + \log x)\). Differentiating: \(y + 1/x = 0\), so \(x^* = -1/y\) (valid for \(y < 0\)), and \(f^*(y) = y \cdot (-1/y) + \log(-1/y) = -1 + \log(-1/y) = -1 - \log(-y)\).
- For a convex body \(K\), the support function \(h_K(u) = \sup_{x \in K} \langle x, u \rangle\) is the Legendre transform of the indicator function \(\mathbf{1}_K\) (which equals 0 on \(K\) and \(+\infty\) off \(K\)).
This illustrates a fundamental principle: the Legendre transform converts “curvature in the \(x\)-space” (captured by \(A\)) into “curvature in the \(y\)-space” (captured by \(A^{-1}\)). High curvature (large eigenvalues of \(A\)) in one space corresponds to low curvature (small eigenvalues of \(A^{-1}\)) in the dual.
- \(f^*\) is always convex and lower semicontinuous (even if \(f\) is not).
- Young–Fenchel inequality: \(\langle x, y \rangle \leq f(x) + f^*(y)\) for all \(x, y\).
- Fenchel–Moreau biconjugate theorem: If \(f\) is convex, proper, and lower semicontinuous, then \(f^{**} = (f^*)^* = f\).
- The conjugate of a sum (with appropriate domains) satisfies the infimal convolution formula: \((f + g)^*(y) = \inf_{u} (f^*(u) + g^*(y - u))\).
For the reverse, notice that \(f^{**}(x) = \sup\{l(x) : l \text{ affine}, l \leq f\}\) (supremum over all affine minorants of \(f\)). Since convex lsc functions are pointwise suprema of their affine minorants (by the Hahn–Banach theorem), we get \(f^{**}(x) = f(x)\).
Geometrically: \(f^{**}(x)\) is the convex lower semicontinuous envelope of \(f\) — the largest convex lsc function below \(f\). If \(f\) is already convex and lsc, this envelope equals \(f\) itself.
8.2b Computing the Legendre Transform: Worked Examples in \(\mathbb{R}^n\)
We collect several more computations of the Legendre transform that are important in applications.
Summary: \(\text{(log-sum-exp)}^* = \text{(negative entropy on simplex)}\). This is the fundamental duality in optimization and information geometry: the log-partition function and the negative entropy are Legendre duals.
For a polytope \(P = \{x : Ax \leq b\}\), the support function can be computed by LP: \(h_P(y) = \max\{\langle y, x\rangle : Ax \leq b\}\). By LP duality, \(h_P(y) = \min\{\langle b, \lambda\rangle : A^T\lambda = y, \lambda \geq 0\}\) — so the support function is computed by the dual LP.
This formula is dual to the fact that the support function of \(K + L\) is \(h_K + h_L\): the Minkowski sum of bodies corresponds to the addition of support functions (which are positively homogeneous convex functions, i.e., “indicators” from a dual perspective).
8.3 Fenchel Duality and Young’s Inequality
8.3b Fenchel Duality in Practice: From Theory to Algorithms
The Fenchel duality theorem is not merely a theoretical curiosity — it is the practical engine behind modern convex optimization solvers.
This is the practical implementation of the abstract duality theory: the “coordinate splitting” structure of ADMM is exactly the Fenchel duality decomposition of the optimization problem.
8.4 Applications and the Dual Perspective
The Legendre–Fenchel duality framework unifies many inequalities. We record two key applications.
Appendix: Connections, Context, and Further Directions
A.1b Concentration of Measure: The Lévy–Milman Phenomenon
The concentration of measure phenomenon is one of the most striking discoveries of high-dimensional geometry. It says that in high-dimensional spaces, nicely-behaved functions are essentially constant — the measure is so concentrated that deviations from the median are exponentially rare.
The geometric explanation: on the high-dimensional sphere, almost all the measure is concentrated near the equator (the hyperplane perpendicular to any fixed direction). This is the content of the “central limit theorem for projections” — a random projection of a high-dimensional uniform distribution on a sphere looks Gaussian.
A.1 Convexity in Optimization Theory
The deepest motivation for studying convex geometry is its role in optimization. When a problem has a convex structure, the landscape of the objective function has no “false minima” — every local minimum is a global minimum. This single property underlies the success of modern optimization algorithms (gradient descent, the simplex method, interior point methods, ADMM) in machine learning, engineering, and economics.
The key dictionary between convex geometry and optimization is:
- Supporting hyperplane theorem \(\leftrightarrow\) KKT conditions: The KKT (Karush–Kuhn–Tucker) conditions for a constrained convex program are equivalent to the existence of a supporting hyperplane to the feasible set at the optimal point.
- Separation theorem \(\leftrightarrow\) Strong LP duality: If the primal LP has a feasible optimum and the dual is feasible, they have equal optimal values — the primal and dual feasible sets can be separated.
- Subdifferential \(\leftrightarrow\) Lagrange multipliers: At the optimal point, the zero subgradient condition \(0 \in \partial f(x^*) + \lambda^* \partial g(x^*)\) generalizes \(\nabla f = \lambda \nabla g\).
- Fenchel duality \(\leftrightarrow\) Lagrangian duality: The Fenchel dual problem \(\sup_y (-f^*(-y) - g^*(y))\) is the general form of the Lagrangian dual.
A.1c Convexity in Statistics and Machine Learning
The connection between convexity and statistical inference is fundamental — many of the core algorithms in machine learning are convex optimization problems.
The Hessian of \(A(\theta)\) is the covariance matrix \(\mathrm{Cov}_\theta[T(X)]\) — a positive semidefinite matrix. So the log-partition function is strictly convex, and the MLE is unique whenever the exponential family is regular (full rank). All of this follows from the convexity theory developed in Chapters 6 and 8.
The geometric interpretation: the SVM finds the maximum-margin hyperplane separating the two classes. This is exactly the supporting hyperplane theorem applied to the convex hulls of the two class point clouds — the optimal separating hyperplane is the one that maximizes the distance between the two convex hulls.
A.2 John’s Ellipsoid and the Banach–Mazur Distance
A fundamental question in convex geometry is: “how round is a convex body?” The theory of John’s ellipsoid provides a quantitative answer.
A.2b Volume of Convex Bodies and the Curse of Dimensionality
A key aspect of high-dimensional geometry is that volumes behave very differently from low-dimensional intuition.
By contrast, for the cross-polytope \(\mathcal{C}_n = \{x : \|x\|_1 \leq 1\}\), the volume is \(\mathrm{Vol}(\mathcal{C}_n) = 2^n/n!\), even smaller than the ball. So in high dimensions: cross-polytope \(\ll\) ball \(\ll\) cube (in volume), even though all three have “radius” 1 in their respective norms.
A.3 Log-Concavity in Combinatorics and Probability
Log-concavity appears throughout combinatorics in the sequences of binomial and multinomial coefficients, in characteristic polynomials of matroids, and in the generating functions of many combinatorial objects. The following is a brief tour.
A.3b Connections to Information Theory: The Entropy Power Inequality
The analogy between the Brunn–Minkowski inequality and information theory is deep and precise.
The analogy is not coincidental:
- Volume \(\mathrm{Vol}(K)^{1/n}\) corresponds to entropy power \(e^{h(X)/n}\).
- Minkowski sum \(K + L\) corresponds to convolution \(X + Y\) (independent sum of random variables).
- Indicator function \(\mathbf{1}_K\) corresponds to density \(f\).
- Ball (maximizes volume for given "spread") corresponds to Gaussian (maximizes entropy for given variance).
The Cramér–Rao inequality \(\mathrm{Var}(\hat{\theta}) \geq 1/I(\theta)\) (variance of any unbiased estimator is at least the reciprocal of Fisher information) is a direct consequence of the Cauchy–Schwarz inequality applied to the score function, connecting convexity (Cauchy–Schwarz is a consequence of the AM–GM/Jensen framework) to statistical estimation.
A.4 The Brunn–Minkowski–Lusternik and Concentration of Measure
The Brunn–Minkowski inequality and its functional analogue the Prékopa–Leindler inequality are the foundations of the “concentration of measure” phenomenon in high-dimensional geometry. The key result is:
The concentration of measure principle has profound implications for randomized algorithms (random projections nearly preserve distances), compressed sensing (random matrices satisfy the restricted isometry property), and machine learning (the “blessing of dimensionality” — in high dimensions, random samples are approximately equidistant from each other and from the center).
A.4b Worked Problems: Connecting All the Themes
The following worked problems synthesize material from across the course.
Solution. If \(K = L\), clearly \(h_K(u) = \max_{x\in K}\langle u,x\rangle = \max_{x\in L}\langle u,x\rangle = h_L(u)\) for all \(u\).
\[ K = \bigcap_{u \in \mathbb{S}^{n-1}} \{x : \langle u, x\rangle \leq h_K(u)\}. \](The intersection of all supporting half-spaces is the set itself, for closed convex sets.) Since \(h_K = h_L\), the two families of half-spaces are identical, so \(K = L\).
This shows the support function is a “complete invariant” of compact convex sets. It is the basis for all support function-based proofs in the Brunn–Minkowski theory.
Solution. By definition, \(v \in \partial f(x)\) iff \(f(y) \geq f(x) + \langle v, y-x\rangle\) for all \(y\), i.e., \(g(Ay) \geq g(Ax) + \langle v, y-x\rangle\).
Setting \(z = Ay\) and \(z_0 = Ax\): \(g(z) \geq g(z_0) + \langle v, A^{-1}(z-z_0)\rangle\). But \(A^{-1}\) may not exist; instead, write \(\langle v, y-x\rangle = \langle A^T(\cdot), y-x\rangle\) — we need \(\langle w, z - z_0\rangle = \langle v, y-x\rangle\) where \(z - z_0 = A(y-x)\), so \(w = v\) pulled back through \(A^T\): \(\langle v, y-x\rangle = \langle A^T w, y-x\rangle\) requires \(v = A^T w\) for some \(w \in \partial g(Ax)\).
More precisely: \(v \in \partial f(x)\) iff for all \(y\), \(g(Ay) \geq g(Ax) + \langle v, y-x\rangle\). Let \(w \in \partial g(Ax)\): then \(g(z) \geq g(Ax) + \langle w, z-Ax\rangle\) for all \(z\). Setting \(z = Ay\): \(g(Ay) \geq g(Ax) + \langle w, Ay - Ax\rangle = g(Ax) + \langle A^T w, y-x\rangle\). So \(A^T w \in \partial f(x)\). This gives \(A^T \partial g(Ax) \subseteq \partial f(x)\). The reverse inclusion follows by noting that any \(v \in \partial f(x)\) satisfies the inequality, and back-solving for \(w \in \partial g(Ax)\) with \(v = A^T w\) (using the identification of subgradients with supporting hyperplanes of the epigraph, composed with \(A\)).
Adding: \(\prod_i^{1/n}\frac{a_i}{a_i+b_i} + \prod_i^{1/n}\frac{b_i}{a_i+b_i} \leq \frac{1}{n}\sum_i\frac{a_i}{a_i+b_i} + \frac{1}{n}\sum_i\frac{b_i}{a_i+b_i} = 1\). This is exactly the BM inequality for boxes, confirming that AM–GM is the “1D engine” driving Brunn–Minkowski.
A.4c The Geometry of Duality: A Synthesis
The following remarks connect the duality theory of Chapter 8 with the geometric results of Chapters 1–5.
Set-level duality: The polar body \(K^\circ\) is dual to \(K\). Large bodies have small polars; the ball is self-dual. The bipolar theorem \((K^\circ)^\circ = K\) shows this is a perfect involution on closed convex bodies containing the origin.
Function-level duality: The Legendre–Fenchel conjugate \(f^*\) is dual to \(f\). The biconjugate theorem \(f^{**} = f\) (for convex lsc \(f\)) mirrors the bipolar theorem. The Young–Fenchel inequality \(\langle x,y\rangle \leq f(x) + f^*(y)\) mirrors the defining property of the polar body.
Norm-level duality: The dual norm \(\|y\|_* = \sup_{\|x\|\leq 1}\langle x,y\rangle\) is the support function of the unit ball. The Cauchy–Schwarz inequality \(\langle x,y\rangle \leq \|x\|\|y\|_*\) is the Young–Fenchel inequality applied to homogeneous functions.
These three are not separate — they are three facets of a single algebraic structure. Set-level polarity is the “indicator function” version: \(\mathbf{1}_K^\star = h_K\) (Legendre transform of indicator = support function), and \(h_K^\star = \mathbf{1}_{K}^{\circ\circ} = \mathbf{1}_K\) (the Legendre transform of a support function is the indicator of the polar body). The norm duality is the “one-homogeneous” version.
Lutwak’s \(L^p\) Brunn–Minkowski theory (1990s) unifies the classical and dual theories by introducing a parameter \(p \in [-\infty, \infty]\), with \(p=1\) giving classical Brunn–Minkowski and \(p=-1\) giving the dual theory. The extension to \(p \neq 1\) involves new geometric operations and new inequalities that are active areas of current research.
The Blaschke selection theorem (compactness): any sequence of convex bodies in a bounded region has a convergent subsequence (in Hausdorff metric). This is used in existence proofs (John’s ellipsoid, Minkowski’s theorem) — the optimizer exists because we can extract a convergent subsequence from a minimizing sequence.
A.5 Glossary of Key Concepts
For quick reference, here is a summary of the principal concepts introduced in this course:
- Affine hull \(\mathrm{aff}(S)\): the smallest affine subspace containing \(S\).
- Convex hull \(\mathrm{conv}(S)\): the smallest convex set containing \(S\); all convex combinations of points in \(S\).
- Extreme point of \(C\): a point that cannot be expressed as a nontrivial convex combination within \(C\).
- Supporting hyperplane: a hyperplane touching the boundary of a convex set with the set on one side.
- Minkowski sum \(A + B\): the set of all sums \(a + b\) with \(a \in A\), \(b \in B\).
- Steiner formula: \(\mathrm{Vol}(K + \varepsilon B^n) = \sum_k \binom{n}{k} W_k(K) \varepsilon^k\).
- Mixed volume \(V(K_1, \ldots, K_n)\): the multilinear coefficient in the expansion of \(\mathrm{Vol}(\sum t_i K_i)\).
- Quermassintegrals \(W_k(K)\): the coefficients in the Steiner formula; include volume, surface area, mean width.
- Brunn–Minkowski inequality: \(\mathrm{Vol}(A+B)^{1/n} \geq \mathrm{Vol}(A)^{1/n} + \mathrm{Vol}(B)^{1/n}\).
- Support function \(h_K(u) = \max_{x \in K} \langle u, x\rangle\): encodes the geometry of \(K\); \(h_{A+B} = h_A + h_B\).
- Polar body \(K^\circ = \{y : \langle x, y\rangle \leq 1 \,\forall x \in K\}\): duality reversing inclusion; \((K^\circ)^\circ = K\).
- Legendre–Fenchel transform \(f^*(y) = \sup_x (\langle x,y\rangle - f(x))\): functional analogue of polarity; \(f^{**} = f\) for convex lsc \(f\).
- Subdifferential \(\partial f(x_0)\): the set of subgradients (slopes of supporting affine minorants) at \(x_0\).
- Log-concave function: \(f((1-t)x + ty) \geq f(x)^{1-t} f(y)^t\); equivalently, \(\log f\) is concave.
- Prékopa–Leindler inequality: the functional Brunn–Minkowski; marginals of log-concave functions are log-concave.
- Carathéodory’s theorem: every point of \(\mathrm{conv}(S)\) is a convex combination of \(\leq n+1\) points of \(S\).
- Radon’s theorem: any \(n+2\) points admit a partition with intersecting convex hulls.
- Helly’s theorem: if every \(n+1\) of \(m\) convex sets intersect, all \(m\) sets intersect.
- Krein–Milman theorem: every compact convex set is the closed convex hull of its extreme points.
- Birkhoff–von Neumann theorem: doubly stochastic matrices are convex combinations of permutation matrices.
- John’s ellipsoid: the maximal-volume ellipsoid inside \(K\); satisfies \(E \subseteq K \subseteq \sqrt{n} E\) for symmetric \(K\).
- Proximal operator: \(\mathrm{prox}_{\lambda f}(x) = \arg\min_z (f(z) + \frac{1}{2\lambda}\|z-x\|^2)\); generalizes projection onto a convex set.
- Moreau envelope: \(M_\lambda f(x) = \min_z(f(z) + \frac{1}{2\lambda}\|z-x\|^2)\); always differentiable, with gradient \(\frac{1}{\lambda}(x - \mathrm{prox}_{\lambda f}(x))\).
- Strong convexity: \(f\) is \(\mu\)-strongly convex if \(D^2 f \succeq \mu I\); equivalent to \(f^*\) being \((1/\mu)\)-smooth.
- Log-partition function: \(A(\theta) = \log\int e^{\langle\theta,T(x)\rangle}dx\); convex in \(\theta\), Legendre dual of the negative entropy.
- Mahler volume: \(\mathcal{M}(K) = \mathrm{Vol}(K)\mathrm{Vol}(K^\circ)\); affine invariant, maximized by the ball (Santaló), conjectured minimized by the cube (Mahler conjecture).
- Normal fan: partition of \(\mathbb{R}^n\) into cones \(N_P(F)\) for each face \(F\) of a polytope \(P\); encodes the face structure in terms of optimizing directions.
- Mixed discriminant: the symmetric multilinear form associated to the permanent, dual to mixed volumes; appears in the proof of the Alexandrov–Fenchel inequality.
- Dual norm: \(\|y\|_* = \sup_{\|x\|\leq 1}\langle x,y\rangle\); the support function of the unit ball; satisfies \(\|x\|^{**} = \|x\|\) (self-duality of duality).
- Concentration of measure: for 1-Lipschitz functions on \(\mathbb{S}^{n-1}\), \(\mathbb{P}(|f - M_f| \geq t) \leq 2e^{-(n-1)t^2/2}\); functions are essentially constant in high dimensions.
- Dvoretzky’s theorem: every convex body in \(\mathbb{R}^n\) has a \(k\)-dimensional section that is \((1+\varepsilon)\)-close to an ellipsoid, with \(k \geq c(\varepsilon)\log n\).
- KLS conjecture: the Poincaré constant of any log-concave measure is \(O(\log n)\) (Chen 2021); the conjecture that it is \(O(1)\) is open.
A.6 Index of Major Results
For quick reference, here are the major theorems with their locations:
- Carathéodory’s theorem (§2.1): every point of \(\mathrm{conv}(S)\) in \(\mathbb{R}^n\) is a convex combination of \(\leq n+1\) points.
- Radon’s theorem (§2.2): any \(n+2\) points in \(\mathbb{R}^n\) admit a Radon partition.
- Helly’s theorem (§2.3): pairwise \((n+1)\)-fold intersection implies global intersection.
- Kirchberger’s theorem (§2.4): separability is determined by \((n+2)\)-subsets.
- Supporting hyperplane theorem (§1.3): every boundary point has a supporting hyperplane.
- Strict separation theorem (§1.3): compact convex and closed convex (disjoint) can be strictly separated.
- Krein–Milman theorem (§7.5): compact convex set = closed convex hull of its extreme points.
- Birkhoff–von Neumann theorem (§7.4): doubly stochastic matrices = convex hull of permutation matrices.
- Hardy–Littlewood–Pólya (§7.3): majorization characterized by doubly stochastic matrices and convex sums.
- Steiner formula (§4.1): \(\mathrm{Vol}(K+\varepsilon\mathbb{B}) = \sum_k \binom{n}{k} W_k(K)\varepsilon^k\).
- Brunn–Minkowski inequality (§4.4): \(\mathrm{Vol}(A+B)^{1/n} \geq \mathrm{Vol}(A)^{1/n} + \mathrm{Vol}(B)^{1/n}\).
- Prékopa–Leindler inequality (§5.2): functional Brunn–Minkowski; implies BM as special case.
- Isoperimetric inequality (§4.5): ball minimizes surface area for given volume; proved via BM.
- Prékopa’s theorem (§5.1): marginals of log-concave functions are log-concave.
- Alexandrov–Fenchel inequality (§4.2): \(V(K_1,K_2,K_3,\ldots)^2 \geq V(K_1,K_1,K_3,\ldots)\cdot V(K_2,K_2,K_3,\ldots)\).
- Bipolar theorem (§8.1b): \((K^\circ)^\circ = K\) for closed convex sets containing origin.
- Fenchel–Moreau theorem (§8.2): \(f^{**} = f\) for convex lower semicontinuous \(f\).
- Fenchel duality theorem (§8.3): \(\inf(f+g) = \sup(-f^*(-\cdot)-g^*(\cdot))\).
- John’s theorem (§A.2): maximal inscribed ellipsoid satisfies \(E \subseteq K \subseteq \sqrt{n}E\) (symmetric \(K\)).
- Santaló inequality (§9.2): \(\mathrm{Vol}(K)\mathrm{Vol}(K^\circ) \leq \omega_n^2\), with equality for ellipsoids.
- Dvoretzky’s theorem (§9.3): every convex body has nearly Euclidean sections of dimension \(\sim \log n\).
End of notes.
Chapter 9: Convex Bodies, Volumes, and the Geometry of High Dimensions
The final chapter takes a broader perspective, exploring how the theory developed in the preceding chapters illuminates the geometry of high-dimensional convex bodies. The key themes are: how convex bodies can be approximated by ellipsoids (John’s theorem), how their volumes behave (the Santaló inequality), and what the “typical” convex body looks like (the concentration of measure phenomenon).
9.1 John’s Theorem: The Best-Fit Ellipsoid
We gave John’s theorem in the appendix; here we develop it more fully, with the characterization conditions and applications.
- \(E_J(K) \subseteq K \subseteq n \cdot E_J(K)\) (where \(n \cdot E_J\) denotes dilation by \(n\) about the center of \(E_J\)).
- If \(K\) is centrally symmetric (\(K = -K\)), the bound improves to \(E_J(K) \subseteq K \subseteq \sqrt{n} \cdot E_J(K)\).
- The ellipsoid \(E_J(K) = \mathbb{B}^n\) (the unit ball) if and only if there exist contact points \(u_1, \ldots, u_m \in \partial \mathbb{B}^n \cap \partial K\) and positive weights \(c_1, \ldots, c_m > 0\) such that: \[ \sum_{i=1}^m c_i u_i = 0 \quad \text{and} \quad \sum_{i=1}^m c_i u_i \otimes u_i = I_n. \] (Here \(u_i \otimes u_i\) denotes the rank-1 matrix \(u_i u_i^T\).)
For the cube \(K = [-1,1]^n\), the John ellipsoid is the inscribed ball of radius 1. The contact points are the \(2n\) points \(\pm e_i\), each with weight \(c_i = 1/n\). Check: \(\sum_{i=1}^n \frac{1}{n}(e_i \otimes e_i + (-e_i) \otimes (-e_i)) = \sum \frac{2}{n} e_i e_i^T = \frac{2}{n} I_n\)… actually with the \(2n\) points \(\pm e_i\) and weights \(c_i = 1/(2n)\) for each: \(\sum_{i=1}^n \frac{1}{2n} \cdot 2 e_i e_i^T = \sum_i \frac{1}{n} e_i e_i^T = \frac{1}{n} I\). Rescaling so the total weight is 1 is a normalization convention. The cube extends to corners at distance \(\sqrt{n}\) from the origin, confirming the bound \(K \subseteq \sqrt{n} E_J(K) = \sqrt{n} \mathbb{B}^n\).
This explains why linear programming over a simplex can require exponentially many simplex method iterations in the worst case: the simplex is the “least round” convex body (in the sense of the John ratio), and the simplex method follows the edges of the simplex, which can wind around the full \(n\)-dimensional simplex.
9.2 The Santaló Inequality and the Mahler Conjecture
The Santaló inequality is proved using Steiner symmetrization (each symmetrization increases both volumes in a controlled way, and the maximum is achieved at the ball).
For non-symmetric bodies, the conjectured minimizer is the simplex \(\Delta_n\), with \(\mathcal{M}(\Delta_n) = (n+1)^{n+1}/(n!)^2\).
9.3 Dvoretzky’s Theorem: Almost Spherical Sections
One of the most remarkable theorems of high-dimensional convex geometry is that every high-dimensional convex body has a nearly spherical low-dimensional section.
The theorem is closely related to the concentration of measure phenomenon: a random \(k\)-dimensional subspace, for \(k = O(\log n/\varepsilon^2)\), gives a good spherical section with positive probability. Milman’s proof of Dvoretzky’s theorem (1971) uses the Lipschitz concentration of the norm function on the sphere — the key is that any norm on \(\mathbb{R}^n\), restricted to a random subspace, is close to a multiple of the Euclidean norm.
Dvoretzky’s theorem is one of the founding results of asymptotic geometric analysis and has applications in approximation theory, compressed sensing, and quantum information.
9.4 The Central Limit Theorem for Convex Bodies
A striking recent development is that log-concave probability measures satisfy a “central limit theorem” even without independence.
9.5 The Log-Concave Sampling Problem and Connections to Computer Science
The study of log-concave measures is deeply connected to algorithms. The fundamental problem is: given access to a log-concave density \(f\) (e.g., the uniform distribution on a convex body given by an oracle), how quickly can we sample from it?
The fastest known algorithms (hit-and-run, Dikin walk) mix in time \(\tilde{O}(n^2)\) to \(\tilde{O}(n^3)\) steps, and the KLS conjecture would imply the ball walk mixes in \(\tilde{O}(n^2)\) steps. The theory of Markov chain mixing for log-concave distributions is one of the most active intersections of convex geometry and theoretical computer science.
Chapter 10: Advanced Topics and Open Problems
This final chapter collects several advanced topics and major open problems, connecting the material of the course to the frontiers of current research.
10.1 The Alexandrov-Fenchel Inequality: Full Statement and Proof Sketch
We stated the Alexandrov–Fenchel inequalities in Chapter 4. Here we give more detail.
- Taking \(K_3 = K_4 = \cdots = K_n = \mathbb{B}^n\) and using \(V(\underbrace{K,\ldots,K}_{n-k},\underbrace{\mathbb{B}^n,\ldots,\mathbb{B}^n}_{k}) \propto W_k(K)\) (quermassintegrals), the Alexandrov–Fenchel inequality specializes to the isoperimetric inequalities \(W_k(K)^2 \geq W_{k-1}(K) W_{k+1}(K)\) — log-concavity of the sequence of quermassintegrals.
- Taking \(K_1 = K_2 = K\) and \(K_3 = \cdots = K_n = L\), we get \(V(K,L,L,\ldots,L)^2 \geq V(K,K,L,\ldots,L) \cdot V(L,L,\ldots,L)\), which is the Brunn–Minkowski inequality in its mixed-volume form.
- The log-concavity of the sequence \(W_0(K), W_1(K), \ldots, W_n(K)\) specializes to the classical isoperimetric inequality \(W_1(K)^n \geq W_0(K)^{n-1} W_n(K) \cdot \text{(constant)}\) — the surface area raised to a power bounds the volume.
10.2 The Busemann Intersection Inequality and Dual Brunn–Minkowski
Just as the Brunn–Minkowski inequality governs Minkowski sums, there is a “dual” theory for intersection bodies.
10.3 The Reverse Isoperimetric Inequality and Rogers-Shephard
The inequality has applications in the geometry of numbers (bounding the number of lattice points in a body), in additive combinatorics (the Freiman–Ruzsa theorem concerns analogous bounds for sumsets), and in the study of covering numbers of convex bodies.
10.3b The Geometry of Convex Position: Erdős–Szekeres and Combinatorial Convexity
The combinatorial side of convex geometry includes beautiful results about point configurations in “convex position.”
The connection to Helly’s theorem is deep: both theorems say “small-scale conditions force global structure.” Helly says “pairwise \((n+1)\)-fold intersections force global intersection.” Erdős–Szekeres says “no three collinear and large enough point set forces convex position.” Both are quantitative manifestations of the principle that convexity is a rigid constraint in finite dimensions.
10.4 Open Problems in Convex Geometry
To conclude, we list several major open problems: