CO 448: Probabilistic Methods in Combinatorics

Estimated study time: 3 hr 18 min

Table of contents

Sources and References

Primary textbook — Noga Alon and Joel H. Spencer, The Probabilistic Method, 4th ed., Wiley-Interscience, 2016. The foundational text for this subject.

Supplementary texts — Michael Molloy and Bruce Reed, Graph Colouring and the Probabilistic Method, Springer, 2002; Stasys Jukna, Extremal Combinatorics, 2nd ed., Springer, 2011; Svante Janson, Tomasz Łuczak, and Andrzej Ruciński, Random Graphs, Wiley, 2000.

Online resources — Jacob Fox and Yufei Zhao, MIT 18.226 Probabilistic Methods in Combinatorics lecture notes (2023, freely available); David Conlon, Oxford C5.4 Probabilistic Methods lecture notes; Nati Linial and Yuval Peled, Hebrew University “Probabilistic Methods” course notes; Benny Sudakov, ETH Zürich “Probabilistic Methods in Combinatorics” lecture notes.

Chapter 1: The Probabilistic Method — Philosophy and Foundations

The probabilistic method is one of the most elegant and powerful paradigms in combinatorics. Its central idea is deceptively simple: to prove that a combinatorial object with some desired property exists, one does not construct it explicitly. Instead, one defines a suitable probability space over a family of objects, and shows that a randomly chosen element satisfies the desired property with strictly positive probability. Since the probability is positive, such an element must actually exist. The method is thus inherently non-constructive — it certifies existence without providing a witness. Paul Erdős, who pioneered the technique in the 1940s, was fond of calling this an argument from “the Book,” a divine volume containing perfect proofs. For decades, probabilistic existence results were regarded as fundamentally non-constructive, until algorithmic breakthroughs in the 1980s–2010s began to close that gap.

Before stating the first major theorem, we establish the foundational framework. A probability space is a triple \((\Omega, \mathcal{F}, \Pr)\) where \(\Omega\) is the sample space, \(\mathcal{F}\) is a \(\sigma\)-algebra of events, and \(\Pr: \mathcal{F} \to [0,1]\) is a probability measure. In combinatorics, \(\Omega\) is typically finite and all subsets are events. The key principle underpinning every application of the probabilistic method is:

Key Principle. If \(\Pr[A] > 0\) for some event \(A \subseteq \Omega\), then \(A \neq \emptyset\). Equivalently, if \(\mathbb{E}[X] > 0\) for a non-negative integer-valued random variable \(X\), then \(\Pr[X > 0] > 0\), so \(X > 0\) for at least one outcome.

This tautological-sounding principle is extraordinarily useful in practice because computing probabilities is often easier than constructing objects directly. The first landmark application, due to Erdős in 1947, gave the first nontrivial lower bound for Ramsey numbers.

Definition (Ramsey number). The Ramsey number \(R(k,k)\) is the minimum \(n\) such that every 2-colouring of the edges of the complete graph \(K_n\) contains a monochromatic \(k\)-clique. Equivalently, any graph on \(R(k,k)\) vertices contains either a clique of size \(k\) or an independent set of size \(k\).

Theorem (Erdős 1947 — Ramsey lower bound). \(R(k,k) > 2^{k/2}\). More precisely, for \(n = \lfloor 2^{k/2} \rfloor\) there exists a 2-colouring of the edges of \(K_n\) with no monochromatic \(k\)-clique.

Proof. Colour each edge of \(K_n\) red or blue independently, each with probability \(1/2\). For a fixed set \(S\) of \(k\) vertices, let \(A_S\) be the event that the induced subgraph on \(S\) is monochromatic (all red or all blue). Since there are \(\binom{k}{2}\) edges in \(S\), we have \(\Pr[A_S] = 2 \cdot 2^{-\binom{k}{2}} = 2^{1-\binom{k}{2}}\). By the union bound, \[ \Pr\!\left[\bigcup_{|S|=k} A_S\right] \leq \binom{n}{k} \cdot 2^{1-\binom{k}{2}}. \] For \(n = \lfloor 2^{k/2} \rfloor\), we have \(\binom{n}{k} \leq n^k / k! \leq 2^{k^2/2}/k!\). Since \(\binom{k}{2} = k(k-1)/2\), we get \(\binom{n}{k} \cdot 2^{1-\binom{k}{2}} \leq \frac{2^{k^2/2}}{k!} \cdot 2^{1-k(k-1)/2} = \frac{2^{k/2+1}}{k!} < 1\) for \(k \geq 3\). Thus with positive probability, no \(k\)-clique is monochromatic, certifying \(R(k,k) > n\). \(\square\)

This proof is the prototypical application of the probabilistic method. The argument says nothing about what such a colouring looks like — it merely certifies its existence. For comparison, the upper bound \(R(k,k) \leq \binom{2k-2}{k-1} \leq 4^k\) follows from Pascal’s recurrence, so we have \(2^{k/2} < R(k,k) \leq 4^k\). Despite decades of effort, improving the exponential base in either bound remains one of the central open problems in combinatorics.

Remark (The gap between \(2^{k/2}\) and \(4^k\)). The upper bound \(R(k,k) \leq 4^k\) is a clean consequence of the recurrence \(R(s,t) \leq R(s-1,t) + R(s,t-1)\) with boundary conditions \(R(1,t) = R(s,1) = 1\). Iterating this Pascal-type recurrence gives \(R(k,k) \leq \binom{2k-2}{k-1}\), and by Stirling's approximation \(\binom{2k-2}{k-1} \leq 4^{k-1}\). Thus the two bounds have been \(2^{k/2} < R(k,k) \leq 4^k\) since Erdős's 1947 paper — a factor of \(2^{k/2}\) unexplained gap. The exponents have barely moved: Sah (2020) improved the lower bound to \(2^{k/2 + (3/2)\log k - O(1)}\) using a more refined alteration, and Campos, Griffiths, Morris, and Sahasrabudhe (2023) broke the \(4^k\) barrier for the first time in 75 years, establishing \(R(k,k) \leq (4 - \varepsilon)^k\) for an explicit but tiny \(\varepsilon > 0\). This remains one of the oldest unsolved quantitative problems in combinatorics: what is the correct exponential base?

The workhorse behind most first-moment arguments is linearity of expectation. Unlike variance or higher moments, linearity holds regardless of any dependence structure among the variables.

Remark (Lovász Local Lemma — proof of the general version). The proof of the asymmetric LLL via induction on \(|S|\) is the standard approach, but another illuminating proof uses the Shearer criterion: the LLL condition \(p_i \leq x_i \prod_{j \in \Gamma(i)} (1-x_j)\) is equivalent to saying that the "dependency graph" can be "weighted" by \(x_i\) in a way that makes the contributions multiplicatively independent. A clean alternative proof via the FKG inequality: for independent events, \(\Pr[\bigcap \bar{A}_i] = \prod \Pr[\bar{A}_i]\). The LLL quantifies how much the probability decreases when events are dependent but the dependency is "sparse" (as encoded by the dependency graph).

The probabilistic method and the LLL are the two foundational tools of the field. The union bound says “avoid all bad events when their probabilities sum to less than 1.” The LLL says “avoid all bad events even when the sum of probabilities is large, provided the events are nearly independent.” Together, these two principles, combined with the moment methods and concentration inequalities of later chapters, give the complete toolkit of modern probabilistic combinatorics.

Theorem (Linearity of Expectation). For any random variables \(X_1, \ldots, X_n\) (not necessarily independent) on a common probability space, \[ \mathbb{E}\!\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n \mathbb{E}[X_i]. \]

This theorem, simple as it is, is the engine behind an enormous range of existence results. One immediate application gives a lower bound on the independence number of a graph — a result that mirrors and complements the Turán theorem from an existential direction.

Theorem (Turán-type lower bound on independence number). For a graph \(G\) with \(n\) vertices and \(m\) edges, the independence number satisfies \[ \alpha(G) \geq \frac{n^2}{2m + n}. \]

Proof. Consider a uniformly random permutation of the vertices. Include vertex \(v\) in set \(I\) if and only if \(v\) precedes all its neighbours in the permutation. The set \(I\) is always an independent set. By linearity of expectation, \[ \mathbb{E}[|I|] = \sum_{v} \Pr[v \in I] = \sum_{v} \frac{1}{\deg(v)+1}. \] By convexity of \(x \mapsto 1/(x+1)\) and the constraint \(\sum_v \deg(v) = 2m\), the sum \(\sum_v 1/(\deg(v)+1)\) is minimized when all degrees are equal, giving \(\mathbb{E}[|I|] \geq n^2/(2m+n)\) by the Cauchy-Schwarz or harmonic-arithmetic mean inequality. Since \(\alpha(G) \geq \mathbb{E}[|I|]\), the result follows. \(\square\)

A recurring refinement of the basic probabilistic method is the alteration technique: rather than requiring the random object to immediately satisfy all conditions, one generates a random object, identifies a small set of “violations,” and removes or fixes them at a modest cost. The following version gives a slightly better bound.

Remark (The alteration technique — general framework). The alteration method has the following general structure:

Random construction: generate a random object \(G\) from some distribution over the "ambient" family.
First moment analysis: compute \(\mathbb{E}[V]\) where \(V\) is the number of "violations" (elements of \(G\) that need to be removed or fixed).
Alteration: remove or fix each violation at a deterministic cost (often 1 element per violation). The resulting object \(G'\) satisfies all constraints.
Lower bound: by linearity of expectation, \(\mathbb{E}[|G'|] \geq \mathbb{E}[|G|] - c \cdot \mathbb{E}[V]\) for some cost \(c\). If \(\mathbb{E}[V] \ll \mathbb{E}[|G|]\), the alteration is cheap.

The key optimisation at each step is to choose the parameters of the random construction (e.g., inclusion probability \(p\)) to balance the two terms \(\mathbb{E}[|G|]\) and \(c \cdot \mathbb{E}[V]\). This is an elementary calculus optimisation, and the resulting bound is often tight or near-tight.

Theorem (Alteration bound on independence number). For a graph \(G\) with \(n\) vertices and average degree \(\bar{d} = 2m/n\), \[ \alpha(G) \geq \frac{n}{\bar{d}+1} = \frac{n^2}{2m+n}. \] More precisely, using alteration: include each vertex independently with probability \(p\); remove one endpoint from each edge in the induced subgraph; optimize \(p = 1/({\bar{d}+1})\) to achieve \(\alpha(G) \geq n/(2\bar{d}+1)\) via a cleaner calculation, or \(\alpha(G) \geq (n/2m)\ln(2m/n + 1)\) for the greedy argument.

The alteration method will recur throughout this course. The key idea is always the same: the random step gives you most of what you need, and the deterministic repair step costs only a lower-order amount.

Remark (Constructive implications of the alteration bound). The greedy bound \(\alpha(G) \geq n/(2\bar{d}+1)\) is not merely an existence result — it is immediately algorithmic. The greedy independent set algorithm proceeds as follows: repeatedly pick any vertex \(v\), add it to the independent set, and remove \(v\) and all its neighbours from the graph. Since \(v\) has at most \(\Delta\) neighbours, each step eliminates at most \(\Delta + 1\) vertices while adding one vertex to the independent set. When the graph is empty, the set \(I\) has size at least \(n/(\Delta+1) \geq n/(2\bar{d}+1)\). This algorithm terminates in polynomial time (at most \(n\) steps) and achieves the alteration bound in expectation if the starting vertex is chosen uniformly at random. This is one of the rare cases where the probabilistic bound immediately yields an efficient deterministic algorithm without needing the method of conditional expectations or any further derandomization.

Chapter 2: The First Moment Method

The first moment method formalizes the observation that if the expected number of “bad” events is less than 1, then with positive probability there are no bad events at all. Conversely, if the expected number of objects of a certain type is at least 1, then such an object exists with positive probability. These two directions — showing existence and showing non-emptiness — are dual facets of the same principle.

A clean illustration of the first moment method in a geometric setting is the domination number of graphs.

Definition (Dominating set). A set \(D \subseteq V(G)\) is a dominating set if every vertex \(v \notin D\) has at least one neighbour in \(D\). The domination number \(\gamma(G)\) is the minimum size of a dominating set.

Theorem (Domination number upper bound). For a graph \(G\) on \(n\) vertices with minimum degree \(\delta \geq 1\), \[ \gamma(G) \leq \frac{n(1 + \ln(\delta+1))}{\delta+1}. \]

Proof. Include each vertex in a set \(S\) independently with probability \(p\) (to be chosen). A vertex \(v\) is not dominated if neither \(v\) nor any of its at least \(\delta\) neighbours is in \(S\). Let \(T\) be the set of undominated vertices. Then \(\mathbb{E}[|T|] \leq n(1-p)^{\delta+1}\). The set \(D = S \cup \{\text{one vertex per undominated vertex}\}\) is a dominating set of expected size at most \(np + n(1-p)^{\delta+1} \leq np + ne^{-p(\delta+1)}\). Set \(p = \ln(\delta+1)/(\delta+1)\); then \(e^{-p(\delta+1)} = 1/(\delta+1)\), so \(\mathbb{E}[|D|] \leq n \cdot \frac{\ln(\delta+1)}{\delta+1} + \frac{n}{\delta+1} = \frac{n(1+\ln(\delta+1))}{\delta+1}\). Since \(\gamma(G) \leq |D|\) for some outcome, the bound follows. \(\square\)

This bound is nearly tight: for the complete bipartite graph \(K_{\delta, \delta}\), one needs at least 2 vertices to dominate everything, and the formula gives approximately \(2\ln\delta\), which is of the same order.

Example (Domination bound for the Petersen graph). The Petersen graph \(G\) has \(n = 10\) vertices, is 3-regular (\(\delta = 3\)), and is vertex-transitive. Applying the domination bound with \(\delta = 3\): \[ \gamma(G) \leq \frac{n(1 + \ln(\delta+1))}{\delta+1} = \frac{10(1 + \ln 4)}{4} = \frac{10(1 + 1.386\ldots)}{4} \approx \frac{10 \times 2.386}{4} \approx 5.97. \] So the bound gives \(\gamma(G) \leq 5\). The true domination number of the Petersen graph is \(\gamma = 4\): one can verify that the four-vertex set \(\{1, 6, 7, 8\}\) (using the standard labelling of the outer 5-cycle as \(1,2,3,4,5\) and inner pentagram as \(6,7,8,9,10\)) dominates all 10 vertices. Indeed vertex 1 is in the set; vertices 2 and 5 are neighbours of 1; vertices 6, 7 are in or adjacent to the set; and so on. The probabilistic bound gives 5, which is not tight here — the actual value 4 requires a more careful combinatorial argument exploiting the vertex-transitivity of the Petersen graph.

The first moment method also yields a beautiful result in combinatorial geometry via a probabilistic argument for the crossing number inequality, which in turn implies the Szemerédi-Trotter theorem on point-line incidences.

Theorem (Crossing Number Inequality). For a simple graph \(G\) with \(n\) vertices and \(m \geq 4n\) edges, \[ \mathrm{cr}(G) \geq \frac{m^3}{64n^2}, \] where \(\mathrm{cr}(G)\) denotes the crossing number of \(G\).

Proof sketch. The Euler formula gives \(\mathrm{cr}(G) \geq m - 3n\) for any drawing. Now apply this to a random subgraph: keep each vertex independently with probability \(p \in (0,1)\), and include an edge only if both endpoints are kept. The subgraph has expected \(pn\) vertices, \(p^2 m\) edges, and \(p^4\,\mathrm{cr}(G)\) crossings. Applying \(\mathrm{cr} \geq m - 3n\) to this subgraph in expectation: \(p^4\,\mathrm{cr}(G) \geq p^2 m - 3pn\). Rearranging gives \(\mathrm{cr}(G) \geq m/p^2 - 3n/p^3\). Optimize over \(p = 4n/m \leq 1\) (valid since \(m \geq 4n\)) to obtain \(\mathrm{cr}(G) \geq m^3/(64n^2)\). \(\square\)

The elegance here is that we apply a weak planar-graph inequality to a random sparse subgraph, then transfer information back to the original dense graph by undoing the sampling. This sampling-and-averaging strategy appears throughout probabilistic combinatorics.

Remark (The crossing number inequality — applications to incidences). The crossing number inequality \(\mathrm{cr}(G) \geq m^3/(64n^2)\) (for \(m \geq 4n\)) is the key ingredient in the Szemerédi-Trotter theorem on point-line incidences: for \(m\) points and \(\ell\) lines in \(\mathbb{R}^2\), the number of incidences satisfies \(I(P,L) = O(m^{2/3}\ell^{2/3} + m + \ell)\). The probabilistic subgraph sampling in the crossing number proof can be seen as a very early instance of the "random restriction" technique, which later became central in Boolean function analysis and computational complexity theory.

The bound \(m^3/(64n^2)\) has the correct exponent: there exist graphs (expanders, complete graphs minus a matching) achieving \(\mathrm{cr}(G) = \Omega(m^3/n^2)\). For the complete graph \(K_n\), \(m = \binom{n}{2} \approx n^2/2\) and the probabilistic bound gives \(\mathrm{cr}(K_n) \geq \Omega(n^4)\), while the true crossing number of \(K_n\) is conjectured to be \(\frac{1}{4}\lfloor n/2 \rfloor \lfloor (n-1)/2 \rfloor \lfloor (n-2)/2 \rfloor \lfloor (n-3)/2 \rfloor \sim n^4/64\) (the Zarankiewicz conjecture, proved for small \(n\) but open in general).

In combinatorial number theory, the first moment gives the existence of Sidon sets (also called \(B_2\) sets) of near-optimal size.

Definition (Sidon set). A set \(A \subseteq \{1, \ldots, n\}\) is a Sidon set if all pairwise sums \(a + b\) (with \(a, b \in A\), \(a \leq b\)) are distinct.

Theorem (Probabilistic Sidon set construction). For every \(n\), there exists a Sidon set \(A \subseteq \{1, \ldots, n\}\) with \(|A| \geq c \cdot n^{1/2}\) for an absolute constant \(c > 0\).

Proof sketch. Include each element of \(\{1,\ldots,n\}\) in \(A\) independently with probability \(p = n^{-1/2}\). The expected size of \(A\) is \(n^{1/2}\). The expected number of "colliding quadruples" \((a,b,c,d)\) with \(a+b = c+d\) and all in \(A\) is at most \(n^2 \cdot p^4 = 1\). Removing one element per colliding quadruple destroys all collisions at a cost of at most \(\mathbb{E}[\text{quadruples}] = O(1)\) elements, leaving a Sidon set of expected size \(\Theta(n^{1/2})\). \(\square\)

The classical upper bound \(|A| \leq (1+o(1))n^{1/2}\) follows from counting: if \(|A| = k\), the \(\binom{k}{2}\) pairwise sums all lie in \(\{2, \ldots, 2n\}\), so \(\binom{k}{2} \leq 2n-1\), giving \(k \leq (1+o(1))(2n)^{1/2}\). The probabilistic construction thus achieves the right order of magnitude.

Remark (Sidon sets and the Erdős-Turán conjecture). The existence of Sidon sets of size \(\Theta(n^{1/2})\) raises the question of whether one can find explicit (deterministically constructible) Sidon sets of this size. The probabilistic argument is non-constructive: the "alteration" step (removing collisions) uses a non-deterministic choice. Explicit constructions are known but harder:

Singer (1938): For prime powers \(q\), the set \(\{g^k + kq : k = 0, 1, \ldots, q-1\}\) (where \(g\) is a generator of \(\mathbb{F}_{q^2}^*\)) gives a Sidon set in \(\{1, \ldots, q^2+q+1\}\) of size \(q\), achieving \(|A| = \Theta(n^{1/2})\).
Bose (1942): A similar construction from finite fields.

These explicit constructions match the probabilistic bound. The Erdős-Turán conjecture asserts that every Sidon set \(A \subseteq \mathbb{N}\) satisfies \(|A \cap [n]| = O(n^{1/2})\) for all \(n\) — an upper bound consistent with the probabilistic lower bound, but establishing the exact threshold.

The first moment method also interacts beautifully with structural graph theory. The following theorem shows how a greedy probabilistic idea implies a fundamental result about graph colouring.

Theorem (Degeneracy and chromatic number). If every subgraph of \(G\) has minimum degree at most \(2(k-1)\), then \(\chi(G) \leq 2k - 1\).

Proof. We say \(G\) is \(d\)-degenerate if every subgraph has minimum degree at most \(d\). If \(G\) is \(d\)-degenerate, we can order the vertices \(v_1, v_2, \ldots, v_n\) such that each \(v_i\) has at most \(d\) neighbours among \(v_1, \ldots, v_{i-1}\) (achieved by repeatedly extracting a minimum-degree vertex). Greedily colouring in this order, vertex \(v_i\) needs a colour different from at most \(d\) earlier neighbours, so \(d+1\) colours suffice. Taking \(d = 2(k-1)\), we get \(\chi(G) \leq 2k - 1\). \(\square\)

Remark. The contrapositive is the key probabilistic-flavoured consequence: if \(\chi(G) > 2k-1\), then \(G\) contains a subgraph with minimum degree at least \(2k-1\), hence average degree at least \(2k-1 > 2(k-1)\). This shows how chromatic number forces dense subgraphs — a theme central to the Erdős-Hajnal school of extremal combinatorics. The probabilistic method enters through the alteration: constructing dense subgraphs probabilistically and verifying high chromatic number via the greedy argument.

Chapter 3: The Second Moment Method and Chebyshev’s Inequality

The first moment method tells us when something exists, but it cannot tell us whether a random variable is concentrated near its mean. The second moment method addresses exactly this: it provides a lower bound on \(\Pr[X > 0]\) in terms of the ratio \((\mathbb{E}[X])^2 / \mathbb{E}[X^2]\).

Chebyshev's Inequality. For any random variable \(X\) with finite variance, \[ \Pr[|X - \mathbb{E}[X]| \geq t] \leq \frac{\mathrm{Var}(X)}{t^2}. \]

Second Moment Method (Paley-Zygmund Inequality). For a non-negative integer-valued random variable \(X\), \[ \Pr[X > 0] \geq \frac{(\mathbb{E}[X])^2}{\mathbb{E}[X^2]}. \]

Proof. By Cauchy-Schwarz, \(\mathbb{E}[X] = \mathbb{E}[X \cdot \mathbf{1}_{X > 0}] \leq \sqrt{\mathbb{E}[X^2]} \cdot \sqrt{\Pr[X > 0]}\). Squaring and rearranging gives the result. \(\square\)

A key computation when applying the second moment method is expanding \(\mathbb{E}[X^2]\) using indicator random variables. If \(X = \sum_i X_i\) where \(X_i \in \{0,1\}\), then

\[ \mathbb{E}[X^2] = \sum_i \mathbb{E}[X_i^2] + 2\sum_{i < j}\mathbb{E}[X_i X_j] = \mathbb{E}[X] + \sum_{i \neq j} \Pr[X_i = 1, X_j = 1]. \]

The method succeeds — meaning \(\Pr[X > 0] \to 1\) — when the “off-diagonal” covariance terms are negligible relative to \((\mathbb{E}[X])^2\).

Remark (When Chebyshev is tight). Chebyshev's inequality \(\Pr[|X - \mu| \geq t] \leq \sigma^2/t^2\) is an equality in the two-point distribution: let \(X = \mu - t\) with probability \(\sigma^2/(2t^2)\), \(X = \mu + t\) with probability \(\sigma^2/(2t^2)\), and \(X = \mu\) with the remaining probability \(1 - \sigma^2/t^2\) (valid when \(\sigma \leq t\)). For this distribution, \(\Pr[|X - \mu| \geq t] = \sigma^2/t^2\) exactly. In combinatorial applications, Chebyshev is nearly tight when the random variable is "spread out" with many values contributing roughly equally to the variance — as happens near the threshold of a random graph property where both \(\{X = 0\}\) and \(\{X \gg 1\}\) have non-negligible probability.

A fundamental and illuminating application of the second moment method is determining the threshold for the presence of triangles in \(G(n,p)\).

Theorem (Threshold for triangles in \(G(n,p)\)). Let \(T\) denote the number of triangles in \(G(n,p)\).

If \(p = c/n\) with \(c > 0\) fixed, then \(\mathbb{E}[T] \to c^3/6\) as \(n \to \infty\).
If \(p \ll n^{-1}\), then \(\Pr[T \geq 1] \to 0\) (whp there are no triangles).
If \(p \gg n^{-1}\), then \(\Pr[T \geq 1] \to 1\) (whp there are triangles).

The threshold function is \(p^* = 1/n\).

Proof. Index triangles by 3-element subsets \(S \subseteq [n]\), and let \(I_S\) be the indicator that \(S\) is a triangle. Then \(T = \sum_{|S|=3} I_S\) and \[ \mathbb{E}[T] = \binom{n}{3} p^3 = \frac{n(n-1)(n-2)}{6} p^3 \xrightarrow{p = c/n} \frac{c^3}{6}. \] For the upper threshold: when \(p \ll 1/n\), we have \(\mathbb{E}[T] \ll 1\), so Markov's inequality gives \(\Pr[T \geq 1] \leq \mathbb{E}[T] \to 0\).

For the lower threshold, we apply the second moment method. When \(p \gg 1/n\), write \(\mathbb{E}[T^2] = \sum_{S, S'} \Pr[I_S = I_{S'} = 1]\). Partition by \(|S \cap S'|\):

\(|S \cap S'| = 0\) or \(1\): \(S\) and \(S'\) share no edge, so \(\Pr[I_S = I_{S'} = 1] = p^6\). There are \(O(n^6)\) such pairs, contributing \(O(n^6 p^6) = (\mathbb{E}[T])^2\) to \(\mathbb{E}[T^2]\).
\(|S \cap S'| = 2\): \(S\) and \(S'\) share one edge (and together span 4 vertices, 5 edges). There are \(O(n^4)\) such pairs, each with probability \(p^5\), contributing \(O(n^4 p^5)\). Since \(\mathbb{E}[T] = \Theta(n^3 p^3)\), this term is \(O(n^4 p^5) = O(\mathbb{E}[T]^2 / (n^2 p))\), which is \(o(\mathbb{E}[T]^2)\) when \(n^2 p \to \infty\), i.e., \(p \gg 1/n^2\).
\(|S \cap S'| = 3\): \(S = S'\), contributing \(\mathbb{E}[T]\) terms, negligible.

Thus \(\mathbb{E}[T^2] = (1 + o(1))(\mathbb{E}[T])^2\) when \(p \gg n^{-1}\), and the Paley-Zygmund inequality gives \(\Pr[T > 0] \geq (\mathbb{E}[T])^2/\mathbb{E}[T^2] \to 1\). \(\square\)

Example (Threshold for \(k\)-cliques in \(G(n,p)\)). The same second moment argument generalises. Let \(X_k\) denote the number of copies of \(K_k\) in \(G(n,p)\). Then \[ \mathbb{E}[X_k] = \binom{n}{k} p^{\binom{k}{2}} \approx \frac{n^k}{k!} p^{k(k-1)/2}. \] This is \(\Theta(1)\) when \(n^k p^{k(k-1)/2} = \Theta(1)\), i.e., when \(p = \Theta(n^{-2/(k-1)})\). So the threshold for \(K_k\) is \[ p^*(k) = n^{-2/(k-1)}. \]

Explicitly:

\(k = 3\) (triangles): \(p^* = n^{-1}\). At \(p = c/n\), \(\mathbb{E}[X_3] \to c^3/6\) as computed above.
\(k = 4\) (4-cliques): \(p^* = n^{-2/3}\). At \(p = c \cdot n^{-2/3}\), \(\mathbb{E}[X_4] = \binom{n}{4}p^6 \approx \frac{n^4}{24} c^6 n^{-4} = c^6/24 = \Theta(1)\).
\(k = 5\) (5-cliques): \(p^* = n^{-1/2}\). At \(p = c \cdot n^{-1/2}\), \(\mathbb{E}[X_5] \approx \frac{n^5}{120} c^{10} n^{-5} = c^{10}/120\).

Below each threshold (in the sense \(p/p^*(k) \to 0\)), \(G(n,p)\) is whp \(K_k\)-free; above it, \(K_k\) appears whp. The second moment argument confirming the lower threshold (that \(X_k > 0\) whp) works identically to the triangle case: the dominant contribution to \(\mathbb{E}[X_k^2]/(\mathbb{E}[X_k])^2\) comes from disjoint pairs of cliques, and the "overlapping" contribution is lower-order.

The paradigmatic application of the second moment method is the analysis of clique numbers in random graphs.

Definition (Erdős-Rényi random graph). The random graph \(G(n,p)\) is the probability space on graphs with vertex set \([n] = \{1,\ldots,n\}\) where each of the \(\binom{n}{2}\) edges is included independently with probability \(p\).

Theorem (Clique number concentration in \(G(n,1/2)\)). Let \(k^* = k^*(n)\) be the largest integer \(k\) such that \(\binom{n}{k} 2^{-\binom{k}{2}} \geq 1\). Then \(\omega(G(n,1/2))\) is concentrated on at most two consecutive values \(\{k^*, k^*+1\}\) with high probability. Moreover, \(k^* = (2-o(1))\log_2 n\).

Proof sketch. Let \(X_k\) be the number of \(k\)-cliques in \(G(n,1/2)\). Then \(\mathbb{E}[X_k] = \binom{n}{k} 2^{-\binom{k}{2}}\). For the upper bound: if \(k = k^* + 2\), then \(\mathbb{E}[X_k] \to 0\), so by Markov, \(\Pr[\omega \geq k] \leq \mathbb{E}[X_k] \to 0\). For the lower bound: at \(k = k^* - 1\), compute \(\mathbb{E}[X_{k-1}^2]\) by summing \(\Pr[S \text{ and } T \text{ both cliques}]\) over pairs of \((k-1)\)-sets \(S, T\). The dominant contribution comes from pairs sharing \(j\) vertices. After careful bookkeeping, one shows \(\mathbb{E}[X_{k-1}^2] = (1+o(1))(\mathbb{E}[X_{k-1}])^2\), so the second moment method gives \(\Pr[X_{k-1} > 0] \to 1\). \(\square\)

This two-value concentration is a remarkable feature of the random graph: even though the number of cliques varies enormously, the maximum clique size is essentially determined. The proof reveals why the second moment method can fail: near \(k^*\) the variance blows up because cliques sharing many vertices create strong positive correlations.

Example (Computing \(k^*\) for \(G(n, 1/2)\) at \(n = 1024\)). With \(n = 1024 = 2^{10}\), we find \(k^*\) as the largest \(k\) such that \(\binom{n}{k} 2^{-\binom{k}{2}} \geq 1\). Using the approximation \(\binom{n}{k} \approx n^k/k!\) and Stirling's approximation \(k! \approx \sqrt{2\pi k}(k/e)^k\): \[ \frac{n^k}{k!} \cdot 2^{-k(k-1)/2} \approx \frac{(1024)^k}{\sqrt{2\pi k}(k/e)^k} \cdot 2^{-k(k-1)/2} = \frac{e^k \cdot 2^{10k}}{\sqrt{2\pi k} \cdot k^k \cdot 2^{k(k-1)/2}}. \] Setting this to 1 and taking logs: \[ k + 10k \ln 2 - k \ln k - \frac{k(k-1)}{2}\ln 2 \approx \frac{1}{2}\ln(2\pi k), \]\[ k\left(1 + 10\ln 2 - \ln k - \frac{(k-1)\ln 2}{2}\right) \approx 0. \]

The main balance is \(10\ln 2 \approx \frac{k \ln 2}{2}\), giving \(k \approx 20 = 2\log_2(1024)\). More carefully, \(k^* = 20\) or \(k^* = 19\) depending on lower-order corrections. The theorem says \(\omega(G(1024, 1/2))\) is concentrated on \(\{19, 20\}\) or \(\{20, 21\}\). For large \(n\), \(\omega(G(n,1/2)) = (2 - o(1))\log_2 n\), so for \(n = 1024\) we expect \(\omega \approx 20\).

Contrast with the greedy lower bound: greedily build a clique by adding vertices that are adjacent to all current clique members. With \(n = 1024\) and \(p = 1/2\), after choosing \(j\) vertices, the next vertex is in the neighbourhood of all \(j\) with probability \((1/2)^j\), so the expected number of candidates among remaining \(n-j\) vertices is \((n-j)/2^j\). This stays positive as long as \(2^j \ll n\), giving a greedy clique of size \(\approx \log_2 n = 10\) — about half of \(k^* \approx 20\). The gap between the greedy bound and the true clique number is an important open problem (the “algorithmic” clique problem: can we find a \((1+\varepsilon)\log_2 n\) clique efficiently?).

Remark (The algorithmic clique problem). The two-value concentration theorem tells us that \(\omega(G(n,1/2)) \approx 2\log_2 n\) with high probability. Yet no polynomial-time algorithm is known to find a clique of size \((1+\varepsilon)\log_2 n\), despite the fact that such cliques exist with overwhelming probability. This is the "planted clique" problem: if a random \(k\)-clique is hidden in \(G(n,1/2)\), detecting it is believed to be computationally hard for \(k \ll \sqrt{n}\). The problem is connected to the hardness of various statistical estimation tasks and serves as a canonical "computational hardness" assumption in average-case complexity. Spectral methods can detect planted cliques of size \(k = \Omega(\sqrt{n})\), but nothing better is known unconditionally.

The second moment method is also the natural tool for analyzing threshold phenomena in \(G(n,p)\).

Definition (Threshold function). For a monotone graph property \(\mathcal{P}\), a function \(p^*(n)\) is a threshold if: \(\Pr[G(n,p) \in \mathcal{P}] \to 0\) when \(p/p^* \to 0\), and \(\to 1\) when \(p/p^* \to \infty\).

Theorem (Connectivity threshold). The threshold for connectivity of \(G(n,p)\) is \(p^* = \ln n / n\). More precisely, if \(p = (\ln n + c)/n\) then \(\Pr[G(n,p) \text{ is connected}] \to e^{-e^{-c}}\); in particular, the probability tends to 1 if \(c \to +\infty\) and to 0 if \(c \to -\infty\).

Proof sketch. The dominant obstruction to connectivity is isolated vertices. Let \(X\) be the number of isolated vertices. Then \(\mathbb{E}[X] = n(1-p)^{n-1} \approx n e^{-p(n-1)} \approx e^{-c}\). Compute \(\mathbb{E}[X^2]\): pairs of distinct vertices are both isolated with probability \((1-p)^{2(n-2)+(1)} \cdot \mathbb{1}[\text{non-adjacent}]\). After calculation, \(\mathrm{Var}(X) = o((\mathbb{E}[X])^2)\), so Chebyshev gives \(X \approx \mathbb{E}[X]\). A separate argument (using the fact that large components form early) shows that when there are no isolated vertices, the graph is connected. The distribution of \(X\) converges to Poisson with mean \(e^{-c}\), giving the stated limiting probability. \(\square\)

This is a prototypical example of a sharp threshold: the transition from disconnected to connected happens in an interval of width \(O(1/n)\) around \(\ln n / n\), which is much narrower than the threshold itself.

Remark (The Poisson paradigm and the connectivity threshold). The connectivity proof illustrates the "Poisson paradigm": for a sum \(X = \sum_i I_i\) of weakly dependent Bernoulli variables with small mean, the distribution of \(X\) converges to a Poisson distribution with the same mean. Here \(X\) = number of isolated vertices, and \(\mathbb{E}[X] \to e^{-c}\). One verifies that isolated vertex events are nearly independent (two given vertices are both isolated with probability \((1-p)^{2n-3} \approx e^{-2pn} \approx e^{-2c} = (\mathbb{E}[X]/n)^2 \cdot n^2 / n^2\) — roughly independent), so \(X \to \mathrm{Poisson}(e^{-c})\) by the Chen-Stein method.

The key point is that connectivity is governed by isolated vertices: once there are no isolated vertices (which happens whp for \(c \to +\infty\)), the graph is connected. This is because the random graph \(G(n,p)\) for \(p \gg \ln n / n\) has all components of size at most 1 that are isolated vertices — larger components merge quickly. The proof of this “no small components” fact uses a separate first-moment argument: for a component of size \(s\) with \(2 \leq s \leq n/2\), the expected number of such components is \(O(n^s p^{s-1} (1-p)^{s(n-s)}) \to 0\) when \(p \gg \ln n / n\).

Example (Connectivity threshold — numerical illustration). For \(n = 1000\) and \(p = (\ln n + c)/n = (6.91 + c)/1000\):

\(c = 0\): \(\Pr[\text{connected}] \to e^{-e^0} = e^{-1} \approx 0.368\). The graph is disconnected about 63% of the time.
\(c = 2\): \(\Pr[\text{connected}] \to e^{-e^{-2}} = e^{-0.135} \approx 0.874\). Connected about 87% of the time.
\(c = -2\): \(\Pr[\text{connected}] \to e^{-e^{2}} = e^{-7.39} \approx 0.00062\). Almost certainly disconnected (expected \(e^2 \approx 7.4\) isolated vertices).
\(c = 5\): \(\Pr[\text{connected}] \to e^{-e^{-5}} = e^{-0.0067} \approx 0.993\). Connected with probability exceeding 99%.

These values show the sharp transition: the probability rises from nearly 0 to nearly 1 over a window of \(c \in [-5, 5]\), i.e., \(p \in [(6.91-5)/1000, (6.91+5)/1000] = [0.00191, 0.01191]\) — an interval of width \(\approx 0.01\) around the threshold \(p^* = \ln(1000)/1000 \approx 0.00691\).

Chapter 4: The Lovász Local Lemma

The union bound \(\Pr[\bigcup_i A_i] \leq \sum_i \Pr[A_i]\) is a blunt instrument: it gives good estimates only when the total probability is small. But many problems involve a large number of individually rare bad events, each pair of which is nearly independent. In such cases, one might expect that all bad events can be simultaneously avoided even though the union bound fails. The Lovász Local Lemma (LLL) makes this intuition precise.

Definition (Dependency graph). Given events \(A_1, \ldots, A_n\), a graph \(H\) on vertex set \([n]\) is a dependency graph if each \(A_i\) is mutually independent of all events \(\{A_j : j \notin \Gamma(i) \cup \{i\}\}\), where \(\Gamma(i)\) is the neighbourhood of \(i\) in \(H\).

Lovász Local Lemma (Symmetric version, Erdős-Lovász 1975). Let \(A_1, \ldots, A_n\) be events in a probability space with dependency graph \(H\). Suppose each event has probability at most \(p\), and each vertex in \(H\) has degree at most \(d\). If \[ ep(d+1) \leq 1, \] then \(\Pr\!\left[\bigcap_{i=1}^n \overline{A_i}\right] > 0\).

Lovász Local Lemma (General / Asymmetric version). Let \(A_1, \ldots, A_n\) be events with dependency graph \(H\). Suppose there exist values \(x_1, \ldots, x_n \in [0,1)\) such that for each \(i\), \[ \Pr[A_i] \leq x_i \prod_{j \in \Gamma(i)} (1 - x_j). \] Then \[ \Pr\!\left[\bigcap_{i=1}^n \overline{A_i}\right] \geq \prod_{i=1}^n (1-x_i) > 0. \]

Proof sketch. The key claim is: for any \(i\) and any set \(S \subseteq [n] \setminus \{i\}\), \[ \Pr\!\left[A_i \,\Big|\, \bigcap_{j \in S} \overline{A_j}\right] \leq x_i. \] This is proved by induction on \(|S|\). The base case \(S = \emptyset\) is immediate from the hypothesis. For the inductive step, partition \(S\) into \(S_1 = S \cap \Gamma(i)\) and \(S_2 = S \setminus \Gamma(i)\). By conditional probability, \[ \Pr\!\left[A_i \cap \bigcap_{j \in S_1} \overline{A_j} \,\Big|\, \bigcap_{j \in S_2}\overline{A_j}\right] \leq \Pr\!\left[A_i \,\Big|\, \bigcap_{j \in S_2}\overline{A_j}\right] \cdot \prod_{j \in S_1}(1-x_j). \]

Since \(A_i\) is independent of events indexed by \(S_2\), the first factor equals \(\Pr[A_i] \leq x_i \prod_{j \in \Gamma(i)}(1-x_j) \leq x_i \prod_{j \in S_1}(1-x_j)\), from which the claim follows by induction. The lower bound on \(\Pr[\bigcap \overline{A_i}]\) follows by telescoping. \(\square\)

The LLL has numerous striking applications. Before stating the most classical one, let us work out how the asymmetric LLL is applied in practice.

Example (Applying the asymmetric LLL). Consider a random \(k\)-colouring of the edges of the complete graph \(K_n\) with \(k\) colours. We want to avoid monochromatic triangles. Let \(A_{ijk}\) be the event that triangle \(\{i,j,k\}\) is monochromatic (all three edges the same colour). Then \(\Pr[A_{ijk}] = k \cdot (1/k)^3 = 1/k^2\).

The dependency graph: \(A_{ijk}\) depends on \(A_{i'j'k'}\) only if they share an edge, i.e., \(|\{i,j,k\} \cap \{i',j',k'\}| \geq 2\). Each triangle shares an edge with \(3(n-3)\) other triangles (choose the shared edge in 3 ways, then pick any of the remaining \(n-3\) vertices). So \(d \leq 3(n-3) \leq 3n\).

Symmetric LLL condition: \(e \cdot (1/k^2) \cdot (3n+1) \leq 1\), which gives \(k^2 \geq 3en\), i.e., \(k \geq \sqrt{3en}\). So for \(k = \lceil\sqrt{3en}\rceil\) colours, there exists a colouring of \(K_n\) with no monochromatic triangle. This is a remarkably clean result: the LLL certifies that \(\Theta(\sqrt{n})\) colours suffice for triangle-free edge colouring.

Using the asymmetric LLL with weights \(x_i = p = 1/(d+1)\) for the symmetric case, or optimising \(x_i\) using the condition \(p_i \leq x_i \prod_{j \in \Gamma(i)}(1-x_j)\), one can often squeeze slightly better constants. Here, choosing \(x_{ijk} = 2/(3n)\) (balancing the product term) gives a slightly improved bound.

The first, and most classical, application is satisfiability of \(k\)-CNF formulas.

Theorem (LLL for \(k\)-SAT). Let \(\Phi\) be a \(k\)-CNF formula (each clause has exactly \(k\) literals). If each clause shares a variable with at most \(2^k / e - 1\) other clauses, then \(\Phi\) is satisfiable.

Proof. Assign each variable independently and uniformly at random in \(\{0,1\}\). Let \(A_i\) be the event that clause \(i\) is not satisfied. Then \(\Pr[A_i] = 2^{-k}\). The dependency graph has \(A_i\) adjacent to \(A_j\) only if they share a variable, so the maximum degree is at most \(d \leq k(2^k - 1)\) (each of the \(k\) variables appears in at most \(2^k - 1\) other clauses, but the bound in the theorem is stated in terms of total clauses sharing any variable). Under the given hypothesis \(d \leq 2^k/e - 1\), the symmetric LLL condition \(ep(d+1) \leq e \cdot 2^{-k} \cdot 2^k/e = 1\) is satisfied. Hence all clauses can be simultaneously satisfied. \(\square\)

The LLL also gives beautiful results in graph colouring and combinatorial geometry. In routing and scheduling, it provides the existence of efficient protocols:

Theorem (Packet routing, Leighton-Maggs-Rao). In any network with \(n\) nodes where each packet must traverse a path of length at most \(L\) and the maximum congestion (packets per edge) is at most \(C\), there exists a routing schedule (with queuing) of makespan \(O(C + L \log n)\).

The proof introduces an artificial random delay for each packet, defines bad events for edge congestion exceeding a threshold, and applies the LLL to show all bad events can be simultaneously avoided. The \(\log n\) factor arises from the LLL condition.

Example (LLL for graph colouring — the Lovász local lemma bound on chromatic number). The LLL gives the following useful colouring result: if \(G\) has maximum degree \(\Delta\), then \(\chi(G) \leq \lceil e(\Delta + 1) \rceil \leq e\Delta + e\). This is worse than the greedy bound \(\chi(G) \leq \Delta + 1\), but the LLL approach generalises to list colouring and defective colouring.

More usefully, the LLL gives the following: if \(G\) has girth \(g \geq 3\) and maximum degree \(\Delta\), then \(\chi(G) \leq \lceil \Delta^{1+1/(g-2)} \rceil / \lceil \Delta^{1/(g-2)} - 1 \rceil \cdot c\) for an appropriate constant… The cleaner statement is for proper list colouring:

Theorem (Johansson-Molloy 1996, simplified). Every triangle-free graph \(G\) with maximum degree \(\Delta\) satisfies \(\chi(G) = O(\Delta / \log \Delta)\).

The proof uses the LLL in a sophisticated “semi-random” setting: vertices are coloured from a list greedily, with random choices, and bad events correspond to conflicts or exhausted lists. The key insight is that triangle-freedom prevents “domino” cascades where one conflict triggers many others, allowing the LLL condition to be satisfied with a list size of \(O(\Delta/\log\Delta)\).

One of the most important recent developments in the theory is the constructive LLL of Moser and Tardos (2010), which transformed probabilistic existence results into efficient algorithms.

Theorem (Moser-Tardos 2010 — Constructive LLL). Suppose the events \(A_1, \ldots, A_n\) are determined by independent random variables \(X_1, \ldots, X_m\), and the LLL conditions hold with values \(x_i \in [0,1)\). Then the following randomized algorithm finds an assignment satisfying all \(\overline{A_i}\) in expected time \(O(m \sum_i x_i / (1-x_i))\): repeatedly resample all variables involved in any violated event \(A_i\), chosen arbitrarily, until no event is violated.

The analysis proceeds through the notion of witness trees: each resampling step can be “charged” to a witness tree in a forest whose total expected size is bounded by the LLL parameters. The Moser-Tardos theorem was a breakthrough because it showed that the non-constructive LLL could, in essentially all combinatorial applications, be made polynomial-time algorithmic.

Remark (LLL for \(k\)-SAT: a worked example). Let us make the LLL application to \(k\)-SAT fully concrete. Consider a \(k\)-CNF formula \(\Phi\) on variables \(x_1, \ldots, x_n\) with \(m\) clauses \(C_1, \ldots, C_m\), each clause containing exactly \(k\) distinct literals. Assign each variable independently and uniformly in \(\{0,1\}\). Let \(A_i\) be the bad event that clause \(C_i\) is violated (all \(k\) literals false). Since each literal is false with probability \(1/2\) and the literals in a clause involve distinct variables, \(\Pr[A_i] = 2^{-k}\). Define the dependency graph: \(A_i\) and \(A_j\) are adjacent if \(C_i\) and \(C_j\) share at least one variable. Each clause uses \(k\) variables, each appearing in at most \(f\) clauses, so each \(A_i\) is adjacent to at most \(k(f-1) \leq kf\) other events.

The symmetric LLL condition \(ep(d+1) \leq 1\) becomes \(e \cdot 2^{-k} \cdot (kf + 1) \leq 1\). For \(k = 3\): this gives \(e \cdot (3f+1)/8 \leq 1\), so \(f \leq \lfloor 8/e - 1/3 \rfloor = \lfloor 2.94 - 0.33 \rfloor = 2\). That is, if each variable appears in at most 2 clauses (each clause has at most \(3 \times 2 = 6 \leq 7\) neighbours), the formula is satisfiable. For general \(k\): the condition becomes \(f \leq 2^k / (ek) + O(1)\). The bound \(2^k/e\) on the number of clauses sharing a variable is the classical LLL \(k\)-SAT threshold.

For comparison, if one uses the asymmetric LLL with optimised weights \(x_i = 1/(d+1)\), one can tolerate slightly more dependencies. And the Moser-Tardos algorithm (2010) finds a satisfying assignment in expected \(O(m/(1-\Pr[\text{satisfied}]))\) resamplings, which is polynomial whenever the LLL condition holds.

Chapter 5: Correlation Inequalities

The tools we have seen so far — first and second moments, the LLL — treat events primarily as obstacles to be avoided or as contributions to an expectation. Correlation inequalities instead describe the qualitative structure of dependencies between events. The central question is: when do two events tend to occur together (positive correlation), and when does knowing one occurred make the other less likely (negative correlation)?

Definition (Monotone events). In a product probability space \(\{0,1\}^n\) with any product measure, an event \(A\) is increasing (or monotone) if \(\omega \in A\) and \(\omega' \geq \omega\) coordinate-wise implies \(\omega' \in A\). It is decreasing if the complement is increasing.

Theorem (Harris 1960; FKG Inequality, Fortuin-Kasteleyn-Ginibre 1971). Let \(\mu\) be a product measure on \(\{0,1\}^n\) (or more generally, a measure on a distributive lattice satisfying the log-supermodularity condition \(\mu(x \vee y)\mu(x \wedge y) \geq \mu(x)\mu(y)\)). Then for any two increasing events \(A\) and \(B\), \[ \Pr[A \cap B] \geq \Pr[A] \cdot \Pr[B]. \] Equivalently, \(\mathrm{Cov}(\mathbf{1}_A, \mathbf{1}_B) \geq 0\).

Proof sketch (product measure case). By induction on \(n\). For \(n=1\), \(A\) and \(B\) are each either \(\emptyset\), \(\{1\}\), or \(\{0,1\}\), and the inequality is direct. For the inductive step, condition on the last coordinate \(X_n\) and use the fact that both events remain increasing after conditioning. The key calculation shows \[ \mathbb{E}_\mu[\mathbf{1}_A \mathbf{1}_B] - \mathbb{E}_\mu[\mathbf{1}_A]\mathbb{E}_\mu[\mathbf{1}_B] = \sum_{x,y \in \{0,1\}^{n-1}} \mu(x)\mu(y)(\mathbf{1}_A(x)-\mathbf{1}_A(y))(\mathbf{1}_B(x)-\mathbf{1}_B(y)) \cdot C \geq 0 \] since both factors have the same sign for monotone \(A\) and \(B\). \(\square\)

The FKG inequality has striking applications in percolation theory and random graph theory. In bond percolation on a graph \(G\), each edge is open independently with probability \(p\). The event “vertices \(u\) and \(v\) are connected by an open path” is increasing in the edge set. Therefore, by FKG, any two connectivity events are positively correlated — information that open paths exist elsewhere can only help (or at worst, not affect) the probability of a given connectivity event.

Remark (Historical context of correlation inequalities). The FKG inequality was proved in 1971 by Fortuin, Kasteleyn, and Ginibre in the context of Ising and Potts models in statistical mechanics. The Ising model on a graph \(G\) assigns spins \(\sigma_v \in \{-1,+1\}\) to vertices with Boltzmann weight proportional to \(\exp(\beta \sum_{\{u,v\}\in E} \sigma_u \sigma_v)\); FKG shows that all "increasing" observables (those that increase when spins flip from \(-1\) to \(+1\)) are positively correlated. Harris (1960) proved the special case for percolation (the "Harris-FKG inequality"), motivating the general theorem.

The physical meaning: in a ferromagnet (large \(\beta\)), if we know region \(A\) has all spins up, this makes it more likely that nearby region \(B\) also has spins up — a positive correlation. The FKG inequality formalises this as a purely probabilistic statement about product measures on distributive lattices, stripping away the physics. The connection to combinatorics: FKG applies to the random set model (each element included independently), which is the setting for most probabilistic combinatorics. The positive correlation of increasing events is used repeatedly — for instance, in proving that large matchings, large cliques, and large independent sets in \(G(n,p)\) are positively correlated (they all become more likely simultaneously as \(p\) increases).

Theorem (Kleitman's Lemma). Let \(\mathcal{A}, \mathcal{B} \subseteq 2^{[n]}\) be two families of subsets of \([n]\) that are both monotone increasing (or both monotone decreasing). Then \[ |\mathcal{A} \cap \mathcal{B}| \cdot 2^n \geq |\mathcal{A}| \cdot |\mathcal{B}|. \]

This is a combinatorial restatement of the FKG inequality for the uniform measure on the hypercube, and it is often proved by a direct switching (injection) argument. The lemma is useful in extremal set theory and in bounding inclusion probabilities.

The Four Functions Theorem of Ahlswede and Daykin (1978) generalizes FKG substantially:

Theorem (Ahlswede-Daykin 1978). Let \(f_1, f_2, f_3, f_4: 2^{[n]} \to \mathbb{R}_{\geq 0}\) satisfy \(f_1(A)f_2(B) \leq f_3(A \cup B)f_4(A \cap B)\) for all \(A, B \subseteq [n]\). Then \[ \left(\sum_A f_1(A)\right)\!\left(\sum_B f_2(B)\right) \leq \left(\sum_A f_3(A)\right)\!\left(\sum_B f_4(B)\right). \]

For negative correlation, the BK inequality (van den Berg-Kesten) provides a complementary picture: for two increasing events in a product space, the probability of their “disjoint occurrence” (witnessing each event on disjoint edge sets) is at most the product of their individual probabilities. This is deeply connected to the RSW theory in percolation.

Example (FKG in bond percolation). Consider bond percolation on the \(n \times n\) grid graph \(\mathbb{Z}_n^2\) where each edge is open independently with probability \(p\). Let \(A\) be the event that there is a left-to-right crossing (an open path from the left column to the right column) and \(B\) the event that there is a top-to-bottom crossing. Both \(A\) and \(B\) are increasing events (adding open edges can only help create a crossing). By FKG: \(\Pr[A \cap B] \geq \Pr[A]\Pr[B]\).

At the self-dual point \(p = 1/2\), symmetry of the grid (and a duality argument) shows \(\Pr[A] = \Pr[B]\). Also, by the planar duality of the square lattice, the complement of a left-right crossing in the primal graph corresponds to a top-bottom crossing in the dual graph. This forces \(\Pr[A] \geq 1/2\) at \(p = 1/2\) (since the event “no left-right crossing” is equivalent to a dual crossing, which has the same probability). Combined with FKG, one gets \(\Pr[A \cap B] \geq 1/4\), meaning both a horizontal and vertical crossing occur simultaneously with positive probability at \(p = 1/2\). This is the foundation of the Russo-Seymour-Welsh (RSW) theory of percolation.

Remark (Negative correlations and the BK inequality). Increasing events are always positively correlated by FKG, but one sometimes needs to bound probabilities of disjoint occurrences — where two events are "witnessed" by disjoint sets of edges. The BK inequality states: for increasing events \(A\) and \(B\) in a product space, \(\Pr[A \circ B] \leq \Pr[A]\Pr[B]\), where \(A \circ B\) denotes the event that \(A\) and \(B\) occur "disjointly" (there exist disjoint subsets of coordinates witnessing each event). The BK inequality was conjectured by van den Berg and Kesten (1985) and proved by Reimer (2000) in the general form \(\Pr[A \circ B] \leq \Pr[A]\Pr[B]\) for all events (not just increasing ones). It is the key tool in proving that the variance of the number of disjoint crossings in percolation is sub-linear, and it underpins the Garban-Steif theory of noise sensitivity.

Chapter 6: Exponential Concentration Inequalities

Chebyshev’s inequality gives a tail bound that decays only as \(1/t^2\). In many combinatorial applications — particularly those involving sums of many independent random variables or Lipschitz functions of independent inputs — one needs tails that decay exponentially in \(t\). This chapter develops the key exponential concentration tools.

Theorem (Chernoff Bounds). Let \(X = X_1 + \cdots + X_n\) where \(X_i\) are independent Bernoulli random variables (possibly with different parameters), and let \(\mu = \mathbb{E}[X]\). Then for \(\delta > 0\): \[ \Pr[X \geq (1+\delta)\mu] \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu \leq e^{-\mu \delta^2/3} \quad (\delta \leq 1), \] \[ \Pr[X \leq (1-\delta)\mu] \leq e^{-\mu\delta^2/2}. \]

Proof. For the upper tail, apply Markov's inequality to the moment generating function: for any \(t > 0\), \[ \Pr[X \geq (1+\delta)\mu] = \Pr[e^{tX} \geq e^{t(1+\delta)\mu}] \leq \frac{\mathbb{E}[e^{tX}]}{e^{t(1+\delta)\mu}} = \frac{\prod_{i=1}^n \mathbb{E}[e^{tX_i}]}{e^{t(1+\delta)\mu}}. \] Using \(\mathbb{E}[e^{tX_i}] = 1 - p_i + p_i e^t \leq e^{p_i(e^t-1)}\), the product telescopes to \(e^{\mu(e^t-1)}\). Setting \(t = \ln(1+\delta)\) minimizes the upper bound and gives the stated result. \(\square\)

Chernoff bounds are powerful but require independence. A far more flexible tool, applicable to functions of independent variables, is the Azuma-Hoeffding martingale inequality.

Theorem (Azuma-Hoeffding Inequality). Let \(X_0, X_1, \ldots, X_n\) be a martingale with \(|X_k - X_{k-1}| \leq c_k\) for all \(k\). Then for any \(t > 0\), \[ \Pr[|X_n - X_0| \geq t] \leq 2\exp\!\left(\frac{-t^2}{2\sum_{k=1}^n c_k^2}\right). \]

A particularly important consequence is the bounded differences inequality:

Theorem (McDiarmid's Bounded Differences Inequality). Let \(X_1, \ldots, X_n\) be independent random variables, and let \(f(x_1,\ldots,x_n)\) be a function satisfying \[ \sup_{x_1,\ldots,x_n,\,x_i'} |f(x_1,\ldots,x_i,\ldots,x_n) - f(x_1,\ldots,x_i',\ldots,x_n)| \leq c_i \] for each \(i\). Then \(\Pr[|f(X_1,\ldots,X_n) - \mathbb{E}[f]| \geq t] \leq 2\exp\!\left(-2t^2/\sum c_i^2\right).\)

Proof sketch. Define the Doob martingale \(Z_k = \mathbb{E}[f(X_1,\ldots,X_n) \mid X_1,\ldots,X_k]\) with \(Z_0 = \mathbb{E}[f]\) and \(Z_n = f\). The bounded differences condition implies \(|Z_k - Z_{k-1}| \leq c_k\), so Azuma's inequality applies. \(\square\)

A key application is concentration of the chromatic number of a random graph.

Theorem (Chromatic number concentration). The chromatic number \(\chi(G(n,1/2))\) is concentrated on an interval of length \(O(\sqrt{n})\). More precisely, \(\chi(G(n,1/2))\) lies in an interval of size \(O(\sqrt{n}/\log n)\) with high probability.

Proof sketch. Order the \(\binom{n}{2}\) edges arbitrarily. Define \(f\) to be the chromatic number viewed as a function of the edge indicators. Changing one edge changes the chromatic number by at most 1. By McDiarmid's inequality with \(c_i = 1\) for all \(i\), \[ \Pr[|\chi - \mathbb{E}[\chi]| \geq t] \leq 2\exp\!\left(\frac{-2t^2}{\binom{n}{2}}\right). \] Setting \(t = C\sqrt{n\log n}\) gives the stated concentration. \(\square\)

For sharper concentration, one needs Talagrand’s inequality, which accounts for the “certifiability” structure of combinatorial properties.

Theorem (Talagrand's Inequality). Let \(X = (X_1,\ldots,X_n)\) be a vector of independent random variables. For a set \(A\) of configurations, define the Lipschitz-certifiable distance \[ d_L(x, A) = \min_{y \in A} |\{i : x_i \neq y_i\}|. \] If \(A\) is \(r\)-certifiable (membership can be witnessed by \(r\) coordinates), then for any \(t \geq 0\), \[ \Pr[X \notin A] \cdot \Pr[d_L(X, A) \geq t] \leq e^{-t^2/(4r)}. \]

In particular, if \(\Pr[X \in A] \geq 1/2\), then \(\Pr[d_L(X,A) \geq t] \leq 2e^{-t^2/(4r)}\).

Talagrand’s inequality gives dramatically tighter bounds than Azuma for combinatorial random variables that are “certifiable.” For example, the length of the longest increasing subsequence of a random permutation is 2-certifiable (any particular increasing subsequence certifies itself via its \(\ell\) positions), and Talagrand gives variance \(O(n^{1/3})\) — far smaller than Azuma’s bound of \(O(n)\). This was a key ingredient in the Kim-Vu work on concentration of graph parameters.

Example (Chernoff bound for coin flips — a concrete calculation). Let \(X = X_1 + \cdots + X_{100}\) where each \(X_i \sim \mathrm{Bernoulli}(1/2)\), so \(\mu = 50\). What is the probability that \(X \geq 75\) (exceeding the mean by 50%)?

Using the Chernoff bound with \(\delta = 1/2\) (since \((1+\delta)\mu = 75\)):

\[ \Pr[X \geq 75] \leq \left(\frac{e^{1/2}}{(3/2)^{3/2}}\right)^{50} = \left(\frac{\sqrt{e}}{(3/2)^{3/2}}\right)^{50}. \]

Now \((3/2)^{3/2} = (3/2)\sqrt{3/2} \approx 1.5 \times 1.225 \approx 1.837\), and \(\sqrt{e} \approx 1.649\), so the base is \(\approx 1.649/1.837 \approx 0.898\). Thus \(\Pr[X \geq 75] \leq (0.898)^{50} \approx e^{50 \ln 0.898} \approx e^{-5.26} \approx 0.0052\).

Alternatively, the simpler bound \(e^{-\mu \delta^2/3} = e^{-50/12} \approx e^{-4.17} \approx 0.015\) gives a slightly weaker but easier estimate. The exact binomial probability is \(\Pr[X \geq 75] = \sum_{k=75}^{100}\binom{100}{k}2^{-100} \approx 2.8 \times 10^{-6}\), showing that the Chernoff bound, while useful, is not tight here — exponential concentration bounds can overestimate the tail by several orders of magnitude, but they are sufficient for most applications.

Remark (The Doob martingale construction). The Azuma-Hoeffding inequality is most useful through the Doob martingale (also called the exposure martingale). Given a function \(f(X_1, \ldots, X_n)\) of independent random variables, define \(Z_k = \mathbb{E}[f \mid X_1, \ldots, X_k]\). Then \(Z_0 = \mathbb{E}[f]\), \(Z_n = f\), and \(Z_0, Z_1, \ldots, Z_n\) is a martingale. The bounded differences condition on \(f\) translates directly to bounded increments \(|Z_k - Z_{k-1}| \leq c_k\) — the key insight being that changing \(X_k\) from any value to any other can move the conditional expectation by at most the maximum change in \(f\) due to that coordinate. The Azuma-Hoeffding inequality then gives exponential concentration of \(f\) around its mean. The canonical examples in combinatorics: \(f = \chi(G)\) (chromatic number, changing one edge changes \(\chi\) by at most 1), \(f = \mathrm{LIS}(\sigma)\) (longest increasing subsequence, changing one element of a permutation changes LIS by at most 1), and \(f = |C^*|\) (maximum matching, bounded differences of 1).

Finally, Janson’s inequality handles lower tails for sums of dependent 0-1 variables arising in random graph theory.

Theorem (Janson's Inequality). Let \(I_S\) be the indicator that edge-set \(S \subseteq \binom{[n]}{2}\) is present in \(G(n,p)\), and let \(X = \sum_S I_S\) over a family of edge-sets (each \(S\) represents a copy of some fixed graph \(H\)). Define \(\mu = \mathbb{E}[X]\) and \[ \Delta = \sum_{\substack{S, T : \\ S \cap T \neq \emptyset, \, S \neq T}} \Pr[I_S = 1, I_T = 1] \] where the sum is over ordered pairs of distinct sets with \(S \cap T \neq \emptyset\). Then \[ \Pr[X = 0] \leq \exp\!\left(-\mu + \frac{\Delta}{2}\right). \]

Equivalently, combining with the first moment: \(\Pr[X = 0] \leq \exp\!\left(-\mu^2 / (\mu + \Delta)\right)\) via the second moment version.

Proof sketch. The Janson inequality is proved by the "exponential moment" method. Let \(\{U_e\}_{e \in \binom{[n]}{2}}\) be i.i.d. uniform on \([0,1]\); an edge \(e\) is in \(G(n,p)\) iff \(U_e \leq p\). Define \(f_p = \mathbb{E}[\Pr[X = 0 \mid U_e \text{ for } e \notin S, \forall S]]\). Differentiating \(\ln \Pr[X = 0]\) with respect to \(p\) and bounding the derivative using the FKG inequality (all \(I_S\) are increasing) gives a differential inequality whose solution is the Janson bound. The key steps are: (1) \(\frac{d}{dp}\ln\Pr[X=0] \geq -\mu/p \cdot (1 + \Delta/\mu)\), and (2) integrating from 0 to \(p\) and exponentiating. \(\square\)

Example (Janson's inequality for triangle containment). In \(G(n,p)\), let \(X\) = number of triangles, with \(p = c/n\). We computed \(\mu = \mathbb{E}[X] \approx c^3/6\). Now compute \(\Delta\): two triangles \(S\) and \(T\) can intersect in 1 edge or 2 edges (sharing 2 or 3 vertices).

Sharing 1 edge (2 vertices): there are \(O(n^4)\) such pairs; each contributes \(p^5\) (5 edges needed). Total: \(O(n^4 p^5) = O(c^5/n)\).
Sharing 2 edges (3 vertices): there are \(O(n^3)\) such pairs ("bowtie" configurations); each contributes \(p^5\) as well. Wait — two triangles sharing 3 vertices means sharing all 3 edges, so \(S = T\), excluded. Two triangles sharing 2 vertices share 1 edge and span 4 vertices with 5 edges total. There are \(\binom{n}{4} \times (\text{number of bowtie labellings})= O(n^4)\) such pairs. Contribution: \(O(n^4 p^5) = O(c^5/n) \to 0\).

So \(\Delta = O(c^5/n) = o(\mu) = o(c^3/6)\) for fixed \(c\). The Janson bound gives: \[ \Pr[X = 0] \leq \exp\!\left(-\mu + \Delta/2\right) \approx \exp(-c^3/6). \]

For \(c = 1\): \(\Pr[X = 0] \leq e^{-1/6} \approx 0.85\). For \(c = 3\): \(\Pr[X = 0] \leq e^{-4.5} \approx 0.011\). This matches the intuition that well above the threshold \(c = 1\), triangles are almost certain.

Janson’s inequality is the primary tool for lower-tailing events in random graphs — for example, showing that a random graph near the threshold for containing a given subgraph \(H\) has \(H\) with high probability. The correction term \(\Delta\) captures the extent to which overlapping copies of \(H\) positively correlate, and when \(\Delta \ll \mu^2\), the Janson bound is nearly tight.

Chapter 7: The Probabilistic Method in Graph Theory

Graph theory is the home territory of the probabilistic method. Nearly every major structural result about graphs — from Ramsey theory to colouring to discrepancy — has a probabilistic proof or a probabilistic component in the best known bound.

The most famous open problem in extremal combinatorics concerns Ramsey numbers. We have seen \(R(k,k) > 2^{k/2}\). The upper bound from Pascal’s identity gives \(R(k,k) \leq 4^k\). Despite the simplicity of both bounds, the gap has not been closed in 75 years.

Theorem (Best known lower bound for \(R(3,k)\)). \(R(3,k) = \Theta(k^2 / \log k)\). The lower bound is achieved by a probabilistic construction with alteration.

Proof sketch (lower bound). We construct a triangle-free graph with independence number less than \(k\). Take \(G = G(n, p)\) with \(p = c \sqrt{\log n / n}\). The expected number of triangles is \(\binom{n}{3}p^3 = O(n^{3/2}(\log n)^{3/2}/n^{3/2}) = O((\log n)^{3/2})\). The expected independence number by the first moment method is \(O(\log n / p) = O(\sqrt{n \log n} / \sqrt{\log n}) = O(\sqrt{n})\). Removing one vertex per triangle leaves a triangle-free graph on \(n - O((\log n)^{3/2})\) vertices with independence number \(O(\sqrt{n \log n}/p)\). Balancing gives \(R(3,k) = \Omega(k^2/\log k)\). \(\square\)

One of Erdős’s most celebrated theorems shows that the probabilistic method can construct graphs with simultaneously extreme properties — high chromatic number and high girth — that seem contradictory.

Theorem (Erdős 1959 — High girth and high chromatic number). For any integers \(k \geq 2\) and \(g \geq 3\), there exists a graph \(G\) with \(\chi(G) \geq k\) and \(\mathrm{girth}(G) \geq g\).

Proof. Let \(n\) be large, and consider \(G = G(n, p)\) with \(p = n^{1/g - 1}\). Count short cycles: for \(3 \leq \ell < g\), the expected number of cycles of length \(\ell\) is \(\frac{1}{2\ell}\cdot\frac{n!}{(n-\ell)!} \cdot p^\ell \leq n^\ell p^\ell / (2\ell) = n^{\ell/g} / (2\ell) = o(n)\). By Markov, with probability \(1/2\) at most \(2 \cdot o(n) = o(n)\) vertices lie on short cycles; remove them all to get a subgraph \(G'\) with \(n - o(n)\) vertices and girth at least \(g\). Now estimate \(\alpha(G(n,p))\): by standard calculations, \(\alpha(G(n,p)) \leq 3(\log n)/p = 3n^{1-1/g}\log n\) with high probability. Hence \(\chi(G') \geq |V(G')| / \alpha(G') \geq (n - o(n)) / (3n^{1-1/g}\log n) = \Omega(n^{1/g}/\log n)\). For large enough \(n\), this exceeds \(k\). \(\square\)

This theorem was the watershed moment for the probabilistic method. Before 1959, many mathematicians believed that high girth would force large independent sets (and hence low chromatic number), via a heuristic analogy with tree-like sparse graphs. Erdős’s proof demolished this intuition entirely.

Remark (Quantitative girth-chromatic number trade-off). The above proof gives explicit quantitative bounds. With \(p = n^{1/g - 1}\), one constructs a subgraph \(G'\) with:

Girth \(\mathrm{girth}(G') \geq g\),
\(|V(G')| \geq n/2\),
\(\alpha(G') \leq 3n^{1-1/g}\log n\) (with high probability),
\(\chi(G') \geq n^{1/g} / (6\log n)\).

For example, to guarantee \(\chi \geq k\) and \(\mathrm{girth} \geq g\), take \(n = (6k\log n)^g\), which gives \(n = e^{O(g \log k)}\). The resulting graphs have \(\Theta(n^{1+1/g})\) edges (since each vertex has expected degree \(\sim np = n^{1/g}\)). So graphs with girth \(g\) and chromatic number \(k\) exist with roughly \(k^{g+O(1)}\) vertices — an exponential dependence on girth. Whether this is tight (whether one needs exponentially many vertices in \(g\)) is related to the "cage problem" in graph theory.

An important quantitative variant: the Bonferroni-type version of the alteration argument gives a graph on \(n = O(k^2 g^2)\) vertices with girth \(g\) and chromatic number \(k\). The difference from the probabilistic argument is that in the cage problem, one seeks the minimum number of vertices in a \(d\)-regular graph of girth \(g\) — the Moore bound gives \(\Omega(d^{g/2})\) vertices, showing sparsity is essential.

Example (Explicit small case: girth 4, chromatic number 4). A graph with girth \(\geq 4\) (no triangles) and \(\chi \geq 4\) (non-3-colourable) can be constructed explicitly via the Mycielski construction, but the probabilistic argument gives it with overwhelming probability. Take \(n = 100\), \(p = 100^{1/4-1} = 100^{-3/4} \approx 0.0056\). Then:

Expected number of 3-cycles: \(\binom{100}{3}p^3 \approx \frac{10^6}{6} \times (0.0056)^3 \approx 162{,}000 \times 1.76 \times 10^{-7} \approx 0.029\). Very few triangles; removing one vertex per triangle costs \(\approx 0.029\) vertices.
Expected independence number: the first moment gives \(\alpha(G(100, p)) \lesssim 3 \ln(100)/p \approx 3 \times 4.6 / 0.0056 \approx 2460\)... but \(n = 100\), so this bound is trivial. The correct bound comes from the second moment: \(\alpha \leq 3n^{3/4}\log n \approx 3 \times 31.6 \times 4.6 \approx 436\), still too large relative to \(n = 100\).

This illustrates that the probabilistic argument requires \(n\) to be large. The explicit Mycielski construction is more efficient for small chromatic number: the Mycielskian of a \(k\)-chromatic graph is a \((k+1)\)-chromatic triangle-free graph. Starting from \(K_2\) (chromatic number 2, girth \(\infty\)), the Mycielski chain gives triangle-free graphs with chromatic numbers \(3, 4, 5, \ldots\) using \(3, 11, 23, \ldots\) vertices respectively — much smaller than the probabilistic construction.

The probabilistic method also underlies the Szemerédi Regularity Lemma, one of the most powerful structural tools in graph theory.

Theorem (Szemerédi Regularity Lemma, 1976). For every \(\varepsilon > 0\), there exists \(M = M(\varepsilon)\) such that every graph \(G\) on \(n\) vertices has an \(\varepsilon\)-regular partition of its vertex set into \(k\) parts (\(\varepsilon \leq 1/k \leq M\)): for all but at most \(\varepsilon k^2\) pairs \((V_i, V_j)\), the bipartite graph between \(V_i\) and \(V_j\) is \(\varepsilon\)-regular in the sense that for all \(A \subseteq V_i\), \(B \subseteq V_j\) with \(|A| \geq \varepsilon|V_i|\) and \(|B| \geq \varepsilon|V_j|\), \[ \left| d(A,B) - d(V_i, V_j) \right| \leq \varepsilon, \] where \(d(X,Y) = e(X,Y)/(|X||Y|)\) is the edge density.

Proof sketch (energy increment). Define the "energy" or "index" of a partition \(\mathcal{P} = \{V_1, \ldots, V_k\}\) as \[ q(\mathcal{P}) = \frac{1}{k^2} \sum_{i,j} d(V_i, V_j)^2. \] Note \(0 \leq q(\mathcal{P}) \leq 1\). If the partition is not \(\varepsilon\)-regular, there exists a pair \((V_i, V_j)\) and witnesses \(A \subseteq V_i\), \(B \subseteq V_j\) violating \(\varepsilon\)-regularity. Refining the partition by splitting \(V_i\) into \(A\) and \(V_i \setminus A\), and \(V_j\) into \(B\) and \(V_j \setminus B\), increases \(q(\mathcal{P})\) by at least \(\varepsilon^4 / 2\) per irregular pair. Since \(q \leq 1\), this can happen at most \(2/\varepsilon^4\) times, bounding the number of refinement steps — and hence the final number of parts by \(M(\varepsilon) = \text{tower}(1/\varepsilon)\). \(\square\)

Remark (The regularity lemma and Ramsey-Turán theory). The regularity lemma is the engine behind the proof of the triangle removal lemma: if a graph on \(n\) vertices has \(o(n^3)\) triangles, it can be made triangle-free by removing \(o(n^2)\) edges. This in turn implies Roth's theorem (every positive-density subset of the integers contains an arithmetic progression of length 3) via a graph-theoretic encoding. The probabilistic method enters the regularity machinery in two ways: the energy increment argument has a probabilistic interpretation (randomly sample from irregular pairs), and the counting lemma (which counts copies of a fixed subgraph in a regular partition) uses independence of the regular pairs as a proxy for independence of edge choices. The interplay between regularity and probabilistic arguments is the foundation of the "sparse regularity" theory (Conlon-Fox-Zhao 2015), which extends the regularity lemma to sparse random and pseudorandom graphs.

The probabilistic method also illuminates discrepancy theory, which asks how evenly a 2-colouring of a ground set can distribute mass across a set system.

Definition (Discrepancy). For a set system \((X, \mathcal{F})\) with \(|X| = n\) and \(\mathcal{F} = \{S_1,\ldots,S_m\}\), the discrepancy is \[ \mathrm{disc}(\mathcal{F}) = \min_{\chi: X \to \{+1,-1\}} \max_{S \in \mathcal{F}} \left|\sum_{x \in S} \chi(x)\right|. \]

Theorem (Spencer 1985 — Six standard deviations suffice). For any set system with \(n\) sets on \(n\) points, \(\mathrm{disc}(\mathcal{F}) \leq 6\sqrt{n}\).

The proof uses an ingenious partial colouring lemma: by a linear-algebraic argument combined with a probabilistic step, one can always find a partial \(\pm 1\) colouring of a constant fraction of points such that every set has discrepancy at most \(C\sqrt{n}\). Iterating gives the full bound. The constant 6 is nearly optimal — there exist systems requiring discrepancy \(\Omega(\sqrt{n})\).

Remark (Beck-Fiala theorem and the discrepancy of arithmetic progressions). Spencer's theorem gives \(\mathrm{disc} = O(\sqrt{n})\) for \(n\) sets on \(n\) points. When each point appears in at most \(t\) sets (the degree is at most \(t\)), the Beck-Fiala theorem gives \(\mathrm{disc} \leq 2t - 2\) — a bound depending only on the maximum degree, not on the number of sets. Whether the true bound is \(O(\sqrt{t})\) (the Beck-Fiala conjecture) remains open.

For arithmetic progressions in \(\{1, \ldots, n\}\), Roth’s theorem (1964) shows that for any \(\pm 1\) colouring, there exists an arithmetic progression of discrepancy \(\Omega(n^{1/4})\). This is proved using Fourier analysis on \(\mathbb{Z}_n\): the discrepancy is lower-bounded by the \(\ell^2\) norm of the Fourier transform of the colouring weighted by the Fourier coefficients of arithmetic progressions. Matoušek and Spencer (1996) showed the correct bound is \(\Theta(n^{1/4})\), closing the question for 1-dimensional arithmetic progressions.

The connection to the probabilistic method: the probabilistic colouring (random \(\pm 1\) assignment) has expected discrepancy \(\Theta(\sqrt{n \log m})\) for \(m\) sets of size \(n/2\) by the union bound and Chernoff. Spencer’s insight was that the random colouring is not optimal — the probabilistic method only provides the starting point, and the real work is in the derandomization via the partial colouring lemma.

The probabilistic method also connects to spectral graph theory through expander graphs and the tight relationship between randomness and pseudorandomness.

Example (Probabilistic construction of a Ramsey graph). To prove \(R(k,k) > 2^{k/2}\), the probabilistic construction generates a random 2-colouring of \(K_n\) edges (with \(n = \lfloor 2^{k/2}\rfloor\)). Let us trace through a specific small case. For \(k = 4\): \(R(4,4) = 18\), so we need to show a 2-colouring of \(K_{17}\) with no monochromatic \(K_4\). The probabilistic bound says \(n = \lfloor 2^2 \rfloor = 4\) (with \(n < R(4,4)\)), but to construct an explicit example of a \(K_{17}\) colouring avoiding monochromatic \(K_4\), one uses the Paley graph on 17 vertices: connect \(i\) and \(j\) iff \(i - j\) is a quadratic residue mod 17. The QR mod 17 are \(\{1, 2, 4, 8, 9, 13, 15, 16\}\) — a set of 8 elements. The Paley graph on 17 vertices is \(\mathrm{srg}(17, 8, 3, 4)\). One verifies that this graph contains no \(K_4\) (clique of size 4) — consistent with \(R(4,4) = 18 > 17\).

Remark (Ramsey numbers and graph Ramsey theory). Graph Ramsey theory extends the study of \(R(k,k)\) to heterogeneous settings. The graph Ramsey number \(R(H_1, H_2)\) is the minimum \(n\) such that every 2-colouring of \(K_n\) contains either a red copy of \(H_1\) or a blue copy of \(H_2\). The probabilistic method gives lower bounds: \(R(K_3, K_k) = \Theta(k^2/\log k)\) (proved by Kim 1995 with the "random greedy" algorithm for the lower bound, and by Ajtai-Komlós-Szemerédi for the upper bound). For "off-diagonal" Ramsey numbers \(R(3,k)\), the lower bound is the most celebrated application of the alteration method. The construction: take \(G = G(n, c\sqrt{\log n / n})\), remove one vertex from each triangle. The resulting triangle-free graph \(G'\) has independence number \(O(\sqrt{n \log n})\) with high probability. Choosing \(n = \Theta(k^2/\log k)\) gives a triangle-free graph with no independent set of size \(k\), so \(R(3,k) = \Omega(k^2/\log k)\). The matching upper bound (Kim 1995) used a vastly more sophisticated probabilistic algorithm.

Theorem (Random regular graphs are expanders). For any fixed degree \(d \geq 3\), a uniformly random \(d\)-regular graph on \(n\) vertices is an expander with high probability: there exists \(\lambda < d\) such that the second eigenvalue of the adjacency matrix is at most \(\lambda\) with probability \(1 - o(1)\).

The expansion property — every small set has many neighbours outside it — is fundamental in theoretical computer science for constructing error-correcting codes, pseudorandom generators, and derandomization schemes.

Theorem (Alon-Boppana bound). For any \(d\)-regular graph \(G\) on \(n\) vertices, the second eigenvalue \(\lambda_2(G)\) of the adjacency matrix satisfies \[ \lambda_2(G) \geq 2\sqrt{d-1} - o(1), \] where the \(o(1)\) term goes to 0 as \(n \to \infty\). A \(d\)-regular graph achieving \(\lambda_2 \leq 2\sqrt{d-1}\) is called a Ramanujan graph.

Remark (Ramanujan graphs and optimal expanders). Ramanujan graphs are optimal expanders: the Alon-Boppana bound shows that \(2\sqrt{d-1}\) is the smallest possible second eigenvalue for an infinite family of \(d\)-regular graphs. The first explicit constructions were given by Lubotzky-Phillips-Sarnak (1988) and Margulis (1988) using deep number theory — specifically, the Ramanujan conjecture (proved by Deligne) bounding the Hecke eigenvalues of modular forms. The name "Ramanujan graph" honours this connection. The probabilistic method shows that random \(d\)-regular graphs are Ramanujan with positive probability (Friedman's theorem, 2004: whp \(\lambda_2(G_{n,d}) \leq 2\sqrt{d-1} + \varepsilon\)), but constructing explicit Ramanujan graphs for all \(d\) remained open until 2015 (Marcus-Spielman-Srivastava), who used the method of interlacing polynomials.

Chapter 8: Algorithmic Applications and Derandomization

The probabilistic method originally produced purely existential results: objects with certain properties exist, but no efficient procedure was known to find them. Over the past four decades, this limitation has largely been overcome through two complementary approaches: derandomization (converting probabilistic arguments into deterministic algorithms) and direct randomized algorithms (using the probabilistic argument constructively).

The algorithmic counterpart of the probabilistic method raises a deep question: when a probabilistic argument certifies existence, how hard is it to find the certifying object? The answer varies:

If the argument is a first moment argument (remove one element per violation), derandomization via conditional expectations almost always gives a polynomial-time algorithm.
If the argument uses the LLL, the Moser-Tardos algorithm (2010) gives an efficient randomized algorithm (converted to deterministic via derandomization of the resamplings).
If the argument uses second moments or non-constructive concentration, finding the object is often computationally hard — the "planted clique" problem is a canonical example.

This trichotomy — first moment = efficient, LLL = efficient, second moment = possibly hard — is not a theorem but a widely observed empirical pattern in probabilistic combinatorics. Understanding when probabilistic existence implies efficient constructibility is one of the central open questions of modern theoretical computer science.

The simplest illustration is the MAX-CUT problem, whose probabilistic proof is one of the cleanest in all of algorithm design.

Theorem (MAX-CUT — Probabilistic 1/2-approximation). For any graph \(G = (V, E)\) with \(m\) edges, there exists a cut \((S, \overline{S})\) with at least \(m/2\) crossing edges.

Proof. Assign each vertex to \(S\) or \(\overline{S}\) independently with probability \(1/2\). Each edge crosses the cut with probability \(1/2\), so the expected number of cut edges is \(m/2\). Hence some cut has \(\geq m/2\) edges. \(\square\)

This result is tight up to constants (the Goemans-Williamson SDP algorithm achieves the approximation ratio \(\alpha_{\text{GW}} \approx 0.878\), which is optimal under UGC). The existential proof immediately yields an algorithm: the randomized algorithm succeeds in expectation, and repeated sampling achieves any desired success probability.

Derandomization extracts a deterministic algorithm from a probabilistic argument by replacing random choices with deterministic ones, guided by a potential function.

Theorem (Method of Conditional Expectations). If a probabilistic argument establishes \(\mathbb{E}[f(X_1,\ldots,X_n)] \geq c\) for some function \(f\) and independent random variables \(X_1,\ldots,X_n\), then a deterministic assignment achieving \(f \geq c\) can be found in polynomial time by sequentially setting each \(X_i\) to the value that maximizes \(\mathbb{E}[f \mid X_1=x_1,\ldots,X_i=x_i]\).

The key requirement is that the conditional expectation can be computed efficiently. For the MAX-CUT application, each vertex is assigned greedily to the side that maximizes the expected number of cut edges given previous assignments, recovering a \(1/2\)-approximation deterministically.

A more sophisticated derandomization applies to hypergraph 2-colouring.

Theorem (Erdős-Selfridge criterion). If \(\mathcal{H} = (V, \mathcal{E})\) is a \(k\)-uniform hypergraph with \(|\mathcal{E}| < 2^{k-1}\), then \(V\) has a 2-colouring with no monochromatic hyperedge.

Proof. Colour each vertex \(\pm 1\) independently. Each hyperedge is monochromatic with probability \(2^{1-k}\). The expected number of monochromatic hyperedges is \(|\mathcal{E}| \cdot 2^{1-k} < 1\). Hence some colouring avoids all monochromatic edges. Derandomization via conditional expectations yields an efficient deterministic algorithm. \(\square\)

Remark (The method of conditional expectations in detail). The derandomization of the hypergraph colouring proceeds as follows. Order the vertices \(v_1, v_2, \ldots, v_n\). Define the potential function \(\Phi(x_1, \ldots, x_j) = \mathbb{E}[\text{number of monochromatic edges} \mid v_1 = x_1, \ldots, v_j = x_j]\). At each step \(j\), choose \(x_j \in \{+1, -1\}\) to be whichever value minimises \(\Phi(x_1, \ldots, x_j)\). Since \(\Phi(x_1, \ldots, x_{j-1}) = \frac{1}{2}[\Phi(x_1,\ldots,x_{j-1},+1) + \Phi(x_1,\ldots,x_{j-1},-1)]\), choosing the minimiser ensures \(\Phi\) is non-increasing. At the start, \(\Phi() = |\mathcal{E}| \cdot 2^{1-k} < 1\). At the end, \(\Phi(x_1,\ldots,x_n)\) is the actual number of monochromatic edges, which must be a non-negative integer less than 1, hence 0. The resulting colouring is valid. The key requirement is that each conditional expectation \(\Phi(x_1,\ldots,x_j,x)\) can be computed efficiently — here it is a sum of \(2^{1-k}\) times the number of edges not yet fully determined, which takes \(O(|\mathcal{E}|)\) time per step.

Example (Derandomizing MAX-CUT for the Petersen graph). The Petersen graph has \(n = 10\) vertices and \(m = 15\) edges. The probabilistic argument guarantees a cut with \(\geq 15/2 = 7.5\), hence \(\geq 8\) crossing edges. Let us apply the conditional expectations algorithm:

Order the vertices \(1, 2, \ldots, 10\). At each step, assign vertex \(v_j\) to side \(S\) or \(\overline{S}\) greedily. The conditional expected number of cut edges after assigning \(v_1, \ldots, v_j\) is:

\[ \Phi(x_1,\ldots,x_j) = (\text{cut edges so far}) + \frac{1}{2} \times (\text{edges between assigned and unassigned vertices}) + \frac{1}{2} \times (\text{edges among unassigned vertices}). \]

At each step \(j\), we choose the assignment \(x_j\) (to \(S\) or \(\overline{S}\)) that maximises the expected cut from edges involving \(v_j\). For the Petersen graph, this greedy process produces a cut of size \(\geq 8\), which is indeed achievable (the maximum cut of the Petersen graph is 12, achieved by a bisection into two 5-vertex independent sets — since the Petersen graph is vertex-transitive and triangle-free, large cuts are easy to find).

Pairwise independence allows significant reduction in the randomness used by probabilistic algorithms while preserving key concentration properties.

Definition (Pairwise-independent hash family). A family \(\mathcal{H}\) of functions \(h: [n] \to [m]\) is pairwise independent (or 2-universal) if for all distinct \(x, y \in [n]\) and all \(a, b \in [m]\), \(\Pr_{h \in \mathcal{H}}[h(x)=a,\, h(y)=b] = 1/m^2\).

Pairwise-independent hash families can be constructed with \(O(\log n)\) random bits (versus \(\Theta(n)\) for full independence), yet suffice for applications that only require pairwise independence — such as Chebyshev’s inequality with \(n\) variables and sum \(X = \sum X_i\), where \(\mathrm{Var}(X) = \sum \mathrm{Var}(X_i)\) still holds under pairwise independence (since covariance terms are zero).

Example (Pairwise-independent hash family from linear maps over \(\mathbb{F}_p\)). Let \(p\) be a prime and consider the family \(\mathcal{H} = \{h_{a,b}: \mathbb{F}_p \to \mathbb{F}_p \mid a, b \in \mathbb{F}_p\}\) where \(h_{a,b}(x) = ax + b\). This family has size \(p^2\) and is pairwise independent: for any \(x \neq y \in \mathbb{F}_p\) and any \(c, d \in \mathbb{F}_p\), the system \(ax + b = c\) and \(ay + b = d\) has a unique solution \((a, b) = ((c-d)/(x-y), c - ax)\). So \(\Pr_{h}[h(x) = c, h(y) = d] = 1/p^2\). \(\checkmark\)

This construction uses only \(2\log_2 p = O(\log p)\) random bits (choosing \(a\) and \(b\)) to define the hash function. Contrast with a truly random function from \(\mathbb{F}_p\) to \(\mathbb{F}_p\), which requires \(p \log_2 p = \Omega(p)\) bits. The savings in randomness come at no cost in the pairwise independence property. For \(k\)-wise independence (any \(k\) inputs are independent), use polynomials of degree \(k-1\) over \(\mathbb{F}_p\): \(h(x) = a_{k-1}x^{k-1} + \cdots + a_1 x + a_0\). This requires \(k\log_2 p\) random bits and gives \(k\)-wise independence.

Remark (Derandomization via limited independence). The Chernoff bound requires full independence, but many weaker results hold with only \(O(\log n)\)-wise independence. For example: the Lovász Local Lemma can sometimes be implemented with \(O(\log n)\)-wise independence (Moser-Tardos uses full independence but the seed length is polylogarithmic); the MAX-CUT random algorithm achieves \(m/2\) expected cut edges with only pairwise independence (since only covariances matter); and many randomised algorithms can be "pseudo-randomised" using hash families as a replacement for truly random strings. The theory of pseudorandom generators (PRGs) formalises this: a PRG \(G: \{0,1\}^s \to \{0,1\}^n\) with seed length \(s \ll n\) such that \(G\)'s output is computationally indistinguishable from a uniform random string. Constructing explicit PRGs for combinatorial algorithms is one of the central problems of derandomization theory.

Theorem (Lovász Local Lemma for graph colouring). If \(G\) is a graph with maximum degree \(\Delta\), then \(\chi(G) \leq \lceil e(\Delta+1)\rceil\). More usefully, if \(\Delta \geq 2\) and every two cycles of length \(\leq g\) share at most one vertex, then \(\chi(G) \leq \lceil 2e\Delta^{1/(g-2)}\rceil\).

Proof (first statement). Choose a random colouring of \(V(G)\) with \(c = \lceil e(\Delta+1) \rceil\) colours uniformly and independently. For each edge \(\{u,v\}\), let \(A_{uv}\) be the event that \(u\) and \(v\) receive the same colour. Then \(\Pr[A_{uv}] = 1/c \leq 1/(e(\Delta+1))\). Each event \(A_{uv}\) depends on at most \(2(\Delta-1) + 1 = 2\Delta - 1 \leq 2\Delta\) other events (edges sharing a vertex with \(\{u,v\}\)). The symmetric LLL condition: \(e \cdot (1/c) \cdot (2\Delta) \leq e \cdot (1/(e(\Delta+1))) \cdot 2\Delta \leq 2\Delta/(\Delta+1) < 2 \leq ... \) — actually with \(c = \Delta + 1\) the symmetric LLL gives \(e/((\Delta+1)(\Delta+1)) \cdot (2\Delta) \leq 1\) iff \(2e\Delta \leq (\Delta+1)^2\) which fails for small \(\Delta\). The correct statement is that taking \(c = e(\Delta+1)\) colours and \(d = 2\Delta-1\) neighbours in the dependency graph gives \(e \cdot (1/c) \cdot (d+1) = e \cdot (2\Delta)/(e(\Delta+1)) \leq 1\). \(\square\)

Remark (The LLL vs. greedy colouring). The greedy bound \(\chi(G) \leq \Delta + 1\) is substantially better than the LLL bound \(\chi(G) \leq e(\Delta+1)\) for simple chromatic number. The LLL is not competitive with greedy for this specific problem. However, the LLL becomes powerful in constrained settings: list colouring (where each vertex has its own list of allowed colours), defective colouring (where some monochromatic edges are allowed), acyclic colouring, and improper colouring. In these settings, the LLL allows constraints between non-adjacent vertices to be handled gracefully, something the greedy algorithm cannot do.

Expander graphs provide perhaps the deepest connection between probabilistic combinatorics and algorithms.

Definition (Spectral expander). A \(d\)-regular graph \(G\) on \(n\) vertices is a \((n,d,\lambda)\)-expander if the second-largest eigenvalue of its adjacency matrix satisfies \(\lambda_2 \leq \lambda < d\).

Theorem (Expander Mixing Lemma). In a \((n,d,\lambda)\)-expander, for any two sets \(S, T \subseteq V\), \[ \left|e(S,T) - \frac{d|S||T|}{n}\right| \leq \lambda \sqrt{|S||T|}, \] where \(e(S,T)\) is the number of edges between \(S\) and \(T\).

This lemma shows that expanders behave like random graphs with respect to edge distribution. Since random regular graphs are expanders (by probabilistic arguments), explicit constructions of expanders (Ramanujan graphs, zig-zag products) provide the “derandomized” analogues of random graphs for applications in coding theory and pseudorandomness.

Chapter 9: Applications in Number Theory, Geometry, and Information Theory

The probabilistic method reaches far beyond graph theory. This final chapter surveys its impact across number theory, discrete geometry, information theory, and the connections to modern extremal combinatorics.

Number Theory

Schur’s theorem states that for any \(r\)-colouring of \(\{1,\ldots,N\}\) with \(N\) large enough, there exists a monochromatic solution to \(x + y = z\). A probabilistic argument establishes the sharp threshold for the minimum \(N\) needed.

Remark (Probabilistic bounds on Schur numbers). The Schur number \(S(r)\) is the maximum \(N\) such that \(\{1, \ldots, N\}\) can be \(r\)-coloured without a monochromatic solution to \(x + y = z\). The first moment method gives an exponential lower bound: for a random \(r\)-colouring, the expected number of monochromatic triples \((x,y,z)\) with \(x+y=z\) is at most \(N^2 \cdot r^{-2}\) (each pair \((x,y)\) gives a forced \(z = x+y\), and both must have the same colour as \(z\), probability \(r^{-2}\)). When this is less than 1, some colouring avoids all monochromatic triples, giving \(S(r) \geq r^2\) (a weak bound). The true lower bound is exponential: \(S(r) \geq \Omega((r/e)^r)\) by a more careful probabilistic argument with the Lovász Local Lemma.

The probabilistic method gives near-optimal Sidon sets, as seen in Chapter 2. More generally, for \(B_h\) sets (where all \(h\)-fold sums are distinct), the probabilistic method gives sets of size \(\Theta(n^{1/h})\), matching the upper bound order.

Remark (The probabilistic method in additive combinatorics). The interplay between probabilistic combinatorics and additive combinatorics has deepened significantly since 2000. The "sparse random sets" version of Szemerédi's theorem (Conlon-Fox-Zhao, 2015; Schacht, 2016): if \(A\) is a \(p\)-random subset of \(\{1,\ldots,N\}\) (each element included independently with probability \(p\)), then whp every subset \(B \subseteq A\) with \(|B| \geq \delta|A|\) contains a 3-term arithmetic progression, provided \(p \geq C(\delta) N^{-1/2}\). This "sparse regularity" result requires the full machinery of transference from the dense setting (using sparse regularity lemmas) and is one of the deepest results combining probabilistic methods with additive combinatorics.

More concretely: Roth’s theorem (dense case) says every subset of \([N]\) with density \(\delta > 0\) contains a 3-AP. The probabilistic extension says that even in a random sparse subset of density \(p = N^{-1/2+\varepsilon}\), the Roth density condition still implies the existence of a 3-AP. This is tight: the random set itself has density \(p\) in \([N]\), so removing all 3-APs from the random set would leave a set of density \(p \cdot e^{-C\sqrt{\log N}}\) (using Behrend) — consistent with the \(p = N^{-1/2}\) threshold.

Sets avoiding arithmetic progressions provide a deep application. Behrend’s construction (1946) of 3-AP-free subsets of \([n]\) of size \(n \exp(-C\sqrt{\log n})\) uses a probabilistic ingredient: project a dense subset of the sphere onto a grid and observe that spheres contain no 3-term APs (since \(a + c = 2b\) and \(\|a\|^2 = \|b\|^2 = \|c\|^2\) force \(a = b = c\) by the parallelogram law).

Remark (Behrend's construction and the Green-Tao theorem). Behrend's 3-AP-free set of density \(\exp(-C\sqrt{\log n})\) remained the record for 60 years. Elkin (2008) and later Green-Wolf (2010) improved the exponent to \(\exp(-C\sqrt{\log n} \cdot (\log \log n)^{1/4})\) — a tiny improvement. The state of the art (Kelley-Meka 2023) gives 3-AP-free sets of density \(\exp(-C (\log n)^{1/12})\), a dramatic improvement suggesting that the true maximum density of a 3-AP-free set may be subpolynomial in \(n\).

At the other extreme, the Green-Tao theorem (2004) states that the prime numbers (a set of density 0 in \(\mathbb{N}\)) contain arithmetic progressions of every finite length. The proof uses Szemerédi’s regularity lemma combined with a “transference principle”: one shows that the primes behave, for the purpose of counting APs, like a pseudorandom set, so Szemerédi’s theorem (which applies to dense sets) transfers to the primes. This is a profound application of probabilistic reasoning to number theory — the primes are not dense, but they are “pseudorandom” in a precise technical sense that makes the Szemerédi machinery applicable.

Geometry

In discrete geometry, the probabilistic method is central to the theory of \(\varepsilon\)-nets, which underpin nearly all of modern computational geometry.

Theorem (Turán's theorem — probabilistic proof). For a \(K_{r+1}\)-free graph \(G\) on \(n\) vertices (containing no complete subgraph on \(r+1\) vertices), the number of edges satisfies \[ m \leq \left(1 - \frac{1}{r}\right)\frac{n^2}{2}. \]

Proof (probabilistic). Choose a uniformly random permutation \(\sigma\) of \(V(G)\), and let \(S\) be the "greedy independent set" of the reverse: include \(v\) in \(S\) iff \(v\) precedes all its neighbours in \(\sigma\). We already know \(\mathbb{E}[|S|] \geq n^2/(2m+n)\), giving \(\alpha(G) \geq n^2/(2m+n)\).

Turán’s theorem is proved instead as follows: colour the vertices with \(r\) colours uniformly and independently. Discard all edges within colour classes. The expected number of remaining edges is \(m(1 - 1/r)\), since each edge has both endpoints in the same colour class with probability \(1/r\). The remaining graph is \(r\)-partite, hence \(K_{r+1}\)-free. But the complete \(r\)-partite graph on \(n\) vertices with equal part sizes (the Turán graph \(T(n,r)\)) has at most \((1-1/r)n^2/2\) edges. Since the original graph is \(K_{r+1}\)-free, it has at most as many edges as \(T(n,r)\), giving \(m \leq (1-1/r)n^2/2\). (The probabilistic argument shows this by demonstrating that the random coloured subgraph has \((1-1/r)m\) edges in expectation, and a Turán-type bound on the coloured subgraph applies.) \(\square\)

Remark (Turán's theorem as a first moment bound). The probabilistic proof of Turán's theorem is in fact a first-moment argument: we show that a random \(r\)-colouring witnesses an \(r\)-partite subgraph, whose edge count is bounded by the Turán bound. The Turán graph \(T(n,r)\) (complete \(r\)-partite with equal parts) achieves equality and is the unique extremal graph — a fact proved by a separate counting argument. This probabilistic perspective on Turán's theorem generalises to the hypergraph Turán problem (Katona-Kruskal theorem) and to Ramsey-Turán theory, where one seeks the maximum edges in a \(K_{r+1}\)-free graph with bounded independence number.

Definition (\(\varepsilon\)-net). For a range space \((X, \mathcal{R})\) and \(\varepsilon > 0\), a set \(N \subseteq X\) is an \(\varepsilon\)-net if every range \(R \in \mathcal{R}\) with \(|R| \geq \varepsilon|X|\) contains at least one point of \(N\).

Theorem (Haussler-Welzl \(\varepsilon\)-net theorem, 1987). Let \((X, \mathcal{R})\) be a range space with VC dimension \(d\). For any \(\varepsilon, \delta > 0\), a random sample of \(m = O\!\left(\frac{d}{\varepsilon}\log\frac{d}{\varepsilon} + \frac{1}{\varepsilon}\log\frac{1}{\delta}\right)\) points chosen uniformly and independently from \(X\) is an \(\varepsilon\)-net with probability at least \(1-\delta\).

Proof sketch. Suppose some "heavy" range \(R\) with \(|R| \geq \varepsilon|X|\) is missed by a sample \(N\) of size \(m\). The probability that a single random point misses \(R\) is at most \(1-\varepsilon\). Thus \(\Pr[N \cap R = \emptyset] \leq (1-\varepsilon)^m \leq e^{-\varepsilon m}\). A union bound over all heavy ranges fails because there are too many ranges. Instead, use the VC dimension to bound the number of "distinct" ranges induced on a sample: by the Sauer-Shelah lemma, at most \(O((2m)^d)\) ranges are non-empty on any fixed sample. A double-counting (or double-sampling) argument then gives the result. \(\square\)

The \(\varepsilon\)-net theorem is the foundation of PAC (probably approximately correct) learning theory and underpins algorithms for range searching, minimum enclosing balls, and linear programming in fixed dimension.

Example (\(\varepsilon\)-nets for halfplanes). Let \(X = P\) be a set of \(n\) points in \(\mathbb{R}^2\) and \(\mathcal{R}\) be the ranges defined by halfplanes: for each halfplane \(h\), the range is \(h \cap P\). The VC dimension of this range space is \(d = 3\) (three points in general position can be shattered — any subset can be separated from its complement by a halfplane — but no 4 points can always be shattered). The \(\varepsilon\)-net theorem guarantees that a random sample of \(m = O(1/\varepsilon \cdot \log(1/\varepsilon))\) points from \(P\) is an \(\varepsilon\)-net: every halfplane containing at least \(\varepsilon n\) points of \(P\) contains at least one sample point.

Application to geometric algorithms: the “random sampling” technique for convex hull computation uses this. A random sample \(R\) of \(r\) points from \(P\) has an expected convex hull size of \(O(r^{1/3})\) in the worst case. Points of \(P\) “outside” this hull (in a region not covered by the convex hull of \(R\)) form the “conflicts” for each hull edge. The expected total number of conflicts is \(O(n/r^{1/3})\). Choosing \(r\) to balance the cost of computing the hull of \(R\) (size \(O(r^{1/3})\)) against the cost of resolving conflicts (size \(O(n/r^{1/3})\)) gives an optimal \(O(n^{3/4})\) algorithm for some hull computations — better than the naive \(O(n^2)\) approach.

Remark (Sauer-Shelah lemma and VC dimension). The VC dimension \(d\) governs the complexity of range spaces in a precise sense. The Sauer-Shelah lemma states: if a range space \((X, \mathcal{R})\) has VC dimension \(d\), then the number of distinct subsets induced on any \(n\)-element set is at most \(\sum_{i=0}^d \binom{n}{i} = O(n^d)\). This polynomial growth (vs. the exponential \(2^n\) in the unrestricted case) is what allows the double-counting argument in the \(\varepsilon\)-net proof to work: we union-bound over only \(O(m^d)\) distinct "breakpoints" of the ranges on the sample, giving the logarithmic factor in the sample size. In learning theory, VC dimension is the fundamental measure of the "complexity" of a hypothesis class, and the \(\varepsilon\)-net theorem is the probabilistic foundation of PAC learnability: a class is PAC learnable iff its VC dimension is finite.

The Szemerédi-Trotter theorem on point-line incidences is another classical gem, proved probabilistically.

Remark (The probabilistic method in computational geometry — cuttings). Beyond \(\varepsilon\)-nets, random sampling underlies the theory of cuttings: a partition of \(\mathbb{R}^d\) into simplices such that each simplex intersects at most \(n/r\) of a given collection of hyperplanes. A random sample of \(r\) hyperplanes from \(n\) gives the starting point: the complement of the sample's arrangement has \(O(r^d)\) cells (by the Upper Bound Theorem for convex polytopes), and each cell is intersected by \(O(n/r \cdot \log r)\) of the remaining hyperplanes on average. After a technical pruning (using the probabilistic method to bound the worst-case intersections), this gives a \((1/r)\)-cutting of size \(O(r^d)\). Cuttings are the foundation of optimal algorithms for halfspace range searching (\(O(n^{1-1/d})\) query time) and for Hopcroft's incidence problem.

Theorem (Szemerédi-Trotter 1983). For \(m\) points and \(n\) lines in \(\mathbb{R}^2\), the number of incidences satisfies \[ I(P, L) = O\!\left(m^{2/3}n^{2/3} + m + n\right). \]

Proof (Spencer's probabilistic version). Use the crossing number inequality from Chapter 2. Build a graph \(G\) on the \(m\) points by connecting two points if they are consecutive on some line. Then \(|V(G)| = m\) and \(|E(G)| \geq I - n\) (since each line with \(k\) incidences contributes \(k-1\) edges). Each crossing corresponds to two lines meeting at a point, so \(\mathrm{cr}(G) \leq n^2\). By the crossing number inequality \(\mathrm{cr}(G) \geq (I-n)^3/(64m^2)\), giving \(n^2 \geq (I-n)^3/(64m^2)\), hence \(I = O(m^{2/3}n^{2/3} + n)\). By symmetry, \(I = O(m^{2/3}n^{2/3} + m + n)\). \(\square\)

Information Theory

Claude Shannon’s 1948 proof that random codes achieve channel capacity is the foundational example of the probabilistic method in information theory. Shannon’s proof is often called the “most important application of the probabilistic method” because it simultaneously established the existence of capacity-achieving codes and implied that finding such codes efficiently is a separate, highly non-trivial problem — a separation that took 50 years to fully close (with polar codes by Arıkan in 2009 and LDPC codes achieving capacity under belief propagation by Richardson-Urbanke in the 2000s).

Theorem (Shannon 1948 — Existence of capacity-achieving codes). For a binary symmetric channel with crossover probability \(p < 1/2\) and capacity \(C = 1 - H(p)\) (where \(H\) is binary entropy), for any rate \(R < C\), there exists a binary code of rate \(R\) and block length \(n\) with error probability \(\exp(-\Omega(n))\).

Proof sketch. Choose a code \(\mathcal{C}\) by selecting \(2^{Rn}\) codewords uniformly at random from \(\{0,1\}^n\). For a fixed codeword \(\mathbf{c}\), the received word after transmission through the channel lies at Hamming distance roughly \(pn\) from \(\mathbf{c}\). By a Chernoff bound, the received word lies in a Hamming ball of radius \((p+\epsilon)n\) with probability \(1-\exp(-\Omega(n))\). The number of other codewords in this ball is roughly \(2^{Rn} \cdot 2^{nH(p+\epsilon)} / 2^n\). Since \(R < 1 - H(p+\epsilon)\) for small \(\epsilon\), this expectation goes to 0, and by Markov, with high probability (over the random code) there are no decoding errors. \(\square\)

Example (Shannon's random code — explicit parameters). Consider a binary symmetric channel with crossover probability \(p = 0.1\) and capacity \(C = 1 - H(0.1) \approx 1 - 0.469 = 0.531\) bits per transmission. Shannon's theorem guarantees that for any rate \(R < 0.531\), say \(R = 0.4\), there exists a binary code of rate 0.4 and block length \(n\) with error probability \(\exp(-\Omega(n))\). Concretely: a random code with \(2^{0.4n}\) codewords chosen uniformly from \(\{0,1\}^n\) achieves probability of decoding error at most \[ P_e \leq 2^{0.4n} \cdot 2^{nH(0.1+\varepsilon)} / 2^n = 2^{n(0.4 + 0.469 + \varepsilon - 1)} = 2^{n(\varepsilon - 0.131)} \] for small \(\varepsilon\). For \(\varepsilon = 0.05\), this is \(2^{-0.081n}\), which decays exponentially. For \(n = 1000\), the error probability is approximately \(2^{-81} \approx 10^{-24}\) — negligible in practice. The probabilistic method guarantees existence, but finding explicit codes achieving capacity remained open for 60 years (resolved by polar codes, Arıkan 2009, and LDPC codes).

The entropy method for combinatorial bounds uses Shannon entropy as a counting tool. The key inequality is Shearer’s lemma:

Theorem (Shearer's Lemma). Let \((X_1,\ldots,X_n)\) be a joint random variable, and let \(\mathcal{F}\) be a family of subsets of \([n]\) such that each \(i \in [n]\) belongs to at least \(t\) members of \(\mathcal{F}\). Then \[ t \cdot H(X_1,\ldots,X_n) \leq \sum_{S \in \mathcal{F}} H(X_S), \] where \(H(X_S)\) denotes the entropy of the variables indexed by \(S\).

Shearer’s lemma gives clean proofs of several classical combinatorial inequalities. For instance, taking \(\mathcal{F}\) to be the edge-set of a \(d\)-regular bipartite graph immediately recovers the Loomis-Whitney inequality bounding the volume of a set in terms of its projections. Applied to graph homomorphisms, Shearer’s lemma gives tight bounds on the number of copies of one graph inside another.

Example (Shearer's lemma and counting triangles). Let \(G\) be a triangle-free graph on \(n\) vertices with \(m\) edges. Let \(X = (X_1, X_2, X_3)\) be a uniformly random triangle from the complete graph \(K_n\), and define the random variable that records whether each edge of the triangle is in \(G\). Apply Shearer's lemma with \(\mathcal{F} = \{\{1,2\}, \{1,3\}, \{2,3\}\}\) (the three pairs), each index \(i \in \{1,2,3\}\) appearing in exactly 2 of the 3 subsets. Then: \[ 2 H(X_1, X_2, X_3) \leq H(X_1, X_2) + H(X_1, X_3) + H(X_2, X_3). \] The maximum entropy of a pair \((X_i, X_j)\) constrained to be an edge of \(G\) is at most \(\log m\) (there are \(m\) edges). Meanwhile, if \(G\) is triangle-free, the three variables cannot all be edges simultaneously — so \(H(X_1,X_2,X_3) \geq \log \binom{n}{3}\) if triangles are non-existent... actually, this direction uses the entropy method to give the Kruskal-Katona type bound \(m = O(n^{3/2})\) for triangle-free graphs (Turán's theorem for triangles). Specifically, the constraint that no triple of edges forms a triangle forces \(m \leq n^{3/2}/2\) by the Kővári-Sós-Turán theorem: any graph with more than \((1+\sqrt{4n-3})/2\) edges per vertex contains a triangle. The entropy approach of Razborov's flag algebras makes this approach systematic and computable.

Pseudorandomness and Explicit Constructions

A recurring theme in the course is the gap between probabilistic existence and explicit construction. Random objects are often near-optimal, but finding them efficiently — or constructing explicit analogues — is much harder.

Definition (Pseudorandom graph). A \((n,d,\lambda)\)-graph is called pseudorandom if \(\lambda \ll d\) (the spectral gap is large). Such graphs mimic many properties of \(G(n, d/n)\) while being deterministically constructible.

Theorem (Explicit Ramanujan graphs — Lubotzky-Phillips-Sarnak, 1988). For any prime \(p \equiv 1 \pmod 4\) and any prime \(q \neq p\), there exist explicit \((p+1)\)-regular Ramanujan graphs — graphs where \(\lambda_2 \leq 2\sqrt{p}\) — on \(q(q^2-1)\) vertices. These graphs are constructed as Cayley graphs of \(\mathrm{PGL}_2(\mathbb{F}_q)\) with generators given by certain quaternion representations.

Remark (The Ramanujan connection). The eigenvalue bound \(\lambda_2 \leq 2\sqrt{d-1}\) for \(d\)-regular Ramanujan graphs is exactly the bound conjectured by Ramanujan for the Fourier coefficients of holomorphic cusp forms of weight 2: the Ramanujan conjecture (Ramanujan's \(\tau\)-function) states that \(|\tau(p)| \leq 2p^{11/2}\), which is equivalent to the Hecke eigenvalues of \(\Delta(z)\) being bounded in absolute value by \(2p^{(k-1)/2}\) for weight \(k = 12\). Deligne's proof (1974) of the Weil conjectures included the Ramanujan conjecture as a special case, and Lubotzky-Phillips-Sarnak used Deligne's theorem to certify the eigenvalue bounds for their explicit constructions. This is one of the most striking applications of deep algebraic geometry (étale cohomology, \(\ell\)-adic representations) to combinatorics.

Modern Developments: Threshold Phenomena

The course concludes with a survey of deep modern results connecting probabilistic combinatorics to Boolean function analysis.

Remark (Random algebraic constructions — Komlós-Szemerédi and Babai-Frankl). The probabilistic method applies not only to combinatorial objects but to algebraic ones. A striking example: the probabilistic proof that most \(n \times n\) \(\{0,1\}\) matrices are nonsingular over \(\mathbb{F}_2\) (Komlós 1967, extended by Bourgain-Vu-Wood 2010 to \(\pm 1\) matrices). Choose a random \(\pm 1\) matrix \(M\); the probability that \(M\) is singular (over \(\mathbb{R}\)) is at most \(e^{-cn}\) for some constant \(c > 0\) — proved via a Littlewood-Offord-type anti-concentration argument. This is the random matrix theory connection: the probabilistic method gives bounds on the smallest singular value of random matrices, which in turn control the condition number and numerical stability.

Another algebraic application: Babai and Frankl used the probabilistic method to show that most \(k\)-element subsets of \(\mathbb{F}_q^n\) are “independent” in a specific coding-theoretic sense, giving random constructions of codes meeting the Plotkin bound.

Example (Random balanced incomplete block designs). The probabilistic method applies directly to design theory: for a random \((v, k, \lambda)\)-BIBD, include each \(k\)-element subset of a \(v\)-element set as a block with probability \(p = \lambda \binom{v-2}{k-2} / \binom{v}{k}\) (chosen so the expected number of blocks containing any fixed pair is \(\lambda\)). Let \(X_{xy}\) be the indicator that pair \(\{x,y\}\) has the wrong number of blocks (not equal to \(\lambda\)). By the Lovász Local Lemma, if the local dependencies are sparse enough, one can find a choice of blocks where all pairs have exactly \(\lambda\) common blocks — but this only works if the "repair" after the random construction is feasible, which typically requires a global algebraic structure. This is the starting point of Keevash's randomised algebraic construction for designs.

Theorem (Kahn-Kalai-Linial, 1988). For any monotone Boolean function \(f: \{0,1\}^n \to \{0,1\}\) and any \(i \in [n]\), the influence \(\mathrm{Inf}_i(f) = \Pr[f(X) \neq f(X^{(i)})]\) satisfies \[ \max_i \mathrm{Inf}_i(f) \geq \frac{\mathrm{Var}(f) \cdot \ln n}{n}. \] Some variable has influence \(\Omega(\log n / n)\), which is tight for the majority function.

This theorem says that monotone functions near threshold cannot be “immune” to all individual variables — some variable must have logarithmically large influence. It is a key ingredient in the theory of sharp thresholds for random graph properties.

Remark (Influences and sharp thresholds — Friedgut's theorem). Friedgut's theorem (1999) characterises when a threshold is sharp: a monotone graph property \(\mathcal{P}\) on \(G(n,p)\) has a sharp threshold (the transition from 0 to 1 happens in a vanishingly small window) if and only if the total influence \(\sum_i \mathrm{Inf}_i(f)\) is \(\omega(\mathrm{Var}(f))\) — i.e., the property is not determined by a small number of "critical" edges. The KKL theorem ensures that even for properties with very small total influence, some edge is still pivotal. Properties like connectivity and Hamilton cycles have sharp thresholds; "contains a specific fixed subgraph" typically has a sharp threshold (the transition width is \(O(1/(np^*))\)); while "has maximum degree \(\geq k\)" has a coarser threshold (determined by degree of a single vertex).

The Fourier-analytic approach: write the indicator function \(f = \mathbf{1}[\mathcal{P}]\) in the Fourier-Walsh basis \(\{W_S\}_{S \subseteq \binom{[n]}{2}}\) where \(W_S = \prod_{e \in S}(2\mathbf{1}[e \in G]-1)\). Then \(\mathrm{Inf}_e(f) = \sum_{S \ni e} \hat{f}(S)^2\) and the total influence equals the “average sensitivity” of \(f\). The KKL theorem says the maximum Fourier coefficient at level \(|S|=1\) is \(\Omega(\mathrm{Var}(f)\log n/n)\), connecting the combinatorial notion of influence to the Fourier spectrum.

Theorem (Park-Pham 2023 — Kahn-Kalai Conjecture). For any monotone graph property \(\mathcal{P}\), the threshold \(p_c\) satisfies \[ p_c \leq O\!\left(\frac{q(\mathcal{P}) \cdot \log n}{n}\right), \] where \(q(\mathcal{P})\) is the expectation threshold: the smallest \(q\) such that the expected number of "spanning witnesses" for \(\mathcal{P}\) in \(G(n,q)\) is at least 1. In particular, \(p_c = O(q(\mathcal{P}) \log n)\).

The Kahn-Kalai conjecture (proved by Park and Pham in 2023) resolves a long-standing open problem by showing that thresholds are determined, up to a logarithmic factor, by the first moment. The gap between threshold and expectation threshold is at most logarithmic for all monotone properties — a remarkable unification of the first moment method with the full theory of threshold phenomena. The proof combines entropy methods, spread measures (a concept from combinatorics), and a probabilistic sunflower lemma, and represents the current state of the art in probabilistic combinatorics.

Remark (The Kahn-Kalai conjecture — what it means). To appreciate the Park-Pham theorem, consider the property \(\mathcal{P}\) of containing a perfect matching in \(G(n,p)\) (with \(n\) even). The expectation threshold \(q(\mathcal{P})\) is the smallest \(q\) such that the expected number of perfect matchings in \(G(n,q)\) is at least 1. A perfect matching requires \(n/2\) edges, each of probability \(q\), so \(\mathbb{E}[\text{matchings}] = \frac{n!}{(n/2)! \cdot 2^{n/2}} \cdot q^{n/2}\). Setting this to 1 and using Stirling gives \(q \approx 1/(2e n)\). So \(q(\mathcal{P}) \approx 1/(en)\). The actual threshold for a perfect matching is \(p^* = (\ln n + c)/n\) (determined by isolated vertices — the same obstruction as connectivity!). The Park-Pham theorem says \(p^* \leq O(q(\mathcal{P}) \log n) = O(\log n / (en))\), which matches the actual threshold up to constants.

This example shows the power and the limitations of the theorem: it gives the right order of magnitude (logarithmic factors) but not the exact constant. The expectation threshold is a purely first-moment quantity, yet it captures the true threshold up to \(O(\log n)\). For spanning subgraphs (spanning trees, Hamilton cycles, perfect matchings), the expectation threshold is computable, and the Park-Pham theorem gives tight bounds.

Example (Hamilton cycle threshold). A Hamilton cycle in \(G(n,p)\) requires \(n\) edges forming a single cycle through all \(n\) vertices. The threshold is \(p^* = \ln n / n\) (same as connectivity — the binding constraint is again isolated vertices, since a Hamilton cycle requires each vertex to have degree \(\geq 2\)). More precisely, \(\Pr[G(n,p) \text{ has a Hamilton cycle}] \to 1\) iff \(p = (\ln n + \ln \ln n + \omega(n))/n\) with \(\omega(n) \to +\infty\). The extra \(\ln \ln n\) comes from the requirement that all vertices have degree \(\geq 2\) (the minimum degree condition for a Hamilton cycle, which is stricter than degree \(\geq 1\) for connectivity).

The expectation threshold: \(\mathbb{E}[\text{Hamilton cycles}] = \frac{(n-1)!}{2} p^n\). Setting to 1: \(p^n \sim 2/(n-1)! \sim 2e^n/n^n\), giving \(q \sim e/n\). The Park-Pham theorem gives \(p^* \leq O(q \log n) = O(\log n / n)\) — matching the actual threshold \(p^* = \ln n / n\).

Reflections on the Probabilistic Method

Looking back across the nine chapters, several themes recur.

The power of linearity. Linearity of expectation, despite being a tautology, underlies every first-moment argument. Its power comes from decoupling: even when events are highly correlated, their individual contributions to the expectation are independent. This decoupling is the key reason why probabilistic arguments often give cleaner bounds than direct combinatorial counting.

The tension between existence and construction. The probabilistic method certifies existence without providing a witness. The degree of difficulty in finding the witness varies enormously: MAX-CUT witnesses are easy (any random assignment suffices on average), LLL witnesses are efficiently findable (Moser-Tardos), but second-moment witnesses (large cliques, Ramsey colourings) are believed to be computationally hard. This tension — between “it exists” and “you can find it” — is one of the deepest themes in modern theoretical computer science.

The role of independence. Full independence is rarely available in combinatorial structures, but the second moment method, LLL, and FKG inequality each provide a framework for “partial independence” that suffices. The LLL requires only a bounded dependency graph; the second moment method requires only that the covariance terms be bounded; FKG requires only monotonicity. Understanding how much independence is truly needed for a given argument is an active and productive area.

Concentration and pseudorandomness. Random objects are concentrated near their mean (Chernoff, Azuma, Talagrand), and this concentration is precisely what makes them “pseudorandom” — random graphs behave like complete bipartite graphs with a fixed density, and random matrices behave like their expected counterpart. Explicit pseudorandom objects (Ramanujan graphs, explicit designs, extractors) try to capture this concentration deterministically.

Topic	Key Technique	Primary Reference
Ramsey lower bounds	First moment, union bound	Alon-Spencer Ch. 1
Independence number	Alteration	Alon-Spencer Ch. 2
Clique number in \(G(n,p)\)	Second moment	Alon-Spencer Ch. 4
Lovász Local Lemma	Dependency graphs	Alon-Spencer Ch. 5
FKG inequality	Lattice correlations	Alon-Spencer Ch. 6
Chernoff bounds	MGF method	Alon-Spencer Ch. 7
Azuma / Talagrand	Martingale concentration	Alon-Spencer Ch. 7
High girth + high chromatic number	Alteration, first moment	Alon-Spencer Ch. 8
Expanders	Spectral methods	Alon-Spencer Ch. 9
\(\varepsilon\)-nets	Double sampling	Haussler-Welzl 1987
Shannon capacity	Random codes, Chernoff	Shannon 1948
Shearer’s lemma	Entropy method	Chung-Graham-Wilson
Park-Pham theorem	Spread measures	Park-Pham 2023

Chapter 10: The Algorithmic Lovász Local Lemma

The classical LLL guarantees existence but provides no algorithm. For nearly thirty years, a central open problem was whether the LLL witness — an assignment of variables avoiding all bad events — could be found efficiently. The answer, given by Moser and Tardos in 2010, is a beautiful yes: a surprisingly simple resampling procedure finds the witness in expected polynomial time, and its correctness proof is an application of information theory.

10.1 The Moser-Tardos Resampling Algorithm

The setting is the variable version of the LLL. We have independent random variables \(X_1, \ldots, X_n\) with arbitrary distributions, and a collection of bad events \(A_1, \ldots, A_m\), each depending on some subset \(\mathrm{vbl}(A_i) \subseteq \{1, \ldots, n\}\) of the variables. The dependency graph \(G\) has vertex set \(\{A_i\}\) with edges between events sharing a variable.

Definition (Moser-Tardos algorithm). Sample \(X_1, \ldots, X_n\) independently. While some bad event \(A_i\) holds, pick any such event, resample all variables in \(\mathrm{vbl}(A_i)\) freshly and independently. Repeat until no bad event holds.

The correctness follows immediately from the LLL: if the LLL condition holds, the algorithm must eventually terminate. The remarkable content is the efficiency bound.

Theorem (Moser-Tardos). Under the symmetric LLL condition \(p(d+1) \leq 1/e\), the expected number of resamplings before termination is at most \(\sum_i x_i / (1 - x_i)\) where \(x_i\) are the LLL witnesses. Under the symmetric condition with \(p \leq 1/(ed)\), the expected total resamplings is \(O(m)\).

10.2 The Entropy Compression Argument

The proof technique, known as entropy compression, is one of the most elegant arguments in modern combinatorics. The idea is to encode the execution history of the algorithm in a compressed form, and show that long runs would require an impossible amount of information.

Proof sketch (entropy compression). Consider an execution of the algorithm that makes \(t\) resamplings. Let \(\sigma\) be the sequence of events that were resampled (in order). We claim that the pair \((\sigma, \text{final state})\) can be reconstructed from the "resampling table" (the values generated during resampling) and \(\sigma\) alone.

Specifically, run the algorithm in reverse: given the final state and the sequence \(\sigma\) read backwards, one can reconstruct the resampling table backwards. This means the resampling table (a sequence of \(t + n\) fresh samples) is encoded by \((\sigma, \text{final state})\). The number of possible sequences \(\sigma\) of length \(t\) over \(m\) events is \(m^t\). The number of final states is bounded by the size of the probability space. So the resampling table has at most \(m^t \cdot |\Omega|\) encodings, while there are \(|\Omega|^{t+n}\) possible tables. When these counts are incompatible, long runs are impossible. Careful bookkeeping with the LLL witness values \(x_i\) gives the bound. \(\square\)

The entropy compression method is more broadly applicable than the specific Moser-Tardos setting: it has been used to prove bounds on the length of the longest subsequence avoiding a given pattern (the Stanley-Wilf conjecture, proved earlier by Marcus-Tardos), and appears in proofs of acyclic edge coloring bounds.

10.3 Applications of the Algorithmic LLL

Example (Efficient proper colouring with \(\Delta+1\) colours). Given a graph \(G\) with maximum degree \(\Delta\), a proper \((\Delta+1)\)-colouring can be found in \(O(n + m)\) expected time by the Moser-Tardos algorithm. Assign each vertex a uniform random colour in \([\Delta+1]\). Each edge is "bad" if its endpoints share a colour, with probability \(1/(\Delta+1)\). Each bad event depends on at most \(2\Delta\) others (edges sharing an endpoint). The LLL condition holds (verify: \(p(d+1) = (1/(\Delta+1)) \cdot 2\Delta \leq 1/e\) requires \(\Delta \geq 1/(\frac{1}{e} - \frac{1}{2}) \approx 5\)). Moser-Tardos terminates in \(O(n)\) expected resamplings, each taking \(O(\Delta)\) time.

This is not the fastest known colouring algorithm (greedy is \(O(n + m)\) deterministically), but the algorithmic LLL shines in constrained settings where greedy fails — list colouring, defective colouring, and coloring with local constraints.

Example (Packet routing and Scheduling). In a network with paths \(P_1, \ldots, P_n\) (each of length at most \(L\)) and congestion \(C\) (maximum number of paths through any edge), the Leighton-Maggs-Rao algorithm routes all packets in \(O(C + L)\) steps using random delays. The algorithmic LLL shows that an explicit schedule exists and can be computed efficiently: each "conflict" (two paths using the same edge at the same time) is a bad event, and the LLL dependency structure follows from the path structure. The resulting schedule achieves the information-theoretic optimum up to constants.

The Moser-Tardos algorithm requires that events depend on independent variables. An extension due to Achlioptas and Iliopoulos (2016) handles variables with more complex dependencies, at the cost of a weaker bound. The “variable LLL” remains an active area of research.

Chapter 11: Coding Theory and the Probabilistic Method

Shannon’s 1948 paper “A Mathematical Theory of Communication” is arguably the founding document of information theory. At its heart is a probabilistic existence argument: random codes achieve the channel capacity. This chapter develops the connections between the probabilistic method and coding theory.

11.1 Random Codes Achieve Shannon Capacity

A binary channel with crossover probability \(p\) flips each bit independently with probability \(p\). The capacity is \(C = 1 - H(p)\) where \(H(p) = -p\log_2 p - (1-p)\log_2(1-p)\) is the binary entropy.

Theorem (Shannon, 1948). For any rate \(R < C\) and \(\varepsilon > 0\), there exists a code with \(2^{Rn}\) codewords of length \(n\) such that the probability of decoding error is at most \(\varepsilon\), for all large enough \(n\).

Proof (probabilistic argument). Choose \(2^{Rn}\) codewords uniformly and independently at random from \(\{0,1\}^n\). Given transmitted word \(c\), the received word \(y\) agrees with \(c\) in each position with probability \(1-p\). By the law of large numbers, the received word is within Hamming distance \((p+\varepsilon)n\) from \(c\) with high probability (for \(n\) large). The decoder outputs the nearest codeword.

A decoding error occurs if some other codeword \(c' \neq c\) is as close to \(y\) as \(c\). The Hamming ball of radius \((p+\varepsilon)n\) around \(y\) has size at most \(2^{nH(p+\varepsilon)}\). For a random codeword \(c'\), the probability that \(c'\) falls in this ball is \(2^{nH(p+\varepsilon)}/2^n = 2^{-n(1 - H(p+\varepsilon))}\). By the union bound over the \(2^{Rn} - 1\) other codewords, the probability of any decoding error is at most:

\[ 2^{Rn} \cdot 2^{-n(1 - H(p+\varepsilon))} = 2^{-n(C - R - H(p+\varepsilon) + H(p))}. \]

For \(R < C\) and \(\varepsilon\) small enough, the exponent is negative, so this probability goes to 0 exponentially in \(n\). Since the average error (over random codes) is small, some specific code has small error. \(\square\)

The argument is pure first-moment method: the expected error probability over random codes is small, so a good code exists. It is existential and non-constructive — finding explicitly capacity-achieving codes is a major engineering challenge, partially resolved by turbo codes (Berrou-Glavieux, 1993) and LDPC codes (Gallager, 1962, rediscovered 1996).

11.2 The Gilbert-Varshamov Bound

The Gilbert-Varshamov (GV) bound gives the best known asymptotic lower bound on the size of binary codes with prescribed minimum distance.

Theorem (Gilbert-Varshamov bound). There exists a binary code \(C \subseteq \{0,1\}^n\) with minimum distance \(\geq d\) and \(|C| \geq 2^n / \text{Vol}(n, d-1)\), where \(\text{Vol}(n,r) = \sum_{i=0}^r \binom{n}{i}\) is the volume of a Hamming ball of radius \(r\).

Proof (greedy construction). Build the code greedily: add codewords one at a time, each time choosing a vector not within distance \(d-1\) of any previously chosen codeword. The process stops only when every vector in \(\{0,1\}^n\) is within distance \(d-1\) of some codeword — i.e., the code is a maximal antichain, and the covering condition implies \(|C| \cdot \text{Vol}(n,d-1) \geq 2^n\). A probabilistic proof (choose codewords at random) also works and is instructive. \(\square\)

For large codes (rate \(R\) and relative distance \(\delta = d/n\)), the GV bound gives \(R \geq 1 - H(\delta)\). This matches the Hamming (Singleton-like) upper bound up to a factor of 2 for small \(\delta\). Whether the GV bound is tight for large alphabets is one of the deepest open problems in coding theory — for binary codes it is widely conjectured to be tight (Zyablov bound), but no superlinear-size family of linear codes beating GV is known.

11.3 Linear Codes and the Probabilistic Argument

A linear code is a subspace \(C \leq \mathbb{F}_2^n\). Its generator matrix \(G \in \mathbb{F}_2^{k \times n}\) has rows forming a basis of \(C\). The minimum distance equals the minimum weight of a nonzero codeword.

Theorem (Random linear codes achieve GV). Choose a random linear code \(C \leq \mathbb{F}_2^n\) of dimension \(k\) (equivalently, choose a random generator matrix \(G \in \mathbb{F}_2^{k \times n}\) uniformly). Then with high probability, \(C\) has minimum distance \(\geq (H^{-1}(1 - R) - \varepsilon)n\) where \(R = k/n\).

The proof is a first-moment calculation: the expected number of nonzero codewords of weight \(\leq dn\) is \(\binom{n}{\leq dn} \cdot 2^{k-n} = \text{Vol}(n, dn) / 2^{n(1-R)}\), which is \(< 1\) when \(R < 1 - H(\delta)\). Hence with positive probability, no such codeword exists. Random linear codes are thus asymptotically as good as the GV bound, and they are far easier to analyze (and to decode) than random nonlinear codes.

Remark (Expander codes). An important explicit construction of near-optimal codes uses expander graphs. A **Sipser-Spielman expander code** is defined by a bipartite expander graph: the left vertices are message bits, the right vertices are check bits, and the edges define parity checks. The expansion property of the graph ensures that any nonzero codeword has many check violations, which can be corrected by a local iterative algorithm. Expander codes with rate approaching the GV bound and linear-time encoding and decoding (due to Spielman, 1995) represent one of the most elegant applications of spectral graph theory to coding theory.

Chapter 12: The Deletion Method and Hypergraph Coloring

The deletion method refines the first-moment argument by adding a “cleaning step”: after a probabilistic construction shows a large structure exists, delete elements causing problems, yielding a smaller but cleaner structure. This chapter develops the method and applies it to hypergraph coloring, discrepancy, and independent sets.

12.1 The Deletion Method

The archetype is the proof that every graph \(G\) with average degree \(d\) has an independent set of size \(\geq n/(2d)\) (a weaker form of the Turán bound).

Theorem (Deletion method for independent sets). Every graph \(G\) on \(n\) vertices with \(m\) edges has an independent set of size at least \(n^2 / (2m + n) \geq n / (2\bar{d} + 1)\) where \(\bar{d} = 2m/n\) is the average degree.

Proof. Include each vertex in a set \(S\) independently with probability \(p\) (to be chosen). Let \(T \subseteq S\) be obtained by deleting one endpoint from each edge of \(G[S]\). Then \(T\) is an independent set and \(\mathbb{E}[|T|] \geq \mathbb{E}[|S|] - \mathbb{E}[e(S)] = pn - p^2 m\). Optimise over \(p = n/(2m)\) to get \(\mathbb{E}[|T|] \geq n^2/(4m)\). Since \(T\) has this many vertices in expectation, some choice of \(S\) gives this many, and the independent set exists. \(\square\)

The deletion method appears simple, but its power comes from the flexibility in choosing the deletion rule. The key idea is that we delete at most one vertex per violating structure, so the expected final size is the initial expected size minus the expected number of violations.

12.2 Hypergraph Coloring

A hypergraph \(\mathcal{H} = (V, \mathcal{E})\) is \(k\)-uniform if every edge has \(k\) vertices. A 2-coloring is a function \(\chi: V \to \{+1,-1\}\). An edge \(e\) is monochromatic if all vertices in \(e\) receive the same color. The property B (or 2-colorability) problem asks: does \(\mathcal{H}\) have a 2-coloring with no monochromatic edge?

Theorem (Erdős, 1963). Every \(k\)-uniform hypergraph with fewer than \(2^{k-1}\) edges has property B.

Proof. Color each vertex \(\pm 1\) uniformly and independently. Each edge is monochromatic with probability \(2^{1-k}\). Expected number of monochromatic edges \(= |\mathcal{E}| \cdot 2^{1-k} < 1\). Some coloring avoids all monochromatic edges. \(\square\)

This is tight up to a constant factor: there exist \(k\)-uniform hypergraphs with \(O(k^2 \cdot 2^k)\) edges that are NOT 2-colorable (Erdős, 1963; Radhakrishnan-Srinivasan, 2000 for the right constant). The construction uses hypergraphs with high overlap structure.

The Radhakrishnan-Srinivasan improvement uses the nibble method (or “semi-random method”). Instead of coloring all vertices uniformly, color vertices in rounds, each round independently, and analyze the residual hypergraph.

Theorem (Radhakrishnan-Srinivasan, 2000). Every \(k\)-uniform hypergraph \(\mathcal{H}\) with maximum degree \(\Delta\) has property B if \(\Delta \leq c \cdot 2^k / \sqrt{k}\) for a universal constant \(c > 0\).

The proof is one of the cleanest applications of the Lovász Local Lemma combined with the alteration idea: one first shows that a random 2-coloring leaves few highly monochromatic edges, then uses the LLL to color the residual problem. The threshold \(2^k/\sqrt{k}\) (up to constants) for the maximum degree condition is believed to be tight.

12.3 The Rödl Nibble

The nibble method (Rödl, 1985) is a powerful technique for constructing combinatorial structures by a sequence of random “nibbles” — small random selections that each make a little progress without disturbing the remaining structure too much.

Theorem (Rödl's packing theorem). For any fixed \(k \geq 2\), there exists an almost-perfect packing of \(K_k\) in \(K_n\): a collection of edge-disjoint \(K_k\) subgraphs of \(K_n\) covering all but \(o(n^2)\) edges. More precisely, the number of \(K_k\) in the packing is \((1 - o(1)) \binom{n}{2} / \binom{k}{2}\).

The nibble argument works as follows: at each round, randomly select a sparse random set of \(K_k\) copies (each included independently with a small probability \(p\)). Among those selected, include a copy in the packing only if it doesn’t conflict with previously selected copies. The key computation shows that (1) the expected number of new edges covered in each round is large, (2) the maximum degree of the “residual” graph (uncovered edges) decreases geometrically, and (3) the process terminates with \(o(n^2)\) uncovered edges.

The nibble method has since been systematized into the random greedy algorithm and the triangles-removal-based approach, and culminated in Keevash’s proof (2014) of the existence of Steiner systems for all admissible parameters — one of the biggest results in combinatorics of the 21st century.

12.4 The Lovász Local Lemma for List Coloring

Theorem (List chromatic number via LLL). If every vertex of a graph \(G\) has a list of at least \(k\) colors, and each color appears in at most \(k/(e(\Delta+1))\) lists, then \(G\) has a proper list coloring.

Proof. Color each vertex uniformly at random from its list. For each edge \(\{u,v\}\) and each color \(c\), let \(A_{uv,c}\) be the event that both \(u\) and \(v\) are colored \(c\). Then \(\Pr[A_{uv,c}] = (1/k)^2 = 1/k^2\) if \(c\) is in both lists. Each event \(A_{uv,c}\) depends only on the colors assigned to \(u\) and \(v\), so its dependency neighborhood is at most \(2(\Delta-1)\) other events involving \(u\) or \(v\) plus the events involving other colors at the same edge. Total dependency degree \(\leq 2\Delta k\). Applying LLL with \(p = 1/k^2\) and \(d = 2\Delta k\): condition \(e \cdot (1/k^2) \cdot (2\Delta k + 1) \leq 1\), which holds for \(k \geq 2e\Delta\) (roughly). \(\square\)

Chapter 13: The Polynomial Method

The polynomial method is a fundamentally different paradigm from the probabilistic method, but the two interact deeply. Algebraic identities constrain combinatorial structures, and probabilistic arguments can be translated into polynomial language. This chapter introduces the key results.

13.1 The Combinatorial Nullstellensatz

Theorem (Alon, 1999 — Combinatorial Nullstellensatz). Let \(\mathbb{F}\) be a field and \(f \in \mathbb{F}[x_1, \ldots, x_n]\). If the coefficient of \(x_1^{t_1} \cdots x_n^{t_n}\) in \(f\) is nonzero, and \(|S_i| > t_i\) for each \(i\), then there exists \((s_1, \ldots, s_n) \in S_1 \times \cdots \times S_n\) such that \(f(s_1, \ldots, s_n) \neq 0\).

The proof is short: consider the polynomial modulo \(\prod_i \prod_{s \in S_i} (x_i - s)\); the degree constraint means the leading monomial survives the reduction, so \(f\) is not identically zero on the grid \(S_1 \times \cdots \times S_n\).

Example (Davenport constant via Nullstellensatz). Let \(G = \mathbb{Z}_n\). The Davenport constant \(D(G)\) is the smallest \(d\) such that every sequence of \(d\) elements of \(G\) has a non-empty zero-sum subsequence. The Combinatorial Nullstellensatz gives \(D(\mathbb{Z}_n) = n\): if \(a_1, \ldots, a_n \in \mathbb{Z}_n\), consider \(f(x_1, \ldots, x_n) = \prod_{1 \leq i < j \leq n}(x_j - x_i)\) evaluated on \(\{0,1\}^n\) (using the polynomial to select subsets). The non-vanishing condition translates into the existence of a non-trivial zero-sum subsequence.

13.2 The Chevalley-Warning Theorem

Theorem (Chevalley-Warning). Let \(\mathbb{F}_q\) be a finite field with \(q = p^r\). If \(f_1, \ldots, f_k \in \mathbb{F}_q[x_1, \ldots, x_n]\) satisfy \(\sum_{i=1}^k \deg f_i < n\), and \(N\) is the number of common zeros of \(f_1, \ldots, f_k\) in \(\mathbb{F}_q^n\), then \(p \mid N\).

Proof sketch. The key identity: for any \(f \in \mathbb{F}_q[x_1, \ldots, x_n]\), \[ \sum_{x \in \mathbb{F}_q^n} f(x) = \begin{cases} -1 & \text{if } f \equiv 1 \\ 0 & \text{if } \deg f < n(q-1) \end{cases} \pmod{p}. \] Apply this to \(f = \prod_{i=1}^k (1 - f_i^{q-1})\), which equals 1 on common zeros and 0 elsewhere. Since \(\deg f = (q-1)\sum_i \deg f_i < n(q-1)\), the sum is 0 mod \(p\). But the sum counts common zeros mod \(p\), and the trivial zero contributes 1 if \(f_1, \ldots, f_k\) have a common zero at \(0\). \(\square\)

Remark (Application to combinatorics). Warning's theorem has the following consequence: if \(\mathcal{F}\) is a set system on \([n]\) where every pairwise intersection has size \(\equiv 0 \pmod{p}\) but \(|F| \not\equiv 0 \pmod{p}\) for each \(F \in \mathcal{F}\), then \(|\mathcal{F}| \leq \binom{n}{0} + \binom{n}{1} + \cdots + \binom{n}{p-1}\). This is the **Frankl-Wilson theorem**, and it gives:

An exponential lower bound for the chromatic number of the unit-distance graph of \(\mathbb{R}^n\): \(\chi(\mathbb{R}^n) \geq (1.2)^n\), proved by Frankl and Wilson (1981).
Borsuk’s conjecture counterexample: there exist sets in \(\mathbb{R}^n\) that cannot be partitioned into fewer than \((1.2)^n\) parts of smaller diameter, due to Kahn and Kalai (1993).

These combinatorial bounds come from pure algebra, with no probabilistic component — a reminder that the probabilistic method, powerful as it is, is one tool among many.

13.3 The Cap Set Problem

The cap set problem asks for the maximum size of a subset of \(\mathbb{F}_3^n\) with no 3-term arithmetic progression (equivalently, no three elements summing to zero). This is the “cap set” problem because \(\mathbb{F}_3^n\) is an \(n\)-dimensional affine space over \(\mathbb{F}_3\), and arithmetic progressions are affine lines.

For decades, the best bound was \(O(2.756^n)\) (probabilistic/Fourier analytic). In 2016, Croot-Lev-Pach and Ellenberg-Gijswijt gave a stunning polynomial method proof that the maximum cap set size is at most \(O(2.756^n)\) — matching the prior bound but with an entirely new proof, and Ellenberg-Gijswijt improved the constant to \(2.756\) via the slice rank method.

Theorem (Ellenberg-Gijswijt, 2017). Any subset of \(\mathbb{F}_3^n\) containing no 3-term arithmetic progression has size at most \(O(2.756^n)\).

The proof uses the following key idea: encode a 3-AP-free set \(A \subseteq \mathbb{F}_3^n\) via the polynomial \(f(x,y,z) = \prod_{i=1}^n (1 - (x_i + y_i + z_i)^2)\), which is nonzero when \(x + y + z = 0\). The slice rank of this tensor (the minimum rank over decompositions into terms depending on at most two of the three variables) controls the size of \(A\). The polynomial method gives an upper bound on the slice rank in terms of \(n\), yielding the exponential bound. This method has since been applied to many other problems (sunflower conjecture, matrix multiplication complexity) and represents one of the most active areas in extremal combinatorics.

Chapter 14: Turán-Type Problems and the Probabilistic Method

14.1 Turán’s Theorem and Its Probabilistic Proof

Theorem (Turán, 1941). The maximum number of edges in a \(K_{r+1}\)-free graph on \(n\) vertices is achieved (uniquely) by the complete \(r\)-partite graph \(T(n,r)\) with nearly equal part sizes, which has \(\left(1 - \frac{1}{r}\right)\frac{n^2}{2} + O(n)\) edges.

The classical proof is algebraic-combinatorial. A probabilistic complement: the Turán number \(\mathrm{ex}(n, K_{r+1})\) also arises naturally from the probabilistic method.

Remark (Probabilistic proof of Turán via Zykov symmetrization). Here is a probabilistic argument for the independence number side of Turán. A graph \(G\) is \(K_{r+1}\)-free iff its independence number \(\alpha(G) \geq n/r\) is NOT implied — wait, the complementary statement is: a graph is \(K_{r+1}\)-free iff its clique number \(\omega(G) \leq r\). By Turán, if \(G\) is \(K_{r+1}\)-free with \(n\) vertices, then \(e(G) \leq (1 - 1/r)n^2/2\). A probabilistic proof: pick a random independent set by including each vertex independently with probability \(p\), then deleting conflicts. The expected size of the final independent set is \(pn - p^2 m \geq pn - p^2 (1-1/r)n^2/2\). Optimise \(p = 1/((1-1/r)n) = r/((r-1)n)\) to get \(\alpha(G) \geq n r/(2(r-1)n) \cdot r = r/(2(r-1)) \cdot 1 \cdot ...\) — the probabilistic argument gives a lower bound on \(\alpha(G)\) consistent with the Ramsey-theoretic lower bound but not quite Turán. For the exact Turán bound, the Zykov symmetrization argument (identifying non-adjacent vertices) is cleaner.

14.2 The Kruskal-Katona Theorem

The Kruskal-Katona theorem gives a sharp bound on the size of the shadow of a set system — a cornerstone of extremal set theory.

Definition (Shadow). For a family \(\mathcal{F} \subseteq \binom{[n]}{k}\) of \(k\)-element sets, the shadow \(\partial \mathcal{F} \subseteq \binom{[n]}{k-1}\) consists of all \((k-1)\)-element sets that are subsets of some member of \(\mathcal{F}\).

Theorem (Kruskal 1963, Katona 1968). Among all families \(\mathcal{F} \subseteq \binom{[n]}{k}\) with \(|\mathcal{F}| = m\), the one minimizing \(|\partial \mathcal{F}|\) is the family consisting of the first \(m\) sets in the **colexicographic order** on \(\binom{[n]}{k}\).

The probabilistic method gives a clean proof of the weaker bound \(|\partial \mathcal{F}| \geq (k/n) |\mathcal{F}|\): pick a random element uniformly from \([n]\), and let \(X\) be the number of sets in \(\mathcal{F}\) containing it. Then \(\mathbb{E}[X] = k |\mathcal{F}|/n\). For each element \(x\), the sets in \(\mathcal{F}\) containing \(x\) project down to sets in \(\partial \mathcal{F}\) by removing \(x\). Different projection images are distinct, giving \(|\partial \mathcal{F}| \geq \mathbb{E}[X] = k|\mathcal{F}|/n\).

14.3 Ramsey Multiplicity

Ramsey multiplicity asks not just whether a monochromatic copy of a graph \(H\) exists, but how many must exist in any 2-coloring of \(K_n\).

Definition (Ramsey multiplicity). The Ramsey multiplicity \(M(H, n)\) is the minimum number of monochromatic copies of \(H\) over all 2-colorings of the edges of \(K_n\).

Theorem (Goodman's formula for triangles). The number of monochromatic triangles in a 2-coloring of \(K_n\) is \[ M(K_3, n) \geq \frac{n(n-2)(n-4)}{24} \cdot \frac{1}{4} = \binom{n}{3}\frac{1}{4} - O(n^2). \] More precisely, Goodman's formula (1959) gives the exact count: if \(G\) has \(e\) edges and \(t_k\) denotes the number of \(K_k\) subgraphs, then \(t_3(\text{red}) + t_3(\text{blue}) = \binom{n}{3} - \frac{1}{2}(e_r(n - 2 - e_r/(n-1)) + e_b(n - 2 - e_b/(n-1)))\) where \(e_r + e_b = \binom{n}{2}\). The minimum is achieved when \(e_r = e_b = \binom{n}{2}/2\) (balanced coloring), giving \(M(K_3, n) = \binom{n}{3}/4 + O(n^2)\).

A random 2-coloring gives \(\mathbb{E}[\text{monochromatic } K_3] = \binom{n}{3} \cdot 2 \cdot (1/2)^3 = \binom{n}{3}/4\), consistent with Goodman’s formula — the random coloring is optimal for triangles.

For larger cliques, Ramsey multiplicity is related to the Ramsey multiplicities conjecture of Erdős: that the random coloring minimizes the number of monochromatic \(K_r\) for all \(r\). This is open for \(r \geq 4\) and represents one of the most subtle open problems in Ramsey theory, touching on the theory of “quasi-random” graphs and flag algebras.

Remark (Flag algebras). Razborov's method of flag algebras (2007) is a systematic tool for proving extremal combinatorics bounds via semidefinite programming. One sets up a linear program (or SDP) over "graph homomorphism densities" — the limiting densities of small graphs in a large graph. The constraints are positivity conditions and linear relations between densities. The resulting bounds often match or improve the best known results. Flag algebras have settled long-open problems including the Turán density of the Fano plane, bounds on Ramsey multiplicity, and crossing number inequalities. The method is inherently non-constructive (the witnesses come from SDP solutions) but gives exact or near-exact bounds where classical probabilistic arguments give only order-of-magnitude results.

Chapter 15: Bipartite Turán Problems and the Kővári-Sós-Turán Theorem

15.1 The Zarankiewicz Problem

The Zarankiewicz problem asks for the maximum number of edges in a bipartite graph on \(n + n\) vertices containing no complete bipartite subgraph \(K_{s,t}\). This is denoted \(z(n; s, t)\).

Theorem (Kővári-Sós-Turán, 1954). For \(s \leq t\), any bipartite graph on \(n + n\) vertices with no \(K_{s,t}\) subgraph has at most \[ \frac{1}{2}\left((t-1)^{1/s} n^{2 - 1/s} + (s-1)n\right) \] edges.

Proof. Let \(G\) be a bipartite graph with parts \(A, B\), \(|A| = |B| = n\). Count the number of "cherries" — pairs \((b, \{a_1, a_2, \ldots, a_s\})\) with \(b \in B\) adjacent to all of \(a_1, \ldots, a_s \in A\). If vertex \(b\) has degree \(d_b\), the number of such cherries through \(b\) is \(\binom{d_b}{s}\). The total is \(\sum_{b \in B} \binom{d_b}{s}\). By Jensen's inequality applied to the convex function \(\binom{x}{s}\), this is at least \(n \binom{\bar{d}}{s}\) where \(\bar{d} = e/n\) is the average degree, \(e = |E(G)|\).

On the other hand, for any \(s\)-subset \(\{a_1, \ldots, a_s\} \subseteq A\), the number of vertices in \(B\) adjacent to all of them is at most \(t-1\) (otherwise we’d have a \(K_{s,t}\)). So the total cherry count is at most \(\binom{n}{s}(t-1)\). Combining:

\[ n \binom{e/n}{s} \leq \binom{n}{s}(t-1). \]

Solving for \(e\) gives the claimed bound. \(\square\)

The KST theorem is a model example of double counting combined with convexity. Its applications are vast: it implies the Bondy-Simonovits theorem (every bipartite graph \(H\) has \(\mathrm{ex}(n, H) = O(n^{2-1/s})\) where \(s\) is the smaller part of the bipartition), and it appears in incidence geometry (Szemerédi-Trotter), algebraic combinatorics, and the analysis of algorithms.

15.2 Lower Bounds via Random Algebraic Constructions

The KST upper bound \(z(n; s, t) = O(n^{2-1/s})\) is matched by algebraic lower bounds using polarity maps over finite fields.

Example (Incidence graphs of projective planes achieve KST for \(s = t = 2\)). The incidence graph of \(\mathrm{PG}(2,q)\) (the projective plane over \(\mathbb{F}_q\)) has \(q^2 + q + 1\) points and \(q^2 + q + 1\) lines, each incident to \(q+1\) others. Any two points are on exactly 1 common line, so there is no \(K_{2,2}\) subgraph (two points share at most one line). The number of incidences is \((q^2+q+1)(q+1) = q^3 + 2q^2 + 2q + 1 \approx n^{3/2}\) where \(n = q^2 + q + 1\). This gives \(z(n; 2, 2) = \Theta(n^{3/2})\), matching the KST bound \(O(n^{3/2})\).

For \(K_{3,3}\)-free graphs, the polarity graphs over \(\mathbb{F}_q\) achieve \(\Theta(n^{5/3})\) edges, matching the KST bound \(O(n^{5/3})\). These algebraic constructions are often the only known exact-order constructions; it is a major open problem whether the KST bound \(z(n; 2, 2) = \Theta(n^{3/2})\) can be achieved with an explicit, efficiently constructible graph.

15.3 The Probabilistic Method for Bipartite Turán Problems

For \(K_{s,t}\) with \(s = 2\) and large \(t\), the Zarankiewicz problem connects to extremal set theory. A set system \(\mathcal{F} \subseteq 2^{[n]}\) is \(K_{2,t}\)-free iff no two sets in \(\mathcal{F}\) share a common element in \(t\) or more other sets — i.e., the sunflower structure is bounded.

The random algebraic method (Bukh, 2015; Conlon-Lee, 2021) constructs nearly-tight \(K_{s,t}\)-free graphs using random low-degree algebraic varieties over finite fields, giving:

\[ z(n; s, t) = \Omega\left(n^{2-1/s}\right) \]

for all \(s\) and \(t \geq (s-1)! + 1\), matching the KST bound. This is the current state of the art, resolving a question that had been open for 70 years.

Chapter 16: Open Problems and the Frontier

The probabilistic method is not a static body of techniques but a living research program. We close with a brief survey of the most important open problems.

The \(R(5,5)\) question. The Ramsey number \(R(5,5)\) satisfies \(43 \leq R(5,5) \leq 48\). The lower bound \(R(5,5) \geq 43\) comes from an explicit construction (Paley graph on 42 vertices, which is triangle-free… wait, actually the Paley graph on 43 vertices). The upper bound follows from \(R(4,5) = 25\) and the Ramsey recurrence. Despite decades of effort and the use of computers to check large classes of colorings, the exact value remains unknown. Closing this gap likely requires new ideas beyond both the probabilistic method and exhaustive search.

The Erdős-Hajnal conjecture. For every graph \(H\), every \(H\)-free graph on \(n\) vertices has a clique or independent set of size \(n^{c(H)}\) for some \(c(H) > 0\). The probabilistic method gives only \(c(H) = \Omega(\log n)\) via Ramsey theory. The conjecture posits a polynomial lower bound. Proved for a few special graphs \(H\), but the general case is wide open.

Explicit constructions for Ramsey graphs. The probabilistic method gives graphs with no clique or independent set of size \(2 \log_2 n\). An explicit construction achieving size \(n^{o(1)}\) would be a major breakthrough. Paley graphs give \(O(\sqrt{n}\log n)\) — far from the probabilistic bound but the best known explicit construction.

The sunflower conjecture. Erdős and Ko (1960) conjectured that any family of \(k\)-sets with more than \((r-1)^k \cdot k!\) members contains an \(r\)-sunflower (three sets with pairwise-equal intersection). Alweiss-Lovett-Wu-Zhang (2020) improved the bound to \((O(\log k))^k\), a dramatic improvement using the spread measure technique that later appeared in the Park-Pham proof. The conjectured bound \((r-1)^k\) remains open.

Frankl’s union-closed conjecture. If \(\mathcal{F}\) is a union-closed family of sets (meaning \(A, B \in \mathcal{F} \Rightarrow A \cup B \in \mathcal{F}\)) with \(\mathcal{F} \neq \{\emptyset\}\), then some element appears in at least half the sets. Despite its elementary statement, this has resisted proof since 1979. A breakthrough by Gilmer (2022) via the entropy method gave the first constant-fraction bound: some element appears in at least \(0.01|\mathcal{F}|\) sets. Chase-Lovett (2023) improved this to \((3 - \sqrt{5})/2 \approx 0.382\), still short of 1/2.

These open problems illustrate a recurring pattern: the probabilistic method establishes existence, provides intuition, and even suggests the right answer, but the exact combinatorial truth often requires more algebraic or analytic structure than pure probability can provide. The interplay between random and deterministic constructions remains one of the richest and most productive areas in discrete mathematics.

Appendix A: Probability Toolbox

The probabilistic method draws on a compact but powerful set of probabilistic facts. This appendix collects the key inequalities and identities for reference.

A.1 Basic Expectations and Variance

For a random variable \(X\) taking non-negative integer values, the most useful identities are:

\[ \mathbb{E}[X] = \sum_{k=1}^\infty \Pr[X \geq k] = \sum_{k=0}^\infty \Pr[X > k], \]

and for a sum \(X = \sum_{i} X_i\) of indicators (by linearity of expectation):

\[ \mathbb{E}[X] = \sum_i \Pr[X_i = 1]. \]

The second moment method uses:

\[ \Pr[X > 0] \geq \frac{(\mathbb{E}[X])^2}{\mathbb{E}[X^2]}, \]

known as the Paley-Zygmund inequality. For \(X = \sum_i X_i\) (indicator sum):

\[ \mathbb{E}[X^2] = \mathbb{E}[X] + \sum_{i \neq j} \Pr[X_i = 1, X_j = 1]. \]

A.2 Tail Bounds Summary

Bound	Assumption	Statement
Markov	\(X \geq 0\)	\(\Pr[X \geq t] \leq \mathbb{E}[X]/t\)
Chebyshev	Finite variance	(\Pr[
Chernoff (upper)	Independent Bernoulli	\(\Pr[X \geq (1+\delta)\mu] \leq e^{-\mu\delta^2/3}\) for \(\delta \leq 1\)
Chernoff (lower)	Independent Bernoulli	\(\Pr[X \leq (1-\delta)\mu] \leq e^{-\mu\delta^2/2}\) for \(\delta \leq 1\)
Azuma	Bounded differences	(\Pr[
Talagrand	Lipschitz, certifiable	\(\Pr[X \leq m - t]\Pr[X \geq m] \leq e^{-t^2/(4b^2)}\)
Janson (upper)	Positively correlated	\(\Pr[X = 0] \leq e^{-\mu + \Delta/2}\)
LLL (symmetric)	\(d\)-bounded dependence	\(\Pr[\bigcap_i \overline{A_i}] > 0\) if \(ep(d+1) \leq 1\)

A.3 Key Entropy Facts

The entropy of a discrete random variable \(X\) is \(H(X) = -\sum_x \Pr[X=x]\log_2\Pr[X=x]\). Key properties used in proofs:

Chain rule: \(H(X_1, \ldots, X_n) = H(X_1) + H(X_2 | X_1) + \cdots + H(X_n | X_1, \ldots, X_{n-1})\).
Sub-additivity: \(H(X_1, \ldots, X_n) \leq \sum_i H(X_i)\), with equality iff the \(X_i\) are independent.
Shearer’s lemma: If each index \(i \in [n]\) appears in at least \(k\) of the sets \(S_1, \ldots, S_m \subseteq [n]\), then \(k \cdot H(X_1, \ldots, X_n) \leq \sum_{j=1}^m H(X_{S_j})\).
Entropy and counting: For a uniform distribution on a set \(\mathcal{F}\), \(H = \log_2 |\mathcal{F}|\). Sub-additivity then gives direct bounds on \(|\mathcal{F}|\) from projections.

A.4 Lovász Local Lemma (All Forms)

Theorem (LLL — General Form). Let \(A_1, \ldots, A_m\) be events in a probability space, with dependency graph \(G\). If there exist reals \(x_1, \ldots, x_m \in (0,1)\) such that for all \(i\): \[ \Pr[A_i] \leq x_i \prod_{j \sim i} (1 - x_j), \] then \(\Pr\!\left[\bigcap_i \overline{A_i}\right] \geq \prod_i (1 - x_i) > 0\).

Corollary (Symmetric LLL). If each \(\Pr[A_i] \leq p\) and each event depends on at most \(d\) others, and \(ep(d+1) \leq 1\), then \(\Pr[\bigcap_i \overline{A_i}] > 0\).

The symmetric condition is obtained by taking \(x_i = 1/(d+1)\) and verifying \(p \leq (1/(d+1))(1 - 1/(d+1))^d \geq 1/(e(d+1))\).

The LLL can also be formulated for the lopsided dependency graph (where \(A_i\) is negatively correlated with some of its “neighbors”), giving a slightly stronger result useful for permutations and other structures with strong negative dependencies.

A.5 Graph Theoretic Quick Reference

Result	Statement	Use
Turán’s theorem	\(\mathrm{ex}(n, K_{r+1}) = (1-1/r)n^2/2 + O(n)\)	Extremal clique-free graphs
KST bound	\(\mathrm{ex}(n, K_{s,t}) = O(n^{2-1/s})\)	Bipartite Turán numbers
Ramsey bound	\(R(k,k) \leq 4^k/\sqrt{k}\) (Spencer); \(R(k,k) \geq \sqrt{2} \cdot 2^{k/2}\) (Erdős)	Clique/indep. set sizes
Lovász theta	\(\vartheta(\bar{G}) = \alpha(G) \cdot \vartheta(G)^{-1}\) (Lovász 1979)	Shannon capacity bound
Szemerédi regularity	Any graph has an \(\varepsilon\)-regular partition of size \(\leq T(\varepsilon)\)	Structure of dense graphs
GV bound	\(\mathrm{ex}(n, H) = \Omega(n^{2-1/s})\) for bipartite \(H\) with parts \(s \leq t\)	Lower bound on Zarankiewicz