CO 448: Probabilistic Methods in Combinatorics
Estimated study time: 3 hr 18 min
Table of contents
Sources and References
Primary textbook — Noga Alon and Joel H. Spencer, The Probabilistic Method, 4th ed., Wiley-Interscience, 2016. The foundational text for this subject.
Supplementary texts — Michael Molloy and Bruce Reed, Graph Colouring and the Probabilistic Method, Springer, 2002; Stasys Jukna, Extremal Combinatorics, 2nd ed., Springer, 2011; Svante Janson, Tomasz Łuczak, and Andrzej Ruciński, Random Graphs, Wiley, 2000.
Online resources — Jacob Fox and Yufei Zhao, MIT 18.226 Probabilistic Methods in Combinatorics lecture notes (2023, freely available); David Conlon, Oxford C5.4 Probabilistic Methods lecture notes; Nati Linial and Yuval Peled, Hebrew University “Probabilistic Methods” course notes; Benny Sudakov, ETH Zürich “Probabilistic Methods in Combinatorics” lecture notes.
Chapter 1: The Probabilistic Method — Philosophy and Foundations
The probabilistic method is one of the most elegant and powerful paradigms in combinatorics. Its central idea is deceptively simple: to prove that a combinatorial object with some desired property exists, one does not construct it explicitly. Instead, one defines a suitable probability space over a family of objects, and shows that a randomly chosen element satisfies the desired property with strictly positive probability. Since the probability is positive, such an element must actually exist. The method is thus inherently non-constructive — it certifies existence without providing a witness. Paul Erdős, who pioneered the technique in the 1940s, was fond of calling this an argument from “the Book,” a divine volume containing perfect proofs. For decades, probabilistic existence results were regarded as fundamentally non-constructive, until algorithmic breakthroughs in the 1980s–2010s began to close that gap.
Before stating the first major theorem, we establish the foundational framework. A probability space is a triple \((\Omega, \mathcal{F}, \Pr)\) where \(\Omega\) is the sample space, \(\mathcal{F}\) is a \(\sigma\)-algebra of events, and \(\Pr: \mathcal{F} \to [0,1]\) is a probability measure. In combinatorics, \(\Omega\) is typically finite and all subsets are events. The key principle underpinning every application of the probabilistic method is:
This tautological-sounding principle is extraordinarily useful in practice because computing probabilities is often easier than constructing objects directly. The first landmark application, due to Erdős in 1947, gave the first nontrivial lower bound for Ramsey numbers.
This proof is the prototypical application of the probabilistic method. The argument says nothing about what such a colouring looks like — it merely certifies its existence. For comparison, the upper bound \(R(k,k) \leq \binom{2k-2}{k-1} \leq 4^k\) follows from Pascal’s recurrence, so we have \(2^{k/2} < R(k,k) \leq 4^k\). Despite decades of effort, improving the exponential base in either bound remains one of the central open problems in combinatorics.
The workhorse behind most first-moment arguments is linearity of expectation. Unlike variance or higher moments, linearity holds regardless of any dependence structure among the variables.
The probabilistic method and the LLL are the two foundational tools of the field. The union bound says “avoid all bad events when their probabilities sum to less than 1.” The LLL says “avoid all bad events even when the sum of probabilities is large, provided the events are nearly independent.” Together, these two principles, combined with the moment methods and concentration inequalities of later chapters, give the complete toolkit of modern probabilistic combinatorics.
This theorem, simple as it is, is the engine behind an enormous range of existence results. One immediate application gives a lower bound on the independence number of a graph — a result that mirrors and complements the Turán theorem from an existential direction.
A recurring refinement of the basic probabilistic method is the alteration technique: rather than requiring the random object to immediately satisfy all conditions, one generates a random object, identifies a small set of “violations,” and removes or fixes them at a modest cost. The following version gives a slightly better bound.
- Random construction: generate a random object \(G\) from some distribution over the "ambient" family.
- First moment analysis: compute \(\mathbb{E}[V]\) where \(V\) is the number of "violations" (elements of \(G\) that need to be removed or fixed).
- Alteration: remove or fix each violation at a deterministic cost (often 1 element per violation). The resulting object \(G'\) satisfies all constraints.
- Lower bound: by linearity of expectation, \(\mathbb{E}[|G'|] \geq \mathbb{E}[|G|] - c \cdot \mathbb{E}[V]\) for some cost \(c\). If \(\mathbb{E}[V] \ll \mathbb{E}[|G|]\), the alteration is cheap.
The alteration method will recur throughout this course. The key idea is always the same: the random step gives you most of what you need, and the deterministic repair step costs only a lower-order amount.
Chapter 2: The First Moment Method
The first moment method formalizes the observation that if the expected number of “bad” events is less than 1, then with positive probability there are no bad events at all. Conversely, if the expected number of objects of a certain type is at least 1, then such an object exists with positive probability. These two directions — showing existence and showing non-emptiness — are dual facets of the same principle.
A clean illustration of the first moment method in a geometric setting is the domination number of graphs.
This bound is nearly tight: for the complete bipartite graph \(K_{\delta, \delta}\), one needs at least 2 vertices to dominate everything, and the formula gives approximately \(2\ln\delta\), which is of the same order.
The first moment method also yields a beautiful result in combinatorial geometry via a probabilistic argument for the crossing number inequality, which in turn implies the Szemerédi-Trotter theorem on point-line incidences.
The elegance here is that we apply a weak planar-graph inequality to a random sparse subgraph, then transfer information back to the original dense graph by undoing the sampling. This sampling-and-averaging strategy appears throughout probabilistic combinatorics.
The bound \(m^3/(64n^2)\) has the correct exponent: there exist graphs (expanders, complete graphs minus a matching) achieving \(\mathrm{cr}(G) = \Omega(m^3/n^2)\). For the complete graph \(K_n\), \(m = \binom{n}{2} \approx n^2/2\) and the probabilistic bound gives \(\mathrm{cr}(K_n) \geq \Omega(n^4)\), while the true crossing number of \(K_n\) is conjectured to be \(\frac{1}{4}\lfloor n/2 \rfloor \lfloor (n-1)/2 \rfloor \lfloor (n-2)/2 \rfloor \lfloor (n-3)/2 \rfloor \sim n^4/64\) (the Zarankiewicz conjecture, proved for small \(n\) but open in general).
In combinatorial number theory, the first moment gives the existence of Sidon sets (also called \(B_2\) sets) of near-optimal size.
The classical upper bound \(|A| \leq (1+o(1))n^{1/2}\) follows from counting: if \(|A| = k\), the \(\binom{k}{2}\) pairwise sums all lie in \(\{2, \ldots, 2n\}\), so \(\binom{k}{2} \leq 2n-1\), giving \(k \leq (1+o(1))(2n)^{1/2}\). The probabilistic construction thus achieves the right order of magnitude.
- Singer (1938): For prime powers \(q\), the set \(\{g^k + kq : k = 0, 1, \ldots, q-1\}\) (where \(g\) is a generator of \(\mathbb{F}_{q^2}^*\)) gives a Sidon set in \(\{1, \ldots, q^2+q+1\}\) of size \(q\), achieving \(|A| = \Theta(n^{1/2})\).
- Bose (1942): A similar construction from finite fields.
The first moment method also interacts beautifully with structural graph theory. The following theorem shows how a greedy probabilistic idea implies a fundamental result about graph colouring.
Chapter 3: The Second Moment Method and Chebyshev’s Inequality
The first moment method tells us when something exists, but it cannot tell us whether a random variable is concentrated near its mean. The second moment method addresses exactly this: it provides a lower bound on \(\Pr[X > 0]\) in terms of the ratio \((\mathbb{E}[X])^2 / \mathbb{E}[X^2]\).
A key computation when applying the second moment method is expanding \(\mathbb{E}[X^2]\) using indicator random variables. If \(X = \sum_i X_i\) where \(X_i \in \{0,1\}\), then
\[ \mathbb{E}[X^2] = \sum_i \mathbb{E}[X_i^2] + 2\sum_{i < j}\mathbb{E}[X_i X_j] = \mathbb{E}[X] + \sum_{i \neq j} \Pr[X_i = 1, X_j = 1]. \]The method succeeds — meaning \(\Pr[X > 0] \to 1\) — when the “off-diagonal” covariance terms are negligible relative to \((\mathbb{E}[X])^2\).
A fundamental and illuminating application of the second moment method is determining the threshold for the presence of triangles in \(G(n,p)\).
- If \(p = c/n\) with \(c > 0\) fixed, then \(\mathbb{E}[T] \to c^3/6\) as \(n \to \infty\).
- If \(p \ll n^{-1}\), then \(\Pr[T \geq 1] \to 0\) (whp there are no triangles).
- If \(p \gg n^{-1}\), then \(\Pr[T \geq 1] \to 1\) (whp there are triangles).
For the lower threshold, we apply the second moment method. When \(p \gg 1/n\), write \(\mathbb{E}[T^2] = \sum_{S, S'} \Pr[I_S = I_{S'} = 1]\). Partition by \(|S \cap S'|\):
- \(|S \cap S'| = 0\) or \(1\): \(S\) and \(S'\) share no edge, so \(\Pr[I_S = I_{S'} = 1] = p^6\). There are \(O(n^6)\) such pairs, contributing \(O(n^6 p^6) = (\mathbb{E}[T])^2\) to \(\mathbb{E}[T^2]\).
- \(|S \cap S'| = 2\): \(S\) and \(S'\) share one edge (and together span 4 vertices, 5 edges). There are \(O(n^4)\) such pairs, each with probability \(p^5\), contributing \(O(n^4 p^5)\). Since \(\mathbb{E}[T] = \Theta(n^3 p^3)\), this term is \(O(n^4 p^5) = O(\mathbb{E}[T]^2 / (n^2 p))\), which is \(o(\mathbb{E}[T]^2)\) when \(n^2 p \to \infty\), i.e., \(p \gg 1/n^2\).
- \(|S \cap S'| = 3\): \(S = S'\), contributing \(\mathbb{E}[T]\) terms, negligible.
Explicitly:
- \(k = 3\) (triangles): \(p^* = n^{-1}\). At \(p = c/n\), \(\mathbb{E}[X_3] \to c^3/6\) as computed above.
- \(k = 4\) (4-cliques): \(p^* = n^{-2/3}\). At \(p = c \cdot n^{-2/3}\), \(\mathbb{E}[X_4] = \binom{n}{4}p^6 \approx \frac{n^4}{24} c^6 n^{-4} = c^6/24 = \Theta(1)\).
- \(k = 5\) (5-cliques): \(p^* = n^{-1/2}\). At \(p = c \cdot n^{-1/2}\), \(\mathbb{E}[X_5] \approx \frac{n^5}{120} c^{10} n^{-5} = c^{10}/120\).
The paradigmatic application of the second moment method is the analysis of clique numbers in random graphs.
This two-value concentration is a remarkable feature of the random graph: even though the number of cliques varies enormously, the maximum clique size is essentially determined. The proof reveals why the second moment method can fail: near \(k^*\) the variance blows up because cliques sharing many vertices create strong positive correlations.
The main balance is \(10\ln 2 \approx \frac{k \ln 2}{2}\), giving \(k \approx 20 = 2\log_2(1024)\). More carefully, \(k^* = 20\) or \(k^* = 19\) depending on lower-order corrections. The theorem says \(\omega(G(1024, 1/2))\) is concentrated on \(\{19, 20\}\) or \(\{20, 21\}\). For large \(n\), \(\omega(G(n,1/2)) = (2 - o(1))\log_2 n\), so for \(n = 1024\) we expect \(\omega \approx 20\).
Contrast with the greedy lower bound: greedily build a clique by adding vertices that are adjacent to all current clique members. With \(n = 1024\) and \(p = 1/2\), after choosing \(j\) vertices, the next vertex is in the neighbourhood of all \(j\) with probability \((1/2)^j\), so the expected number of candidates among remaining \(n-j\) vertices is \((n-j)/2^j\). This stays positive as long as \(2^j \ll n\), giving a greedy clique of size \(\approx \log_2 n = 10\) — about half of \(k^* \approx 20\). The gap between the greedy bound and the true clique number is an important open problem (the “algorithmic” clique problem: can we find a \((1+\varepsilon)\log_2 n\) clique efficiently?).
The second moment method is also the natural tool for analyzing threshold phenomena in \(G(n,p)\).
This is a prototypical example of a sharp threshold: the transition from disconnected to connected happens in an interval of width \(O(1/n)\) around \(\ln n / n\), which is much narrower than the threshold itself.
The key point is that connectivity is governed by isolated vertices: once there are no isolated vertices (which happens whp for \(c \to +\infty\)), the graph is connected. This is because the random graph \(G(n,p)\) for \(p \gg \ln n / n\) has all components of size at most 1 that are isolated vertices — larger components merge quickly. The proof of this “no small components” fact uses a separate first-moment argument: for a component of size \(s\) with \(2 \leq s \leq n/2\), the expected number of such components is \(O(n^s p^{s-1} (1-p)^{s(n-s)}) \to 0\) when \(p \gg \ln n / n\).
- \(c = 0\): \(\Pr[\text{connected}] \to e^{-e^0} = e^{-1} \approx 0.368\). The graph is disconnected about 63% of the time.
- \(c = 2\): \(\Pr[\text{connected}] \to e^{-e^{-2}} = e^{-0.135} \approx 0.874\). Connected about 87% of the time.
- \(c = -2\): \(\Pr[\text{connected}] \to e^{-e^{2}} = e^{-7.39} \approx 0.00062\). Almost certainly disconnected (expected \(e^2 \approx 7.4\) isolated vertices).
- \(c = 5\): \(\Pr[\text{connected}] \to e^{-e^{-5}} = e^{-0.0067} \approx 0.993\). Connected with probability exceeding 99%.
Chapter 4: The Lovász Local Lemma
The union bound \(\Pr[\bigcup_i A_i] \leq \sum_i \Pr[A_i]\) is a blunt instrument: it gives good estimates only when the total probability is small. But many problems involve a large number of individually rare bad events, each pair of which is nearly independent. In such cases, one might expect that all bad events can be simultaneously avoided even though the union bound fails. The Lovász Local Lemma (LLL) makes this intuition precise.
Since \(A_i\) is independent of events indexed by \(S_2\), the first factor equals \(\Pr[A_i] \leq x_i \prod_{j \in \Gamma(i)}(1-x_j) \leq x_i \prod_{j \in S_1}(1-x_j)\), from which the claim follows by induction. The lower bound on \(\Pr[\bigcap \overline{A_i}]\) follows by telescoping. \(\square\)
The LLL has numerous striking applications. Before stating the most classical one, let us work out how the asymmetric LLL is applied in practice.
The dependency graph: \(A_{ijk}\) depends on \(A_{i'j'k'}\) only if they share an edge, i.e., \(|\{i,j,k\} \cap \{i',j',k'\}| \geq 2\). Each triangle shares an edge with \(3(n-3)\) other triangles (choose the shared edge in 3 ways, then pick any of the remaining \(n-3\) vertices). So \(d \leq 3(n-3) \leq 3n\).
Symmetric LLL condition: \(e \cdot (1/k^2) \cdot (3n+1) \leq 1\), which gives \(k^2 \geq 3en\), i.e., \(k \geq \sqrt{3en}\). So for \(k = \lceil\sqrt{3en}\rceil\) colours, there exists a colouring of \(K_n\) with no monochromatic triangle. This is a remarkably clean result: the LLL certifies that \(\Theta(\sqrt{n})\) colours suffice for triangle-free edge colouring.
Using the asymmetric LLL with weights \(x_i = p = 1/(d+1)\) for the symmetric case, or optimising \(x_i\) using the condition \(p_i \leq x_i \prod_{j \in \Gamma(i)}(1-x_j)\), one can often squeeze slightly better constants. Here, choosing \(x_{ijk} = 2/(3n)\) (balancing the product term) gives a slightly improved bound.
The first, and most classical, application is satisfiability of \(k\)-CNF formulas.
The LLL also gives beautiful results in graph colouring and combinatorial geometry. In routing and scheduling, it provides the existence of efficient protocols:
The proof introduces an artificial random delay for each packet, defines bad events for edge congestion exceeding a threshold, and applies the LLL to show all bad events can be simultaneously avoided. The \(\log n\) factor arises from the LLL condition.
More usefully, the LLL gives the following: if \(G\) has girth \(g \geq 3\) and maximum degree \(\Delta\), then \(\chi(G) \leq \lceil \Delta^{1+1/(g-2)} \rceil / \lceil \Delta^{1/(g-2)} - 1 \rceil \cdot c\) for an appropriate constant… The cleaner statement is for proper list colouring:
Theorem (Johansson-Molloy 1996, simplified). Every triangle-free graph \(G\) with maximum degree \(\Delta\) satisfies \(\chi(G) = O(\Delta / \log \Delta)\).
The proof uses the LLL in a sophisticated “semi-random” setting: vertices are coloured from a list greedily, with random choices, and bad events correspond to conflicts or exhausted lists. The key insight is that triangle-freedom prevents “domino” cascades where one conflict triggers many others, allowing the LLL condition to be satisfied with a list size of \(O(\Delta/\log\Delta)\).
One of the most important recent developments in the theory is the constructive LLL of Moser and Tardos (2010), which transformed probabilistic existence results into efficient algorithms.
The analysis proceeds through the notion of witness trees: each resampling step can be “charged” to a witness tree in a forest whose total expected size is bounded by the LLL parameters. The Moser-Tardos theorem was a breakthrough because it showed that the non-constructive LLL could, in essentially all combinatorial applications, be made polynomial-time algorithmic.
The symmetric LLL condition \(ep(d+1) \leq 1\) becomes \(e \cdot 2^{-k} \cdot (kf + 1) \leq 1\). For \(k = 3\): this gives \(e \cdot (3f+1)/8 \leq 1\), so \(f \leq \lfloor 8/e - 1/3 \rfloor = \lfloor 2.94 - 0.33 \rfloor = 2\). That is, if each variable appears in at most 2 clauses (each clause has at most \(3 \times 2 = 6 \leq 7\) neighbours), the formula is satisfiable. For general \(k\): the condition becomes \(f \leq 2^k / (ek) + O(1)\). The bound \(2^k/e\) on the number of clauses sharing a variable is the classical LLL \(k\)-SAT threshold.
For comparison, if one uses the asymmetric LLL with optimised weights \(x_i = 1/(d+1)\), one can tolerate slightly more dependencies. And the Moser-Tardos algorithm (2010) finds a satisfying assignment in expected \(O(m/(1-\Pr[\text{satisfied}]))\) resamplings, which is polynomial whenever the LLL condition holds.
Chapter 5: Correlation Inequalities
The tools we have seen so far — first and second moments, the LLL — treat events primarily as obstacles to be avoided or as contributions to an expectation. Correlation inequalities instead describe the qualitative structure of dependencies between events. The central question is: when do two events tend to occur together (positive correlation), and when does knowing one occurred make the other less likely (negative correlation)?
The FKG inequality has striking applications in percolation theory and random graph theory. In bond percolation on a graph \(G\), each edge is open independently with probability \(p\). The event “vertices \(u\) and \(v\) are connected by an open path” is increasing in the edge set. Therefore, by FKG, any two connectivity events are positively correlated — information that open paths exist elsewhere can only help (or at worst, not affect) the probability of a given connectivity event.
The physical meaning: in a ferromagnet (large \(\beta\)), if we know region \(A\) has all spins up, this makes it more likely that nearby region \(B\) also has spins up — a positive correlation. The FKG inequality formalises this as a purely probabilistic statement about product measures on distributive lattices, stripping away the physics. The connection to combinatorics: FKG applies to the random set model (each element included independently), which is the setting for most probabilistic combinatorics. The positive correlation of increasing events is used repeatedly — for instance, in proving that large matchings, large cliques, and large independent sets in \(G(n,p)\) are positively correlated (they all become more likely simultaneously as \(p\) increases).
This is a combinatorial restatement of the FKG inequality for the uniform measure on the hypercube, and it is often proved by a direct switching (injection) argument. The lemma is useful in extremal set theory and in bounding inclusion probabilities.
The Four Functions Theorem of Ahlswede and Daykin (1978) generalizes FKG substantially:
For negative correlation, the BK inequality (van den Berg-Kesten) provides a complementary picture: for two increasing events in a product space, the probability of their “disjoint occurrence” (witnessing each event on disjoint edge sets) is at most the product of their individual probabilities. This is deeply connected to the RSW theory in percolation.
At the self-dual point \(p = 1/2\), symmetry of the grid (and a duality argument) shows \(\Pr[A] = \Pr[B]\). Also, by the planar duality of the square lattice, the complement of a left-right crossing in the primal graph corresponds to a top-bottom crossing in the dual graph. This forces \(\Pr[A] \geq 1/2\) at \(p = 1/2\) (since the event “no left-right crossing” is equivalent to a dual crossing, which has the same probability). Combined with FKG, one gets \(\Pr[A \cap B] \geq 1/4\), meaning both a horizontal and vertical crossing occur simultaneously with positive probability at \(p = 1/2\). This is the foundation of the Russo-Seymour-Welsh (RSW) theory of percolation.
Chapter 6: Exponential Concentration Inequalities
Chebyshev’s inequality gives a tail bound that decays only as \(1/t^2\). In many combinatorial applications — particularly those involving sums of many independent random variables or Lipschitz functions of independent inputs — one needs tails that decay exponentially in \(t\). This chapter develops the key exponential concentration tools.
Chernoff bounds are powerful but require independence. A far more flexible tool, applicable to functions of independent variables, is the Azuma-Hoeffding martingale inequality.
A particularly important consequence is the bounded differences inequality:
A key application is concentration of the chromatic number of a random graph.
For sharper concentration, one needs Talagrand’s inequality, which accounts for the “certifiability” structure of combinatorial properties.
In particular, if \(\Pr[X \in A] \geq 1/2\), then \(\Pr[d_L(X,A) \geq t] \leq 2e^{-t^2/(4r)}\).
Talagrand’s inequality gives dramatically tighter bounds than Azuma for combinatorial random variables that are “certifiable.” For example, the length of the longest increasing subsequence of a random permutation is 2-certifiable (any particular increasing subsequence certifies itself via its \(\ell\) positions), and Talagrand gives variance \(O(n^{1/3})\) — far smaller than Azuma’s bound of \(O(n)\). This was a key ingredient in the Kim-Vu work on concentration of graph parameters.
Using the Chernoff bound with \(\delta = 1/2\) (since \((1+\delta)\mu = 75\)):
\[ \Pr[X \geq 75] \leq \left(\frac{e^{1/2}}{(3/2)^{3/2}}\right)^{50} = \left(\frac{\sqrt{e}}{(3/2)^{3/2}}\right)^{50}. \]Now \((3/2)^{3/2} = (3/2)\sqrt{3/2} \approx 1.5 \times 1.225 \approx 1.837\), and \(\sqrt{e} \approx 1.649\), so the base is \(\approx 1.649/1.837 \approx 0.898\). Thus \(\Pr[X \geq 75] \leq (0.898)^{50} \approx e^{50 \ln 0.898} \approx e^{-5.26} \approx 0.0052\).
Alternatively, the simpler bound \(e^{-\mu \delta^2/3} = e^{-50/12} \approx e^{-4.17} \approx 0.015\) gives a slightly weaker but easier estimate. The exact binomial probability is \(\Pr[X \geq 75] = \sum_{k=75}^{100}\binom{100}{k}2^{-100} \approx 2.8 \times 10^{-6}\), showing that the Chernoff bound, while useful, is not tight here — exponential concentration bounds can overestimate the tail by several orders of magnitude, but they are sufficient for most applications.
Finally, Janson’s inequality handles lower tails for sums of dependent 0-1 variables arising in random graph theory.
Equivalently, combining with the first moment: \(\Pr[X = 0] \leq \exp\!\left(-\mu^2 / (\mu + \Delta)\right)\) via the second moment version.
- Sharing 1 edge (2 vertices): there are \(O(n^4)\) such pairs; each contributes \(p^5\) (5 edges needed). Total: \(O(n^4 p^5) = O(c^5/n)\).
- Sharing 2 edges (3 vertices): there are \(O(n^3)\) such pairs ("bowtie" configurations); each contributes \(p^5\) as well. Wait — two triangles sharing 3 vertices means sharing all 3 edges, so \(S = T\), excluded. Two triangles sharing 2 vertices share 1 edge and span 4 vertices with 5 edges total. There are \(\binom{n}{4} \times (\text{number of bowtie labellings})= O(n^4)\) such pairs. Contribution: \(O(n^4 p^5) = O(c^5/n) \to 0\).
For \(c = 1\): \(\Pr[X = 0] \leq e^{-1/6} \approx 0.85\). For \(c = 3\): \(\Pr[X = 0] \leq e^{-4.5} \approx 0.011\). This matches the intuition that well above the threshold \(c = 1\), triangles are almost certain.
Janson’s inequality is the primary tool for lower-tailing events in random graphs — for example, showing that a random graph near the threshold for containing a given subgraph \(H\) has \(H\) with high probability. The correction term \(\Delta\) captures the extent to which overlapping copies of \(H\) positively correlate, and when \(\Delta \ll \mu^2\), the Janson bound is nearly tight.
Chapter 7: The Probabilistic Method in Graph Theory
Graph theory is the home territory of the probabilistic method. Nearly every major structural result about graphs — from Ramsey theory to colouring to discrepancy — has a probabilistic proof or a probabilistic component in the best known bound.
The most famous open problem in extremal combinatorics concerns Ramsey numbers. We have seen \(R(k,k) > 2^{k/2}\). The upper bound from Pascal’s identity gives \(R(k,k) \leq 4^k\). Despite the simplicity of both bounds, the gap has not been closed in 75 years.
One of Erdős’s most celebrated theorems shows that the probabilistic method can construct graphs with simultaneously extreme properties — high chromatic number and high girth — that seem contradictory.
This theorem was the watershed moment for the probabilistic method. Before 1959, many mathematicians believed that high girth would force large independent sets (and hence low chromatic number), via a heuristic analogy with tree-like sparse graphs. Erdős’s proof demolished this intuition entirely.
- Girth \(\mathrm{girth}(G') \geq g\),
- \(|V(G')| \geq n/2\),
- \(\alpha(G') \leq 3n^{1-1/g}\log n\) (with high probability),
- \(\chi(G') \geq n^{1/g} / (6\log n)\).
An important quantitative variant: the Bonferroni-type version of the alteration argument gives a graph on \(n = O(k^2 g^2)\) vertices with girth \(g\) and chromatic number \(k\). The difference from the probabilistic argument is that in the cage problem, one seeks the minimum number of vertices in a \(d\)-regular graph of girth \(g\) — the Moore bound gives \(\Omega(d^{g/2})\) vertices, showing sparsity is essential.
- Expected number of 3-cycles: \(\binom{100}{3}p^3 \approx \frac{10^6}{6} \times (0.0056)^3 \approx 162{,}000 \times 1.76 \times 10^{-7} \approx 0.029\). Very few triangles; removing one vertex per triangle costs \(\approx 0.029\) vertices.
- Expected independence number: the first moment gives \(\alpha(G(100, p)) \lesssim 3 \ln(100)/p \approx 3 \times 4.6 / 0.0056 \approx 2460\)... but \(n = 100\), so this bound is trivial. The correct bound comes from the second moment: \(\alpha \leq 3n^{3/4}\log n \approx 3 \times 31.6 \times 4.6 \approx 436\), still too large relative to \(n = 100\).
The probabilistic method also underlies the Szemerédi Regularity Lemma, one of the most powerful structural tools in graph theory.
The probabilistic method also illuminates discrepancy theory, which asks how evenly a 2-colouring of a ground set can distribute mass across a set system.
The proof uses an ingenious partial colouring lemma: by a linear-algebraic argument combined with a probabilistic step, one can always find a partial \(\pm 1\) colouring of a constant fraction of points such that every set has discrepancy at most \(C\sqrt{n}\). Iterating gives the full bound. The constant 6 is nearly optimal — there exist systems requiring discrepancy \(\Omega(\sqrt{n})\).
For arithmetic progressions in \(\{1, \ldots, n\}\), Roth’s theorem (1964) shows that for any \(\pm 1\) colouring, there exists an arithmetic progression of discrepancy \(\Omega(n^{1/4})\). This is proved using Fourier analysis on \(\mathbb{Z}_n\): the discrepancy is lower-bounded by the \(\ell^2\) norm of the Fourier transform of the colouring weighted by the Fourier coefficients of arithmetic progressions. Matoušek and Spencer (1996) showed the correct bound is \(\Theta(n^{1/4})\), closing the question for 1-dimensional arithmetic progressions.
The connection to the probabilistic method: the probabilistic colouring (random \(\pm 1\) assignment) has expected discrepancy \(\Theta(\sqrt{n \log m})\) for \(m\) sets of size \(n/2\) by the union bound and Chernoff. Spencer’s insight was that the random colouring is not optimal — the probabilistic method only provides the starting point, and the real work is in the derandomization via the partial colouring lemma.
The probabilistic method also connects to spectral graph theory through expander graphs and the tight relationship between randomness and pseudorandomness.
The expansion property — every small set has many neighbours outside it — is fundamental in theoretical computer science for constructing error-correcting codes, pseudorandom generators, and derandomization schemes.
Chapter 8: Algorithmic Applications and Derandomization
The probabilistic method originally produced purely existential results: objects with certain properties exist, but no efficient procedure was known to find them. Over the past four decades, this limitation has largely been overcome through two complementary approaches: derandomization (converting probabilistic arguments into deterministic algorithms) and direct randomized algorithms (using the probabilistic argument constructively).
The algorithmic counterpart of the probabilistic method raises a deep question: when a probabilistic argument certifies existence, how hard is it to find the certifying object? The answer varies:
- If the argument is a first moment argument (remove one element per violation), derandomization via conditional expectations almost always gives a polynomial-time algorithm.
- If the argument uses the LLL, the Moser-Tardos algorithm (2010) gives an efficient randomized algorithm (converted to deterministic via derandomization of the resamplings).
- If the argument uses second moments or non-constructive concentration, finding the object is often computationally hard — the "planted clique" problem is a canonical example.
The simplest illustration is the MAX-CUT problem, whose probabilistic proof is one of the cleanest in all of algorithm design.
This result is tight up to constants (the Goemans-Williamson SDP algorithm achieves the approximation ratio \(\alpha_{\text{GW}} \approx 0.878\), which is optimal under UGC). The existential proof immediately yields an algorithm: the randomized algorithm succeeds in expectation, and repeated sampling achieves any desired success probability.
Derandomization extracts a deterministic algorithm from a probabilistic argument by replacing random choices with deterministic ones, guided by a potential function.
The key requirement is that the conditional expectation can be computed efficiently. For the MAX-CUT application, each vertex is assigned greedily to the side that maximizes the expected number of cut edges given previous assignments, recovering a \(1/2\)-approximation deterministically.
A more sophisticated derandomization applies to hypergraph 2-colouring.
Order the vertices \(1, 2, \ldots, 10\). At each step, assign vertex \(v_j\) to side \(S\) or \(\overline{S}\) greedily. The conditional expected number of cut edges after assigning \(v_1, \ldots, v_j\) is:
\[ \Phi(x_1,\ldots,x_j) = (\text{cut edges so far}) + \frac{1}{2} \times (\text{edges between assigned and unassigned vertices}) + \frac{1}{2} \times (\text{edges among unassigned vertices}). \]At each step \(j\), we choose the assignment \(x_j\) (to \(S\) or \(\overline{S}\)) that maximises the expected cut from edges involving \(v_j\). For the Petersen graph, this greedy process produces a cut of size \(\geq 8\), which is indeed achievable (the maximum cut of the Petersen graph is 12, achieved by a bisection into two 5-vertex independent sets — since the Petersen graph is vertex-transitive and triangle-free, large cuts are easy to find).
Pairwise independence allows significant reduction in the randomness used by probabilistic algorithms while preserving key concentration properties.
Pairwise-independent hash families can be constructed with \(O(\log n)\) random bits (versus \(\Theta(n)\) for full independence), yet suffice for applications that only require pairwise independence — such as Chebyshev’s inequality with \(n\) variables and sum \(X = \sum X_i\), where \(\mathrm{Var}(X) = \sum \mathrm{Var}(X_i)\) still holds under pairwise independence (since covariance terms are zero).
This construction uses only \(2\log_2 p = O(\log p)\) random bits (choosing \(a\) and \(b\)) to define the hash function. Contrast with a truly random function from \(\mathbb{F}_p\) to \(\mathbb{F}_p\), which requires \(p \log_2 p = \Omega(p)\) bits. The savings in randomness come at no cost in the pairwise independence property. For \(k\)-wise independence (any \(k\) inputs are independent), use polynomials of degree \(k-1\) over \(\mathbb{F}_p\): \(h(x) = a_{k-1}x^{k-1} + \cdots + a_1 x + a_0\). This requires \(k\log_2 p\) random bits and gives \(k\)-wise independence.
Expander graphs provide perhaps the deepest connection between probabilistic combinatorics and algorithms.
This lemma shows that expanders behave like random graphs with respect to edge distribution. Since random regular graphs are expanders (by probabilistic arguments), explicit constructions of expanders (Ramanujan graphs, zig-zag products) provide the “derandomized” analogues of random graphs for applications in coding theory and pseudorandomness.
Chapter 9: Applications in Number Theory, Geometry, and Information Theory
The probabilistic method reaches far beyond graph theory. This final chapter surveys its impact across number theory, discrete geometry, information theory, and the connections to modern extremal combinatorics.
Number Theory
Schur’s theorem states that for any \(r\)-colouring of \(\{1,\ldots,N\}\) with \(N\) large enough, there exists a monochromatic solution to \(x + y = z\). A probabilistic argument establishes the sharp threshold for the minimum \(N\) needed.
The probabilistic method gives near-optimal Sidon sets, as seen in Chapter 2. More generally, for \(B_h\) sets (where all \(h\)-fold sums are distinct), the probabilistic method gives sets of size \(\Theta(n^{1/h})\), matching the upper bound order.
More concretely: Roth’s theorem (dense case) says every subset of \([N]\) with density \(\delta > 0\) contains a 3-AP. The probabilistic extension says that even in a random sparse subset of density \(p = N^{-1/2+\varepsilon}\), the Roth density condition still implies the existence of a 3-AP. This is tight: the random set itself has density \(p\) in \([N]\), so removing all 3-APs from the random set would leave a set of density \(p \cdot e^{-C\sqrt{\log N}}\) (using Behrend) — consistent with the \(p = N^{-1/2}\) threshold.
Sets avoiding arithmetic progressions provide a deep application. Behrend’s construction (1946) of 3-AP-free subsets of \([n]\) of size \(n \exp(-C\sqrt{\log n})\) uses a probabilistic ingredient: project a dense subset of the sphere onto a grid and observe that spheres contain no 3-term APs (since \(a + c = 2b\) and \(\|a\|^2 = \|b\|^2 = \|c\|^2\) force \(a = b = c\) by the parallelogram law).
At the other extreme, the Green-Tao theorem (2004) states that the prime numbers (a set of density 0 in \(\mathbb{N}\)) contain arithmetic progressions of every finite length. The proof uses Szemerédi’s regularity lemma combined with a “transference principle”: one shows that the primes behave, for the purpose of counting APs, like a pseudorandom set, so Szemerédi’s theorem (which applies to dense sets) transfers to the primes. This is a profound application of probabilistic reasoning to number theory — the primes are not dense, but they are “pseudorandom” in a precise technical sense that makes the Szemerédi machinery applicable.
Geometry
In discrete geometry, the probabilistic method is central to the theory of \(\varepsilon\)-nets, which underpin nearly all of modern computational geometry.
Turán’s theorem is proved instead as follows: colour the vertices with \(r\) colours uniformly and independently. Discard all edges within colour classes. The expected number of remaining edges is \(m(1 - 1/r)\), since each edge has both endpoints in the same colour class with probability \(1/r\). The remaining graph is \(r\)-partite, hence \(K_{r+1}\)-free. But the complete \(r\)-partite graph on \(n\) vertices with equal part sizes (the Turán graph \(T(n,r)\)) has at most \((1-1/r)n^2/2\) edges. Since the original graph is \(K_{r+1}\)-free, it has at most as many edges as \(T(n,r)\), giving \(m \leq (1-1/r)n^2/2\). (The probabilistic argument shows this by demonstrating that the random coloured subgraph has \((1-1/r)m\) edges in expectation, and a Turán-type bound on the coloured subgraph applies.) \(\square\)
The \(\varepsilon\)-net theorem is the foundation of PAC (probably approximately correct) learning theory and underpins algorithms for range searching, minimum enclosing balls, and linear programming in fixed dimension.
Application to geometric algorithms: the “random sampling” technique for convex hull computation uses this. A random sample \(R\) of \(r\) points from \(P\) has an expected convex hull size of \(O(r^{1/3})\) in the worst case. Points of \(P\) “outside” this hull (in a region not covered by the convex hull of \(R\)) form the “conflicts” for each hull edge. The expected total number of conflicts is \(O(n/r^{1/3})\). Choosing \(r\) to balance the cost of computing the hull of \(R\) (size \(O(r^{1/3})\)) against the cost of resolving conflicts (size \(O(n/r^{1/3})\)) gives an optimal \(O(n^{3/4})\) algorithm for some hull computations — better than the naive \(O(n^2)\) approach.
The Szemerédi-Trotter theorem on point-line incidences is another classical gem, proved probabilistically.
Information Theory
Claude Shannon’s 1948 proof that random codes achieve channel capacity is the foundational example of the probabilistic method in information theory. Shannon’s proof is often called the “most important application of the probabilistic method” because it simultaneously established the existence of capacity-achieving codes and implied that finding such codes efficiently is a separate, highly non-trivial problem — a separation that took 50 years to fully close (with polar codes by Arıkan in 2009 and LDPC codes achieving capacity under belief propagation by Richardson-Urbanke in the 2000s).
The entropy method for combinatorial bounds uses Shannon entropy as a counting tool. The key inequality is Shearer’s lemma:
Shearer’s lemma gives clean proofs of several classical combinatorial inequalities. For instance, taking \(\mathcal{F}\) to be the edge-set of a \(d\)-regular bipartite graph immediately recovers the Loomis-Whitney inequality bounding the volume of a set in terms of its projections. Applied to graph homomorphisms, Shearer’s lemma gives tight bounds on the number of copies of one graph inside another.
Pseudorandomness and Explicit Constructions
A recurring theme in the course is the gap between probabilistic existence and explicit construction. Random objects are often near-optimal, but finding them efficiently — or constructing explicit analogues — is much harder.
Modern Developments: Threshold Phenomena
The course concludes with a survey of deep modern results connecting probabilistic combinatorics to Boolean function analysis.
Another algebraic application: Babai and Frankl used the probabilistic method to show that most \(k\)-element subsets of \(\mathbb{F}_q^n\) are “independent” in a specific coding-theoretic sense, giving random constructions of codes meeting the Plotkin bound.
This theorem says that monotone functions near threshold cannot be “immune” to all individual variables — some variable must have logarithmically large influence. It is a key ingredient in the theory of sharp thresholds for random graph properties.
The Fourier-analytic approach: write the indicator function \(f = \mathbf{1}[\mathcal{P}]\) in the Fourier-Walsh basis \(\{W_S\}_{S \subseteq \binom{[n]}{2}}\) where \(W_S = \prod_{e \in S}(2\mathbf{1}[e \in G]-1)\). Then \(\mathrm{Inf}_e(f) = \sum_{S \ni e} \hat{f}(S)^2\) and the total influence equals the “average sensitivity” of \(f\). The KKL theorem says the maximum Fourier coefficient at level \(|S|=1\) is \(\Omega(\mathrm{Var}(f)\log n/n)\), connecting the combinatorial notion of influence to the Fourier spectrum.
The Kahn-Kalai conjecture (proved by Park and Pham in 2023) resolves a long-standing open problem by showing that thresholds are determined, up to a logarithmic factor, by the first moment. The gap between threshold and expectation threshold is at most logarithmic for all monotone properties — a remarkable unification of the first moment method with the full theory of threshold phenomena. The proof combines entropy methods, spread measures (a concept from combinatorics), and a probabilistic sunflower lemma, and represents the current state of the art in probabilistic combinatorics.
This example shows the power and the limitations of the theorem: it gives the right order of magnitude (logarithmic factors) but not the exact constant. The expectation threshold is a purely first-moment quantity, yet it captures the true threshold up to \(O(\log n)\). For spanning subgraphs (spanning trees, Hamilton cycles, perfect matchings), the expectation threshold is computable, and the Park-Pham theorem gives tight bounds.
The expectation threshold: \(\mathbb{E}[\text{Hamilton cycles}] = \frac{(n-1)!}{2} p^n\). Setting to 1: \(p^n \sim 2/(n-1)! \sim 2e^n/n^n\), giving \(q \sim e/n\). The Park-Pham theorem gives \(p^* \leq O(q \log n) = O(\log n / n)\) — matching the actual threshold \(p^* = \ln n / n\).
Reflections on the Probabilistic Method
Looking back across the nine chapters, several themes recur.
The power of linearity. Linearity of expectation, despite being a tautology, underlies every first-moment argument. Its power comes from decoupling: even when events are highly correlated, their individual contributions to the expectation are independent. This decoupling is the key reason why probabilistic arguments often give cleaner bounds than direct combinatorial counting.
The tension between existence and construction. The probabilistic method certifies existence without providing a witness. The degree of difficulty in finding the witness varies enormously: MAX-CUT witnesses are easy (any random assignment suffices on average), LLL witnesses are efficiently findable (Moser-Tardos), but second-moment witnesses (large cliques, Ramsey colourings) are believed to be computationally hard. This tension — between “it exists” and “you can find it” — is one of the deepest themes in modern theoretical computer science.
The role of independence. Full independence is rarely available in combinatorial structures, but the second moment method, LLL, and FKG inequality each provide a framework for “partial independence” that suffices. The LLL requires only a bounded dependency graph; the second moment method requires only that the covariance terms be bounded; FKG requires only monotonicity. Understanding how much independence is truly needed for a given argument is an active and productive area.
Concentration and pseudorandomness. Random objects are concentrated near their mean (Chernoff, Azuma, Talagrand), and this concentration is precisely what makes them “pseudorandom” — random graphs behave like complete bipartite graphs with a fixed density, and random matrices behave like their expected counterpart. Explicit pseudorandom objects (Ramanujan graphs, explicit designs, extractors) try to capture this concentration deterministically.
| Topic | Key Technique | Primary Reference |
|---|---|---|
| Ramsey lower bounds | First moment, union bound | Alon-Spencer Ch. 1 |
| Independence number | Alteration | Alon-Spencer Ch. 2 |
| Clique number in \(G(n,p)\) | Second moment | Alon-Spencer Ch. 4 |
| Lovász Local Lemma | Dependency graphs | Alon-Spencer Ch. 5 |
| FKG inequality | Lattice correlations | Alon-Spencer Ch. 6 |
| Chernoff bounds | MGF method | Alon-Spencer Ch. 7 |
| Azuma / Talagrand | Martingale concentration | Alon-Spencer Ch. 7 |
| High girth + high chromatic number | Alteration, first moment | Alon-Spencer Ch. 8 |
| Expanders | Spectral methods | Alon-Spencer Ch. 9 |
| \(\varepsilon\)-nets | Double sampling | Haussler-Welzl 1987 |
| Shannon capacity | Random codes, Chernoff | Shannon 1948 |
| Shearer’s lemma | Entropy method | Chung-Graham-Wilson |
| Park-Pham theorem | Spread measures | Park-Pham 2023 |
Chapter 10: The Algorithmic Lovász Local Lemma
The classical LLL guarantees existence but provides no algorithm. For nearly thirty years, a central open problem was whether the LLL witness — an assignment of variables avoiding all bad events — could be found efficiently. The answer, given by Moser and Tardos in 2010, is a beautiful yes: a surprisingly simple resampling procedure finds the witness in expected polynomial time, and its correctness proof is an application of information theory.
10.1 The Moser-Tardos Resampling Algorithm
The setting is the variable version of the LLL. We have independent random variables \(X_1, \ldots, X_n\) with arbitrary distributions, and a collection of bad events \(A_1, \ldots, A_m\), each depending on some subset \(\mathrm{vbl}(A_i) \subseteq \{1, \ldots, n\}\) of the variables. The dependency graph \(G\) has vertex set \(\{A_i\}\) with edges between events sharing a variable.
The correctness follows immediately from the LLL: if the LLL condition holds, the algorithm must eventually terminate. The remarkable content is the efficiency bound.
10.2 The Entropy Compression Argument
The proof technique, known as entropy compression, is one of the most elegant arguments in modern combinatorics. The idea is to encode the execution history of the algorithm in a compressed form, and show that long runs would require an impossible amount of information.
Specifically, run the algorithm in reverse: given the final state and the sequence \(\sigma\) read backwards, one can reconstruct the resampling table backwards. This means the resampling table (a sequence of \(t + n\) fresh samples) is encoded by \((\sigma, \text{final state})\). The number of possible sequences \(\sigma\) of length \(t\) over \(m\) events is \(m^t\). The number of final states is bounded by the size of the probability space. So the resampling table has at most \(m^t \cdot |\Omega|\) encodings, while there are \(|\Omega|^{t+n}\) possible tables. When these counts are incompatible, long runs are impossible. Careful bookkeeping with the LLL witness values \(x_i\) gives the bound. \(\square\)
The entropy compression method is more broadly applicable than the specific Moser-Tardos setting: it has been used to prove bounds on the length of the longest subsequence avoiding a given pattern (the Stanley-Wilf conjecture, proved earlier by Marcus-Tardos), and appears in proofs of acyclic edge coloring bounds.
10.3 Applications of the Algorithmic LLL
This is not the fastest known colouring algorithm (greedy is \(O(n + m)\) deterministically), but the algorithmic LLL shines in constrained settings where greedy fails — list colouring, defective colouring, and coloring with local constraints.
The Moser-Tardos algorithm requires that events depend on independent variables. An extension due to Achlioptas and Iliopoulos (2016) handles variables with more complex dependencies, at the cost of a weaker bound. The “variable LLL” remains an active area of research.
Chapter 11: Coding Theory and the Probabilistic Method
Shannon’s 1948 paper “A Mathematical Theory of Communication” is arguably the founding document of information theory. At its heart is a probabilistic existence argument: random codes achieve the channel capacity. This chapter develops the connections between the probabilistic method and coding theory.
11.1 Random Codes Achieve Shannon Capacity
A binary channel with crossover probability \(p\) flips each bit independently with probability \(p\). The capacity is \(C = 1 - H(p)\) where \(H(p) = -p\log_2 p - (1-p)\log_2(1-p)\) is the binary entropy.
A decoding error occurs if some other codeword \(c' \neq c\) is as close to \(y\) as \(c\). The Hamming ball of radius \((p+\varepsilon)n\) around \(y\) has size at most \(2^{nH(p+\varepsilon)}\). For a random codeword \(c'\), the probability that \(c'\) falls in this ball is \(2^{nH(p+\varepsilon)}/2^n = 2^{-n(1 - H(p+\varepsilon))}\). By the union bound over the \(2^{Rn} - 1\) other codewords, the probability of any decoding error is at most:
\[ 2^{Rn} \cdot 2^{-n(1 - H(p+\varepsilon))} = 2^{-n(C - R - H(p+\varepsilon) + H(p))}. \]For \(R < C\) and \(\varepsilon\) small enough, the exponent is negative, so this probability goes to 0 exponentially in \(n\). Since the average error (over random codes) is small, some specific code has small error. \(\square\)
The argument is pure first-moment method: the expected error probability over random codes is small, so a good code exists. It is existential and non-constructive — finding explicitly capacity-achieving codes is a major engineering challenge, partially resolved by turbo codes (Berrou-Glavieux, 1993) and LDPC codes (Gallager, 1962, rediscovered 1996).
11.2 The Gilbert-Varshamov Bound
The Gilbert-Varshamov (GV) bound gives the best known asymptotic lower bound on the size of binary codes with prescribed minimum distance.
For large codes (rate \(R\) and relative distance \(\delta = d/n\)), the GV bound gives \(R \geq 1 - H(\delta)\). This matches the Hamming (Singleton-like) upper bound up to a factor of 2 for small \(\delta\). Whether the GV bound is tight for large alphabets is one of the deepest open problems in coding theory — for binary codes it is widely conjectured to be tight (Zyablov bound), but no superlinear-size family of linear codes beating GV is known.
11.3 Linear Codes and the Probabilistic Argument
A linear code is a subspace \(C \leq \mathbb{F}_2^n\). Its generator matrix \(G \in \mathbb{F}_2^{k \times n}\) has rows forming a basis of \(C\). The minimum distance equals the minimum weight of a nonzero codeword.
The proof is a first-moment calculation: the expected number of nonzero codewords of weight \(\leq dn\) is \(\binom{n}{\leq dn} \cdot 2^{k-n} = \text{Vol}(n, dn) / 2^{n(1-R)}\), which is \(< 1\) when \(R < 1 - H(\delta)\). Hence with positive probability, no such codeword exists. Random linear codes are thus asymptotically as good as the GV bound, and they are far easier to analyze (and to decode) than random nonlinear codes.
Chapter 12: The Deletion Method and Hypergraph Coloring
The deletion method refines the first-moment argument by adding a “cleaning step”: after a probabilistic construction shows a large structure exists, delete elements causing problems, yielding a smaller but cleaner structure. This chapter develops the method and applies it to hypergraph coloring, discrepancy, and independent sets.
12.1 The Deletion Method
The archetype is the proof that every graph \(G\) with average degree \(d\) has an independent set of size \(\geq n/(2d)\) (a weaker form of the Turán bound).
The deletion method appears simple, but its power comes from the flexibility in choosing the deletion rule. The key idea is that we delete at most one vertex per violating structure, so the expected final size is the initial expected size minus the expected number of violations.
12.2 Hypergraph Coloring
A hypergraph \(\mathcal{H} = (V, \mathcal{E})\) is \(k\)-uniform if every edge has \(k\) vertices. A 2-coloring is a function \(\chi: V \to \{+1,-1\}\). An edge \(e\) is monochromatic if all vertices in \(e\) receive the same color. The property B (or 2-colorability) problem asks: does \(\mathcal{H}\) have a 2-coloring with no monochromatic edge?
This is tight up to a constant factor: there exist \(k\)-uniform hypergraphs with \(O(k^2 \cdot 2^k)\) edges that are NOT 2-colorable (Erdős, 1963; Radhakrishnan-Srinivasan, 2000 for the right constant). The construction uses hypergraphs with high overlap structure.
The Radhakrishnan-Srinivasan improvement uses the nibble method (or “semi-random method”). Instead of coloring all vertices uniformly, color vertices in rounds, each round independently, and analyze the residual hypergraph.
The proof is one of the cleanest applications of the Lovász Local Lemma combined with the alteration idea: one first shows that a random 2-coloring leaves few highly monochromatic edges, then uses the LLL to color the residual problem. The threshold \(2^k/\sqrt{k}\) (up to constants) for the maximum degree condition is believed to be tight.
12.3 The Rödl Nibble
The nibble method (Rödl, 1985) is a powerful technique for constructing combinatorial structures by a sequence of random “nibbles” — small random selections that each make a little progress without disturbing the remaining structure too much.
The nibble argument works as follows: at each round, randomly select a sparse random set of \(K_k\) copies (each included independently with a small probability \(p\)). Among those selected, include a copy in the packing only if it doesn’t conflict with previously selected copies. The key computation shows that (1) the expected number of new edges covered in each round is large, (2) the maximum degree of the “residual” graph (uncovered edges) decreases geometrically, and (3) the process terminates with \(o(n^2)\) uncovered edges.
The nibble method has since been systematized into the random greedy algorithm and the triangles-removal-based approach, and culminated in Keevash’s proof (2014) of the existence of Steiner systems for all admissible parameters — one of the biggest results in combinatorics of the 21st century.
12.4 The Lovász Local Lemma for List Coloring
Chapter 13: The Polynomial Method
The polynomial method is a fundamentally different paradigm from the probabilistic method, but the two interact deeply. Algebraic identities constrain combinatorial structures, and probabilistic arguments can be translated into polynomial language. This chapter introduces the key results.
13.1 The Combinatorial Nullstellensatz
The proof is short: consider the polynomial modulo \(\prod_i \prod_{s \in S_i} (x_i - s)\); the degree constraint means the leading monomial survives the reduction, so \(f\) is not identically zero on the grid \(S_1 \times \cdots \times S_n\).
13.2 The Chevalley-Warning Theorem
- An exponential lower bound for the chromatic number of the unit-distance graph of \(\mathbb{R}^n\): \(\chi(\mathbb{R}^n) \geq (1.2)^n\), proved by Frankl and Wilson (1981).
- Borsuk’s conjecture counterexample: there exist sets in \(\mathbb{R}^n\) that cannot be partitioned into fewer than \((1.2)^n\) parts of smaller diameter, due to Kahn and Kalai (1993).
These combinatorial bounds come from pure algebra, with no probabilistic component — a reminder that the probabilistic method, powerful as it is, is one tool among many.
13.3 The Cap Set Problem
The cap set problem asks for the maximum size of a subset of \(\mathbb{F}_3^n\) with no 3-term arithmetic progression (equivalently, no three elements summing to zero). This is the “cap set” problem because \(\mathbb{F}_3^n\) is an \(n\)-dimensional affine space over \(\mathbb{F}_3\), and arithmetic progressions are affine lines.
For decades, the best bound was \(O(2.756^n)\) (probabilistic/Fourier analytic). In 2016, Croot-Lev-Pach and Ellenberg-Gijswijt gave a stunning polynomial method proof that the maximum cap set size is at most \(O(2.756^n)\) — matching the prior bound but with an entirely new proof, and Ellenberg-Gijswijt improved the constant to \(2.756\) via the slice rank method.
The proof uses the following key idea: encode a 3-AP-free set \(A \subseteq \mathbb{F}_3^n\) via the polynomial \(f(x,y,z) = \prod_{i=1}^n (1 - (x_i + y_i + z_i)^2)\), which is nonzero when \(x + y + z = 0\). The slice rank of this tensor (the minimum rank over decompositions into terms depending on at most two of the three variables) controls the size of \(A\). The polynomial method gives an upper bound on the slice rank in terms of \(n\), yielding the exponential bound. This method has since been applied to many other problems (sunflower conjecture, matrix multiplication complexity) and represents one of the most active areas in extremal combinatorics.
Chapter 14: Turán-Type Problems and the Probabilistic Method
14.1 Turán’s Theorem and Its Probabilistic Proof
The classical proof is algebraic-combinatorial. A probabilistic complement: the Turán number \(\mathrm{ex}(n, K_{r+1})\) also arises naturally from the probabilistic method.
14.2 The Kruskal-Katona Theorem
The Kruskal-Katona theorem gives a sharp bound on the size of the shadow of a set system — a cornerstone of extremal set theory.
The probabilistic method gives a clean proof of the weaker bound \(|\partial \mathcal{F}| \geq (k/n) |\mathcal{F}|\): pick a random element uniformly from \([n]\), and let \(X\) be the number of sets in \(\mathcal{F}\) containing it. Then \(\mathbb{E}[X] = k |\mathcal{F}|/n\). For each element \(x\), the sets in \(\mathcal{F}\) containing \(x\) project down to sets in \(\partial \mathcal{F}\) by removing \(x\). Different projection images are distinct, giving \(|\partial \mathcal{F}| \geq \mathbb{E}[X] = k|\mathcal{F}|/n\).
14.3 Ramsey Multiplicity
Ramsey multiplicity asks not just whether a monochromatic copy of a graph \(H\) exists, but how many must exist in any 2-coloring of \(K_n\).
A random 2-coloring gives \(\mathbb{E}[\text{monochromatic } K_3] = \binom{n}{3} \cdot 2 \cdot (1/2)^3 = \binom{n}{3}/4\), consistent with Goodman’s formula — the random coloring is optimal for triangles.
For larger cliques, Ramsey multiplicity is related to the Ramsey multiplicities conjecture of Erdős: that the random coloring minimizes the number of monochromatic \(K_r\) for all \(r\). This is open for \(r \geq 4\) and represents one of the most subtle open problems in Ramsey theory, touching on the theory of “quasi-random” graphs and flag algebras.
Chapter 15: Bipartite Turán Problems and the Kővári-Sós-Turán Theorem
15.1 The Zarankiewicz Problem
The Zarankiewicz problem asks for the maximum number of edges in a bipartite graph on \(n + n\) vertices containing no complete bipartite subgraph \(K_{s,t}\). This is denoted \(z(n; s, t)\).
On the other hand, for any \(s\)-subset \(\{a_1, \ldots, a_s\} \subseteq A\), the number of vertices in \(B\) adjacent to all of them is at most \(t-1\) (otherwise we’d have a \(K_{s,t}\)). So the total cherry count is at most \(\binom{n}{s}(t-1)\). Combining:
\[ n \binom{e/n}{s} \leq \binom{n}{s}(t-1). \]Solving for \(e\) gives the claimed bound. \(\square\)
The KST theorem is a model example of double counting combined with convexity. Its applications are vast: it implies the Bondy-Simonovits theorem (every bipartite graph \(H\) has \(\mathrm{ex}(n, H) = O(n^{2-1/s})\) where \(s\) is the smaller part of the bipartition), and it appears in incidence geometry (Szemerédi-Trotter), algebraic combinatorics, and the analysis of algorithms.
15.2 Lower Bounds via Random Algebraic Constructions
The KST upper bound \(z(n; s, t) = O(n^{2-1/s})\) is matched by algebraic lower bounds using polarity maps over finite fields.
For \(K_{3,3}\)-free graphs, the polarity graphs over \(\mathbb{F}_q\) achieve \(\Theta(n^{5/3})\) edges, matching the KST bound \(O(n^{5/3})\). These algebraic constructions are often the only known exact-order constructions; it is a major open problem whether the KST bound \(z(n; 2, 2) = \Theta(n^{3/2})\) can be achieved with an explicit, efficiently constructible graph.
15.3 The Probabilistic Method for Bipartite Turán Problems
For \(K_{s,t}\) with \(s = 2\) and large \(t\), the Zarankiewicz problem connects to extremal set theory. A set system \(\mathcal{F} \subseteq 2^{[n]}\) is \(K_{2,t}\)-free iff no two sets in \(\mathcal{F}\) share a common element in \(t\) or more other sets — i.e., the sunflower structure is bounded.
The random algebraic method (Bukh, 2015; Conlon-Lee, 2021) constructs nearly-tight \(K_{s,t}\)-free graphs using random low-degree algebraic varieties over finite fields, giving:
\[ z(n; s, t) = \Omega\left(n^{2-1/s}\right) \]for all \(s\) and \(t \geq (s-1)! + 1\), matching the KST bound. This is the current state of the art, resolving a question that had been open for 70 years.
Chapter 16: Open Problems and the Frontier
The probabilistic method is not a static body of techniques but a living research program. We close with a brief survey of the most important open problems.
The \(R(5,5)\) question. The Ramsey number \(R(5,5)\) satisfies \(43 \leq R(5,5) \leq 48\). The lower bound \(R(5,5) \geq 43\) comes from an explicit construction (Paley graph on 42 vertices, which is triangle-free… wait, actually the Paley graph on 43 vertices). The upper bound follows from \(R(4,5) = 25\) and the Ramsey recurrence. Despite decades of effort and the use of computers to check large classes of colorings, the exact value remains unknown. Closing this gap likely requires new ideas beyond both the probabilistic method and exhaustive search.
The Erdős-Hajnal conjecture. For every graph \(H\), every \(H\)-free graph on \(n\) vertices has a clique or independent set of size \(n^{c(H)}\) for some \(c(H) > 0\). The probabilistic method gives only \(c(H) = \Omega(\log n)\) via Ramsey theory. The conjecture posits a polynomial lower bound. Proved for a few special graphs \(H\), but the general case is wide open.
Explicit constructions for Ramsey graphs. The probabilistic method gives graphs with no clique or independent set of size \(2 \log_2 n\). An explicit construction achieving size \(n^{o(1)}\) would be a major breakthrough. Paley graphs give \(O(\sqrt{n}\log n)\) — far from the probabilistic bound but the best known explicit construction.
The sunflower conjecture. Erdős and Ko (1960) conjectured that any family of \(k\)-sets with more than \((r-1)^k \cdot k!\) members contains an \(r\)-sunflower (three sets with pairwise-equal intersection). Alweiss-Lovett-Wu-Zhang (2020) improved the bound to \((O(\log k))^k\), a dramatic improvement using the spread measure technique that later appeared in the Park-Pham proof. The conjectured bound \((r-1)^k\) remains open.
Frankl’s union-closed conjecture. If \(\mathcal{F}\) is a union-closed family of sets (meaning \(A, B \in \mathcal{F} \Rightarrow A \cup B \in \mathcal{F}\)) with \(\mathcal{F} \neq \{\emptyset\}\), then some element appears in at least half the sets. Despite its elementary statement, this has resisted proof since 1979. A breakthrough by Gilmer (2022) via the entropy method gave the first constant-fraction bound: some element appears in at least \(0.01|\mathcal{F}|\) sets. Chase-Lovett (2023) improved this to \((3 - \sqrt{5})/2 \approx 0.382\), still short of 1/2.
These open problems illustrate a recurring pattern: the probabilistic method establishes existence, provides intuition, and even suggests the right answer, but the exact combinatorial truth often requires more algebraic or analytic structure than pure probability can provide. The interplay between random and deterministic constructions remains one of the richest and most productive areas in discrete mathematics.
Appendix A: Probability Toolbox
The probabilistic method draws on a compact but powerful set of probabilistic facts. This appendix collects the key inequalities and identities for reference.
A.1 Basic Expectations and Variance
For a random variable \(X\) taking non-negative integer values, the most useful identities are:
\[ \mathbb{E}[X] = \sum_{k=1}^\infty \Pr[X \geq k] = \sum_{k=0}^\infty \Pr[X > k], \]and for a sum \(X = \sum_{i} X_i\) of indicators (by linearity of expectation):
\[ \mathbb{E}[X] = \sum_i \Pr[X_i = 1]. \]The second moment method uses:
\[ \Pr[X > 0] \geq \frac{(\mathbb{E}[X])^2}{\mathbb{E}[X^2]}, \]known as the Paley-Zygmund inequality. For \(X = \sum_i X_i\) (indicator sum):
\[ \mathbb{E}[X^2] = \mathbb{E}[X] + \sum_{i \neq j} \Pr[X_i = 1, X_j = 1]. \]A.2 Tail Bounds Summary
| Bound | Assumption | Statement |
|---|---|---|
| Markov | \(X \geq 0\) | \(\Pr[X \geq t] \leq \mathbb{E}[X]/t\) |
| Chebyshev | Finite variance | (\Pr[ |
| Chernoff (upper) | Independent Bernoulli | \(\Pr[X \geq (1+\delta)\mu] \leq e^{-\mu\delta^2/3}\) for \(\delta \leq 1\) |
| Chernoff (lower) | Independent Bernoulli | \(\Pr[X \leq (1-\delta)\mu] \leq e^{-\mu\delta^2/2}\) for \(\delta \leq 1\) |
| Azuma | Bounded differences | (\Pr[ |
| Talagrand | Lipschitz, certifiable | \(\Pr[X \leq m - t]\Pr[X \geq m] \leq e^{-t^2/(4b^2)}\) |
| Janson (upper) | Positively correlated | \(\Pr[X = 0] \leq e^{-\mu + \Delta/2}\) |
| LLL (symmetric) | \(d\)-bounded dependence | \(\Pr[\bigcap_i \overline{A_i}] > 0\) if \(ep(d+1) \leq 1\) |
A.3 Key Entropy Facts
The entropy of a discrete random variable \(X\) is \(H(X) = -\sum_x \Pr[X=x]\log_2\Pr[X=x]\). Key properties used in proofs:
- Chain rule: \(H(X_1, \ldots, X_n) = H(X_1) + H(X_2 | X_1) + \cdots + H(X_n | X_1, \ldots, X_{n-1})\).
- Sub-additivity: \(H(X_1, \ldots, X_n) \leq \sum_i H(X_i)\), with equality iff the \(X_i\) are independent.
- Shearer’s lemma: If each index \(i \in [n]\) appears in at least \(k\) of the sets \(S_1, \ldots, S_m \subseteq [n]\), then \(k \cdot H(X_1, \ldots, X_n) \leq \sum_{j=1}^m H(X_{S_j})\).
- Entropy and counting: For a uniform distribution on a set \(\mathcal{F}\), \(H = \log_2 |\mathcal{F}|\). Sub-additivity then gives direct bounds on \(|\mathcal{F}|\) from projections.
A.4 Lovász Local Lemma (All Forms)
The symmetric condition is obtained by taking \(x_i = 1/(d+1)\) and verifying \(p \leq (1/(d+1))(1 - 1/(d+1))^d \geq 1/(e(d+1))\).
The LLL can also be formulated for the lopsided dependency graph (where \(A_i\) is negatively correlated with some of its “neighbors”), giving a slightly stronger result useful for permutations and other structures with strong negative dependencies.
A.5 Graph Theoretic Quick Reference
| Result | Statement | Use |
|---|---|---|
| Turán’s theorem | \(\mathrm{ex}(n, K_{r+1}) = (1-1/r)n^2/2 + O(n)\) | Extremal clique-free graphs |
| KST bound | \(\mathrm{ex}(n, K_{s,t}) = O(n^{2-1/s})\) | Bipartite Turán numbers |
| Ramsey bound | \(R(k,k) \leq 4^k/\sqrt{k}\) (Spencer); \(R(k,k) \geq \sqrt{2} \cdot 2^{k/2}\) (Erdős) | Clique/indep. set sizes |
| Lovász theta | \(\vartheta(\bar{G}) = \alpha(G) \cdot \vartheta(G)^{-1}\) (Lovász 1979) | Shannon capacity bound |
| Szemerédi regularity | Any graph has an \(\varepsilon\)-regular partition of size \(\leq T(\varepsilon)\) | Structure of dense graphs |
| GV bound | \(\mathrm{ex}(n, H) = \Omega(n^{2-1/s})\) for bipartite \(H\) with parts \(s \leq t\) | Lower bound on Zarankiewicz |