CO 432: Information Theory and Applications

Vijay Bhattiprolu

Estimated study time: 2 hr 4 min

Table of contents

Sources and References

Primary textbook — T.M. Cover & J.A. Thomas, Elements of Information Theory (2nd ed., Wiley, 2006)

Supplementary texts — V. Guruswami & M. Cheraghchi, Information Theory and Its Applications in Theory of Computation (CMU course notes, cs.cmu.edu/~venkatg); S. Arora & B. Barak, Computational Complexity: A Modern Approach (Cambridge, 2009) — for PCP chapter; M.A. Nielsen & I.L. Chuang, Quantum Computation and Quantum Information (Cambridge, 2000) — for quantum chapter; Y. Polyanskiy & Y. Wu, Information Theory: From Coding to Learning (MIT 6.441 lecture notes, 2024); T. Weissman, Information Theory (Stanford EE 376A/B course notes); A. Lapidoth, A Foundation in Digital Communication (Cambridge, 2nd ed., 2017); I. Csiszár & J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems (2nd ed., Cambridge, 2011)

Online resources — MIT OCW 6.441 Information Theory; Lecture notes from Madhu Sudan (Harvard CS 229r); ETH Zürich Information Theory lecture notes (Lapidoth)

Chapter 1: Entropy and Source Coding

How much information is contained in a random variable? This deceptively simple question, posed and answered by Claude Shannon in his landmark 1948 paper, gave birth to the entire field of information theory. Shannon’s key insight was that information is intimately connected to uncertainty: the more unpredictable a source, the more bits are needed on average to describe its output. The precise quantitative measure he introduced is the Shannon entropy.

Definition (Shannon Entropy): Let $X$ be a discrete random variable taking values in a finite alphabet $\mathcal{X}$ with probability mass function $p(x) = P(X = x)$. The Shannon entropy of $X$ is \[ H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x), \] where logarithms are base 2 and the convention $0 \log 0 = 0$ is adopted. The unit of $H(X)$ is bits.

The definition has immediate intuitive appeal. For a deterministic variable — one that takes a single value with probability 1 — entropy is zero: there is no uncertainty, and no bits are required to communicate the outcome. At the opposite extreme, a uniform distribution over $|\mathcal{X}|$ outcomes maximises entropy, yielding $H(X) = \log |\mathcal{X}|$ bits.

Theorem (Basic Properties of Entropy):

Non-negativity: $H(X) \geq 0$, with equality iff $X$ is deterministic.
Maximum entropy: $H(X) \leq \log |\mathcal{X}|$, with equality iff $X$ is uniform over $\mathcal{X}$.
Concavity: $H(X)$ is a concave function of the probability vector $(p(x))_{x \in \mathcal{X}}$.

Proof sketch. Non-negativity follows because $p(x) \in [0,1]$ implies $-\log p(x) \geq 0$. The maximum is achieved at the uniform distribution by Jensen's inequality applied to the concave function $-t \log t$. Concavity follows from the same Jensen argument: $H$ is a sum of concave functions of the $p(x)$.

The binary entropy function $h(p) = -p \log p - (1-p)\log(1-p)$ for a Bernoulli$(p)$ variable will recur throughout the course. Note $h(0) = h(1) = 0$ and $h(1/2) = 1$.

Prefix-Free Codes and the Kraft Inequality

An operational interpretation of entropy is provided by source coding. We wish to encode outcomes of $X$ as binary strings, and we require a prefix-free code: no codeword is a prefix of another. This ensures instantaneous decodability.

Definition (Prefix-Free Code): A code $C: \mathcal{X} \to \{0,1\}^*$ is prefix-free if no codeword $C(x)$ is a proper prefix of another codeword $C(x')$. Denote the codeword length by $\ell(x) = |C(x)|$.

Theorem (Kraft Inequality): For any prefix-free code with lengths $\ell_1, \ell_2, \ldots, \ell_{|\mathcal{X}|}$, \[ \sum_{x \in \mathcal{X}} 2^{-\ell(x)} \leq 1. \] Conversely, given any lengths satisfying this inequality, a prefix-free code exists with those lengths.
Proof sketch. Each codeword of length $\ell$ "consumes" a fraction $2^{-\ell}$ of the binary tree's leaf capacity. Prefix-freeness requires these fractions to be disjoint subtrees, so the total cannot exceed 1. The converse is constructive: assign codewords greedily from the binary tree.

Uniquely Decodable Codes and McMillan’s Theorem

A code is uniquely decodable if every sequence of codeword concatenations has a unique parse into codewords. Prefix-free codes are a strict subclass: every prefix-free code is uniquely decodable, but the converse is not true. However, McMillan’s theorem shows that we pay no cost for requiring prefix-freeness.

Theorem (McMillan, 1956): For any uniquely decodable code with codeword lengths $\ell_1, \ldots, \ell_{|\mathcal{X}|}$, \[ \sum_{x \in \mathcal{X}} 2^{-\ell(x)} \leq 1. \]

Proof. Consider the $k$-th power of the Kraft sum: $\left(\sum_x 2^{-\ell(x)}\right)^k = \sum_{n} A_k(n) \cdot 2^{-n}$, where $A_k(n)$ counts the number of $k$-tuples of codewords with total length $n$. Unique decodability implies $A_k(n) \leq 2^n$ (each length-$n$ binary string parses into at most one $k$-tuple). Hence $\left(\sum_x 2^{-\ell(x)}\right)^k \leq \sum_{n=k}^{k \ell_{\max}} 1 \leq k \ell_{\max}$. Taking $k$-th roots and letting $k \to \infty$ gives the result.

McMillan’s theorem has an important operational corollary: since uniquely decodable codes and prefix-free codes achieve the same Kraft sum, and the minimum expected length $\sum p(x)\ell(x)$ subject to the Kraft inequality is the same in both cases, prefix-free codes are optimal among all uniquely decodable codes. There is no advantage to allowing delayed decodability.

The Source Coding Theorem

With the Kraft inequality in hand, we can precisely characterise the minimum expected code length.

Theorem (Shannon Source Coding / Noiseless Coding): For any random variable $X$ and any prefix-free code with expected length $L = \sum_x p(x) \ell(x)$, \[ H(X) \leq L < H(X) + 1. \] Moreover, if we encode blocks of $n$ i.i.d. copies of $X$, the optimal expected length per symbol $L_n / n$ satisfies $L_n / n \to H(X)$ as $n \to \infty$.
Proof sketch. The lower bound follows from the Kraft inequality and the optimisation $\min \sum p(x) \ell(x)$ subject to $\sum 2^{-\ell(x)} \leq 1$; Lagrange multipliers give the optimal (possibly non-integer) lengths $\ell^*(x) = -\log p(x)$. Since lengths must be integers, we take $\ell(x) = \lceil -\log p(x) \rceil$, giving the upper bound. The block coding argument uses the AEP.

Huffman Codes and Optimality

The Huffman algorithm constructs an optimal prefix-free code by greedily combining the two least probable symbols.

Theorem (Huffman Optimality): The Huffman code achieves minimum expected length among all prefix-free codes for a given distribution $p$.
Proof by induction. Base case: two symbols — any prefix-free code assigns lengths 1 and 1, optimal since $H(X) \leq 1$. Inductive step: let $x^*$ and $y^*$ be the two symbols with smallest probabilities. Key structural lemma: there exists an optimal code in which $x^*$ and $y^*$ are siblings (have the same parent in the code tree) and have the longest codeword length. Given this, replacing the sibling pair by a new merged symbol $z$ with $p(z) = p(x^*) + p(y^*)$ gives a reduced problem. Any optimal code for the reduced problem, expanded back by splitting $z$ into $x^*, y^*$ and appending 0 and 1, gives an optimal code for the original. The Huffman algorithm does exactly this reduction.

The expected length of a Huffman code satisfies $H(X) \leq L_{\text{Huff}} < H(X) + 1$. For a source with dyadic probabilities (all $p(x) = 2^{-k}$ for integers $k$), the Huffman code achieves $L = H(X)$ exactly.

Asymptotic Equipartition Property

The AEP is the law of large numbers for information theory: it says that for large $n$, the sample $(X_1, \ldots, X_n)$ almost surely has log-probability close to $nH(X)$.

Theorem (Asymptotic Equipartition Property): Let $X_1, X_2, \ldots$ be i.i.d. $\sim p(x)$. Then \[ -\frac{1}{n} \log p(X_1, X_2, \ldots, X_n) \xrightarrow{P} H(X) \quad \text{as } n \to \infty. \]

This follows directly from the (weak) law of large numbers applied to the i.i.d. sequence $-\log p(X_i)$, whose mean is $H(X)$. The AEP has a dramatic structural consequence: define the typical set

\[ T_\varepsilon^{(n)} = \left\{ (x_1, \ldots, x_n) : \left| -\frac{1}{n} \log p(x_1, \ldots, x_n) - H(X) \right| \leq \varepsilon \right\}. \]

Theorem (Typical Set Properties):

$P\left( (X_1, \ldots, X_n) \in T_\varepsilon^{(n)} \right) \geq 1 - \delta$ for large enough $n$.
$\left| T_\varepsilon^{(n)} \right| \leq 2^{n(H(X) + \varepsilon)}$.
$\left| T_\varepsilon^{(n)} \right| \geq (1 - \delta) 2^{n(H(X) - \varepsilon)}$ for large enough $n$.

The typical set has size roughly $2^{nH(X)}$ — exponentially smaller than $|\mathcal{X}|^n$ if $H(X) < \log |\mathcal{X}|$ — yet it captures nearly all the probability mass. This is the geometric heart of Shannon’s theory: with high probability, nature draws from a tiny typical set, and we only need $nH(X)$ bits to index into it.

Arithmetic Coding

Arithmetic coding implements the entropy bound tightly, approaching the ideal codeword length $-\log p(x_1^n)$ for each specific sequence. The encoder maintains an interval $[l, r) \subseteq [0,1)$, initially $[0,1)$. Upon seeing symbol $x_t$, it shrinks the interval to the sub-interval corresponding to $x_t$ in the cumulative distribution function of $X_t | x_1^{t-1}$. After $n$ symbols, the interval has width $p(x_1^n)$; the encoder outputs the shortest binary description of any number in this interval, which requires $\lceil -\log p(x_1^n) \rceil + 1$ bits. Arithmetic coding is thus an implementation of the Shannon bound for sequences, using only $H(X_1, \ldots, X_n) + 2$ bits on average — no block-length overhead. It is particularly powerful for adaptive and universal sources.

Lempel-Ziv Compression and Universality

A universal compressor achieves the entropy rate of any stationary ergodic source without knowing the source distribution. The LZ78 algorithm (Ziv-Lempel, 1978) parses the input string greedily: maintain a dictionary of previously seen phrases; read the input until a string is seen that is not in the dictionary; add it as a new phrase. LZ77 (1977) uses a sliding window instead of an explicit dictionary. The LZW algorithm (Welch, 1984) is a dictionary-based refinement used in GIF compression.

Theorem (Universality of LZ78): For any stationary ergodic source with entropy rate $\bar{H}$, \[ \lim_{n \to \infty} \frac{\text{bits output by LZ78 on } X_1^n}{n} = \bar{H} \quad \text{almost surely.} \]

The proof proceeds by showing that the number of phrases $c(n)$ in a length-$n$ parse satisfies $c(n) \sim n / \log n$, and each phrase index requires $\log c(n)$ bits, giving an average of $\approx 1/\log(n/c(n)) \to \bar{H}$ bits per symbol. The remarkable aspect is that LZ78 is entirely blind to the source — it adapts purely from the observed data.

Entropy Rate of Stochastic Processes

For non-i.i.d. sources, the relevant quantity is the entropy rate.

Definition (Entropy Rate): For a stationary stochastic process $\{X_i\}$, \[ \bar{H}(X) = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \ldots, X_n), \] provided the limit exists. Equivalently, $\bar{H}(X) = \lim_{n \to \infty} H(X_n | X_{n-1}, \ldots, X_1)$.

For a stationary Markov chain with transition matrix $P$ and stationary distribution $\pi$, the entropy rate is

\[ \bar{H} = -\sum_{i,j} \pi_i P_{ij} \log P_{ij} = \sum_i \pi_i H(X_{t+1} | X_t = i). \]

This is the weighted average of the entropies of each row of $P$. For an i.i.d. source, it reduces to $H(X)$, recovering the classical definition.

The limit $\bar{H}(X) = \lim_{n} H(X_n | X_{n-1}, \ldots, X_1)$ exists for any stationary process because $H(X_n | X_{n-1}, \ldots, X_1)$ is monotone non-increasing in $n$ (conditioning reduces entropy) and non-negative, so it converges. The two definitions agree because Cesàro means of a convergent sequence converge to the same limit. The entropy rate is the fundamental quantity for universal data compression and for characterising the “complexity” of a dynamical system — it equals the topological entropy for equidistributed Markov chains.

Chapter 2: Conditional Entropy, Chain Rules, and Combinatorial Applications

Entropy becomes far richer when we consider multiple random variables. The key quantities are the joint entropy $H(X,Y)$ and the conditional entropy $H(X|Y)$, which measure the uncertainty of a pair and the residual uncertainty of $X$ given $Y$, respectively.

Definition (Conditional Entropy): For jointly distributed $(X, Y)$, \[ H(X \mid Y) = \sum_{y} p(y) \, H(X \mid Y = y) = -\sum_{x,y} p(x,y) \log p(x \mid y). \]

Theorem (Subadditivity of Entropy): \[ H(X, Y) \leq H(X) + H(Y), \] with equality if and only if $X$ and $Y$ are independent.
Proof sketch. $H(X,Y) = H(X) + H(Y|X)$, and $H(Y|X) \leq H(Y)$ since conditioning cannot increase entropy. Equality holds iff $p(x,y) = p(x)p(y)$ for all $x, y$.

Theorem (Chain Rule for Entropy): \[ H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^{n} H(X_i \mid X_1, X_2, \ldots, X_{i-1}). \]

This simple but powerful identity decomposes joint uncertainty into a sequence of conditional uncertainties. Notice that each term $H(X_i | X_1, \ldots, X_{i-1}) \leq H(X_i)$, so subadditivity follows immediately.

Tail Bounds via Entropy

Entropy provides elegant large-deviation bounds for the binomial distribution. If $B \sim \mathrm{Bin}(n, p)$, then for $\alpha < p$,

\[ P(B \leq \alpha n) \leq 2^{-n D(\alpha \| p)}, \]

where $D(\alpha \| p) = \alpha \log(\alpha/p) + (1-\alpha)\log((1-\alpha)/(1-p))$ is the KL divergence (introduced in Chapter 3). This will be derived once we have the full machinery of KL divergence.

Bregman’s Theorem

Theorem (Bregman, 1973): Let $A$ be an $n \times n$ (0,1)-matrix with row sums $r_1, \ldots, r_n$. Then the permanent \[ \mathrm{perm}(A) \leq \prod_{i=1}^{n} (r_i!)^{1/r_i}. \] In particular, for an $n \times n$ matrix with all row sums equal to $r$, $\mathrm{perm}(A) \leq (r!)^{n/r}$.
Proof sketch. Consider a uniformly random perfect matching (permutation matrix $\sigma$). Define $Z_i = A_{i,\sigma(i)}$. Compute the entropy of the column index $\sigma(i)$ in two ways — summing conditional entropies using the chain rule gives $H(\sigma) \leq \sum_i \log r_i$. The counting argument then yields the permanent bound.

Bregman’s theorem is a striking example of entropy as a purely combinatorial tool, with no probability or coding in the conclusion.

Shearer’s Lemma

Theorem (Shearer's Lemma): Let $X = (X_1, \ldots, X_n)$ and let $S_1, \ldots, S_m \subseteq [n]$ be a family of subsets such that each index $i \in [n]$ appears in at least $t$ of the sets. Then \[ \sum_{j=1}^{m} H(X_{S_j}) \geq t \cdot H(X_1, \ldots, X_n). \]

Shearer’s lemma is a generalisation of subadditivity and is extraordinarily useful for counting. As a canonical application: let $G = (V, E)$ be a graph. The number $N$ of copies of a fixed graph $H$ in $G$ satisfies a product inequality over edge-indexed subgraph entropies, yielding bounds on graph homomorphism counts such as the Kruskal–Katona theorem.

Example (Counting Graph Homomorphisms): Let $G$ be a triangle-free graph on $n$ vertices with $m$ edges. By Shearer's Lemma with the edge cover structure of the triangle, the number of triangles in any graph with $m$ edges is at most $O(m^{3/2})$. This bound, tight for $G = K_{\sqrt{m}}$, follows from the entropy inequality applied to the uniform distribution over vertices.

Loomis-Whitney Inequality

The Loomis-Whitney inequality is a higher-dimensional version of Shearer’s lemma with a beautiful geometric interpretation. For a finite set $S \subseteq \mathbb{Z}^d$, let $S_{ij}$ denote the projection of $S$ onto the coordinate hyperplane orthogonal to the $k$-th axis.

Theorem (Loomis-Whitney, 1949): For any finite set $S \subseteq \mathbb{Z}^d$, \[ |S|^{d-1} \leq \prod_{k=1}^{d} |S_k|, \] where $S_k$ is the projection of $S$ onto the hyperplane $\{x_k = 0\}$. In three dimensions: $|S|^2 \leq |S_{xy}| \cdot |S_{xz}| \cdot |S_{yz}|$.
Proof via Shearer's Lemma. Let $(X_1, \ldots, X_d)$ be uniform over $S$. Apply Shearer's lemma with the $d$ subsets $S_k = [d] \setminus \{k\}$, each of size $d-1$. Each index $i$ appears in $d-1$ of the subsets. Shearer gives $\sum_k H(X_{S_k}) \geq (d-1) H(X_1, \ldots, X_d)$. Since $H(X_{S_k}) \leq \log |S_k|$ and $H(X_1, \ldots, X_d) = \log |S|$, the inequality follows.

Applications include: bounding the number of incidences between points and hyperplanes; proving the Cauchy-Davenport theorem in additive combinatorics; and establishing geometric measure inequalities such as the isoperimetric inequality for finite sets.

Kahn’s Theorem on Independent Sets

The entropy method also resolves a conjecture about the number of independent sets in regular graphs.

Theorem (Kahn, 2001): Among all $d$-regular bipartite graphs on $2n$ vertices, the number of independent sets is maximised by the disjoint union of $n/d$ copies of the complete bipartite graph $K_{d,d}$. Consequently, if $i(G)$ denotes the number of independent sets in a $d$-regular bipartite graph $G$ on $2n$ vertices, \[ i(G) \leq (2^{d+1} - 1)^{n/d}. \]
Proof sketch via entropy. Let $I$ be a uniformly random independent set of $G$. For each edge $e = (u,v)$, consider $H(I_u, I_v)$ — the entropy of the pair of bits indicating whether $u$ and $v$ are in $I$. Since $I$ is an independent set, $(I_u, I_v) \neq (1,1)$, so $H(I_u, I_v) \leq \log 3$. Summing over all edges and applying the regularity condition via Shearer's lemma gives $\log i(G) = H(I) \leq (n/d) \log(2^{d+1}-1)$.

Chapter 3: Relative Entropy and Its Consequences

While Shannon entropy measures uncertainty about a single distribution, relative entropy (or KL divergence) measures the cost of mistaking one distribution for another. It is arguably the most fundamental information-theoretic quantity.

Definition (KL Divergence / Relative Entropy): For probability distributions $P$ and $Q$ on the same alphabet $\mathcal{X}$ (with $P \ll Q$), \[ D(P \| Q) = \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}. \]

Theorem (Gibbs' Inequality / Non-negativity): $D(P \| Q) \geq 0$, with equality if and only if $P = Q$.
Proof sketch. Apply Jensen's inequality to the strictly convex function $-\ln t$: $-\sum p(x) \ln(q(x)/p(x)) \geq -\ln \sum p(x) \cdot (q(x)/p(x)) = 0$. Equality holds iff $q(x)/p(x)$ is constant, i.e., $P = Q$.

Crucially, $D(P \| Q)$ is not symmetric and is not a metric. Nevertheless, it is the natural notion of “directed distance” between distributions.

f-Divergences

KL divergence is one member of a broad family of divergences.

Definition (f-Divergence): Let $f: (0,\infty) \to \mathbb{R}$ be a convex function with $f(1) = 0$. The f-divergence between distributions $P$ and $Q$ is \[ D_f(P \| Q) = \sum_{x} q(x) \, f\!\left(\frac{p(x)}{q(x)}\right). \]

Special cases: KL divergence ($f(t) = t \log t$); total variation ($f(t) = |t-1|/2$); $\chi^2$-divergence ($f(t) = (t-1)^2$); Hellinger distance squared ($f(t) = (\sqrt{t}-1)^2$); reverse KL ($f(t) = -\log t$). All f-divergences inherit the data processing inequality: if $Y$ is a stochastic function of $X$, then $D_f(P_Y \| Q_Y) \leq D_f(P_X \| Q_X)$. This follows from Jensen’s inequality applied to the convex $f$ and the transition kernel.

Chain Rule and Data Processing Inequality for KL

Theorem (Chain Rule for KL Divergence): \[ D(P_{XY} \| Q_{XY}) = D(P_X \| Q_X) + D(P_{Y|X} \| Q_{Y|X}), \] where $D(P_{Y|X} \| Q_{Y|X}) = \sum_x p(x) D(P_{Y|X=x} \| Q_{Y|X=x})$.

Theorem (Data Processing Inequality for KL): If $X \to Y \to Z$ is a Markov chain, then \[ D(P_X \| Q_X) \geq D(P_Y \| Q_Y) \geq D(P_Z \| Q_Z). \]

The DPI encapsulates a fundamental principle: processing cannot create information.

Variational Representation of KL Divergence

Theorem (Donsker-Varadhan Variational Formula): For any distributions $P$ and $Q$, \[ D(P \| Q) = \sup_{f: \mathcal{X} \to \mathbb{R}} \left\{ \mathbb{E}_P[f(X)] - \log \mathbb{E}_Q[e^{f(X)}} \right\}, \] where the supremum is over all measurable functions $f$. The supremum is achieved by $f^*(x) = \log(p(x)/q(x))$.

This variational formula has profound consequences. It connects KL divergence to free energy in statistical physics: $D(P \| Q) = F(P) - F(Q)$ where $F$ is the free energy functional. It also underpins the generalised information-theoretic framework for large deviations (Cramér’s theorem) and is the foundation of importance sampling and variational inference in machine learning.

Stein’s Lemma and Hypothesis Testing

One of the most operationally significant interpretations of KL divergence is in hypothesis testing. We observe $n$ i.i.d. samples from either $P$ or $Q$, and must decide which distribution generated them.

Theorem (Stein's Lemma): Consider testing $H_0: P$ vs. $H_1: Q$ with $n$ i.i.d. samples. Fix the type-I error (false rejection of $H_0$) at level $\alpha \in (0,1)$. The optimal type-II error (false acceptance of $H_0$) $\beta_n^*$ satisfies \[ \lim_{n \to \infty} -\frac{1}{n} \log \beta_n^* = D(P \| Q). \]

The optimal test is the Neyman-Pearson likelihood ratio test: reject $H_0$ iff $\frac{1}{n}\log \frac{p(X_1^n)}{q(X_1^n)} > \tau$ for a threshold $\tau$. The exponent $D(P \| Q)$ is the “price of the wrong model” — using model $Q$ when $P$ is true forces an exponentially higher type-II error. Stein’s lemma thus gives KL divergence a direct operational meaning as a distinguishability rate.

Fisher Information and the Cramér-Rao Bound

For parametric families $\{p(\cdot;\theta)\}_{\theta \in \Theta}$, Fisher information quantifies how much information an observation carries about the parameter.

Definition (Fisher Information): For a parametric family $p(x;\theta)$, \[ I(\theta) = \mathbb{E}\left[\left(\frac{\partial}{\partial \theta} \log p(X;\theta)\right)^2\right] = -\mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log p(X;\theta)\right]. \] The quantity $\partial_\theta \log p(x;\theta)$ is called the score function.

Theorem (Cramér-Rao Bound): For any unbiased estimator $T(X_1, \ldots, X_n)$ of $\theta$, \[ \mathrm{Var}(T) \geq \frac{1}{n \, I(\theta)}. \]

The bound is achieved asymptotically by the maximum likelihood estimator. Fisher information is additive for i.i.d. samples: $I_n(\theta) = n \cdot I(\theta)$, exactly paralleling the additivity of entropy. The link to KL divergence is:

\[ D(p(\cdot;\theta+\varepsilon) \| p(\cdot;\theta)) = \frac{\varepsilon^2}{2} I(\theta) + O(\varepsilon^3). \]

KL divergence is thus the “second-order Taylor approximation” of the Fisher information geometry of the statistical manifold.

Chernoff–Hoeffding Bound

Theorem (Chernoff–Hoeffding Bound): Let $B \sim \mathrm{Bin}(n, p)$. For any $\alpha \in (0, p)$, \[ P(B \leq \alpha n) \leq 2^{-n \, D(\alpha \| p)}, \] where $D(\alpha \| p) = \alpha \log \frac{\alpha}{p} + (1-\alpha) \log \frac{1-\alpha}{1-p}$.

Entropy Power Inequality

The entropy power inequality (EPI) is the most powerful known inequality governing entropy of sums of random variables, with far-reaching consequences for Gaussian channels.

Theorem (Entropy Power Inequality, Shannon 1948 / Stam 1959): Let $X$ and $Y$ be independent continuous random variables with densities. Then \[ e^{2H(X+Y)/n} \geq e^{2H(X)/n} + e^{2H(Y)/n}, \] where $n$ is the dimension and entropy is in nats.

For one-dimensional variables: $2^{2H(X+Y)} \geq 2^{2H(X)} + 2^{2H(Y)}$. Shannon conjectured this in 1948; Stam proved it in 1959 using de Bruijn’s identity: if $X_t = X + \sqrt{t} Z$ where $Z \sim \mathcal{N}(0,1)$ is independent of \(X$, then $\frac{d}{dt} H(X_t) = \frac{1}{2} I(\theta)_{\text{Fisher}}$ where the Fisher information is with respect to location. The EPI then follows from the Fisher information inequality for sums. Gaussians are the extremal distributions: $H(X+Y)$ is maximised (for given $H(X)$ and $H(Y)$) when both are Gaussian with the same variance. The EPI is used to prove the capacity of the Gaussian broadcast channel.

Joint Convexity and Pinsker’s Inequality

Theorem (Joint Convexity of KL Divergence): The map $(P, Q) \mapsto D(P \| Q)$ is jointly convex in $(P, Q)$.

Theorem (Pinsker's Inequality): For any distributions $P$ and $Q$, \[ \| P - Q \|_{\mathrm{TV}} \leq \sqrt{\frac{D(P \| Q)}{2 \ln 2}}, \] where $\| P - Q \|_{\mathrm{TV}} = \frac{1}{2}\sum_x |p(x) - q(x)|$ is total variation distance.

Pinsker’s inequality is the bridge between the analytically tractable KL divergence and the operationally natural total variation distance. Together with Stein’s lemma and Fisher information, it completes a rich geometric picture of the space of probability distributions.

Chapter 4: Mutual Information

The mutual information between two random variables is the reduction in uncertainty about one achieved by observing the other. It synthesises all the quantities we have defined.

Definition (Mutual Information): The mutual information between $X$ and $Y$ is \[ I(X; Y) = D(P_{XY} \| P_X \otimes P_Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y). \]

Since KL divergence is non-negative, $I(X;Y) \geq 0$, with equality iff $X$ and $Y$ are independent. Mutual information is symmetric: $I(X;Y) = I(Y;X)$.

Data Processing Inequality for Mutual Information

Theorem (DPI for Mutual Information): If $X \to Y \to Z$ is a Markov chain, then \[ I(X; Z) \leq I(X; Y). \]

Processing $Y$ to obtain $Z$ cannot increase the information that $Z$ carries about $X$. This principle is fundamental to information-theoretic security and to lower bounds in communication complexity.

Fano’s Inequality

Fano’s inequality quantifies the necessary imprecision of any estimator.

Theorem (Fano's Inequality): Let $X$ be a random variable taking values in $\mathcal{X}$, and let $\hat{X} = g(Y)$ be an estimate of $X$ from $Y$. Let $P_e = P(\hat{X} \neq X)$. Then \[ H(X \mid Y) \leq h(P_e) + P_e \log(|\mathcal{X}| - 1) \leq 1 + P_e \log |\mathcal{X}|. \]
Proof sketch. Introduce the error indicator $E = \mathbf{1}[\hat{X} \neq X]$. By the chain rule, $H(E, X | Y) = H(X | Y)$ (since $E$ is a deterministic function of $(X, \hat{X})$). Also $H(E, X | Y) = H(E|Y) + H(X | E, Y) \leq h(P_e) + P_e \log(|\mathcal{X}|-1)$.

Fano’s inequality is the cornerstone of the converse proof in channel coding: it shows that if the mutual information is low, any decoder must err with bounded probability.

Network Information Theory Preview

Mutual information extends naturally to multi-user settings. Two key theorems illustrate this.

Multiple access channel (MAC): Two senders, each independently choosing inputs $X_1$ and $X_2$, transmit to a common receiver who observes $Y$. The capacity region is all rate pairs $(R_1, R_2)$ satisfying

\[ R_1 \leq I(X_1; Y \mid X_2), \quad R_2 \leq I(X_2; Y \mid X_1), \quad R_1 + R_2 \leq I(X_1, X_2; Y), \]

for some product input distribution $P_{X_1} \times P_{X_2}$. The three constraints correspond to: sender 1’s rate is bounded by what $Y$ tells about $X_1$ given knowledge of $X_2$; symmetrically for sender 2; and the sum rate is bounded by the total information the channel can convey.

Slepian-Wolf theorem (distributed source coding): Two correlated sources $X$ and $Y$ are encoded separately and decoded jointly. Despite no communication between encoders, reliable reconstruction requires

\[ R_X \geq H(X \mid Y), \quad R_Y \geq H(Y \mid X), \quad R_X + R_Y \geq H(X, Y). \]

The surprising content is that the sum rate $R_X + R_Y$ need not exceed $H(X, Y)$: there is no penalty for distributed encoding, provided that joint decoding is allowed. The achievability proof uses random binning, which is a direct application of the joint typicality framework.

Rate-Distortion Theory

Source coding with a fidelity criterion is governed by the rate-distortion function.

Definition (Rate-Distortion Function): For a source $X$ and distortion measure $d(x, \hat{x})$, the rate-distortion function is \[ R(D) = \min_{\substack{p(\hat{x}|x):\\ \mathbb{E}[d(X,\hat{X})] \leq D}} I(X; \hat{X}). \]

Theorem (Shannon Rate-Distortion Theorem, 1959): $R(D)$ bits per symbol are necessary and sufficient to reproduce $X$ with average distortion at most $D$.

For a Bernoulli($1/2$) source with Hamming distortion, $R(D) = 1 - h(D)$ for $D \leq 1/2$, and $R(D) = 0$ for $D \geq 1/2$. For a Gaussian source $X \sim \mathcal{N}(0, \sigma^2)$ with MSE distortion, the rate-distortion function is

\[ R(D) = \max\!\left(0,\; \frac{1}{2}\log \frac{\sigma^2}{D}\right). \]

This is the reverse water-filling formula: to achieve distortion $D$, allocate $D$ as the “water level” and describe the source down to that precision. Each additional bit of rate halves the achievable distortion (in linear scale), reflecting the quadratic relationship between description precision and quantization error. For a vector Gaussian source with covariance matrix $\Sigma$, the rate-distortion function under MSE distortion generalises to

\[ R(D) = \frac{1}{2} \sum_i \max\!\left(0, \log \frac{\lambda_i}{\mu}\right), \]

where $\lambda_i$ are the eigenvalues of $\Sigma$ and the water-level $\mu$ is chosen so that $\sum_i \min(\lambda_i, \mu) = D$. This is the water-filling formula: allocate bits to the most “energetic” dimensions first, and cut off dimensions whose variance is below the water level $\mu$. The same water-filling solution appears (in dual form) in Gaussian channel capacity with power allocation across parallel channels.

Strong Data Processing Inequality

For a fixed channel $p(y|x)$, one might ask: how much does passing a correlated source through the channel shrink mutual information? The strong data processing inequality (SDPI) captures this contraction.

Definition (Contraction Coefficient): For a channel $p(y|x)$ and divergence measure $D_f$, the contraction coefficient is \[ \eta_f(p_{Y|X}) = \sup_{P \neq Q} \frac{D_f(P_Y \| Q_Y)}{D_f(P_X \| Q_X)}, \] where $P_Y(y) = \sum_x P_X(x) p(y|x)$ and similarly for $Q_Y$. For the KL divergence, $\eta_{\mathrm{KL}} < 1$ whenever the channel is not noiseless.

The contraction coefficient for the BSC($p$) under KL divergence equals $(1-2p)^2$, so $D(P_{Y^n} \| Q_{Y^n}) \leq (1-2p)^{2n} D(P_{X^n} \| Q_{X^n})$ — an exponential decay in distinguishability. This connects to hypercontractivity (Bonami-Beckner inequality) and noise stability of Boolean functions, with applications in hardness of approximation (Håstad’s proof) and social choice theory (Arrow’s theorem via information theory).

Perfect Secrecy and Key Size

Theorem (Shannon's Perfect Secrecy): In a perfectly secure cipher (where the ciphertext $C$ reveals nothing about the message $M$, i.e., $I(M; C) = 0$), the key $K$ must satisfy $H(K) \geq H(M)$. In particular, if $M$ and $K$ are uniform over their respective sets, then $|K| \geq |M|$.

This is Shannon’s impossibility theorem: the one-time pad is optimal, and no cipher can compress the key below the message entropy while maintaining perfect secrecy.

Lower Bound for the Index Function

The Index function is the communication complexity problem where Alice holds $x \in \{0,1\}^n$ and Bob holds $i \in [n]$, and they must compute $x_i$.

Theorem (Index Lower Bound): Any randomised one-way protocol (Alice sends a single message to Bob) for Index requires $\Omega(n)$ bits of communication.
Proof sketch. Let $M$ be Alice's message. The protocol's success condition implies $H(X_I | M, I) \leq h(P_e)$. By a counting argument, $\sum_{i} I(X_i; M) \geq n(1 - h(P_e))$. Each $I(X_i; M)$ is at most 1, so $|M| \geq n(1 - h(P_e)) = \Omega(n)$.

Chapter 5: Communication Complexity

Communication complexity, introduced by Yao in 1979, studies how many bits two parties must exchange to jointly compute a function. Information theory provides matching lower bounds by quantifying how much “useful” information must actually be transmitted.

Definition (Two-Party Communication Protocol): Alice holds $x \in \mathcal{X}$, Bob holds $y \in \mathcal{Y}$. A protocol $\Pi$ is a binary tree where each internal node is labelled by a player who sends a bit (dependent on their input and transcript so far), and each leaf is labelled by the output. The communication complexity $\mathrm{CC}(f)$ is the maximum total bits exchanged over all inputs.

Information Complexity

Information complexity refines communication complexity by measuring the actual information content of the transcript, rather than its length.

Definition (Information Complexity): For a protocol $\Pi$ and input distribution $\mu$ on $\mathcal{X} \times \mathcal{Y}$, the internal information cost is \[ \mathrm{IC}_\mu(\Pi) = I(X; \Pi \mid Y) + I(Y; \Pi \mid X), \] where $(X, Y) \sim \mu$ and $\Pi = \Pi(X,Y)$ denotes the transcript. The information complexity of $f$ is $\mathrm{IC}_\mu(f) = \inf_\Pi \mathrm{IC}_\mu(\Pi)$ over all protocols computing $f$.

Information complexity is always at most communication complexity: $\mathrm{IC}_\mu(f) \leq \mathrm{CC}(f)$. The key result is that this gap can be closed asymptotically.

Theorem (Direct Sum for Information Complexity, Braverman-Rao 2011): For any function $f$ and product distribution $\mu$, \[ \mathrm{IC}_\mu(f^n) = n \cdot \mathrm{IC}_\mu(f), \] where $f^n$ denotes $n$ independent copies of $f$. Moreover, for product distributions, $\mathrm{CC}(f^n) = \Theta(n \cdot \mathrm{IC}(f))$.

This direct sum theorem is one of the most important results in communication complexity. It says that computing $n$ independent instances of $f$ requires $n$ times as much information as a single instance — information is not “shareable” across independent instances.

Public vs. Private Coins

Theorem (Newman's Theorem): Any public-coin randomised protocol with communication complexity $c$ can be simulated by a private-coin protocol with communication complexity $c + O(\log n + \log(1/\delta))$, where $n$ is the input length and $\delta$ is the allowed increase in error probability.

The key insight: a random string of length $O(\log n)$ suffices to index into a small set of “good” shared random strings. This separates public-coin from private-coin randomised complexity by only a logarithmic additive gap — far smaller than the gap between randomised and deterministic complexity.

Equality function: The equality function $\mathrm{EQ}_n(x,y) = \mathbf{1}[x=y]$ has deterministic complexity $\mathrm{D(EQ)} = n+1$ (one party must send their full input), but randomised public-coin complexity $R^{\mathrm{pub}}(\mathrm{EQ}) = O(1)$: Alice picks a random hash $h$, sends $h(x)$, and Bob checks if $h(y) = h(x)$. This is an exponential gap. On the other hand, the inner product function $\mathrm{IP}_n(x,y) = \sum_i x_i y_i \pmod{2}$ satisfies $R^{\mathrm{pub}}(\mathrm{IP}) = \Omega(n)$ — public randomness does not help here.

Set Disjointness

The Set Disjointness problem DISJ$_n$: Alice has $x \subseteq [n]$, Bob has $y \subseteq [n]$, output 1 iff $x \cap y = \emptyset$.

Theorem (Deterministic Lower Bound for DISJ): $\mathrm{CC}(\mathrm{DISJ}_n) = \Omega(n)$.

Theorem (Randomised Lower Bound for DISJ, Razborov 1992): The randomised communication complexity of DISJ$_n$ is $\Omega(n)$.

The information-complexity proof proceeds via a hard distribution: place $j$ uniformly at random in $[n]$, set $x = ([n] \setminus \{j\}) \cup B_x$ and $y = ([n] \setminus \{j\}) \cup B_y$ where $B_x, B_y \in \{\emptyset, \{j\}\}$ with $(B_x, B_y)$ equal to $(\{j\},\emptyset)$ or $(\emptyset,\{j\})$ (disjoint) or $(\{j\},\{j\})$ (intersecting). Any protocol that decides DISJ must, on this distribution, essentially determine the single bit of intersection. The information cost per coordinate is $\Omega(1)$, giving the $\Omega(n)$ lower bound.

Corruption bound: An alternative technique for randomised lower bounds. The corruption bound for DISJ states that any protocol with small error on the hard distribution above must have communication complexity $\Omega(n)$. The corruption bound and the discrepancy method are related lower-bound techniques; together they give tight bounds for functions with specific combinatorial structure.

Cell-Probe Model and Data Structures

Theorem (Communication Lower Bound for Data Structures): In the cell-probe model where a data structure answers queries by probing memory cells of size $w$ bits, any data structure for a function $f$ with input $(x, q)$ (database $x$, query $q$) satisfying $t$ probes per query requires \[ t \geq \frac{I(x; \text{answer})}{w}, \] where $I$ is the information the answer carries about the database. This gives: any data structure for Lopsided Set Disjointness (where one set is large) requires $t = \Omega(n / w)$ time per query.

Chapter 6: Channel Capacity and the Noisy Coding Theorem

We now turn from compression to the dual problem: reliable transmission over a noisy channel. Shannon’s channel coding theorem — perhaps the deepest result in information theory — asserts that reliable communication is possible at any rate below the channel capacity, and impossible above it.

Definition (Discrete Memoryless Channel): A channel is specified by input alphabet $\mathcal{X}$, output alphabet $\mathcal{Y}$, and a transition probability $P_{Y|X}(y|x)$. The channel is memoryless if $P(y_1, \ldots, y_n | x_1, \ldots, x_n) = \prod_{i=1}^n P_{Y|X}(y_i | x_i)$.

Definition (Channel Capacity): \[ C = \max_{P_X} I(X; Y). \]

Example (Binary Symmetric Channel BSC($p$)): $\mathcal{X} = \mathcal{Y} = \{0,1\}$, output equals input with probability $1-p$ and flips with probability $p$. The capacity-achieving distribution is uniform: $C = 1 - h(p)$ bits per channel use.

Example (Binary Erasure Channel BEC($\varepsilon$)): Output is the input with probability $1-\varepsilon$ and a special erasure symbol $?$ with probability $\varepsilon$. Capacity: $C = 1 - \varepsilon$ bits per channel use.

Gaussian Channel Capacity

One of the most practically significant results in information theory is the capacity of the additive white Gaussian noise channel.

Theorem (Gaussian Channel Capacity, Shannon 1948): For the channel $Y = X + Z$ where $Z \sim \mathcal{N}(0, N)$ is independent of $X$, and the input is subject to a power constraint $\mathbb{E}[X^2] \leq P$, the capacity is \[ C = \frac{1}{2} \log\!\left(1 + \frac{P}{N}\right) \text{ bits per channel use.} \]
Proof sketch (achievability). Use a Gaussian codebook: each codeword $x^n(m)$ is drawn i.i.d. from $\mathcal{N}(0, P)$. By the entropy power inequality and the fact that Gaussian maximises entropy under power constraint, the jointly typical decoder achieves rate up to $C$. The noise variance $N$ limits how finely we can pack codewords in an $n$-dimensional sphere of radius $\sqrt{nP}$; the received vector lies in a sphere of radius $\sqrt{n(P+N)}$, and each noise ball has radius $\approx \sqrt{nN}$. The number of disjoint noise balls is $\approx (1 + P/N)^{n/2}$, giving rate $(1/2)\log(1+P/N)$.
Proof sketch (converse). By the data processing inequality and the fact that Gaussian maximises entropy under variance constraint: $I(X;Y) = H(Y) - H(Z) \leq (1/2)\log(2\pi e(P+N)) - (1/2)\log(2\pi e N) = (1/2)\log(1 + P/N)$.

The signal-to-noise ratio (SNR) $= P/N$ enters logarithmically. This means: doubling the power gives only one additional bit of rate. Improving capacity requires exponentially more power — a fundamental constraint on all communication systems.

Wideband Capacity and Bandwidth-Power Tradeoff

For a band-limited AWGN channel with bandwidth $W$ Hz and noise power spectral density $N_0/2$,

\[ C(W) = W \log\!\left(1 + \frac{P}{N_0 W}\right) \text{ bits per second.} \]

As $W \to \infty$, $C(W) \to P / (N_0 \ln 2)$ bits per second — the infinite-bandwidth limit. This shows that bandwidth is not the limiting factor at low SNR; power is. In the wideband regime, each joule of energy delivers $1/(N_0 \ln 2)$ bits of capacity — a universal constant. Satellite and deep-space communications operate in this regime.

Joint Typicality and the Channel Coding Theorem

Definition (Jointly Typical Sequences): A pair $(x^n, y^n)$ is $\varepsilon$-jointly typical with respect to $P_{XY}$ if \[ \left| -\frac{1}{n}\log p(x^n) - H(X) \right| \leq \varepsilon, \quad \left| -\frac{1}{n}\log p(y^n) - H(Y) \right| \leq \varepsilon, \quad \left| -\frac{1}{n}\log p(x^n, y^n) - H(X,Y) \right| \leq \varepsilon. \]

The joint typicality lemma: if $\tilde{X}^n$ and $\tilde{Y}^n$ are drawn independently from the marginals $P_X^n$ and $P_Y^n$, then $P((\tilde{X}^n, \tilde{Y}^n) \text{ jointly typical}) \leq 2^{-n(I(X;Y) - 3\varepsilon)}$.

Theorem (Shannon's Noisy Channel Coding Theorem, 1948): For a discrete memoryless channel with capacity $C$:

Achievability: For any rate $R < C$ and $\varepsilon > 0$, there exist codes with error $\to 0$.
Converse: For any rate $R > C$, any code has error $\to 1$.

Error Exponents and the Sphere-Packing Bound

The channel coding theorem says error goes to zero for $R < C$, but how fast? The answer is given by the error exponent.

Definition (Error Exponent): The error exponent at rate $R$ is \[ E(R) = \lim_{n \to \infty} -\frac{1}{n} \log P_e^{(n)}(R), \] where $P_e^{(n)}(R)$ is the minimum error probability of any rate-$R$ code of length $n$.

The random coding exponent $E_r(R)$ (Gallager 1965) is achieved by random codes and gives an achievable exponent:

\[ E_r(R) = \max_{\rho \in [0,1]} \max_{P_X} \left[ E_0(\rho, P_X) - \rho R \right], \]

where $E_0(\rho, P_X) = -\log \sum_y \left(\sum_x P_X(x) P_{Y|X}(y|x)^{1/(1+\rho)}\right)^{1+\rho}$ is Gallager’s $E_0$ function. The sphere-packing (Fano) bound $E_{\mathrm{sp}}(R) \geq E_r(R)$ gives a matching converse for rates near capacity, showing $E_r(R)$ is exact for $R$ close to $C$. For rates near zero, the expurgation bound improves upon random coding.

Feedback and the Strong Converse

Theorem (Feedback Does Not Increase Capacity): The capacity of a discrete memoryless channel is unchanged whether or not the encoder has causal access to past channel outputs.
Proof. With feedback, the encoder's input $X_i$ can depend on the message $M$ and past outputs $Y_1, \ldots, Y_{i-1}$. Nevertheless, $I(M; Y^n) \leq \sum_i I(X_i; Y_i) \leq nC$ by the data processing inequality and the memoryless channel structure. Fano's inequality then gives the same capacity bound.

This is one of Shannon’s most surprising theorems. While feedback dramatically simplifies coding (enabling sequential transmission and adaptive schemes), it provides no capacity gain for point-to-point memoryless channels. The situation is very different for multi-user channels: feedback can strictly increase capacity of broadcast channels and MACs.

Theorem (Strong Converse and Channel Dispersion): For a DMC with capacity $C$ and channel dispersion \[ V = \sum_{y} P_{Y|X}(y|x^*) \left(\log \frac{P_{Y|X}(y|x^*)}{P_Y(y)} - C\right)^2 \] (evaluated at the capacity-achieving input $x^*$), the maximum rate achievable with blocklength $n$ and error probability $\leq \varepsilon$ satisfies \[ R^*(n, \varepsilon) = C - \sqrt{\frac{V}{n}} Q^{-1}(\varepsilon) + O\!\left(\frac{\log n}{n}\right), \]

where $Q^{-1}$ is the inverse of the Gaussian tail function $Q(x) = P(\mathcal{N}(0,1) > x)$.

This second-order or finite-blocklength result (Polyanskiy-Poor-Verdú 2010) quantifies the penalty for using a finite-length code. The $1/\sqrt{n}$ gap from capacity is governed by the channel dispersion $V$, in direct analogy with the central limit theorem. The BSC has dispersion $V = p(1-p)(\log((1-p)/p))^2$. For the AWGN channel, $V = (1/2)\log^2(1 + P/N)$.

Chapter 7: Error-Correcting Codes

Where Shannon’s coding theorem guarantees existence, coding theory asks for explicit constructions with efficient encoding and decoding algorithms. The key parameters of a code form a triangle of competing requirements.

Definition (Code Parameters): An $[n, k, d]_q$-code is a set $\mathcal{C} \subseteq \mathbb{F}_q^n$ of $q^k$ codewords of length $n$ with minimum Hamming distance $d = \min_{c \neq c'} \Delta(c, c')$. The information rate is $R = k/n$, the relative distance is $\delta = d/n$.

The Singleton bound gives $d \leq n - k + 1$ (equivalently $R + \delta \leq 1 + 1/n$). Codes meeting this bound are called Maximum Distance Separable (MDS).

A fundamental table of bounds constrains achievable code parameters:

Bound	Form	Notes
Singleton	$R + \delta \leq 1$	MDS codes (RS) achieve equality
Plotkin	$\delta \leq 1/2$ for $R > 0$	Tight for binary codes
Hamming (sphere-packing)	$2^{nR} \cdot V(n, \lfloor \delta n/2 \rfloor) \leq 2^n$	Perfect codes achieve equality
Gilbert-Varshamov	$R \geq 1 - H_q(\delta)$	Random codes achieve this
Elias-Bassalygo	$R \leq 1 - H_q(J_q(\delta))$	For list decoding

where $V(n,t) = \sum_{i=0}^t \binom{n}{i}$ is the Hamming ball volume and $J_q(\delta)$ is the $q$-ary Johnson bound. The gap between the Gilbert-Varshamov bound (what random codes achieve) and the Singleton bound (what the best known explicit codes achieve) is a central open problem in coding theory.

BSC Capacity and Random Linear Codes

Theorem (Capacity of BSC via Random Linear Codes): The capacity of BSC($p$) is $C = 1 - H_b(p)$. There exist (existentially, via random linear codes) binary linear codes of rate $R < C$ that correct all error patterns of weight $\leq pn$ with high probability.

A random linear code of rate $R$ over $\mathbb{F}_2^n$ is a code whose generator matrix is chosen uniformly at random. By the probabilistic method, the expected number of low-weight codewords is small when $R < C$, and a union bound shows most random codes have minimum distance $\geq (H_b^{-1}(1-R) - \varepsilon)n$ — the Gilbert-Varshamov bound. Achieving capacity with polynomial-time decoding algorithms is the central open problem of algebraic coding theory.

Plotkin Bound and Singleton Bound

Theorem (Plotkin Bound): For a binary code with $M$ codewords and minimum distance $d$, \[ M \leq \frac{2d}{2d - n} \quad \text{when } 2d > n. \] Equivalently, for a code of relative distance $\delta > 1/2$, the rate satisfies $R = 0$ (the code has at most constantly many codewords). For $\delta \leq 1/2$, the Plotkin bound gives $M \leq 2\lfloor d/(2d-n) \rfloor$ when $n < 2d$.

The Plotkin bound shows that exceeding relative distance $1/2$ with positive rate is impossible for binary codes. Reed-Muller codes achieve the Plotkin bound for small parameters.

Reed–Solomon Codes

Definition (Reed–Solomon Code): Fix a finite field $\mathbb{F}_q$ with $q \geq n$. Choose $n$ distinct evaluation points $\alpha_1, \ldots, \alpha_n \in \mathbb{F}_q$. The Reed–Solomon code $\mathrm{RS}[n, k]$ encodes $(a_0, \ldots, a_{k-1}) \in \mathbb{F}_q^k$ as \[ \mathrm{RS}(a) = (f(\alpha_1), f(\alpha_2), \ldots, f(\alpha_n)), \quad f(X) = \sum_{i=0}^{k-1} a_i X^i. \]

Theorem (RS Code Properties): The Reed–Solomon code $\mathrm{RS}[n, k]$ has minimum distance $d = n - k + 1$. It is MDS, meeting the Singleton bound with equality.

Efficient decoding up to $\lfloor (d-1)/2 \rfloor$ errors is achieved by the Berlekamp-Welch algorithm. Beyond half the distance, list decoding is required.

List Decoding

Standard unique decoding corrects up to $\lfloor (d-1)/2 \rfloor$ errors. List decoding allows outputting a short list of possible codewords, enabling correction of far more errors.

Definition (List Decoding): A code $\mathcal{C}$ is $(p, L)$-list decodable if for every received word $y$, the number of codewords within Hamming distance $pn$ from $y$ is at most $L$.

Theorem (Johnson Bound): Every code with minimum distance $d = \delta n$ is $(J(\delta), L)$-list decodable for $J(\delta) = 1 - \sqrt{1-\delta}$ and $L = O(1/\varepsilon^2)$ where $\varepsilon = J(\delta) - p$. The Johnson bound is tight for random codes.

Theorem (Guruswami-Sudan List Decoding of RS Codes): Reed-Solomon codes can be list decoded up to the Johnson bound in polynomial time. The algorithm constructs a bivariate polynomial $Q(X,Y)$ vanishing on the received word, then factors it to find all consistent codewords. This corrects $n - \sqrt{nk}$ errors — far beyond the unique decoding radius $(n-k)/2$.

Theorem (List Decoding Capacity): For any $\varepsilon > 0$, random linear codes of rate $R$ are $(1-R-\varepsilon, \mathrm{poly}(1/\varepsilon))$-list decodable with high probability. Moreover, this is essentially optimal: no code of rate $R$ can be list decoded beyond fraction $1-R$ of errors with bounded list size.

This theorem (Zyablov-Pinsker, Guruswami-Hastad-Sudan-Zuckerman) is the “list decoding capacity theorem” — the analogue of Shannon’s capacity theorem for the combinatorial setting.

LDPC Codes

Low-density parity check (LDPC) codes (Gallager 1960, rediscovered by Mackay-Neal 1996) achieve capacity in practice.

Definition (LDPC Code): An LDPC code is a linear code defined by a sparse parity check matrix $H$, typically represented by a bipartite Tanner graph where variable nodes (columns of $H$) connect to check nodes (rows of $H$) with low degree $d$.

Decoding proceeds by belief propagation (message passing): variable nodes iteratively send “soft” probability estimates to check nodes, which combine them and send updated beliefs back. On the binary erasure channel, belief propagation is exact and can be analyzed via density evolution: track the evolution of erasure probability through the iterations. LDPC codes achieve capacity on the BEC with threshold phenomenon: for any $\varepsilon > 0$, there exists a degree distribution such that BP decodes correctly for erasure probability $(1-\varepsilon)C$. LDPC codes are used in the 5G NR standard for data channels.

Polar Codes

Arıkan’s polar codes (2009) are the first explicit family of codes that provably achieve capacity for any symmetric binary-input memoryless channel.

Theorem (Channel Polarization, Arıkan 2009): Consider $n = 2^m$ uses of a binary-input channel $W$. After the polar transform (a recursive butterfly structure), the resulting $n$ "synthetic" channels $W_n^{(i)}$ polarize: for any $\varepsilon > 0$, the fraction of channels with capacity $\geq 1-\varepsilon$ converges to $C(W)$, and the fraction with capacity $\leq \varepsilon$ converges to $1 - C(W)$. The fraction of "mixed" channels (capacity in $(\varepsilon, 1-\varepsilon)$) goes to zero.

The key insight is the recursion: given two copies of channel $W$, form the “minus” channel $W^- : u_1 \to (y_1, y_2)$ where input is $u_1 \oplus u_2$, and the “plus” channel $W^+ : u_2 \to (y_1, y_2, u_1)$ where we additionally know $u_1$. Then $I(W^-) + I(W^+) = 2I(W)$ and $I(W^+) \geq I(W) \geq I(W^-)$ — the channels “spread” in capacity while preserving the average. Iterating this polarizes them completely.

Successive cancellation (SC) decoding decodes $u_1, u_2, \ldots, u_n$ in order, using previously decoded bits. On the “good” channels, $u_i$ is decoded correctly; on the “bad” channels, $u_i$ is frozen to a known value (typically 0). SC decoding runs in $O(n \log n)$ time. Polar codes are used in 5G NR for control channels.

Linear Codes and Duality

Definition (Linear Code): An $[n, k, d]$ linear code over $\mathbb{F}_q$ is a $k$-dimensional subspace of $\mathbb{F}_q^n$. It is specified by a generator matrix $G \in \mathbb{F}_q^{k \times n}$ or a parity-check matrix $H \in \mathbb{F}_q^{(n-k) \times n}$ with $\mathcal{C} = \ker(H)$. The dual code $\mathcal{C}^\perp$ has generator matrix $H$.

The minimum distance of the dual code governs the degree of independence achievable in the code, connecting code theory to pseudorandomness and the probabilistic method.

Chapter 8: k-Wise Independence and Derandomisation

Many randomised algorithms only require their random bits to be “locally independent” — specifically, any $k$ of the $n$ bits are mutually independent. This allows dramatic savings in the number of truly random bits needed.

Definition (k-Wise Independence): Random variables $X_1, \ldots, X_n$ over $\{0,1\}$ are k-wise independent if for every $k$-element subset $S \subseteq [n]$, the variables $(X_i)_{i \in S}$ are mutually independent.

The key construction uses Reed–Solomon codes: choose a random polynomial $f \in \mathbb{F}_q[X]$ of degree $< k$ uniformly, and set $X_i = f(\alpha_i)$ for distinct field elements $\alpha_i$. Since any $k$ points determine a unique polynomial of degree $< k$, the variables are $k$-wise independent. This requires only $k \log q$ random bits — an exponential saving when $k \ll n$.

Application to E3-SAT

Theorem (Approximation Algorithm for MAX-E3-SAT): Using a 3-wise independent distribution, the algorithm that assigns variables from a 3-wise independent distribution satisfies each 3-clause with probability exactly $7/8$. Derandomisation by enumerating all $q^3$ seeds of the Reed-Solomon construction gives a deterministic polynomial-time algorithm achieving $7/8$ of clauses.

Small-Bias Spaces and $\varepsilon$-Biased Sets

For tests that depend only on the XOR of subsets of variables (a broad class), we can use an even weaker notion than $k$-wise independence.

Definition ($\varepsilon$-Biased Distribution): A distribution $\mu$ on $\{0,1\}^n$ is $\varepsilon$-biased if for every non-empty $S \subseteq [n]$, \[ \left| \Pr_\mu\!\left[\bigoplus_{i \in S} x_i = 0\right] - \frac{1}{2} \right| \leq \varepsilon. \]

Theorem (Naor-Reingold Construction): There exist explicit $\varepsilon$-biased sets of size $O(n^2/\varepsilon^2)$ over $\{0,1\}^n$. Constructions via linear feedback shift registers (LFSRs) over $\mathbb{F}_{2^m}$ give $\varepsilon$-biased sets of size $O(n/\varepsilon^2)$, achieving the information-theoretic lower bound $\Omega(n/\varepsilon^2)$ up to constants.

Small-bias spaces are a tool for derandomising tests that depend only on XOR of variables. They are strictly more efficient than pairwise-independent sets for large $n$.

Expander Graphs and Pseudorandomness

Expander graphs are sparse graphs with strong connectivity properties — they behave “like” complete graphs for many purposes.

Definition (Spectral Expander): A $d$-regular graph $G$ on $n$ vertices is an $(n, d, \lambda)$-expander if the second largest absolute eigenvalue of the adjacency matrix satisfies $\lambda_2 \leq \lambda$. The spectral gap is $d - \lambda$.

Theorem (Expander Mixing Lemma): For any $(n,d,\lambda)$-expander and any subsets $S, T \subseteq V$, \[ \left| e(S, T) - \frac{d|S||T|}{n} \right| \leq \lambda \sqrt{|S||T|}, \] where $e(S,T)$ is the number of edges between $S$ and $T$. This says the graph is "pseudorandom": edge densities between large sets are close to what they would be in the complete graph.

The Ramanujan bound gives the optimal spectral gap: $\lambda \geq 2\sqrt{d-1}$ for any $d$-regular graph (Alon-Boppana theorem). Ramanujan graphs achieving this bound exist (Margulis 1988, Lubotzky-Phillips-Sarnak 1988). Expanders have mixing time $O(\log n / \log(d/\lambda))$ for random walks, making them useful for pseudorandomness, error reduction in randomised algorithms, and constructing LDPC codes (Sipser-Spielman LDPC codes from expanders have linear-time decoding).

Pseudorandom Generators

Definition (PRG): A function $G: \{0,1\}^s \to \{0,1\}^n$ is a pseudorandom generator for a class $\mathcal{F}$ of tests with error $\varepsilon$ if for every $f \in \mathcal{F}$, \[ \left| \Pr_{x \sim \{0,1\}^s}[f(G(x)) = 1] - \Pr_{y \sim \{0,1\}^n}[f(y) = 1] \right| \leq \varepsilon. \]

The Nisan-Wigderson (NW) PRG: given a Boolean function $f: \{0,1\}^k \to \{0,1\}$ that is hard to compute on $1/2 + \delta$ fraction of inputs for all polynomial-size circuits, the NW construction produces a PRG with seed length $O(k)$ that fools all polynomial-size circuits with error $\delta$. This establishes the equivalence between computational hardness and pseudorandomness — a cornerstone of complexity theory. For space-bounded computation, Nisan’s PRG (1992) fools width-$w$ branching programs with seed length $O(\log^2 n)$.

Chapter 9: Locally Testable Codes and PCPs

The theory of probabilistically checkable proofs (PCPs) is one of the great achievements of complexity theory, and its connection to coding theory runs deep. At its heart lies the remarkable linearity of the Hadamard code.

Hadamard Code as a Linear Code

Definition (Hadamard Code): For $k$ bits of message $x \in \{0,1\}^k$, the Hadamard encoding is $\mathrm{Had}(x) = (x \cdot y \bmod 2)_{y \in \{0,1\}^k} \in \{0,1\}^{2^k}$. This is a $[2^k, k, 2^{k-1}]$ linear code with rate $k/2^k$ and relative distance $1/2$.

The Hadamard code is the dual of the repetition code. Its codewords are exactly the characters of the group $\mathbb{F}_2^k$, i.e., the functions $x \mapsto \langle a, x \rangle$ for each $a \in \mathbb{F}_2^k$. The Walsh-Hadamard transform $\hat{f}(a) = \sum_x (-1)^{f(x)+\langle a,x\rangle}$ diagonalises the code’s structure. Every pair of distinct codewords has Hamming distance exactly $2^{k-1}$ — the code is equidistant.

Self-Testing and Self-Correction

Theorem (Hadamard Self-Correction): Given oracle access to $f: \{0,1\}^k \to \{0,1\}$ that agrees with the Hadamard encoding of some $x^*$ on a $1 - \delta$ fraction of inputs (with $\delta < 1/4$), the following recovers $x^* \cdot z$ for any $z$ with probability $\geq 1 - 2\delta$: sample uniform $r \in \{0,1\}^k$ and output $f(r) \oplus f(r \oplus z)$.

Linearity testing (BLR test): to test whether $f$ is close to a linear function, pick random $x, y \in \{0,1\}^k$ and check $f(x) \oplus f(y) = f(x \oplus y)$. The BLR test (Blum-Luby-Rubinfeld 1993) has perfect completeness and detects any function that is $\delta$-far from linear with probability $\Omega(\delta)$. This is the first example of a locally testable code (LTC): a code where membership can be probabilistically verified by reading a constant number of positions.

Low-Degree Testing and the Algebraic Approach

The extension from the Hadamard code (degree-1 polynomials) to higher degrees vastly increases the rate while preserving testability.

Theorem (Low-Degree Testing, Rubinfeld-Sudan 1996): There is a test that, given oracle access to a function $f: \mathbb{F}_q^m \to \mathbb{F}_q$, accepts if $f$ is a degree-$d$ polynomial and rejects with probability $\Omega(\varepsilon)$ if $f$ is $\varepsilon$-far from every degree-$d$ polynomial. The test reads $O(1)$ values of $f$ on random affine lines.

This enables the sum-check protocol: given a function $g: \{0,1\}^n \to \{0,1\}$ and a claimed sum $K = \sum_x g(x)$, the prover and verifier interact for $n$ rounds. In round $i$, the prover sends a univariate polynomial $s_i(X_i)$, and the verifier checks consistency. At the end, the verifier evaluates $g$ at a random point (one oracle query). The sum-check protocol runs in $O(n)$ rounds and $O(n d \log q)$ bits of communication, verifying the count of solutions to any arithmetic circuit in $\mathcal{IP}$.

PCP Theorem and Hardness of Approximation

Theorem (PCP Theorem, Arora-Safra; Arora-Lund-Motwani-Sudan-Szegedy 1992/1998): NP has proofs that can be verified probabilistically using $O(\log n)$ random bits and $O(1)$ adaptive queries to the proof, with perfect completeness and constant soundness.

The PCP theorem has a dramatic corollary for approximation algorithms.

Theorem (Håstad's 3-Bit PCP): For any $\varepsilon > 0$, approximating MAX-3SAT beyond a factor of $7/8 + \varepsilon$ is NP-hard. Since the random assignment achieves $7/8$, this is a tight inapproximability result.

Håstad’s PCP uses 3-query projection games with near-linear size. The key ingredient is a long code test that tests whether a codeword encodes a single bit: for a function $f: \{0,1\}^n \to \{0,1\}$, pick a random folded function and test $f(x) \oplus f(y) \oplus f(x \oplus y \oplus e_i) = 0$ for a random bit position $i$. The Unique Games Conjecture (Khot 2002) — that it is NP-hard to distinguish $(1-\varepsilon)$-satisfiable 2-CSPs from $\varepsilon$-satisfiable ones — if true, would give tight inapproximability for many problems including VERTEX-COVER and MAX-CUT.

Exponential-Length PCPs for 3-SAT

Theorem (Exponential PCP for 3-SAT): There is a PCP verifier for 3-SAT that uses a proof of length $2^n$ and makes $O(1)$ queries, with perfect completeness and soundness $< 1$.

The construction: encode the satisfying assignment $x \in \{0,1\}^n$ as its Hadamard codeword $\hat{x} \in \{0,1\}^{2^n}$, and encode all degree-2 products $(x_i x_j)$ similarly. Each clause $\ell_1 \vee \ell_2 \vee \ell_3$ is violated only if $(1 - \ell_1)(1 - \ell_2)(1 - \ell_3) = 1$, which can be checked by reading 3 bits of the Hadamard codeword and testing consistency via self-correction. The polynomial $\hat{x}$ and its degree-2 extension enable the verifier to check all constraints without reading the full proof.

Chapter 10: Graph Entropy and Monotone Complexity

Graph entropy, introduced by Körner (1973), extends the notion of Shannon entropy to combinatorial structures. It provides tight lower bounds on formula complexity that purely combinatorial or algebraic methods cannot match.

Definition (Graph Entropy): Let $G = (V, E)$ be a graph and $P$ a probability distribution on $V$. The graph entropy $H(G, P)$ is \[ H(G, P) = \min_{(X, Y): Y \sim P,\; X \in \text{Stab}(G, Y)} I(X; Y), \] where the minimum is over all random variables $(X, Y)$ with $Y \sim P$ and $X$ a random independent set containing $Y$. Equivalently, \[ H(G, P) = \min_{P_{X|Y}: X \in \mathrm{Stab}(G)} I(X; Y). \]

Graph entropy satisfies $H(G, P) \leq H(P)$, with equality when $G$ is the complete graph. For an independent set (edgeless graph), $H(\bar{K}_n, P) = 0$.

Graph Entropy and Fractional Chromatic Number

Theorem (Graph Entropy and Fractional Chromatic Number): For the uniform distribution $U$ on $V(G)$, \[ H(G, U) = \log \chi_f(G), \] where $\chi_f(G)$ is the fractional chromatic number of $G$. Moreover, for any distribution $P$, $H(G, P) + H(\bar{G}, P) \geq H(P)$, where $\bar{G}$ is the complement graph.

This connects graph theory and information theory beautifully. The fractional chromatic number $\chi_f(G) = \min\{n/k : G \to K_{n:k}\}$ where $K_{n:k}$ is the Kneser graph, so graph entropy literally encodes chromatic information.

Körner’s Zero-Error Capacity

Shannon introduced the zero-error capacity $C_0(G)$ of a channel: the maximum rate of communication with zero probability of error, where the channel’s confusability graph has $G$ as its structure. Specifically, $C_0(G) = \sup_n (1/n) \log \alpha(G^n)$ where $\alpha$ is the independence number and $G^n$ is the $n$-fold OR-product.

Theorem (Lovász $\theta$-Function, Lovász 1979): For any graph $G$, \[ \alpha(G) \leq \vartheta(G) \leq \chi_f(\bar{G}), \] where $\vartheta(G)$ is the Lovász theta function (computable by semidefinite programming). Moreover, $C_0(G) \leq \log \vartheta(G)$. For the 5-cycle $C_5$: $\vartheta(C_5) = \sqrt{5}$, so $C_0(C_5) = \frac{1}{2}\log 5$, which is tight.

The Lovász theta function is the first example of a polynomial-time computable graph invariant that sandwiches the independence number — a landmark in the interplay between information theory, combinatorics, and semidefinite programming.

Theorem (Sub-additivity of Graph Entropy): If $G = G_1 \cup G_2$ (edge union on the same vertex set), then $H(G, P) \leq H(G_1, P) + H(G_2, P)$.

Entropy and Graph Coloring

Graph entropy also connects to list coloring and Witsenhausen’s rate.

Definition (Witsenhausen Rate): For a source $X \sim P$ on $V(G)$ where confusable vertices are adjacent in $G$ (the sender and receiver must agree on a value, but the receiver may observe a "nearby" $X$), the Witsenhausen rate $W(G, P)$ is the minimum entropy of a coloring of $G$ that is a function of $X$. The chromatic entropy $H_\chi(G, P) = \min_{\text{proper coloring } c} H(c(X))$ satisfies $H_\chi(G, P) \geq \log \chi(G)$.

For the cycle $C_5$ with uniform distribution, $H_\chi(C_5, U) = \log 3$ (since $\chi(C_5) = 3$), while the Körner entropy $H(C_5, U) = \frac{1}{2}\log 5$ (since $\chi_f(C_5) = 5/2$). The gap between chromatic entropy and Körner entropy reflects the gap between integer and fractional chromatic number. This has operational meaning: the Witsenhausen rate governs the minimum number of bits per source symbol needed for zero-error communication when the receiver observes a confusable version of the source.

Lower Bound for the Threshold Function

Theorem (Threshold Formula Complexity): Any monotone formula computing the $k$-out-of-$n$ threshold function $T_k^n$ requires size $\Omega(n^{5/2})$.

The proof uses the graph entropy of the Kneser graph $K(n, k)$: vertices are $k$-subsets of $[n]$, edges connect disjoint subsets. The entropy of the Kneser graph lower-bounds the formula’s size: each gate in the formula can “resolve” at most a bounded amount of graph entropy per unit of formula size. The Kneser graph has $\chi_f(K(n,k)) = n/k$, so $H(K(n,k), U) = \log(n/k)$, and the formula complexity lower bound follows from iterating this argument through the tree structure of the formula.

The argument proceeds as follows: define a “complexity measure” $\mu(f)$ for each monotone Boolean function $f$ as the graph entropy of its associated confusability graph under the uniform distribution. One shows (i) $\mu(T_k^n) = \Theta(\log(n/k))$, (ii) $\mu$ is sub-multiplicative under composition of gates: $\mu(f \vee g) \leq \mu(f) + \mu(g)$, and (iii) each input literal $x_i$ has $\mu(x_i) = O(1/n)$. Combining: any formula for $T_k^n$ of size $s$ satisfies $\mu(T_k^n) \leq s \cdot O(1/n)$, giving $s = \Omega(n \log(n/k))$. For $k = \lfloor n/2 \rfloor$, this yields $\Omega(n)$, and a more refined analysis using the Kneser graph structure gives the $\Omega(n^{5/2})$ bound.

Chapter 11: The Lovász Local Lemma

The Lovász Local Lemma (LLL) is a fundamental probabilistic existence argument. Given a collection of “bad” events that each occur with some probability but are “mostly independent” of each other, LLL guarantees that with positive probability, none of the bad events occurs.

Theorem (Symmetric Lovász Local Lemma): Let $A_1, \ldots, A_m$ be events in a probability space. Suppose each $A_i$ has $P(A_i) \leq p$, and each $A_i$ is mutually independent of all but at most $d$ other events. If \[ e p (d + 1) \leq 1, \] then $P\left( \bigcap_{i=1}^m \overline{A_i} \right) > 0$.

Theorem (Asymmetric LLL): If there exist $x_i \in (0,1)$ such that for all $i$, \[ P(A_i) \leq x_i \prod_{j \in \Gamma(i)} (1 - x_j), \] where $\Gamma(i)$ is the set of events not independent of $A_i$, then $P(\bigcap_i \bar{A}_i) \geq \prod_i (1-x_i) > 0$.

Example (LLL for Ek-SAT): In an Ek-SAT instance where each clause involves exactly $k$ distinct variables and each variable appears in at most $r$ clauses, set $A_i$ = "clause $i$ is violated." Then $P(A_i) = 2^{-k}$, and each $A_i$ depends on at most $d = k(r-1)$ other events. By LLL, the instance is satisfiable whenever $e \cdot 2^{-k} \cdot (k(r-1)+1) \leq 1$.

Algorithmic LLL via Entropy Compression

Theorem (Algorithmic LLL, Moser–Tardos 2010): Under the conditions of LLL, the resampling algorithm terminates in expected polynomial time.

The entropy compression proof (Moser 2009) gives the most illuminating argument. Define the following encoding scheme for the algorithm’s execution: the algorithm uses a random tape $R$ (bits for initial assignment and resamplings). Whenever a bad event $A_i$ occurs at time $t$, record a witness tree — a tree whose root is $A_i$ and whose children encode which events caused further resamplings. The key claim: the witness tree at time $t$ plus the bits of $R$ used for resampled variables can be decoded from a compressed representation of length proportional to the tree size times $\log m$ bits. Since the original bits are uniformly random (and hence incompressible by Shannon’s source coding theorem), the number of resampling steps is bounded by the information-theoretic lower bound on their entropy. Concretely: if there are $T$ resampling steps, the compressed encoding uses $\leq ck \log m \cdot T$ bits (for some constant $c$) while the random tape used is $\geq kT$ bits. When LLL conditions hold, the encoding is a genuine compression, bounding $T$ polynomially in $n$ and $m$.

This argument makes a beautiful connection: the LLL condition $ep(d+1) \leq 1$ is exactly the condition under which the witness tree encoding compresses the random tape, i.e., the condition under which the algorithm cannot run forever without violating the incompressibility of randomness.

Applications and Extensions of LLL

$k$-colorability of sparse graphs: A graph with maximum degree $\Delta$ is properly $k$-colorable if $k \geq e(\Delta - 1) + 1$. This follows from LLL applied to the bad events “edge $(u,v)$ is monochromatic,” each with probability $1/k$ and dependency degree at most $2(\Delta-1)$. Setting $p = 1/k$ and $d = 2(\Delta-1)$ gives the condition $e/k \cdot (2\Delta - 1) \leq 1$, satisfied for $k \geq e(2\Delta-1)$.

Property B (2-colorability of hypergraphs): A hypergraph $\mathcal{H} = (V, \mathcal{E})$ where every edge has size $\geq k$ and every vertex appears in at most $d$ edges is 2-colorable whenever $e \cdot 2^{1-k} \cdot d \leq 1$. This follows from LLL with $p = 2^{1-k}$.

Variable version of LLL (Moser-Tardos full version): In the variable setting, events are associated with disjoint sets of variables. After any resampling of variables in $\mathrm{vbl}(A_i)$, the resample is truly independent of all events disjoint from $A_i$. This structure makes the witness tree argument particularly clean: each node in the witness tree contributes independent randomness to the encoding. The expected number of resamplings of event $A_i$ is at most $x_i / (1 - x_i)$ in the asymmetric LLL setting, giving a precise bound on algorithm runtime.

Lossless coding proof: The original Moser proof can be phrased as a lossless coding argument. The compression scheme encodes the sequence of resampled events as a prefix-free code. The LLL condition ensures the code has length less than the number of random bits used — a contradiction with Shannon’s source coding lower bound (or equivalently, Kolmogorov complexity lower bound) if the algorithm runs for too many steps.

Chapter 12: Extended Formulations and Non-Negative Rank

The theory of extended formulations connects linear programming with communication complexity — a surprising and deep bridge that allows information-theoretic methods to prove lower bounds on LP sizes.

Extended Formulations and Slack Matrices

Definition (Extended Formulation): A polytope $P \subseteq \mathbb{R}^d$ has an extended formulation if there exists a polytope $Q \subseteq \mathbb{R}^{d+e}$ and a linear projection $\pi: \mathbb{R}^{d+e} \to \mathbb{R}^d$ with $\pi(Q) = P$. The extension complexity $\mathrm{xc}(P)$ is the minimum number of facets over all extended formulations.

Example (Permutahedron): The permutahedron $\Pi_n$ (convex hull of all permutations of $(1, 2, \ldots, n)$) has $2^n - 2$ facets. However, it has an extended formulation of size $O(n \log n)$ using the "sorting network" construction. This exponential gap illustrates the power of extended formulations.

Definition (Slack Matrix): Given a polytope $P$ with vertices $v_1, \ldots, v_V$ and facets $h_1, \ldots, h_F$ (inequalities $h_j(x) \leq 0$ for $x \in P$), the slack matrix $S \in \mathbb{R}^{V \times F}$ is $S_{ij} = -h_j(v_i) \geq 0$.

Theorem (Yannakakis Factorization Theorem, 1991): The extension complexity of $P$ equals the non-negative rank of its slack matrix: \[ \mathrm{xc}(P) = \mathrm{rank}_+(S). \] Here $\mathrm{rank}_+(M) = \min\left\{ r : M = \sum_{k=1}^r u_k v_k^T, \; u_k, v_k \geq 0 \right\}$.

Lower Bound via Communication Complexity

Theorem (Non-Negative Rank and Randomised CC): For a non-negative matrix $M$, \[ \log \mathrm{rank}_+(M) \geq \mathrm{R}^{\mathrm{pub}}(f_M), \] where $f_M(i,j) = \mathbf{1}[M_{ij} > 0]$ is the support function and $R^{\mathrm{pub}}$ is public-coin randomised communication complexity.

Applied to the correlation polytope COR$_n$ (convex hull of $\{xx^T : x \in \{0,1\}^n\}$): its slack matrix encodes Set Disjointness, which has randomised CC $\Omega(n)$. Therefore $\mathrm{rank}_+(\text{slack}) \geq 2^{\Omega(n)}$, so $\mathrm{xc}(\mathrm{COR}_n) = 2^{\Omega(n)}$. This resolved Yannakakis’s 1991 question and showed the symmetric TSP polytope (which contains COR$_n$ as a face) has no polynomial-size LP formulation.

Rectangle Bound and Partition Bound

For randomised communication complexity, the corruption bound and the partition bound give strong lower bounds.

Definition (Rectangle and Corruption Bound): A combinatorial rectangle is a set $A \times B \subseteq \mathcal{X} \times \mathcal{Y}$. A protocol with $c$ bits partitions $\mathcal{X} \times \mathcal{Y}$ into at most $2^c$ monochromatic rectangles. The corruption bound for a function $f$ under distribution $\mu$ is the maximum $\beta$ such that: any rectangle $R$ with $\mu(R \cap f^{-1}(0)) \geq \beta \cdot \mu(R)$ has $\mu(R \cap f^{-1}(1)) \leq \gamma \cdot \mu(f^{-1}(1))$ (for small $\gamma$). A high corruption bound forces large communication.

The partition bound (Jain-Klauck 2010) is a linear programming relaxation that simultaneously generalises the corruption bound, the discrepancy bound, and the smooth rectangle bound. Its dual is the smooth rectangle bound, which has a clean operational interpretation: it is the minimum number of “smooth” monochromatic rectangles needed to cover the communication matrix. For Set Disjointness, the partition bound gives a tight $\Omega(n)$ lower bound for randomised complexity.

The hierarchy of lower bound methods for randomised CC, from weakest to strongest, is approximately: rank bound $\leq$ discrepancy bound $\leq$ corruption/rectangle bound $\leq$ smooth rectangle bound $\approx$ partition bound. The partition bound is known to be tight for most natural problems, including DISJ, IP, and GT (Greater-Than function), and it equals the communication complexity of the function up to constant factors for product distributions (Braverman-Rao 2011, strengthened by Kerenidis et al. 2012).

Discrepancy method: For a function $f$ and distribution $\mu$, the discrepancy is $\mathrm{disc}_\mu(f) = \max_{R = A \times B} |\mu(R \cap f^{-1}(1)) - \mu(R \cap f^{-1}(0))|$. Any randomised protocol with error $\varepsilon$ satisfies $\mathrm{disc}_\mu(f) \geq (1-2\varepsilon) \cdot 2^{-c}$, where $c$ is the number of bits communicated. For the inner product function IP$_n$ over $\mathbb{F}_2^n$ with uniform distribution, the discrepancy is $2^{-n/2}$, giving $R^{\mathrm{pub}}(\mathrm{IP}_n) = \Omega(n)$. The discrepancy method connects combinatorics (rectangle covers), Fourier analysis (characters of the distribution), and communication complexity.

Approximate Nonnegative Rank and Quantum Communication

The connection between communication complexity and LP complexity extends to quantum computation.

Definition (Approximate Nonnegative Rank): The $\varepsilon$-approximate nonneg rank $\mathrm{rank}_{+,\varepsilon}(M)$ is the minimum rank of a nonneg matrix $\widetilde{M}$ with $\|M - \widetilde{M}\|_\infty \leq \varepsilon$. The approximate rank} of $f$ (over $\mathbb{R}$) is $\mathrm{rank}_\varepsilon(M_f)$ where $M_f$ is the communication matrix.

Theorem (Quantum CC and Log-Rank): For any Boolean function $f$, the quantum communication complexity satisfies \[ Q(f) = \Omega(\log \mathrm{rank}_\varepsilon(M_f)). \] The log-rank conjecture (Lovász-Saks 1988) states that $\mathrm{D}(f) \leq (\log \mathrm{rank}(M_f))^c$ for some constant $c$ — i.e., deterministic CC is at most polynomial in log-rank.

Chapter 13: Quantum Information Theory

Quantum information theory extends classical information theory to quantum mechanical systems. The central object replaces probability distributions with density matrices, and entropy is generalised accordingly.

Definition (Density Matrix): A quantum state on a $d$-dimensional Hilbert space is described by a density matrix $\rho$: a $d \times d$ positive semidefinite matrix with $\mathrm{Tr}(\rho) = 1$. A pure state is $\rho = |\psi\rangle\langle\psi|$ for some unit vector $|\psi\rangle$. A mixed state is a convex combination of pure states.

Definition (Von Neumann Entropy): The von Neumann entropy of a density matrix $\rho$ is \[ S(\rho) = -\mathrm{Tr}(\rho \log \rho) = -\sum_i \lambda_i \log \lambda_i, \] where $\lambda_1, \ldots, \lambda_d$ are the eigenvalues of $\rho$.

Von Neumann entropy inherits properties from Shannon entropy: $S(\rho) \geq 0$ with equality for pure states; $S(\rho) \leq \log d$ with equality for the maximally mixed state $I/d$; and concavity in $\rho$. But quantum entropy also has properties with no classical analogue.

Strong Subadditivity

Theorem (Strong Subadditivity of Quantum Entropy, Lieb-Ruskai 1973): For any tripartite quantum state $\rho_{ABC}$, \[ S(\rho_{ABC}) + S(\rho_B) \leq S(\rho_{AB}) + S(\rho_{BC}). \]

Strong subadditivity (SSA) is one of the deepest inequalities in quantum information theory. It was conjectured by Lanford-Robinson in 1968 and proved by Lieb-Ruskai using Lieb’s concavity theorem for matrix functions. The classical analogue is the chain rule inequality $H(X|Y,Z) \leq H(X|Y)$, which is elementary. SSA has no elementary quantum proof and requires sophisticated operator inequalities. It implies the weak monotonicity $S(\rho_{AB}) \geq S(\rho_A)$ only when $B$ is a classical system — in the quantum case, $S(\rho_{AB})$ can be less than $S(\rho_A)$ (a purely quantum phenomenon).

Holevo’s Theorem

Theorem (Holevo, 1973): Suppose Alice prepares state $\rho_i$ with probability $p_i$, sends it to Bob, and Bob performs any quantum measurement. The accessible classical information is bounded by the Holevo quantity: \[ I(A; B) \leq \chi = S\!\left( \sum_i p_i \rho_i \right) - \sum_i p_i S(\rho_i). \]

For pure states, $S(\rho_i) = 0$, so $\chi = S(\rho_\mathrm{avg}) \leq \log d$. This means a qubit carries at most 1 bit of accessible classical information. The bound is achieved by measuring in the eigenbasis of $\rho_\mathrm{avg}$ — the Holevo-Schumacher-Westmoreland (HSW) theorem states that the classical capacity of a quantum channel equals $\lim_{n \to \infty} (1/n) \max_{\{p_i, \rho_i\}} \chi^{(n)}$, the regularised Holevo information.

Quantum Channel Capacity and Superadditivity

Definition (Holevo Capacity): The classical capacity of a quantum channel $\mathcal{N}: \mathcal{L}(A) \to \mathcal{L}(B)$ is \[ C = \lim_{n \to \infty} \frac{1}{n} \max_{\{p_i, \rho_i^{\otimes n}\}} \chi(\mathcal{N}^{\otimes n}(\{p_i, \rho_i\})). \]

A fundamental open question was whether the Holevo quantity is additive: does $\chi(\mathcal{N}_1 \otimes \mathcal{N}_2) = \chi(\mathcal{N}_1) + \chi(\mathcal{N}_2)$? Equivalently, can entangled inputs increase the classical capacity? Hastings (2009) proved the additivity conjecture is false: there exist channels for which entangled inputs across multiple uses give strictly higher classical capacity. This is a purely quantum phenomenon with no classical analogue.

The quantum capacity $Q(\mathcal{N})$ of a channel is the maximum rate of reliable transmission of quantum information (qubits). The quantum channel coding theorem (Lloyd-Shor-Devetak 1997-2005) states that $Q(\mathcal{N}) = \lim_{n \to \infty} (1/n) I_c(\mathcal{N}^{\otimes n})$ where $I_c$ is the coherent information. A channel has positive quantum capacity iff it is not anti-degradable. The quantum capacity is also superadditive: $I_c(\mathcal{N}_1 \otimes \mathcal{N}_2) > I_c(\mathcal{N}_1) + I_c(\mathcal{N}_2)$ is possible.

Quantum Error Correction

Quantum systems are fragile: any interaction with the environment can cause decoherence. Quantum error correction (QEC) protects quantum information against noise.

Definition (Quantum Error Model): Common noise models: bit flip channel $\mathcal{E}(\rho) = (1-p)\rho + p X\rho X$; phase flip channel $\mathcal{E}(\rho) = (1-p)\rho + p Z\rho Z$; depolarising channel $\mathcal{E}(\rho) = (1-p)\rho + (p/3)(X\rho X + Y\rho Y + Z\rho Z)$. The Pauli matrices are $X = \left(\begin{smallmatrix}0&1\\1&0\end{smallmatrix}\right)$, $Z = \left(\begin{smallmatrix}1&0\\0&-1\end{smallmatrix}\right)$, $Y = iXZ$.

The 3-qubit repetition code encodes $|\psi\rangle = \alpha|0\rangle + \beta|1\rangle$ as $\alpha|000\rangle + \beta|111\rangle$, correcting single bit flips by majority vote. The Shor code (9 qubits) concatenates a 3-qubit phase-flip code with a 3-qubit bit-flip code, correcting any single-qubit error.

Definition (Stabilizer Code): An $[[n,k,d]]$ stabilizer code encodes $k$ logical qubits into $n$ physical qubits with distance $d$. It is defined by an abelian subgroup $\mathcal{S}$ of the $n$-qubit Pauli group, with the codespace being the $+1$ eigenspace of all stabilizers. The CSS (Calderbank-Shor-Steane) construction builds stabilizer codes from classical codes $\mathcal{C}_1$ and $\mathcal{C}_2$ with $\mathcal{C}_2 \subseteq \mathcal{C}_1$: $X$-stabilizers from $\mathcal{C}_2^\perp$, $Z$-stabilizers from $\mathcal{C}_1^\perp$.

Theorem (Quantum Hamming Bound): An $[[n,k,d]]$ quantum code satisfies \[ \sum_{j=0}^{t} \binom{n}{j} 3^j \leq 2^{n-k}, \] where $t = \lfloor (d-1)/2 \rfloor$. This parallels the classical Hamming bound.

Theorem (Fault-Tolerant Threshold Theorem): If the physical error rate per gate is below a threshold $p_{\mathrm{th}} \approx 10^{-3}$ to $10^{-2}$, then arbitrarily long quantum computations can be performed fault-tolerantly using polynomial overhead. The threshold is achieved by concatenating quantum error-correcting codes.

BB84 Quantum Key Distribution

Quantum mechanics enables information-theoretically secure key distribution — a task impossible classically against a computationally unbounded eavesdropper.

Protocol (BB84, Bennett-Brassard 1984):
Alice generates a random bit string $b \in \{0,1\}^n$ (the key) and a random basis string $\theta \in \{+,\times\}^n$.
Alice sends $n$ qubits, encoding each bit $b_i$ in basis $\theta_i$: basis $+$ uses $\{|0\rangle, |1\rangle\}$, basis $\times$ uses $\{|+\rangle, |-\rangle\}$.
Bob measures each qubit in a randomly chosen basis $\theta'_i$. When $\theta_i = \theta'_i$, he obtains $b_i$ exactly; otherwise his result is random.
Alice and Bob announce bases over a public channel, discard mismatched positions (~50%), and perform error estimation. High error rate signals eavesdropping.
Privacy amplification and information reconciliation yield a shared secret key.

Theorem (Composable Security of BB84): BB84 is information-theoretically secure against any eavesdropping attack, including coherent attacks across multiple qubits. Security follows from the no-cloning theorem: any eavesdropper who measures qubits necessarily disturbs them (since $|0\rangle$ and $|+\rangle$ are non-orthogonal). The security proof (Mayers 1996; Shor-Preskill 2000 via CSS codes) shows that an eavesdropper's information about the final key is exponentially small in the key length, unconditionally.

Entanglement as a Resource

Entanglement is a purely quantum phenomenon with no classical analogue. It enables communication tasks that are impossible classically.

Theorem (Dense Coding, Bennett-Wiesner 1992): Using 1 pre-shared ebit (maximally entangled qubit pair $|\Phi^+\rangle = (|00\rangle + |11\rangle)/\sqrt{2}$) and 1 qubit of communication, Alice can communicate 2 classical bits to Bob. She does so by applying one of four Pauli operations ($I, X, Y, Z$) to her half of the ebit, then sending her qubit. Bob measures the resulting 2-qubit state in the Bell basis to recover both bits.

Theorem (Quantum Teleportation, Bennett et al. 1993): Using 1 pre-shared ebit and 2 classical bits of communication, Alice can transmit 1 qubit to Bob. Alice measures her qubit and her half of the ebit in the Bell basis (2 classical bits result), sends the result to Bob, who applies the appropriate Pauli correction to his half of the ebit. The original state is transferred exactly, without any quantum channel from Alice to Bob.

Dense coding and teleportation are dual protocols: one trades quantum communication for classical + entanglement, the other trades entanglement + classical for quantum communication. Together they define the three fundamental resources of quantum information: classical bits (cbits), quantum bits (qubits), and shared entanglement (ebits). The resource inequalities

\[ [q \to q] \geq [c \to c] + [qq], \qquad [qq] + 2[c \to c] \geq [q \to q] \]
(dense coding, teleportation) form the basis of quantum Shannon theory, where the goal is to characterise all achievable tradeoffs among these three resources over a quantum channel.
Schumacher Compression
Theorem (Schumacher Compression, 1995): Let $\{(p_i, |\psi_i\rangle)\}$ be a pure-state quantum source — a source that emits $|\psi_i\rangle$ with probability $p_i$. The minimum number of qubits per source symbol required for reliable compression (fidelity $\to 1$) is the von Neumann entropy $S(\rho)$ of the ensemble, where $\rho = \sum_i p_i |\psi_i\rangle\langle\psi_i|$.
The proof is a direct quantum analogue of the AEP proof. The typical subspace is the eigenspace of $\rho^{\otimes n}$ corresponding to eigenvalues in the range $2^{-n(S(\rho) \pm \varepsilon)}$ — this subspace has dimension $\approx 2^{nS(\rho)}$ and captures nearly all of the $n$-copy state’s support.
Schumacher compression is the quantum source coding theorem: it shows that von Neumann entropy is the quantum analogue of Shannon entropy, not just formally but operationally.
Classical Quantum
Probability distribution $p$ Density matrix $\rho$
Shannon entropy $H(p)$ Von Neumann entropy $S(\rho)$
Typical set $T_\varepsilon^{(n)}$ Typical subspace
Source coding rate $\to H(p)$ Compression rate $\to S(\rho)$
AEP via LLN Quantum AEP via operator norm
Summary: The Information-Theoretic Method
The course has developed a single powerful idea — entropy as a measure of uncertainty — and deployed it across six distinct areas of mathematics and computer science.
As a compression principle: Shannon entropy is the fundamental limit of lossless compression (source coding theorem). The Lempel-Ziv algorithm achieves this limit universally without knowing the source. Rate-distortion theory extends this to lossy compression, with the Gaussian case giving the elegant water-filling formula.
As a combinatorial counting tool: Shearer’s lemma and the Loomis-Whitney inequality turn entropy into a counting machine. Kahn’s theorem on independent sets, Bregman’s permanent bound, and the Fredman-Komlós bound on perfect hash families are all entropy arguments in disguise.
As a channel coding limit: Mutual information $I(X;Y)$ quantifies the maximum reliable throughput of a noisy channel. The Gaussian channel capacity $C = (1/2)\log(1+\text{SNR})$ is the cornerstone of wireless communication. Error exponents, the strong converse, and channel dispersion refine this to finite blocklengths.
As a lower-bound tool in complexity: Fano’s inequality, the data processing inequality, and information complexity give communication lower bounds for Set Disjointness, Index, and Inner Product. These propagate to circuit lower bounds, data structure lower bounds, and via Yannakakis’s theorem, to LP lower bounds.
As a probabilistic method: LLL via entropy compression connects the condition $ep(d+1) \leq 1$ to the incompressibility of random strings. K-wise independence and small-bias spaces trade between randomness and universality of probabilistic algorithms.
In quantum systems: Von Neumann entropy, strong subadditivity, and the Holevo bound govern quantum communication. Dense coding and teleportation trade classical bits, qubits, and entanglement at precise exchange rates determined by quantum entropy. Quantum error correction protects against decoherence, with the fault-tolerance threshold theorem making arbitrarily long quantum computation feasible in principle.
The thread connecting all these is the idea that information is a physical, computational, and combinatorial resource — one that can be quantified, manipulated, and traded against other resources with precise exchange rates governed by the inequalities of information theory.

Classical	Quantum
Probability distribution \(p\)	Density matrix \(\rho\)
Shannon entropy \(H(p)\)	Von Neumann entropy \(S(\rho)\)
Typical set \(T_\varepsilon^{(n)}\)	Typical subspace
Source coding rate \(\to H(p)\)	Compression rate \(\to S(\rho)\)
AEP via LLN	Quantum AEP via operator norm

Back to top

Bound	Form	Notes
Singleton	\(R + \delta \leq 1\)	MDS codes (RS) achieve equality
Plotkin	\(\delta \leq 1/2\) for \(R > 0\)	Tight for binary codes
Hamming (sphere-packing)	\(2^{nR} \cdot V(n, \lfloor \delta n/2 \rfloor) \leq 2^n\)	Perfect codes achieve equality
Gilbert-Varshamov	\(R \geq 1 - H_q(\delta)\)	Random codes achieve this
Elias-Bassalygo	\(R \leq 1 - H_q(J_q(\delta))\)	For list decoding