CO 432: Information Theory and Applications
Vijay Bhattiprolu
Estimated study time: 2 hr 4 min
Table of contents
Sources and References
Primary textbook — T.M. Cover & J.A. Thomas, Elements of Information Theory (2nd ed., Wiley, 2006)
Supplementary texts — V. Guruswami & M. Cheraghchi, Information Theory and Its Applications in Theory of Computation (CMU course notes, cs.cmu.edu/~venkatg); S. Arora & B. Barak, Computational Complexity: A Modern Approach (Cambridge, 2009) — for PCP chapter; M.A. Nielsen & I.L. Chuang, Quantum Computation and Quantum Information (Cambridge, 2000) — for quantum chapter; Y. Polyanskiy & Y. Wu, Information Theory: From Coding to Learning (MIT 6.441 lecture notes, 2024); T. Weissman, Information Theory (Stanford EE 376A/B course notes); A. Lapidoth, A Foundation in Digital Communication (Cambridge, 2nd ed., 2017); I. Csiszár & J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems (2nd ed., Cambridge, 2011)
Online resources — MIT OCW 6.441 Information Theory; Lecture notes from Madhu Sudan (Harvard CS 229r); ETH Zürich Information Theory lecture notes (Lapidoth)
Chapter 1: Entropy and Source Coding
How much information is contained in a random variable? This deceptively simple question, posed and answered by Claude Shannon in his landmark 1948 paper, gave birth to the entire field of information theory. Shannon’s key insight was that information is intimately connected to uncertainty: the more unpredictable a source, the more bits are needed on average to describe its output. The precise quantitative measure he introduced is the Shannon entropy.
The definition has immediate intuitive appeal. For a deterministic variable — one that takes a single value with probability 1 — entropy is zero: there is no uncertainty, and no bits are required to communicate the outcome. At the opposite extreme, a uniform distribution over \(|\mathcal{X}|\) outcomes maximises entropy, yielding \(H(X) = \log |\mathcal{X}|\) bits.
- Non-negativity: \(H(X) \geq 0\), with equality iff \(X\) is deterministic.
- Maximum entropy: \(H(X) \leq \log |\mathcal{X}|\), with equality iff \(X\) is uniform over \(\mathcal{X}\).
- Concavity: \(H(X)\) is a concave function of the probability vector \((p(x))_{x \in \mathcal{X}}\).
Proof sketch. Non-negativity follows because \(p(x) \in [0,1]\) implies \(-\log p(x) \geq 0\). The maximum is achieved at the uniform distribution by Jensen's inequality applied to the concave function \(-t \log t\). Concavity follows from the same Jensen argument: \(H\) is a sum of concave functions of the \(p(x)\).
The binary entropy function \(h(p) = -p \log p - (1-p)\log(1-p)\) for a Bernoulli\((p)\) variable will recur throughout the course. Note \(h(0) = h(1) = 0\) and \(h(1/2) = 1\).
Prefix-Free Codes and the Kraft Inequality
An operational interpretation of entropy is provided by source coding. We wish to encode outcomes of \(X\) as binary strings, and we require a prefix-free code: no codeword is a prefix of another. This ensures instantaneous decodability.
Proof sketch. Each codeword of length \(\ell\) "consumes" a fraction \(2^{-\ell}\) of the binary tree's leaf capacity. Prefix-freeness requires these fractions to be disjoint subtrees, so the total cannot exceed 1. The converse is constructive: assign codewords greedily from the binary tree.
Uniquely Decodable Codes and McMillan’s Theorem
A code is uniquely decodable if every sequence of codeword concatenations has a unique parse into codewords. Prefix-free codes are a strict subclass: every prefix-free code is uniquely decodable, but the converse is not true. However, McMillan’s theorem shows that we pay no cost for requiring prefix-freeness.
McMillan’s theorem has an important operational corollary: since uniquely decodable codes and prefix-free codes achieve the same Kraft sum, and the minimum expected length \(\sum p(x)\ell(x)\) subject to the Kraft inequality is the same in both cases, prefix-free codes are optimal among all uniquely decodable codes. There is no advantage to allowing delayed decodability.
The Source Coding Theorem
With the Kraft inequality in hand, we can precisely characterise the minimum expected code length.
Proof sketch. The lower bound follows from the Kraft inequality and the optimisation \(\min \sum p(x) \ell(x)\) subject to \(\sum 2^{-\ell(x)} \leq 1\); Lagrange multipliers give the optimal (possibly non-integer) lengths \(\ell^*(x) = -\log p(x)\). Since lengths must be integers, we take \(\ell(x) = \lceil -\log p(x) \rceil\), giving the upper bound. The block coding argument uses the AEP.
Huffman Codes and Optimality
The Huffman algorithm constructs an optimal prefix-free code by greedily combining the two least probable symbols.
Proof by induction. Base case: two symbols — any prefix-free code assigns lengths 1 and 1, optimal since \(H(X) \leq 1\). Inductive step: let \(x^*\) and \(y^*\) be the two symbols with smallest probabilities. Key structural lemma: there exists an optimal code in which \(x^*\) and \(y^*\) are siblings (have the same parent in the code tree) and have the longest codeword length. Given this, replacing the sibling pair by a new merged symbol \(z\) with \(p(z) = p(x^*) + p(y^*)\) gives a reduced problem. Any optimal code for the reduced problem, expanded back by splitting \(z\) into \(x^*, y^*\) and appending 0 and 1, gives an optimal code for the original. The Huffman algorithm does exactly this reduction.
The expected length of a Huffman code satisfies \(H(X) \leq L_{\text{Huff}} < H(X) + 1\). For a source with dyadic probabilities (all \(p(x) = 2^{-k}\) for integers \(k\)), the Huffman code achieves \(L = H(X)\) exactly.
Asymptotic Equipartition Property
The AEP is the law of large numbers for information theory: it says that for large \(n\), the sample \((X_1, \ldots, X_n)\) almost surely has log-probability close to \(nH(X)\).
This follows directly from the (weak) law of large numbers applied to the i.i.d. sequence \(-\log p(X_i)\), whose mean is \(H(X)\). The AEP has a dramatic structural consequence: define the typical set
\[ T_\varepsilon^{(n)} = \left\{ (x_1, \ldots, x_n) : \left| -\frac{1}{n} \log p(x_1, \ldots, x_n) - H(X) \right| \leq \varepsilon \right\}. \]- \(P\left( (X_1, \ldots, X_n) \in T_\varepsilon^{(n)} \right) \geq 1 - \delta\) for large enough \(n\).
- \(\left| T_\varepsilon^{(n)} \right| \leq 2^{n(H(X) + \varepsilon)}\).
- \(\left| T_\varepsilon^{(n)} \right| \geq (1 - \delta) 2^{n(H(X) - \varepsilon)}\) for large enough \(n\).
The typical set has size roughly \(2^{nH(X)}\) — exponentially smaller than \(|\mathcal{X}|^n\) if \(H(X) < \log |\mathcal{X}|\) — yet it captures nearly all the probability mass. This is the geometric heart of Shannon’s theory: with high probability, nature draws from a tiny typical set, and we only need \(nH(X)\) bits to index into it.
Arithmetic Coding
Arithmetic coding implements the entropy bound tightly, approaching the ideal codeword length \(-\log p(x_1^n)\) for each specific sequence. The encoder maintains an interval \([l, r) \subseteq [0,1)\), initially \([0,1)\). Upon seeing symbol \(x_t\), it shrinks the interval to the sub-interval corresponding to \(x_t\) in the cumulative distribution function of \(X_t | x_1^{t-1}\). After \(n\) symbols, the interval has width \(p(x_1^n)\); the encoder outputs the shortest binary description of any number in this interval, which requires \(\lceil -\log p(x_1^n) \rceil + 1\) bits. Arithmetic coding is thus an implementation of the Shannon bound for sequences, using only \(H(X_1, \ldots, X_n) + 2\) bits on average — no block-length overhead. It is particularly powerful for adaptive and universal sources.
Lempel-Ziv Compression and Universality
A universal compressor achieves the entropy rate of any stationary ergodic source without knowing the source distribution. The LZ78 algorithm (Ziv-Lempel, 1978) parses the input string greedily: maintain a dictionary of previously seen phrases; read the input until a string is seen that is not in the dictionary; add it as a new phrase. LZ77 (1977) uses a sliding window instead of an explicit dictionary. The LZW algorithm (Welch, 1984) is a dictionary-based refinement used in GIF compression.
The proof proceeds by showing that the number of phrases \(c(n)\) in a length-\(n\) parse satisfies \(c(n) \sim n / \log n\), and each phrase index requires \(\log c(n)\) bits, giving an average of \(\approx 1/\log(n/c(n)) \to \bar{H}\) bits per symbol. The remarkable aspect is that LZ78 is entirely blind to the source — it adapts purely from the observed data.
Entropy Rate of Stochastic Processes
For non-i.i.d. sources, the relevant quantity is the entropy rate.
For a stationary Markov chain with transition matrix \(P\) and stationary distribution \(\pi\), the entropy rate is
\[ \bar{H} = -\sum_{i,j} \pi_i P_{ij} \log P_{ij} = \sum_i \pi_i H(X_{t+1} | X_t = i). \]This is the weighted average of the entropies of each row of \(P\). For an i.i.d. source, it reduces to \(H(X)\), recovering the classical definition.
The limit \(\bar{H}(X) = \lim_{n} H(X_n | X_{n-1}, \ldots, X_1)\) exists for any stationary process because \(H(X_n | X_{n-1}, \ldots, X_1)\) is monotone non-increasing in \(n\) (conditioning reduces entropy) and non-negative, so it converges. The two definitions agree because Cesàro means of a convergent sequence converge to the same limit. The entropy rate is the fundamental quantity for universal data compression and for characterising the “complexity” of a dynamical system — it equals the topological entropy for equidistributed Markov chains.
Chapter 2: Conditional Entropy, Chain Rules, and Combinatorial Applications
Entropy becomes far richer when we consider multiple random variables. The key quantities are the joint entropy \(H(X,Y)\) and the conditional entropy \(H(X|Y)\), which measure the uncertainty of a pair and the residual uncertainty of \(X\) given \(Y\), respectively.
Proof sketch. \(H(X,Y) = H(X) + H(Y|X)\), and \(H(Y|X) \leq H(Y)\) since conditioning cannot increase entropy. Equality holds iff \(p(x,y) = p(x)p(y)\) for all \(x, y\).
This simple but powerful identity decomposes joint uncertainty into a sequence of conditional uncertainties. Notice that each term \(H(X_i | X_1, \ldots, X_{i-1}) \leq H(X_i)\), so subadditivity follows immediately.
Tail Bounds via Entropy
Entropy provides elegant large-deviation bounds for the binomial distribution. If \(B \sim \mathrm{Bin}(n, p)\), then for \(\alpha < p\),
\[ P(B \leq \alpha n) \leq 2^{-n D(\alpha \| p)}, \]where \(D(\alpha \| p) = \alpha \log(\alpha/p) + (1-\alpha)\log((1-\alpha)/(1-p))\) is the KL divergence (introduced in Chapter 3). This will be derived once we have the full machinery of KL divergence.
Bregman’s Theorem
Proof sketch. Consider a uniformly random perfect matching (permutation matrix \(\sigma\)). Define \(Z_i = A_{i,\sigma(i)}\). Compute the entropy of the column index \(\sigma(i)\) in two ways — summing conditional entropies using the chain rule gives \(H(\sigma) \leq \sum_i \log r_i\). The counting argument then yields the permanent bound.
Bregman’s theorem is a striking example of entropy as a purely combinatorial tool, with no probability or coding in the conclusion.
Shearer’s Lemma
Shearer’s lemma is a generalisation of subadditivity and is extraordinarily useful for counting. As a canonical application: let \(G = (V, E)\) be a graph. The number \(N\) of copies of a fixed graph \(H\) in \(G\) satisfies a product inequality over edge-indexed subgraph entropies, yielding bounds on graph homomorphism counts such as the Kruskal–Katona theorem.
Loomis-Whitney Inequality
The Loomis-Whitney inequality is a higher-dimensional version of Shearer’s lemma with a beautiful geometric interpretation. For a finite set \(S \subseteq \mathbb{Z}^d\), let \(S_{ij}\) denote the projection of \(S\) onto the coordinate hyperplane orthogonal to the \(k\)-th axis.
Proof via Shearer's Lemma. Let \((X_1, \ldots, X_d)\) be uniform over \(S\). Apply Shearer's lemma with the \(d\) subsets \(S_k = [d] \setminus \{k\}\), each of size \(d-1\). Each index \(i\) appears in \(d-1\) of the subsets. Shearer gives \(\sum_k H(X_{S_k}) \geq (d-1) H(X_1, \ldots, X_d)\). Since \(H(X_{S_k}) \leq \log |S_k|\) and \(H(X_1, \ldots, X_d) = \log |S|\), the inequality follows.
Applications include: bounding the number of incidences between points and hyperplanes; proving the Cauchy-Davenport theorem in additive combinatorics; and establishing geometric measure inequalities such as the isoperimetric inequality for finite sets.
Kahn’s Theorem on Independent Sets
The entropy method also resolves a conjecture about the number of independent sets in regular graphs.
Proof sketch via entropy. Let \(I\) be a uniformly random independent set of \(G\). For each edge \(e = (u,v)\), consider \(H(I_u, I_v)\) — the entropy of the pair of bits indicating whether \(u\) and \(v\) are in \(I\). Since \(I\) is an independent set, \((I_u, I_v) \neq (1,1)\), so \(H(I_u, I_v) \leq \log 3\). Summing over all edges and applying the regularity condition via Shearer's lemma gives \(\log i(G) = H(I) \leq (n/d) \log(2^{d+1}-1)\).
Chapter 3: Relative Entropy and Its Consequences
While Shannon entropy measures uncertainty about a single distribution, relative entropy (or KL divergence) measures the cost of mistaking one distribution for another. It is arguably the most fundamental information-theoretic quantity.
Proof sketch. Apply Jensen's inequality to the strictly convex function \(-\ln t\): \(-\sum p(x) \ln(q(x)/p(x)) \geq -\ln \sum p(x) \cdot (q(x)/p(x)) = 0\). Equality holds iff \(q(x)/p(x)\) is constant, i.e., \(P = Q\).
Crucially, \(D(P \| Q)\) is not symmetric and is not a metric. Nevertheless, it is the natural notion of “directed distance” between distributions.
f-Divergences
KL divergence is one member of a broad family of divergences.
Special cases: KL divergence (\(f(t) = t \log t\)); total variation (\(f(t) = |t-1|/2\)); \(\chi^2\)-divergence (\(f(t) = (t-1)^2\)); Hellinger distance squared (\(f(t) = (\sqrt{t}-1)^2\)); reverse KL (\(f(t) = -\log t\)). All f-divergences inherit the data processing inequality: if \(Y\) is a stochastic function of \(X\), then \(D_f(P_Y \| Q_Y) \leq D_f(P_X \| Q_X)\). This follows from Jensen’s inequality applied to the convex \(f\) and the transition kernel.
Chain Rule and Data Processing Inequality for KL
The DPI encapsulates a fundamental principle: processing cannot create information.
Variational Representation of KL Divergence
This variational formula has profound consequences. It connects KL divergence to free energy in statistical physics: \(D(P \| Q) = F(P) - F(Q)\) where \(F\) is the free energy functional. It also underpins the generalised information-theoretic framework for large deviations (Cramér’s theorem) and is the foundation of importance sampling and variational inference in machine learning.
Stein’s Lemma and Hypothesis Testing
One of the most operationally significant interpretations of KL divergence is in hypothesis testing. We observe \(n\) i.i.d. samples from either \(P\) or \(Q\), and must decide which distribution generated them.
The optimal test is the Neyman-Pearson likelihood ratio test: reject \(H_0\) iff \(\frac{1}{n}\log \frac{p(X_1^n)}{q(X_1^n)} > \tau\) for a threshold \(\tau\). The exponent \(D(P \| Q)\) is the “price of the wrong model” — using model \(Q\) when \(P\) is true forces an exponentially higher type-II error. Stein’s lemma thus gives KL divergence a direct operational meaning as a distinguishability rate.
Fisher Information and the Cramér-Rao Bound
For parametric families \(\{p(\cdot;\theta)\}_{\theta \in \Theta}\), Fisher information quantifies how much information an observation carries about the parameter.
The bound is achieved asymptotically by the maximum likelihood estimator. Fisher information is additive for i.i.d. samples: \(I_n(\theta) = n \cdot I(\theta)\), exactly paralleling the additivity of entropy. The link to KL divergence is:
\[ D(p(\cdot;\theta+\varepsilon) \| p(\cdot;\theta)) = \frac{\varepsilon^2}{2} I(\theta) + O(\varepsilon^3). \]KL divergence is thus the “second-order Taylor approximation” of the Fisher information geometry of the statistical manifold.
Chernoff–Hoeffding Bound
Entropy Power Inequality
The entropy power inequality (EPI) is the most powerful known inequality governing entropy of sums of random variables, with far-reaching consequences for Gaussian channels.
For one-dimensional variables: \(2^{2H(X+Y)} \geq 2^{2H(X)} + 2^{2H(Y)}\). Shannon conjectured this in 1948; Stam proved it in 1959 using de Bruijn’s identity: if \(X_t = X + \sqrt{t} Z\) where \(Z \sim \mathcal{N}(0,1)$ is independent of \(X\), then \(\frac{d}{dt} H(X_t) = \frac{1}{2} I(\theta)_{\text{Fisher}}\) where the Fisher information is with respect to location. The EPI then follows from the Fisher information inequality for sums. Gaussians are the extremal distributions: \(H(X+Y)\) is maximised (for given \(H(X)\) and \(H(Y)\)) when both are Gaussian with the same variance. The EPI is used to prove the capacity of the Gaussian broadcast channel.
Joint Convexity and Pinsker’s Inequality
Pinsker’s inequality is the bridge between the analytically tractable KL divergence and the operationally natural total variation distance. Together with Stein’s lemma and Fisher information, it completes a rich geometric picture of the space of probability distributions.
Chapter 4: Mutual Information
The mutual information between two random variables is the reduction in uncertainty about one achieved by observing the other. It synthesises all the quantities we have defined.
Since KL divergence is non-negative, \(I(X;Y) \geq 0\), with equality iff \(X\) and \(Y\) are independent. Mutual information is symmetric: \(I(X;Y) = I(Y;X)\).
Data Processing Inequality for Mutual Information
Processing \(Y\) to obtain \(Z\) cannot increase the information that \(Z\) carries about \(X\). This principle is fundamental to information-theoretic security and to lower bounds in communication complexity.
Fano’s Inequality
Fano’s inequality quantifies the necessary imprecision of any estimator.
Proof sketch. Introduce the error indicator \(E = \mathbf{1}[\hat{X} \neq X]\). By the chain rule, \(H(E, X | Y) = H(X | Y)\) (since \(E\) is a deterministic function of \((X, \hat{X})\)). Also \(H(E, X | Y) = H(E|Y) + H(X | E, Y) \leq h(P_e) + P_e \log(|\mathcal{X}|-1)\).
Fano’s inequality is the cornerstone of the converse proof in channel coding: it shows that if the mutual information is low, any decoder must err with bounded probability.
Network Information Theory Preview
Mutual information extends naturally to multi-user settings. Two key theorems illustrate this.
Multiple access channel (MAC): Two senders, each independently choosing inputs \(X_1\) and \(X_2\), transmit to a common receiver who observes \(Y\). The capacity region is all rate pairs \((R_1, R_2)\) satisfying
\[ R_1 \leq I(X_1; Y \mid X_2), \quad R_2 \leq I(X_2; Y \mid X_1), \quad R_1 + R_2 \leq I(X_1, X_2; Y), \]for some product input distribution \(P_{X_1} \times P_{X_2}\). The three constraints correspond to: sender 1’s rate is bounded by what \(Y\) tells about \(X_1\) given knowledge of \(X_2\); symmetrically for sender 2; and the sum rate is bounded by the total information the channel can convey.
Slepian-Wolf theorem (distributed source coding): Two correlated sources \(X\) and \(Y\) are encoded separately and decoded jointly. Despite no communication between encoders, reliable reconstruction requires
\[ R_X \geq H(X \mid Y), \quad R_Y \geq H(Y \mid X), \quad R_X + R_Y \geq H(X, Y). \]The surprising content is that the sum rate \(R_X + R_Y\) need not exceed \(H(X, Y)\): there is no penalty for distributed encoding, provided that joint decoding is allowed. The achievability proof uses random binning, which is a direct application of the joint typicality framework.
Rate-Distortion Theory
Source coding with a fidelity criterion is governed by the rate-distortion function.
For a Bernoulli(\(1/2\)) source with Hamming distortion, \(R(D) = 1 - h(D)\) for \(D \leq 1/2\), and \(R(D) = 0\) for \(D \geq 1/2\). For a Gaussian source \(X \sim \mathcal{N}(0, \sigma^2)\) with MSE distortion, the rate-distortion function is
\[ R(D) = \max\!\left(0,\; \frac{1}{2}\log \frac{\sigma^2}{D}\right). \]This is the reverse water-filling formula: to achieve distortion \(D\), allocate \(D\) as the “water level” and describe the source down to that precision. Each additional bit of rate halves the achievable distortion (in linear scale), reflecting the quadratic relationship between description precision and quantization error. For a vector Gaussian source with covariance matrix \(\Sigma\), the rate-distortion function under MSE distortion generalises to
\[ R(D) = \frac{1}{2} \sum_i \max\!\left(0, \log \frac{\lambda_i}{\mu}\right), \]where \(\lambda_i\) are the eigenvalues of \(\Sigma\) and the water-level \(\mu\) is chosen so that \(\sum_i \min(\lambda_i, \mu) = D\). This is the water-filling formula: allocate bits to the most “energetic” dimensions first, and cut off dimensions whose variance is below the water level \(\mu\). The same water-filling solution appears (in dual form) in Gaussian channel capacity with power allocation across parallel channels.
Strong Data Processing Inequality
For a fixed channel \(p(y|x)\), one might ask: how much does passing a correlated source through the channel shrink mutual information? The strong data processing inequality (SDPI) captures this contraction.
The contraction coefficient for the BSC(\(p\)) under KL divergence equals \((1-2p)^2\), so \(D(P_{Y^n} \| Q_{Y^n}) \leq (1-2p)^{2n} D(P_{X^n} \| Q_{X^n})\) — an exponential decay in distinguishability. This connects to hypercontractivity (Bonami-Beckner inequality) and noise stability of Boolean functions, with applications in hardness of approximation (Håstad’s proof) and social choice theory (Arrow’s theorem via information theory).
Perfect Secrecy and Key Size
This is Shannon’s impossibility theorem: the one-time pad is optimal, and no cipher can compress the key below the message entropy while maintaining perfect secrecy.
Lower Bound for the Index Function
The Index function is the communication complexity problem where Alice holds \(x \in \{0,1\}^n\) and Bob holds \(i \in [n]\), and they must compute \(x_i\).
Proof sketch. Let \(M\) be Alice's message. The protocol's success condition implies \(H(X_I | M, I) \leq h(P_e)\). By a counting argument, \(\sum_{i} I(X_i; M) \geq n(1 - h(P_e))\). Each \(I(X_i; M)\) is at most 1, so \(|M| \geq n(1 - h(P_e)) = \Omega(n)\).
Chapter 5: Communication Complexity
Communication complexity, introduced by Yao in 1979, studies how many bits two parties must exchange to jointly compute a function. Information theory provides matching lower bounds by quantifying how much “useful” information must actually be transmitted.
Information Complexity
Information complexity refines communication complexity by measuring the actual information content of the transcript, rather than its length.
Information complexity is always at most communication complexity: \(\mathrm{IC}_\mu(f) \leq \mathrm{CC}(f)\). The key result is that this gap can be closed asymptotically.
This direct sum theorem is one of the most important results in communication complexity. It says that computing \(n\) independent instances of \(f\) requires \(n\) times as much information as a single instance — information is not “shareable” across independent instances.
Public vs. Private Coins
The key insight: a random string of length \(O(\log n)\) suffices to index into a small set of “good” shared random strings. This separates public-coin from private-coin randomised complexity by only a logarithmic additive gap — far smaller than the gap between randomised and deterministic complexity.
Equality function: The equality function \(\mathrm{EQ}_n(x,y) = \mathbf{1}[x=y]\) has deterministic complexity \(\mathrm{D(EQ)} = n+1\) (one party must send their full input), but randomised public-coin complexity \(R^{\mathrm{pub}}(\mathrm{EQ}) = O(1)\): Alice picks a random hash \(h\), sends \(h(x)\), and Bob checks if \(h(y) = h(x)\). This is an exponential gap. On the other hand, the inner product function \(\mathrm{IP}_n(x,y) = \sum_i x_i y_i \pmod{2}\) satisfies \(R^{\mathrm{pub}}(\mathrm{IP}) = \Omega(n)\) — public randomness does not help here.
Set Disjointness
The Set Disjointness problem DISJ\(_n\): Alice has \(x \subseteq [n]\), Bob has \(y \subseteq [n]\), output 1 iff \(x \cap y = \emptyset\).
The information-complexity proof proceeds via a hard distribution: place \(j\) uniformly at random in \([n]\), set \(x = ([n] \setminus \{j\}) \cup B_x\) and \(y = ([n] \setminus \{j\}) \cup B_y\) where \(B_x, B_y \in \{\emptyset, \{j\}\}\) with \((B_x, B_y)\) equal to \((\{j\},\emptyset)\) or \((\emptyset,\{j\})\) (disjoint) or \((\{j\},\{j\})\) (intersecting). Any protocol that decides DISJ must, on this distribution, essentially determine the single bit of intersection. The information cost per coordinate is \(\Omega(1)\), giving the \(\Omega(n)\) lower bound.
Corruption bound: An alternative technique for randomised lower bounds. The corruption bound for DISJ states that any protocol with small error on the hard distribution above must have communication complexity \(\Omega(n)\). The corruption bound and the discrepancy method are related lower-bound techniques; together they give tight bounds for functions with specific combinatorial structure.
Cell-Probe Model and Data Structures
Chapter 6: Channel Capacity and the Noisy Coding Theorem
We now turn from compression to the dual problem: reliable transmission over a noisy channel. Shannon’s channel coding theorem — perhaps the deepest result in information theory — asserts that reliable communication is possible at any rate below the channel capacity, and impossible above it.
Gaussian Channel Capacity
One of the most practically significant results in information theory is the capacity of the additive white Gaussian noise channel.
Proof sketch (achievability). Use a Gaussian codebook: each codeword \(x^n(m)\) is drawn i.i.d. from \(\mathcal{N}(0, P)\). By the entropy power inequality and the fact that Gaussian maximises entropy under power constraint, the jointly typical decoder achieves rate up to \(C\). The noise variance \(N\) limits how finely we can pack codewords in an \(n\)-dimensional sphere of radius \(\sqrt{nP}\); the received vector lies in a sphere of radius \(\sqrt{n(P+N)}\), and each noise ball has radius \(\approx \sqrt{nN}\). The number of disjoint noise balls is \(\approx (1 + P/N)^{n/2}\), giving rate \((1/2)\log(1+P/N)\).
Proof sketch (converse). By the data processing inequality and the fact that Gaussian maximises entropy under variance constraint: \(I(X;Y) = H(Y) - H(Z) \leq (1/2)\log(2\pi e(P+N)) - (1/2)\log(2\pi e N) = (1/2)\log(1 + P/N)\).
The signal-to-noise ratio (SNR) \(= P/N\) enters logarithmically. This means: doubling the power gives only one additional bit of rate. Improving capacity requires exponentially more power — a fundamental constraint on all communication systems.
Wideband Capacity and Bandwidth-Power Tradeoff
For a band-limited AWGN channel with bandwidth \(W\) Hz and noise power spectral density \(N_0/2\),
\[ C(W) = W \log\!\left(1 + \frac{P}{N_0 W}\right) \text{ bits per second.} \]As \(W \to \infty\), \(C(W) \to P / (N_0 \ln 2)\) bits per second — the infinite-bandwidth limit. This shows that bandwidth is not the limiting factor at low SNR; power is. In the wideband regime, each joule of energy delivers \(1/(N_0 \ln 2)\) bits of capacity — a universal constant. Satellite and deep-space communications operate in this regime.
Joint Typicality and the Channel Coding Theorem
The joint typicality lemma: if \(\tilde{X}^n\) and \(\tilde{Y}^n\) are drawn independently from the marginals \(P_X^n\) and \(P_Y^n\), then \(P((\tilde{X}^n, \tilde{Y}^n) \text{ jointly typical}) \leq 2^{-n(I(X;Y) - 3\varepsilon)}\).
- Achievability: For any rate \(R < C\) and \(\varepsilon > 0\), there exist codes with error \(\to 0\).
- Converse: For any rate \(R > C\), any code has error \(\to 1\).
Error Exponents and the Sphere-Packing Bound
The channel coding theorem says error goes to zero for \(R < C\), but how fast? The answer is given by the error exponent.
The random coding exponent \(E_r(R)\) (Gallager 1965) is achieved by random codes and gives an achievable exponent:
\[ E_r(R) = \max_{\rho \in [0,1]} \max_{P_X} \left[ E_0(\rho, P_X) - \rho R \right], \]where \(E_0(\rho, P_X) = -\log \sum_y \left(\sum_x P_X(x) P_{Y|X}(y|x)^{1/(1+\rho)}\right)^{1+\rho}\) is Gallager’s \(E_0\) function. The sphere-packing (Fano) bound \(E_{\mathrm{sp}}(R) \geq E_r(R)\) gives a matching converse for rates near capacity, showing \(E_r(R)\) is exact for \(R\) close to \(C\). For rates near zero, the expurgation bound improves upon random coding.
Feedback and the Strong Converse
Proof. With feedback, the encoder's input \(X_i\) can depend on the message \(M\) and past outputs \(Y_1, \ldots, Y_{i-1}\). Nevertheless, \(I(M; Y^n) \leq \sum_i I(X_i; Y_i) \leq nC\) by the data processing inequality and the memoryless channel structure. Fano's inequality then gives the same capacity bound.
This is one of Shannon’s most surprising theorems. While feedback dramatically simplifies coding (enabling sequential transmission and adaptive schemes), it provides no capacity gain for point-to-point memoryless channels. The situation is very different for multi-user channels: feedback can strictly increase capacity of broadcast channels and MACs.
where \(Q^{-1}\) is the inverse of the Gaussian tail function \(Q(x) = P(\mathcal{N}(0,1) > x)\).
This second-order or finite-blocklength result (Polyanskiy-Poor-Verdú 2010) quantifies the penalty for using a finite-length code. The \(1/\sqrt{n}\) gap from capacity is governed by the channel dispersion \(V\), in direct analogy with the central limit theorem. The BSC has dispersion \(V = p(1-p)(\log((1-p)/p))^2\). For the AWGN channel, \(V = (1/2)\log^2(1 + P/N)\).
Chapter 7: Error-Correcting Codes
Where Shannon’s coding theorem guarantees existence, coding theory asks for explicit constructions with efficient encoding and decoding algorithms. The key parameters of a code form a triangle of competing requirements.
The Singleton bound gives \(d \leq n - k + 1\) (equivalently \(R + \delta \leq 1 + 1/n\)). Codes meeting this bound are called Maximum Distance Separable (MDS).
A fundamental table of bounds constrains achievable code parameters:
| Bound | Form | Notes |
|---|---|---|
| Singleton | \(R + \delta \leq 1\) | MDS codes (RS) achieve equality |
| Plotkin | \(\delta \leq 1/2\) for \(R > 0\) | Tight for binary codes |
| Hamming (sphere-packing) | \(2^{nR} \cdot V(n, \lfloor \delta n/2 \rfloor) \leq 2^n\) | Perfect codes achieve equality |
| Gilbert-Varshamov | \(R \geq 1 - H_q(\delta)\) | Random codes achieve this |
| Elias-Bassalygo | \(R \leq 1 - H_q(J_q(\delta))\) | For list decoding |
where \(V(n,t) = \sum_{i=0}^t \binom{n}{i}\) is the Hamming ball volume and \(J_q(\delta)\) is the \(q\)-ary Johnson bound. The gap between the Gilbert-Varshamov bound (what random codes achieve) and the Singleton bound (what the best known explicit codes achieve) is a central open problem in coding theory.
BSC Capacity and Random Linear Codes
A random linear code of rate \(R\) over \(\mathbb{F}_2^n\) is a code whose generator matrix is chosen uniformly at random. By the probabilistic method, the expected number of low-weight codewords is small when \(R < C\), and a union bound shows most random codes have minimum distance \(\geq (H_b^{-1}(1-R) - \varepsilon)n\) — the Gilbert-Varshamov bound. Achieving capacity with polynomial-time decoding algorithms is the central open problem of algebraic coding theory.
Plotkin Bound and Singleton Bound
The Plotkin bound shows that exceeding relative distance \(1/2\) with positive rate is impossible for binary codes. Reed-Muller codes achieve the Plotkin bound for small parameters.
Reed–Solomon Codes
Efficient decoding up to \(\lfloor (d-1)/2 \rfloor\) errors is achieved by the Berlekamp-Welch algorithm. Beyond half the distance, list decoding is required.
List Decoding
Standard unique decoding corrects up to \(\lfloor (d-1)/2 \rfloor\) errors. List decoding allows outputting a short list of possible codewords, enabling correction of far more errors.
This theorem (Zyablov-Pinsker, Guruswami-Hastad-Sudan-Zuckerman) is the “list decoding capacity theorem” — the analogue of Shannon’s capacity theorem for the combinatorial setting.
LDPC Codes
Low-density parity check (LDPC) codes (Gallager 1960, rediscovered by Mackay-Neal 1996) achieve capacity in practice.
Decoding proceeds by belief propagation (message passing): variable nodes iteratively send “soft” probability estimates to check nodes, which combine them and send updated beliefs back. On the binary erasure channel, belief propagation is exact and can be analyzed via density evolution: track the evolution of erasure probability through the iterations. LDPC codes achieve capacity on the BEC with threshold phenomenon: for any \(\varepsilon > 0\), there exists a degree distribution such that BP decodes correctly for erasure probability \((1-\varepsilon)C\). LDPC codes are used in the 5G NR standard for data channels.
Polar Codes
Arıkan’s polar codes (2009) are the first explicit family of codes that provably achieve capacity for any symmetric binary-input memoryless channel.
The key insight is the recursion: given two copies of channel \(W\), form the “minus” channel \(W^- : u_1 \to (y_1, y_2)\) where input is \(u_1 \oplus u_2\), and the “plus” channel \(W^+ : u_2 \to (y_1, y_2, u_1)\) where we additionally know \(u_1\). Then \(I(W^-) + I(W^+) = 2I(W)\) and \(I(W^+) \geq I(W) \geq I(W^-)\) — the channels “spread” in capacity while preserving the average. Iterating this polarizes them completely.
Successive cancellation (SC) decoding decodes \(u_1, u_2, \ldots, u_n\) in order, using previously decoded bits. On the “good” channels, \(u_i\) is decoded correctly; on the “bad” channels, \(u_i\) is frozen to a known value (typically 0). SC decoding runs in \(O(n \log n)\) time. Polar codes are used in 5G NR for control channels.
Linear Codes and Duality
The minimum distance of the dual code governs the degree of independence achievable in the code, connecting code theory to pseudorandomness and the probabilistic method.
Chapter 8: k-Wise Independence and Derandomisation
Many randomised algorithms only require their random bits to be “locally independent” — specifically, any \(k\) of the \(n\) bits are mutually independent. This allows dramatic savings in the number of truly random bits needed.
The key construction uses Reed–Solomon codes: choose a random polynomial \(f \in \mathbb{F}_q[X]\) of degree \(< k\) uniformly, and set \(X_i = f(\alpha_i)\) for distinct field elements \(\alpha_i\). Since any \(k\) points determine a unique polynomial of degree \(< k\), the variables are \(k\)-wise independent. This requires only \(k \log q\) random bits — an exponential saving when \(k \ll n\).
Application to E3-SAT
Small-Bias Spaces and \(\varepsilon\)-Biased Sets
For tests that depend only on the XOR of subsets of variables (a broad class), we can use an even weaker notion than \(k\)-wise independence.
Small-bias spaces are a tool for derandomising tests that depend only on XOR of variables. They are strictly more efficient than pairwise-independent sets for large \(n\).
Expander Graphs and Pseudorandomness
Expander graphs are sparse graphs with strong connectivity properties — they behave “like” complete graphs for many purposes.
The Ramanujan bound gives the optimal spectral gap: \(\lambda \geq 2\sqrt{d-1}\) for any \(d\)-regular graph (Alon-Boppana theorem). Ramanujan graphs achieving this bound exist (Margulis 1988, Lubotzky-Phillips-Sarnak 1988). Expanders have mixing time \(O(\log n / \log(d/\lambda))\) for random walks, making them useful for pseudorandomness, error reduction in randomised algorithms, and constructing LDPC codes (Sipser-Spielman LDPC codes from expanders have linear-time decoding).
Pseudorandom Generators
The Nisan-Wigderson (NW) PRG: given a Boolean function \(f: \{0,1\}^k \to \{0,1\}\) that is hard to compute on \(1/2 + \delta\) fraction of inputs for all polynomial-size circuits, the NW construction produces a PRG with seed length \(O(k)\) that fools all polynomial-size circuits with error \(\delta\). This establishes the equivalence between computational hardness and pseudorandomness — a cornerstone of complexity theory. For space-bounded computation, Nisan’s PRG (1992) fools width-\(w\) branching programs with seed length \(O(\log^2 n)\).
Chapter 9: Locally Testable Codes and PCPs
The theory of probabilistically checkable proofs (PCPs) is one of the great achievements of complexity theory, and its connection to coding theory runs deep. At its heart lies the remarkable linearity of the Hadamard code.
Hadamard Code as a Linear Code
The Hadamard code is the dual of the repetition code. Its codewords are exactly the characters of the group \(\mathbb{F}_2^k\), i.e., the functions \(x \mapsto \langle a, x \rangle\) for each \(a \in \mathbb{F}_2^k\). The Walsh-Hadamard transform \(\hat{f}(a) = \sum_x (-1)^{f(x)+\langle a,x\rangle}\) diagonalises the code’s structure. Every pair of distinct codewords has Hamming distance exactly \(2^{k-1}\) — the code is equidistant.
Self-Testing and Self-Correction
Linearity testing (BLR test): to test whether \(f\) is close to a linear function, pick random \(x, y \in \{0,1\}^k\) and check \(f(x) \oplus f(y) = f(x \oplus y)\). The BLR test (Blum-Luby-Rubinfeld 1993) has perfect completeness and detects any function that is \(\delta\)-far from linear with probability \(\Omega(\delta)\). This is the first example of a locally testable code (LTC): a code where membership can be probabilistically verified by reading a constant number of positions.
Low-Degree Testing and the Algebraic Approach
The extension from the Hadamard code (degree-1 polynomials) to higher degrees vastly increases the rate while preserving testability.
This enables the sum-check protocol: given a function \(g: \{0,1\}^n \to \{0,1\}\) and a claimed sum \(K = \sum_x g(x)\), the prover and verifier interact for \(n\) rounds. In round \(i\), the prover sends a univariate polynomial \(s_i(X_i)\), and the verifier checks consistency. At the end, the verifier evaluates \(g\) at a random point (one oracle query). The sum-check protocol runs in \(O(n)\) rounds and \(O(n d \log q)\) bits of communication, verifying the count of solutions to any arithmetic circuit in \(\mathcal{IP}\).
PCP Theorem and Hardness of Approximation
The PCP theorem has a dramatic corollary for approximation algorithms.
Håstad’s PCP uses 3-query projection games with near-linear size. The key ingredient is a long code test that tests whether a codeword encodes a single bit: for a function \(f: \{0,1\}^n \to \{0,1\}\), pick a random folded function and test \(f(x) \oplus f(y) \oplus f(x \oplus y \oplus e_i) = 0\) for a random bit position \(i\). The Unique Games Conjecture (Khot 2002) — that it is NP-hard to distinguish \((1-\varepsilon)\)-satisfiable 2-CSPs from \(\varepsilon\)-satisfiable ones — if true, would give tight inapproximability for many problems including VERTEX-COVER and MAX-CUT.
Exponential-Length PCPs for 3-SAT
The construction: encode the satisfying assignment \(x \in \{0,1\}^n\) as its Hadamard codeword \(\hat{x} \in \{0,1\}^{2^n}\), and encode all degree-2 products \((x_i x_j)\) similarly. Each clause \(\ell_1 \vee \ell_2 \vee \ell_3\) is violated only if \((1 - \ell_1)(1 - \ell_2)(1 - \ell_3) = 1\), which can be checked by reading 3 bits of the Hadamard codeword and testing consistency via self-correction. The polynomial \(\hat{x}\) and its degree-2 extension enable the verifier to check all constraints without reading the full proof.
Chapter 10: Graph Entropy and Monotone Complexity
Graph entropy, introduced by Körner (1973), extends the notion of Shannon entropy to combinatorial structures. It provides tight lower bounds on formula complexity that purely combinatorial or algebraic methods cannot match.
Graph entropy satisfies \(H(G, P) \leq H(P)\), with equality when \(G\) is the complete graph. For an independent set (edgeless graph), \(H(\bar{K}_n, P) = 0\).
Graph Entropy and Fractional Chromatic Number
This connects graph theory and information theory beautifully. The fractional chromatic number \(\chi_f(G) = \min\{n/k : G \to K_{n:k}\}\) where \(K_{n:k}\) is the Kneser graph, so graph entropy literally encodes chromatic information.
Körner’s Zero-Error Capacity
Shannon introduced the zero-error capacity \(C_0(G)\) of a channel: the maximum rate of communication with zero probability of error, where the channel’s confusability graph has \(G\) as its structure. Specifically, \(C_0(G) = \sup_n (1/n) \log \alpha(G^n)\) where \(\alpha\) is the independence number and \(G^n\) is the \(n\)-fold OR-product.
The Lovász theta function is the first example of a polynomial-time computable graph invariant that sandwiches the independence number — a landmark in the interplay between information theory, combinatorics, and semidefinite programming.
Entropy and Graph Coloring
Graph entropy also connects to list coloring and Witsenhausen’s rate.
For the cycle \(C_5\) with uniform distribution, \(H_\chi(C_5, U) = \log 3\) (since \(\chi(C_5) = 3\)), while the Körner entropy \(H(C_5, U) = \frac{1}{2}\log 5\) (since \(\chi_f(C_5) = 5/2\)). The gap between chromatic entropy and Körner entropy reflects the gap between integer and fractional chromatic number. This has operational meaning: the Witsenhausen rate governs the minimum number of bits per source symbol needed for zero-error communication when the receiver observes a confusable version of the source.
Lower Bound for the Threshold Function
The proof uses the graph entropy of the Kneser graph \(K(n, k)\): vertices are \(k\)-subsets of \([n]\), edges connect disjoint subsets. The entropy of the Kneser graph lower-bounds the formula’s size: each gate in the formula can “resolve” at most a bounded amount of graph entropy per unit of formula size. The Kneser graph has \(\chi_f(K(n,k)) = n/k\), so \(H(K(n,k), U) = \log(n/k)\), and the formula complexity lower bound follows from iterating this argument through the tree structure of the formula.
The argument proceeds as follows: define a “complexity measure” \(\mu(f)\) for each monotone Boolean function \(f\) as the graph entropy of its associated confusability graph under the uniform distribution. One shows (i) \(\mu(T_k^n) = \Theta(\log(n/k))\), (ii) \(\mu\) is sub-multiplicative under composition of gates: \(\mu(f \vee g) \leq \mu(f) + \mu(g)\), and (iii) each input literal \(x_i\) has \(\mu(x_i) = O(1/n)\). Combining: any formula for \(T_k^n\) of size \(s\) satisfies \(\mu(T_k^n) \leq s \cdot O(1/n)\), giving \(s = \Omega(n \log(n/k))\). For \(k = \lfloor n/2 \rfloor\), this yields \(\Omega(n)\), and a more refined analysis using the Kneser graph structure gives the \(\Omega(n^{5/2})\) bound.
Chapter 11: The Lovász Local Lemma
The Lovász Local Lemma (LLL) is a fundamental probabilistic existence argument. Given a collection of “bad” events that each occur with some probability but are “mostly independent” of each other, LLL guarantees that with positive probability, none of the bad events occurs.
Algorithmic LLL via Entropy Compression
The entropy compression proof (Moser 2009) gives the most illuminating argument. Define the following encoding scheme for the algorithm’s execution: the algorithm uses a random tape \(R\) (bits for initial assignment and resamplings). Whenever a bad event \(A_i\) occurs at time \(t\), record a witness tree — a tree whose root is \(A_i\) and whose children encode which events caused further resamplings. The key claim: the witness tree at time \(t\) plus the bits of \(R\) used for resampled variables can be decoded from a compressed representation of length proportional to the tree size times \(\log m\) bits. Since the original bits are uniformly random (and hence incompressible by Shannon’s source coding theorem), the number of resampling steps is bounded by the information-theoretic lower bound on their entropy. Concretely: if there are \(T\) resampling steps, the compressed encoding uses \(\leq ck \log m \cdot T\) bits (for some constant \(c\)) while the random tape used is \(\geq kT\) bits. When LLL conditions hold, the encoding is a genuine compression, bounding \(T\) polynomially in \(n\) and \(m\).
This argument makes a beautiful connection: the LLL condition \(ep(d+1) \leq 1\) is exactly the condition under which the witness tree encoding compresses the random tape, i.e., the condition under which the algorithm cannot run forever without violating the incompressibility of randomness.
Applications and Extensions of LLL
\(k\)-colorability of sparse graphs: A graph with maximum degree \(\Delta\) is properly \(k\)-colorable if \(k \geq e(\Delta - 1) + 1\). This follows from LLL applied to the bad events “edge \((u,v)\) is monochromatic,” each with probability \(1/k\) and dependency degree at most \(2(\Delta-1)\). Setting \(p = 1/k\) and \(d = 2(\Delta-1)\) gives the condition \(e/k \cdot (2\Delta - 1) \leq 1\), satisfied for \(k \geq e(2\Delta-1)\).
Property B (2-colorability of hypergraphs): A hypergraph \(\mathcal{H} = (V, \mathcal{E})\) where every edge has size \(\geq k\) and every vertex appears in at most \(d\) edges is 2-colorable whenever \(e \cdot 2^{1-k} \cdot d \leq 1\). This follows from LLL with \(p = 2^{1-k}\).
Variable version of LLL (Moser-Tardos full version): In the variable setting, events are associated with disjoint sets of variables. After any resampling of variables in \(\mathrm{vbl}(A_i)\), the resample is truly independent of all events disjoint from \(A_i\). This structure makes the witness tree argument particularly clean: each node in the witness tree contributes independent randomness to the encoding. The expected number of resamplings of event \(A_i\) is at most \(x_i / (1 - x_i)\) in the asymmetric LLL setting, giving a precise bound on algorithm runtime.
Lossless coding proof: The original Moser proof can be phrased as a lossless coding argument. The compression scheme encodes the sequence of resampled events as a prefix-free code. The LLL condition ensures the code has length less than the number of random bits used — a contradiction with Shannon’s source coding lower bound (or equivalently, Kolmogorov complexity lower bound) if the algorithm runs for too many steps.
Chapter 12: Extended Formulations and Non-Negative Rank
The theory of extended formulations connects linear programming with communication complexity — a surprising and deep bridge that allows information-theoretic methods to prove lower bounds on LP sizes.
Extended Formulations and Slack Matrices
Lower Bound via Communication Complexity
Applied to the correlation polytope COR\(_n\) (convex hull of \(\{xx^T : x \in \{0,1\}^n\}\)): its slack matrix encodes Set Disjointness, which has randomised CC \(\Omega(n)\). Therefore \(\mathrm{rank}_+(\text{slack}) \geq 2^{\Omega(n)}\), so \(\mathrm{xc}(\mathrm{COR}_n) = 2^{\Omega(n)}\). This resolved Yannakakis’s 1991 question and showed the symmetric TSP polytope (which contains COR\(_n\) as a face) has no polynomial-size LP formulation.
Rectangle Bound and Partition Bound
For randomised communication complexity, the corruption bound and the partition bound give strong lower bounds.
The partition bound (Jain-Klauck 2010) is a linear programming relaxation that simultaneously generalises the corruption bound, the discrepancy bound, and the smooth rectangle bound. Its dual is the smooth rectangle bound, which has a clean operational interpretation: it is the minimum number of “smooth” monochromatic rectangles needed to cover the communication matrix. For Set Disjointness, the partition bound gives a tight \(\Omega(n)\) lower bound for randomised complexity.
The hierarchy of lower bound methods for randomised CC, from weakest to strongest, is approximately: rank bound \(\leq\) discrepancy bound \(\leq\) corruption/rectangle bound \(\leq\) smooth rectangle bound \(\approx\) partition bound. The partition bound is known to be tight for most natural problems, including DISJ, IP, and GT (Greater-Than function), and it equals the communication complexity of the function up to constant factors for product distributions (Braverman-Rao 2011, strengthened by Kerenidis et al. 2012).
Discrepancy method: For a function \(f\) and distribution \(\mu\), the discrepancy is \(\mathrm{disc}_\mu(f) = \max_{R = A \times B} |\mu(R \cap f^{-1}(1)) - \mu(R \cap f^{-1}(0))|\). Any randomised protocol with error \(\varepsilon\) satisfies \(\mathrm{disc}_\mu(f) \geq (1-2\varepsilon) \cdot 2^{-c}\), where \(c\) is the number of bits communicated. For the inner product function IP\(_n\) over \(\mathbb{F}_2^n\) with uniform distribution, the discrepancy is \(2^{-n/2}\), giving \(R^{\mathrm{pub}}(\mathrm{IP}_n) = \Omega(n)\). The discrepancy method connects combinatorics (rectangle covers), Fourier analysis (characters of the distribution), and communication complexity.
Approximate Nonnegative Rank and Quantum Communication
The connection between communication complexity and LP complexity extends to quantum computation.
Chapter 13: Quantum Information Theory
Quantum information theory extends classical information theory to quantum mechanical systems. The central object replaces probability distributions with density matrices, and entropy is generalised accordingly.
Von Neumann entropy inherits properties from Shannon entropy: \(S(\rho) \geq 0\) with equality for pure states; \(S(\rho) \leq \log d\) with equality for the maximally mixed state \(I/d\); and concavity in \(\rho\). But quantum entropy also has properties with no classical analogue.
Strong Subadditivity
Strong subadditivity (SSA) is one of the deepest inequalities in quantum information theory. It was conjectured by Lanford-Robinson in 1968 and proved by Lieb-Ruskai using Lieb’s concavity theorem for matrix functions. The classical analogue is the chain rule inequality \(H(X|Y,Z) \leq H(X|Y)\), which is elementary. SSA has no elementary quantum proof and requires sophisticated operator inequalities. It implies the weak monotonicity \(S(\rho_{AB}) \geq S(\rho_A)\) only when \(B\) is a classical system — in the quantum case, \(S(\rho_{AB})\) can be less than \(S(\rho_A)\) (a purely quantum phenomenon).
Holevo’s Theorem
For pure states, \(S(\rho_i) = 0\), so \(\chi = S(\rho_\mathrm{avg}) \leq \log d\). This means a qubit carries at most 1 bit of accessible classical information. The bound is achieved by measuring in the eigenbasis of \(\rho_\mathrm{avg}\) — the Holevo-Schumacher-Westmoreland (HSW) theorem states that the classical capacity of a quantum channel equals \(\lim_{n \to \infty} (1/n) \max_{\{p_i, \rho_i\}} \chi^{(n)}\), the regularised Holevo information.
Quantum Channel Capacity and Superadditivity
A fundamental open question was whether the Holevo quantity is additive: does \(\chi(\mathcal{N}_1 \otimes \mathcal{N}_2) = \chi(\mathcal{N}_1) + \chi(\mathcal{N}_2)\)? Equivalently, can entangled inputs increase the classical capacity? Hastings (2009) proved the additivity conjecture is false: there exist channels for which entangled inputs across multiple uses give strictly higher classical capacity. This is a purely quantum phenomenon with no classical analogue.
The quantum capacity \(Q(\mathcal{N})\) of a channel is the maximum rate of reliable transmission of quantum information (qubits). The quantum channel coding theorem (Lloyd-Shor-Devetak 1997-2005) states that \(Q(\mathcal{N}) = \lim_{n \to \infty} (1/n) I_c(\mathcal{N}^{\otimes n})\) where \(I_c\) is the coherent information. A channel has positive quantum capacity iff it is not anti-degradable. The quantum capacity is also superadditive: \(I_c(\mathcal{N}_1 \otimes \mathcal{N}_2) > I_c(\mathcal{N}_1) + I_c(\mathcal{N}_2)\) is possible.
Quantum Error Correction
Quantum systems are fragile: any interaction with the environment can cause decoherence. Quantum error correction (QEC) protects quantum information against noise.
The 3-qubit repetition code encodes \(|\psi\rangle = \alpha|0\rangle + \beta|1\rangle\) as \(\alpha|000\rangle + \beta|111\rangle\), correcting single bit flips by majority vote. The Shor code (9 qubits) concatenates a 3-qubit phase-flip code with a 3-qubit bit-flip code, correcting any single-qubit error.
BB84 Quantum Key Distribution
Quantum mechanics enables information-theoretically secure key distribution — a task impossible classically against a computationally unbounded eavesdropper.
- Alice generates a random bit string \(b \in \{0,1\}^n\) (the key) and a random basis string \(\theta \in \{+,\times\}^n\).
- Alice sends \(n\) qubits, encoding each bit \(b_i\) in basis \(\theta_i\): basis \(+\) uses \(\{|0\rangle, |1\rangle\}\), basis \(\times\) uses \(\{|+\rangle, |-\rangle\}\).
- Bob measures each qubit in a randomly chosen basis \(\theta'_i\). When \(\theta_i = \theta'_i\), he obtains \(b_i\) exactly; otherwise his result is random.
- Alice and Bob announce bases over a public channel, discard mismatched positions (~50%), and perform error estimation. High error rate signals eavesdropping.
- Privacy amplification and information reconciliation yield a shared secret key.
Entanglement as a Resource
Entanglement is a purely quantum phenomenon with no classical analogue. It enables communication tasks that are impossible classically.
Dense coding and teleportation are dual protocols: one trades quantum communication for classical + entanglement, the other trades entanglement + classical for quantum communication. Together they define the three fundamental resources of quantum information: classical bits (cbits), quantum bits (qubits), and shared entanglement (ebits). The resource inequalities
\[ [q \to q] \geq [c \to c] + [qq], \qquad [qq] + 2[c \to c] \geq [q \to q] \](dense coding, teleportation) form the basis of quantum Shannon theory, where the goal is to characterise all achievable tradeoffs among these three resources over a quantum channel.
Schumacher Compression
The proof is a direct quantum analogue of the AEP proof. The typical subspace is the eigenspace of \(\rho^{\otimes n}\) corresponding to eigenvalues in the range \(2^{-n(S(\rho) \pm \varepsilon)}\) — this subspace has dimension \(\approx 2^{nS(\rho)}\) and captures nearly all of the \(n\)-copy state’s support.
Schumacher compression is the quantum source coding theorem: it shows that von Neumann entropy is the quantum analogue of Shannon entropy, not just formally but operationally.
| Classical | Quantum |
|---|---|
| Probability distribution \(p\) | Density matrix \(\rho\) |
| Shannon entropy \(H(p)\) | Von Neumann entropy \(S(\rho)\) |
| Typical set \(T_\varepsilon^{(n)}\) | Typical subspace |
| Source coding rate \(\to H(p)\) | Compression rate \(\to S(\rho)\) |
| AEP via LLN | Quantum AEP via operator norm |
Summary: The Information-Theoretic Method
The course has developed a single powerful idea — entropy as a measure of uncertainty — and deployed it across six distinct areas of mathematics and computer science.
As a compression principle: Shannon entropy is the fundamental limit of lossless compression (source coding theorem). The Lempel-Ziv algorithm achieves this limit universally without knowing the source. Rate-distortion theory extends this to lossy compression, with the Gaussian case giving the elegant water-filling formula.
As a combinatorial counting tool: Shearer’s lemma and the Loomis-Whitney inequality turn entropy into a counting machine. Kahn’s theorem on independent sets, Bregman’s permanent bound, and the Fredman-Komlós bound on perfect hash families are all entropy arguments in disguise.
As a channel coding limit: Mutual information \(I(X;Y)\) quantifies the maximum reliable throughput of a noisy channel. The Gaussian channel capacity \(C = (1/2)\log(1+\text{SNR})\) is the cornerstone of wireless communication. Error exponents, the strong converse, and channel dispersion refine this to finite blocklengths.
As a lower-bound tool in complexity: Fano’s inequality, the data processing inequality, and information complexity give communication lower bounds for Set Disjointness, Index, and Inner Product. These propagate to circuit lower bounds, data structure lower bounds, and via Yannakakis’s theorem, to LP lower bounds.
As a probabilistic method: LLL via entropy compression connects the condition \(ep(d+1) \leq 1\) to the incompressibility of random strings. K-wise independence and small-bias spaces trade between randomness and universality of probabilistic algorithms.
In quantum systems: Von Neumann entropy, strong subadditivity, and the Holevo bound govern quantum communication. Dense coding and teleportation trade classical bits, qubits, and entanglement at precise exchange rates determined by quantum entropy. Quantum error correction protects against decoherence, with the fault-tolerance threshold theorem making arbitrarily long quantum computation feasible in principle.
The thread connecting all these is the idea that information is a physical, computational, and combinatorial resource — one that can be quantified, manipulated, and traded against other resources with precise exchange rates governed by the inequalities of information theory.