STAT 908: Statistical Inference

Shoja'eddin Chenouri

Estimated study time: 1 hr 25 min

Table of contents

Sources and References

Primary texts — A.W. van der Vaart, Asymptotic Statistics (Cambridge University Press, 1998); Anirban DasGupta, Asymptotic Theory of Statistics and Probability (Springer, 2008); Martin Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint (Cambridge University Press, 2019). Supplementary — E.L. Lehmann, Theory of Point Estimation (Wiley, 1983); E.L. Lehmann and J.P. Romano, Testing Statistical Hypotheses (3rd ed., Springer, 2005); T.S. Ferguson, Mathematical Statistics: A Decision Theoretic Approach (Academic Press, 1967). Online resources — MIT 18.650 Statistics for Applications notes; Stanford STATS 300B (Theory of Statistics) lecture notes; Larry Wasserman’s CMU 36-705 Intermediate Statistics notes.

Chapter 1: Measure-Theoretic Foundations and Hilbert Spaces

Statistical inference at the graduate level demands a rigorous foundation in measure theory. The language of measurable spaces, sigma-algebras, and integration not only gives precise meaning to probability and expectation but also supplies the machinery for proving asymptotic results throughout this course. This chapter reviews the essential elements at the level expected for STAT 908.

Section 1.1: Probability Spaces and Measurable Functions

A measurable space is a pair \((\Omega, \mathcal{F})\) where \(\Omega\) is a nonempty set and \(\mathcal{F}\) is a sigma-algebra of subsets of \(\Omega\). A probability measure \(P : \mathcal{F} \to [0,1]\) satisfies \(P(\Omega) = 1\) and countable additivity: for disjoint \(A_1, A_2, \ldots \in \mathcal{F}\),

\[ P\!\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i). \]

The triple \((\Omega, \mathcal{F}, P)\) is a probability space. A random variable is a measurable function \(X : (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))\), where \(\mathcal{B}(\mathbb{R})\) denotes the Borel sigma-algebra generated by the open sets of \(\mathbb{R}\). Measurability means \(X^{-1}(B) \in \mathcal{F}\) for every \(B \in \mathcal{B}(\mathbb{R})\), ensuring that events of the form \(\{X \in B\}\) are well-defined.

The distribution or law of \(X\) is the probability measure \(P_X = P \circ X^{-1}\) on \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\). The cumulative distribution function (CDF) is \(F_X(t) = P(X \leq t)\).

Section 1.2: Integration and Expectation

The Lebesgue integral is constructed in three stages: first for nonnegative simple functions, then for nonnegative measurable functions as a supremum of simple approximations, and finally for general functions by decomposing \(f = f^+ - f^-\) where \(f^+ = \max(f,0)\) and \(f^- = \max(-f,0)\). The integral \(\int_\Omega f \, dP\) is defined when at least one of \(\int f^+ \, dP\) or \(\int f^- \, dP\) is finite.

The expectation of a random variable \(X\) is \(\mathbb{E}[X] = \int_\Omega X \, dP\). The key convergence theorems are:

Monotone Convergence Theorem. If \(0 \leq X_1 \leq X_2 \leq \cdots\) are measurable functions and \(X_n \to X\) pointwise, then \(\mathbb{E}[X_n] \to \mathbb{E}[X]\).

Dominated Convergence Theorem. If \(X_n \to X\) almost surely and \(|X_n| \leq Y\) for all \(n\) where \(\mathbb{E}[Y] < \infty\), then \(\mathbb{E}[X_n] \to \mathbb{E}[X]\).

These theorems underpin the interchange of limits and expectations that appears throughout asymptotic theory.

Section 1.3: Modes of Convergence

Let \(X, X_1, X_2, \ldots\) be random variables on a common probability space \((\Omega, \mathcal{F}, P)\).

Almost Sure Convergence. We say \(X_n \xrightarrow{\text{a.s.}} X\) if \(P(\{\omega : X_n(\omega) \to X(\omega)\}) = 1\), equivalently if \[ P\!\left(\limsup_{n\to\infty} \{|X_n - X| > \epsilon\}\right) = 0 \quad \text{for all } \epsilon > 0. \]

Convergence in Probability. We say \(X_n \xrightarrow{P} X\) if for all \(\epsilon > 0\), \[ P(|X_n - X| > \epsilon) \to 0 \quad \text{as } n \to \infty. \]

Convergence in \(L^p\). For \(p \geq 1\), we say \(X_n \xrightarrow{L^p} X\) if \(\mathbb{E}[|X_n - X|^p] \to 0\).

Convergence in Distribution. We say \(X_n \xrightarrow{d} X\) if \(\mathbb{E}[f(X_n)] \to \mathbb{E}[f(X)]\) for all bounded continuous functions \(f : \mathbb{R} \to \mathbb{R}\), equivalently if \(F_{X_n}(t) \to F_X(t)\) at every continuity point \(t\) of \(F_X\).

The relationships among these modes are summarized by the following implications, all of which are strict in general:

\[ \text{a.s.} \implies \text{in probability} \implies \text{in distribution}, \]\[ L^p \implies \text{in probability} \implies \text{in distribution.} \]

The implication from almost sure to in probability follows because \(\limsup\) is a stronger condition than lim. The converse fails: the classic “typewriter sequence” of indicator functions on \([0,1]\) converges in probability to zero but not almost surely.

Markov’s inequality states that for any \(a > 0\) and \(p \geq 1\),

\[ P(|X| \geq a) \leq \frac{\mathbb{E}[|X|^p]}{a^p}, \]

and Chebyshev’s inequality is the special case \(p=2\) applied to \(X - \mu\):

\[ P(|X - \mu| \geq a) \leq \frac{\text{Var}(X)}{a^2}. \]

These inequalities provide the fundamental bridge between moments and tail probabilities.

Section 1.4: Uniform Integrability

Convergence in probability does not imply \(L^1\) convergence unless a uniform integrability condition holds.

Uniform Integrability. A family of random variables \(\{X_\alpha\}\) is uniformly integrable if \[ \lim_{M \to \infty} \sup_\alpha \mathbb{E}\!\left[|X_\alpha| \mathbf{1}_{|X_\alpha| > M}\right] = 0. \]

Vitali Convergence Theorem. \(X_n \xrightarrow{L^1} X\) if and only if \(X_n \xrightarrow{P} X\) and \(\{X_n\}\) is uniformly integrable.

Uniform integrability is equivalent to the condition that \(\sup_n \mathbb{E}[|X_n|^{1+\delta}] < \infty\) for some \(\delta > 0\), providing a practical checkable criterion.

Section 1.5: Hilbert Space Structure of \(L^2\)

The space \(L^2(\Omega, \mathcal{F}, P)\) of square-integrable random variables forms a Hilbert space under the inner product

\[ \langle X, Y \rangle = \mathbb{E}[XY] \]

with corresponding norm \(\|X\|_2 = (\mathbb{E}[X^2])^{1/2}\). Elements are equivalence classes under almost-sure equality. This structure is fundamental to the geometry of statistical estimation.

Projection Theorem. Let \(\mathcal{H}\) be a closed linear subspace of \(L^2(\Omega, \mathcal{F}, P)\) and let \(X \in L^2\). There exists a unique \(\hat{X} \in \mathcal{H}\) minimizing \(\|X - Y\|_2\) over all \(Y \in \mathcal{H}\). This \(\hat{X}\) is characterized by the orthogonality condition \(X - \hat{X} \perp \mathcal{H}\), meaning \(\mathbb{E}[(X - \hat{X})Y] = 0\) for all \(Y \in \mathcal{H}\).

The conditional expectation \(\mathbb{E}[X \mid \mathcal{G}]\) is precisely the \(L^2\)-projection of \(X\) onto \(L^2(\Omega, \mathcal{G}, P)\) for any sub-sigma-algebra \(\mathcal{G} \subseteq \mathcal{F}\). The tower property \(\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{H}]\) for \(\mathcal{H} \subseteq \mathcal{G}\) follows directly from the projection interpretation: projecting onto a larger space and then onto a smaller one is the same as projecting onto the smaller space directly.

Chapter 2: Decision Theory and Optimal Estimation

Decision theory provides a unified framework for evaluating and comparing statistical procedures. Rather than seeking procedures that are “good” in some intuitive sense, decision theory formalizes the notion of optimality through loss functions and risk functions.

Section 2.1: The Statistical Decision Problem

A statistical decision problem consists of:

A parameter space \(\Theta\), which may be finite-dimensional or infinite-dimensional.
An action space \(\mathcal{A}\) of possible decisions.
A loss function \(L : \Theta \times \mathcal{A} \to [0, \infty)\), where \(L(\theta, a)\) measures the cost of taking action \(a\) when the true parameter is \(\theta\).

A decision rule or estimator \(\delta : \mathcal{X} \to \mathcal{A}\) is a measurable function mapping observations to actions. The risk function is the expected loss:

\[ R(\theta, \delta) = \mathbb{E}_\theta\!\left[L(\theta, \delta(X))\right]. \]

Common loss functions include squared error loss \(L(\theta, a) = (\theta - a)^2\), absolute error loss \(L(\theta, a) = |\theta - a|\), and 0-1 loss for testing.

Admissibility. A decision rule \(\delta\) is inadmissible if there exists another rule \(\delta'\) such that \(R(\theta, \delta') \leq R(\theta, \delta)\) for all \(\theta\) with strict inequality for at least one \(\theta\). Otherwise \(\delta\) is admissible.

Minimax Rule. A decision rule \(\delta^*\) is minimax if \[ \sup_\theta R(\theta, \delta^*) = \inf_\delta \sup_\theta R(\theta, \delta). \]

Bayes Risk. Given a prior \(\pi\) on \(\Theta\), the Bayes risk of \(\delta\) is \[ r(\pi, \delta) = \int_\Theta R(\theta, \delta) \, d\pi(\theta). \]

A Bayes rule minimizes the Bayes risk over all decision rules.

A key connection is that if a Bayes rule \(\delta_\pi\) has constant risk, then it is minimax. This is the classical method for establishing minimaxity: find a least-favorable prior such that the Bayes estimator achieves constant risk.

Section 2.2: The Cramér-Rao Lower Bound

The Cramér-Rao lower bound (CRLB) provides a fundamental lower bound on the variance of any unbiased estimator in a regular parametric family.

Let \(\{P_\theta : \theta \in \Theta\}\) be a family with densities \(f(x; \theta)\) with respect to a sigma-finite measure. The score function is

\[ s(x; \theta) = \frac{\partial}{\partial \theta} \log f(x; \theta), \]

and the Fisher information is

\[ \mathcal{I}(\theta) = \mathbb{E}_\theta\!\left[s(X; \theta)^2\right] = -\mathbb{E}_\theta\!\left[\frac{\partial^2}{\partial \theta^2} \log f(X; \theta)\right], \]

with equality of the two forms holding under suitable regularity (exchange of differentiation and integration).

Cramér-Rao Lower Bound. Under regularity conditions (the support of \(f(\cdot; \theta)\) does not depend on \(\theta\), and differentiation may pass under the integral), if \(\delta(X)\) is an unbiased estimator of \(g(\theta)\), then \[ \text{Var}_\theta(\delta(X)) \geq \frac{\left[g'(\theta)\right]^2}{\mathcal{I}(\theta)}. \]

Proof. By the unbiasedness condition, \[ g(\theta) = \mathbb{E}_\theta[\delta(X)] = \int \delta(x) f(x;\theta) \, d\mu(x). \]

Differentiating with respect to \(\theta\) and exchanging differentiation and integration:

\[ g'(\theta) = \int \delta(x) \frac{\partial}{\partial\theta} f(x;\theta) \, d\mu(x) = \int \delta(x) s(x;\theta) f(x;\theta) \, d\mu(x) = \text{Cov}_\theta(\delta(X), s(X;\theta)). \]

Note \(\mathbb{E}_\theta[s(X;\theta)] = 0\) (under regularity, differentiating \(\int f \, d\mu = 1\)). By the Cauchy-Schwarz inequality,

\[ \left[g'(\theta)\right]^2 = \left[\text{Cov}_\theta(\delta(X), s(X;\theta))\right]^2 \leq \text{Var}_\theta(\delta(X)) \cdot \text{Var}_\theta(s(X;\theta)) = \text{Var}_\theta(\delta(X)) \cdot \mathcal{I}(\theta). \]

Dividing both sides by \(\mathcal{I}(\theta)\) yields the result. \(\square\)

An estimator achieving the CRLB with equality is called efficient. The CRLB may not be achievable; the Rao-Blackwell and Lehmann-Scheffé theorems characterize when an optimal unbiased estimator exists.

Section 2.3: Sufficiency and the Neyman Factorization Theorem

Sufficient Statistic. A statistic \(T = T(X)\) is sufficient for \(\theta\) if the conditional distribution of \(X\) given \(T(X) = t\) does not depend on \(\theta\).

Sufficiency means that \(T\) captures all information in the data about \(\theta\). The following theorem provides the standard criterion.

Neyman Factorization Theorem. \(T(X)\) is sufficient for \(\theta\) if and only if the density (or probability mass function) factorizes as \[ f(x; \theta) = g(T(x), \theta) \cdot h(x) \]

for nonnegative functions \(g\) and \(h\), where \(h\) does not depend on \(\theta\).

The exponential family \(f(x;\theta) = \exp(\eta(\theta)^T T(x) - A(\theta)) h(x)\) has \(T(x)\) as a natural sufficient statistic by the factorization theorem.

Section 2.4: Completeness, Basu’s Theorem, and UMVUE

Complete Statistic. A statistic \(T\) is complete for the family \(\{P_\theta\}\) if for any measurable function \(g\), \[ \mathbb{E}_\theta[g(T)] = 0 \text{ for all } \theta \implies P_\theta(g(T) = 0) = 1 \text{ for all } \theta. \]

Basu's Theorem. If \(T\) is a complete sufficient statistic for \(\theta\) and \(V\) is an ancillary statistic (whose distribution does not depend on \(\theta\)), then \(T\) and \(V\) are independent under every \(P_\theta\).

UMVUE. An estimator \(\delta^*(X)\) is a uniformly minimum variance unbiased estimator (UMVUE) of \(g(\theta)\) if it is unbiased and \(\text{Var}_\theta(\delta^*(X)) \leq \text{Var}_\theta(\delta(X))\) for every unbiased estimator \(\delta\) and every \(\theta \in \Theta\).

Rao-Blackwell Theorem. Let \(\delta(X)\) be any unbiased estimator of \(g(\theta)\) and \(T\) be sufficient for \(\theta\). Define \(\delta^*(T) = \mathbb{E}[\delta(X) \mid T]\). Then \(\delta^*(T)\) is unbiased and \(\text{Var}_\theta(\delta^*(T)) \leq \text{Var}_\theta(\delta(X))\) for all \(\theta\), with equality if and only if \(\delta(X) = \delta^*(T)\) a.s.

Proof sketch. Unbiasedness of \(\delta^*\) follows from the tower property: \(\mathbb{E}[\delta^*(T)] = \mathbb{E}[\mathbb{E}[\delta(X) \mid T]] = \mathbb{E}[\delta(X)] = g(\theta)\). The variance reduction follows from the law of total variance: \[ \text{Var}(\delta(X)) = \mathbb{E}[\text{Var}(\delta(X) \mid T)] + \text{Var}(\mathbb{E}[\delta(X) \mid T]) \geq \text{Var}(\delta^*(T)). \quad \square \]

Lehmann-Scheffé Theorem. If \(T\) is a complete sufficient statistic and \(\delta^*(T)\) is any unbiased estimator of \(g(\theta)\) that is a function of \(T\) alone, then \(\delta^*(T)\) is the unique UMVUE of \(g(\theta)\).

The uniqueness follows from completeness: if two functions of \(T\) are both unbiased for \(g(\theta)\), their difference has zero expectation for all \(\theta\), so by completeness they are equal almost surely.

Chapter 3: Classical Asymptotic Theory

Asymptotic theory studies the behavior of statistical procedures as the sample size \(n \to \infty\). This chapter develops the fundamental tools: modes of convergence, limit theorems, and the perturbation calculus of the delta method.

Section 3.1: Weak Convergence and the Portmanteau Theorem

Convergence in distribution (weak convergence) is the central mode of convergence in asymptotic statistics. The Portmanteau theorem provides a collection of equivalent characterizations.

Portmanteau Theorem. Let \(X_n, X\) be random variables with distributions \(\mu_n, \mu\) on \(\mathbb{R}\). The following are equivalent:

\(\mu_n \xrightarrow{d} \mu\): \(\mathbb{E}[f(X_n)] \to \mathbb{E}[f(X)]\) for all bounded continuous \(f\).
\(\mathbb{E}[f(X_n)] \to \mathbb{E}[f(X)]\) for all bounded uniformly continuous \(f\).
\(\limsup_n \mu_n(F) \leq \mu(F)\) for all closed sets \(F\).
\(\liminf_n \mu_n(G) \geq \mu(G)\) for all open sets \(G\).
\(\mu_n(A) \to \mu(A)\) for all Borel sets \(A\) with \(\mu(\partial A) = 0\) (continuity sets).
\(F_{X_n}(t) \to F_X(t)\) for all continuity points \(t\) of \(F_X\).

The equivalence of conditions (3), (4), and (5) is especially useful when dealing with tail probabilities.

Section 3.2: Tightness and Prohorov’s Theorem

Tightness. A sequence of probability measures \(\{\mu_n\}\) on \(\mathbb{R}\) is tight if for every \(\epsilon > 0\) there exists a compact set \(K_\epsilon\) such that \(\mu_n(K_\epsilon) \geq 1 - \epsilon\) for all \(n\).

Tightness prevents probability mass from escaping to infinity. It is automatically satisfied when \(\sup_n \mathbb{E}[|X_n|] < \infty\).

Prohorov's Theorem. A sequence of probability measures is relatively compact (every subsequence has a weakly convergent subsequence) if and only if it is tight.

Helly's Selection Lemma. Every sequence of uniformly bounded, non-decreasing functions \(\{F_n\}\) on \(\mathbb{R}\) has a subsequence converging pointwise to a non-decreasing right-continuous function (which need not be a CDF).

Prohorov’s theorem strengthens Helly’s lemma by ensuring that the limit is a proper probability measure, precisely when tightness holds.

Section 3.3: \(O_p\) and \(o_p\) Notation

Stochastic order notation provides a concise language for the magnitude of random sequences.

\(X_n = O_p(a_n)\) if for every \(\epsilon > 0\) there exist \(M, N < \infty\) such that \(P(|X_n/a_n| > M) < \epsilon\) for all \(n > N\). Equivalently, the sequence \(X_n / a_n\) is tight.

\(X_n = o_p(a_n)\) if \(X_n / a_n \xrightarrow{P} 0\).

Key algebraic rules: \(O_p(a_n) \cdot O_p(b_n) = O_p(a_n b_n)\), \(o_p(a_n) + O_p(b_n) = O_p(\max(a_n, b_n))\), and if \(X_n \xrightarrow{d} X\) then \(X_n = O_p(1)\).

The continuous mapping theorem states: if \(X_n \xrightarrow{d} X\) and \(g : \mathbb{R} \to \mathbb{R}\) is continuous \(\mu\)-almost everywhere (where \(\mu\) is the law of \(X\)), then \(g(X_n) \xrightarrow{d} g(X)\). Slutsky’s theorem states: if \(X_n \xrightarrow{d} X\) and \(Y_n \xrightarrow{P} c\) (a constant), then \(X_n + Y_n \xrightarrow{d} X + c\) and \(X_n Y_n \xrightarrow{d} cX\).

Section 3.4: Laws of Large Numbers

Weak Law of Large Numbers (WLLN). Let \(X_1, X_2, \ldots\) be i.i.d. with \(\mathbb{E}[|X_1|] < \infty\) and mean \(\mu\). Then \(\bar{X}_n = n^{-1}\sum_{i=1}^n X_i \xrightarrow{P} \mu\).

Strong Law of Large Numbers (SLLN). Under the same conditions, \(\bar{X}_n \xrightarrow{\text{a.s.}} \mu\).

The SLLN requires more careful analysis via the Borel-Cantelli lemma. For the WLLN, Chebyshev’s inequality suffices when \(\text{Var}(X_1) < \infty\).

Section 3.5: Central Limit Theorems

Classical CLT. Let \(X_1, X_2, \ldots\) be i.i.d. with mean \(\mu\) and variance \(0 < \sigma^2 < \infty\). Then \[ \sqrt{n}\left(\bar{X}_n - \mu\right) \xrightarrow{d} N(0, \sigma^2). \]

The Lindeberg-Feller CLT extends this to triangular arrays of independent but not identically distributed random variables.

Lindeberg-Feller CLT. For each \(n\), let \(X_{n1}, \ldots, X_{nn}\) be independent with \(\mathbb{E}[X_{ni}] = 0\) and \(\sigma_{ni}^2 = \text{Var}(X_{ni}) < \infty\). Let \(s_n^2 = \sum_{i=1}^n \sigma_{ni}^2\). If the Lindeberg condition holds: \[ \frac{1}{s_n^2} \sum_{i=1}^n \mathbb{E}\!\left[X_{ni}^2 \mathbf{1}_{|X_{ni}| > \epsilon s_n}\right] \to 0 \quad \text{for all } \epsilon > 0, \]

then \(s_n^{-1} \sum_{i=1}^n X_{ni} \xrightarrow{d} N(0,1)\). Moreover, if the Feller condition \(\max_i \sigma_{ni}^2 / s_n^2 \to 0\) holds, then the Lindeberg condition is also necessary for the CLT.

The proof proceeds via the characteristic function approach: write the characteristic function of the normalized sum as a product, use the Lindeberg condition to show each factor converges to \(e^{-t^2/2}\), then invoke the continuity theorem for characteristic functions.

Lyapounov’s CLT is a simpler sufficient condition: if \(\mathbb{E}[|X_{ni}|^{2+\delta}] < \infty\) and \(s_n^{-(2+\delta)} \sum_i \mathbb{E}[|X_{ni}|^{2+\delta}] \to 0\) for some \(\delta > 0\), the Lindeberg condition holds, and hence the CLT follows.

Section 3.6: Berry-Esseen Bound

The CLT provides a qualitative statement about convergence to normality, but the Berry-Esseen theorem gives a quantitative rate.

Berry-Esseen Theorem. Let \(X_1, \ldots, X_n\) be i.i.d. with mean 0, variance \(\sigma^2\), and \(\rho = \mathbb{E}[|X_1|^3] < \infty\). Then \[ \sup_t \left|P\!\left(\frac{\sqrt{n}\bar{X}_n}{\sigma} \leq t\right) - \Phi(t)\right| \leq \frac{C\rho}{\sigma^3 \sqrt{n}}, \]

where \(C\) is an absolute constant (the best known value is approximately 0.4748) and \(\Phi\) is the standard normal CDF.

The bound is tight in the order \(n^{-1/2}\): the Bernoulli distribution achieves this order. The proof uses a smoothing inequality relating the characteristic function to the CDF, then bounds the difference between the characteristic functions of the standardized sum and the standard normal.

Section 3.7: The Delta Method

The delta method extends CLTs to smooth functions of sample means.

Delta Method (First Order). If \(\sqrt{n}(\hat{\theta}_n - \theta) \xrightarrow{d} N(0, \sigma^2)\) and \(g\) is differentiable at \(\theta\) with \(g'(\theta) \neq 0\), then \[ \sqrt{n}(g(\hat{\theta}_n) - g(\theta)) \xrightarrow{d} N(0, \left[g'(\theta)\right]^2 \sigma^2). \]

Proof. By a first-order Taylor expansion, \[ g(\hat{\theta}_n) - g(\theta) = g'(\theta)(\hat{\theta}_n - \theta) + o(|\hat{\theta}_n - \theta|). \]

Multiplying by \(\sqrt{n}\) and noting that \(|\hat{\theta}_n - \theta| = O_p(n^{-1/2})\), so \(\sqrt{n} \cdot o(|\hat{\theta}_n - \theta|) = o_p(1)\). By Slutsky’s theorem, \(\sqrt{n}(g(\hat{\theta}_n) - g(\theta)) \xrightarrow{d} g'(\theta) \cdot N(0,\sigma^2) = N(0, [g'(\theta)]^2 \sigma^2)\). \(\square\)

When \(g'(\theta) = 0\), a second-order expansion is needed: if \(g''(\theta)\) exists and is nonzero, then \(n(g(\hat{\theta}_n) - g(\theta)) \xrightarrow{d} \frac{1}{2} g''(\theta) \chi^2_1 \sigma^2\).

Variance-stabilizing transformations exploit the delta method. For a Poisson random variable with mean \(\lambda\), \(\text{Var}(\bar{X}) = \lambda/n\), so \(\sqrt{n}(\bar{X} - \lambda) \approx N(0, \lambda)\). Setting \(g(\lambda) = 2\sqrt{\lambda}\), we get \([g'(\lambda)]^2 \lambda = 1\), so \(\sqrt{n}(2\sqrt{\bar{X}} - 2\sqrt{\lambda}) \xrightarrow{d} N(0,1)\) regardless of \(\lambda\). Similarly, for the Binomial proportion \(\hat{p}\) with \(\text{Var}(\hat{p}) = p(1-p)/n\), the arcsine transformation \(g(p) = \arcsin(\sqrt{p})\) gives \([g'(p)]^2 p(1-p) = 1/4\) and variance stabilization: \(2\sqrt{n}(\arcsin\sqrt{\hat{p}} - \arcsin\sqrt{p}) \xrightarrow{d} N(0,1)\).

Chapter 4: Empirical Processes

The theory of empirical processes studies the uniform behavior of sample-based functions over classes of sets or functions. It provides the theoretical foundation for nonparametric statistics and statistical learning theory.

Section 4.1: The Empirical Measure and Empirical CDF

Given i.i.d. observations \(X_1, \ldots, X_n\) from distribution \(P\) on \((\mathcal{X}, \mathcal{A})\), the empirical measure is

\[ \mathbb{P}_n = \frac{1}{n} \sum_{i=1}^n \delta_{X_i}, \]

where \(\delta_x\) is the Dirac measure at \(x\). For any function \(f\), \(\mathbb{P}_n f = n^{-1}\sum_{i=1}^n f(X_i)\). The empirical CDF is

\[ \mathbb{F}_n(t) = \mathbb{P}_n((-\infty, t]) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}_{X_i \leq t}. \]

At each fixed \(t\), \(\mathbb{F}_n(t)\) is an unbiased estimator of \(F(t)\) with variance \(F(t)(1-F(t))/n\), so \(\sqrt{n}(\mathbb{F}_n(t) - F(t)) \xrightarrow{d} N(0, F(t)(1-F(t)))\) by the CLT.

Section 4.2: The Glivenko-Cantelli Theorem

Glivenko-Cantelli Theorem. Let \(X_1, X_2, \ldots\) be i.i.d. from CDF \(F\). Then \[ \sup_t |\mathbb{F}_n(t) - F(t)| \xrightarrow{\text{a.s.}} 0. \]

Proof sketch. Fix \(\epsilon > 0\). Choose a finite partition \(-\infty = t_0 < t_1 < \cdots < t_k = \infty\) such that \(F(t_j) - F(t_{j-1}^-) \leq \epsilon/2\) for all \(j\). By the SLLN, \(\mathbb{F}_n(t_j) \to F(t_j)\) almost surely for each \(j\). For any \(t \in (t_{j-1}, t_j)\), monotonicity gives \[ \mathbb{F}_n(t) - F(t) \leq \mathbb{F}_n(t_j) - F(t_{j-1}^-) = [\mathbb{F}_n(t_j) - F(t_j)] + [F(t_j) - F(t_{j-1}^-)] \leq [\mathbb{F}_n(t_j) - F(t_j)] + \epsilon/2. \]

Taking \(n\) large enough that \(|\mathbb{F}_n(t_j) - F(t_j)| < \epsilon/2\) simultaneously for all \(j\) (by the SLLN on the finite set) yields \(\sup_t |\mathbb{F}_n(t) - F(t)| < \epsilon\). Since \(\epsilon\) was arbitrary, the result follows. \(\square\)

Section 4.3: The DKW Inequality

The Glivenko-Cantelli theorem guarantees almost sure uniform convergence but gives no rate. Massart’s version of the Dvoretzky-Kiefer-Wolfowitz inequality provides a sharp exponential bound.

Massart's DKW Inequality. For i.i.d. observations from any continuous CDF \(F\), \[ P\!\left(\sup_t |\mathbb{F}_n(t) - F(t)| > \epsilon\right) \leq 2e^{-2n\epsilon^2} \]

for all \(\epsilon > 0\). The constant 2 is sharp.

This inequality is remarkable: the bound is distribution-free and depends on the sample size and \(\epsilon\) only through \(n\epsilon^2\). It implies that with probability at least \(1 - \delta\), the confidence band

\[ \left[\mathbb{F}_n(t) \pm \sqrt{\frac{\log(2/\delta)}{2n}}\right] \]

contains \(F(t)\) uniformly in \(t\). This is the Dvoretzky-Kiefer-Wolfowitz confidence band, which is exactly distribution-free.

Section 4.4: Donsker’s Theorem

The uniform CLT for the empirical process is Donsker’s theorem. Define the centered and scaled empirical process:

\[ \mathbb{G}_n(t) = \sqrt{n}(\mathbb{F}_n(t) - F(t)). \]

Donsker's Theorem. The empirical process \(\mathbb{G}_n\) converges weakly in \(\ell^\infty(\mathbb{R})\) (equipped with the sup-norm) to the Brownian bridge \(\mathbb{B}_F\), where \(\mathbb{B}_F(t) = \mathbb{B}(F(t))\) and \(\mathbb{B}\) is the standard Brownian bridge on \([0,1]\). Specifically, \[ \mathbb{G}_n \xrightarrow{d} \mathbb{B}_F \quad \text{in } \ell^\infty(\mathbb{R}), \]

where the Brownian bridge \(\mathbb{B}\) is a Gaussian process with \(\mathbb{E}[\mathbb{B}(s)] = 0\) and \(\text{Cov}(\mathbb{B}(s), \mathbb{B}(t)) = \min(s,t) - st\).

As a consequence, \(\sqrt{n} \sup_t |\mathbb{F}_n(t) - F(t)| \xrightarrow{d} \sup_{t \in [0,1]} |\mathbb{B}(t)|\), the Kolmogorov distribution, which underlies the Kolmogorov-Smirnov test.

Section 4.5: VC Theory

VC (Vapnik-Chervonenkis) theory provides uniform laws of large numbers for classes of indicator functions.

VC Dimension. Let \(\mathcal{C}\) be a class of subsets of \(\mathcal{X}\). The shatter coefficient (or growth function) is \[ m(\mathcal{C}, n) = \max_{x_1, \ldots, x_n} |\{(1_{C}(x_1), \ldots, 1_C(x_n)) : C \in \mathcal{C}\}|. \]

The VC dimension \(V(\mathcal{C})\) is the largest \(n\) such that \(m(\mathcal{C}, n) = 2^n\) (i.e., some set of \(n\) points can be shattered by \(\mathcal{C}\)).

Sauer-Shelah Lemma. If \(V(\mathcal{C}) = d < \infty\), then \[ m(\mathcal{C}, n) \leq \sum_{j=0}^d \binom{n}{j} \leq \left(\frac{en}{d}\right)^d. \]

Proof. We prove \(m(\mathcal{C}, n) \leq \sum_{j=0}^d \binom{n}{j}\) by induction on \(n + d\). The base cases \(n = 0\) or \(d = 0\) are immediate. For the inductive step, consider \(n\) points \(x_1, \ldots, x_n\) and partition the collection of subsets realized by \(\mathcal{C}\) into those that include or exclude \(x_n\). The subsets not involving \(x_n\) are realized on \(x_1, \ldots, x_{n-1}\), giving at most \(\sum_{j=0}^d \binom{n-1}{j}\) such patterns. The pairs correspond to sets where \(x_n\)'s inclusion changes the pattern on \(x_1, \ldots, x_{n-1}\); these can only occur if \(\mathcal{C}\) restricted to \(x_1, \ldots, x_{n-1}\) has VC dimension at most \(d-1\) (otherwise we could shatter a set of size \(d\) among the first \(n-1\) points along with \(x_n\)), giving at most \(\sum_{j=0}^{d-1} \binom{n-1}{j}\) pairs. Combining: \(m(\mathcal{C},n) \leq \sum_{j=0}^d \binom{n-1}{j} + \sum_{j=0}^{d-1}\binom{n-1}{j} = \sum_{j=0}^d \binom{n}{j}\) by Pascal's identity. \(\square\)

VC Inequality. For a class \(\mathcal{C}\) with VC dimension \(d\), \[ \mathbb{E}\!\left[\sup_{C \in \mathcal{C}} |\mathbb{P}_n(C) - P(C)|\right] \leq C\sqrt{\frac{d \log n}{n}}, \]

and with high probability the supremum is of order \(\sqrt{d \log n / n}\).

Section 4.6: Concentration Inequalities

Concentration inequalities quantify how a random variable concentrates around its mean or median.

Hoeffding's Inequality. Let \(X_1, \ldots, X_n\) be independent with \(a_i \leq X_i \leq b_i\). Then for any \(t > 0\), \[ P\!\left(\sum_{i=1}^n (X_i - \mathbb{E}[X_i]) \geq t\right) \leq \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right). \]

A random variable \(X\) is sub-Gaussian with parameter \(\sigma^2\) if \(\mathbb{E}[e^{\lambda X}] \leq e^{\lambda^2 \sigma^2 / 2}\) for all \(\lambda \in \mathbb{R}\). Sub-Gaussian random variables satisfy Gaussian-type tail bounds: \(P(X > t) \leq e^{-t^2/(2\sigma^2)}\).

Sub-exponential random variables are heavier-tailed: \(X\) is sub-exponential with parameters \((\nu^2, b)\) if \(\mathbb{E}[e^{\lambda X}] \leq e^{\nu^2 \lambda^2/2}\) for \(|\lambda| \leq 1/b\). The Bernstein condition holds: \(\mathbb{E}[|X|^k] \leq k! b^{k-2} \nu^2 / 2\) for \(k \geq 2\). Chi-squared variables, products of Gaussian variables, and bounded variables fall into this class.

Chapter 5: Von Mises Calculus and Functional Delta Method

Von Mises (1947) introduced a calculus for statistical functionals that extends the scalar delta method to functionals of the empirical distribution. This framework unifies many nonparametric estimators and provides their asymptotic distributions.

Section 5.1: Statistical Functionals

Statistical Functional. A statistical functional is a mapping \(T : \mathcal{P} \to \mathbb{R}\) (or to \(\mathbb{R}^k\)), where \(\mathcal{P}\) is a set of probability distributions on \(\mathcal{X}\). The plug-in estimator of \(T(P)\) is \(T(\mathbb{P}_n)\).

Examples: the mean functional \(T(F) = \int x \, dF(x)\), the variance functional \(T(F) = \int (x - \int y \, dF(y))^2 \, dF(x)\), the quantile functional \(T(F) = F^{-1}(p)\), and the Mann-Whitney functional \(T(F,G) = \int F(x) \, dG(x)\).

Section 5.2: Hadamard Differentiability

Hadamard Differentiability. A functional \(T : \mathbb{D} \to \mathbb{E}\) (between normed spaces) is Hadamard differentiable at \(F \in \mathbb{D}\) tangentially to \(\mathbb{D}_0 \subseteq \mathbb{D}\) if there exists a continuous linear map \(T'_F : \mathbb{D}_0 \to \mathbb{E}\) such that \[ \frac{T(F + t_n h_n) - T(F)}{t_n} \to T'_F(h) \]

for all sequences \(t_n \to 0\) and \(h_n \to h\) in \(\mathbb{D}_0\).

Hadamard differentiability is weaker than Fréchet differentiability (which requires uniform convergence over all bounded sets of directions) but stronger than Gâteaux differentiability (which only considers fixed directions). Hadamard differentiability is precisely what is needed for the functional delta method.

Section 5.3: The Influence Function

Influence Function. The influence function of \(T\) at \(F\) is \[ \text{IF}(x; T, F) = \lim_{t \to 0} \frac{T((1-t)F + t\delta_x) - T(F)}{t}, \]

the Gâteaux derivative of \(T\) at \(F\) in the direction \(\delta_x - F\).

The influence function measures the effect of a small point mass contamination at \(x\) on the functional. It plays the role of a “derivative” that determines the first-order asymptotic behavior of \(T(\mathbb{P}_n)\).

Example: Sample Mean. For \(T(F) = \int x \, dF(x)\), \(\text{IF}(x; T, F) = x - \int y \, dF(y) = x - \mu\).

Example: Sample Variance. For \(T(F) = \int (x - \mu)^2 \, dF(x)\), \(\text{IF}(x; T, F) = (x - \mu)^2 - \sigma^2\).

Section 5.4: The Functional Delta Method

Functional Delta Method. If \(T\) is Hadamard differentiable at \(P\) tangentially to \(\mathbb{D}_0\), and \(\sqrt{n}(\mathbb{P}_n - P) \xrightarrow{d} \mathbb{G}\) in \(\mathbb{D}_0\) (with \(\mathbb{G}\) a tight Borel measurable map), then \[ \sqrt{n}(T(\mathbb{P}_n) - T(P)) \xrightarrow{d} T'_P(\mathbb{G}). \]

For real-valued \(T\) with influence function \(\text{IF}(x; T, P)\), the linear functional \(T'_P(h) = \int \text{IF}(x; T, P) \, dh(x)\). By Donsker’s theorem and the functional delta method:

\[ \sqrt{n}(T(\mathbb{P}_n) - T(P)) \xrightarrow{d} N\!\left(0, \int \text{IF}^2(x; T, P) \, dP(x)\right), \]

provided \(\int \text{IF}^2 \, dP < \infty\) and \(\int \text{IF}(x; T, P) \, dP(x) = 0\) (which holds for Fisher-consistent \(T\)).

Section 5.5: U-Statistics

U-statistics are another class of estimators for which the influence function approach yields clean asymptotics.

U-statistic. Given a symmetric kernel \(h : \mathcal{X}^m \to \mathbb{R}\) with \(\theta = \mathbb{E}[h(X_1, \ldots, X_m)]\), the U-statistic of degree \(m\) is \[ U_n = \binom{n}{m}^{-1} \sum_{1 \leq i_1 < \cdots < i_m \leq n} h(X_{i_1}, \ldots, X_{i_m}). \]

Define the projection \(h_1(x) = \mathbb{E}[h(x, X_2, \ldots, X_m)] - \theta\). Hoeffding’s decomposition gives

\[ \sqrt{n}(U_n - \theta) = \sqrt{n} \cdot \frac{m}{n} \sum_{i=1}^n h_1(X_i) + o_p(1) \xrightarrow{d} N(0, m^2 \sigma_1^2), \]

where \(\sigma_1^2 = \text{Var}(h_1(X_1))\). The leading term is \(m\) times the sample average of \(h_1\), so the asymptotic variance is \(m^2 \sigma_1^2\). The Mann-Whitney statistic \(U_{mn} = (mn)^{-1}\sum_{i,j} \mathbf{1}_{X_i < Y_j}\) is a canonical example.

Chapter 6: Asymptotic Approximations

Beyond the first-order normal approximation, Edgeworth expansions provide corrections that account for skewness and kurtosis, while Laplace’s method approximates integrals with a Gaussian kernel centered at the mode.

Section 6.1: Edgeworth Expansions

Let \(X_1, \ldots, X_n\) be i.i.d. with mean 0, variance \(\sigma^2\), skewness \(\kappa_3 = \mathbb{E}[X^3]\), and excess kurtosis \(\kappa_4 = \mathbb{E}[X^4] - 3\sigma^4\). Let \(S_n = (X_1 + \cdots + X_n)/(\sigma\sqrt{n})\). The Edgeworth expansion of the CDF of \(S_n\) to order \(n^{-1/2}\) is:

\[ P(S_n \leq x) = \Phi(x) - \phi(x) \frac{\kappa_3}{6\sigma^3 \sqrt{n}} H_2(x) + O(n^{-1}), \]

where \(\phi\) is the standard normal density, and \(H_2(x) = x^2 - 1\) is the second Hermite polynomial. The correction term reflects the skewness of the distribution and is of order \(n^{-1/2}\).

To order \(n^{-1}\), additional terms involving \(\kappa_4\) and \(\kappa_3^2\) appear:

\[ P(S_n \leq x) = \Phi(x) - \phi(x)\left[\frac{\kappa_3}{6\sigma^3\sqrt{n}} H_2(x) + \frac{1}{n}\left(\frac{\kappa_4}{24\sigma^4} H_3(x) + \frac{\kappa_3^2}{72\sigma^6} H_5(x)\right)\right] + O(n^{-3/2}), \]

where \(H_k\) are Hermite polynomials satisfying \(H_k'(x) = k H_{k-1}(x)\) and \(-\phi'(x) H_{k-1}(x) = \phi(x) H_k(x)\).

The Cornish-Fisher expansion inverts the Edgeworth expansion to give quantile approximations: the \(p\)-th quantile of \(S_n\) satisfies

\[ q_{n,p} = z_p + \frac{\kappa_3}{6\sigma^3 \sqrt{n}}(z_p^2 - 1) + O(n^{-1}), \]

where \(z_p = \Phi^{-1}(p)\). This correction improves upon the naive approximation \(q_{n,p} \approx z_p\).

Section 6.2: Laplace Approximation

The Laplace approximation evaluates integrals of the form \(\int e^{n h(x)} dx\) by expanding around the mode.

Laplace Approximation. Suppose \(h : \mathbb{R} \to \mathbb{R}\) is smooth with a unique maximum at \(x_0\) satisfying \(h'(x_0) = 0\) and \(h''(x_0) < 0\). Then as \(n \to \infty\), \[ \int_{-\infty}^\infty e^{n h(x)} \, dx = \sqrt{\frac{2\pi}{n |h''(x_0)|}} \, e^{n h(x_0)} \left(1 + O(n^{-1})\right). \]

Proof sketch. Taylor-expand \(h\) around \(x_0\): \(h(x) \approx h(x_0) + \frac{1}{2} h''(x_0)(x-x_0)^2\). Substituting: \[ \int e^{nh(x)} dx \approx e^{nh(x_0)} \int e^{n h''(x_0)(x-x_0)^2/2} dx = e^{nh(x_0)} \sqrt{\frac{2\pi}{n|h''(x_0)|}}, \]

where we used the Gaussian integral \(\int e^{-a u^2/2} du = \sqrt{2\pi/a}\) for \(a = n|h''(x_0)|\). Higher-order error terms arise from the cubic and quartic terms in the Taylor expansion. \(\square\)

Application: Bayesian Posterior Approximation (Bernstein-von Mises Theorem). For a regular parametric model with likelihood \(L(\theta) = \prod_{i=1}^n f(X_i; \theta)\) and prior \(\pi(\theta)\), the posterior is

\[ \pi(\theta \mid X_1, \ldots, X_n) \propto L(\theta) \pi(\theta). \]

Taking \(h(\theta) = n^{-1}(\log L(\theta) + \log\pi(\theta))\), the Laplace approximation gives a Gaussian approximation to the posterior centered at the MLE \(\hat{\theta}_n\) with variance \((\mathcal{I}(\theta_0))^{-1}/n\). The Bernstein-von Mises theorem formalizes this: under regularity conditions, the total variation distance between the posterior \(\pi(\cdot \mid X_1,\ldots,X_n)\) and \(N(\hat{\theta}_n, n^{-1}\mathcal{I}(\theta_0)^{-1})\) converges to zero in \(P_{\theta_0}\)-probability.

Chapter 7: Resampling Methods

Resampling methods are computationally intensive alternatives to asymptotic approximations. The jackknife and bootstrap estimate the sampling distribution of a statistic without deriving analytic formulas, and permutation tests provide exact finite-sample level guarantees.

Section 7.1: The Jackknife

Let \(\hat{\theta}_n = \theta(X_1, \ldots, X_n)\) be an estimator. Let \(\hat{\theta}_{n,-i}\) denote the estimator computed on the sample with \(X_i\) deleted, and let \(\bar{\theta}_{.} = n^{-1}\sum_{i=1}^n \hat{\theta}_{n,-i}\).

Jackknife Bias Estimate. \[ \hat{b}_J = (n-1)\left(\bar{\theta}_{.} - \hat{\theta}_n\right). \]

The jackknife bias-corrected estimator is \(\hat{\theta}_J = \hat{\theta}_n - \hat{b}_J = n\hat{\theta}_n - (n-1)\bar{\theta}_{.}\).

For a smooth functional \(T(F)\), the jackknife bias estimator satisfies \(\hat{b}_J = O_p(n^{-1})\), and the corrected estimator \(\hat{\theta}_J\) has bias of order \(n^{-2}\) rather than \(n^{-1}\). The jackknife variance estimator is

\[ \hat{V}_J = \frac{n-1}{n} \sum_{i=1}^n (\hat{\theta}_{n,-i} - \bar{\theta}_{.})^2. \]

For smooth statistics (differentiable functionals of the empirical distribution), \(\hat{V}_J\) is a consistent estimator of \(\text{Var}(\hat{\theta}_n)\). The delete-\(d\) jackknife generalizes this by deleting \(d\) observations at a time, which is necessary for consistent variance estimation when \(\hat{\theta}\) is not a smooth functional (e.g., extreme order statistics).

Section 7.2: The Bootstrap

Efron’s bootstrap (1979) estimates the sampling distribution of \(\hat{\theta}_n\) by resampling from the empirical distribution \(\mathbb{P}_n\).

The Bootstrap Algorithm:

Draw \(X_1^*, \ldots, X_n^*\) i.i.d. from \(\mathbb{P}_n\) (sampling with replacement from the observed data).
Compute \(\hat{\theta}_n^* = T(\mathbb{P}_n^*)\) where \(\mathbb{P}_n^*\) is the empirical measure of the bootstrap sample.
Repeat \(B\) times to obtain \(\hat{\theta}_{n,1}^*, \ldots, \hat{\theta}_{n,B}^*\).
Approximate the distribution of \(\hat{\theta}_n - T(P)\) by the conditional distribution of \(\hat{\theta}_n^* - \hat{\theta}_n\) given the data.

Bootstrap Consistency. Let \(T(\mathbb{P}_n)\) be a smooth statistical functional (Hadamard differentiable at \(P\)). Then the bootstrap distribution consistently estimates the sampling distribution of \(\sqrt{n}(T(\mathbb{P}_n) - T(P))\) in the sense that \[ \sup_x \left|P^*\!\left(\sqrt{n}(T(\mathbb{P}_n^*) - T(\mathbb{P}_n)) \leq x\right) - P\!\left(\sqrt{n}(T(\mathbb{P}_n) - T(P)) \leq x\right)\right| \xrightarrow{P} 0. \]

The proof uses the functional delta method: under the bootstrap, \(\sqrt{n}(\mathbb{P}_n^* - \mathbb{P}_n) \xrightarrow{d^*} \mathbb{G}_P\) (Brownian bridge) conditionally in probability, so by Hadamard differentiability, the bootstrap distribution of \(\sqrt{n}(T(\mathbb{P}_n^*) - T(\mathbb{P}_n))\) converges to the true limiting distribution \(T'_P(\mathbb{G}_P)\).

Bootstrap Confidence Intervals:

Percentile interval: \(\left[\hat{\theta}^*_{(\alpha/2)}, \hat{\theta}^*_{(1-\alpha/2)}\right]\) where quantiles are taken from the bootstrap distribution.
Basic (reflected) interval: \(\left[2\hat{\theta}_n - \hat{\theta}^*_{(1-\alpha/2)}, 2\hat{\theta}_n - \hat{\theta}^*_{(\alpha/2)}\right]\).
Bootstrap-\(t\) interval: \(\left[\hat{\theta}_n - z^*_{1-\alpha/2} \hat{\text{se}}, \hat{\theta}_n - z^*_{\alpha/2} \hat{\text{se}}\right]\) where \(z^*\) are quantiles of the bootstrapped t-statistic.
BCa (Bias-corrected and accelerated): Adjusts for bias and skewness using a bias-correction constant \(z_0\) and an acceleration constant \(a\), achieving second-order accuracy.

Section 7.3: Permutation Tests

Permutation tests provide exact finite-sample level guarantees under the null hypothesis of exchangeability.

Exactness of Permutation Tests. Under the null hypothesis \(H_0\) that \(X_1, \ldots, X_n\) are exchangeable, the permutation test that rejects when the observed test statistic exceeds the \((1-\alpha)\)-quantile of its permutation distribution has exact size \(\alpha\) (for \(\alpha = k/n!\) for integer \(k\)).

Proof. Under \(H_0\), all \(n!\) permutations of the data are equally likely. The permutation distribution of any statistic \(T\) is the uniform distribution over the \(n!\) values \(\{T(\sigma(X_1,\ldots,X_n)) : \sigma \in S_n\}\). The observed statistic is exactly one of these equally likely values, so \(P(T \geq q_{1-\alpha}) = \alpha\) exactly (modulo discreteness). \(\square\)

The advantage over the bootstrap is exactness; the disadvantage is that validity requires exchangeability under \(H_0\), which may be a strong assumption. The bootstrap is more flexible but only approximately valid. For two-sample testing, permutation tests require that the two populations have the same distribution under \(H_0\) (not just the same mean), which is the complete null hypothesis.

Chapter 8: High-Dimensional Inference

Classical statistical theory assumes the dimension \(p\) is fixed while \(n \to \infty\). High-dimensional inference addresses the regime where \(p\) grows with \(n\), potentially with \(p \gg n\). New phenomena emerge: the curse of dimensionality, the phase transition in regularization, and the breakdown of classical tests.

Section 8.1: Multiple Testing

When testing \(m\) hypotheses simultaneously, the probability of at least one false rejection (the family-wise error rate, FWER) inflates unless correction is applied.

FWER and FDR. Let \(V\) be the number of false rejections (false discoveries) and \(R\) be the total number of rejections. The family-wise error rate is \(\text{FWER} = P(V \geq 1)\). The false discovery rate is \[ \text{FDR} = \mathbb{E}\!\left[\frac{V}{\max(R, 1)}\right]. \]

Bonferroni correction controls FWER: reject hypothesis \(H_i\) if \(p_i \leq \alpha/m\). Then \(\text{FWER} \leq \alpha\) by a union bound. This is conservative when the tests are positively correlated.

Benjamini-Hochberg Procedure. Let \(p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}\) be the ordered p-values. Define \(k = \max\{i : p_{(i)} \leq i\alpha/m\}\) and reject \(H_{(1)}, \ldots, H_{(k)}\). If the test statistics are independent (or positively dependent in the PRDS sense), then \[ \text{FDR} \leq \frac{m_0}{m} \alpha \leq \alpha, \]

where \(m_0\) is the number of true null hypotheses.

Proof (independence case). Let \(\mathcal{H}_0\) index the set of true nulls with \(|\mathcal{H}_0| = m_0\). For any fixed realization of the non-null p-values, the p-values from the nulls are i.i.d. Uniform\([0,1]\). The BH procedure's false discovery proportion is \[ \frac{V}{R} = \frac{|\{i \in \mathcal{H}_0 : \text{rejected}\}|}{R}. \]

Using the step-up structure, for each \(i \in \mathcal{H}_0\):

\[ P(i \text{ rejected}) = P(p_i \leq p_{(k)}) \leq \mathbb{E}[p_{(k)} \cdot m / k_0] \leq \alpha, \]

where a more careful argument using indicator random variables and the law of total expectation shows:

\[ \text{FDR} = \sum_{i \in \mathcal{H}_0} E\!\left[\frac{\mathbf{1}_{i \text{ rejected}}}{R}\right] \leq \sum_{i \in \mathcal{H}_0} \frac{\alpha}{m} = \frac{m_0 \alpha}{m} \leq \alpha. \quad \square \]

Storey’s approach estimates the proportion of true nulls \(\hat{\pi}_0 = \#\{p_i > \lambda\} / (m(1-\lambda))\) for a tuning parameter \(\lambda \in (0,1)\), then uses \(\hat{\pi}_0\) in place of 1 in the BH threshold, yielding an adaptive procedure with higher power when many nulls are false.

Section 8.2: The James-Stein Estimator and SURE

Consider estimating \(\theta \in \mathbb{R}^p\) from \(X \sim N(\theta, I_p)\). The MLE is \(\hat{\theta} = X\), with risk \(\mathbb{E}[\|X - \theta\|^2] = p\).

James-Stein Estimator. For \(p \geq 3\), the James-Stein estimator \[ \hat{\theta}^{JS} = \left(1 - \frac{p-2}{\|X\|^2}\right) X \]

has risk \(\mathbb{E}[\|\hat{\theta}^{JS} - \theta\|^2] = p - \frac{(p-2)^2}{\mathbb{E}[\|X\|^{-2}]} < p\),

so the MLE \(X\) is inadmissible under squared error loss for \(p \geq 3\).

This is a striking result: shrinkage toward zero uniformly dominates the MLE in dimension three or higher. The shrinkage factor \((p-2)/\|X\|^2\) adaptively adjusts to the signal strength.

Stein’s Unbiased Risk Estimate (SURE) provides an unbiased estimator of the risk of any differentiable shrinkage estimator without knowing \(\theta\).

SURE. For \(X \sim N(\theta, \sigma^2 I_p)\) and an estimator \(\hat{\theta}(X) = X + g(X)\) where \(g\) is weakly differentiable, an unbiased estimate of the risk \(\mathbb{E}[\|\hat{\theta} - \theta\|^2]\) is \[ \text{SURE}(g) = p\sigma^2 + \|g(X)\|^2 + 2\sigma^2 \sum_{i=1}^p \frac{\partial g_i}{\partial x_i}(X) = p\sigma^2 + \|g\|^2 + 2\sigma^2 \nabla \cdot g. \]

The identity \(\mathbb{E}[g_i(X)(X_i - \theta_i)] = \sigma^2 \mathbb{E}[\partial g_i/\partial x_i]\) (Stein’s identity) is the key tool.

Section 8.3: Donoho-Johnstone Wavelet Shrinkage

Consider the normal means problem: observe \(Y_i = \theta_i + \epsilon_i\), \(i = 1, \ldots, n\), with \(\epsilon_i \sim N(0, \sigma^2)\) i.i.d., and \(\theta = (\theta_1, \ldots, \theta_n)\) assumed sparse.

Soft thresholding at level \(\lambda\) is \(\eta_\lambda(y) = \text{sign}(y)(|y| - \lambda)_+\). Donoho and Johnstone (1994) proved that the soft-threshold estimator \(\hat{\theta}_i = \eta_\lambda(Y_i)\) with \(\lambda = \sigma\sqrt{2\log n}\) (the universal threshold) achieves near-minimax risk over sparse and smooth function classes.

Donoho-Johnstone Risk Bound. For soft thresholding at level \(\lambda = \sigma\sqrt{2\log n}\), the risk satisfies \[ \mathbb{E}\!\left[\sum_{i=1}^n (\hat{\theta}_i - \theta_i)^2\right] \leq (2\log n + 1)\left(\sigma^2 + \min_{1 \leq k \leq n} \left[\sum_{i=1}^k \theta_{(i)}^2 + k\sigma^2\right]\right), \]

showing at most a \(\log n\) factor over the oracle risk.

SURE provides an adaptive choice of threshold by minimizing \(\text{SURE}(\lambda)\) over \(\lambda\), yielding a data-driven threshold without the \(\log n\) factor in many practical problems.

Section 8.4: RKHS and Function Estimation

A reproducing kernel Hilbert space (RKHS) \(\mathcal{H}\) on \(\mathcal{X}\) is a Hilbert space of functions \(f : \mathcal{X} \to \mathbb{R}\) with reproducing kernel \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) satisfying:

\(K(\cdot, x) \in \mathcal{H}\) for all \(x \in \mathcal{X}\).
\(\langle f, K(\cdot, x)\rangle_\mathcal{H} = f(x)\) for all \(f \in \mathcal{H}\), \(x \in \mathcal{X}\) (the reproducing property).

By Mercer’s theorem, any continuous positive semidefinite kernel \(K\) on a compact space induces an RKHS. The kernel ridge regression estimator minimizes

\[ \frac{1}{n}\sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda \|f\|_\mathcal{H}^2, \]

and by the representer theorem, the solution has the form \(\hat{f}(x) = \sum_{i=1}^n \hat{\alpha}_i K(x, X_i)\), reducing the infinite-dimensional optimization to an \(n\)-dimensional linear system.

Section 8.5: Model Selection (AIC and BIC)

Akaike Information Criterion (AIC): Akaike (1973) proposed selecting the model minimizing

\[ \text{AIC} = -2\ell(\hat{\theta}) + 2k, \]

where \(\ell(\hat{\theta})\) is the log-likelihood at the MLE and \(k\) is the number of parameters. The derivation shows that \(-2\ell(\hat{\theta})/n\) is a biased estimator of the Kullback-Leibler divergence \(D_{KL}(f_{true} \| f_{\hat{\theta}})\), with bias approximately \(k/n\). AIC corrects for this bias, so model selection by AIC is equivalent to choosing the model with the best asymptotically unbiased estimate of prediction KL divergence.

Bayesian Information Criterion (BIC): Schwarz (1978) proposed

\[ \text{BIC} = -2\ell(\hat{\theta}) + k\log n. \]

The penalty \(k\log n\) arises from a Laplace approximation to the marginal likelihood (Bayes factor): for models \(\mathcal{M}_1\) and \(\mathcal{M}_2\) with equal prior probability,

\[ \log P(\mathcal{M}_1 \mid X) - \log P(\mathcal{M}_2 \mid X) \approx \frac{1}{2}(\text{BIC}_2 - \text{BIC}_1). \]

BIC is consistent (selects the true model with probability 1 as \(n \to \infty\) when the true model is in the candidate set), while AIC is not consistent but has better prediction properties when the true model is not in the candidate set.

Chapter 9: Advanced Topics

Section 9.1: Post-Selection Inference

Classical inference assumes the model (and in particular, which parameters to test) is chosen prior to examining the data. In practice, model selection is performed on the same data used for inference, invalidating classical p-values and confidence intervals.

The Lasso estimator \(\hat{\beta}(\lambda) = \arg\min_\beta \left[\frac{1}{2n}\|Y - X\beta\|^2 + \lambda\|\beta\|_1\right]\) selects a sparse model, but the selected predictors are correlated with the response, so ordinary OLS inference on the selected model yields inflated type-I errors.

Post-selection inference (POSI) seeks valid inference after model selection. The selective inference framework (Berk et al., 2013; Lee et al., 2016) conditions on the selected model to construct valid tests.

Selective Type-I Error. An inference procedure controls the selective type-I error at level \(\alpha\) if, conditioning on the event that a particular model \(\hat{M}\) is selected, the test of any hypothesis about parameters of model \(\hat{M}\) has size \(\leq \alpha\).

For the Lasso at fixed \(\lambda\), the selected model \(\hat{M} = \text{supp}(\hat{\beta}(\lambda))\) is a polyhedral set: \(\hat{M} = M\) if and only if \(AY \leq b\) for a matrix \(A\) and vector \(b\) depending on \(M\) and \(\lambda\). Conditioning on this polyhedral set, the truncated normal distribution (via the Polyhedral Lemma of Lee et al.) yields pivot statistics with exact uniform distribution under the null.

Polyhedral Lemma. Let \(Y \sim N(\mu, \sigma^2 I)\) and \(\eta^T Y\) be any linear contrast. Conditional on \(\{AY \leq b\}\), the pivot \[ F = \frac{\Phi\!\left(\frac{\eta^T Y - \mathcal{V}^+}{\sigma \|\eta\|}\right) - \Phi\!\left(\frac{\mathcal{V}^- - \mathcal{V}^+}{\sigma\|\eta\|}\right)}{1 - \Phi\!\left(\frac{\mathcal{V}^- - \mathcal{V}^+}{\sigma\|\eta\|}\right)} \]

is Uniform\([0,1]\) under \(\eta^T \mu = 0\), where \(\mathcal{V}^-\) and \(\mathcal{V}^+\) are the truncation bounds on \(\eta^T Y\) induced by the polyhedron.

This enables p-values and confidence intervals for regression coefficients that are valid conditional on Lasso model selection.

Section 9.2: Inference Under Differential Privacy

Differential privacy (Dwork et al., 2006) provides a rigorous definition of privacy for statistical inference. It quantifies the maximum information leakage about any individual from the output of a computation.

Differential Privacy. A randomized mechanism \(\mathcal{M} : \mathcal{X}^n \to \mathcal{O}\) satisfies \((\epsilon, \delta)\)-differential privacy if for all pairs of datasets \(D, D'\) differing in a single entry and all measurable sets \(S \subseteq \mathcal{O}\), \[ P(\mathcal{M}(D) \in S) \leq e^\epsilon P(\mathcal{M}(D') \in S) + \delta. \]

Pure \(\epsilon\)-differential privacy corresponds to \(\delta = 0\).

The local differential privacy (LDP) model requires each individual to privatize their own data before sharing. Under \(\epsilon\)-LDP, each user sends a privatized version \(Z_i = \mathcal{M}(X_i)\) satisfying \(P(Z_i \in S \mid X_i = x) \leq e^\epsilon P(Z_i \in S \mid X_i = x')\) for all \(x, x', S\).

Privacy-Accuracy Tradeoff and Fisher Information Bounds:

Fisher Information under LDP. For any \(\epsilon\)-LDP mechanism applied to i.i.d. observations, the Fisher information available for estimating a one-dimensional parameter satisfies \[ \mathcal{I}_n^{\text{LDP}}(\theta) \leq \frac{n(e^\epsilon - 1)^2}{(e^\epsilon + 1)^2} \cdot \mathcal{I}(\theta), \]

where \(\mathcal{I}(\theta)\) is the single-observation Fisher information. For small \(\epsilon\), the reduction factor is approximately \(\epsilon^2/4\), meaning that LDP reduces effective sample size by a factor of \(\Theta(\epsilon^2)\).

Randomized Response (Warner, 1965): The canonical LDP mechanism for a binary attribute \(X \in \{0,1\}\) is: report \(Z = X\) with probability \(e^\epsilon/(1+e^\epsilon)\) and \(Z = 1-X\) with probability \(1/(1+e^\epsilon)\). This satisfies \(\epsilon\)-LDP. The unbiased estimator of \(P(X=1) = p\) from \(n\) independent responses is

\[ \hat{p} = \frac{n^{-1}\sum_{i=1}^n Z_i \cdot (1+e^\epsilon) - 1}{e^\epsilon - 1}, \]

with variance \(\text{Var}(\hat{p}) = p(1-p)/n + (e^\epsilon + 1)^2/[4n(e^\epsilon - 1)^2]\), illustrating the inflation in variance due to privacy.

Under the Gaussian mechanism, a real-valued query \(q(D)\) with sensitivity \(\Delta q = \sup_{D,D'} |q(D) - q(D')|\) is privatized as \(\mathcal{M}(D) = q(D) + N(0, \sigma^2)\) with \(\sigma = \Delta q \sqrt{2\log(1.25/\delta)} / \epsilon\) to achieve \((\epsilon,\delta)\)-DP. The resulting estimator has a bias-variance tradeoff between privacy (large \(\sigma\)) and accuracy (small \(\sigma\)).

The field of private hypothesis testing seeks to control type I and type II errors while maintaining differential privacy. For the simple hypothesis test \(H_0: \theta = \theta_0\) vs. \(H_1: \theta = \theta_1\), adding Laplace noise to the log-likelihood ratio yields a private likelihood ratio test, but the private sample complexity grows as \(O(1/\epsilon)\) times the non-private sample complexity, illustrating the fundamental cost of privacy.

Appendix: Key Notation and Symbols

Symbol	Meaning
\(\mathbb{P}_n\)	Empirical measure \(n^{-1}\sum_{i=1}^n \delta_{X_i}\)
\(\mathbb{F}_n\)	Empirical CDF
\(\mathbb{G}_n\)	Centered empirical process \(\sqrt{n}(\mathbb{F}_n - F)\)
\(\mathbb{B}\)	Brownian bridge
\(\mathcal{I}(\theta)\)	Fisher information
\(O_p, o_p\)	Stochastic order notation
\(\xrightarrow{d}\)	Convergence in distribution
\(\xrightarrow{P}\)	Convergence in probability
\(\xrightarrow{\text{a.s.}}\)	Almost sure convergence
\(\text{IF}(x; T, F)\)	Influence function
\(\text{FWER}\)	Family-wise error rate
\(\text{FDR}\)	False discovery rate
\(\ell^\infty(\mathcal{F})\)	Space of bounded functions on \(\mathcal{F}\) with sup-norm
\(V(\mathcal{C})\)	VC dimension of class \(\mathcal{C}\)