STAT 908: Statistical Inference
Shoja'eddin Chenouri
Estimated study time: 1 hr 25 min
Table of contents
Sources and References
Primary texts — A.W. van der Vaart, Asymptotic Statistics (Cambridge University Press, 1998); Anirban DasGupta, Asymptotic Theory of Statistics and Probability (Springer, 2008); Martin Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint (Cambridge University Press, 2019). Supplementary — E.L. Lehmann, Theory of Point Estimation (Wiley, 1983); E.L. Lehmann and J.P. Romano, Testing Statistical Hypotheses (3rd ed., Springer, 2005); T.S. Ferguson, Mathematical Statistics: A Decision Theoretic Approach (Academic Press, 1967). Online resources — MIT 18.650 Statistics for Applications notes; Stanford STATS 300B (Theory of Statistics) lecture notes; Larry Wasserman’s CMU 36-705 Intermediate Statistics notes.
Chapter 1: Measure-Theoretic Foundations and Hilbert Spaces
Statistical inference at the graduate level demands a rigorous foundation in measure theory. The language of measurable spaces, sigma-algebras, and integration not only gives precise meaning to probability and expectation but also supplies the machinery for proving asymptotic results throughout this course. This chapter reviews the essential elements at the level expected for STAT 908.
Section 1.1: Probability Spaces and Measurable Functions
A measurable space is a pair \((\Omega, \mathcal{F})\) where \(\Omega\) is a nonempty set and \(\mathcal{F}\) is a sigma-algebra of subsets of \(\Omega\). A probability measure \(P : \mathcal{F} \to [0,1]\) satisfies \(P(\Omega) = 1\) and countable additivity: for disjoint \(A_1, A_2, \ldots \in \mathcal{F}\),
\[ P\!\left(\bigcup_{i=1}^\infty A_i\right) = \sum_{i=1}^\infty P(A_i). \]The triple \((\Omega, \mathcal{F}, P)\) is a probability space. A random variable is a measurable function \(X : (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))\), where \(\mathcal{B}(\mathbb{R})\) denotes the Borel sigma-algebra generated by the open sets of \(\mathbb{R}\). Measurability means \(X^{-1}(B) \in \mathcal{F}\) for every \(B \in \mathcal{B}(\mathbb{R})\), ensuring that events of the form \(\{X \in B\}\) are well-defined.
The distribution or law of \(X\) is the probability measure \(P_X = P \circ X^{-1}\) on \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\). The cumulative distribution function (CDF) is \(F_X(t) = P(X \leq t)\).
Section 1.2: Integration and Expectation
The Lebesgue integral is constructed in three stages: first for nonnegative simple functions, then for nonnegative measurable functions as a supremum of simple approximations, and finally for general functions by decomposing \(f = f^+ - f^-\) where \(f^+ = \max(f,0)\) and \(f^- = \max(-f,0)\). The integral \(\int_\Omega f \, dP\) is defined when at least one of \(\int f^+ \, dP\) or \(\int f^- \, dP\) is finite.
The expectation of a random variable \(X\) is \(\mathbb{E}[X] = \int_\Omega X \, dP\). The key convergence theorems are:
These theorems underpin the interchange of limits and expectations that appears throughout asymptotic theory.
Section 1.3: Modes of Convergence
Let \(X, X_1, X_2, \ldots\) be random variables on a common probability space \((\Omega, \mathcal{F}, P)\).
The relationships among these modes are summarized by the following implications, all of which are strict in general:
\[ \text{a.s.} \implies \text{in probability} \implies \text{in distribution}, \]\[ L^p \implies \text{in probability} \implies \text{in distribution.} \]The implication from almost sure to in probability follows because \(\limsup\) is a stronger condition than lim. The converse fails: the classic “typewriter sequence” of indicator functions on \([0,1]\) converges in probability to zero but not almost surely.
Markov’s inequality states that for any \(a > 0\) and \(p \geq 1\),
\[ P(|X| \geq a) \leq \frac{\mathbb{E}[|X|^p]}{a^p}, \]and Chebyshev’s inequality is the special case \(p=2\) applied to \(X - \mu\):
\[ P(|X - \mu| \geq a) \leq \frac{\text{Var}(X)}{a^2}. \]These inequalities provide the fundamental bridge between moments and tail probabilities.
Section 1.4: Uniform Integrability
Convergence in probability does not imply \(L^1\) convergence unless a uniform integrability condition holds.
Uniform integrability is equivalent to the condition that \(\sup_n \mathbb{E}[|X_n|^{1+\delta}] < \infty\) for some \(\delta > 0\), providing a practical checkable criterion.
Section 1.5: Hilbert Space Structure of \(L^2\)
The space \(L^2(\Omega, \mathcal{F}, P)\) of square-integrable random variables forms a Hilbert space under the inner product
\[ \langle X, Y \rangle = \mathbb{E}[XY] \]with corresponding norm \(\|X\|_2 = (\mathbb{E}[X^2])^{1/2}\). Elements are equivalence classes under almost-sure equality. This structure is fundamental to the geometry of statistical estimation.
The conditional expectation \(\mathbb{E}[X \mid \mathcal{G}]\) is precisely the \(L^2\)-projection of \(X\) onto \(L^2(\Omega, \mathcal{G}, P)\) for any sub-sigma-algebra \(\mathcal{G} \subseteq \mathcal{F}\). The tower property \(\mathbb{E}[\mathbb{E}[X \mid \mathcal{G}] \mid \mathcal{H}] = \mathbb{E}[X \mid \mathcal{H}]\) for \(\mathcal{H} \subseteq \mathcal{G}\) follows directly from the projection interpretation: projecting onto a larger space and then onto a smaller one is the same as projecting onto the smaller space directly.
Chapter 2: Decision Theory and Optimal Estimation
Decision theory provides a unified framework for evaluating and comparing statistical procedures. Rather than seeking procedures that are “good” in some intuitive sense, decision theory formalizes the notion of optimality through loss functions and risk functions.
Section 2.1: The Statistical Decision Problem
A statistical decision problem consists of:
- A parameter space \(\Theta\), which may be finite-dimensional or infinite-dimensional.
- An action space \(\mathcal{A}\) of possible decisions.
- A loss function \(L : \Theta \times \mathcal{A} \to [0, \infty)\), where \(L(\theta, a)\) measures the cost of taking action \(a\) when the true parameter is \(\theta\).
A decision rule or estimator \(\delta : \mathcal{X} \to \mathcal{A}\) is a measurable function mapping observations to actions. The risk function is the expected loss:
\[ R(\theta, \delta) = \mathbb{E}_\theta\!\left[L(\theta, \delta(X))\right]. \]Common loss functions include squared error loss \(L(\theta, a) = (\theta - a)^2\), absolute error loss \(L(\theta, a) = |\theta - a|\), and 0-1 loss for testing.
A Bayes rule minimizes the Bayes risk over all decision rules.
A key connection is that if a Bayes rule \(\delta_\pi\) has constant risk, then it is minimax. This is the classical method for establishing minimaxity: find a least-favorable prior such that the Bayes estimator achieves constant risk.
Section 2.2: The Cramér-Rao Lower Bound
The Cramér-Rao lower bound (CRLB) provides a fundamental lower bound on the variance of any unbiased estimator in a regular parametric family.
Let \(\{P_\theta : \theta \in \Theta\}\) be a family with densities \(f(x; \theta)\) with respect to a sigma-finite measure. The score function is
\[ s(x; \theta) = \frac{\partial}{\partial \theta} \log f(x; \theta), \]and the Fisher information is
\[ \mathcal{I}(\theta) = \mathbb{E}_\theta\!\left[s(X; \theta)^2\right] = -\mathbb{E}_\theta\!\left[\frac{\partial^2}{\partial \theta^2} \log f(X; \theta)\right], \]with equality of the two forms holding under suitable regularity (exchange of differentiation and integration).
Differentiating with respect to \(\theta\) and exchanging differentiation and integration:
\[ g'(\theta) = \int \delta(x) \frac{\partial}{\partial\theta} f(x;\theta) \, d\mu(x) = \int \delta(x) s(x;\theta) f(x;\theta) \, d\mu(x) = \text{Cov}_\theta(\delta(X), s(X;\theta)). \]Note \(\mathbb{E}_\theta[s(X;\theta)] = 0\) (under regularity, differentiating \(\int f \, d\mu = 1\)). By the Cauchy-Schwarz inequality,
\[ \left[g'(\theta)\right]^2 = \left[\text{Cov}_\theta(\delta(X), s(X;\theta))\right]^2 \leq \text{Var}_\theta(\delta(X)) \cdot \text{Var}_\theta(s(X;\theta)) = \text{Var}_\theta(\delta(X)) \cdot \mathcal{I}(\theta). \]Dividing both sides by \(\mathcal{I}(\theta)\) yields the result. \(\square\)
An estimator achieving the CRLB with equality is called efficient. The CRLB may not be achievable; the Rao-Blackwell and Lehmann-Scheffé theorems characterize when an optimal unbiased estimator exists.
Section 2.3: Sufficiency and the Neyman Factorization Theorem
Sufficiency means that \(T\) captures all information in the data about \(\theta\). The following theorem provides the standard criterion.
for nonnegative functions \(g\) and \(h\), where \(h\) does not depend on \(\theta\).
The exponential family \(f(x;\theta) = \exp(\eta(\theta)^T T(x) - A(\theta)) h(x)\) has \(T(x)\) as a natural sufficient statistic by the factorization theorem.
Section 2.4: Completeness, Basu’s Theorem, and UMVUE
The uniqueness follows from completeness: if two functions of \(T\) are both unbiased for \(g(\theta)\), their difference has zero expectation for all \(\theta\), so by completeness they are equal almost surely.
Chapter 3: Classical Asymptotic Theory
Asymptotic theory studies the behavior of statistical procedures as the sample size \(n \to \infty\). This chapter develops the fundamental tools: modes of convergence, limit theorems, and the perturbation calculus of the delta method.
Section 3.1: Weak Convergence and the Portmanteau Theorem
Convergence in distribution (weak convergence) is the central mode of convergence in asymptotic statistics. The Portmanteau theorem provides a collection of equivalent characterizations.
- \(\mu_n \xrightarrow{d} \mu\): \(\mathbb{E}[f(X_n)] \to \mathbb{E}[f(X)]\) for all bounded continuous \(f\).
- \(\mathbb{E}[f(X_n)] \to \mathbb{E}[f(X)]\) for all bounded uniformly continuous \(f\).
- \(\limsup_n \mu_n(F) \leq \mu(F)\) for all closed sets \(F\).
- \(\liminf_n \mu_n(G) \geq \mu(G)\) for all open sets \(G\).
- \(\mu_n(A) \to \mu(A)\) for all Borel sets \(A\) with \(\mu(\partial A) = 0\) (continuity sets).
- \(F_{X_n}(t) \to F_X(t)\) for all continuity points \(t\) of \(F_X\).
The equivalence of conditions (3), (4), and (5) is especially useful when dealing with tail probabilities.
Section 3.2: Tightness and Prohorov’s Theorem
Tightness prevents probability mass from escaping to infinity. It is automatically satisfied when \(\sup_n \mathbb{E}[|X_n|] < \infty\).
Prohorov’s theorem strengthens Helly’s lemma by ensuring that the limit is a proper probability measure, precisely when tightness holds.
Section 3.3: \(O_p\) and \(o_p\) Notation
Stochastic order notation provides a concise language for the magnitude of random sequences.
\(X_n = o_p(a_n)\) if \(X_n / a_n \xrightarrow{P} 0\).
Key algebraic rules: \(O_p(a_n) \cdot O_p(b_n) = O_p(a_n b_n)\), \(o_p(a_n) + O_p(b_n) = O_p(\max(a_n, b_n))\), and if \(X_n \xrightarrow{d} X\) then \(X_n = O_p(1)\).
The continuous mapping theorem states: if \(X_n \xrightarrow{d} X\) and \(g : \mathbb{R} \to \mathbb{R}\) is continuous \(\mu\)-almost everywhere (where \(\mu\) is the law of \(X\)), then \(g(X_n) \xrightarrow{d} g(X)\). Slutsky’s theorem states: if \(X_n \xrightarrow{d} X\) and \(Y_n \xrightarrow{P} c\) (a constant), then \(X_n + Y_n \xrightarrow{d} X + c\) and \(X_n Y_n \xrightarrow{d} cX\).
Section 3.4: Laws of Large Numbers
The SLLN requires more careful analysis via the Borel-Cantelli lemma. For the WLLN, Chebyshev’s inequality suffices when \(\text{Var}(X_1) < \infty\).
Section 3.5: Central Limit Theorems
The Lindeberg-Feller CLT extends this to triangular arrays of independent but not identically distributed random variables.
then \(s_n^{-1} \sum_{i=1}^n X_{ni} \xrightarrow{d} N(0,1)\). Moreover, if the Feller condition \(\max_i \sigma_{ni}^2 / s_n^2 \to 0\) holds, then the Lindeberg condition is also necessary for the CLT.
The proof proceeds via the characteristic function approach: write the characteristic function of the normalized sum as a product, use the Lindeberg condition to show each factor converges to \(e^{-t^2/2}\), then invoke the continuity theorem for characteristic functions.
Lyapounov’s CLT is a simpler sufficient condition: if \(\mathbb{E}[|X_{ni}|^{2+\delta}] < \infty\) and \(s_n^{-(2+\delta)} \sum_i \mathbb{E}[|X_{ni}|^{2+\delta}] \to 0\) for some \(\delta > 0\), the Lindeberg condition holds, and hence the CLT follows.
Section 3.6: Berry-Esseen Bound
The CLT provides a qualitative statement about convergence to normality, but the Berry-Esseen theorem gives a quantitative rate.
where \(C\) is an absolute constant (the best known value is approximately 0.4748) and \(\Phi\) is the standard normal CDF.
The bound is tight in the order \(n^{-1/2}\): the Bernoulli distribution achieves this order. The proof uses a smoothing inequality relating the characteristic function to the CDF, then bounds the difference between the characteristic functions of the standardized sum and the standard normal.
Section 3.7: The Delta Method
The delta method extends CLTs to smooth functions of sample means.
Multiplying by \(\sqrt{n}\) and noting that \(|\hat{\theta}_n - \theta| = O_p(n^{-1/2})\), so \(\sqrt{n} \cdot o(|\hat{\theta}_n - \theta|) = o_p(1)\). By Slutsky’s theorem, \(\sqrt{n}(g(\hat{\theta}_n) - g(\theta)) \xrightarrow{d} g'(\theta) \cdot N(0,\sigma^2) = N(0, [g'(\theta)]^2 \sigma^2)\). \(\square\)
When \(g'(\theta) = 0\), a second-order expansion is needed: if \(g''(\theta)\) exists and is nonzero, then \(n(g(\hat{\theta}_n) - g(\theta)) \xrightarrow{d} \frac{1}{2} g''(\theta) \chi^2_1 \sigma^2\).
Variance-stabilizing transformations exploit the delta method. For a Poisson random variable with mean \(\lambda\), \(\text{Var}(\bar{X}) = \lambda/n\), so \(\sqrt{n}(\bar{X} - \lambda) \approx N(0, \lambda)\). Setting \(g(\lambda) = 2\sqrt{\lambda}\), we get \([g'(\lambda)]^2 \lambda = 1\), so \(\sqrt{n}(2\sqrt{\bar{X}} - 2\sqrt{\lambda}) \xrightarrow{d} N(0,1)\) regardless of \(\lambda\). Similarly, for the Binomial proportion \(\hat{p}\) with \(\text{Var}(\hat{p}) = p(1-p)/n\), the arcsine transformation \(g(p) = \arcsin(\sqrt{p})\) gives \([g'(p)]^2 p(1-p) = 1/4\) and variance stabilization: \(2\sqrt{n}(\arcsin\sqrt{\hat{p}} - \arcsin\sqrt{p}) \xrightarrow{d} N(0,1)\).
Chapter 4: Empirical Processes
The theory of empirical processes studies the uniform behavior of sample-based functions over classes of sets or functions. It provides the theoretical foundation for nonparametric statistics and statistical learning theory.
Section 4.1: The Empirical Measure and Empirical CDF
Given i.i.d. observations \(X_1, \ldots, X_n\) from distribution \(P\) on \((\mathcal{X}, \mathcal{A})\), the empirical measure is
\[ \mathbb{P}_n = \frac{1}{n} \sum_{i=1}^n \delta_{X_i}, \]where \(\delta_x\) is the Dirac measure at \(x\). For any function \(f\), \(\mathbb{P}_n f = n^{-1}\sum_{i=1}^n f(X_i)\). The empirical CDF is
\[ \mathbb{F}_n(t) = \mathbb{P}_n((-\infty, t]) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}_{X_i \leq t}. \]At each fixed \(t\), \(\mathbb{F}_n(t)\) is an unbiased estimator of \(F(t)\) with variance \(F(t)(1-F(t))/n\), so \(\sqrt{n}(\mathbb{F}_n(t) - F(t)) \xrightarrow{d} N(0, F(t)(1-F(t)))\) by the CLT.
Section 4.2: The Glivenko-Cantelli Theorem
Taking \(n\) large enough that \(|\mathbb{F}_n(t_j) - F(t_j)| < \epsilon/2\) simultaneously for all \(j\) (by the SLLN on the finite set) yields \(\sup_t |\mathbb{F}_n(t) - F(t)| < \epsilon\). Since \(\epsilon\) was arbitrary, the result follows. \(\square\)
Section 4.3: The DKW Inequality
The Glivenko-Cantelli theorem guarantees almost sure uniform convergence but gives no rate. Massart’s version of the Dvoretzky-Kiefer-Wolfowitz inequality provides a sharp exponential bound.
for all \(\epsilon > 0\). The constant 2 is sharp.
This inequality is remarkable: the bound is distribution-free and depends on the sample size and \(\epsilon\) only through \(n\epsilon^2\). It implies that with probability at least \(1 - \delta\), the confidence band
\[ \left[\mathbb{F}_n(t) \pm \sqrt{\frac{\log(2/\delta)}{2n}}\right] \]contains \(F(t)\) uniformly in \(t\). This is the Dvoretzky-Kiefer-Wolfowitz confidence band, which is exactly distribution-free.
Section 4.4: Donsker’s Theorem
The uniform CLT for the empirical process is Donsker’s theorem. Define the centered and scaled empirical process:
\[ \mathbb{G}_n(t) = \sqrt{n}(\mathbb{F}_n(t) - F(t)). \]where the Brownian bridge \(\mathbb{B}\) is a Gaussian process with \(\mathbb{E}[\mathbb{B}(s)] = 0\) and \(\text{Cov}(\mathbb{B}(s), \mathbb{B}(t)) = \min(s,t) - st\).
As a consequence, \(\sqrt{n} \sup_t |\mathbb{F}_n(t) - F(t)| \xrightarrow{d} \sup_{t \in [0,1]} |\mathbb{B}(t)|\), the Kolmogorov distribution, which underlies the Kolmogorov-Smirnov test.
Section 4.5: VC Theory
VC (Vapnik-Chervonenkis) theory provides uniform laws of large numbers for classes of indicator functions.
The VC dimension \(V(\mathcal{C})\) is the largest \(n\) such that \(m(\mathcal{C}, n) = 2^n\) (i.e., some set of \(n\) points can be shattered by \(\mathcal{C}\)).
and with high probability the supremum is of order \(\sqrt{d \log n / n}\).
Section 4.6: Concentration Inequalities
Concentration inequalities quantify how a random variable concentrates around its mean or median.
A random variable \(X\) is sub-Gaussian with parameter \(\sigma^2\) if \(\mathbb{E}[e^{\lambda X}] \leq e^{\lambda^2 \sigma^2 / 2}\) for all \(\lambda \in \mathbb{R}\). Sub-Gaussian random variables satisfy Gaussian-type tail bounds: \(P(X > t) \leq e^{-t^2/(2\sigma^2)}\).
Sub-exponential random variables are heavier-tailed: \(X\) is sub-exponential with parameters \((\nu^2, b)\) if \(\mathbb{E}[e^{\lambda X}] \leq e^{\nu^2 \lambda^2/2}\) for \(|\lambda| \leq 1/b\). The Bernstein condition holds: \(\mathbb{E}[|X|^k] \leq k! b^{k-2} \nu^2 / 2\) for \(k \geq 2\). Chi-squared variables, products of Gaussian variables, and bounded variables fall into this class.
Chapter 5: Von Mises Calculus and Functional Delta Method
Von Mises (1947) introduced a calculus for statistical functionals that extends the scalar delta method to functionals of the empirical distribution. This framework unifies many nonparametric estimators and provides their asymptotic distributions.
Section 5.1: Statistical Functionals
Examples: the mean functional \(T(F) = \int x \, dF(x)\), the variance functional \(T(F) = \int (x - \int y \, dF(y))^2 \, dF(x)\), the quantile functional \(T(F) = F^{-1}(p)\), and the Mann-Whitney functional \(T(F,G) = \int F(x) \, dG(x)\).
Section 5.2: Hadamard Differentiability
for all sequences \(t_n \to 0\) and \(h_n \to h\) in \(\mathbb{D}_0\).
Hadamard differentiability is weaker than Fréchet differentiability (which requires uniform convergence over all bounded sets of directions) but stronger than Gâteaux differentiability (which only considers fixed directions). Hadamard differentiability is precisely what is needed for the functional delta method.
Section 5.3: The Influence Function
the Gâteaux derivative of \(T\) at \(F\) in the direction \(\delta_x - F\).
The influence function measures the effect of a small point mass contamination at \(x\) on the functional. It plays the role of a “derivative” that determines the first-order asymptotic behavior of \(T(\mathbb{P}_n)\).
Example: Sample Mean. For \(T(F) = \int x \, dF(x)\), \(\text{IF}(x; T, F) = x - \int y \, dF(y) = x - \mu\).
Example: Sample Variance. For \(T(F) = \int (x - \mu)^2 \, dF(x)\), \(\text{IF}(x; T, F) = (x - \mu)^2 - \sigma^2\).
Section 5.4: The Functional Delta Method
For real-valued \(T\) with influence function \(\text{IF}(x; T, P)\), the linear functional \(T'_P(h) = \int \text{IF}(x; T, P) \, dh(x)\). By Donsker’s theorem and the functional delta method:
\[ \sqrt{n}(T(\mathbb{P}_n) - T(P)) \xrightarrow{d} N\!\left(0, \int \text{IF}^2(x; T, P) \, dP(x)\right), \]provided \(\int \text{IF}^2 \, dP < \infty\) and \(\int \text{IF}(x; T, P) \, dP(x) = 0\) (which holds for Fisher-consistent \(T\)).
Section 5.5: U-Statistics
U-statistics are another class of estimators for which the influence function approach yields clean asymptotics.
Define the projection \(h_1(x) = \mathbb{E}[h(x, X_2, \ldots, X_m)] - \theta\). Hoeffding’s decomposition gives
\[ \sqrt{n}(U_n - \theta) = \sqrt{n} \cdot \frac{m}{n} \sum_{i=1}^n h_1(X_i) + o_p(1) \xrightarrow{d} N(0, m^2 \sigma_1^2), \]where \(\sigma_1^2 = \text{Var}(h_1(X_1))\). The leading term is \(m\) times the sample average of \(h_1\), so the asymptotic variance is \(m^2 \sigma_1^2\). The Mann-Whitney statistic \(U_{mn} = (mn)^{-1}\sum_{i,j} \mathbf{1}_{X_i < Y_j}\) is a canonical example.
Chapter 6: Asymptotic Approximations
Beyond the first-order normal approximation, Edgeworth expansions provide corrections that account for skewness and kurtosis, while Laplace’s method approximates integrals with a Gaussian kernel centered at the mode.
Section 6.1: Edgeworth Expansions
Let \(X_1, \ldots, X_n\) be i.i.d. with mean 0, variance \(\sigma^2\), skewness \(\kappa_3 = \mathbb{E}[X^3]\), and excess kurtosis \(\kappa_4 = \mathbb{E}[X^4] - 3\sigma^4\). Let \(S_n = (X_1 + \cdots + X_n)/(\sigma\sqrt{n})\). The Edgeworth expansion of the CDF of \(S_n\) to order \(n^{-1/2}\) is:
\[ P(S_n \leq x) = \Phi(x) - \phi(x) \frac{\kappa_3}{6\sigma^3 \sqrt{n}} H_2(x) + O(n^{-1}), \]where \(\phi\) is the standard normal density, and \(H_2(x) = x^2 - 1\) is the second Hermite polynomial. The correction term reflects the skewness of the distribution and is of order \(n^{-1/2}\).
To order \(n^{-1}\), additional terms involving \(\kappa_4\) and \(\kappa_3^2\) appear:
\[ P(S_n \leq x) = \Phi(x) - \phi(x)\left[\frac{\kappa_3}{6\sigma^3\sqrt{n}} H_2(x) + \frac{1}{n}\left(\frac{\kappa_4}{24\sigma^4} H_3(x) + \frac{\kappa_3^2}{72\sigma^6} H_5(x)\right)\right] + O(n^{-3/2}), \]where \(H_k\) are Hermite polynomials satisfying \(H_k'(x) = k H_{k-1}(x)\) and \(-\phi'(x) H_{k-1}(x) = \phi(x) H_k(x)\).
The Cornish-Fisher expansion inverts the Edgeworth expansion to give quantile approximations: the \(p\)-th quantile of \(S_n\) satisfies
\[ q_{n,p} = z_p + \frac{\kappa_3}{6\sigma^3 \sqrt{n}}(z_p^2 - 1) + O(n^{-1}), \]where \(z_p = \Phi^{-1}(p)\). This correction improves upon the naive approximation \(q_{n,p} \approx z_p\).
Section 6.2: Laplace Approximation
The Laplace approximation evaluates integrals of the form \(\int e^{n h(x)} dx\) by expanding around the mode.
where we used the Gaussian integral \(\int e^{-a u^2/2} du = \sqrt{2\pi/a}\) for \(a = n|h''(x_0)|\). Higher-order error terms arise from the cubic and quartic terms in the Taylor expansion. \(\square\)
Application: Bayesian Posterior Approximation (Bernstein-von Mises Theorem). For a regular parametric model with likelihood \(L(\theta) = \prod_{i=1}^n f(X_i; \theta)\) and prior \(\pi(\theta)\), the posterior is
\[ \pi(\theta \mid X_1, \ldots, X_n) \propto L(\theta) \pi(\theta). \]Taking \(h(\theta) = n^{-1}(\log L(\theta) + \log\pi(\theta))\), the Laplace approximation gives a Gaussian approximation to the posterior centered at the MLE \(\hat{\theta}_n\) with variance \((\mathcal{I}(\theta_0))^{-1}/n\). The Bernstein-von Mises theorem formalizes this: under regularity conditions, the total variation distance between the posterior \(\pi(\cdot \mid X_1,\ldots,X_n)\) and \(N(\hat{\theta}_n, n^{-1}\mathcal{I}(\theta_0)^{-1})\) converges to zero in \(P_{\theta_0}\)-probability.
Chapter 7: Resampling Methods
Resampling methods are computationally intensive alternatives to asymptotic approximations. The jackknife and bootstrap estimate the sampling distribution of a statistic without deriving analytic formulas, and permutation tests provide exact finite-sample level guarantees.
Section 7.1: The Jackknife
Let \(\hat{\theta}_n = \theta(X_1, \ldots, X_n)\) be an estimator. Let \(\hat{\theta}_{n,-i}\) denote the estimator computed on the sample with \(X_i\) deleted, and let \(\bar{\theta}_{.} = n^{-1}\sum_{i=1}^n \hat{\theta}_{n,-i}\).
The jackknife bias-corrected estimator is \(\hat{\theta}_J = \hat{\theta}_n - \hat{b}_J = n\hat{\theta}_n - (n-1)\bar{\theta}_{.}\).
For a smooth functional \(T(F)\), the jackknife bias estimator satisfies \(\hat{b}_J = O_p(n^{-1})\), and the corrected estimator \(\hat{\theta}_J\) has bias of order \(n^{-2}\) rather than \(n^{-1}\). The jackknife variance estimator is
\[ \hat{V}_J = \frac{n-1}{n} \sum_{i=1}^n (\hat{\theta}_{n,-i} - \bar{\theta}_{.})^2. \]For smooth statistics (differentiable functionals of the empirical distribution), \(\hat{V}_J\) is a consistent estimator of \(\text{Var}(\hat{\theta}_n)\). The delete-\(d\) jackknife generalizes this by deleting \(d\) observations at a time, which is necessary for consistent variance estimation when \(\hat{\theta}\) is not a smooth functional (e.g., extreme order statistics).
Section 7.2: The Bootstrap
Efron’s bootstrap (1979) estimates the sampling distribution of \(\hat{\theta}_n\) by resampling from the empirical distribution \(\mathbb{P}_n\).
The Bootstrap Algorithm:
- Draw \(X_1^*, \ldots, X_n^*\) i.i.d. from \(\mathbb{P}_n\) (sampling with replacement from the observed data).
- Compute \(\hat{\theta}_n^* = T(\mathbb{P}_n^*)\) where \(\mathbb{P}_n^*\) is the empirical measure of the bootstrap sample.
- Repeat \(B\) times to obtain \(\hat{\theta}_{n,1}^*, \ldots, \hat{\theta}_{n,B}^*\).
- Approximate the distribution of \(\hat{\theta}_n - T(P)\) by the conditional distribution of \(\hat{\theta}_n^* - \hat{\theta}_n\) given the data.
The proof uses the functional delta method: under the bootstrap, \(\sqrt{n}(\mathbb{P}_n^* - \mathbb{P}_n) \xrightarrow{d^*} \mathbb{G}_P\) (Brownian bridge) conditionally in probability, so by Hadamard differentiability, the bootstrap distribution of \(\sqrt{n}(T(\mathbb{P}_n^*) - T(\mathbb{P}_n))\) converges to the true limiting distribution \(T'_P(\mathbb{G}_P)\).
Bootstrap Confidence Intervals:
- Percentile interval: \(\left[\hat{\theta}^*_{(\alpha/2)}, \hat{\theta}^*_{(1-\alpha/2)}\right]\) where quantiles are taken from the bootstrap distribution.
- Basic (reflected) interval: \(\left[2\hat{\theta}_n - \hat{\theta}^*_{(1-\alpha/2)}, 2\hat{\theta}_n - \hat{\theta}^*_{(\alpha/2)}\right]\).
- Bootstrap-\(t\) interval: \(\left[\hat{\theta}_n - z^*_{1-\alpha/2} \hat{\text{se}}, \hat{\theta}_n - z^*_{\alpha/2} \hat{\text{se}}\right]\) where \(z^*\) are quantiles of the bootstrapped t-statistic.
- BCa (Bias-corrected and accelerated): Adjusts for bias and skewness using a bias-correction constant \(z_0\) and an acceleration constant \(a\), achieving second-order accuracy.
Section 7.3: Permutation Tests
Permutation tests provide exact finite-sample level guarantees under the null hypothesis of exchangeability.
The advantage over the bootstrap is exactness; the disadvantage is that validity requires exchangeability under \(H_0\), which may be a strong assumption. The bootstrap is more flexible but only approximately valid. For two-sample testing, permutation tests require that the two populations have the same distribution under \(H_0\) (not just the same mean), which is the complete null hypothesis.
Chapter 8: High-Dimensional Inference
Classical statistical theory assumes the dimension \(p\) is fixed while \(n \to \infty\). High-dimensional inference addresses the regime where \(p\) grows with \(n\), potentially with \(p \gg n\). New phenomena emerge: the curse of dimensionality, the phase transition in regularization, and the breakdown of classical tests.
Section 8.1: Multiple Testing
When testing \(m\) hypotheses simultaneously, the probability of at least one false rejection (the family-wise error rate, FWER) inflates unless correction is applied.
Bonferroni correction controls FWER: reject hypothesis \(H_i\) if \(p_i \leq \alpha/m\). Then \(\text{FWER} \leq \alpha\) by a union bound. This is conservative when the tests are positively correlated.
where \(m_0\) is the number of true null hypotheses.
Using the step-up structure, for each \(i \in \mathcal{H}_0\):
\[ P(i \text{ rejected}) = P(p_i \leq p_{(k)}) \leq \mathbb{E}[p_{(k)} \cdot m / k_0] \leq \alpha, \]where a more careful argument using indicator random variables and the law of total expectation shows:
\[ \text{FDR} = \sum_{i \in \mathcal{H}_0} E\!\left[\frac{\mathbf{1}_{i \text{ rejected}}}{R}\right] \leq \sum_{i \in \mathcal{H}_0} \frac{\alpha}{m} = \frac{m_0 \alpha}{m} \leq \alpha. \quad \square \]Storey’s approach estimates the proportion of true nulls \(\hat{\pi}_0 = \#\{p_i > \lambda\} / (m(1-\lambda))\) for a tuning parameter \(\lambda \in (0,1)\), then uses \(\hat{\pi}_0\) in place of 1 in the BH threshold, yielding an adaptive procedure with higher power when many nulls are false.
Section 8.2: The James-Stein Estimator and SURE
Consider estimating \(\theta \in \mathbb{R}^p\) from \(X \sim N(\theta, I_p)\). The MLE is \(\hat{\theta} = X\), with risk \(\mathbb{E}[\|X - \theta\|^2] = p\).
has risk \(\mathbb{E}[\|\hat{\theta}^{JS} - \theta\|^2] = p - \frac{(p-2)^2}{\mathbb{E}[\|X\|^{-2}]} < p\),
so the MLE \(X\) is inadmissible under squared error loss for \(p \geq 3\).
This is a striking result: shrinkage toward zero uniformly dominates the MLE in dimension three or higher. The shrinkage factor \((p-2)/\|X\|^2\) adaptively adjusts to the signal strength.
Stein’s Unbiased Risk Estimate (SURE) provides an unbiased estimator of the risk of any differentiable shrinkage estimator without knowing \(\theta\).
The identity \(\mathbb{E}[g_i(X)(X_i - \theta_i)] = \sigma^2 \mathbb{E}[\partial g_i/\partial x_i]\) (Stein’s identity) is the key tool.
Section 8.3: Donoho-Johnstone Wavelet Shrinkage
Consider the normal means problem: observe \(Y_i = \theta_i + \epsilon_i\), \(i = 1, \ldots, n\), with \(\epsilon_i \sim N(0, \sigma^2)\) i.i.d., and \(\theta = (\theta_1, \ldots, \theta_n)\) assumed sparse.
Soft thresholding at level \(\lambda\) is \(\eta_\lambda(y) = \text{sign}(y)(|y| - \lambda)_+\). Donoho and Johnstone (1994) proved that the soft-threshold estimator \(\hat{\theta}_i = \eta_\lambda(Y_i)\) with \(\lambda = \sigma\sqrt{2\log n}\) (the universal threshold) achieves near-minimax risk over sparse and smooth function classes.
showing at most a \(\log n\) factor over the oracle risk.
SURE provides an adaptive choice of threshold by minimizing \(\text{SURE}(\lambda)\) over \(\lambda\), yielding a data-driven threshold without the \(\log n\) factor in many practical problems.
Section 8.4: RKHS and Function Estimation
A reproducing kernel Hilbert space (RKHS) \(\mathcal{H}\) on \(\mathcal{X}\) is a Hilbert space of functions \(f : \mathcal{X} \to \mathbb{R}\) with reproducing kernel \(K : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) satisfying:
- \(K(\cdot, x) \in \mathcal{H}\) for all \(x \in \mathcal{X}\).
- \(\langle f, K(\cdot, x)\rangle_\mathcal{H} = f(x)\) for all \(f \in \mathcal{H}\), \(x \in \mathcal{X}\) (the reproducing property).
By Mercer’s theorem, any continuous positive semidefinite kernel \(K\) on a compact space induces an RKHS. The kernel ridge regression estimator minimizes
\[ \frac{1}{n}\sum_{i=1}^n (Y_i - f(X_i))^2 + \lambda \|f\|_\mathcal{H}^2, \]and by the representer theorem, the solution has the form \(\hat{f}(x) = \sum_{i=1}^n \hat{\alpha}_i K(x, X_i)\), reducing the infinite-dimensional optimization to an \(n\)-dimensional linear system.
Section 8.5: Model Selection (AIC and BIC)
Akaike Information Criterion (AIC): Akaike (1973) proposed selecting the model minimizing
\[ \text{AIC} = -2\ell(\hat{\theta}) + 2k, \]where \(\ell(\hat{\theta})\) is the log-likelihood at the MLE and \(k\) is the number of parameters. The derivation shows that \(-2\ell(\hat{\theta})/n\) is a biased estimator of the Kullback-Leibler divergence \(D_{KL}(f_{true} \| f_{\hat{\theta}})\), with bias approximately \(k/n\). AIC corrects for this bias, so model selection by AIC is equivalent to choosing the model with the best asymptotically unbiased estimate of prediction KL divergence.
Bayesian Information Criterion (BIC): Schwarz (1978) proposed
\[ \text{BIC} = -2\ell(\hat{\theta}) + k\log n. \]The penalty \(k\log n\) arises from a Laplace approximation to the marginal likelihood (Bayes factor): for models \(\mathcal{M}_1\) and \(\mathcal{M}_2\) with equal prior probability,
\[ \log P(\mathcal{M}_1 \mid X) - \log P(\mathcal{M}_2 \mid X) \approx \frac{1}{2}(\text{BIC}_2 - \text{BIC}_1). \]BIC is consistent (selects the true model with probability 1 as \(n \to \infty\) when the true model is in the candidate set), while AIC is not consistent but has better prediction properties when the true model is not in the candidate set.
Chapter 9: Advanced Topics
Section 9.1: Post-Selection Inference
Classical inference assumes the model (and in particular, which parameters to test) is chosen prior to examining the data. In practice, model selection is performed on the same data used for inference, invalidating classical p-values and confidence intervals.
The Lasso estimator \(\hat{\beta}(\lambda) = \arg\min_\beta \left[\frac{1}{2n}\|Y - X\beta\|^2 + \lambda\|\beta\|_1\right]\) selects a sparse model, but the selected predictors are correlated with the response, so ordinary OLS inference on the selected model yields inflated type-I errors.
Post-selection inference (POSI) seeks valid inference after model selection. The selective inference framework (Berk et al., 2013; Lee et al., 2016) conditions on the selected model to construct valid tests.
For the Lasso at fixed \(\lambda\), the selected model \(\hat{M} = \text{supp}(\hat{\beta}(\lambda))\) is a polyhedral set: \(\hat{M} = M\) if and only if \(AY \leq b\) for a matrix \(A\) and vector \(b\) depending on \(M\) and \(\lambda\). Conditioning on this polyhedral set, the truncated normal distribution (via the Polyhedral Lemma of Lee et al.) yields pivot statistics with exact uniform distribution under the null.
is Uniform\([0,1]\) under \(\eta^T \mu = 0\), where \(\mathcal{V}^-\) and \(\mathcal{V}^+\) are the truncation bounds on \(\eta^T Y\) induced by the polyhedron.
This enables p-values and confidence intervals for regression coefficients that are valid conditional on Lasso model selection.
Section 9.2: Inference Under Differential Privacy
Differential privacy (Dwork et al., 2006) provides a rigorous definition of privacy for statistical inference. It quantifies the maximum information leakage about any individual from the output of a computation.
Pure \(\epsilon\)-differential privacy corresponds to \(\delta = 0\).
The local differential privacy (LDP) model requires each individual to privatize their own data before sharing. Under \(\epsilon\)-LDP, each user sends a privatized version \(Z_i = \mathcal{M}(X_i)\) satisfying \(P(Z_i \in S \mid X_i = x) \leq e^\epsilon P(Z_i \in S \mid X_i = x')\) for all \(x, x', S\).
Privacy-Accuracy Tradeoff and Fisher Information Bounds:
where \(\mathcal{I}(\theta)\) is the single-observation Fisher information. For small \(\epsilon\), the reduction factor is approximately \(\epsilon^2/4\), meaning that LDP reduces effective sample size by a factor of \(\Theta(\epsilon^2)\).
Randomized Response (Warner, 1965): The canonical LDP mechanism for a binary attribute \(X \in \{0,1\}\) is: report \(Z = X\) with probability \(e^\epsilon/(1+e^\epsilon)\) and \(Z = 1-X\) with probability \(1/(1+e^\epsilon)\). This satisfies \(\epsilon\)-LDP. The unbiased estimator of \(P(X=1) = p\) from \(n\) independent responses is
\[ \hat{p} = \frac{n^{-1}\sum_{i=1}^n Z_i \cdot (1+e^\epsilon) - 1}{e^\epsilon - 1}, \]with variance \(\text{Var}(\hat{p}) = p(1-p)/n + (e^\epsilon + 1)^2/[4n(e^\epsilon - 1)^2]\), illustrating the inflation in variance due to privacy.
Under the Gaussian mechanism, a real-valued query \(q(D)\) with sensitivity \(\Delta q = \sup_{D,D'} |q(D) - q(D')|\) is privatized as \(\mathcal{M}(D) = q(D) + N(0, \sigma^2)\) with \(\sigma = \Delta q \sqrt{2\log(1.25/\delta)} / \epsilon\) to achieve \((\epsilon,\delta)\)-DP. The resulting estimator has a bias-variance tradeoff between privacy (large \(\sigma\)) and accuracy (small \(\sigma\)).
The field of private hypothesis testing seeks to control type I and type II errors while maintaining differential privacy. For the simple hypothesis test \(H_0: \theta = \theta_0\) vs. \(H_1: \theta = \theta_1\), adding Laplace noise to the log-likelihood ratio yields a private likelihood ratio test, but the private sample complexity grows as \(O(1/\epsilon)\) times the non-private sample complexity, illustrating the fundamental cost of privacy.
Appendix: Key Notation and Symbols
| Symbol | Meaning |
|---|---|
| \(\mathbb{P}_n\) | Empirical measure \(n^{-1}\sum_{i=1}^n \delta_{X_i}\) |
| \(\mathbb{F}_n\) | Empirical CDF |
| \(\mathbb{G}_n\) | Centered empirical process \(\sqrt{n}(\mathbb{F}_n - F)\) |
| \(\mathbb{B}\) | Brownian bridge |
| \(\mathcal{I}(\theta)\) | Fisher information |
| \(O_p, o_p\) | Stochastic order notation |
| \(\xrightarrow{d}\) | Convergence in distribution |
| \(\xrightarrow{P}\) | Convergence in probability |
| \(\xrightarrow{\text{a.s.}}\) | Almost sure convergence |
| \(\text{IF}(x; T, F)\) | Influence function |
| \(\text{FWER}\) | Family-wise error rate |
| \(\text{FDR}\) | False discovery rate |
| \(\ell^\infty(\mathcal{F})\) | Space of bounded functions on \(\mathcal{F}\) with sup-norm |
| \(V(\mathcal{C})\) | VC dimension of class \(\mathcal{C}\) |