STAT 946: Topics in Probability and Statistics
Peijun Sang; Martin Lysy
Estimated study time: 1 hr 12 min
Table of contents
Sources and References
Part I (Functional Data Analysis) — J.O. Ramsay and B.W. Silverman, Functional Data Analysis (2nd ed., Springer, 2005); Piotr Kokoszka and Matthew Reimherr, Introduction to Functional Data Analysis (Chapman and Hall/CRC, 2017); Tailen Hsing and Randall Eubank, Theoretical Foundations of Functional Data Analysis (Wiley, 2015).
Part II (Bayesian Computing) — Andrew Gelman et al., Bayesian Data Analysis (3rd ed., CRC Press); Jun Liu, Monte Carlo Strategies in Scientific Computing (Springer, 2001).
Online resources — Ramsay and Silverman fda R package documentation; Michael Betancourt’s HMC conceptual introduction; David Blei’s variational inference tutorial.
Part I: Functional Data Analysis (Fall 2025, Peijun Sang)
Chapter 1: Introduction to Functional Data Analysis
1.1 What Is Functional Data?
Classical multivariate statistics treats observations as vectors in \(\mathbb{R}^p\). Functional data analysis (FDA) takes a different perspective: each observation is a function — a curve, surface, or other object that varies over a continuum. The continuum is typically time or space, but it can be any ordered domain such as wavelength, age, or spatial coordinate.
In practice, functions are observed discretely: for each subject \(i\), one records measurements \(y_{ij} = X_i(t_{ij}) + \varepsilon_{ij}\) at a finite set of time points \(\{t_{ij}\}\). A central task of FDA is to recover the underlying functions from these noisy, discrete measurements.
FDA subsumes several related areas. Longitudinal data analysis studies repeated measurements on individuals, but the number of time points per subject is typically small and the focus is on population-level parameters. FDA treats curves as the primary objects of inference and routinely works with dozens to thousands of observation points per curve.
1.2 Basis Expansions for Functional Objects
The most practical approach to representing functional data is to expand each function in a finite-dimensional basis. Let \(\{\phi_k\}_{k=1}^K\) be a basis for a subspace of \(L^2(\mathcal{T})\). One approximates
\[ X_i(t) \approx \sum_{k=1}^{K} c_{ik}\,\phi_k(t). \]The vector \(\mathbf{c}_i = (c_{i1}, \ldots, c_{iK})^\top\) reduces the infinite-dimensional problem to a finite-dimensional one. Three basis systems dominate applications.
Fourier basis. For periodic functions on \([0,T]\), the Fourier basis is \(\{1, \sin(2\pi t/T), \cos(2\pi t/T), \sin(4\pi t/T), \cos(4\pi t/T), \ldots\}\). Fourier coefficients decay at a rate governed by the smoothness of \(X\): a function with \(m\) square-integrable derivatives has Fourier coefficients that satisfy \(|c_k| = O(k^{-m})\). The Fourier basis is ideal for signals without sharp local features.
B-spline basis. For non-periodic data, B-splines of order \(m\) with knot sequence \(\xi_0 < \xi_1 < \cdots < \xi_{K+m}\) form a basis for piecewise polynomial functions of degree \(m-1\) with \(m-2\) continuous derivatives at the interior knots. B-splines have compact support (each basis function is nonzero on at most \(m\) knot intervals), making the resulting Gram matrix \(\int \phi_k \phi_l\) banded and computationally efficient.
Wavelet basis. Wavelets provide multi-resolution representations: a dilation parameter controls the scale and a translation parameter controls the location of each basis function. Wavelets excel at representing functions with spatially localized features such as discontinuities or sharp peaks.
1.3 Penalized Least Squares for Smoothing
Given discrete noisy measurements \((t_j, y_j)\) for a single curve, the aim is to estimate the underlying smooth function \(x\). A natural criterion penalizes a goodness-of-fit term by a roughness penalty.
The second term penalizes curvature; larger \(\lambda\) produces smoother estimates.
When the solution is expanded in a B-spline basis \(x(t) = \boldsymbol{\phi}(t)^\top \mathbf{c}\), the criterion becomes \(\|\mathbf{y} - \boldsymbol{\Phi}\mathbf{c}\|^2 + \lambda\,\mathbf{c}^\top \mathbf{P} \mathbf{c}\), where \(\boldsymbol{\Phi}_{jk} = \phi_k(t_j)\) and \(\mathbf{P}_{kl} = \int \phi_k'' \phi_l''\). The minimizer satisfies the normal equations
\[ \left(\boldsymbol{\Phi}^\top \boldsymbol{\Phi} + \lambda \mathbf{P}\right)\hat{\mathbf{c}} = \boldsymbol{\Phi}^\top \mathbf{y}. \]The fitted values \(\hat{\mathbf{y}} = \boldsymbol{\Phi}\hat{\mathbf{c}} = \mathbf{H}_\lambda \mathbf{y}\) are linear in the data, where the hat matrix is
\[ \mathbf{H}_\lambda = \boldsymbol{\Phi}\left(\boldsymbol{\Phi}^\top \boldsymbol{\Phi} + \lambda\mathbf{P}\right)^{-1}\boldsymbol{\Phi}^\top. \]The effective degrees of freedom of the smoother is \(\operatorname{tr}(\mathbf{H}_\lambda)\), which decreases from \(K\) (when \(\lambda = 0\)) to 0 (as \(\lambda \to \infty\)).
1.4 Smoothing Parameter Selection
The choice of \(\lambda\) governs the bias-variance tradeoff. Two principled selection methods are standard.
Cross-validation (CV). The leave-one-out CV score is
\[ \text{CV}(\lambda) = \sum_{j=1}^{n} \left[y_j - \hat{x}^{(-j)}(t_j)\right]^2, \]where \(\hat{x}^{(-j)}\) is the PLS estimate computed without observation \(j\). By the influence-function trick for linear smoothers, this equals
\[ \text{CV}(\lambda) = \sum_{j=1}^{n} \left[\frac{y_j - \hat{x}(t_j)}{1 - h_{jj}(\lambda)}\right]^2, \]where \(h_{jj}\) is the \(j\)-th diagonal of \(\mathbf{H}_\lambda\). This allows efficient computation without refitting \(n\) times.
Generalized cross-validation (GCV). GCV replaces the individual leverage values with their average, yielding
\[ \text{GCV}(\lambda) = \frac{n^{-1} \|\mathbf{y} - \hat{\mathbf{y}}\|^2}{\left[1 - n^{-1}\operatorname{tr}(\mathbf{H}_\lambda)\right]^2}. \]GCV is rotation-invariant and avoids the instability that arises when individual \(h_{jj}\) are large.
fda library.Chapter 2: Nonparametric Regression
2.1 Kernel Regression
Kernel regression provides nonparametric estimates of a regression function \(m(t) = E\left[Y \mid T = t\right]\) without assuming a global parametric form.
The kernel \(K\) is typically a symmetric, non-negative function integrating to one, such as the Epanechnikov kernel \(K(u) = \frac{3}{4}(1-u^2)\mathbf{1}_{|u|\le 1}\) or the Gaussian kernel \(K(u) = (2\pi)^{-1/2}e^{-u^2/2}\). The bandwidth \(h\) plays the role of \(\lambda^{-1}\): small \(h\) gives a rough, high-variance estimate; large \(h\) gives a smooth, high-bias estimate.
Local polynomial regression. The Nadaraya–Watson estimator is the zeroth-order member of a family. Local polynomial regression of degree \(p\) fits, at each target point \(t\), a degree-\(p\) polynomial to the nearby data using weighted least squares with kernel weights:
\[ \hat{\boldsymbol{\beta}}(t) = \operatorname*{arg\,min}_{\boldsymbol{\beta}} \sum_{i=1}^{n} K\!\left(\frac{t - t_i}{h}\right) \left[y_i - \sum_{j=0}^{p} \beta_j (t_i - t)^j\right]^2. \]The estimate of \(m(t)\) is \(\hat{\beta}_0(t)\). Local linear regression (\(p=1\)) is preferred in practice because it automatically corrects boundary bias and achieves the minimax optimal rate for estimating functions with two bounded derivatives.
where \(\mu_2(K) = \int u^2 K(u)\,du\) and \(\|K\|_2^2 = \int K^2(u)\,du\). The MSE-optimal bandwidth satisfies \(h_{\text{opt}} \propto n^{-1/5}\), yielding MSE of order \(n^{-4/5}\).
2.2 Reproducing Kernel Hilbert Spaces
Reproducing kernel Hilbert spaces (RKHS) provide a unifying framework for a broad class of smoothing methods and connect kernel regression, spline smoothing, and Gaussian process methods.
The function \(R : \mathcal{T} \times \mathcal{T} \to \mathbb{R}\) defined this way is called the reproducing kernel of \(\mathcal{H}\). It satisfies (i) \(R(s,t) = R(t,s)\) (symmetry) and (ii) \(\sum_{i,j} a_i a_j R(t_i, t_j) \ge 0\) for all finite collections (positive semi-definiteness).
Conversely, given any symmetric, positive semi-definite kernel \(R\) on \(\mathcal{T} \times \mathcal{T}\), the Moore–Aronszajn theorem guarantees the existence of a unique RKHS \(\mathcal{H}_R\) for which \(R\) is the reproducing kernel.
Examples. (a) The Sobolev space \(W^m_2[0,1] = \{f : f^{(m-1)}\text{ abs. continuous},\, f^{(m)} \in L^2\}\) with inner product \(\langle f, g\rangle = \sum_{k=0}^{m-1} f^{(k)}(0)g^{(k)}(0) + \int_0^1 f^{(m)}g^{(m)}\) is an RKHS. (b) The Gaussian kernel \(R(s,t) = \exp(-\|s-t\|^2/2\sigma^2)\) on \(\mathbb{R}^d\) is the reproducing kernel of a specific RKHS of smooth functions.
Any minimizer admits the finite-dimensional representation
\[ \hat{f}(\cdot) = \sum_{i=1}^{n} \alpha_i\, R(\cdot, t_i). \]The Representer Theorem is fundamental: despite optimizing over an infinite-dimensional space, the solution lives in the \(n\)-dimensional span of kernel sections. This reduces the optimization to finding \(\boldsymbol{\alpha} \in \mathbb{R}^n\).
2.3 Smoothing Splines as RKHS Minimizers
The cubic smoothing spline minimizer of the PLS criterion in Section 1.3 is precisely the solution to an RKHS problem in \(W^2_2[0,1]\). The reproducing kernel for the penalty \(\int (f'')^2\) is
\[ R(s,t) = \int_0^1 (s - u)_+\,(t - u)_+\,du, \]where \((x)_+ = \max(x,0)\). By the Representer Theorem, the natural cubic smoothing spline interpolates the data at the \(n\) observation points in the RKHS norm sense, and its explicit form has knots exactly at \(t_1, \ldots, t_n\).
2.4 Penalized Splines
Smoothing splines place a knot at every data point, giving \(O(n)\) basis functions. Penalized splines (P-splines) use a fixed, modest number of knots \(K \ll n\), making them computationally lighter while retaining most of the smoothing-spline flexibility.
A P-spline fit uses \(K\) B-spline basis functions with evenly spaced knots and penalizes the \(d\)-th order differences of adjacent B-spline coefficients:
\[ \hat{\mathbf{c}} = \operatorname*{arg\,min}_{\mathbf{c}} \left\{ \|\mathbf{y} - \boldsymbol{\Phi}\mathbf{c}\|^2 + \lambda\,\mathbf{c}^\top \mathbf{D}_d^\top \mathbf{D}_d\,\mathbf{c} \right\}, \]where \(\mathbf{D}_d\) is the \(d\)-th difference matrix. When \(d = 2\), this approximately penalizes curvature. P-splines admit a mixed-model representation (Ruppert, Wand, and Carroll), enabling REML-based selection of \(\lambda\).
2.5 Semiparametric Regression
Semiparametric models combine a parametric component for covariates with known structure and a nonparametric component for covariates with unknown form.
where \(\mathbf{Z} \in \mathbb{R}^q\) is a vector of parametric covariates, \(T\) is a scalar covariate, \(m\) is an unknown smooth function, and \(\varepsilon\) has mean zero and variance \(\sigma^2\).
Robinson’s (1988) double-residual estimator achieves \(\sqrt{n}\)-consistency for \(\boldsymbol{\gamma}\): partial out \(m(T)\) and \(E[\mathbf{Z} \mid T]\) by nonparametric regression, then run OLS on the residuals. The functional form \(m\) can subsequently be estimated by kernel or spline methods after removing \(\hat{\boldsymbol{\gamma}}^\top \mathbf{Z}\).
Chapter 3: Covariance Modeling and Functional PCA
3.1 Mean and Covariance Functions
Let \(X\) be a square-integrable stochastic process on \(\mathcal{T}\), meaning \(E\left[\int_{\mathcal{T}} X^2(t)\,dt\right] < \infty\). The fundamental second-order summary statistics are the mean function and the covariance kernel.
The covariance operator \(\mathcal{C} : L^2(\mathcal{T}) \to L^2(\mathcal{T})\) acts as \((\mathcal{C} f)(s) = \int_{\mathcal{T}} C(s,t)\,f(t)\,dt\).
The covariance operator is self-adjoint, positive semi-definite (since \(\int\!\int C(s,t)f(s)f(t)\,ds\,dt = \operatorname{Var}\!\left[\int f(t)X(t)\,dt\right] \ge 0\)), and of trace class when \(\int C(t,t)\,dt < \infty\).
3.2 Mercer’s Theorem
Mercer’s theorem is the continuous analogue of the spectral theorem for positive semi-definite matrices. It guarantees that the covariance operator has an eigendecomposition and provides the foundation for FPCA.
where the series converges absolutely and uniformly.
3.3 Functional Principal Component Analysis
Functional PCA (FPCA) seeks directions of maximal variation in the function space.
The \(k\)-th eigenfunction \(\psi_k\) solves \(\mathcal{C}\psi_k = \lambda_k \psi_k\). The functional principal component score of observation \(X_i\) on the \(k\)-th component is \(\xi_{ik} = \langle X_i - \mu, \psi_k \rangle = \int_{\mathcal{T}} (X_i(t) - \mu(t))\psi_k(t)\,dt\).
The Karhunen–Loève expansion expresses the process as
\[ X(t) = \mu(t) + \sum_{k=1}^{\infty} \xi_k\,\psi_k(t), \]where the scores satisfy \(E[\xi_k] = 0\), \(E[\xi_k \xi_l] = \lambda_k \delta_{kl}\), and the convergence is in mean square. Truncating at \(K\) components gives the best rank-\(K\) approximation in the \(L^2\) sense, explaining a fraction \(\sum_{k=1}^{K} \lambda_k / \sum_{k=1}^{\infty} \lambda_k\) of total variance.
3.4 Estimation of Covariance and Eigenfunctions
In practice, the mean function and covariance kernel must be estimated from data. When curves are densely observed, one can:
- Estimate \(\mu(t)\) by the pointwise sample mean \(\bar{X}(t) = n^{-1}\sum_i X_i(t)\).
- Estimate \(C(s,t)\) by the pointwise sample covariance \(\hat{C}(s,t) = (n-1)^{-1}\sum_i (X_i(s) - \bar{X}(s))(X_i(t) - \bar{X}(t))\).
- Compute the empirical covariance operator and solve the eigenproblem numerically.
When functions are observed on a common grid \(\{t_1, \ldots, t_J\}\), this reduces to the eigendecomposition of the \(J \times J\) sample covariance matrix, weighted by the quadrature weights for the integral \(\int \psi_k \psi_l\).
3.5 PACE for Sparse Functional Data
A critical challenge in biomedical applications is sparse and irregular functional data: subject \(i\) is observed only at a few time points \(\{t_{ij}\}_{j=1}^{m_i}\) with \(m_i\) often between 2 and 10, and the grids differ across subjects.
The PACE algorithm (Principal Analysis by Conditional Estimation; Yao, Müller, and Wang, 2005) addresses this by pooling information across subjects.
- Smooth all pairs \((t_{ij}, y_{ij})\) pooled across subjects to estimate \(\mu(t)\) by local polynomial regression.
- Form raw covariance products \(\tilde{C}_{ijl} = (y_{ij} - \hat{\mu}(t_{ij}))(y_{il} - \hat{\mu}(t_{il}))\) for \(j \ne l\), excluding diagonal to avoid noise contamination. Smooth the off-diagonal products by local polynomial surface smoothing to obtain \(\hat{C}(s,t)\).
- Estimate eigenvalues \(\hat{\lambda}_k\) and eigenfunctions \(\hat{\psi}_k\) from the discretized \(\hat{C}\).
- Predict scores by conditional expectation: \(\hat{\xi}_{ik} = \hat{\lambda}_k \hat{\boldsymbol{\psi}}_k^i{}^\top \hat{\boldsymbol{\Sigma}}_{Y_i}^{-1} (\mathbf{y}_i - \hat{\boldsymbol{\mu}}_i)\), where \(\hat{\boldsymbol{\Sigma}}_{Y_i} = \hat{\mathbf{C}}_i + \hat{\sigma}^2 \mathbf{I}\) and the subscript \(i\) denotes restriction to subject \(i\)'s grid.
The PACE score estimates are the best linear predictors of \(\xi_{ik}\) given the observations \(\mathbf{y}_i\), exploiting the structure of the Karhunen–Loève expansion. Under Gaussianity, these are the full conditional expectations \(E[\xi_{ik} \mid \mathbf{y}_i]\).
Chapter 4: Functional Linear Models and Functional GLMs
4.1 Scalar-on-Function Regression
The scalar-on-function regression model relates a scalar response \(Y\) to a functional predictor \(X(\cdot)\) through an integral transform involving a coefficient function \(\beta(\cdot)\).
where \(\alpha \in \mathbb{R}\) is an intercept, \(\beta \in L^2(\mathcal{T})\) is the slope function, and \(\varepsilon_i\) are i.i.d. errors with mean zero and variance \(\sigma^2\).
The functional inner product \(\int \beta(t) X_i(t)\,dt\) generalizes the linear predictor \(\boldsymbol{\beta}^\top \mathbf{x}_i\) of classical regression to the infinite-dimensional setting. Identifiability requires that the distribution of \(X\) is not supported on a proper closed subspace of \(L^2(\mathcal{T})\).
4.2 Estimation via Basis Expansion
Approach 1: Basis expansion for \(\beta\). Write \(\beta(t) = \sum_{k=1}^{K_\beta} b_k \phi_k(t)\) in a chosen basis. Then
\[ \int_{\mathcal{T}} \beta(t)\,X_i(t)\,dt = \sum_{k=1}^{K_\beta} b_k \int_{\mathcal{T}} \phi_k(t)\,X_i(t)\,dt = \mathbf{b}^\top \mathbf{w}_i, \]where \(w_{ik} = \int \phi_k(t) X_i(t)\,dt\). This reduces to a standard linear regression of \(Y\) on \(\mathbf{w}_i\). Penalization of \(\mathbf{b}\) via \(\lambda \mathbf{b}^\top \mathbf{P} \mathbf{b}\) (roughness penalty on \(\beta\)) corresponds to ridge regression with a structured penalty.
Approach 2: FPCA reduction. Project \(X_i\) onto the leading \(K\) functional principal components: the model becomes \(Y_i = \alpha + \sum_{k=1}^{K} \xi_{ik} \beta_k + \varepsilon_i\), where \(\beta_k = \int \beta(t)\psi_k(t)\,dt\) are the FPC coefficients of \(\beta\). This is standard multiple regression on the FPC scores.
has the representer form \(\hat{\beta}(\cdot) = \sum_{i=1}^{n} \hat{a}_i \int_{\mathcal{T}} R(\cdot, t) X_i(t)\,dt\). Estimation reduces to a finite-dimensional penalized least squares problem in the coefficients \(\hat{\mathbf{a}}\).
4.3 Function-on-Function Regression
When both the response and predictor are functions, the regression coefficient becomes an integral operator.
where \(\beta(s,t)\) is a bivariate coefficient surface and \(\varepsilon_i\) is a zero-mean stochastic process on \(\mathcal{S}\).
Estimation proceeds by expanding \(\beta(s,t) = \sum_{k,l} b_{kl}\,\phi_k(s)\phi_l(t)\) in a tensor product basis. The surface is penalized to enforce smoothness simultaneously in \(s\) and \(t\) directions. Alternatively, a concurrent model \(\beta(s,t) = \beta(t)\mathbf{1}_{s=t}\) restricts to \(Y_i(t) = \alpha(t) + \beta(t)X_i(t) + \varepsilon_i(t)\), a pointwise regression.
4.4 Functional Generalized Linear Models
Functional GLMs extend scalar-on-function regression to non-Gaussian responses by linking the conditional distribution to the functional predictor through a link function.
More generally, for a response \(Y_i\) in an exponential family with canonical link \(g\),
\[ g\!\left(E[Y_i \mid X_i]\right) = \alpha + \int_{\mathcal{T}} \beta(t)\,X_i(t)\,dt. \]Estimation uses penalized likelihood. The log-likelihood is
\[ \ell(\alpha, \beta) = \sum_{i=1}^{n} \ell_i\!\left(\alpha + \int \beta(t)X_i(t)\,dt\right) - \lambda \int \left[\beta''(t)\right]^2 dt, \]and Newton–Raphson (IRLS) adapted to the functional setting applies iteratively.
4.5 Inference for Functional Coefficients
Pointwise confidence bands for \(\hat{\beta}(t)\) can be constructed by bootstrap or by asymptotic theory. The functional t-test for \(H_0 : \beta = 0\) requires careful treatment of the infinite-dimensional null hypothesis; popular approaches include the global F-type test using the \(L^2\) norm \(\|\hat{\beta}\|^2\) and permutation-based p-values.
Part II: Bayesian Computing (Winter 2025, Martin Lysy)
Chapter 5: Bayesian Foundations and Normal Approximation
5.1 Prior, Posterior, and Predictive Distributions
The Bayesian paradigm treats all unknowns — parameters, latent variables, and future observations — as random variables. Uncertainty about a parameter vector \(\boldsymbol{\theta} \in \Theta\) is encoded in a prior distribution \(p(\boldsymbol{\theta})\), and updated upon observing data \(\mathbf{y}\) via Bayes’ rule.
The normalizing constant \(p(\mathbf{y})\) is the marginal likelihood or evidence.
Conjugate models. A prior \(p(\boldsymbol{\theta})\) is conjugate to the likelihood \(p(\mathbf{y} \mid \boldsymbol{\theta})\) if the posterior \(p(\boldsymbol{\theta} \mid \mathbf{y})\) is in the same parametric family. The Beta-Binomial, Normal-Normal, Dirichlet-Multinomial, and Normal-Inverse-Gamma models are the canonical conjugate examples.
The posterior predictive distribution for a new observation \(\tilde{y}\) marginalizes out \(\boldsymbol{\theta}\):
\[ p(\tilde{y} \mid \mathbf{y}) = \int p(\tilde{y} \mid \boldsymbol{\theta})\,p(\boldsymbol{\theta} \mid \mathbf{y})\,d\boldsymbol{\theta}. \]This propagates full posterior uncertainty into predictions, in contrast to plug-in estimators.
5.2 Hierarchical Models
Hierarchical models introduce a structured prior with multiple levels, allowing partial pooling of information across groups.
for groups \(i = 1, \ldots, m\). The hyperparameter \(\boldsymbol{\phi}\) links the groups. The joint posterior is \(p(\boldsymbol{\theta}_1, \ldots, \boldsymbol{\theta}_m, \boldsymbol{\phi} \mid \mathbf{y}) \propto p(\boldsymbol{\phi})\prod_{i=1}^{m} p(\boldsymbol{\theta}_i \mid \boldsymbol{\phi})\,p(\mathbf{y}_i \mid \boldsymbol{\theta}_i)\).
The eight-schools example of Gelman et al. illustrates shrinkage: group-level estimates \(\hat{\theta}_i\) are pulled toward the grand mean \(\bar{\theta}\), with more shrinkage for groups with smaller sample sizes.
5.3 Bayesian Normal Approximation (Laplace)
When the posterior is unimodal and the sample size is large, a normal approximation centered at the posterior mode (MAP estimate) is often accurate.
The Laplace approximation is \(p(\boldsymbol{\theta} \mid \mathbf{y}) \approx \mathcal{N}\!\left(\hat{\boldsymbol{\theta}},\, \hat{\mathbf{I}}^{-1}\right)\).
The approximation is justified by a Taylor expansion of \(\log p(\boldsymbol{\theta} \mid \mathbf{y})\) around \(\hat{\boldsymbol{\theta}}\). Under mild regularity conditions (Bernstein–von Mises), the posterior concentrates around the MLE at rate \(n^{-1/2}\) and is asymptotically normal with covariance \(\mathbf{I}(\boldsymbol{\theta}_0)^{-1}/n\), where \(\mathbf{I}(\boldsymbol{\theta}_0)\) is the Fisher information at the true parameter.
Bayes factors and model comparison. The Bayes factor comparing models \(\mathcal{M}_1\) and \(\mathcal{M}_2\) is
\[ \text{BF}_{12} = \frac{p(\mathbf{y} \mid \mathcal{M}_1)}{p(\mathbf{y} \mid \mathcal{M}_2)}. \]The Laplace approximation to \(\log p(\mathbf{y}) \approx \log p(\mathbf{y} \mid \hat{\boldsymbol{\theta}}) + \frac{p}{2}\log(2\pi) - \frac{1}{2}\log\det\hat{\mathbf{I}}\) connects to the BIC via the leading terms.
Chapter 6: Markov Chain Monte Carlo
6.1 The Metropolis-Hastings Algorithm
When the posterior is intractable, MCMC constructs a Markov chain whose stationary distribution is the target \(\pi(\boldsymbol{\theta}) \propto p(\mathbf{y} \mid \boldsymbol{\theta})\,p(\boldsymbol{\theta})\).
- Propose \(\boldsymbol{\theta}^* \sim q(\cdot \mid \boldsymbol{\theta}^{(t)})\) from a proposal distribution.
- Compute the acceptance ratio \[ \alpha = \min\!\left(1,\; \frac{\pi(\boldsymbol{\theta}^*)\,q(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^*)}{\pi(\boldsymbol{\theta}^{(t)})\,q(\boldsymbol{\theta}^* \mid \boldsymbol{\theta}^{(t)})}\right). \]
- Set \(\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^*\) with probability \(\alpha\); otherwise \(\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)}\).
The resulting chain satisfies detailed balance with respect to \(\pi\): \(\pi(\boldsymbol{\theta})\,q(\boldsymbol{\theta}' \mid \boldsymbol{\theta})\,\alpha(\boldsymbol{\theta}, \boldsymbol{\theta}') = \pi(\boldsymbol{\theta}')\,q(\boldsymbol{\theta} \mid \boldsymbol{\theta}')\,\alpha(\boldsymbol{\theta}', \boldsymbol{\theta})\), ensuring \(\pi\) is the stationary distribution.
Random-walk MH. The proposal \(\boldsymbol{\theta}^* = \boldsymbol{\theta}^{(t)} + \epsilon\), \(\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})\) is symmetric, so \(q(\boldsymbol{\theta}' \mid \boldsymbol{\theta}) = q(\boldsymbol{\theta} \mid \boldsymbol{\theta}')\) and the acceptance ratio simplifies to \(\min(1, \pi(\boldsymbol{\theta}^*)/\pi(\boldsymbol{\theta}^{(t)}))\). The optimal acceptance rate for a \(d\)-dimensional Gaussian target is approximately 0.234 (Roberts, Gelman, and Gilks, 1997), achieved by scaling \(\sigma^2 \approx 2.38^2/d\).
6.2 Gibbs Sampler
The Gibbs sampler exploits conditional independence structure by cycling through full conditional distributions.
Each Gibbs step is an MH step with acceptance probability exactly 1.
In hierarchical models, the full conditionals are often available in closed form (conjugate structure), making the Gibbs sampler the algorithm of choice. Data augmentation (Tanner and Wong, 1987) introduces auxiliary variables to render full conditionals tractable.
6.3 Convergence Diagnostics
Assessing convergence is critical; MCMC provides a biased estimate until the chain has mixed.
The Gelman–Rubin diagnostic \(\hat{R}\) runs \(M \ge 2\) parallel chains from overdispersed starting values. Let \(W\) be the within-chain variance and \(B\) the between-chain variance. The potential scale reduction factor is
\[ \hat{R} = \sqrt{\frac{\hat{V}}{W}}, \quad \hat{V} = \frac{T-1}{T}W + \frac{M+1}{MT}B, \]where \(T\) is the chain length after warmup. Values \(\hat{R} < 1.01\) are considered adequate (Vehtari et al., 2021 revision).
The effective sample size (ESS) accounts for autocorrelation:
\[ \text{ESS} = \frac{MT}{1 + 2\sum_{\tau=1}^{\infty} \rho_\tau}, \]where \(\rho_\tau\) is the autocorrelation at lag \(\tau\). Highly correlated chains (slow mixing) produce low ESS. A rule of thumb is ESS \(\ge 400\) for reliable posterior summaries.
6.4 Adaptive MCMC
Adaptive MCMC tunes the proposal distribution using past samples, trading off the simplicity of fixed-kernel MCMC for improved mixing.
The Adaptive Metropolis (AM) algorithm (Haario, Saksman, and Tamminen, 2001) uses a proposal covariance that adapts to the empirical covariance of the chain:
\[ q_t(\boldsymbol{\theta}^* \mid \boldsymbol{\theta}^{(t)}) = \mathcal{N}\!\left(\boldsymbol{\theta}^{(t)},\; \frac{2.38^2}{d}\hat{\boldsymbol{\Sigma}}_t + \varepsilon \mathbf{I}\right), \]where \(\hat{\boldsymbol{\Sigma}}_t\) is the sample covariance of \(\boldsymbol{\theta}^{(1)}, \ldots, \boldsymbol{\theta}^{(t)}\) and \(\varepsilon > 0\) prevents degeneracy. Although adaptation violates the Markov property, the AM chain is ergodic if adaptation diminishes asymptotically (Atchadé and Rosenthal, 2003).
Chapter 7: Hamiltonian Monte Carlo
7.1 Hamiltonian Dynamics
Hamiltonian Monte Carlo (HMC) exploits gradient information to make large moves in parameter space while maintaining high acceptance rates, dramatically improving over random-walk MH in high dimensions.
HMC augments the target with an auxiliary momentum variable \(\mathbf{p} \in \mathbb{R}^d\) and simulates Hamiltonian dynamics on the joint space \((\boldsymbol{\theta}, \mathbf{p})\). The Hamiltonian is
\[ H(\boldsymbol{\theta}, \mathbf{p}) = U(\boldsymbol{\theta}) + K(\mathbf{p}), \]where \(U(\boldsymbol{\theta}) = -\log\pi(\boldsymbol{\theta})\) is the potential energy and \(K(\mathbf{p}) = \frac{1}{2}\mathbf{p}^\top \mathbf{M}^{-1}\mathbf{p}\) is the kinetic energy (with mass matrix \(\mathbf{M}\)). Hamilton’s equations are
\[ \frac{d\boldsymbol{\theta}}{d\tau} = \frac{\partial H}{\partial \mathbf{p}} = \mathbf{M}^{-1}\mathbf{p}, \qquad \frac{d\mathbf{p}}{d\tau} = -\frac{\partial H}{\partial \boldsymbol{\theta}} = \nabla_{\boldsymbol{\theta}} \log\pi(\boldsymbol{\theta}). \]The Hamiltonian is conserved along trajectories, and the flow preserves the canonical distribution \(p(\boldsymbol{\theta}, \mathbf{p}) \propto e^{-H(\boldsymbol{\theta},\mathbf{p})}\).
7.2 The Leapfrog Integrator
Exact Hamiltonian dynamics are approximated by the leapfrog integrator, which is time-reversible and volume-preserving (symplectic):
\[ \mathbf{p}^{(\ell + 1/2)} = \mathbf{p}^{(\ell)} + \frac{\epsilon}{2}\nabla_{\boldsymbol{\theta}}\log\pi\!\left(\boldsymbol{\theta}^{(\ell)}\right), \]\[ \boldsymbol{\theta}^{(\ell+1)} = \boldsymbol{\theta}^{(\ell)} + \epsilon\,\mathbf{M}^{-1}\mathbf{p}^{(\ell+1/2)}, \]\[ \mathbf{p}^{(\ell+1)} = \mathbf{p}^{(\ell+1/2)} + \frac{\epsilon}{2}\nabla_{\boldsymbol{\theta}}\log\pi\!\left(\boldsymbol{\theta}^{(\ell+1)}\right), \]for step size \(\epsilon\). After \(L\) leapfrog steps, the proposed \((\boldsymbol{\theta}^*, \mathbf{p}^*)\) is accepted with MH probability \(\min(1, e^{-H(\boldsymbol{\theta}^*, \mathbf{p}^*) + H(\boldsymbol{\theta}^{(t)}, \mathbf{p}^{(t)})})\). For an ideal integrator with no discretization error, acceptance would be 1; in practice, the optimal acceptance rate is approximately 0.651.
7.3 The No-U-Turn Sampler and Stan
HMC requires tuning the step size \(\epsilon\) and number of leapfrog steps \(L\). The No-U-Turn Sampler (NUTS) (Hoffman and Gelman, 2014) eliminates the need to specify \(L\) by dynamically extending the trajectory until it makes a U-turn (i.e., until \(\mathbf{p} \cdot (\boldsymbol{\theta} - \boldsymbol{\theta}^{(t)}) < 0\)).
where \(\boldsymbol{\theta}^+\) and \(\boldsymbol{\theta}^-\) are the rightmost and leftmost positions of the current tree.
Stan is a probabilistic programming language implementing NUTS with dual averaging for automatic step-size adaptation during warmup. Users specify a joint model in Stan’s modeling language; Stan auto-differentiates the log-posterior to supply the required gradients. Stan’s performance degrades in the presence of non-identifiabilities and discrete parameters (which it cannot handle directly), motivating the use of marginalization or alternative parameterizations.
Chapter 8: State-Space Models and Particle Filtering
8.1 Hidden Markov Models and State-Space Models
State-space models (SSMs) describe a latent process \(\{X_t\}_{t \ge 0}\) observed noisily through \(\{Y_t\}_{t \ge 0}\). The Markov structure is:
\[ X_{t+1} \mid X_t \sim p(x_{t+1} \mid x_t), \quad Y_t \mid X_t \sim p(y_t \mid x_t), \]with \(Y_t \perp (X_{0:t-1}, Y_{0:t-1}) \mid X_t\) and \(X_{t+1} \perp (X_{0:t-1}, Y_{0:t}) \mid X_t\).
Hidden Markov models (HMMs) are discrete-state SSMs. Linear Gaussian SSMs (also called Gaussian linear dynamical systems) have
\[ X_{t+1} = \mathbf{F} X_t + \mathbf{q}_t, \quad Y_t = \mathbf{H} X_t + \mathbf{r}_t, \]with \(\mathbf{q}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q})\) and \(\mathbf{r}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{R})\). This special case admits a closed-form solution: the Kalman filter.
8.2 The Kalman Filter
The matrix \(\mathbf{K}_t\) is the Kalman gain.
The Kalman smoother (Rauch–Tung–Striebel) computes the smoothing distribution \(p(x_t \mid y_{0:T})\) for all \(t \le T\) using a backward pass.
8.3 The Bootstrap Particle Filter
For nonlinear, non-Gaussian SSMs, the Kalman filter is not exact. Particle filters approximate the filtering distribution \(p(x_t \mid y_{0:t})\) by a weighted empirical distribution over \(N\) particles \(\{x_t^{(j)}, w_t^{(j)}\}_{j=1}^N\).
- Initialize: Draw \(x_0^{(j)} \sim p(x_0)\), set \(w_0^{(j)} = 1/N\).
- At each time \(t\):
- Resample: Draw indices \(a^{(j)} \sim \text{Multinomial}(\mathbf{w}_{t-1})\), set \(\tilde{x}_{t-1}^{(j)} = x_{t-1}^{(a^{(j)})}\).
- Propagate: Draw \(x_t^{(j)} \sim p(x_t \mid \tilde{x}_{t-1}^{(j)})\) from the transition kernel.
- Weight: Set \(w_t^{(j)} \propto p(y_t \mid x_t^{(j)})\).
The filtering expectation \(E\left[h(X_t) \mid y_{0:t}\right]\) is approximated by \(\sum_{j=1}^N w_t^{(j)} h(x_t^{(j)})\).
The sequential importance resampling (SIR) step prevents weight degeneracy, where a single particle would accumulate all weight. The effective sample size \(\text{ESS}_t = \left(\sum_j (w_t^{(j)})^2\right)^{-1}\) diagnoses degeneracy; resampling is typically triggered when \(\text{ESS}_t < N/2\).
Particle filters form the basis of particle MCMC (Andrieu, Doucet, and Holenstein, 2010), which uses particle filter estimates of the likelihood \(p(\mathbf{y} \mid \boldsymbol{\theta})\) inside an MH sampler for \(\boldsymbol{\theta}\).
Chapter 9: Nonparametric Bayes
9.1 The Dirichlet Process
The Dirichlet process (DP) is the fundamental building block of Bayesian nonparametric inference. It defines a prior over the infinite-dimensional space of probability measures.
Key properties of the DP: (i) \(E[G(A)] = G_0(A)\), so \(G_0\) is the prior mean measure; (ii) \(\operatorname{Var}(G(A)) = G_0(A)(1 - G_0(A))/(\alpha + 1)\), so \(\alpha\) controls concentration around \(G_0\); (iii) realizations of \(G\) are almost surely discrete — a foundational fact exploited in mixture models.
Stick-breaking construction (Sethuraman, 1994). An equivalent constructive definition is:
\[ G = \sum_{k=1}^{\infty} \pi_k\,\delta_{\theta_k}, \]where \(\theta_k \overset{\text{iid}}{\sim} G_0\), \(V_k \overset{\text{iid}}{\sim} \text{Beta}(1, \alpha)\), and \(\pi_k = V_k \prod_{l=1}^{k-1}(1 - V_l)\). The weights \(\pi_k\) are a random partition of unity, constructed by successively “breaking” a unit stick.
Chinese Restaurant Process (CRP). The predictive distribution of the DP yields the CRP: given a sequence \(\theta_1, \ldots, \theta_{n-1}\) drawn from a DP-based model, the next draw satisfies
\[ \theta_n \mid \theta_1, \ldots, \theta_{n-1} \sim \frac{\alpha}{\alpha + n - 1}\,G_0 + \frac{1}{\alpha + n - 1}\sum_{i=1}^{n-1}\delta_{\theta_i}. \]With probability proportional to \(\alpha\), a new table (cluster) is opened; with probability proportional to its current count, an existing table is joined. This gives a distribution over partitions of \(\{1, \ldots, n\}\) with the property that the number of clusters grows as \(\alpha\log n\).
9.2 Dirichlet Process Mixture Models
Since DP realizations are discrete, they are directly useful as mixing distributions in mixture models.
Marginalizing out \(G\), the CRP structure induces a random clustering of the observations. The DP mixture allows an unbounded number of components, with the data determining the effective number. Posterior inference is typically via:
- Blocked Gibbs sampler: Truncate the stick-breaking at \(K\) terms and use conjugate updates.
- Algorithm 8 of Neal (2000): A non-conjugate collapsed sampler that updates cluster assignments using the CRP prior and a Metropolis step for cluster parameters.
9.3 Gaussian Process Priors and Regression
A Gaussian process (GP) is a stochastic process such that any finite-dimensional marginal is multivariate Gaussian. GPs provide flexible, nonparametric priors over function spaces.
where \(m_i = m(t_i)\) and \(K_{ij} = k(t_i, t_j)\).
In GP regression, one places a GP prior on a latent function \(f\) and observes \(Y_i = f(t_i) + \varepsilon_i\) with \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\). The posterior is also a GP:
where \(\mathbf{k}(t) = (k(t, t_1), \ldots, k(t, t_n))^\top\) and \(\mathbf{K}_{ij} = k(t_i, t_j)\).
Computing the GP posterior requires solving a linear system with the \(n \times n\) matrix \(\mathbf{K} + \sigma^2\mathbf{I}\), costing \(O(n^3)\) via Cholesky decomposition. Sparse approximations (inducing points, Nyström) reduce this to \(O(nm^2)\) for \(m \ll n\) inducing inputs.
Deep GPs compose multiple GP layers, \(f^{(L)} \circ \cdots \circ f^{(1)}\), yielding richer representations but requiring approximate inference (variational or MCMC).
Chapter 10: Approximate Inference
10.1 Variational Inference
Variational inference (VI) converts posterior inference into optimization. One approximates the intractable posterior \(p(\boldsymbol{\theta} \mid \mathbf{y})\) by the closest member of a tractable family \(\mathcal{Q} = \{q_\phi : \phi \in \Phi\}\), measured by KL divergence.
where the ELBO is
\[ \mathcal{L}(\phi) = E_{q_\phi}\!\left[\log p(\mathbf{y}, \boldsymbol{\theta})\right] - E_{q_\phi}\!\left[\log q_\phi(\boldsymbol{\theta})\right]. \]Since \(\text{KL} \ge 0\), maximizing \(\mathcal{L}(\phi)\) minimizes the KL divergence and tightens the lower bound on the log-evidence.
10.2 Mean-Field Variational Inference and CAVI
The mean-field family assumes full factorization: \(q_\phi(\boldsymbol{\theta}) = \prod_{j=1}^{d} q_j(\theta_j)\). While this ignores posterior correlations, it often yields reasonable marginal approximations.
The Coordinate Ascent Variational Inference (CAVI) algorithm iteratively optimizes each factor \(q_j\) while holding the others fixed.
where the expectation is over all variables except \(\theta_j\). CAVI iterates these updates to convergence.
For exponential family models with conjugate structure, the CAVI updates are available in closed form and reduce to updating natural parameters of the variational distributions.
10.3 Stochastic Gradient Variational Bayes
For large datasets, the ELBO sum over \(n\) observations is expensive to compute. Stochastic gradient variational Bayes (SGVB) uses mini-batches and the reparameterization trick to obtain low-variance gradient estimates.
If \(\boldsymbol{\theta} = g_\phi(\boldsymbol{\epsilon})\) for \(\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})\) independent of \(\phi\) (a differentiable reparameterization), then
\[ \nabla_\phi\, E_{q_\phi}[f(\boldsymbol{\theta})] = E_{p(\boldsymbol{\epsilon})}\!\left[\nabla_\phi f(g_\phi(\boldsymbol{\epsilon}))\right]. \]This expectation is estimated by Monte Carlo and differentiated by automatic differentiation, enabling gradient ascent on \(\mathcal{L}(\phi)\). For Gaussian \(q_\phi = \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), the reparameterization is \(\boldsymbol{\theta} = \boldsymbol{\mu} + \mathbf{L}\boldsymbol{\epsilon}\), where \(\mathbf{L}\mathbf{L}^\top = \boldsymbol{\Sigma}\).
Variational autoencoders (VAEs) use this technique with neural network encoders and decoders; the encoder outputs \((\boldsymbol{\mu}_\phi(y), \boldsymbol{\Sigma}_\phi(y))\) and the decoder models \(p_\theta(y \mid z)\).
10.4 Approximate Bayesian Computation
In some models, the likelihood \(p(\mathbf{y} \mid \boldsymbol{\theta})\) is intractable but data can be simulated from the model. Approximate Bayesian computation (ABC) replaces likelihood evaluation with simulation.
- Draw \(\boldsymbol{\theta}^* \sim p(\boldsymbol{\theta})\) from the prior.
- Simulate \(\mathbf{y}^* \sim p(\cdot \mid \boldsymbol{\theta}^*)\).
- Accept \(\boldsymbol{\theta}^*\) if \(\rho(S(\mathbf{y}^*), S(\mathbf{y})) \le \varepsilon\), where \(S\) is a summary statistic vector, \(\rho\) is a distance, and \(\varepsilon > 0\) is a tolerance.
The resulting sample approximates the ABC posterior \(p_\varepsilon(\boldsymbol{\theta} \mid \mathbf{y}) \propto P(\rho(S(\mathbf{y}^*), S(\mathbf{y})) \le \varepsilon \mid \boldsymbol{\theta})\,p(\boldsymbol{\theta})\).
As \(\varepsilon \to 0\), the ABC posterior converges to the exact posterior only if \(S\) is a sufficient statistic. In practice, \(S\) is a lower-dimensional summary chosen to retain information about \(\boldsymbol{\theta}\).
ABC-MCMC embeds the acceptance step within a Metropolis scheme, targeting the ABC posterior. Synthetic likelihood (Wood, 2010) replaces the indicator \(\mathbf{1}[\rho \le \varepsilon]\) with a Gaussian likelihood for \(S(\mathbf{y}^*)\), running MH on the resulting approximation to \(p(S(\mathbf{y}) \mid \boldsymbol{\theta})\).
10.5 Normalizing Flows
Normalizing flows construct flexible variational approximations by transforming a simple base density \(p(\boldsymbol{\epsilon})\) through a sequence of invertible differentiable maps \(f = f_K \circ \cdots \circ f_1\).
For a composition \(f = f_K \circ \cdots \circ f_1\),
\[ \log q(\boldsymbol{\theta}) = \log p(\boldsymbol{\epsilon}) - \sum_{k=1}^{K} \log \left|\det \mathbf{J}_{f_k}(\boldsymbol{\epsilon}_{k-1})\right|, \]where \(\boldsymbol{\epsilon}_{k} = f_k(\boldsymbol{\epsilon}_{k-1})\).
Popular flow architectures include planar flows, RealNVP (affine coupling layers with triangular Jacobians), and autoregressive flows. The ELBO can be maximized via gradient ascent using the reparameterization gradient, as each \(f_k\) is differentiable and invertible. Normalizing flows offer richer approximations than mean-field VI at the cost of more complex architectures and longer training times.