STAT 330: Mathematical Statistics
Peijun Sang
Estimated study time: 1 hr 43 min
Table of contents
Sources and References
Primary notes — Cameron Roopnarine (Hextical), STAT 330: Mathematical Statistics, Fall 2020, hextical.github.io/university-notes Supplementary notes — David Duan, STAT 330 Master Notes, david-duan.me Textbook — Bain and Engelhardt, Introduction to Probability and Mathematical Statistics, 2nd Edition
Chapter 1: Univariate Random Variables
1.1 The Probability Model
A probability model provides a mathematical framework for describing a random experiment. It consists of three components:
- A sample space \( S \), the set of all possible outcomes of the experiment.
- An event \( A \), which is a subset of \( S \).
- A probability function \( P \) that assigns a real number to each event.
(i) \( P(A) \geq 0 \) for every event \( A \subseteq S \). (ii) \( P(S) = 1 \). (iii) Countable additivity: If \( A_1, A_2, \ldots \) are pairwise mutually exclusive events (i.e., \( A_i \cap A_j = \varnothing \) for \( i \neq j \)), then
\[ P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i). \]From these axioms, several useful properties follow immediately:
- \( P(\varnothing) = 0 \).
- Complement rule: \( P(\bar{A}) = 1 - P(A) \).
- Inclusion-exclusion: \( P(A_1 \cup A_2) = P(A_1) + P(A_2) - P(A_1 \cap A_2) \).
- Monotonicity: If \( A_1 \subseteq A_2 \), then \( P(A_1) \leq P(A_2) \).
- \( 0 \leq P(A) \leq 1 \) for all events \( A \).
Equivalently, when \( P(B) > 0 \), independence holds if and only if \( P(A \mid B) = P(A) \). Intuitively, the occurrence of one event does not influence the probability of the other.
1.2 Random Variables and Cumulative Distribution Functions
The main purpose of a random variable is to quantify outcomes of a random experiment numerically.
(1) \( F \) is non-decreasing: if \( x_1 \leq x_2 \), then \( F(x_1) \leq F(x_2) \). (2) \( \lim_{x \to \infty} F(x) = 1 \) and \( \lim_{x \to -\infty} F(x) = 0 \). (3) \( F \) is right-continuous: \( \lim_{x \to a^+} F(x) = F(a) \) for all \( a \in \mathbb{R} \). (4) \( P(a < X \leq b) = F(b) - F(a) \). (5) \( P(X = a) = F(a) - \lim_{x \to a^-} F(x) \).
1.3 Discrete Random Variables
The support of \( X \) is the set \( A = \{x : f(x) > 0\} \). The PMF satisfies \( f(x) \geq 0 \) for all \( x \) and \( \sum_{x \in A} f(x) = 1 \).
Common Discrete Distributions
| Distribution | PMF \( f(x) \) | Support | Mean | Variance | MGF |
|---|---|---|---|---|---|
| \( \text{Bernoulli}(p) \) | \( p^x(1-p)^{1-x} \) | \( x \in \{0,1\} \) | \( p \) | \( p(1-p) \) | \( 1 - p + pe^t \) |
| \( \text{Binomial}(n,p) \) | \( \binom{n}{x}p^x(1-p)^{n-x} \) | \( x = 0,1,\ldots,n \) | \( np \) | \( np(1-p) \) | \( (pe^t + 1-p)^n \) |
| \( \text{Geometric}(p) \) | \( (1-p)^x p \) | \( x = 0,1,2,\ldots \) | \( (1-p)/p \) | \( (1-p)/p^2 \) | \( \frac{p}{1-(1-p)e^t} \) |
| \( \text{NegBin}(r,p) \) | \( \binom{x+r-1}{x}(1-p)^x p^r \) | \( x = 0,1,2,\ldots \) | \( r(1-p)/p \) | \( r(1-p)/p^2 \) | \( \left(\frac{p}{1-(1-p)e^t}\right)^r \) |
| \( \text{Poisson}(\lambda) \) | \( e^{-\lambda}\lambda^x / x! \) | \( x = 0,1,2,\ldots \) | \( \lambda \) | \( \lambda \) | \( \exp\{\lambda(e^t - 1)\} \) |
The Geometric distribution (in the “number of failures before the first success” convention) counts the number of failures preceding the first success. The Negative Binomial generalizes this to the number of failures before the \( r \)-th success.
which is the Poisson PMF. The key step uses \( \lim_{n \to \infty}(1 + z/n)^n = e^z \).
1.4 Continuous Random Variables
wherever the derivative exists, and \( f(x) = 0 \) otherwise. The PDF satisfies:
(i) \( f(x) \geq 0 \) for all \( x \in \mathbb{R} \). (ii) \( \int_{-\infty}^{\infty} f(x)\,dx = 1 \). (iii) \( F(x) = \int_{-\infty}^{x} f(t)\,dt \). (iv) \( P(a < X \leq b) = P(a \leq X \leq b) = \int_a^b f(x)\,dx \), since \( P(X = c) = 0 \) for any single point \( c \).
The Gamma Function
Key properties: (1) \( \Gamma(\alpha) = (\alpha - 1)\,\Gamma(\alpha - 1) \) for \( \alpha > 1 \). (2) \( \Gamma(n) = (n-1)! \) for positive integers \( n \). (3) \( \Gamma(1/2) = \sqrt{\pi} \).
Common Continuous Distributions
| Distribution | PDF \( f(x) \) | Support | Mean | Variance | MGF |
|---|---|---|---|---|---|
| \( \text{Uniform}(a,b) \) | \( \frac{1}{b-a} \) | \( a < x < b \) | \( \frac{a+b}{2} \) | \( \frac{(b-a)^2}{12} \) | \( \frac{e^{tb}-e^{ta}}{t(b-a)} \) |
| \( \text{Exponential}(\theta) \) | \( \frac{1}{\theta}e^{-x/\theta} \) | \( x > 0 \) | \( \theta \) | \( \theta^2 \) | \( (1-\theta t)^{-1} \) |
| \( \text{Gamma}(\alpha,\beta) \) | \( \frac{x^{\alpha-1}e^{-x/\beta}}{\Gamma(\alpha)\beta^\alpha} \) | \( x > 0 \) | \( \alpha\beta \) | \( \alpha\beta^2 \) | \( (1-\beta t)^{-\alpha} \) |
| \( N(\mu,\sigma^2) \) | \( \frac{1}{\sqrt{2\pi}\sigma}\exp\!\left\{-\frac{(x-\mu)^2}{2\sigma^2}\right\} \) | \( x \in \mathbb{R} \) | \( \mu \) | \( \sigma^2 \) | \( \exp\!\left\{\mu t + \frac{\sigma^2 t^2}{2}\right\} \) |
| \( \text{Weibull}(\theta,\beta) \) | \( \frac{\beta}{\theta^\beta}x^{\beta-1}\exp\!\left\{-\left(\frac{x}{\theta}\right)^\beta\right\} \) | \( x > 0 \) | — | — | — |
By symmetry, the integral equals \( 2\int_0^{\infty} \frac{1}{\sqrt{2\pi}} e^{-x^2/2}\,dx \). Using the substitution \( y = x^2/2 \) (so \( dx = \frac{1}{\sqrt{2y}}\,dy \)):
\[ \frac{2}{\sqrt{2\pi}} \int_0^{\infty} e^{-y} \frac{1}{\sqrt{2y}}\,dy = \frac{1}{\sqrt{\pi}} \int_0^{\infty} y^{-1/2} e^{-y}\,dy = \frac{1}{\sqrt{\pi}}\,\Gamma\!\left(\frac{1}{2}\right) = \frac{\sqrt{\pi}}{\sqrt{\pi}} = 1. \]For general \( X \sim N(\mu, \sigma^2) \), writing \( X = \sigma Z + \mu \) with \( Z \sim N(0,1) \) and substituting \( z = (x - \mu)/\sigma \) reduces the general case to the standard one.
1.5 Expectation and Variance
For discrete \( X \) with PMF \( f(x) \) and support \( A \):
\[ E[X] = \sum_{x \in A} x\,f(x), \quad \text{provided } \sum_{x \in A} |x|\,f(x) < \infty. \]For continuous \( X \) with PDF \( f(x) \):
\[ E[X] = \int_{-\infty}^{\infty} x\,f(x)\,dx, \quad \text{provided } \int_{-\infty}^{\infty} |x|\,f(x)\,dx < \infty. \]If the absolute convergence condition fails, \( E[X] \) does not exist.
provided the sum or integral converges absolutely.
where \( \mu = E[X] \).
Properties of variance:
- \( \text{Var}(a) = 0 \) for any constant \( a \).
- \( \text{Var}(X) \geq 0 \).
- \( \text{Var}(X + a) = \text{Var}(X) \) (invariant under location shifts).
- \( \text{Var}(aX) = a^2\,\text{Var}(X) \).
- \( \text{Var}(aX + bY) = a^2\,\text{Var}(X) + b^2\,\text{Var}(Y) + 2ab\,\text{Cov}(X,Y) \).
- If \( X_1, \ldots, X_n \) are independent: \( \text{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) \).
Moments
The \( k \)-th moment about the origin (or raw moment) is \( E[X^k] \). The \( k \)-th central moment (moment about the mean) is \( E[(X - \mu)^k] \).
In particular, the first moment is the mean and the second central moment is the variance.
Setting \( p = 1 \): \( E[X] = \alpha\beta \). Setting \( p = 2 \): \( E[X^2] = \alpha(\alpha+1)\beta^2 \). Therefore \( \text{Var}(X) = \alpha\beta^2 \).
A similar calculation yields \( E[X^2] = \theta^2 + \theta \), so \( \text{Var}(X) = \theta \).
1.6 Moment Generating Functions
provided this expectation exists (is finite) for all \( t \) in some open interval \( (-h, h) \) containing 0.
where \( M_X^{(k)}(0) \) denotes the \( k \)-th derivative of \( M_X(t) \) evaluated at \( t = 0 \).
Setting \( t = 0 \) yields \( M_X^{(k)}(0) = E[X^k \cdot 1] = E[X^k] \).
This is derived by completing the square in the exponent: \( tx - x^2/2 = -(x-t)^2/2 + t^2/2 \).
For \( X \sim N(\mu, \sigma^2) \), writing \( X = \sigma Z + \mu \) gives:
\[ M_X(t) = e^{\mu t}\,M_Z(\sigma t) = \exp\!\left\{\mu t + \frac{\sigma^2 t^2}{2}\right\}. \]Chapter 2: Multivariate Random Variables
2.1 Joint and Marginal Distribution Functions
Properties of the joint CDF:
- \( F \) is non-decreasing in each argument when the other is held fixed.
- \( \lim_{x \to -\infty} F(x, y) = 0 \) and \( \lim_{y \to -\infty} F(x, y) = 0 \).
- \( \lim_{(x,y) \to (-\infty,-\infty)} F(x,y) = 0 \) and \( \lim_{(x,y) \to (\infty,\infty)} F(x,y) = 1 \).
Similarly, \( F_Y(y) = \lim_{x \to \infty} F(x, y) \).
2.2 Bivariate Discrete Distributions
The joint support is \( A = \{(x, y) : f(x, y) > 0\} \). It satisfies \( f(x,y) \geq 0 \) and \( \sum_{(x,y) \in A} f(x,y) = 1 \).
The marginal PMF of \( X \): \( f_X(x) = \sum_{y=0}^{\infty} (1-p)^2 p^{x+y} = (1-p)^2 p^x \cdot \frac{1}{1-p} = (1-p)p^x \), which is a Geometric distribution.
To compute \( P(X \leq Y) \):
\[ P(X \leq Y) = \sum_{x=0}^{\infty} \sum_{y=x}^{\infty} (1-p)^2 p^{x+y} = (1-p) \sum_{x=0}^{\infty} p^{2x} = \frac{1-p}{1-p^2} = \frac{1}{1+p}. \]2.3 Bivariate Continuous Distributions
then \( X \) and \( Y \) are jointly continuous with joint probability density function
\[ f(x, y) = \frac{\partial^2 F(x, y)}{\partial x\,\partial y} \]wherever this mixed partial derivative exists.
Properties: \( f(x,y) \geq 0 \) and \( \iint_{\mathbb{R}^2} f(x,y)\,dx\,dy = 1 \). For any region \( R \subseteq \mathbb{R}^2 \):
\[ P((X, Y) \in R) = \iint_R f(x, y)\,dx\,dy. \]and similarly \( f_Y(y) = \int_{-\infty}^{\infty} f(x, y)\,dx \).
Verification: \( \int_0^1 \int_0^1 (x+y)\,dy\,dx = \int_0^1 (x + 1/2)\,dx = 1 \).
Marginal of \( X \): \( f_X(x) = \int_0^1 (x+y)\,dy = x + 1/2 \) for \( 0 \leq x \leq 1 \).
\( P(X \leq Y) = \int_0^1 \int_x^1 (x+y)\,dy\,dx = 1/2 \).
2.4 Independence
(1) \( F(x, y) = F_X(x)\,F_Y(y) \) for all \( (x,y) \in \mathbb{R}^2 \). (2) \( f(x, y) = f_X(x)\,f_Y(y) \) for all \( (x,y) \).
(i) \( A = A_X \times A_Y \) (the support is a “rectangle”), and (ii) there exist non-negative functions \( g \) and \( h \) such that \( f(x, y) = g(x)\,h(y) \) for all \( (x,y) \in A \).
2.5 Joint Expectation, Covariance, and Correlation
Discrete case: \( E[h(X, Y)] = \sum_{(x,y) \in A} h(x, y)\,f(x, y) \).
Continuous case: \( E[h(X, Y)] = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} h(x, y)\,f(x, y)\,dx\,dy \).
This holds regardless of whether the variables are independent.
More generally, if \( X_1, \ldots, X_n \) are independent: \( E\!\left[\prod_{i=1}^n g_i(X_i)\right] = \prod_{i=1}^n E[g_i(X_i)] \).
If \( X \) and \( Y \) are independent, then \( \text{Cov}(X, Y) = 0 \). The converse is not generally true.
It satisfies \( -1 \leq \rho(X, Y) \leq 1 \). Equality \( |\rho| = 1 \) holds if and only if \( Y = aX + b \) for some constants with \( a \neq 0 \).
(1) \( \text{Var}(aX + bY) = a^2\,\text{Var}(X) + b^2\,\text{Var}(Y) + 2ab\,\text{Cov}(X, Y) \).
(2) \( \text{Var}\!\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n a_i^2\,\text{Var}(X_i) + 2\sum_{i < j} a_i a_j\,\text{Cov}(X_i, X_j) \).
(3) If \( X_1, \ldots, X_n \) are independent: \( \text{Var}\!\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n a_i^2\,\text{Var}(X_i) \).
2.6 Conditional Distributions and Expectation
Discrete: \( f_X(x \mid y) = \frac{f(x, y)}{f_Y(y)} \), provided \( f_Y(y) > 0 \).
Continuous: \( f_X(x \mid y) = \frac{f(x, y)}{f_Y(y)} \), provided \( f_Y(y) > 0 \).
Similarly, \( f_Y(y \mid x) = f(x, y) / f_X(x) \).
These conditional functions are themselves valid probability distributions (they are non-negative and sum/integrate to 1).
Conditional Expectation
The conditional mean is \( E[Y \mid X = x] \) and the conditional variance is
\[ \text{Var}(Y \mid X = x) = E[Y^2 \mid X = x] - (E[Y \mid X = x])^2. \]In particular, \( E[Y] = E[E[Y \mid X]] \).
The first term captures the average within-group variance, and the second captures the between-group variance.
\( E[X \mid Y] = Yp \), so \( E[X] = E[Yp] = p\theta \).
\( \text{Var}(X \mid Y) = Yp(1-p) \), so:
\[ \text{Var}(X) = E[Yp(1-p)] + \text{Var}(Yp) = p(1-p)\theta + p^2\theta = p\theta. \]This confirms \( X \sim \text{Poisson}(p\theta) \).
2.7 Joint Moment Generating Functions
provided this exists for \( |t_1| < h_1 \) and \( |t_2| < h_2 \) for some \( h_1, h_2 > 0 \).
For \( n \) random variables: \( M(t_1, \ldots, t_n) = E[\exp\{t_1 X_1 + \cdots + t_n X_n\}] \).
Key applications:
- Recovering marginal MGFs: \( M_X(t_1) = M(t_1, 0) \) and \( M_Y(t_2) = M(0, t_2) \).
- Testing independence: \( X \) and \( Y \) are independent if and only if \( M(t_1, t_2) = M_X(t_1)\,M_Y(t_2) \).
which is the MGF of \( \text{Poisson}(\theta_1 + \theta_2) \). By uniqueness, \( X + Y \sim \text{Poisson}(\theta_1 + \theta_2) \).
2.8 The Multinomial Distribution
where \( x_i \in \{0, 1, \ldots, n\} \) and \( \sum_{i=1}^k x_i = n \).
Properties of the Multinomial distribution:
- Joint MGF: \( M(t_1, \ldots, t_k) = (p_1 e^{t_1} + \cdots + p_k e^{t_k})^n \).
- Marginals: Each \( X_i \sim \text{Binomial}(n, p_i) \).
- Pairwise sums: \( X_i + X_j \sim \text{Binomial}(n, p_i + p_j) \) for \( i \neq j \).
- Covariance: \( \text{Cov}(X_i, X_j) = -np_i p_j \) for \( i \neq j \).
- Conditional distributions: \( X_i \mid X_j = x_j \sim \text{Binomial}\!\left(n - x_j,\, \frac{p_i}{1 - p_j}\right) \) for \( i \neq j \).
2.9 The Bivariate Normal Distribution
where \( \boldsymbol{\mu} = (\mu_1, \mu_2)^\top \) and \( \Sigma = \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix} \).
Properties of the bivariate normal:
- Joint MGF: \( M(t_1, t_2) = \exp\!\left\{\mathbf{t}^\top\boldsymbol{\mu} + \frac{1}{2}\mathbf{t}^\top\Sigma\mathbf{t}\right\} \).
- Marginals: \( X_1 \sim N(\mu_1, \sigma_1^2) \) and \( X_2 \sim N(\mu_2, \sigma_2^2) \).
- Conditional distributions:
4. Covariance: \( \text{Cov}(X_1, X_2) = \rho\sigma_1\sigma_2 \) and \( \text{Corr}(X_1, X_2) = \rho \). 5. Independence: \( \rho = 0 \iff X_1 \) and \( X_2 \) are independent. (For BVN, uncorrelated implies independent.) 6. Linear combinations: Any linear combination \( c_1 X_1 + c_2 X_2 \sim N(c_1\mu_1 + c_2\mu_2,\, \mathbf{c}^\top\Sigma\mathbf{c}) \). 7. Quadratic form: \( (\mathbf{x} - \boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu}) \sim \chi^2(2) \).
Chapter 3: Functions of Random Variables
Given random variables \( X_1, \ldots, X_n \) and a function \( h \), we often need the distribution of \( Y = h(X_1, \ldots, X_n) \). Three principal techniques are available: the CDF technique, the one-to-one transformation (Jacobian) method, and the MGF technique.
3.1 The CDF Technique
The CDF technique is the most general method and works for both discrete and continuous variables.
Procedure (continuous case):
- For each \( y \in \mathbb{R} \), determine the region \( R_y = \{(x_1, \ldots, x_n) : h(x_1, \ldots, x_n) \leq y\} \).
- Compute \( F_Y(y) = P(Y \leq y) = \int_{R_y} f(x_1, \ldots, x_n)\,dx_1 \cdots dx_n \).
- Differentiate: \( f_Y(y) = F_Y'(y) \).
The support of \( Y \) is \( [0, \infty) \). For \( y > 0 \):
\[ F_Y(y) = P(X^2 \leq y) = P(-\sqrt{y} \leq X \leq \sqrt{y}) = F_X(\sqrt{y}) - F_X(-\sqrt{y}). \]Differentiating:
\[ f_Y(y) = \frac{1}{2\sqrt{y}}\left[f_X(\sqrt{y}) + f_X(-\sqrt{y})\right] = \frac{1}{2\sqrt{y}} \cdot \frac{2}{\sqrt{2\pi}}\,e^{-y/2} = \frac{1}{\sqrt{2\pi}}\,y^{-1/2}\,e^{-y/2}. \]This is the PDF of a \( \chi^2(1) \) distribution, equivalently \( \text{Gamma}(1/2, 2) \).
For the maximum \( X_{(n)} = \max_i X_i \): Since the \( X_i \) are independent,
\[ F_{X_{(n)}}(y) = P(X_1 \leq y, \ldots, X_n \leq y) = \left(\frac{y}{\theta}\right)^n, \quad 0 < y < \theta. \]So \( f_{X_{(n)}}(y) = \frac{n}{\theta^n}\,y^{n-1} \) for \( 0 < y < \theta \).
For the minimum \( X_{(1)} = \min_i X_i \):
\[ F_{X_{(1)}}(y) = 1 - P(X_1 > y, \ldots, X_n > y) = 1 - \left(\frac{\theta - y}{\theta}\right)^n, \quad 0 < y < \theta. \]So \( f_{X_{(1)}}(y) = \frac{n}{\theta}\left(1 - \frac{y}{\theta}\right)^{n-1} \) for \( 0 < y < \theta \).
Conversely, if \( U \sim \text{Uniform}(0,1) \), then \( X = F^{-1}(U) \) has CDF \( F \).
which is the CDF of \( \text{Uniform}(0,1) \).
3.2 One-to-One Transformations (Univariate)
where \( x = h^{-1}(y) \) and the support of \( Y \) is \( h(A) \).
Since \( y = \ln(x) \implies x = e^y \) and \( dx/dy = e^y \), for \( y > 0 \):
\[ f_Y(y) = f_X(e^y)\,|e^y| = \frac{\theta}{(e^y)^{\theta+1}}\,e^y = \theta\,e^{-y\theta}. \]Thus \( Y \sim \text{Exponential}(1/\theta) \).
3.3 One-to-One Transformations (Bivariate Jacobian Method)
where the Jacobian is the absolute value of the determinant
\[ \frac{\partial(x,y)}{\partial(u,v)} = \begin{vmatrix} \partial x/\partial u & \partial x/\partial v \\ \partial y/\partial u & \partial y/\partial v \end{vmatrix}. \]Inverse: \( x = (u+v)/2 \), \( y = (u-v)/2 \). The Jacobian is
\[ \frac{\partial(x,y)}{\partial(u,v)} = \begin{vmatrix} 1/2 & 1/2 \\ 1/2 & -1/2 \end{vmatrix} = -1/2. \]The joint PDF of \( (X, Y) \) is \( f(x,y) = \frac{1}{2\pi}\exp\{-(x^2+y^2)/2\} \). Substituting:
\[ g(u,v) = \frac{1}{2\pi}\exp\!\left\{-\frac{(u+v)^2/4 + (u-v)^2/4}{2}\right\}\cdot\frac{1}{2} = \frac{1}{4\pi}\exp\!\left\{-\frac{u^2+v^2}{4}\right\}. \]This factors as \( g_U(u)\,g_V(v) \), confirming that \( U \sim N(0,2) \) and \( V \sim N(0,2) \) are independent.
Then \( x = v \), \( y = u - v \), with Jacobian \( |J| = 1 \). The support becomes \( 0 < v < u < \infty \) (from \( 0 < x < y \)).
\[ g(u,v) = e^{-u}, \quad 0 < v < u. \]Marginalizing: \( g_U(u) = \int_0^u e^{-u}\,dv = u\,e^{-u} \) for \( u > 0 \), which is \( \text{Gamma}(2, 1) \).
3.4 The MGF Technique
If additionally the \( X_i \) are identically distributed with common MGF \( M(t) \), then \( M_T(t) = [M(t)]^n \).
Important Distributions Derived via MGFs
The MGF is \( M_Q(t) = (1 - 2t)^{-k/2} \) for \( t < 1/2 \). Note that \( \chi^2(k) = \text{Gamma}(k/2, 2) \).
If \( Y_i \sim \chi^2(k_i) \) are independent, then \( \sum Y_i \sim \chi^2(\sum k_i) \).
In particular, if \( X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2) \), then \( \bar{X}_n \sim N(\mu, \sigma^2/n) \).
the Student’s t-distribution with \( \nu \) degrees of freedom. Its support is \( (-\infty, \infty) \).
Also, if \( X \) and \( Y \) are independent chi-squared, then \( X + Y \sim \chi^2(n + m) \).
(1) \( \bar{X} \sim N(\mu, \sigma^2/n) \). (2) \( \frac{(n-1)S^2}{\sigma^2} = \frac{\sum(X_i - \bar{X})^2}{\sigma^2} \sim \chi^2(n-1) \). (3) \( \bar{X} \) and \( S^2 \) are independent. (4) \( \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t(n-1) \).
For (4), since \( \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim N(0,1) \) and \( \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \) are independent, the ratio has a \( t(n-1) \) distribution by definition.
Chapter 4: Limiting and Asymptotic Distributions
4.1 Convergence in Distribution
for all \( x \) at which \( F \) is continuous. The CDF \( F \) is called the limiting (or asymptotic) distribution of the sequence.
For \( nX_{(1)} \): The CDF is \( F_n(x) = 1 - (1 - x/n)^n \) for \( 0 < x < n \). As \( n \to \infty \):
\[ \lim_{n \to \infty} F_n(x) = 1 - e^{-x}, \quad x > 0, \]which is the CDF of \( \text{Exponential}(1) \). So \( nX_{(1)} \xrightarrow{d} \text{Exponential}(1) \).
For \( X_{(n)} \): The CDF \( F_n(x) = x^n \) on \( (0,1) \) converges to the degenerate distribution at 1: \( X_{(n)} \xrightarrow{d} 1 \).
4.2 Convergence in Probability
When \( X = b \) is a constant, we write \( X_n \xrightarrow{P} b \).
However, for the special case of convergence to a constant:
\[ X_n \xrightarrow{d} b \iff X_n \xrightarrow{P} b. \]4.3 The Weak Law of Large Numbers
The most common case is \( k = 2 \) (Chebyshev’s inequality): \( P(|X| \geq c) \leq E[X^2]/c^2 \).
as \( n \to \infty \). By the squeeze theorem, \( P(|\bar{X}_n - \mu| \geq \varepsilon) \to 0 \).
4.4 The Central Limit Theorem
Equivalently, \( \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2) \).
Expanding by Taylor’s theorem: \( M_Y(t/\sqrt{n}) = 1 + \frac{t^2}{2n} + o(t^2/n) \). Therefore:
\[ \lim_{n \to \infty}\left[1 + \frac{t^2}{2n} + o\!\left(\frac{t^2}{n}\right)\right]^n = \exp\!\left\{\frac{t^2}{2}\right\}, \]which is the MGF of \( N(0,1) \). By the MGF convergence theorem, the result follows.
4.5 Slutsky’s Theorem and the Continuous Mapping Theorem
(1) If \( X_n \xrightarrow{P} a \), then \( g(X_n) \xrightarrow{P} g(a) \). (2) If \( X_n \xrightarrow{d} X \), then \( g(X_n) \xrightarrow{d} g(X) \).
(a) \( X_n + Y_n \xrightarrow{d} X + b \). (This also holds if \( b \) is replaced by a random variable \( Y \).) (b) \( X_n Y_n \xrightarrow{d} bX \). (c) \( X_n / Y_n \xrightarrow{d} X/b \) provided \( b \neq 0 \).
Since \( \bar{X}_n \xrightarrow{P} \mu \) by the WLLN, the continuous mapping theorem gives \( \sqrt{\mu}/\sqrt{\bar{X}_n} \xrightarrow{P} 1 \). By Slutsky’s theorem:
\[ \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sqrt{\bar{X}_n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sqrt{\mu}} \cdot \frac{\sqrt{\mu}}{\sqrt{\bar{X}_n}} \xrightarrow{d} N(0,1). \]4.6 The Delta Method
Equivalently, \( g(X_n) \) is approximately \( N(g(\theta),\, [g'(\theta)]^2\sigma^2/n) \) for large \( n \).
The intuition comes from a first-order Taylor expansion: \( g(X_n) \approx g(\theta) + g'(\theta)(X_n - \theta) \), so
\[ \sqrt{n}[g(X_n) - g(\theta)] \approx g'(\theta)\,\sqrt{n}(X_n - \theta), \]and the right side converges to \( g'(\theta) \cdot N(0, \sigma^2) = N(0, [g'(\theta)]^2\sigma^2) \).
Taking \( g(x) = \sqrt{x} \), so \( g'(x) = \frac{1}{2}x^{-1/2} \) and \( [g'(\mu)]^2 = \frac{1}{4\mu} \):
\[ \sqrt{n}\!\left(\sqrt{\bar{X}_n} - \sqrt{\mu}\right) \xrightarrow{d} N\!\left(0,\, \frac{1}{4\mu} \cdot \mu\right) = N\!\left(0, \frac{1}{4}\right). \]This is the variance-stabilizing transformation for the Poisson: the asymptotic variance \( 1/4 \) does not depend on \( \mu \).
For \( g(x) = \ln(x) \), \( g'(\theta) = 1/\theta \), so:
\[ \sqrt{n}(\ln\bar{X}_n - \ln\theta) \xrightarrow{d} N\!\left(0,\, \frac{1}{\theta^2}\cdot\theta^2\right) = N(0, 1). \]4.7 Normal Approximations
The CLT enables us to approximate probabilities for sums and means of large samples. If \( X_1, \ldots, X_n \) are iid with mean \( \mu \) and variance \( \sigma^2 \), then for large \( n \):
\[ P\!\left(\bar{X}_n \leq x\right) \approx \Phi\!\left(\frac{x - \mu}{\sigma/\sqrt{n}}\right), \]where \( \Phi \) is the standard normal CDF. Similarly for the sum \( T_n = \sum X_i \):
\[ P(T_n \leq t) \approx \Phi\!\left(\frac{t - n\mu}{\sigma\sqrt{n}}\right). \]Chapter 5: Point Estimation
5.1 Introduction and Basic Concepts
Suppose \( X_1, \ldots, X_n \) are iid random variables from a distribution with PDF (or PMF) \( f(x; \boldsymbol{\theta}) \), where \( \boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)^\top \) is an unknown parameter vector in the parameter space \( \Theta \). The goal of point estimation is to use the observed data to produce a single “best guess” for \( \boldsymbol{\theta} \).
5.2 Method of Moments
- Population moments: \( \mu_j = E[X^j] = \mu_j(\boldsymbol{\theta}) \) for \( j = 1, \ldots, k \).
- Sample moments: \( \hat{\mu}_j = \frac{1}{n}\sum_{i=1}^n X_i^j \).
- Solve the system \( \mu_j(\hat{\boldsymbol{\theta}}) = \hat{\mu}_j \) for \( j = 1, \ldots, k \) to obtain \( \hat{\boldsymbol{\theta}}_{\text{MM}} \).
\( \mu_1(\mu, \sigma^2) = \mu \) and \( \mu_2(\mu, \sigma^2) = \mu^2 + \sigma^2 \).
Setting \( \hat{\mu}_1 = \bar{X}_n \) and \( \hat{\mu}_2 = \frac{1}{n}\sum X_i^2 \):
\[ \hat{\mu}_{\text{MM}} = \bar{X}_n, \qquad \hat{\sigma}^2_{\text{MM}} = \frac{1}{n}\sum_{i=1}^n X_i^2 - \bar{X}_n^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2. \]5.3 Maximum Likelihood Estimation
The log-likelihood is \( \ell(\theta) = \ln L(\theta) = \sum_{i=1}^n \ln f(x_i; \theta) \).
Setting \( \ell'(\theta) = \frac{\sum x_i}{\theta} - n = 0 \) gives \( \hat{\theta}_{\text{ML}} = \bar{X}_n \), the same as the MM estimator.
By invariance, the MLE of \( P(X_1 = 0) = e^{-\theta} \) is \( e^{-\bar{X}_n} \).
Setting partial derivatives to zero yields:
\[ \hat{\mu}_{\text{ML}} = \bar{X}_n, \qquad \hat{\sigma}^2_{\text{ML}} = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2. \]Note that \( \hat{\sigma}^2_{\text{ML}} \) is biased (its expectation is \( \frac{n-1}{n}\sigma^2 \)), unlike the unbiased estimator \( S^2 = \frac{1}{n-1}\sum(X_i - \bar{X}_n)^2 \).
For \( \theta \geq x_{(n)} \), \( L(\theta) = \theta^{-n} \) is strictly decreasing, so the maximum is at \( \hat{\theta}_{\text{ML}} = X_{(n)} = \max(X_1, \ldots, X_n) \).
This differs from \( \hat{\theta}_{\text{MM}} = 2\bar{X}_n \). The calculus-based approach of setting derivatives to zero does not apply here because the maximum occurs at the boundary of the parameter space.
Setting \( \ell'(\theta) = n/\theta + \sum\ln x_i = 0 \) gives:
\[ \hat{\theta}_{\text{ML}} = -\frac{n}{\sum_{i=1}^n \ln X_i}. \]5.4 Properties of the MLE
Score Function and Information
When the support does not depend on \( \theta \), the MLE satisfies \( S(\hat{\theta}) = 0 \).
where \( J_1(\theta) = -E\!\left[\frac{d^2}{d\theta^2}\ln f(X_1; \theta)\right] \) is the Fisher information from a single observation.
The Cramer-Rao Lower Bound
An estimator achieving the Cramer-Rao lower bound is called efficient (or a minimum variance unbiased estimator, MVUE).
So \( J_1(\theta) = E[X]/\theta^2 = 1/\theta \), and \( J(\theta) = n/\theta \).
The MLE \( \hat{\theta} = \bar{X}_n \) has \( \text{Var}(\hat{\theta}) = \theta/n = 1/J(\theta) \), which equals the Cramer-Rao bound. Hence \( \bar{X}_n \) is an efficient estimator.
Asymptotic Properties of the MLE
(1) Consistency: \( \hat{\theta} \xrightarrow{P} \theta \) as \( n \to \infty \).
(2) Asymptotic normality: \( \sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} N\!\left(0, \frac{1}{J_1(\theta)}\right) \).
(3) Asymptotic efficiency: The MLE achieves the Cramer-Rao lower bound asymptotically, meaning \( \text{Var}(\hat{\theta}) \approx \frac{1}{J(\theta)} \) for large \( n \).
(4) Delta method extension: \( \sqrt{n}[g(\hat{\theta}) - g(\theta)] \xrightarrow{d} N\!\left(0, \frac{[g'(\theta)]^2}{J_1(\theta)}\right) \).
(i) MLE: \( \hat{\theta} = \bar{X}_n \). (ii) MLE of \( g(\theta) = e^{-\theta} \): by invariance, \( \hat{g} = e^{-\bar{X}_n} \). (iii) \( \sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} N(0, \theta) \) since \( J_1(\theta) = 1/\theta \). (iv) By delta method with \( g(x) = e^{-x} \), \( g'(\theta) = -e^{-\theta} \):
\[ \sqrt{n}(e^{-\bar{X}_n} - e^{-\theta}) \xrightarrow{d} N(0,\, e^{-2\theta}\theta). \](v) \( \hat{\theta} = \bar{X}_n \) is exactly unbiased: \( E[\bar{X}_n] = \theta \). However, \( e^{-\bar{X}_n} \) is biased for \( e^{-\theta} \) in finite samples, though it is asymptotically unbiased.
Chapter 6: Confidence Intervals and Hypothesis Testing
6.1 Pivotal Quantities and Confidence Intervals
A pivotal quantity is the basis for constructing confidence intervals. The general procedure is:
- Find a pivotal quantity \( Q \) involving the parameter of interest.
- Determine constants \( a, b \) such that \( P(a \leq Q \leq b) = 1 - \alpha \).
- Rearrange the inequality to isolate \( \theta \), obtaining a \( 100(1-\alpha)\% \) confidence interval.
is a pivotal quantity. A \( 100(1-\alpha)\% \) confidence interval for \( \mu \) is
\[ \bar{X}_n \pm z_{\alpha/2}\,\frac{\sigma}{\sqrt{n}}, \]where \( z_{\alpha/2} \) is the upper \( \alpha/2 \) quantile of \( N(0,1) \).
giving the CI: \( \bar{X}_n \pm t_{\alpha/2, n-1}\,\frac{S}{\sqrt{n}} \).
as a \( 100(1-\alpha)\% \) confidence interval for \( \sigma^2 \).
Large-Sample Confidence Intervals via the CLT
When the exact distribution of a pivotal quantity is not available, the CLT and asymptotic normality of the MLE provide approximate confidence intervals. By the asymptotic distribution of the MLE:
\[ \hat{\theta} \;\dot{\sim}\; N\!\left(\theta,\, \frac{1}{nJ_1(\theta)}\right), \]so an approximate \( 100(1-\alpha)\% \) CI for \( \theta \) is \( \hat{\theta} \pm z_{\alpha/2}\,\sqrt{1/(nJ_1(\hat{\theta}))} \).
6.2 Hypothesis Testing Fundamentals
A null hypothesis \( H_0 \) and an alternative hypothesis \( H_1 \). A test statistic \( T(X_1, \ldots, X_n) \). A rejection region (critical region) \( C \): reject \( H_0 \) if \( T \in C \).
Type I error: Rejecting \( H_0 \) when \( H_0 \) is true. Probability: \( \alpha = P(\text{reject } H_0 \mid H_0 \text{ true}) \). Type II error: Failing to reject \( H_0 \) when \( H_1 \) is true. Probability: \( \beta = P(\text{fail to reject } H_0 \mid H_1 \text{ true}) \).
The significance level is \( \alpha \). The power of the test is \( 1 - \beta \).
6.3 The Neyman-Pearson Lemma
exceeds a critical value \( k \), where \( k \) is chosen so that \( P(\Lambda > k \mid H_0) = \alpha \).
6.4 Likelihood Ratio Tests
For testing composite hypotheses, the generalized likelihood ratio is commonly used.
where \( \hat{\theta}_0 \) is the MLE under \( H_0 \) and \( \hat{\theta} \) is the unrestricted MLE. Note \( 0 \leq \lambda \leq 1 \), and we reject \( H_0 \) for small values of \( \lambda \).
where \( r = \dim(\Theta) - \dim(\Theta_0) \) is the number of restrictions imposed by \( H_0 \). This provides a large-sample test: reject \( H_0 \) if \( -2\ln\lambda > \chi^2_{\alpha, r} \).
Under \( H_0 \): \( L(\mu_0) \propto \exp\{-\frac{n(\bar{x} - \mu_0)^2}{2\sigma^2}\} \cdot \exp\{-\frac{\sum(x_i - \bar{x})^2}{2\sigma^2}\} \).
Under the full model: \( L(\hat{\mu}) \propto \exp\{-\frac{\sum(x_i - \bar{x})^2}{2\sigma^2}\} \).
The ratio simplifies to \( \lambda = \exp\{-\frac{n(\bar{x} - \mu_0)^2}{2\sigma^2}\} \), and \( -2\ln\lambda = \frac{n(\bar{x} - \mu_0)^2}{\sigma^2} = z^2 \), which has a \( \chi^2(1) \) distribution under \( H_0 \). Equivalently, the test rejects when \( |z| > z_{\alpha/2} \), consistent with the standard two-sided z-test.
6.5 Summary of Common Test Statistics
| Setting | Hypotheses | Test Statistic | Distribution under \( H_0 \) |
|---|---|---|---|
| Normal mean, \( \sigma^2 \) known | \( H_0: \mu = \mu_0 \) | \( Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \) | \( N(0,1) \) |
| Normal mean, \( \sigma^2 \) unknown | \( H_0: \mu = \mu_0 \) | \( T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \) | \( t(n-1) \) |
| Normal variance | \( H_0: \sigma^2 = \sigma_0^2 \) | \( \chi^2 = \frac{(n-1)S^2}{\sigma_0^2} \) | \( \chi^2(n-1) \) |
| Two normal variances | \( H_0: \sigma_1^2 = \sigma_2^2 \) | \( F = \frac{S_1^2}{S_2^2} \) | \( F(n_1-1, n_2-1) \) |
| Large-sample proportion | \( H_0: p = p_0 \) | \( Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} \) | \( N(0,1) \) approx |