Sources and References
Primary notes — Cameron Roopnarine (Hextical), STAT 330: Mathematical Statistics, Fall 2020, hextical.github.io/university-notes
Supplementary notes — David Duan, STAT 330 Master Notes, david-duan.me
Textbook — Bain and Engelhardt, Introduction to Probability and Mathematical Statistics, 2nd Edition
Chapter 1: Univariate Random Variables
1.1 The Probability Model
A probability model provides a mathematical framework for describing a random experiment. It consists of three components:
- A sample space \( S \), the set of all possible outcomes of the experiment.
- An event \( A \), which is a subset of \( S \).
- A probability function \( P \) that assigns a real number to each event.
Probability Function. A function \( P \) defined on events in \( S \) is a probability function if it satisfies the following axioms:
\[
P\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} P(A_i).
\]
From these axioms, several useful properties follow immediately:
- \( P(\varnothing) = 0 \).
- Complement rule: \( P(\bar{A}) = 1 - P(A) \).
- Inclusion-exclusion: \( P(A_1 \cup A_2) = P(A_1) + P(A_2) - P(A_1 \cap A_2) \).
- Monotonicity: If \( A_1 \subseteq A_2 \), then \( P(A_1) \leq P(A_2) \).
- \( 0 \leq P(A) \leq 1 \) for all events \( A \).
Conditional Probability. For events \( A \) and \( B \) with \( P(B) > 0 \), the conditional probability of \( A \) given \( B \) is
\[
P(A \mid B) = \frac{P(A \cap B)}{P(B)}.
\]
Independence of Events. Two events \( A \) and \( B \) are independent if and only if
\[
P(A \cap B) = P(A)\,P(B).
\]
Equivalently, when \( P(B) > 0 \), independence holds if and only if \( P(A \mid B) = P(A) \). Intuitively, the occurrence of one event does not influence the probability of the other.
1.2 Random Variables and Cumulative Distribution Functions
Random Variable. A random variable \( X \) is a function from the sample space \( S \) to the real numbers \( \mathbb{R} \), i.e., \( X : S \to \mathbb{R} \), such that for every \( x \in \mathbb{R} \), the set \( \{X \leq x\} = \{s \in S : X(s) \leq x\} \) is an event (a subset of \( S \) with a well-defined probability).
The main purpose of a random variable is to quantify outcomes of a random experiment numerically.
Cumulative Distribution Function (CDF). The cumulative distribution function of a random variable \( X \) is defined by
\[
F(x) = P(X \leq x), \quad x \in \mathbb{R}.
\]
Properties of the CDF. Let \( F \) be the CDF of a random variable \( X \). Then:
(1) \( F \) is non-decreasing: if \( x_1 \leq x_2 \), then \( F(x_1) \leq F(x_2) \).
(2) \( \lim_{x \to \infty} F(x) = 1 \) and \( \lim_{x \to -\infty} F(x) = 0 \).
(3) \( F \) is right-continuous: \( \lim_{x \to a^+} F(x) = F(a) \) for all \( a \in \mathbb{R} \).
(4) \( P(a < X \leq b) = F(b) - F(a) \).
(5) \( P(X = a) = F(a) - \lim_{x \to a^-} F(x) \).
1.3 Discrete Random Variables
Discrete Random Variable. A random variable \( X \) is discrete if it takes on a finite or countable number of values. Its CDF is a right-continuous step function.
Probability Mass Function (PMF). The probability mass function (also called the probability function) of a discrete random variable \( X \) is
\[
f(x) = P(X = x).
\]
The support of \( X \) is the set \( A = \{x : f(x) > 0\} \). The PMF satisfies \( f(x) \geq 0 \) for all \( x \) and \( \sum_{x \in A} f(x) = 1 \).
Common Discrete Distributions
| Distribution | PMF \( f(x) \) | Support | Mean | Variance | MGF |
|---|
| \( \text{Bernoulli}(p) \) | \( p^x(1-p)^{1-x} \) | \( x \in \{0,1\} \) | \( p \) | \( p(1-p) \) | \( 1 - p + pe^t \) |
| \( \text{Binomial}(n,p) \) | \( \binom{n}{x}p^x(1-p)^{n-x} \) | \( x = 0,1,\ldots,n \) | \( np \) | \( np(1-p) \) | \( (pe^t + 1-p)^n \) |
| \( \text{Geometric}(p) \) | \( (1-p)^x p \) | \( x = 0,1,2,\ldots \) | \( (1-p)/p \) | \( (1-p)/p^2 \) | \( \frac{p}{1-(1-p)e^t} \) |
| \( \text{NegBin}(r,p) \) | \( \binom{x+r-1}{x}(1-p)^x p^r \) | \( x = 0,1,2,\ldots \) | \( r(1-p)/p \) | \( r(1-p)/p^2 \) | \( \left(\frac{p}{1-(1-p)e^t}\right)^r \) |
| \( \text{Poisson}(\lambda) \) | \( e^{-\lambda}\lambda^x / x! \) | \( x = 0,1,2,\ldots \) | \( \lambda \) | \( \lambda \) | \( \exp\{\lambda(e^t - 1)\} \) |
Poisson as a Binomial Limit. If \( X \sim \text{Binomial}(n,p) \) with \( n \to \infty \) and \( np = \lambda \) held fixed (so \( p = \lambda/n \to 0 \)), then
\[
P(X = x) = \binom{n}{x}\left(\frac{\lambda}{n}\right)^x\!\left(1 - \frac{\lambda}{n}\right)^{n-x} \;\xrightarrow{n \to \infty}\; \frac{e^{-\lambda}\lambda^x}{x!},
\]
which is the Poisson PMF. The key step uses \( \lim_{n \to \infty}(1 + z/n)^n = e^z \).
1.4 Continuous Random Variables
Continuous Random Variable. A random variable \( X \) is continuous if its CDF \( F(x) \) is continuous everywhere and differentiable almost everywhere (except possibly at a countable set of points).
Probability Density Function (PDF). The
probability density function of a continuous random variable \( X \) is
\[
f(x) = F'(x)
\]
wherever the derivative exists, and \( f(x) = 0 \) otherwise. The PDF satisfies:
(i) \( f(x) \geq 0 \) for all \( x \in \mathbb{R} \).
(ii) \( \int_{-\infty}^{\infty} f(x)\,dx = 1 \).
(iii) \( F(x) = \int_{-\infty}^{x} f(t)\,dt \).
(iv) \( P(a < X \leq b) = P(a \leq X \leq b) = \int_a^b f(x)\,dx \), since \( P(X = c) = 0 \) for any single point \( c \).
The Gamma Function
Gamma Function. For \( \alpha > 0 \), the gamma function is defined by
\[
\Gamma(\alpha) = \int_0^{\infty} x^{\alpha - 1} e^{-x}\,dx.
\]
Key properties:
(1) \( \Gamma(\alpha) = (\alpha - 1)\,\Gamma(\alpha - 1) \) for \( \alpha > 1 \).
(2) \( \Gamma(n) = (n-1)! \) for positive integers \( n \).
(3) \( \Gamma(1/2) = \sqrt{\pi} \).
Common Continuous Distributions
| Distribution | PDF \( f(x) \) | Support | Mean | Variance | MGF |
|---|
| \( \text{Uniform}(a,b) \) | \( \frac{1}{b-a} \) | \( a < x < b \) | \( \frac{a+b}{2} \) | \( \frac{(b-a)^2}{12} \) | \( \frac{e^{tb}-e^{ta}}{t(b-a)} \) |
| \( \text{Exponential}(\theta) \) | \( \frac{1}{\theta}e^{-x/\theta} \) | \( x > 0 \) | \( \theta \) | \( \theta^2 \) | \( (1-\theta t)^{-1} \) |
| \( \text{Gamma}(\alpha,\beta) \) | \( \frac{x^{\alpha-1}e^{-x/\beta}}{\Gamma(\alpha)\beta^\alpha} \) | \( x > 0 \) | \( \alpha\beta \) | \( \alpha\beta^2 \) | \( (1-\beta t)^{-\alpha} \) |
| \( N(\mu,\sigma^2) \) | \( \frac{1}{\sqrt{2\pi}\sigma}\exp\!\left\{-\frac{(x-\mu)^2}{2\sigma^2}\right\} \) | \( x \in \mathbb{R} \) | \( \mu \) | \( \sigma^2 \) | \( \exp\!\left\{\mu t + \frac{\sigma^2 t^2}{2}\right\} \) |
| \( \text{Weibull}(\theta,\beta) \) | \( \frac{\beta}{\theta^\beta}x^{\beta-1}\exp\!\left\{-\left(\frac{x}{\theta}\right)^\beta\right\} \) | \( x > 0 \) | — | — | — |
Verifying the Normal PDF integrates to 1. For the standard normal \( Z \sim N(0,1) \), we need to show
\[
\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} \exp\!\left\{-\frac{x^2}{2}\right\} dx = 1.
\]
By symmetry, the integral equals \( 2\int_0^{\infty} \frac{1}{\sqrt{2\pi}} e^{-x^2/2}\,dx \). Using the substitution \( y = x^2/2 \) (so \( dx = \frac{1}{\sqrt{2y}}\,dy \)):
\[
\frac{2}{\sqrt{2\pi}} \int_0^{\infty} e^{-y} \frac{1}{\sqrt{2y}}\,dy = \frac{1}{\sqrt{\pi}} \int_0^{\infty} y^{-1/2} e^{-y}\,dy = \frac{1}{\sqrt{\pi}}\,\Gamma\!\left(\frac{1}{2}\right) = \frac{\sqrt{\pi}}{\sqrt{\pi}} = 1.
\]
For general \( X \sim N(\mu, \sigma^2) \), writing \( X = \sigma Z + \mu \) with \( Z \sim N(0,1) \) and substituting \( z = (x - \mu)/\sigma \) reduces the general case to the standard one.
1.5 Expectation and Variance
Expected Value. The
expectation (or expected value, or mean) of a random variable \( X \) is:
\[
E[X] = \sum_{x \in A} x\,f(x), \quad \text{provided } \sum_{x \in A} |x|\,f(x) < \infty.
\]\[
E[X] = \int_{-\infty}^{\infty} x\,f(x)\,dx, \quad \text{provided } \int_{-\infty}^{\infty} |x|\,f(x)\,dx < \infty.
\]
If the absolute convergence condition fails, \( E[X] \) does not exist.
Expectation of a Function. For a function \( g \) applied to a random variable \( X \):
\[
E[g(X)] = \begin{cases} \sum_{x \in A} g(x)\,f(x) & \text{if } X \text{ is discrete}, \\[4pt] \int_{-\infty}^{\infty} g(x)\,f(x)\,dx & \text{if } X \text{ is continuous}, \end{cases}
\]
provided the sum or integral converges absolutely.
Linearity of Expectation. For real constants \( a, b \) and functions \( g, h \):
\[
E[ag(X) + bh(X) + c] = a\,E[g(X)] + b\,E[h(X)] + c.
\]
Variance. The variance of a random variable \( X \) is
\[
\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2,
\]
where \( \mu = E[X] \).
Proof of the shortcut formula.
\[
\text{Var}(X) = E[(X - \mu)^2] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu\,E[X] + \mu^2 = E[X^2] - \mu^2.
\]
Properties of variance:
- \( \text{Var}(a) = 0 \) for any constant \( a \).
- \( \text{Var}(X) \geq 0 \).
- \( \text{Var}(X + a) = \text{Var}(X) \) (invariant under location shifts).
- \( \text{Var}(aX) = a^2\,\text{Var}(X) \).
- \( \text{Var}(aX + bY) = a^2\,\text{Var}(X) + b^2\,\text{Var}(Y) + 2ab\,\text{Cov}(X,Y) \).
- If \( X_1, \ldots, X_n \) are independent: \( \text{Var}\!\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \text{Var}(X_i) \).
Moments
Moments. Let \( X \) be a random variable with mean \( \mu \).
The \( k \)-th moment about the origin (or raw moment) is \( E[X^k] \).
The \( k \)-th central moment (moment about the mean) is \( E[(X - \mu)^k] \).
In particular, the first moment is the mean and the second central moment is the variance.
Moments of the Gamma distribution. If \( X \sim \text{Gamma}(\alpha, \beta) \), then for \( p > -\alpha \):
\[
E[X^p] = \frac{\beta^p\,\Gamma(\alpha + p)}{\Gamma(\alpha)}.
\]
Setting \( p = 1 \): \( E[X] = \alpha\beta \). Setting \( p = 2 \): \( E[X^2] = \alpha(\alpha+1)\beta^2 \). Therefore \( \text{Var}(X) = \alpha\beta^2 \).
Moments of the Poisson distribution. If \( X \sim \text{Poisson}(\theta) \), then
\[
E[X] = \sum_{x=0}^{\infty} x\,\frac{\theta^x}{x!}\,e^{-\theta} = \theta \sum_{y=0}^{\infty} \frac{\theta^y}{y!}\,e^{-\theta} = \theta.
\]
A similar calculation yields \( E[X^2] = \theta^2 + \theta \), so \( \text{Var}(X) = \theta \).
1.6 Moment Generating Functions
Moment Generating Function (MGF). The moment generating function of a random variable \( X \) is
\[
M_X(t) = E[e^{tX}],
\]
provided this expectation exists (is finite) for all \( t \) in some open interval \( (-h, h) \) containing 0.
Moment Extraction. If \( X \) has MGF \( M_X(t) \) defined on \( (-h, h) \), then for \( k = 1, 2, \ldots \):
\[
E[X^k] = M_X^{(k)}(0),
\]
where \( M_X^{(k)}(0) \) denotes the \( k \)-th derivative of \( M_X(t) \) evaluated at \( t = 0 \).
Since \( M_X(t) = E[e^{tX}] \), differentiating under the expectation gives
\[
M_X^{(k)}(t) = E[X^k e^{tX}].
\]
Setting \( t = 0 \) yields \( M_X^{(k)}(0) = E[X^k \cdot 1] = E[X^k] \).
Uniqueness Theorem. If two random variables \( X \) and \( Y \) have MGFs that are equal in a neighbourhood of 0, i.e., \( M_X(t) = M_Y(t) \) for all \( t \in (-h, h) \) with \( h > 0 \), then \( X \) and \( Y \) have the same distribution.
MGF of a Linear Function. If \( Y = aX + b \), then
\[
M_Y(t) = e^{bt}\,M_X(at).
\]
\( M_Y(t) = E[e^{tY}] = E[e^{t(aX+b)}] = e^{bt}\,E[e^{(at)X}] = e^{bt}\,M_X(at) \).
MGF of the Normal distribution. For \( Z \sim N(0,1) \):
\[
M_Z(t) = \int_{-\infty}^{\infty} e^{tx}\,\frac{1}{\sqrt{2\pi}}\,e^{-x^2/2}\,dx = \exp\!\left\{\frac{t^2}{2}\right\}.
\]
This is derived by completing the square in the exponent: \( tx - x^2/2 = -(x-t)^2/2 + t^2/2 \).
\[
M_X(t) = e^{\mu t}\,M_Z(\sigma t) = \exp\!\left\{\mu t + \frac{\sigma^2 t^2}{2}\right\}.
\]
Identifying a distribution via its MGF. Suppose \( M_X(t) = (1 - 2t)^{-1} \). This matches the form \( (1 - \beta t)^{-\alpha} \) with \( \alpha = 1, \beta = 2 \), which is the MGF of a \( \text{Gamma}(1, 2) = \text{Exponential}(2) \) distribution.
Chapter 2: Multivariate Random Variables
2.1 Joint and Marginal Distribution Functions
Joint CDF. The joint cumulative distribution function of random variables \( X \) and \( Y \) is
\[
F(x, y) = P(X \leq x, Y \leq y), \quad (x, y) \in \mathbb{R}^2.
\]
Properties of the joint CDF:
- \( F \) is non-decreasing in each argument when the other is held fixed.
- \( \lim_{x \to -\infty} F(x, y) = 0 \) and \( \lim_{y \to -\infty} F(x, y) = 0 \).
- \( \lim_{(x,y) \to (-\infty,-\infty)} F(x,y) = 0 \) and \( \lim_{(x,y) \to (\infty,\infty)} F(x,y) = 1 \).
Marginal CDFs. The marginal CDF of \( X \) is obtained by letting \( y \to \infty \):
\[
F_X(x) = \lim_{y \to \infty} F(x, y) = P(X \leq x).
\]
Similarly, \( F_Y(y) = \lim_{x \to \infty} F(x, y) \).
2.2 Bivariate Discrete Distributions
Joint PMF. If \( X \) and \( Y \) are jointly discrete, their joint probability mass function is
\[
f(x, y) = P(X = x, Y = y), \quad (x, y) \in \mathbb{R}^2.
\]
The joint support is \( A = \{(x, y) : f(x, y) > 0\} \). It satisfies \( f(x,y) \geq 0 \) and \( \sum_{(x,y) \in A} f(x,y) = 1 \).
Marginal PMFs. The marginal PMFs are obtained by summing out the other variable:
\[
f_X(x) = \sum_y f(x, y), \qquad f_Y(y) = \sum_x f(x, y).
\]
Joint PMF example. Suppose \( f(x,y) = (1-p)^2 p^{x+y} \) for \( x, y = 0, 1, 2, \ldots \) and \( 0 < p < 1 \). Then:
The marginal PMF of \( X \): \( f_X(x) = \sum_{y=0}^{\infty} (1-p)^2 p^{x+y} = (1-p)^2 p^x \cdot \frac{1}{1-p} = (1-p)p^x \), which is a Geometric distribution.
\[
P(X \leq Y) = \sum_{x=0}^{\infty} \sum_{y=x}^{\infty} (1-p)^2 p^{x+y} = (1-p) \sum_{x=0}^{\infty} p^{2x} = \frac{1-p}{1-p^2} = \frac{1}{1+p}.
\]
2.3 Bivariate Continuous Distributions
Joint PDF. If the joint CDF can be written as
\[
F(x, y) = \int_{-\infty}^{x} \int_{-\infty}^{y} f(s, t)\,dt\,ds,
\]
then \( X \) and \( Y \) are jointly continuous with joint probability density function
\[
f(x, y) = \frac{\partial^2 F(x, y)}{\partial x\,\partial y}
\]
wherever this mixed partial derivative exists.
\[
P((X, Y) \in R) = \iint_R f(x, y)\,dx\,dy.
\]
Marginal PDFs. The marginal density of \( X \) is
\[
f_X(x) = \int_{-\infty}^{\infty} f(x, y)\,dy,
\]
and similarly \( f_Y(y) = \int_{-\infty}^{\infty} f(x, y)\,dx \).
Working with a joint PDF. Let \( f(x,y) = x + y \) for \( 0 \leq x \leq 1 \), \( 0 \leq y \leq 1 \) and zero otherwise. Then:
Verification: \( \int_0^1 \int_0^1 (x+y)\,dy\,dx = \int_0^1 (x + 1/2)\,dx = 1 \).
Marginal of \( X \): \( f_X(x) = \int_0^1 (x+y)\,dy = x + 1/2 \) for \( 0 \leq x \leq 1 \).
\( P(X \leq Y) = \int_0^1 \int_x^1 (x+y)\,dy\,dx = 1/2 \).
2.4 Independence
Independence of Random Variables. Random variables \( X \) and \( Y \) are independent if and only if for all sets \( A, B \subseteq \mathbb{R} \):
\[
P(X \in A, Y \in B) = P(X \in A)\,P(Y \in B).
\]
Equivalent Conditions for Independence. \( X \) and \( Y \) are independent if and only if any one of the following holds:
(1) \( F(x, y) = F_X(x)\,F_Y(y) \) for all \( (x,y) \in \mathbb{R}^2 \).
(2) \( f(x, y) = f_X(x)\,f_Y(y) \) for all \( (x,y) \).
Factorization Theorem for Independence. Suppose \( (X, Y) \) has joint density (or mass function) \( f(x, y) \) with joint support \( A \), and the marginal supports are \( A_X \) and \( A_Y \). Then \( X \) and \( Y \) are independent if and only if:
(i) \( A = A_X \times A_Y \) (the support is a “rectangle”), and
(ii) there exist non-negative functions \( g \) and \( h \) such that \( f(x, y) = g(x)\,h(y) \) for all \( (x,y) \in A \).
Functions of Independent Variables. If \( X \) and \( Y \) are independent and \( g, h \) are real-valued functions, then \( g(X) \) and \( h(Y) \) are independent.
2.5 Joint Expectation, Covariance, and Correlation
Joint Expectation. For a function \( h(x, y) \) of jointly distributed random variables:
Discrete case: \( E[h(X, Y)] = \sum_{(x,y) \in A} h(x, y)\,f(x, y) \).
Continuous case: \( E[h(X, Y)] = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} h(x, y)\,f(x, y)\,dx\,dy \).
Linearity of Expectation (general). For any random variables \( X_1, \ldots, X_n \) and constants \( a_1, \ldots, a_n \):
\[
E\!\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i\,E[X_i].
\]
This holds regardless of whether the variables are independent.
Independence and Expectation. If \( X \) and \( Y \) are independent and \( g(x), h(y) \) are real-valued functions, then
\[
E[g(X)\,h(Y)] = E[g(X)]\,E[h(Y)].
\]
More generally, if \( X_1, \ldots, X_n \) are independent: \( E\!\left[\prod_{i=1}^n g_i(X_i)\right] = \prod_{i=1}^n E[g_i(X_i)] \).
Covariance. The covariance of \( X \) and \( Y \) is
\[
\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]\,E[Y].
\]
If \( X \) and \( Y \) are independent, then \( \text{Cov}(X, Y) = 0 \). The converse is not generally true.
Correlation Coefficient. The correlation coefficient is the standardized covariance:
\[
\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X)}\,\sqrt{\text{Var}(Y)}}.
\]
It satisfies \( -1 \leq \rho(X, Y) \leq 1 \). Equality \( |\rho| = 1 \) holds if and only if \( Y = aX + b \) for some constants with \( a \neq 0 \).
Variance of a Linear Combination.(1) \( \text{Var}(aX + bY) = a^2\,\text{Var}(X) + b^2\,\text{Var}(Y) + 2ab\,\text{Cov}(X, Y) \).
(2) \( \text{Var}\!\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n a_i^2\,\text{Var}(X_i) + 2\sum_{i < j} a_i a_j\,\text{Cov}(X_i, X_j) \).
(3) If \( X_1, \ldots, X_n \) are independent: \( \text{Var}\!\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n a_i^2\,\text{Var}(X_i) \).
2.6 Conditional Distributions and Expectation
Conditional PMF/PDF. The
conditional distribution of \( X \) given \( Y = y \) is defined by
Discrete: \( f_X(x \mid y) = \frac{f(x, y)}{f_Y(y)} \), provided \( f_Y(y) > 0 \).
Continuous: \( f_X(x \mid y) = \frac{f(x, y)}{f_Y(y)} \), provided \( f_Y(y) > 0 \).
Similarly, \( f_Y(y \mid x) = f(x, y) / f_X(x) \).
These conditional functions are themselves valid probability distributions (they are non-negative and sum/integrate to 1).
Product Rule. The joint density factors as
\[
f(x, y) = f_X(x \mid y)\,f_Y(y) = f_Y(y \mid x)\,f_X(x).
\]
Independence via Conditional Distributions. \( X \) and \( Y \) are independent if and only if \( f_X(x \mid y) = f_X(x) \) for all \( x, y \) (and equivalently \( f_Y(y \mid x) = f_Y(y) \)).
Conditional Expectation
Conditional Expectation. The conditional expectation of \( g(Y) \) given \( X = x \) is
\[
E[g(Y) \mid X = x] = \begin{cases} \sum_y g(y)\,f_Y(y \mid x) & \text{if } Y \text{ is discrete}, \\[4pt] \int_{-\infty}^{\infty} g(y)\,f_Y(y \mid x)\,dy & \text{if } Y \text{ is continuous}. \end{cases}
\]
The conditional mean is \( E[Y \mid X = x] \) and the conditional variance is
\[
\text{Var}(Y \mid X = x) = E[Y^2 \mid X = x] - (E[Y \mid X = x])^2.
\]
Law of Total Expectation (Double Expectation Theorem). For any random variables \( X \) and \( Y \):
\[
E[g(Y)] = E[E[g(Y) \mid X]].
\]
In particular, \( E[Y] = E[E[Y \mid X]] \).
For the continuous case:
\[
E[E[g(X) \mid Y]] = \int_{-\infty}^{\infty}\!\left[\int_{-\infty}^{\infty} g(x)\,f_X(x \mid y)\,dx\right] f_Y(y)\,dy = \int_{-\infty}^{\infty} g(x) \underbrace{\left[\int_{-\infty}^{\infty} f(x,y)\,dy\right]}_{f_X(x)} dx = E[g(X)].
\]
Law of Total Variance. For any random variables \( X \) and \( Y \):
\[
\text{Var}(Y) = E[\text{Var}(Y \mid X)] + \text{Var}(E[Y \mid X]).
\]
The first term captures the average within-group variance, and the second captures the between-group variance.
Application of total expectation and variance. Suppose \( Y \sim \text{Poisson}(\theta) \) and \( X \mid Y = y \sim \text{Binomial}(y, p) \). Then:
\( E[X \mid Y] = Yp \), so \( E[X] = E[Yp] = p\theta \).
\[
\text{Var}(X) = E[Yp(1-p)] + \text{Var}(Yp) = p(1-p)\theta + p^2\theta = p\theta.
\]
This confirms \( X \sim \text{Poisson}(p\theta) \).
2.7 Joint Moment Generating Functions
Joint MGF. The
joint moment generating function of \( X \) and \( Y \) is
\[
M(t_1, t_2) = E[e^{t_1 X + t_2 Y}],
\]
provided this exists for \( |t_1| < h_1 \) and \( |t_2| < h_2 \) for some \( h_1, h_2 > 0 \).
For \( n \) random variables: \( M(t_1, \ldots, t_n) = E[\exp\{t_1 X_1 + \cdots + t_n X_n\}] \).
Key applications:
- Recovering marginal MGFs: \( M_X(t_1) = M(t_1, 0) \) and \( M_Y(t_2) = M(0, t_2) \).
- Testing independence: \( X \) and \( Y \) are independent if and only if \( M(t_1, t_2) = M_X(t_1)\,M_Y(t_2) \).
Additivity of independent Poissons. If \( X \sim \text{Poisson}(\theta_1) \) and \( Y \sim \text{Poisson}(\theta_2) \) are independent, then
\[
M_{X+Y}(t) = E[e^{tX}]\,E[e^{tY}] = \exp\{\theta_1(e^t - 1)\}\,\exp\{\theta_2(e^t - 1)\} = \exp\{(\theta_1 + \theta_2)(e^t - 1)\},
\]
which is the MGF of \( \text{Poisson}(\theta_1 + \theta_2) \). By uniqueness, \( X + Y \sim \text{Poisson}(\theta_1 + \theta_2) \).
2.8 The Multinomial Distribution
Multinomial Distribution. Suppose \( n \) independent trials are performed, each resulting in one of \( k \) categories with probabilities \( p_1, \ldots, p_k \) (\( \sum p_i = 1 \)). Let \( X_i \) count the number of outcomes in category \( i \). Then \( (X_1, \ldots, X_k) \sim \text{Multinomial}(n; p_1, \ldots, p_k) \) with joint PMF
\[
f(x_1, \ldots, x_k) = \frac{n!}{x_1!\,x_2!\cdots x_k!}\,p_1^{x_1}\cdots p_k^{x_k},
\]
where \( x_i \in \{0, 1, \ldots, n\} \) and \( \sum_{i=1}^k x_i = n \).
Properties of the Multinomial distribution:
- Joint MGF: \( M(t_1, \ldots, t_k) = (p_1 e^{t_1} + \cdots + p_k e^{t_k})^n \).
- Marginals: Each \( X_i \sim \text{Binomial}(n, p_i) \).
- Pairwise sums: \( X_i + X_j \sim \text{Binomial}(n, p_i + p_j) \) for \( i \neq j \).
- Covariance: \( \text{Cov}(X_i, X_j) = -np_i p_j \) for \( i \neq j \).
- Conditional distributions: \( X_i \mid X_j = x_j \sim \text{Binomial}\!\left(n - x_j,\, \frac{p_i}{1 - p_j}\right) \) for \( i \neq j \).
2.9 The Bivariate Normal Distribution
Bivariate Normal Distribution. Random variables \( X_1 \) and \( X_2 \) follow a bivariate normal distribution, written \( \mathbf{X} = (X_1, X_2)^\top \sim \text{BVN}(\boldsymbol{\mu}, \Sigma) \), if their joint PDF is
\[
f(x_1, x_2) = \frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}} \exp\!\left\{-\frac{1}{2(1-\rho^2)}\left[\left(\frac{x_1-\mu_1}{\sigma_1}\right)^2 + \left(\frac{x_2-\mu_2}{\sigma_2}\right)^2 - \frac{2\rho(x_1-\mu_1)(x_2-\mu_2)}{\sigma_1\sigma_2}\right]\right\},
\]
where \( \boldsymbol{\mu} = (\mu_1, \mu_2)^\top \) and \( \Sigma = \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix} \).
Properties of the bivariate normal:
- Joint MGF: \( M(t_1, t_2) = \exp\!\left\{\mathbf{t}^\top\boldsymbol{\mu} + \frac{1}{2}\mathbf{t}^\top\Sigma\mathbf{t}\right\} \).
- Marginals: \( X_1 \sim N(\mu_1, \sigma_1^2) \) and \( X_2 \sim N(\mu_2, \sigma_2^2) \).
- Conditional distributions:
\[
X_2 \mid X_1 = x_1 \sim N\!\left(\mu_2 + \rho\frac{\sigma_2}{\sigma_1}(x_1 - \mu_1),\; \sigma_2^2(1-\rho^2)\right).
\]
- Covariance: \( \text{Cov}(X_1, X_2) = \rho\sigma_1\sigma_2 \) and \( \text{Corr}(X_1, X_2) = \rho \).
- Independence: \( \rho = 0 \iff X_1 \) and \( X_2 \) are independent. (For BVN, uncorrelated implies independent.)
- Linear combinations: Any linear combination \( c_1 X_1 + c_2 X_2 \sim N(c_1\mu_1 + c_2\mu_2,\, \mathbf{c}^\top\Sigma\mathbf{c}) \).
- Quadratic form: \( (\mathbf{x} - \boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x} - \boldsymbol{\mu}) \sim \chi^2(2) \).
Chapter 3: Functions of Random Variables
Given random variables \( X_1, \ldots, X_n \) and a function \( h \), we often need the distribution of \( Y = h(X_1, \ldots, X_n) \). Three principal techniques are available: the CDF technique, the one-to-one transformation (Jacobian) method, and the MGF technique.
3.1 The CDF Technique
The CDF technique is the most general method and works for both discrete and continuous variables.
Procedure (continuous case):
- For each \( y \in \mathbb{R} \), determine the region \( R_y = \{(x_1, \ldots, x_n) : h(x_1, \ldots, x_n) \leq y\} \).
- Compute \( F_Y(y) = P(Y \leq y) = \int_{R_y} f(x_1, \ldots, x_n)\,dx_1 \cdots dx_n \).
- Differentiate: \( f_Y(y) = F_Y'(y) \).
Distribution of \( Y = X^2 \) where \( X \sim N(0,1) \).
\[
F_Y(y) = P(X^2 \leq y) = P(-\sqrt{y} \leq X \leq \sqrt{y}) = F_X(\sqrt{y}) - F_X(-\sqrt{y}).
\]\[
f_Y(y) = \frac{1}{2\sqrt{y}}\left[f_X(\sqrt{y}) + f_X(-\sqrt{y})\right] = \frac{1}{2\sqrt{y}} \cdot \frac{2}{\sqrt{2\pi}}\,e^{-y/2} = \frac{1}{\sqrt{2\pi}}\,y^{-1/2}\,e^{-y/2}.
\]
This is the PDF of a \( \chi^2(1) \) distribution, equivalently \( \text{Gamma}(1/2, 2) \).
Distribution of order statistics. Let \( X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Uniform}(0, \theta) \).
\[
F_{X_{(n)}}(y) = P(X_1 \leq y, \ldots, X_n \leq y) = \left(\frac{y}{\theta}\right)^n, \quad 0 < y < \theta.
\]
So \( f_{X_{(n)}}(y) = \frac{n}{\theta^n}\,y^{n-1} \) for \( 0 < y < \theta \).
\[
F_{X_{(1)}}(y) = 1 - P(X_1 > y, \ldots, X_n > y) = 1 - \left(\frac{\theta - y}{\theta}\right)^n, \quad 0 < y < \theta.
\]
So \( f_{X_{(1)}}(y) = \frac{n}{\theta}\left(1 - \frac{y}{\theta}\right)^{n-1} \) for \( 0 < y < \theta \).
Probability Integral Transform. If \( X \) is a continuous random variable with CDF \( F \), then \( Y = F(X) \sim \text{Uniform}(0, 1) \).
Conversely, if \( U \sim \text{Uniform}(0,1) \), then \( X = F^{-1}(U) \) has CDF \( F \).
For \( 0 < y < 1 \):
\[
P(Y \leq y) = P(F(X) \leq y) = P(X \leq F^{-1}(y)) = F(F^{-1}(y)) = y,
\]
which is the CDF of \( \text{Uniform}(0,1) \).
Univariate Transformation Theorem. Let \( X \) be a continuous random variable with PDF \( f_X(x) \) and support \( A \). If \( h \) is a one-to-one (monotone) function on \( A \), and \( Y = h(X) \), then the PDF of \( Y \) is
\[
f_Y(y) = f_X(x) \left|\frac{dx}{dy}\right|,
\]
where \( x = h^{-1}(y) \) and the support of \( Y \) is \( h(A) \).
Log transformation. Let \( f_X(x) = \frac{\theta}{x^{\theta+1}} \) for \( x \geq 1 \) (a Pareto distribution) and \( Y = \ln(X) \).
\[
f_Y(y) = f_X(e^y)\,|e^y| = \frac{\theta}{(e^y)^{\theta+1}}\,e^y = \theta\,e^{-y\theta}.
\]
Thus \( Y \sim \text{Exponential}(1/\theta) \).
CDF transformation yields Uniform. If \( X \sim N(0,1) \) and \( Y = \Phi(X) \) where \( \Phi \) is the standard normal CDF, then \( Y \sim \text{Uniform}(0,1) \) by the Probability Integral Transform.
Bivariate Transformation Theorem. Let \( (X, Y) \) have joint PDF \( f(x, y) \). Define \( U = h_1(X, Y) \) and \( V = h_2(X, Y) \), where the transformation is one-to-one with inverse \( x = w_1(u, v) \), \( y = w_2(u, v) \). The joint PDF of \( (U, V) \) is
\[
g(u, v) = f(w_1(u,v),\, w_2(u,v))\,\left|\frac{\partial(x, y)}{\partial(u, v)}\right|,
\]
where the Jacobian is the absolute value of the determinant
\[
\frac{\partial(x,y)}{\partial(u,v)} = \begin{vmatrix} \partial x/\partial u & \partial x/\partial v \\ \partial y/\partial u & \partial y/\partial v \end{vmatrix}.
\]
Sum and difference of independent normals. Let \( X, Y \stackrel{\text{iid}}{\sim} N(0, 1) \). Define \( U = X + Y \) and \( V = X - Y \).
\[
\frac{\partial(x,y)}{\partial(u,v)} = \begin{vmatrix} 1/2 & 1/2 \\ 1/2 & -1/2 \end{vmatrix} = -1/2.
\]\[
g(u,v) = \frac{1}{2\pi}\exp\!\left\{-\frac{(u+v)^2/4 + (u-v)^2/4}{2}\right\}\cdot\frac{1}{2} = \frac{1}{4\pi}\exp\!\left\{-\frac{u^2+v^2}{4}\right\}.
\]
This factors as \( g_U(u)\,g_V(v) \), confirming that \( U \sim N(0,2) \) and \( V \sim N(0,2) \) are independent.
Finding marginal via auxiliary variable. Suppose \( f(x,y) = e^{-x-y} \) for \( 0 < x < y < \infty \). To find the PDF of \( U = X + Y \), set \( V = X \).
\[
g(u,v) = e^{-u}, \quad 0 < v < u.
\]
Marginalizing: \( g_U(u) = \int_0^u e^{-u}\,dv = u\,e^{-u} \) for \( u > 0 \), which is \( \text{Gamma}(2, 1) \).
3.4 The MGF Technique
MGF of a Sum of Independents. If \( X_1, \ldots, X_n \) are independent random variables, then \( T = \sum_{i=1}^n X_i \) has MGF
\[
M_T(t) = \prod_{i=1}^n M_{X_i}(t).
\]
If additionally the \( X_i \) are identically distributed with common MGF \( M(t) \), then \( M_T(t) = [M(t)]^n \).
Important Distributions Derived via MGFs
Chi-Squared Distribution. If \( Z_1, \ldots, Z_k \stackrel{\text{iid}}{\sim} N(0, 1) \), then
\[
Q = \sum_{i=1}^k Z_i^2 \sim \chi^2(k).
\]
The MGF is \( M_Q(t) = (1 - 2t)^{-k/2} \) for \( t < 1/2 \). Note that \( \chi^2(k) = \text{Gamma}(k/2, 2) \).
If \( Y_i \sim \chi^2(k_i) \) are independent, then \( \sum Y_i \sim \chi^2(\sum k_i) \).
Linear Combinations of Independent Normals. If \( X_i \sim N(\mu_i, \sigma_i^2) \) independently for \( i = 1, \ldots, n \), then
\[
\sum_{i=1}^n a_i X_i \sim N\!\left(\sum_{i=1}^n a_i\mu_i,\; \sum_{i=1}^n a_i^2\sigma_i^2\right).
\]
In particular, if \( X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2) \), then \( \bar{X}_n \sim N(\mu, \sigma^2/n) \).
Student's t-Distribution. If \( Z \sim N(0, 1) \) and \( Q \sim \chi^2(\nu) \) are independent, then
\[
T = \frac{Z}{\sqrt{Q/\nu}} \sim t(\nu),
\]
the Student's t-distribution with \( \nu \) degrees of freedom. Its support is \( (-\infty, \infty) \).
F-Distribution. If \( X \sim \chi^2(n) \) and \( Y \sim \chi^2(m) \) are independent, then
\[
\frac{X/n}{Y/m} \sim F(n, m).
\]
Also, if \( X \) and \( Y \) are independent chi-squared, then \( X + Y \sim \chi^2(n + m) \).
Sampling from the Normal Distribution. If \( X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2) \), define \( \bar{X} = \frac{1}{n}\sum X_i \) and \( S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2 \). Then:
(1) \( \bar{X} \sim N(\mu, \sigma^2/n) \).
(2) \( \frac{(n-1)S^2}{\sigma^2} = \frac{\sum(X_i - \bar{X})^2}{\sigma^2} \sim \chi^2(n-1) \).
(3) \( \bar{X} \) and \( S^2 \) are independent.
(4) \( \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t(n-1) \).
For (2), the key identity is \( \sum(X_i - \mu)^2 = \sum(X_i - \bar{X})^2 + n(\bar{X} - \mu)^2 \). Since the left side is \( \chi^2(n) \) and \( n(\bar{X}-\mu)^2/\sigma^2 \sim \chi^2(1) \), by independence of \( \bar{X} \) and \( S^2 \), comparing MGFs gives \( (1-2t)^{-n/2} = M_U(t)(1-2t)^{-1/2} \), so \( M_U(t) = (1-2t)^{-(n-1)/2} \), confirming \( \chi^2(n-1) \).
For (4), since \( \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim N(0,1) \) and \( \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \) are independent, the ratio has a \( t(n-1) \) distribution by definition.
Chapter 4: Limiting and Asymptotic Distributions
4.1 Convergence in Distribution
Convergence in Distribution. Let \( X_1, X_2, \ldots \) be a sequence of random variables with CDFs \( F_1, F_2, \ldots \), and let \( X \) be a random variable with CDF \( F \). We say \( X_n \) converges in distribution to \( X \), written \( X_n \xrightarrow{d} X \), if
\[
\lim_{n \to \infty} F_n(x) = F(x)
\]
for all \( x \) at which \( F \) is continuous. The CDF \( F \) is called the limiting (or asymptotic) distribution of the sequence.
Limiting distribution of order statistics. Let \( X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Uniform}(0, 1) \). Then:
\[
\lim_{n \to \infty} F_n(x) = 1 - e^{-x}, \quad x > 0,
\]
which is the CDF of \( \text{Exponential}(1) \). So \( nX_{(1)} \xrightarrow{d} \text{Exponential}(1) \).
For \( X_{(n)} \): The CDF \( F_n(x) = x^n \) on \( (0,1) \) converges to the degenerate distribution at 1: \( X_{(n)} \xrightarrow{d} 1 \).
4.2 Convergence in Probability
Convergence in Probability. A sequence \( X_1, X_2, \ldots \) converges in probability to a random variable \( X \), written \( X_n \xrightarrow{P} X \), if for every \( \varepsilon > 0 \):
\[
\lim_{n \to \infty} P(|X_n - X| \geq \varepsilon) = 0.
\]
When \( X = b \) is a constant, we write \( X_n \xrightarrow{P} b \).
Convergence in probability implies convergence in distribution. If \( X_n \xrightarrow{P} X \), then \( X_n \xrightarrow{d} X \). The converse is not true in general.
\[
X_n \xrightarrow{d} b \iff X_n \xrightarrow{P} b.
\]
4.3 The Weak Law of Large Numbers
Markov's Inequality. For any random variable \( X \) and constants \( k > 0 \), \( c > 0 \):
\[
P(|X| \geq c) \leq \frac{E[|X|^k]}{c^k}.
\]
The most common case is \( k = 2 \) (Chebyshev's inequality): \( P(|X| \geq c) \leq E[X^2]/c^2 \).
Weak Law of Large Numbers (WLLN). Let \( X_1, X_2, \ldots \) be iid random variables with \( E[X_i] = \mu \) and \( \text{Var}(X_i) = \sigma^2 < \infty \). Then the sample mean
\[
\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu.
\]
By Chebyshev's inequality, for any \( \varepsilon > 0 \):
\[
P(|\bar{X}_n - \mu| \geq \varepsilon) \leq \frac{E[(\bar{X}_n - \mu)^2]}{\varepsilon^2} = \frac{\text{Var}(\bar{X}_n)}{\varepsilon^2} = \frac{\sigma^2}{n\varepsilon^2} \to 0
\]
as \( n \to \infty \). By the squeeze theorem, \( P(|\bar{X}_n - \mu| \geq \varepsilon) \to 0 \).
4.4 The Central Limit Theorem
Central Limit Theorem (CLT). Let \( X_1, X_2, \ldots \) be iid random variables with \( E[X_i] = \mu \) and \( \text{Var}(X_i) = \sigma^2 \in (0, \infty) \). Then
\[
\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0, 1).
\]
Equivalently, \( \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2) \).
Proof via MGFs. Let \( Y_i = (X_i - \mu)/\sigma \), so the \( Y_i \) are iid with \( E[Y_i] = 0 \), \( \text{Var}(Y_i) = 1 \), and MGF \( M_Y(t) \). The MGF of \( \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} = \frac{1}{\sqrt{n}}\sum Y_i \) is
\[
M_n(t) = \left[M_Y\!\left(\frac{t}{\sqrt{n}}\right)\right]^n.
\]
Expanding by Taylor's theorem: \( M_Y(t/\sqrt{n}) = 1 + \frac{t^2}{2n} + o(t^2/n) \). Therefore:
\[
\lim_{n \to \infty}\left[1 + \frac{t^2}{2n} + o\!\left(\frac{t^2}{n}\right)\right]^n = \exp\!\left\{\frac{t^2}{2}\right\},
\]
which is the MGF of \( N(0,1) \). By the MGF convergence theorem, the result follows.
MGF Convergence Theorem. Let \( X_n \) have MGF \( M_n(t) \) and \( X \) have MGF \( M(t) \). If there exists \( h > 0 \) such that \( \lim_{n \to \infty} M_n(t) = M(t) \) for all \( t \in (-h, h) \), then \( X_n \xrightarrow{d} X \).
4.5 Slutsky’s Theorem and the Continuous Mapping Theorem
Continuous Mapping Theorem. Let \( g(\cdot) \) be a continuous function. Then:
(1) If \( X_n \xrightarrow{P} a \), then \( g(X_n) \xrightarrow{P} g(a) \).
(2) If \( X_n \xrightarrow{d} X \), then \( g(X_n) \xrightarrow{d} g(X) \).
Slutsky's Theorem. If \( X_n \xrightarrow{d} X \) and \( Y_n \xrightarrow{P} b \) (a constant), then:
(a) \( X_n + Y_n \xrightarrow{d} X + b \). (This also holds if \( b \) is replaced by a random variable \( Y \).)
(b) \( X_n Y_n \xrightarrow{d} bX \).
(c) \( X_n / Y_n \xrightarrow{d} X/b \) provided \( b \neq 0 \).
Standardizing with estimated variance. Let \( X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Poisson}(\mu) \). By the CLT,
\[
\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sqrt{\mu}} \xrightarrow{d} N(0, 1).
\]
Since \( \bar{X}_n \xrightarrow{P} \mu \) by the WLLN, the continuous mapping theorem gives \( \sqrt{\mu}/\sqrt{\bar{X}_n} \xrightarrow{P} 1 \). By Slutsky's theorem:
\[
\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sqrt{\bar{X}_n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sqrt{\mu}} \cdot \frac{\sqrt{\mu}}{\sqrt{\bar{X}_n}} \xrightarrow{d} N(0,1).
\]
4.6 The Delta Method
Delta Method. Suppose \( \sqrt{n}(X_n - \theta) \xrightarrow{d} N(0, \sigma^2) \) and \( g \) is differentiable at \( \theta \) with \( g'(\theta) \neq 0 \). Then
\[
\sqrt{n}\left[g(X_n) - g(\theta)\right] \xrightarrow{d} N\!\left(0,\, [g'(\theta)]^2\,\sigma^2\right).
\]
Equivalently, \( g(X_n) \) is approximately \( N(g(\theta),\, [g'(\theta)]^2\sigma^2/n) \) for large \( n \).
\[
\sqrt{n}[g(X_n) - g(\theta)] \approx g'(\theta)\,\sqrt{n}(X_n - \theta),
\]
and the right side converges to \( g'(\theta) \cdot N(0, \sigma^2) = N(0, [g'(\theta)]^2\sigma^2) \).
Delta method for the Poisson. Let \( X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Poisson}(\mu) \). Since \( \text{Var}(X_i) = \mu \), the CLT gives \( \sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \mu) \).
\[
\sqrt{n}\!\left(\sqrt{\bar{X}_n} - \sqrt{\mu}\right) \xrightarrow{d} N\!\left(0,\, \frac{1}{4\mu} \cdot \mu\right) = N\!\left(0, \frac{1}{4}\right).
\]
This is the variance-stabilizing transformation for the Poisson: the asymptotic variance \( 1/4 \) does not depend on \( \mu \).
Delta method for the Exponential. Let \( X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exponential}(\theta) \). Then \( E[X_i] = \theta \), \( \text{Var}(X_i) = \theta^2 \), and \( \sqrt{n}(\bar{X}_n - \theta) \xrightarrow{d} N(0, \theta^2) \).
\[
\sqrt{n}(\ln\bar{X}_n - \ln\theta) \xrightarrow{d} N\!\left(0,\, \frac{1}{\theta^2}\cdot\theta^2\right) = N(0, 1).
\]
4.7 Normal Approximations
The CLT enables us to approximate probabilities for sums and means of large samples. If \( X_1, \ldots, X_n \) are iid with mean \( \mu \) and variance \( \sigma^2 \), then for large \( n \):
\[
P\!\left(\bar{X}_n \leq x\right) \approx \Phi\!\left(\frac{x - \mu}{\sigma/\sqrt{n}}\right),
\]\[
P(T_n \leq t) \approx \Phi\!\left(\frac{t - n\mu}{\sigma\sqrt{n}}\right).
\]
Chapter 5: Point Estimation
5.1 Introduction and Basic Concepts
Suppose \( X_1, \ldots, X_n \) are iid random variables from a distribution with PDF (or PMF) \( f(x; \boldsymbol{\theta}) \), where \( \boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)^\top \) is an unknown parameter vector in the parameter space \( \Theta \). The goal of point estimation is to use the observed data to produce a single “best guess” for \( \boldsymbol{\theta} \).
Statistic. A statistic is any function \( T = T(X_1, \ldots, X_n) \) of the data that does not depend on any unknown parameter.
Estimator and Estimate. If a statistic \( \hat{\theta} = T(X_1, \ldots, X_n) \) is used to estimate \( \theta \), it is called an estimator of \( \theta \). When evaluated at observed data \( (x_1, \ldots, x_n) \), the value \( \hat{\theta}(x_1, \ldots, x_n) \) is called an estimate.
Bias and Unbiasedness. An estimator \( \hat{\theta} \) is unbiased for \( \theta \) if \( E[\hat{\theta}] = \theta \) for all \( \theta \in \Theta \). The bias is \( \text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta \).
5.2 Method of Moments
Method of Moments (MM) Estimator. The procedure equates population moments to sample moments and solves for the parameters.
- Population moments: \( \mu_j = E[X^j] = \mu_j(\boldsymbol{\theta}) \) for \( j = 1, \ldots, k \).
- Sample moments: \( \hat{\mu}_j = \frac{1}{n}\sum_{i=1}^n X_i^j \).
- Solve the system \( \mu_j(\hat{\boldsymbol{\theta}}) = \hat{\mu}_j \) for \( j = 1, \ldots, k \) to obtain \( \hat{\boldsymbol{\theta}}_{\text{MM}} \).
MM for the Exponential. If \( X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \text{Exponential}(\theta) \), then \( \mu_1 = E[X] = \theta \). Setting \( \hat{\theta} = \hat{\mu}_1 = \bar{X}_n \) gives the MM estimator \( \hat{\theta}_{\text{MM}} = \bar{X}_n \).
MM for the Uniform. If \( X_i \stackrel{\text{iid}}{\sim} \text{Uniform}(0, \theta) \), then \( \mu_1 = \theta/2 \). Setting \( \hat{\theta}/2 = \bar{X}_n \) gives \( \hat{\theta}_{\text{MM}} = 2\bar{X}_n \).
MM for the Normal. If \( X_i \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2) \), we have two parameters. Using two moment equations:
\( \mu_1(\mu, \sigma^2) = \mu \) and \( \mu_2(\mu, \sigma^2) = \mu^2 + \sigma^2 \).
\[
\hat{\mu}_{\text{MM}} = \bar{X}_n, \qquad \hat{\sigma}^2_{\text{MM}} = \frac{1}{n}\sum_{i=1}^n X_i^2 - \bar{X}_n^2 = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2.
\]
5.3 Maximum Likelihood Estimation
Likelihood Function. Given iid observations from \( f(x; \theta) \), the likelihood function is
\[
L(\theta) = L(\theta;\, x_1, \ldots, x_n) = \prod_{i=1}^n f(x_i; \theta).
\]
The log-likelihood is \( \ell(\theta) = \ln L(\theta) = \sum_{i=1}^n \ln f(x_i; \theta) \).
Maximum Likelihood Estimator (MLE). The MLE is the value \( \hat{\theta} \) that maximizes \( L(\theta) \) (equivalently, \( \ell(\theta) \)):
\[
\hat{\theta} = \arg\max_{\theta \in \Theta}\, L(\theta) = \arg\max_{\theta \in \Theta}\, \ell(\theta).
\]
Invariance Property of the MLE. If \( \hat{\theta} \) is the MLE of \( \theta \), then for any function \( \tau(\theta) \), the MLE of \( \tau(\theta) \) is \( \tau(\hat{\theta}) \).
MLE for the Poisson. If \( X_i \stackrel{\text{iid}}{\sim} \text{Poisson}(\theta) \):
\[
\ell(\theta) = \left(\sum x_i\right)\ln\theta - n\theta - \sum\ln(x_i!).
\]
Setting \( \ell'(\theta) = \frac{\sum x_i}{\theta} - n = 0 \) gives \( \hat{\theta}_{\text{ML}} = \bar{X}_n \), the same as the MM estimator.
By invariance, the MLE of \( P(X_1 = 0) = e^{-\theta} \) is \( e^{-\bar{X}_n} \).
MLE for the Normal. If \( X_i \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2) \):
\[
\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2.
\]
Setting partial derivatives to zero yields:
\[
\hat{\mu}_{\text{ML}} = \bar{X}_n, \qquad \hat{\sigma}^2_{\text{ML}} = \frac{1}{n}\sum_{i=1}^n (X_i - \bar{X}_n)^2.
\]
Note that \( \hat{\sigma}^2_{\text{ML}} \) is biased (its expectation is \( \frac{n-1}{n}\sigma^2 \)), unlike the unbiased estimator \( S^2 = \frac{1}{n-1}\sum(X_i - \bar{X}_n)^2 \).
MLE for the Uniform (boundary case). If \( X_i \stackrel{\text{iid}}{\sim} \text{Uniform}(0, \theta) \), the likelihood is
\[
L(\theta) = \begin{cases} \theta^{-n} & \text{if } 0 \leq x_{(1)} \leq x_{(n)} \leq \theta, \\ 0 & \text{otherwise}. \end{cases}
\]
For \( \theta \geq x_{(n)} \), \( L(\theta) = \theta^{-n} \) is strictly decreasing, so the maximum is at \( \hat{\theta}_{\text{ML}} = X_{(n)} = \max(X_1, \ldots, X_n) \).
This differs from \( \hat{\theta}_{\text{MM}} = 2\bar{X}_n \). The calculus-based approach of setting derivatives to zero does not apply here because the maximum occurs at the boundary of the parameter space.
MLE for a power-law density. If \( f(x; \theta) = \theta x^{\theta-1} \) for \( 0 < x < 1 \) and \( \theta > 0 \):
\[
\ell(\theta) = n\ln\theta + (\theta - 1)\sum_{i=1}^n \ln x_i.
\]
Setting \( \ell'(\theta) = n/\theta + \sum\ln x_i = 0 \) gives:
\[
\hat{\theta}_{\text{ML}} = -\frac{n}{\sum_{i=1}^n \ln X_i}.
\]
5.4 Properties of the MLE
Score Function. The score function is
\[
S(\theta) = \frac{d}{d\theta}\ell(\theta) = \frac{d}{d\theta}\ln L(\theta).
\]
When the support does not depend on \( \theta \), the MLE satisfies \( S(\hat{\theta}) = 0 \).
Observed Information. The observed information is
\[
I(\theta) = -\frac{d^2}{d\theta^2}\ell(\theta).
\]
Fisher Information. The Fisher information (or expected information) is
\[
J(\theta) = E[I(\theta; \mathbf{X})] = -E\!\left[\frac{d^2}{d\theta^2}\ln f(X_1; \theta)\right] \cdot n = n\,J_1(\theta),
\]
where \( J_1(\theta) = -E\!\left[\frac{d^2}{d\theta^2}\ln f(X_1; \theta)\right] \) is the Fisher information from a single observation.
The Cramer-Rao Lower Bound
Cramer-Rao Lower Bound. If \( \hat{\theta} \) is any unbiased estimator of \( \theta \), then under regularity conditions:
\[
\text{Var}(\hat{\theta}) \geq \frac{1}{J(\theta)} = \frac{1}{n\,J_1(\theta)}.
\]
Cramer-Rao Bound for Functions. If \( T \) is an unbiased estimator of \( g(\theta) \), then
\[
\text{Var}(T) \geq \frac{[g'(\theta)]^2}{J(\theta)}.
\]
An estimator achieving the Cramer-Rao lower bound is called efficient (or a minimum variance unbiased estimator, MVUE).
Fisher information for Poisson. If \( X_i \stackrel{\text{iid}}{\sim} \text{Poisson}(\theta) \):
\[
\ln f(x; \theta) = x\ln\theta - \theta - \ln(x!), \quad \frac{d^2}{d\theta^2}\ln f(x;\theta) = -\frac{x}{\theta^2}.
\]
So \( J_1(\theta) = E[X]/\theta^2 = 1/\theta \), and \( J(\theta) = n/\theta \).
The MLE \( \hat{\theta} = \bar{X}_n \) has \( \text{Var}(\hat{\theta}) = \theta/n = 1/J(\theta) \), which equals the Cramer-Rao bound. Hence \( \bar{X}_n \) is an efficient estimator.
Asymptotic Properties of the MLE
Asymptotic Properties of the MLE. Under regularity conditions (support independent of \( \theta \), sufficient smoothness):
(1) Consistency: \( \hat{\theta} \xrightarrow{P} \theta \) as \( n \to \infty \).
(2) Asymptotic normality: \( \sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} N\!\left(0, \frac{1}{J_1(\theta)}\right) \).
(3) Asymptotic efficiency: The MLE achieves the Cramer-Rao lower bound asymptotically, meaning \( \text{Var}(\hat{\theta}) \approx \frac{1}{J(\theta)} \) for large \( n \).
(4) Delta method extension: \( \sqrt{n}[g(\hat{\theta}) - g(\theta)] \xrightarrow{d} N\!\left(0, \frac{[g'(\theta)]^2}{J_1(\theta)}\right) \).
Complete MLE analysis for Poisson. Let \( X_i \stackrel{\text{iid}}{\sim} \text{Poisson}(\theta) \).
\[
\sqrt{n}(e^{-\bar{X}_n} - e^{-\theta}) \xrightarrow{d} N(0,\, e^{-2\theta}\theta).
\]
(v) \( \hat{\theta} = \bar{X}_n \) is exactly unbiased: \( E[\bar{X}_n] = \theta \). However, \( e^{-\bar{X}_n} \) is biased for \( e^{-\theta} \) in finite samples, though it is asymptotically unbiased.
Chapter 6: Confidence Intervals and Hypothesis Testing
6.1 Pivotal Quantities and Confidence Intervals
Pivotal Quantity. A pivotal quantity is a function \( Q(X_1, \ldots, X_n; \theta) \) of the data and the parameter \( \theta \) whose distribution does not depend on any unknown parameters.
A pivotal quantity is the basis for constructing confidence intervals. The general procedure is:
- Find a pivotal quantity \( Q \) involving the parameter of interest.
- Determine constants \( a, b \) such that \( P(a \leq Q \leq b) = 1 - \alpha \).
- Rearrange the inequality to isolate \( \theta \), obtaining a \( 100(1-\alpha)\% \) confidence interval.
CI for a normal mean (known variance). If \( X_i \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2) \) with \( \sigma^2 \) known, then
\[
Z = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)
\]
is a pivotal quantity. A \( 100(1-\alpha)\% \) confidence interval for \( \mu \) is
\[
\bar{X}_n \pm z_{\alpha/2}\,\frac{\sigma}{\sqrt{n}},
\]
where \( z_{\alpha/2} \) is the upper \( \alpha/2 \) quantile of \( N(0,1) \).
CI for a normal mean (unknown variance). If \( X_i \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2) \) with \( \sigma^2 \) unknown, the pivot is
\[
T = \frac{\bar{X}_n - \mu}{S/\sqrt{n}} \sim t(n-1),
\]
giving the CI: \( \bar{X}_n \pm t_{\alpha/2, n-1}\,\frac{S}{\sqrt{n}} \).
CI for a normal variance. The pivot \( \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \) yields
\[
\left(\frac{(n-1)S^2}{\chi^2_{\alpha/2, n-1}},\; \frac{(n-1)S^2}{\chi^2_{1-\alpha/2, n-1}}\right)
\]
as a \( 100(1-\alpha)\% \) confidence interval for \( \sigma^2 \).
Large-Sample Confidence Intervals via the CLT
\[
\hat{\theta} \;\dot{\sim}\; N\!\left(\theta,\, \frac{1}{nJ_1(\theta)}\right),
\]
so an approximate \( 100(1-\alpha)\% \) CI for \( \theta \) is \( \hat{\theta} \pm z_{\alpha/2}\,\sqrt{1/(nJ_1(\hat{\theta}))} \).
6.2 Hypothesis Testing Fundamentals
Hypothesis Test. A
hypothesis test consists of:
A null hypothesis \( H_0 \) and an alternative hypothesis \( H_1 \).
A test statistic \( T(X_1, \ldots, X_n) \).
A rejection region (critical region) \( C \): reject \( H_0 \) if \( T \in C \).
Types of Error.Type I error: Rejecting \( H_0 \) when \( H_0 \) is true. Probability: \( \alpha = P(\text{reject } H_0 \mid H_0 \text{ true}) \).
Type II error: Failing to reject \( H_0 \) when \( H_1 \) is true. Probability: \( \beta = P(\text{fail to reject } H_0 \mid H_1 \text{ true}) \).
The significance level is \( \alpha \). The power of the test is \( 1 - \beta \).
Simple and Composite Hypotheses. A hypothesis is simple if it specifies the distribution completely (e.g., \( H_0: \theta = \theta_0 \)). It is composite if it specifies a set of distributions (e.g., \( H_1: \theta > \theta_0 \)).
6.3 The Neyman-Pearson Lemma
Neyman-Pearson Lemma. For testing the simple null \( H_0: \theta = \theta_0 \) against the simple alternative \( H_1: \theta = \theta_1 \) at significance level \( \alpha \), the most powerful test rejects \( H_0 \) when the likelihood ratio
\[
\Lambda = \frac{L(\theta_1)}{L(\theta_0)} = \frac{\prod_{i=1}^n f(x_i; \theta_1)}{\prod_{i=1}^n f(x_i; \theta_0)}
\]
exceeds a critical value \( k \), where \( k \) is chosen so that \( P(\Lambda > k \mid H_0) = \alpha \).
6.4 Likelihood Ratio Tests
For testing composite hypotheses, the generalized likelihood ratio is commonly used.
Likelihood Ratio Test Statistic. For testing \( H_0: \theta \in \Theta_0 \) versus \( H_1: \theta \in \Theta_0^c \), the likelihood ratio statistic is
\[
\lambda = \frac{\sup_{\theta \in \Theta_0} L(\theta)}{\sup_{\theta \in \Theta} L(\theta)} = \frac{L(\hat{\theta}_0)}{L(\hat{\theta})},
\]
where \( \hat{\theta}_0 \) is the MLE under \( H_0 \) and \( \hat{\theta} \) is the unrestricted MLE. Note \( 0 \leq \lambda \leq 1 \), and we reject \( H_0 \) for small values of \( \lambda \).
Wilks' Theorem (asymptotic distribution). Under regularity conditions and under \( H_0 \), as \( n \to \infty \):
\[
-2\ln\lambda \xrightarrow{d} \chi^2(r),
\]
where \( r = \dim(\Theta) - \dim(\Theta_0) \) is the number of restrictions imposed by \( H_0 \). This provides a large-sample test: reject \( H_0 \) if \( -2\ln\lambda > \chi^2_{\alpha, r} \).
Testing a normal mean. For \( X_i \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2) \) with \( \sigma^2 \) known, testing \( H_0: \mu = \mu_0 \) vs. \( H_1: \mu \neq \mu_0 \):
Under \( H_0 \): \( L(\mu_0) \propto \exp\{-\frac{n(\bar{x} - \mu_0)^2}{2\sigma^2}\} \cdot \exp\{-\frac{\sum(x_i - \bar{x})^2}{2\sigma^2}\} \).
Under the full model: \( L(\hat{\mu}) \propto \exp\{-\frac{\sum(x_i - \bar{x})^2}{2\sigma^2}\} \).
The ratio simplifies to \( \lambda = \exp\{-\frac{n(\bar{x} - \mu_0)^2}{2\sigma^2}\} \), and \( -2\ln\lambda = \frac{n(\bar{x} - \mu_0)^2}{\sigma^2} = z^2 \), which has a \( \chi^2(1) \) distribution under \( H_0 \). Equivalently, the test rejects when \( |z| > z_{\alpha/2} \), consistent with the standard two-sided z-test.
Relationship between CIs and tests. There is a duality between confidence intervals and hypothesis tests. A \( 100(1-\alpha)\% \) confidence interval consists of all parameter values \( \theta_0 \) for which the corresponding level-\( \alpha \) test would not reject \( H_0: \theta = \theta_0 \). Conversely, a level-\( \alpha \) test rejects \( H_0: \theta = \theta_0 \) if and only if \( \theta_0 \) lies outside the \( 100(1-\alpha)\% \) confidence interval.
6.5 Summary of Common Test Statistics
| Setting | Hypotheses | Test Statistic | Distribution under \( H_0 \) |
|---|
| Normal mean, \( \sigma^2 \) known | \( H_0: \mu = \mu_0 \) | \( Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}} \) | \( N(0,1) \) |
| Normal mean, \( \sigma^2 \) unknown | \( H_0: \mu = \mu_0 \) | \( T = \frac{\bar{X} - \mu_0}{S/\sqrt{n}} \) | \( t(n-1) \) |
| Normal variance | \( H_0: \sigma^2 = \sigma_0^2 \) | \( \chi^2 = \frac{(n-1)S^2}{\sigma_0^2} \) | \( \chi^2(n-1) \) |
| Two normal variances | \( H_0: \sigma_1^2 = \sigma_2^2 \) | \( F = \frac{S_1^2}{S_2^2} \) | \( F(n_1-1, n_2-1) \) |
| Large-sample proportion | \( H_0: p = p_0 \) | \( Z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} \) | \( N(0,1) \) approx |