STAT 340: Stochastic Simulation Methods
Erik Hintz
Estimated study time: 1 hr 10 min
Table of contents
Sources and References
Textbooks — Ross, S. M. (2013), Simulation, 5th Edition; Law, A. M. (2014), Simulation Modeling and Analysis, 5th Edition; Ripley, B. D. (1987), Stochastic Simulation Advanced references — Lemieux, C. (2009), Monte Carlo and Quasi-Monte Carlo Sampling; Asmussen, S. and Glynn, P. W. (2007), Stochastic Simulation: Algorithms and Analysis; Devroye, L. (1986), Non-Uniform Random Variate Generation Online resources — MIT OCW simulation materials; Cambridge lecture notes on Monte Carlo methods
Chapter 1: Introduction to Simulation
1.1 What Is Simulation?
A simulation is an imitation of a real-world process or system over time. One builds a model – a set of assumptions about how a system operates, expressed as mathematical or logical relationships – and then uses a computer to generate artificial histories of the system. By observing these histories one draws inferences about the operating characteristics of the real system.
Three key terms recur throughout the course:
- System: a collection of entities that interact over time to accomplish one or more goals.
- Model: an abstract representation of a system, typically involving variables, parameters, and equations that capture essential behaviour.
- Simulation: the process of running a model on a computer to produce output that can be analysed statistically.
Simulation is particularly valuable when the system is too complex for analytical treatment, when closed-form solutions do not exist, or when real-world experimentation is prohibitively expensive or dangerous.
Deterministic vs. Stochastic Simulation
In a deterministic simulation all input quantities and relationships are fixed; running the model twice produces identical output. In a stochastic (or Monte Carlo) simulation at least some inputs are random, so each replication produces different output. Stochastic simulation is the focus of this course: it requires generating random numbers, transforming them into random variates from desired distributions, collecting output data, and performing statistical analysis on the results.
1.2 Overview of Monte Carlo Methods
is an unbiased estimator of \(\theta\), and the strong law of large numbers guarantees \(\hat{\theta}_n \to \theta\) almost surely as \(n \to \infty\).
Monte Carlo methods appear throughout science and engineering: evaluating high-dimensional integrals in physics, pricing financial derivatives, simulating queueing networks, assessing system reliability, performing Bayesian inference, and solving partial differential equations.
1.3 Floating-Point Arithmetic
Since all Monte Carlo methods are implemented on digital computers, understanding how computers represent numbers is essential. Errors introduced by finite-precision arithmetic can accumulate and corrupt simulation results if one is not careful.
Binary Representation of Integers
\[ n = \sum_{k=0}^{K} b_k \, 2^k, \qquad b_k \in \{0,1\}. \]To convert a decimal integer to binary, one repeatedly divides by 2 and records the remainders from bottom to top.
Binary Representation of Fractions
\[ 0.d_1 d_2 d_3 \cdots = \sum_{k=1}^{\infty} d_k \, 2^{-k}, \qquad d_k \in \{0,1\}. \]The procedure to convert a decimal fraction to binary is: multiply by 2, record the integer part as the next binary digit, keep the fractional part, and repeat.
IEEE 754 Double Precision
Modern computers store floating-point numbers according to the IEEE 754 standard. A double-precision (64-bit) number uses:
| Component | Bits | Purpose |
|---|---|---|
| Sign | 1 | 0 for positive, 1 for negative |
| Exponent | 11 | Biased exponent (bias = 1023) |
| Mantissa (significand) | 52 | Fractional part of the normalised form |
where the leading 1 before the binary point is implicit (the “hidden bit”), so the mantissa effectively has 53 bits of precision.
The range of representable doubles spans roughly \(2.2 \times 10^{-308}\) to \(1.8 \times 10^{308}\). Numbers smaller than the minimum are underflow and are typically rounded to zero; numbers larger than the maximum cause overflow and are set to infinity.
Rounding Errors and Their Consequences
Floating-point arithmetic is not exact. Two important phenomena arise:
Rounding error: when a real number is stored in finite precision, it is rounded to the nearest representable value. The relative error satisfies \(|\text{fl}(x) - x|/|x| \leq \varepsilon_{\text{mach}}/2\).
Catastrophic cancellation: when two nearly equal numbers are subtracted, the leading significant digits cancel, and the result retains few meaningful digits.
For simulation work, these considerations matter in several ways: random number generators involve modular arithmetic on integers (where exact integer arithmetic is essential), cumulative sums over many iterations can drift, and likelihood computations for rare events may underflow.
1.4 Motivation and Applications
Stochastic simulation is applied across a vast range of disciplines:
- Finance: pricing exotic options and computing risk measures (Value-at-Risk, CVaR) when closed-form solutions are unavailable.
- Operations research: modelling queueing systems (call centres, hospital emergency departments, manufacturing lines) to estimate waiting times and throughput.
- Reliability engineering: estimating system failure probabilities by simulating component lifetimes.
- Bayesian statistics: Markov chain Monte Carlo (MCMC) methods for posterior inference.
- Physics: simulating particle transport, Ising models, and molecular dynamics.
The power of simulation lies in its generality: if one can describe the probabilistic mechanism that governs a system, one can simulate it.
Chapter 2: Pseudo-Random Number Generation
2.1 The Need for Random Numbers
Every stochastic simulation begins with a source of randomness. In principle one could use physical random devices (radioactive decay, thermal noise), but such true random number generators are slow and non-reproducible. Instead, simulation practitioners use pseudo-random number generators (PRNGs): deterministic algorithms that produce sequences of numbers which, for all practical purposes, appear to be independent realisations of a \(\text{Uniform}(0,1)\) distribution.
Properties of a Good PRNG
A high-quality PRNG should possess the following properties:
- Uniformity: the output values should closely approximate a uniform distribution on \((0,1)\).
- Independence: successive values should show no discernible correlation.
- Long period: the sequence should not repeat for a very long time. Since the generator has finitely many internal states, it must eventually cycle; the length of this cycle is the period.
- Reproducibility: given the same seed, the generator must produce the same sequence, enabling debugging and verification.
- Efficiency: generating each number should require minimal computation.
- Portability: the generator should produce identical results on different platforms.
2.2 Linear Congruential Generators
The most classical family of PRNGs is the linear congruential generator (LCG), introduced by D. H. Lehmer in 1951.
When \(c = 0\) the generator is called a multiplicative congruential generator. When \(c \neq 0\) it is called a mixed congruential generator.
Modular Arithmetic Review
The operation \(a \bmod m\) returns the remainder when \(a\) is divided by \(m\). Formally, \(a \bmod m = a - m\lfloor a/m \rfloor\). Key properties:
- \((a + b) \bmod m = ((a \bmod m) + (b \bmod m)) \bmod m\)
- \((a \cdot b) \bmod m = ((a \bmod m) \cdot (b \bmod m)) \bmod m\)
These identities allow intermediate computations to be kept within range, preventing integer overflow.
Full-Period Conditions
- \(\gcd(c, m) = 1\) (the increment and modulus are coprime).
- If \(p\) is a prime factor of \(m\), then \(p \mid (a - 1)\) (i.e., \(a - 1\) is divisible by every prime factor of \(m\)).
- If \(4 \mid m\), then \(4 \mid (a - 1)\).
For the multiplicative case (\(c = 0\)), the maximum achievable period is \(m - 1\) (since \(x_n = 0\) is an absorbing state). This maximum period is achieved when \(m\) is prime and \(a\) is a primitive root modulo \(m\), meaning the smallest positive integer \(k\) with \(a^k \equiv 1 \pmod{m}\) is \(k = m - 1\).
Choice of Parameters
Historically popular choices include:
| Generator | \(a\) | \(c\) | \(m\) | Period |
|---|---|---|---|---|
| RANDU (IBM) | 65539 | 0 | \(2^{31}\) | \(2^{29}\) |
| MINSTD (Park–Miller) | 16807 | 0 | \(2^{31} - 1\) | \(2^{31} - 2\) |
| Numerical Recipes | 1664525 | 1013904223 | \(2^{32}\) | \(2^{32}\) |
2.3 Lattice Structure and Spectral Test
Every LCG has an inherent lattice structure: the \(d\)-tuples \((u_i, u_{i+1}, \ldots, u_{i+d-1})\) lie on a finite number of parallel hyperplanes in the \(d\)-dimensional unit cube. The spectral test (introduced by Coveyou and MacPherson in 1967) measures the maximum distance between adjacent hyperplanes. A good generator has small inter-plane distances in all dimensions up to at least \(d = 6\).
2.4 The Mersenne Twister
The Mersenne Twister (MT19937), developed by Makoto Matsumoto and Takuji Nishimura in 1998, is the default PRNG in many software packages including R, Python, and MATLAB. Its key properties are:
- Period of \(2^{19937} - 1\), a Mersenne prime (hence the name).
- Equidistributed in 623 consecutive dimensions up to 32-bit accuracy.
- Uses a 624-element array of 32-bit integers as its internal state.
- Passes most standard statistical tests for randomness.
The algorithm works with a linear recurrence over \(\mathbb{F}_2\) (the field with two elements). At each step it combines three state words using bitwise XOR and shift operations, then applies a tempering transformation to improve the statistical properties of the output.
2.5 Testing Randomness
No deterministic generator is truly random, so we subject PRNGs to a battery of statistical tests. The null hypothesis is always that the generated values are iid \(\text{Uniform}(0,1)\).
Chi-Squared Goodness-of-Fit Test
\[ \chi^2 = \sum_{j=1}^{k} \frac{(O_j - E_j)^2}{E_j}. \]Under the null hypothesis, \(\chi^2 \sim \chi^2_{k-1}\) approximately for large \(n\). We reject uniformity if \(\chi^2\) is too large (or, for the purpose of testing PRNGs, also if it is suspiciously small, which indicates too-perfect regularity).
Serial Test (Pairs Test)
The serial test checks for independence between consecutive values. Partition \([0,1)^2\) into \(k^2\) cells and count how many of the pairs \((u_i, u_{i+1})\) fall into each cell. Apply a chi-squared test with \(k^2 - 1\) degrees of freedom.
Runs Test
\[ E[R] = \frac{2n - 1}{3}, \qquad \text{Var}(R) = \frac{16n - 29}{90}. \]The standardised statistic \(Z = (R - E[R])/\sqrt{\text{Var}(R)}\) is approximately \(N(0,1)\).
The Spectral Test
The spectral test (Coveyou and MacPherson, 1967) is the definitive test for evaluating the global structure of an LCG. Since all points \((u_n, u_{n+1}, \ldots, u_{n+s-1})\) generated by an LCG lie on a family of parallel hyperplanes in \([0,1)^s\), the spectral test measures the maximum distance \(\nu_s\) between adjacent hyperplanes.
A commonly cited benchmark is \(\nu_s \leq m^{-1/s}/2\), which corresponds to a grid spacing comparable to what one would expect from a good uniform distribution in dimension \(s\). The figure of merit \(\mu_s = \nu_s \cdot m^{1/s}\) is normalised so that \(\mu_s \approx 1\) for a good generator.
The spectral test reveals a fundamental limitation of LCGs: in dimension \(s\), the generated vectors lie in at most \(m^{1/s}\) hyperplanes. For double precision with \(m \approx 2^{64}\), this gives at most \(2^{64/s}\) hyperplanes in dimension \(s\), which quickly becomes inadequate for large \(s\). This is why the Mersenne Twister, which passes the spectral test in 623 dimensions, is preferred for serious simulation work.
TestU01 and Modern Test Suites
L’Ecuyer and Simard’s TestU01 package provides over 80 statistical tests organised into batteries:
- SmallCrush (10 tests, \(\approx 4\) seconds): a quick screen for obvious flaws.
- Crush (96 tests, \(\approx 1.5\) hours): a thorough examination.
- BigCrush (106 tests, \(\approx 6\) hours): the most demanding battery.
Modern generators (PCG, xorshift\({}^{**}\), Mersenne Twister) pass all BigCrush tests. LCGs with small moduli routinely fail Crush. TestU01 is the current gold standard for PRNG evaluation.
Other Tests
Additional tests used in comprehensive test suites include:
- Gap test: examines the lengths of gaps between successive values falling in a specified subinterval.
- Poker test: groups digits into “hands” and checks their frequency.
- Birthday spacings test: looks at the distribution of spacings between sorted output values.
- Kolmogorov–Smirnov test: compares the empirical CDF of the generated values to the uniform CDF.
Chapter 3: Random Variate Generation
3.1 The Inverse Transform Method
The inverse transform method is the most fundamental technique for generating random variates from non-uniform distributions. It rests on a simple but powerful probability result.
Algorithm for Continuous Distributions
To generate a random variate \(X\) with continuous CDF \(F\):
- Generate \(U \sim \text{Uniform}(0,1)\).
- Return \(X = F^{-1}(U)\).
Inverse Transform for Discrete Distributions
For a discrete random variable \(X\) taking values \(x_1 < x_2 < x_3 < \cdots\) with PMF \(p(x_i)\) and CDF \(F(x_i) = \sum_{j \leq i} p(x_j)\):
- Generate \(U \sim \text{Uniform}(0,1)\).
- Return \(X = x_i\) where \(i\) is the smallest index such that \(U \leq F(x_i)\).
Equivalently, set \(X = x_i\) if \(F(x_{i-1}) < U \leq F(x_i)\), with the convention \(F(x_0) = 0\).
3.2 The Acceptance-Rejection Method
When the inverse CDF is not available in closed form (or is expensive to compute), the acceptance-rejection (AR) method provides an elegant alternative. It was introduced by John von Neumann in 1951.
- Generate \(Y\) from density \(g\).
- Generate \(U \sim \text{Uniform}(0,1)\), independent of \(Y\).
- If \(U \leq f(Y)/(c\, g(Y))\), accept \(X = Y\). Otherwise, return to step 1.
3.3 Composition and Table Look-Up Methods
Composition (Mixture) Method
\[ f(x) = \sum_{j=1}^{k} p_j \, f_j(x), \qquad p_j \geq 0, \quad \sum_{j=1}^{k} p_j = 1, \]where each component density \(f_j\) is easy to sample from, then:
- Generate an index \(J\) with \(P(J = j) = p_j\).
- Generate \(X\) from density \(f_J\).
The resulting \(X\) has density \(f\).
Table Look-Up (Alias Method)
For discrete distributions with finitely many outcomes, the alias method (due to A. J. Walker, 1977) allows sampling in \(O(1)\) time after an \(O(k)\) preprocessing step, where \(k\) is the number of possible values.
The idea is to construct a table with \(k\) entries, each containing a probability \(q_j\) and an “alias” index \(a_j\). To generate a variate:
- Generate an integer \(J\) uniformly from \(\{1, 2, \ldots, k\}\).
- Generate \(U \sim \text{Uniform}(0,1)\).
- If \(U \leq q_J\), return \(J\); otherwise return \(a_J\).
The preprocessing step constructs the table so that the correct PMF is reproduced. This method is particularly valuable when many variates from the same distribution are needed.
3.4 Generating from Specific Distributions
Normal Distribution
Several specialised methods exist for generating standard normal variates:
\[ Z_1 = \sqrt{-2\ln U_1}\,\cos(2\pi U_2), \qquad Z_2 = \sqrt{-2\ln U_1}\,\sin(2\pi U_2) \]are independent \(N(0,1)\) random variables.
The acceptance probability is \(\pi/4 \approx 0.785\).
Gamma Distribution
For \(X \sim \text{Gamma}(\alpha, 1)\) with integer \(\alpha\), the sum-of-exponentials method works: \(X = -\sum_{i=1}^{\alpha} \ln U_i\). For non-integer \(\alpha > 1\), acceptance-rejection methods are used. A widely-used algorithm is due to Ahrens and Dieter (1974).
For \(\alpha < 1\), one can use the identity: if \(Y \sim \text{Gamma}(\alpha + 1, 1)\) and \(U \sim \text{Uniform}(0,1)\) are independent, then \(X = Y \cdot U^{1/\alpha} \sim \text{Gamma}(\alpha, 1)\).
Poisson Distribution
For \(X \sim \text{Poisson}(\lambda)\), the inverse transform via the CDF is natural: starting from \(p = e^{-\lambda}\), \(F = p\), and \(x = 0\), generate \(U\); while \(U > F\), update \(p \leftarrow p \cdot \lambda / (x+1)\), \(F \leftarrow F + p\), and \(x \leftarrow x + 1\). Return \(x\). This is efficient for small \(\lambda\) but slow for large \(\lambda\), where one switches to acceptance-rejection or normal approximation methods.
3.5 Generating Multivariate Distributions
Multivariate Normal
To generate \(\mathbf{X} \sim N(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) in \(\mathbb{R}^d\):
- Compute the Cholesky decomposition \(\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^T\), where \(\mathbf{L}\) is lower triangular.
- Generate \(\mathbf{Z} = (Z_1, \ldots, Z_d)^T\) with each \(Z_i \sim N(0,1)\) independently.
- Set \(\mathbf{X} = \boldsymbol{\mu} + \mathbf{L}\mathbf{Z}\).
Then \(E[\mathbf{X}] = \boldsymbol{\mu}\) and \(\text{Cov}(\mathbf{X}) = \mathbf{L}\, E[\mathbf{Z}\mathbf{Z}^T]\, \mathbf{L}^T = \mathbf{L}\mathbf{I}\mathbf{L}^T = \boldsymbol{\Sigma}\).
Conditional Distribution Method
\[ f(x_1, \ldots, x_d) = f_1(x_1)\, f_2(x_2 \mid x_1)\, f_3(x_3 \mid x_1, x_2) \cdots f_d(x_d \mid x_1, \ldots, x_{d-1}). \]Sample \(X_1\) from its marginal, then \(X_2\) from its conditional given \(X_1\), and so on.
3.6 Copulas
A copula provides a way to model the dependence structure between random variables separately from their marginal distributions. This is invaluable in finance, insurance, and risk management.
Gaussian Copula
\[ C_R^{\text{Ga}}(u_1, \ldots, u_d) = \Phi_R(\Phi^{-1}(u_1), \ldots, \Phi^{-1}(u_d)), \]where \(\Phi_R\) is the CDF of the standard multivariate normal with correlation \(\mathbf{R}\) and \(\Phi^{-1}\) is the standard normal quantile function.
To sample from a Gaussian copula with given marginals \(F_1, \ldots, F_d\):
- Generate \(\mathbf{Z} \sim N(\mathbf{0}, \mathbf{R})\) using the Cholesky method.
- Set \(U_i = \Phi(Z_i)\) for \(i = 1, \ldots, d\). The vector \((U_1, \ldots, U_d)\) has the Gaussian copula.
- Set \(X_i = F_i^{-1}(U_i)\) to obtain the desired marginals.
Student-\(t\) Copula
The \(t\) copula allows for heavier tails and stronger tail dependence than the Gaussian copula. If \(\mathbf{R}\) is a correlation matrix and \(\nu > 0\) is the degrees of freedom:
- Generate \(\mathbf{Z} \sim N(\mathbf{0}, \mathbf{R})\).
- Generate \(W \sim \chi^2_\nu / \nu\) independently.
- Set \(\mathbf{T} = \mathbf{Z}/\sqrt{W}\). Each \(T_i\) has a univariate \(t_\nu\) distribution.
- Set \(U_i = t_\nu(T_i)\) where \(t_\nu\) is the CDF of the \(t_\nu\) distribution.
- Set \(X_i = F_i^{-1}(U_i)\).
Archimedean Copulas
\[ C(u_1, \ldots, u_d) = \psi(\psi^{-1}(u_1) + \cdots + \psi^{-1}(u_d)), \]where \(\psi : [0, \infty) \to [0, 1]\) is a generator function satisfying \(\psi(0) = 1\), \(\psi(\infty) = 0\), and appropriate monotonicity conditions.
Common examples include:
| Family | Generator \(\psi(t)\) | Parameter |
|---|---|---|
| Clayton | \((1 + t)^{-1/\theta}\) | \(\theta > 0\) |
| Gumbel | \(\exp(-t^{1/\theta})\) | \(\theta \geq 1\) |
| Frank | \(-\frac{1}{\theta}\ln\!\left(1 + e^{-t}(e^{-\theta} - 1)\right)\) | \(\theta \neq 0\) |
Sampling from Archimedean copulas in two dimensions can be done using the conditional method: generate \(U_1 \sim \text{Uniform}(0,1)\), then generate \(U_2\) from the conditional distribution \(C(u_2 \mid u_1) = \partial C / \partial u_1\) evaluated at \(u_1\).
3.7 Simulating Stochastic Processes
Brownian Motion
A standard Brownian motion (Wiener process) \(\{W(t) : t \geq 0\}\) satisfies: (i) \(W(0) = 0\), (ii) the increments \(W(t) - W(s) \sim N(0, t-s)\) for \(0 \leq s < t\), and (iii) non-overlapping increments are independent.
To simulate \(W\) at times \(0 = t_0 < t_1 < \cdots < t_n\):
- For \(i = 1, \ldots, n\), generate \(Z_i \sim N(0,1)\) independently.
- Set \(W(t_i) = W(t_{i-1}) + \sqrt{t_i - t_{i-1}}\, Z_i\).
For equal spacing \(\Delta t = t_i - t_{i-1}\), this reduces to \(W(t_i) = W(t_{i-1}) + \sqrt{\Delta t}\, Z_i\).
where \(\mu\) is the drift and \(\sigma\) is the volatility.
Poisson Process
A Poisson process \(\{N(t) : t \geq 0\}\) with rate \(\lambda\) satisfies: (i) \(N(0) = 0\), (ii) independent increments, and (iii) \(N(t) - N(s) \sim \text{Poisson}(\lambda(t-s))\) for \(s < t\).
The key simulation insight is that the inter-arrival times \(T_1, T_2, \ldots\) are iid \(\text{Exp}(\lambda)\). To simulate the arrival times on \([0, T]\):
- Set \(t = 0\), \(n = 0\).
- Generate \(U \sim \text{Uniform}(0,1)\) and set \(t \leftarrow t - \ln(U)/\lambda\).
- If \(t > T\), stop. Otherwise, \(n \leftarrow n + 1\), record \(S_n = t\), and return to step 2.
For a non-homogeneous Poisson process with time-varying rate \(\lambda(t) \leq \lambda^*\), one can use thinning (Lewis and Shedler, 1979): generate arrivals from a homogeneous Poisson process at rate \(\lambda^*\), then independently accept each arrival at time \(t\) with probability \(\lambda(t)/\lambda^*\).
Chapter 4: Monte Carlo Estimation
4.1 Monte Carlo Estimators for Integrals
\[ \theta = \int_a^b h(x)\, dx. \]\[ \hat{\theta}_n = \frac{b-a}{n}\sum_{i=1}^{n} h(U_i), \qquad U_i \sim \text{Uniform}(a,b) \text{ iid}. \]\[ \hat{\theta}_n = \frac{1}{n}\sum_{i=1}^{n} g(X_i). \]Multidimensional Integration
\[ \hat\theta = \frac{1}{n}\sum_{i=1}^n g(\boldsymbol{X}_i), \qquad \boldsymbol{X}_i \sim f_{\boldsymbol{X}}, \]achieves standard error \(\sigma/\sqrt{n}\) regardless of dimension, where \(\sigma^2 = \text{Var}(g(\boldsymbol{X}))\). For \(s = 10\), even 10 points per dimension requires \(10^{10}\) function evaluations, while MC can achieve a 1% relative error with roughly \(10^4/\sigma^2\) samples.
This advantage makes Monte Carlo the method of choice for integrals over more than 3–4 dimensions, which arise naturally in Bayesian posterior computations, high-dimensional finance models, and particle transport simulations.
Hit-or-Miss Estimator
An alternative approach for estimating \(\theta = \int_0^1 h(x)\, dx\) (assuming \(0 \leq h(x) \leq 1\)) uses the geometric interpretation:
- Generate \((U_1, U_2)\) uniformly on the unit square \([0,1]^2\).
- If \(U_2 \leq h(U_1)\), count a “hit.”
Then \(\hat{\theta}_n = (\text{number of hits})/n\) is an unbiased estimator because \(P(U_2 \leq h(U_1)) = \int_0^1 h(x)\, dx = \theta\).
4.2 Properties of Monte Carlo Estimators
Unbiasedness
\[ E[\hat{\theta}_n] = \frac{1}{n}\sum_{i=1}^{n} E[g(X_i)] = E[g(X)] = \theta. \]Consistency
By the strong law of large numbers, \(\hat{\theta}_n \to \theta\) almost surely as \(n \to \infty\). This means that with enough samples, the estimator converges to the true value.
Mean Squared Error
\[ \text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + (\text{Bias}(\hat{\theta}))^2. \]For the unbiased sample-mean estimator, \(\text{MSE}(\hat{\theta}_n) = \text{Var}(\hat{\theta}_n) = \sigma^2/n\), where \(\sigma^2 = \text{Var}(g(X))\).
Central Limit Theorem for MC Estimators
Confidence Intervals
\[ \hat{\theta}_n \pm z_{\alpha/2}\, \frac{S_n}{\sqrt{n}}, \]where \(z_{\alpha/2}\) is the \((1-\alpha/2)\)-quantile of the standard normal distribution. For a 95% confidence interval, \(z_{0.025} = 1.96\).
Determining Sample Size
\[ n \geq \left(\frac{z_{\alpha/2}\, S}{\varepsilon}\right)^2. \]In practice, one runs a pilot study to estimate \(S\), then determines the required total sample size.
4.3 Examples and Applications
Estimating \(\pi\)
\[ \hat{\pi}_n = \frac{4}{n}\sum_{i=1}^{n} \mathbf{1}(U_{1i}^2 + U_{2i}^2 \leq 1). \]This is a hit-or-miss estimator.
Option Pricing
\[ C = e^{-rT}\, E[\max(S(T) - K, 0)], \]\[ \hat{C}_n = \frac{e^{-rT}}{n}\sum_{i=1}^{n}\max(S_0 e^{(r-\sigma^2/2)T + \sigma\sqrt{T}Z_i} - K,\, 0). \]Queueing Systems
Consider a single-server queue with Poisson arrivals (rate \(\lambda\)) and exponential service times (rate \(\mu\)). While the \(M/M/1\) queue has a known steady-state distribution, more complex systems (e.g., \(G/G/k\) queues with priorities) lack analytical solutions and must be studied via simulation. Typical output measures include average waiting time, server utilisation, and maximum queue length.
Reliability
For a system of \(d\) components with independent random lifetimes \(T_1, \ldots, T_d\), the system lifetime \(T_{\text{sys}} = h(T_1, \ldots, T_d)\) depends on the system structure function \(h\). For complex configurations (mixtures of series and parallel), Monte Carlo simulation estimates the system reliability \(P(T_{\text{sys}} > t)\) by generating many independent replications of component lifetimes.
4.4 Quasi-Monte Carlo Methods
Standard Monte Carlo uses pseudo-random sequences which, while statistically uniform, contain random clusters and gaps. Quasi-Monte Carlo (QMC) methods replace the random sequence with a low-discrepancy sequence that fills the unit hypercube more evenly.
For a pseudo-random sequence, \(D_n^* = O(\sqrt{\log\log n/n})\) almost surely (by the law of the iterated logarithm). Low-discrepancy sequences achieve the superior rate \(D_n^* = O((\log n)^s / n)\).
Halton Sequence
\[ v_b(n) = \sum_k d_k b^{-(k+1)}. \]For \(s = 2\): use base 2 for dimension 1 and base 3 for dimension 2. The first 8 Halton points in 2D are far more evenly distributed than 8 pseudo-random points.
Sobol’ Sequences
Sobol’ sequences are the most widely used low-discrepancy sequences in high-dimensional problems. They are constructed over the base-2 field using direction numbers that optimise equidistribution. Key properties:
- Dimension \(s\) up to several thousand is supported by precomputed direction number tables.
- They are scrambled in modern implementations (using random digital shifts) to enable unbiased estimation and valid confidence intervals.
- Convergence rate of the QMC estimator variance: \(O((\log n)^{2s}/n^2)\), compared to \(O(1/n)\) for standard MC.
QMC vs. MC: When to Use Each
| Feature | Standard MC | Quasi-MC (Sobol') |
|---|---|---|
| Convergence rate | \(O(n^{-1/2})\) | \(O(n^{-1}(\log n)^s)\) for smooth functions |
| Confidence intervals | Easy (CLT) | Requires scrambling |
| Dimension | Unlimited | Degrades for very high dimensions |
| Parallelism | Trivially parallel | Requires careful implementation |
QMC is most beneficial when the integrand is smooth and of low effective dimension. For rough integrands, rare event problems, or Markov chain simulations, standard MC or specialised techniques (such as importance sampling) are preferred.
Chapter 5: Variance Reduction Techniques
5.1 Motivation
The standard error of the Monte Carlo estimator decreases as \(O(n^{-1/2})\). To reduce the error by a factor of 10, one needs 100 times as many samples – which can be prohibitively expensive. Variance reduction techniques (VRTs) aim to produce estimators with smaller variance without increasing the number of replications, or equivalently, to achieve the same precision with fewer replications.
5.2 Antithetic Variates
The idea of antithetic variates is to introduce negative correlation between pairs of replications to reduce the variance of the average.
Since \(1 - U_i \sim \text{Uniform}(0,1)\) whenever \(U_i\) does, both halves are valid estimators.
5.3 Stratified Sampling
Stratified sampling partitions the sample space into non-overlapping strata and samples from each stratum separately, guaranteeing that each region of the space is represented proportionally.
Method
\[ \hat{\theta}_{\text{str}} = \sum_{j=1}^{k} \frac{1}{k} \cdot \frac{1}{n_j}\sum_{i=1}^{n_j} g\!\left(\frac{j-1+U_{ji}}{k}\right), \]where \(U_{ji} \sim \text{Uniform}(0,1)\) independently.
With proportional allocation (\(n_j = n/k\) for all \(j\)), the total number of samples is \(n\) and the estimator simplifies.
5.4 Importance Sampling
Importance sampling is arguably the most powerful and flexible variance reduction technique. The idea is to sample from a different distribution that places more probability mass in the “important” regions of the integrand.
Derivation
\[ \theta = \int g(x) f(x)\, dx = \int g(x) \frac{f(x)}{h(x)} h(x)\, dx = E_h\!\left[g(X)\frac{f(X)}{h(X)}\right]. \]\[ \hat{\theta}_{\text{IS}} = \frac{1}{n}\sum_{i=1}^{n} g(X_i)\, \frac{f(X_i)}{h(X_i)}, \qquad X_i \sim h \text{ iid}. \]The ratio \(w(x) = f(x)/h(x)\) is called the likelihood ratio or importance weight.
5.5 Control Variates
\[ Y^{(c)} = Y - \beta(C - \mu_C), \]for some constant \(\beta\). Since \(E[Y^{(c)}] = E[Y] = \theta\) regardless of \(\beta\), the estimator is unbiased.
In practice, \(\beta^*\) is unknown and must be estimated from the simulation output (e.g., using a pilot run or from the same data).
5.6 Combined Methods
Variance reduction techniques can often be combined for even greater effect.
Antithetic + Control Variates
One can apply antithetic variates to generate negatively correlated pairs, and then apply a control variate adjustment to each pair. Care must be taken to estimate the optimal control coefficient appropriately.
Stratification + Importance Sampling
Different importance sampling densities can be used within different strata, optimising the proposal locally. This is particularly useful when the integrand has different characteristics in different regions of the domain.
5.7 Applications to Option Pricing
European Options
\[ \hat{C} = \frac{e^{-rT}}{n}\sum_{i=1}^{n} \max(S_0 e^{(r - \sigma^2/2)T + \sigma\sqrt{T}Z_i} - K,\, 0). \]Antithetic variates replace each \(Z_i\) with the pair \((Z_i, -Z_i)\). Since \(\max(S(T) - K, 0)\) is an increasing function of \(Z\), the antithetic pairs are negatively correlated.
Asian Options
An Asian call option has payoff \(\max(\bar{S} - K, 0)\) where \(\bar{S} = \frac{1}{m}\sum_{j=1}^{m} S(t_j)\) is the arithmetic average of the stock price at \(m\) monitoring dates. No closed-form price exists, so Monte Carlo is the standard tool. Variance reduction is essential: the geometric average Asian option (which has a closed-form price) serves as an excellent control variate because it is highly correlated with the arithmetic average version.
Barrier Options
A knock-out barrier option becomes worthless if the stock price crosses a barrier level during the option’s life. Simulating this requires tracking the entire price path, and importance sampling can shift the path distribution to make barrier-crossing events more frequent, improving efficiency for deep out-of-the-money barrier options.
5.8 Rare Event Simulation
\[ \frac{\sqrt{\text{Var}(\mathbf{1}_A)}}{\theta\sqrt{n}} = \frac{\sqrt{\theta(1-\theta)}}{\theta\sqrt{n}} \approx \frac{1}{\sqrt{\theta \cdot n}}, \]which requires \(n = O(1/\theta)\) samples for a fixed relative error.
Importance sampling is the primary tool for rare event simulation. The idea is to change the probability measure so that the rare event becomes common, then correct via importance weights. For example, in estimating the probability of ruin in insurance, or the probability of buffer overflow in telecommunications, exponential tilting (also called exponential change of measure) shifts the distribution to make large losses more likely.
5.9 Conditional Monte Carlo
Conditional Monte Carlo reduces variance by analytically computing the conditional expectation over one or more random variables, thereby eliminating their contribution to the variance.
5.10 Common Random Numbers
\[ \text{Var}(Y_1 - Y_2) = \text{Var}(Y_1) + \text{Var}(Y_2) - 2\,\text{Cov}(Y_1, Y_2). \]With CRN, \(\text{Cov}(Y_1, Y_2) > 0\), reducing the variance of the estimated difference. This technique does not reduce the variance of estimating either \(E[Y_1]\) or \(E[Y_2]\) alone, but it dramatically improves the comparison.
5.11 Summary of Variance Reduction Techniques
| Technique | Key Idea | When It Works Best | Typical Reduction |
|---|---|---|---|
| Antithetic variates | Negative correlation via \(U\) and \(1-U\) | Monotone integrands | 2–50x |
| Stratified sampling | Partition domain, sample each stratum | Integrand varies across strata | 2–20x |
| Importance sampling | Resample from better distribution | Rare events, peaked integrands | 10–10000x |
| Control variates | Subtract known-mean correlated variable | Highly correlated control available | 2–100x |
| Conditional MC | Integrate out some variables analytically | Conditional expectation is tractable | 2–50x |
| Common random numbers | Same random numbers for system comparison | Comparing alternatives | 2–20x |
Appendix: Key Algorithms and Results Summary
Random Variate Generation: Decision Tree
Choosing the right generation method depends on the form of the target density:
| Situation | Recommended Method |
|---|---|
| Invertible CDF in closed form | Inverse transform |
| Easy-to-sample envelope \(f_e(x) \geq c\,f(x)\) exists | Acceptance-rejection |
| Target is a mixture \(\sum_k p_k f_k\) | Composition (mixture) |
| Target is multivariate normal | Cholesky decomposition |
| Target involves a copula | Conditional simulation |
| Target is a stochastic process path | Sequential simulation (Euler, exact) |
Acceptance-Rejection: Choosing the Envelope
The acceptance probability is \(1/c\) where \(c = \sup_x f(x)/f_e(x)\). A smaller \(c\) means fewer proposals are rejected. Optimal envelope density \(f_e^*(x) \propto f(x)\) gives \(c = 1\), but then \(f_e^* = f\) (which we cannot sample from). In practice, \(f_e\) is chosen from a family of easily sampled distributions, and \(c\) is minimized by matching the shape of \(f\) as closely as possible.
Monte Carlo Error Summary
| Quantity | Formula |
|---|---|
| Variance of MC estimator | \(\sigma^2/n\) where \(\sigma^2 = \text{Var}(g(X))\) |
| Standard error | \(\sigma/\sqrt{n}\) |
| 95% CI half-width | \(1.96\,S_n/\sqrt{n}\) |
| Required sample size for half-width \(\varepsilon\) | \(n \geq (z_{\alpha/2}\,S/\varepsilon)^2\) |
| Convergence rate of standard error | \(O(n^{-1/2})\), independent of dimension |
Variance Reduction: Quantitative Benchmarks
| Method | SE of estimate |
|---|---|
| Crude MC | 0.228 |
| Antithetic variates | 0.021 (effective factor 10.9) |
| Control variates (forward price) | 0.019 (effective factor 12.0) |
| Stratified sampling (10 strata) | 0.073 (effective factor 3.1) |
| Importance sampling (optimal shift) | 0.009 (effective factor 25.3) |
The best single technique is application-dependent. In practice, control variates and antithetic variates are the most widely used because they are easy to implement and provide substantial gains for smooth payoffs.
Notation
| Symbol | Meaning |
|---|---|
| \(U \sim \text{Uniform}(0,1)\) | Uniform random variate (seed for all generation) |
| \(F^{-1}\) | Quantile (inverse CDF) function |
| \(c\) | Envelope constant in acceptance-rejection (\(c \geq \sup f/f_e\)) |
| \(\hat\theta_n\) | MC estimate after \(n\) replications |
| \(S_n^2\) | Sample variance of \(g(X_i)\) values |
| \(\rho\) | Correlation between control variate and estimator |
| \(\beta^*\) | Optimal control variate coefficient |
Markov Chain Monte Carlo: A Brief Introduction
The methods covered in this course generate independent random variates. In many statistical problems — particularly Bayesian inference — the target distribution is a high-dimensional density \(\pi(x)\) that cannot be directly sampled. Markov chain Monte Carlo (MCMC) methods construct a Markov chain \(X_0, X_1, X_2, \ldots\) with \(\pi\) as its stationary distribution; after a burn-in period, the chain values \(X_t\) approximately follow \(\pi\).
Metropolis–Hastings Algorithm
The Metropolis–Hastings (MH) algorithm is the most general MCMC method. Given the current state \(X_t = x\):
- Propose \(Y \sim q(y \mid x)\) from a proposal distribution \(q\).
- Compute the acceptance ratio \[ \alpha(x, y) = \min\!\left(1, \frac{\pi(y)\,q(x \mid y)}{\pi(x)\,q(y \mid x)}\right). \]
- With probability \(\alpha(x, y)\), set \(X_{t+1} = y\); otherwise set \(X_{t+1} = x\).
The chain satisfies detailed balance \(\pi(x) q(y \mid x) \alpha(x,y) = \pi(y)q(x \mid y)\alpha(y,x)\), which ensures \(\pi\) is stationary. A key advantage is that only the ratio \(\pi(y)/\pi(x)\) is needed, so any normalizing constant in \(\pi\) cancels.
Gibbs Sampling
When \(\pi\) is a joint distribution of \((X_1, \ldots, X_d)\) and all full conditionals \(\pi(X_j \mid X_{-j})\) are available in closed form, Gibbs sampling cycles through coordinates, sampling each from its full conditional. Gibbs sampling is a special case of MH with acceptance probability 1.
MCMC vs. Direct Simulation
| Property | Direct MC | MCMC |
|---|---|---|
| Independence | Exact | Approximate (correlated chain) |
| Convergence diagnostics | None needed | Required (burn-in, ESS, trace plots) |
| Applicability | Known, tractable distributions | Any distribution with computable density ratio |
| Variance of estimators | \(\sigma^2/n\) | \(\sigma^2/(n/\tau)\), \(\tau\) = integrated autocorrelation |
The effective sample size (ESS) for MCMC is \(n / \tau\) where \(\tau = 1 + 2\sum_{k=1}^\infty \text{Corr}(X_t, X_{t+k})\). Highly correlated chains have large \(\tau\) and small ESS, requiring more iterations to achieve the same accuracy as direct MC.