ACTSC 454: Longevity and Mortality Using Predictive Analytics
Emily Kozlowski
Estimated study time: 30 minutes
Table of contents
Sources and References
- Primary textbook — Klein, J.P., Moeschberger, M.L. Survival Analysis: Techniques for Censored and Truncated Data, 2nd ed., Springer, 2003.
- Supplementary texts — Dickson, D.C.M., Hardy, M.R., Waters, H.R. Actuarial Mathematics for Life Contingent Risks, 3rd ed., Cambridge University Press, 2020 (Chapters 18–19); Klugman, S.A., Panjer, H.H., Willmot, G.E. Loss Models, 5th ed., Wiley, 2019; Pitacco, E., Denuit, M., Haberman, S., Olivieri, A. Modelling Longevity Dynamics for Pensions and Annuity Business, Oxford University Press, 2009; Cairns, A.J.G., Blake, D., Dowd, K. “A Two-Factor Model for Stochastic Mortality with Parameter Uncertainty,” Journal of Risk and Insurance, 2006.
- Online resources — Society of Actuaries Mortality Improvement Scale documents (MP-2014, MP-2021); Human Mortality Database (mortality.org); Continuous Mortality Investigation (CMI) Bureau working papers; Hyndman lecture notes on demographic forecasting.
Chapter 1: Survival Data, Censoring, and Truncation
Longevity and mortality analysis begins with the recognition that lifetime data are rarely observed completely. A subject may leave the study before dying, be enrolled only after reaching a particular age, or be observed only through a window of calendar time. These features motivate the formal machinery of censoring and truncation, which distinguishes survival analysis from classical statistics.
Types of Censoring
Let \(T\) denote a non-negative random variable representing time until death (or some other event). The survival function is \(S(t) = \Pr(T > t)\) and the hazard function is \(\mu(t) = f(t)/S(t)\), with cumulative hazard \(H(t) = -\ln S(t)\). Censoring describes the situation in which a subject’s exact lifetime is not observed, but a bound on it is known.
Right censoring — the most common form — occurs when the study ends, the subject withdraws, or a competing event removes the subject from observation while they are still alive. The observed data for individual \(i\) are the pair \((T_i, \delta_i)\), where \(T_i = \min(X_i, C_i)\) is the minimum of the true lifetime \(X_i\) and the censoring time \(C_i\), and \(\delta_i = \mathbb{1}\{X_i \leq C_i\}\) is the death indicator.
Left censoring arises when an event is known to have occurred before observation began, but the exact time is unknown. For example, a dementia study might record that a subject already had symptoms at enrollment, without knowing when they first appeared.
Interval censoring occurs when the event is known only to lie between two observation times \((L, R)\). This is typical of studies with periodic follow-up visits: if a subject is alive at visit \(k\) but dead at visit \(k+1\), the death time is only known to lie in that interval.
Truncation and Risk Sets
Truncation differs from censoring in that the subject is entirely unobserved outside a certain region. Left truncation, also called delayed entry, is critical in actuarial applications: a life insurance policyholder only enters the data at the age of policy issue, so any death before that age is invisible. Conditional on being observed, the distribution used for likelihood must condition on \(T > L_i\), where \(L_i\) is the entry age.
The risk set at time \(t\), denoted \(R(t)\), is the collection of individuals under observation and still alive just before \(t\). With left truncation, a subject contributes to \(R(t)\) only for \(t \in [L_i, T_i]\). Most nonparametric estimators depend on \(R(t)\) only through its size \(n(t) = |R(t)|\) and the number of events \(d(t)\) occurring at \(t\).
Chapter 2: Empirical Estimation of Survival Functions
Nonparametric estimators of the survival function make no assumption on the shape of \(S\) and form the backbone of descriptive mortality analysis. The two leading estimators — Kaplan–Meier and Nelson–Aalen — rest on the same risk-set logic but treat discrete jumps differently.
The Kaplan–Meier Estimator
Suppose events occur at the ordered times \(t_1 < t_2 < \cdots < t_k\), with \(d_i\) deaths at time \(t_i\) and risk set size \(n_i\). The Kaplan–Meier (product-limit) estimator of \(S(t)\) is
\[\hat S(t) = \prod_{t_i \leq t}\left(1 - \frac{d_i}{n_i}\right).\]It is the nonparametric maximum-likelihood estimator of \(S\) under right censoring. The conditional survival probability through each event time, \(1 - d_i/n_i\), is an empirical estimate of the one-step survival. Between event times, \(\hat S\) is constant, producing the characteristic step function. When the largest observation is censored, \(\hat S\) does not drop to zero; when it is a death, it does.
The Nelson–Aalen Estimator
An alternative is the Nelson–Aalen estimator of the cumulative hazard,
\[\hat H(t) = \sum_{t_i \leq t}\frac{d_i}{n_i},\]from which one obtains \(\hat S_{NA}(t) = e^{-\hat H(t)}\). The two estimators satisfy \(\hat S_{KM}(t) \approx \exp(-\hat H_{NA}(t))\) for small hazards, since \(\log(1 - x) \approx -x\) when \(x\) is small. In practice Nelson–Aalen tends to be slightly larger than Kaplan–Meier, because \(-\log(1-x) > x\). The Nelson–Aalen estimator is preferred when hazard estimation is the goal or when sample sizes are small and the product form is unstable.
Greenwood’s Variance and Confidence Intervals
The variance of \(\hat S(t)\) is estimated by Greenwood’s formula
\[\widehat{\mathrm{Var}}(\hat S(t)) = [\hat S(t)]^2 \sum_{t_i \leq t}\frac{d_i}{n_i(n_i - d_i)}.\]A naive Wald confidence interval \(\hat S(t) \pm z_{1-\alpha/2}\sqrt{\widehat{\mathrm{Var}}(\hat S(t))}\) may stray outside \([0,1]\). Two remedies are standard.
The log-transformed interval uses \(\log \hat S(t)\) and back-transforms:
\[\hat S(t)^{\exp(\pm z_{1-\alpha/2}\,\sigma_L(t)/\log\hat S(t))},\]where \(\sigma_L^2(t) = \widehat{\mathrm{Var}}(\hat S(t))/[\hat S(t)]^2\). The log–log interval, based on \(\log(-\log \hat S(t))\), is
\[\hat S(t)^{\exp(\mp z_{1-\alpha/2}\,\sigma_L(t)/\log\hat S(t))},\]and has the advantage that both endpoints always lie in \((0,1)\). The log–log transform is preferred in the tails, where \(\hat S(t)\) is close to zero and symmetric intervals distort badly.
Kernel Density Estimation of the Hazard
While \(\hat H\) is a step function, the hazard \(\mu(t) = H'(t)\) is not directly estimable without smoothing. A kernel estimator replaces each jump \(\Delta \hat H(t_i) = d_i/n_i\) by a smooth bump of bandwidth \(b\):
\[\hat \mu(t) = \frac{1}{b}\sum_{t_i \leq \tau} K\!\left(\frac{t - t_i}{b}\right)\frac{d_i}{n_i},\]where \(K\) is a symmetric kernel (Epanechnikov, biweight, Gaussian). The bandwidth \(b\) trades bias against variance; cross-validation or plug-in rules select it. Kernel hazards are indispensable for drawing smooth mortality curves from raw data.
√n (Ŝ(t) - S(t)) → N(0, σ²(t))
in distribution, with σ²(t) consistently estimated by Greenwood’s formula scaled by n. The proof proceeds via the martingale central limit theorem applied to the counting process N(t) = ∑i δi 1{Ti ≤ t}.
Chapter 3: Cox Proportional Hazards Model
When the analyst wishes to attach covariates to a lifetime distribution, the Cox proportional hazards model offers a semiparametric framework of striking elegance. It decomposes the hazard into a baseline function of time multiplied by a covariate-dependent scale.
Model Specification
For a subject with covariate vector \(\mathbf{x}\), the hazard is
\[h(t \mid \mathbf{x}) = h_0(t)\,\exp(\boldsymbol\beta^\top \mathbf{x}),\]where \(h_0(t)\) is an unspecified baseline hazard and \(\boldsymbol\beta\) is the coefficient vector. The defining feature is that the ratio \(h(t \mid \mathbf{x}_1)/h(t \mid \mathbf{x}_2) = \exp(\boldsymbol\beta^\top(\mathbf{x}_1 - \mathbf{x}_2))\) does not depend on \(t\): hazards are proportional through time. A one-unit increase in a covariate multiplies the hazard by \(e^{\beta_j}\), the hazard ratio.
Partial Likelihood
The genius of Cox’s approach is that \(\boldsymbol\beta\) can be estimated without specifying \(h_0\). Conditional on the risk set \(R(t_i)\) at an event time \(t_i\), the probability that the actual death is subject \(i\) is
\[\frac{e^{\boldsymbol\beta^\top \mathbf{x}_i}}{\sum_{j \in R(t_i)} e^{\boldsymbol\beta^\top \mathbf{x}_j}}.\]Multiplying across all event times gives the partial likelihood
\[L(\boldsymbol\beta) = \prod_{i : d_i = 1}\frac{e^{\boldsymbol\beta^\top \mathbf{x}_i}}{\sum_{j \in R(t_i)} e^{\boldsymbol\beta^\top \mathbf{x}_j}}.\]Maximising \(\log L\) numerically yields \(\hat{\boldsymbol\beta}\); the observed information matrix provides standard errors. The baseline cumulative hazard is then estimated by the Breslow estimator
\[\hat H_0(t) = \sum_{t_i \leq t}\frac{d_i}{\sum_{j \in R(t_i)} e^{\hat{\boldsymbol\beta}^\top \mathbf{x}_j}}.\]Extensions
Time-varying covariates are incorporated by letting \(\mathbf{x}\) depend on \(t\); the risk-set sums are recomputed at each event. Stratification allows different baseline hazards for subgroups (e.g., smokers versus non-smokers) while sharing \(\boldsymbol\beta\) — useful when proportionality fails for a categorical factor.
Chapter 4: Parametric Mortality Models and Graduation
Insurance applications demand a single mortality table \(\{q_x\}\) indexed by integer age. Raw crude rates \(D_x / E_x\) are noisy, especially at older ages with few exposures. The process of smoothing crude rates into a regular sequence is called graduation, and the underlying statistical machinery rests on the binomial and Poisson models of death counts.
Binomial Model
Assume that \(E_x\) lives are observed through age \([x, x+1]\), with each life independently dying with probability \(q_x\) during the year. Then
\[D_x \sim \mathrm{Binomial}(E_x, q_x), \quad \hat q_x = D_x/E_x.\]Here \(E_x\) is the initial exposure, counting each life as a full year of exposure regardless of whether they died mid-year. The variance is \(E_x q_x (1-q_x)\). This model is conceptually simple but requires that every life be observed for the entire age interval, which is unrealistic under withdrawals.
Poisson Model
Letting \(E^c_x\) denote central exposure — the total person-years actually lived between ages \(x\) and \(x+1\) — and assuming a constant force of mortality \(\mu_x\) across the year,
\[D_x \sim \mathrm{Poisson}(E^c_x \mu_x), \quad \hat\mu_x = D_x/E^c_x.\]The maximum-likelihood estimator has standard error \(\sqrt{D_x}/E^c_x\). Under the constant-force assumption, \(q_x = 1 - e^{-\mu_x}\). The Poisson model handles withdrawals naturally because each life contributes only the time it was actually observed.
The two exposure concepts are related approximately by \(E^c_x \approx E_x - \tfrac12 D_x\) when deaths are assumed to occur uniformly through the year, or more generally by careful bookkeeping of withdrawals and new entrants.
Parametric Mortality Laws
Several analytic families are used to describe how \(\mu_x\) varies with age:
- Gompertz law: \(\mu_x = B c^x\), capturing exponential ageing.
- Makeham law: \(\mu_x = A + B c^x\), adding a constant accidental term.
- Heligman–Pollard 8-parameter law:
The Heligman–Pollard form decomposes mortality into three components: infant and childhood mortality (first term), an adult accident hump centred near young adult ages (second term), and senescent Gompertz ageing (third term). Each parameter has an interpretable role.
Spline and Whittaker–Henderson Graduation
When no parametric form is imposed, the graduated values \(\{\dot q_x\}\) are chosen to balance fidelity to the data against smoothness. The Whittaker–Henderson penalty minimises
\[\sum_x w_x (\dot q_x - \hat q_x)^2 + h\sum_x (\Delta^z \dot q_x)^2,\]where \(\Delta^z\) is the \(z\)-th finite difference (usually \(z = 3\)) and \(h\) is a smoothing parameter. Larger \(h\) gives smoother graduations. Cubic or P-splines offer similar flexibility with a basis-expansion interpretation.
Tests of Fit
After graduation, the analyst checks whether the smoothed rates \(\dot q_x\) are statistically consistent with the data. Common tests include:
- Chi-square test: compute \(\sum_x (D_x - E_x \dot q_x)^2/[E_x \dot q_x (1 - \dot q_x)]\) and compare to \(\chi^2\) with degrees of freedom equal to the number of age cells minus parameters estimated.
- Signs test: count the number of ages where the crude rate exceeds the graduated rate; under the null this count is binomial with parameter 0.5.
- Runs test (Stevens test): count the runs of consecutive positive or negative deviations; too few runs indicates clumping, too many indicates alternation.
- Serial correlation test: correlation of standardised deviations at lag 1.
Chapter 5: MLE for Markov Multi-State Mortality
Many actuarial products involve more than a simple alive/dead dichotomy: think of disability income (healthy, disabled, dead), critical illness (healthy, diagnosed, dead), or long-term care models with multiple impairment levels. These are described by Markov multi-state models with transition intensities \(\mu^{ij}_t\) between states \(i\) and \(j\).
The Multi-State Likelihood
Let \(N^{ij}(t)\) denote the count of transitions from \(i\) to \(j\) by time \(t\), and \(v^i_t\) the number of lives in state \(i\) at time \(t\). Under the assumption of piecewise-constant intensities \(\mu^{ij}\), the log-likelihood contribution is
\[\ell(\mu^{ij}) = N^{ij}\log\mu^{ij} - \mu^{ij}\int v^i_t\,dt,\]summed over all transition types. Maximising gives the natural estimator
\[\hat\mu^{ij} = \frac{N^{ij}}{\int v^i_t\,dt},\]which generalises the Poisson force estimator to the multi-state setting. The denominator \(\int v^i_t\,dt\) is the total person-time spent in state \(i\), a direct multi-state analogue of central exposure.
Poisson Approximation
Because transitions out of state \(i\) at rate \(\mu^{ij}\) on a population of total exposure \(E^i = \int v^i_t\,dt\) yield \(N^{ij} \sim \mathrm{Poisson}(\mu^{ij} E^i)\) (approximately, under independence), standard errors follow the Poisson form: \(\mathrm{se}(\hat\mu^{ij}) = \sqrt{\hat\mu^{ij}/E^i} = \sqrt{N^{ij}}/E^i\). Confidence intervals, hypothesis tests on intensities, and likelihood-ratio comparisons of nested models all proceed as in standard GLM theory.
The framework accommodates competing risks (multiple absorbing states), reversible transitions (e.g., recovery from disability), and duration-dependent intensities via time-since-entry as a covariate.
Chapter 6: Longevity Risk and Stochastic Mortality
Insurance and pension liabilities extending over decades are sensitive to changes in mortality, not merely its current level. Since the 1950s developed-country mortality has declined steadily, especially at older ages, producing substantial longevity risk: the risk that people live longer than anticipated, increasing the cost of life annuities, defined-benefit pensions, and longevity-indexed reinsurance treaties.
Deterministic Mortality Improvement
The simplest approach projects a base-year table forward using improvement factors. Let \(q_{x,0}\) denote the base rate and \(\phi_{x,s}\) the one-year improvement in year \(s\). Then
\[q_{x,t} = q_{x,0}\prod_{s=1}^{t}(1 - \phi_{x,s}).\]Two-dimensional improvement scales such as Scale BB, MP-2014, MP-2021, and the Canadian CPM2014 publish tables of \(\phi_{x,s}\) varying by age and calendar year. The improvement structure may include a cohort component (lives born in the same year share a common improvement), as in the Continuous Mortality Investigation CMI Model used in the United Kingdom.
Deterministic scales are easy to implement and communicate, but they provide only a central projection and do not quantify uncertainty. For solvency capital or reinsurance pricing, a stochastic model is essential.
The Lee–Carter Model
The Lee–Carter (1992) model is the canonical stochastic mortality model. It assumes
\[\log m_{x,t} = a_x + b_x k_t + \epsilon_{x,t},\]where \(m_{x,t}\) is the central death rate at age \(x\) in year \(t\), \(a_x\) is an age-specific average log-rate, \(k_t\) is a time index capturing overall mortality level, and \(b_x\) describes how strongly each age responds to changes in \(k_t\). Identification is pinned down by the constraints \(\sum_x b_x = 1\) and \(\sum_t k_t = 0\).
Estimation proceeds by setting \(\hat a_x\) equal to the mean of \(\log m_{x,t}\) over \(t\), then extracting the first singular vector of the demeaned matrix by singular value decomposition. The resulting \(\hat k_t\) is a unidimensional summary of period mortality which is then forecast with an ARIMA model (typically a random walk with drift) to produce projections and prediction intervals.
The Cairns–Blake–Dowd Model
For older ages, where the accident hump is absent and mortality rises almost linearly in log-odds space, Cairns, Blake, and Dowd (2006) proposed the two-factor model
\[\mathrm{logit}\,q_{x,t} = \kappa^{(1)}_t + (x - \bar x)\,\kappa^{(2)}_t,\]where \(\bar x\) is the mean age in the fit and \(\kappa^{(1)}_t, \kappa^{(2)}_t\) follow a bivariate random walk with drift. \(\kappa^{(1)}\) measures the overall level and \(\kappa^{(2)}\) the slope (the spread between young-old and old-old mortality). Unlike Lee–Carter, CBD has no age-specific \(b_x\), which both simplifies estimation and avoids overfitting when only older ages are modelled.
Extensions add cohort effects (CBD-X), quadratic age terms (M7), or additional period factors. The choice between models is typically made by out-of-sample forecast accuracy on held-back calendar years.
Comparison and Longevity Risk Transfer
Lee–Carter is flexible across the full age range but imposes perfectly correlated innovations across ages; CBD is parsimonious and well-suited to ages 60+, with a transparent interpretation of level and slope shocks. Both permit Monte Carlo simulation of future liabilities to compute value at risk and conditional tail expectation on annuity books.
Longevity risk is increasingly transferred to capital markets via longevity bonds (coupons indexed to a survivor fraction of a reference cohort), longevity swaps (fixed payments exchanged for realised annuity payments), and index-based reinsurance tied to published population mortality. Pricing such instruments requires a stochastic model, a risk-neutral adjustment, and calibration to available market prices.