STAT 331: Applied Linear Models
Samuel Wong
Estimated study time: 1 hr 35 min
Table of contents
Sources and References
Primary notes — Cameron Roopnarine (Hextical), STAT 331: Applied Linear Models, Fall 2020, hextical.github.io/university-notes Supplementary notes — Richard Wu, STAT 331 Notes, richardwu.ca; David Duan, STAT 331 Master Notes, david-duan.me Textbook — Abraham and Ledolter, Introduction to Regression Modeling
Chapter 1: Simple Linear Regression
1.1 Overview and Motivation
Regression analysis is the study of how a response variable (also called the dependent variable) relates to one or more explanatory variables (also called independent variables, predictors, or covariates). We denote the response by \(Y\) and the predictors by \(x_1, x_2, \ldots, x_p\). A regression model captures the functional relationship between the response and the predictors, and in this course the focus is on models that are linear in the parameters.
\[ Y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon, \]where \(\beta_0\) is the intercept, \(\beta_1, \ldots, \beta_p\) are regression coefficients that quantify the effect of each predictor on the response, and \(\varepsilon\) is a random error term satisfying \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\). The term “linear” refers to linearity in the parameters \(\beta_j\), not necessarily in the predictors themselves.
1.2 The Simple Linear Regression Model
\[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2). \]The standard assumptions, often remembered by the acronym LINE, are:
- Linearity: the expected response is a linear function of \(x\).
- Independence: the error terms \(\varepsilon_1, \ldots, \varepsilon_n\) are independent.
- Normality: each \(\varepsilon_i\) follows a normal distribution.
- Equal variance (homoscedasticity): \(\operatorname{Var}(\varepsilon_i) = \sigma^2\) for all \(i\).
Here \(\beta_0, \beta_1, \sigma^2\) are fixed but unknown parameters; \(\varepsilon_i\) are unobserved random quantities; and \(y_i, x_i\) are observed data (we treat \(x_i\) as fixed constants throughout this course).
Interpretation of Parameters
- \(\beta_0\) is the expected value of \(Y\) when \(x = 0\). This may not always be scientifically meaningful if \(x = 0\) is outside the range of observed data.
- \(\beta_1\) is the expected change in \(Y\) for a one-unit increase in \(x\): \(\mathbb{E}[Y \mid x = x^* + 1] - \mathbb{E}[Y \mid x = x^*] = \beta_1\).
Sample Correlation
\[ r = \frac{S_{xy}}{\sqrt{S_{xx} \, S_{yy}}}, \]\[ S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2, \qquad S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}), \qquad S_{yy} = \sum_{i=1}^n (y_i - \bar{y})^2. \]The correlation satisfies \(-1 \leq r \leq 1\). Values of \(|r|\) near 1 indicate a strong linear relationship, while values near 0 suggest a weak one.
1.3 Ordinary Least Squares Estimation
Equivalence with Maximum Likelihood
\[ L(\beta_0, \beta_1, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(y_i - \beta_0 - \beta_1 x_i)^2}{2\sigma^2}\right). \]\[ \ell(\beta_0, \beta_1, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2. \]Maximizing \(\ell\) with respect to \(\beta_0\) and \(\beta_1\) is equivalent to minimizing the sum of squares, so the ML estimators coincide with the LS estimators.
Fitted Values and Residuals
These constraints reduce the degrees of freedom from \(n\) to \(n - 2\), since two parameters have been estimated.
Estimating the Error Variance
\[ \hat{\sigma}^2 = \frac{\text{SS(Res)}}{n - 2} = \frac{\sum_{i=1}^n e_i^2}{n - 2}. \]The denominator \(n - 2\) corrects for the two estimated parameters. The residual standard error is \(\hat{\sigma} = \sqrt{\hat{\sigma}^2}\).
1.4 Inference on the Slope and Intercept
Sampling Distribution of the Estimators
Since \(\hat{\beta}_1\) is a linear combination of the normally distributed \(Y_i\), it is itself normal.
Standard Error and the t-Statistic
Confidence Intervals
\[ \hat{\beta}_1 \pm t_{1-\alpha/2, \, n-2} \cdot \operatorname{SE}(\hat{\beta}_1), \]where \(t_{1-\alpha/2, \, n-2}\) is the appropriate quantile of the \(t\)-distribution with \(n - 2\) degrees of freedom. An analogous interval exists for \(\beta_0\).
Hypothesis Testing
\[ t = \frac{\hat{\beta}_1}{\operatorname{SE}(\hat{\beta}_1)} \sim t(n-2) \quad \text{under } H_0. \]The two-sided p-value is \(p = \Pr(|T| \geq |t|)\). We reject \(H_0\) at significance level \(\alpha\) when \(p < \alpha\). A rejection implies that there is statistically significant evidence of a linear relationship between \(x\) and \(Y\).
1.5 ANOVA Decomposition for SLR
\[ \underbrace{\sum_{i=1}^n (y_i - \bar{y})^2}_{\text{SS(Total)}} = \underbrace{\sum_{i=1}^n (\hat{\mu}_i - \bar{y})^2}_{\text{SS(Reg)}} + \underbrace{\sum_{i=1}^n (y_i - \hat{\mu}_i)^2}_{\text{SS(Res)}}. \]This identity holds because the cross-product term vanishes: \(\sum e_i(\hat{\mu}_i - \bar{y}) = 0\), a consequence of the orthogonality of the residual and fitted value vectors.
\[ R^2 = \frac{\text{SS(Reg)}}{\text{SS(Total)}} = 1 - \frac{\text{SS(Res)}}{\text{SS(Total)}}, \]and represents the proportion of variability in \(Y\) explained by the linear model. In simple linear regression, \(R^2 = r^2\), where \(r\) is the sample correlation.
1.6 Prediction and Confidence Intervals
Confidence Interval for the Mean Response
\[ \hat{\mu}_0 \sim \mathcal{N}\!\left(\mu_0, \; \sigma^2\!\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)\right), \]\[ \hat{\mu}_0 \pm t_{1-\alpha/2,\,n-2}\;\hat{\sigma}\sqrt{\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}}. \]Prediction Interval for a New Observation
\[ \operatorname{Var}(Y_0 - \hat{\mu}_0) = \sigma^2 + \sigma^2\!\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right) = \sigma^2\!\left(1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right). \]\[ \hat{\mu}_0 \pm t_{1-\alpha/2,\,n-2}\;\hat{\sigma}\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}}. \]1.7 Correlation and the Pearson Coefficient
\[ \rho = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\,\text{Var}(Y)}}. \]Its sample estimate is \(r = S_{xy}/\sqrt{S_{xx} S_{yy}}\), which satisfies \(-1 \leq r \leq 1\). In SLR, \(r = \hat\beta_1\sqrt{S_{xx}/S_{yy}}\), and crucially, \(R^2 = r^2\).
Testing \(H_0\colon \rho = 0\)
\[ T = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \sim t(n-2) \quad \text{under } H_0\colon \rho = 0 \]is identical to the t-test for \(H_0\colon \beta_1 = 0\) in SLR. Thus testing zero correlation is the same as testing zero slope.
Confidence Interval for \(\rho\)
\[ z = \frac{1}{2}\ln\!\left(\frac{1+r}{1-r}\right) = \tanh^{-1}(r). \]The transformed statistic satisfies \(z \approx \mathcal{N}(\tanh^{-1}(\rho), 1/(n-3))\) for large \(n\). A \(100(1-\alpha)\%\) CI for \(\rho\) is obtained by:
- Computing \(z \pm z_{\alpha/2}/\sqrt{n-3}\).
- Back-transforming via \(\rho = (e^{2z}-1)/(e^{2z}+1) = \tanh(z)\).
Chapter 2: Multiple Linear Regression
2.1 Matrix Formulation
\[ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \varepsilon_i, \qquad \varepsilon_i \overset{\text{iid}}{\sim} \mathcal{N}(0, \sigma^2), \]\[ \boldsymbol{Y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}, \]\[ \boldsymbol{Y} = \begin{bmatrix} Y_1 \\ \vdots \\ Y_n \end{bmatrix}_{n \times 1}, \quad X = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix}_{n \times (p+1)}, \quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \vdots \\ \beta_p \end{bmatrix}_{(p+1) \times 1}, \quad \boldsymbol{\varepsilon} = \begin{bmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_n \end{bmatrix}_{n \times 1}. \]The matrix \(X\) is called the design matrix. Under the model assumptions, \(\boldsymbol{\varepsilon} \sim \text{MVN}(\boldsymbol{0}, \sigma^2 I_n)\), so \(\boldsymbol{Y} \sim \text{MVN}(X\boldsymbol{\beta}, \sigma^2 I_n)\).
Random Vectors and Covariance Matrices
Multivariate Normal Distribution
Important properties of the MVN:
- Any subset of \(\boldsymbol{Y}\) is also MVN.
- Any linear transformation \(A\boldsymbol{Y} + \boldsymbol{d}\) is MVN with mean \(A\boldsymbol{\mu} + \boldsymbol{d}\) and covariance \(A\Sigma A^\top\).
- Components \(Y_i\) and \(Y_j\) are independent if and only if \(\Sigma_{ij} = 0\).
2.2 OLS in Matrix Form
2.3 The Hat Matrix
Key properties of \(H\):
- Symmetric: \(H^\top = H\).
- Idempotent: \(H^2 = H\).
- \(I - H\) is also symmetric and idempotent, and \(\boldsymbol{e} = (I - H)\boldsymbol{y}\).
- \(\operatorname{tr}(H) = p + 1\) and \(\operatorname{tr}(I - H) = n - (p + 1)\).
Moreover, \(\hat{\boldsymbol{\beta}}\) and \(\boldsymbol{e}\) are independent, which is crucial for the validity of \(t\)- and \(F\)-tests.
2.4 Distribution of the OLS Estimator and Inference
Confidence Intervals and Hypothesis Tests for Individual Coefficients
\[ T = \frac{\hat{\beta}_j - \beta_j}{\operatorname{SE}(\hat{\beta}_j)} \sim t(n - p - 1) \]\[ \hat{\beta}_j \pm t_{1-\alpha/2, \, n-p-1}\;\operatorname{SE}(\hat{\beta}_j). \]To test \(H_0\colon \beta_j = 0\), the test statistic is \(t = \hat{\beta}_j / \operatorname{SE}(\hat{\beta}_j)\) compared to \(t(n-p-1)\).
Interpretation of Coefficients
- \(\hat{\beta}_0\) estimates the expected response when all predictors equal zero.
- \(\hat{\beta}_j\) estimates the change in the expected response for a one-unit increase in \(x_j\), holding all other predictors constant.
Gauss–Markov Theorem
Among all unbiased linear estimators of \(\boldsymbol{\beta}\), the OLS estimator has the smallest variance. Formally, for any other unbiased estimator \(\hat{\boldsymbol{\beta}}^* = M^*\boldsymbol{Y}\), we have \(\operatorname{Var}(\hat{\boldsymbol{\beta}}^*) - \operatorname{Var}(\hat{\boldsymbol{\beta}})\) positive semi-definite. This result holds without requiring normality of errors.
2.5 Prediction in MLR
For a new covariate vector \(\boldsymbol{x}_0 = (1, x_{01}, \ldots, x_{0p})\), the predicted mean response is \(\hat{\mu}_0 = \boldsymbol{x}_0 \hat{\boldsymbol{\beta}}\).
\[ \hat{\mu}_0 \pm t_{1-\alpha/2, \, n-p-1}\;\hat{\sigma}\sqrt{\boldsymbol{x}_0 (X^\top X)^{-1} \boldsymbol{x}_0^\top}. \]\[ \hat{\mu}_0 \pm t_{1-\alpha/2, \, n-p-1}\;\hat{\sigma}\sqrt{1 + \boldsymbol{x}_0 (X^\top X)^{-1} \boldsymbol{x}_0^\top}. \]The prediction interval is wider because the variance of the prediction error is \(\sigma^2(1 + \boldsymbol{x}_0(X^\top X)^{-1}\boldsymbol{x}_0^\top)\), incorporating both the inherent noise \(\sigma^2\) and estimation uncertainty.
2.6 Partial F-Tests and ANOVA
ANOVA Decomposition
\[ \text{SS(Total)} = \text{SS(Reg)} + \text{SS(Res)}, \]\[ \text{SS(Total)} = \sum_{i=1}^n (y_i - \bar{y})^2, \quad \text{SS(Reg)} = \sum_{i=1}^n (\hat{\mu}_i - \bar{y})^2, \quad \text{SS(Res)} = \sum_{i=1}^n (y_i - \hat{\mu}_i)^2. \]| Source | d.f. | SS | MS | F |
|---|---|---|---|---|
| Regression | \(p\) | SS(Reg) | SS(Reg)/\(p\) | MS(Reg)/MS(Res) |
| Residual | \(n - p - 1\) | SS(Res) | SS(Res)/\((n-p-1)\) | |
| Total | \(n - 1\) | SS(Total) |
Overall F-Test
\[ H_0\colon \beta_1 = \beta_2 = \cdots = \beta_p = 0 \qquad \text{vs.} \qquad H_A\colon \text{at least one } \beta_j \neq 0. \]\[ F = \frac{\text{SS(Reg)}/p}{\text{SS(Res)}/(n-p-1)} = \frac{\text{MS(Reg)}}{\text{MS(Res)}} \sim F(p, n-p-1) \quad \text{under } H_0. \]Partial F-Test (Comparing Nested Models)
\[ F = \frac{(\text{SS(Res)}_A - \text{SS(Res)})/\ell}{\text{SS(Res)}/(n-p-1)} \sim F(\ell, n-p-1) \quad \text{under } H_0. \]Coefficient of Determination
\[ R^2 = \frac{\text{SS(Reg)}}{\text{SS(Total)}} = 1 - \frac{\text{SS(Res)}}{\text{SS(Total)}}. \]\(R^2\) always increases (or stays the same) when predictors are added. It is bounded between 0 and 1, but a high \(R^2\) does not imply the model is appropriate.
2.7 Confidence Region for the Coefficient Vector
Individual confidence intervals for \(\beta_j\) do not account for the joint distribution of all coefficients. A confidence region for the entire vector \(\boldsymbol{\beta}\) (or a subvector) is an ellipsoid.
This ellipsoid is centred at \(\hat{\boldsymbol\beta}\) and has axes determined by the eigenvectors of \((X^\top X)^{-1}\). When predictors are correlated, the ellipsoid is elongated, reflecting the joint uncertainty in the coefficients.
A joint confidence region for a subvector \(({\beta}_j, \beta_k)\) is obtained by projecting the full ellipsoid onto the \((\beta_j, \beta_k)\)-plane. This region can be a non-axis-aligned ellipse, illustrating that the individual \(t\)-based confidence intervals for \(\beta_j\) and \(\beta_k\) do not fully capture the joint uncertainty.
2.8 Worked Example: Real Estate Data
Individual inferences:
- Each additional m\(^2\) of floor area is associated with a \$1,720 increase in price (95% CI: \$950 to \$2,490), adjusting for bathrooms. This is highly significant (\(t = 1.72/(34.5\sqrt{0.000012}) = 14.4\)).
- Each additional bathroom adds \$22,300 on average (95% CI: \$0.8K to \$43.8K), adjusting for floor area (\(t = 2.13\), \(p = 0.044\)).
Overall F-test: \(F = \text{MS(Reg)}/\text{MS(Res)} = 198.9\) on \((2, 22)\) df, \(p < 0.0001\). Adjusted \(R^2 = 0.944\). The residuals vs. fitted plot shows no obvious patterns.
Chapter 3: Categorical Predictors and Interactions
3.1 Indicator (Dummy) Variables
When an explanatory variable is categorical with \(k\) levels, we encode it using \(k - 1\) indicator (or dummy) variables. One level is designated as the reference category and is absorbed into the intercept.
- \(\beta_0\) is the mean response for the control group (reference).
- \(\beta_1\) is the difference in mean response between promo1 and control.
- \(\beta_2\) is the difference between promo2 and control.
- \(\beta_1 - \beta_2\) is the difference between promo1 and promo2.
Why Not Use All \(k\) Indicators?
If we create an indicator for every level, the column of ones (intercept) becomes a linear combination of the indicator columns, making \(X^\top X\) singular. With \(k\) levels, we always use \(k - 1\) indicator variables plus an intercept.
ANOVA as Regression
A one-way ANOVA comparing \(k\) group means is equivalent to a regression model with \(k-1\) indicator variables and an intercept. The F-test for overall group differences corresponds to the partial F-test for the joint significance of all \(k-1\) indicator coefficients.
Comparing Two Non-Reference Groups
\[ \operatorname{SE}(\hat{\beta}_j - \hat{\beta}_k) = \hat{\sigma}\sqrt{V_{jj} + V_{kk} - 2V_{jk}}, \]where \(V = (X^\top X)^{-1}\).
3.2 Interaction Terms
\[ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2} + \varepsilon_i. \]\[ Y_i = \beta_0 + (\beta_1 + \beta_3 x_{i2}) x_{i1} + \beta_2 x_{i2} + \varepsilon_i, \]showing that the slope of \(Y\) with respect to \(x_1\) changes with \(x_2\).
Interactions with Categorical Variables
\[ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i1} x_{i2} + \varepsilon_i. \]- For group 0 (\(x_{i2} = 0\)): \(\mathbb{E}[Y] = \beta_0 + \beta_1 x_1\) (intercept \(\beta_0\), slope \(\beta_1\)).
- For group 1 (\(x_{i2} = 1\)): \(\mathbb{E}[Y] = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)x_1\) (different intercept, different slope).
3.3 Analysis of Covariance (ANCOVA)
\[ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \varepsilon_i. \]Under this model the two groups have parallel regression lines with different intercepts: the difference in adjusted group means is \(\beta_2\), and it is estimated holding \(x_1\) constant.
The group difference \(\beta_2 + \beta_3 x_1\) now varies with the covariate, and the interpretation requires specifying the value of \(x_1\).
3.4 General Linear Hypotheses and Extra Sum of Squares
\[ F = \frac{(\text{SS(Reg, Full)} - \text{SS(Reg, Reduced)})/r}{\text{SS(Res, Full)}/(n - p - 1)} \sim F(r, n - p - 1). \]The numerator, \(\text{SS(Reg, Full)} - \text{SS(Reg, Reduced)}\), is the extra sum of squares attributed to the restricted coefficients. Under \(H_0\), this quantity divided by \(\sigma^2\) follows a \(\chi^2(r)\) distribution, which leads to the F-test after dividing by the independent \(\chi^2(n-p-1)/\sigma^2\) in the denominator.
Connecting Extra SS to the Coefficient of Determination
The extra SS framework unifies multiple special cases:
| Hypothesis | Extra SS | F-test |
|---|---|---|
| All slopes zero: \(\beta_1 = \cdots = \beta_p = 0\) | SS(Reg, Full) | Overall F in ANOVA table |
| Subset of slopes zero: \(\beta_{q+1} = \cdots = \beta_p = 0\) | SS(Reg, Full) \(-\) SS(Reg, Reduced) | Partial F-test |
| Single coefficient: \(\beta_j = 0\) | SS(\(\beta_j\) \(\mid\) others) | \(F = t_j^2\) |
In all cases, the numerator is a difference in SS(Reg) between the full and reduced model.
Chapter 4: Model Checking and Diagnostics
4.1 Residual Analysis
The fundamental purpose of residual analysis is to assess whether the LINE assumptions hold. Since the true errors \(\varepsilon_i\) are unobservable, we use the residuals \(e_i = y_i - \hat{y}_i\) as surrogates. However, residuals are not identical to errors: \(\boldsymbol{e} = (I - H)\boldsymbol{\varepsilon}\), so the residuals are correlated and have unequal variances even when the errors are iid.
Ordinary vs. Studentized Residuals
\[ r_i = \frac{e_i}{\hat{\sigma}\sqrt{1 - h_{ii}}}. \]These have approximately constant variance (approximately \(t\)-distributed in finite samples), making them more appropriate for diagnostic plots.
Key Diagnostic Plots
- Residuals vs. fitted values (\(e_i\) or \(r_i\) vs. \(\hat{\mu}_i\)): should show a random scatter with no pattern. A funnel shape indicates heteroscedasticity; a curve indicates nonlinearity.
- Residuals vs. each predictor (\(e_i\) vs. \(x_{ij}\)): helps detect nonlinear relationships with specific predictors.
- Normal QQ plot of residuals: points should fall approximately along a straight line. Departures suggest non-normality.
- Residuals vs. observation order: useful for detecting autocorrelation in time-ordered data.
4.2 Non-Constant Variance and Data Transformations
When residual plots reveal non-constant variance (heteroscedasticity), a variance-stabilizing transformation on the response may help. The idea is to find a function \(g(\cdot)\) such that \(\operatorname{Var}(g(Y_i)) \approx \text{constant}\).
Using a first-order Taylor expansion, if \(\operatorname{Var}(Y_i) = h(\mu_i)\sigma^2\), we need \([g'(\mu_i)]^2 \propto 1/h(\mu_i)\), which gives:
| Variance structure | \(h(\mu)\) | Transformation \(g(y)\) |
|---|---|---|
| \(\operatorname{Var} \propto \mu\) | \(\mu\) | \(\sqrt{y}\) |
| \(\operatorname{Var} \propto \mu^2\) | \(\mu^2\) | \(\ln(y)\) |
| General power | \(\mu^C\) | Box-Cox |
Box-Cox Transformations
\[ g(y_i) = \begin{cases} (y_i^\lambda - 1)/\lambda & \lambda \neq 0 \\ \ln(y_i) & \lambda = 0 \end{cases}. \]The parameter \(\lambda\) is chosen to maximize the profile log-likelihood. Special cases include \(\lambda = 1/2\) (square root), \(\lambda = 0\) (log), \(\lambda = 1\) (identity), and \(\lambda = -1\) (reciprocal).
Transforming Predictors and Adding Polynomial Terms
\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i. \]This model is still linear in the parameters \(\beta_0, \beta_1, \beta_2\) and can be fit by OLS.
4.3 Weighted Least Squares
\[ \sum_{i=1}^n w_i (y_i - \boldsymbol{x}_i^\top \boldsymbol{\beta})^2, \qquad w_i = \frac{1}{\sigma_i^2}. \]In matrix form, \(\hat{\boldsymbol{\beta}}_W = (X^\top W X)^{-1}X^\top W \boldsymbol{y}\), where \(W = \operatorname{diag}(w_1, \ldots, w_n)\). The WLS estimator is unbiased with variance \((X^\top W X)^{-1}\).
An equivalent viewpoint is to premultiply the model by \(W^{1/2}\), producing a transformed model \(\boldsymbol{y}_w = X_w \boldsymbol{\beta} + \boldsymbol{\varepsilon}_w\) where the new errors satisfy \(\boldsymbol{\varepsilon}_w \sim \mathcal{N}(\boldsymbol{0}, I)\).
In practice, the weights are often unknown and must be estimated, leading to an iteratively reweighted least squares procedure:
- Fit OLS to obtain initial residuals \(e_i\).
- Estimate \(\sigma_i^2\) from the residuals and fitted values (e.g., by regressing \(|e_i|\) or \(e_i^2\) on \(\hat{y}_i\)).
- Set \(w_i = 1/\hat{\sigma}_i^2\) and compute \(\hat{\boldsymbol{\beta}}_W\).
- Repeat until convergence.
4.4 Outliers, Leverage, and Influence
Leverage
The leverage of observation \(i\) is \(h_{ii}\), the \(i\)-th diagonal of the hat matrix. It measures how far \(\boldsymbol{x}_i\) is from the center of the predictor space. Properties:
- \(1/n \leq h_{ii} \leq 1\).
- \(\sum_{i=1}^n h_{ii} = p + 1\), so the average leverage is \(\bar{h} = (p+1)/n\).
- A point is considered high leverage if \(h_{ii} > 2\bar{h} = 2(p+1)/n\).
In SLR, leverage has the explicit form \(h_{ii} = 1/n + (x_i - \bar{x})^2/S_{xx}\), confirming that points far from the mean of \(x\) have high leverage.
Outliers
An observation is an outlier if its studentized residual is unusually large, typically \(|r_i| > 2.5\) or \(|r_i| > 3\). High-leverage points can mask outliers because their residuals are pulled toward zero.
Externally Studentized (Jackknife) Residuals
\[ r_{i(-i)} = \frac{e_i}{\hat{\sigma}_{(-i)}\sqrt{1 - h_{ii}}} = r_i \left[\frac{n - p - 2}{n - p - 1 - r_i^2}\right]^{1/2}. \]This can be computed from a single model fit without refitting \(n\) times.
Cook’s Distance
\[ D_i = \frac{h_{ii}}{1 - h_{ii}} \cdot \frac{r_i^2}{p + 1}. \]\[ D_i = \frac{(\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(-i)})^\top (X^\top X)(\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(-i)})}{(p+1)\hat{\sigma}^2}, \]which directly measures how much the entire coefficient vector shifts when observation \(i\) is deleted.
DFFITS and DFBETAS
Cook’s distance measures global influence. Two related per-observation diagnostics provide more focused information:
- \[
\text{DFFITS}_i = \frac{\hat{y}_i - \hat{y}_{i(-i)}}{\hat\sigma_{(-i)}\sqrt{h_{ii}}} = r_{i(-i)}\sqrt{\frac{h_{ii}}{1 - h_{ii}}}.
\]
A rule of thumb flags \(|\text{DFFITS}_i| > 2\sqrt{(p+1)/n}\).
- \[
\text{DFBETAS}_{ij} = \frac{\hat\beta_j - \hat\beta_{j(-i)}}{\hat\sigma_{(-i)}\sqrt{[(X^\top X)^{-1}]_{jj}}}.
\]
Flagged when \(|\text{DFBETAS}_{ij}| > 2/\sqrt{n}\). This pinpoints which coefficient is most affected by each influential observation.
4.5 What to Do When Assumptions Fail
When diagnostic plots reveal assumption violations, several remedies are available:
| Violation | Diagnostic Signal | Remedy |
|---|---|---|
| Non-linearity | Curved residual vs. fitted | Add polynomial terms, transform \(x\) or \(y\) |
| Heteroscedasticity | Fanning in residual plot | WLS, log transform, Box-Cox |
| Non-normality | Heavy tails in QQ plot | Robust regression (Huber, LAD), GLMs |
| Outliers | Large \(\|r_i\|\) or \(D_i\) | Investigate, correct if erroneous; otherwise report sensitivity |
| Autocorrelation | Runs pattern in time-ordered residuals | Include time trend, time series models |
| Multicollinearity | Large VIF | Remove correlated predictors, ridge regression, PCA |
Robust Regression
\[ \rho_k(e) = \begin{cases} e^2/2 & |e| \leq k \\ k|e| - k^2/2 & |e| > k \end{cases} \]transitions from squared loss (for small residuals) to absolute loss (for large residuals), downweighting outliers without discarding them. The parameter \(k\) (typically \(k = 1.345\hat\sigma\)) is chosen to achieve 95% efficiency under normality. Estimation proceeds by IRLS: iteratively reweight observations by \(w_i \propto \rho_k'(e_i)/e_i\) and solve a WLS problem.
Chapter 5: Model Selection
5.1 The Model Selection Problem
Given \(p\) candidate predictors, there are \(2^p\) possible subsets of predictors. Model selection aims to identify a subset that balances goodness of fit with parsimony. Motivations for preferring simpler models include:
- Interpretability: fewer parameters are easier to understand.
- Precision: including irrelevant predictors inflates \(\hat{\sigma}^2 = \text{SS(Res)}/(n - k - 1)\), which increases standard errors.
- Prediction: overly complex models overfit the training data and predict poorly on new data.
5.2 Selection Criteria
Adjusted \(R^2\)
\[ R^2_{\text{adj}} = 1 - \frac{\text{SS(Res)}/(n - k - 1)}{\text{SS(Total)}/(n - 1)} = 1 - \frac{\hat{\sigma}^2}{s^2}, \]\[ R^2_{\text{adj}} = R^2 - \frac{k}{n - k - 1}(1 - R^2). \]Maximizing \(R^2_{\text{adj}}\) is equivalent to minimizing \(\hat{\sigma}^2\). Unlike \(R^2\), the adjusted version can decrease when a useless predictor is added.
Akaike Information Criterion (AIC)
\[ \text{AIC} = 2q - 2\ln L(\hat{\theta}), \]\[ \text{AIC} = n \ln\!\left(\frac{\text{SS(Res)}}{n}\right) + 2q. \]Lower AIC is preferred. AIC penalizes complexity through the \(2q\) term.
Bayesian Information Criterion (BIC)
\[ \text{BIC} = q \ln(n) - 2\ln L(\hat{\theta}). \]BIC applies a stronger penalty than AIC (since \(\ln n > 2\) for \(n \geq 8\)), so it tends to select simpler models.
Mallow’s \(C_p\)
\[ C_p = \frac{\text{SS(Res)}_k}{\text{MS(Res)}_{\text{full}}} + 2(k + 1) - n, \]where \(\text{MS(Res)}_{\text{full}}\) is the residual mean square from the full \(p\)-predictor model. A model is considered suitable if \(C_p \leq k + 1\). When \(C_p\) is much larger than \(k+1\), the model suffers from substantial bias.
5.3 Search Strategies
Forward Selection
- Start with only the intercept.
- Among all \(p\) one-predictor models, choose the one that best improves the criterion.
- With the chosen predictor fixed, evaluate all two-predictor models, and add the next best.
- Continue until no further improvement is achieved.
Backward Elimination
- Start with all \(p\) predictors.
- Remove the predictor whose removal least worsens the criterion (or improves it).
- Continue until all remaining predictors are deemed important.
Stepwise Selection
Combine forward and backward steps: after each addition, check whether any previously included predictor can now be dropped. This prevents variables from being “locked in” early.
Exhaustive Search
Fit all \(2^p\) possible models and select the best by the chosen criterion. Computationally feasible only for moderate \(p\) (e.g., \(p \leq 20\) with modern algorithms like the branch-and-bound method in the leaps package).
5.4 Multicollinearity and Variance Inflation Factors
Multicollinearity occurs when predictor variables are strongly linearly related. It inflates the variances of the OLS estimates (the diagonal entries of \(\sigma^2(X^\top X)^{-1}\)), making individual coefficients unreliable without substantially affecting overall prediction.
Guidelines:
- \(\text{VIF}_j = 1\) means no collinearity with other predictors.
- \(\text{VIF}_j > 10\) (i.e., \(R_j^2 > 0.9\)) indicates serious multicollinearity.
- Remedies include removing redundant predictors, combining correlated predictors, or using regularization methods (ridge, LASSO).
Chapter 6: Cross-Validation and Prediction
6.1 Prediction Error and Overfitting
\[ \text{MSPE} = \mathbb{E}\!\left[(Y_{\text{new}} - \hat{y}_{\text{new}})^2\right], \]which we wish to estimate honestly.
6.2 Holdout Validation
\[ \text{MSE}_{\text{test}} = \frac{1}{n_{\text{test}}} \sum_{i \in S_{\text{test}}} (y_i - \hat{y}_i)^2. \]The disadvantage is that a large portion of data is not used for fitting, potentially increasing variance.
6.3 Leave-One-Out Cross-Validation (LOOCV)
\[ \text{MSE}_{\text{CV}} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_{i(-i)})^2. \]\[ y_i - \hat{y}_{i(-i)} = \frac{e_i}{1 - h_{ii}}, \]\[ \text{MSE}_{\text{CV}} = \frac{1}{n}\sum_{i=1}^n \left(\frac{e_i}{1 - h_{ii}}\right)^2, \]which requires fitting the model only once.
6.4 \(K\)-Fold Cross-Validation
Randomly partition the data into \(K\) roughly equal subsets (folds). For each fold \(k\):
- Fit the model on the remaining \(K - 1\) folds.
- Predict the held-out fold and compute \(\text{MSE}_k = \frac{1}{n_k}\sum_{i \in S_k}(y_i - \hat{y}_i)^2\).
Common choices are \(K = 5\) or \(K = 10\). Setting \(K = n\) recovers LOOCV.
6.5 Combining Cross-Validation with Model Selection
When selecting among models, the recommended workflow involves three conceptual data splits:
- Training set: used to fit candidate models.
- Validation set (or CV folds within training): used to compare models and choose the best.
- Test set: reserved exclusively for estimating the prediction error of the final selected model.
Cross-validation can replace the validation set by repeatedly splitting the training data. After selecting the best model by CV, refit it on the full training data, then report the test-set error as the final performance estimate.
The One-Standard-Error Rule
\[ \text{select the most parsimonious model with} \quad \widehat{\text{MSPE}} \leq \widehat{\text{MSPE}}_{\min} + \widehat{\text{SE}}_{\min}. \]This guards against overfitting by preferring simpler models when the evidence for a more complex one is not compelling.
Worked Example: Variable Selection via CV
6.6 Bias-Variance Tradeoff
\[ \mathbb{E}\!\left[(Y_0 - \hat{y}_0)^2\right] = \operatorname{Var}(Y_0) + \operatorname{Bias}^2(\hat{y}_0) + \operatorname{Var}(\hat{y}_0). \]- Irreducible error \(\operatorname{Var}(Y_0) = \sigma^2\): cannot be reduced by any model.
- Bias: model underfitting (too few predictors) introduces systematic error. Decreases as model complexity increases.
- Variance of prediction: overfitting means \(\hat{\boldsymbol\beta}\) varies considerably across training samples. Increases with model complexity.
Adding predictors always decreases in-sample RSS, but may increase MSPE if the variance increase outweighs the bias reduction. This is the fundamental tension that cross-validation addresses empirically.
Connection to Information Criteria
When the true model is in the candidate set and the sample is large, cross-validation and information criteria are asymptotically equivalent:
- AIC minimizes estimated prediction error: \(\text{AIC} = n\ln(\hat\sigma^2) + 2(p+1)\).
- BIC minimizes \(n\ln(\hat\sigma^2) + (p+1)\ln(n)\), applying a heavier penalty that favours sparser models.
- LOOCV is approximately equivalent to AIC for linear models (Stone, 1977): \(\text{LOOCV-MSPE} \approx \text{AIC-selected model error}\).
In small samples, CV provides a more direct and distribution-free estimate of predictive performance than AIC or BIC.
Chapter 7: Generalized Linear Models
7.1 Motivation: Beyond the Normal Linear Model
The normal linear model assumes \(Y_i \sim \mathcal{N}(\mu_i, \sigma^2)\) with \(\mu_i = \boldsymbol{x}_i^\top \boldsymbol{\beta}\). However, many practical situations involve responses that are not normally distributed:
- Binary responses (\(Y_i \in \{0, 1\}\)): success/failure outcomes.
- Count data (\(Y_i \in \{0, 1, 2, \ldots\}\)): number of events in a fixed period.
Generalized linear models (GLMs) extend the linear model framework to handle such responses.
7.2 Exponential Family and Link Functions
A GLM consists of three components:
- Random component: \(Y_i\) follows a distribution from the exponential family (e.g., Normal, Bernoulli, Binomial, Poisson).
- Systematic component: a linear predictor \(\eta_i = \boldsymbol{x}_i^\top \boldsymbol{\beta}\).
- Link function: a monotone differentiable function \(g\) relating the mean to the linear predictor: \(g(\mu_i) = \eta_i\).
| Response type | Distribution | Link function \(g(\mu)\) | Link name |
|---|---|---|---|
| Continuous | Normal | \(\mu\) | Identity |
| Binary | Bernoulli | \(\ln(\mu/(1-\mu))\) | Logit |
| Count | Poisson | \(\ln(\mu)\) | Log |
7.3 Logistic Regression
\[ \ln\!\left(\frac{\pi_i}{1 - \pi_i}\right) = \boldsymbol{x}_i^\top \boldsymbol{\beta}. \]\[ \pi_i = \frac{e^{\boldsymbol{x}_i^\top \boldsymbol{\beta}}}{1 + e^{\boldsymbol{x}_i^\top \boldsymbol{\beta}}}. \]Since \(Y_i \sim \text{Bernoulli}(\pi_i)\), we have \(\mathbb{E}[Y_i] = \pi_i\) and \(\operatorname{Var}(Y_i) = \pi_i(1 - \pi_i)\). Note that the variance is a function of the mean; the errors are inherently heteroscedastic.
Parameter Estimation
\[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^n \bigl[y_i \ln(\pi_i) + (1 - y_i)\ln(1 - \pi_i)\bigr] = \sum_{i=1}^n \bigl[y_i \boldsymbol{x}_i^\top \boldsymbol{\beta} - \ln(1 + e^{\boldsymbol{x}_i^\top \boldsymbol{\beta}})\bigr]. \]There is no closed-form solution. The MLE is found by iteratively reweighted least squares (IRLS), a form of Newton-Raphson applied to the score equations.
Interpretation of Coefficients
Binomial Count Data
\[ \ell(\boldsymbol{\beta}) = \sum_{i=1}^m \bigl[y_i \ln(\pi_i) + (n_i - y_i)\ln(1 - \pi_i)\bigr] + C. \]7.4 Poisson Regression
\[ \ln(\mu_i) = \boldsymbol{x}_i^\top \boldsymbol{\beta}, \]so \(\mu_i = e^{\boldsymbol{x}_i^\top \boldsymbol{\beta}}\). Here \(\mathbb{E}[Y_i] = \operatorname{Var}(Y_i) = \mu_i\).
The coefficient \(\beta_j\) has a multiplicative interpretation: a one-unit increase in \(x_j\) multiplies the expected count by \(e^{\beta_j}\).
Offset Terms
\[ \ln(\mu_i) = \ln(t_i) + \boldsymbol{x}_i^\top \boldsymbol{\beta}, \]where \(\ln(t_i)\) is the offset (a known constant, not an estimated coefficient). This models the rate \(\lambda_i\) as a log-linear function of predictors.
Interpretation:
- Each 1,000-vehicle increase in daily traffic multiplies the expected accident rate by \(e^{0.31} = 1.36\) (36% increase).
- Presence of a bike lane multiplies the rate by \(e^{-0.47} = 0.62\) (38% reduction), adjusting for traffic volume.
A likelihood ratio test comparing this model to the intercept-only model gives \(\Delta D = 41.2\) on 2 df (\(p < 0.001\)), confirming the joint significance of both predictors.
7.5 Inference in GLMs
Wald Statistic
\[ z = \frac{\hat{\beta}_j}{\operatorname{SE}(\hat{\beta}_j)} \;\dot{\sim}\; \mathcal{N}(0, 1) \quad \text{under } H_0\colon \beta_j = 0. \]The squared Wald statistic \(z^2 \sim \chi^2_1\) can also be used.
Deviance
\[ D = 2\bigl[\ell(\text{saturated}) - \ell(\text{fitted})\bigr]. \]For a well-fitting model, \(D/(n - p - 1) \approx 1\). More precisely, under the null hypothesis that the model is correct, \(D \;\dot{\sim}\; \chi^2(n - p - 1)\) for sufficiently large sample sizes.
Comparing Models via Deviance
\[ D_{\text{red}} - D_{\text{full}} = 2\bigl[\ell(\text{full}) - \ell(\text{reduced})\bigr] \;\dot{\sim}\; \chi^2(df_{\text{red}} - df_{\text{full}}). \]This is the likelihood ratio test and is analogous to the partial F-test in normal linear models.
AIC for GLMs
\[ \text{AIC} = 2(p + 1) - 2\ell(\hat{\boldsymbol{\beta}}). \]Lower AIC is preferred. Stepwise selection can be carried out using AIC as the criterion, as implemented in R’s stepAIC function.
7.6 Diagnostics for GLMs
Ordinary residuals \(y_i - \hat\mu_i\) are not meaningful for GLMs because the variance of \(Y_i\) depends on \(\mu_i\). Instead, standardised residuals are based on the Pearson residual or deviance residual.
\[ r_i^P = \frac{y_i - \hat\mu_i}{\sqrt{V(\hat\mu_i)}}, \]where \(V(\mu)\) is the variance function (\(V(\mu) = \mu(1-\mu)/n_i\) for binomial, \(V(\mu) = \mu\) for Poisson). Plotting \(r_i^P\) against the linear predictor \(\hat\eta_i\) or against predictors checks for lack of fit and heterogeneity.
The deviance residual for observation \(i\) is \(d_i = \text{sign}(y_i - \hat\mu_i)\sqrt{2[l_i^{\text{sat}} - l_i^{\text{fit}}]}\), where \(l_i^{\text{sat}}\) is the saturated log-likelihood and \(l_i^{\text{fit}}\) is the fitted model log-likelihood. The sum of squares of deviance residuals equals the total deviance \(D\).
Overdispersion occurs when the observed variance exceeds the assumed variance function. For binary data (logistic regression), this manifests as \(\sum (r_i^P)^2 / (n - p - 1) > 1\). Solutions include quasi-likelihood models (estimate a dispersion parameter) or beta-binomial regression.
Appendix: Notation, Key Results, and Model Comparison Summary
Notation
| Symbol | Meaning |
|---|---|
| \(Y_i\) | Response for the \(i\)-th observation |
| \(\boldsymbol{X}\) | \(n \times (p+1)\) design matrix (includes column of 1s) |
| \(\boldsymbol{\beta}\) | \((p+1)\)-vector of regression coefficients |
| \(\hat{\boldsymbol{\beta}} = (X^\top X)^{-1}X^\top\boldsymbol{y}\) | OLS (MLE) estimator |
| \(\boldsymbol{H} = X(X^\top X)^{-1}X^\top\) | Hat (projection) matrix |
| \(\boldsymbol{e} = (I - H)\boldsymbol{y}\) | Vector of OLS residuals |
| \(h_{ii}\) | \(i\)-th diagonal of \(H\) (leverage) |
| \(\hat{\sigma}^2 = \|\boldsymbol{e}\|^2/(n-p-1)\) | Unbiased estimator of \(\sigma^2\) |
| \(r_i = e_i / (\hat\sigma\sqrt{1-h_{ii}})\) | Studentized residual |
Key Distributional Results
- \(\hat{\boldsymbol\beta} \sim \mathcal{N}(\boldsymbol\beta, \sigma^2(X^\top X)^{-1})\).
- \((n-p-1)\hat\sigma^2/\sigma^2 \sim \chi^2(n-p-1)\), independent of \(\hat{\boldsymbol\beta}\).
- \(\hat\beta_j / (\hat\sigma\sqrt{[(X^\top X)^{-1}]_{jj}}) \sim t(n-p-1)\) under \(H_0\colon \beta_j = 0\).
- The partial F-statistic \(\sim F(r, n-p-1)\) under \(H_0\colon C\boldsymbol\beta = \boldsymbol 0\).
Model Comparison Criteria Summary
| Criterion | Formula | Direction | Focus |
|---|---|---|---|
| \(R^2\) | \(1 - \text{SS(Res)}/\text{SS(Tot)}\) | Maximize | In-sample fit |
| Adjusted \(R^2\) | \(1 - \hat\sigma^2/s^2\) | Maximize | Fit adjusted for complexity |
| AIC | \(n\ln(\text{SS(Res)}/n) + 2(p+2)\) | Minimize | Predictive loss |
| BIC | \(n\ln(\text{SS(Res)}/n) + (p+2)\ln n\) | Minimize | Model posterior |
| Mallow’s \(C_p\) | \(\text{SS(Res)}_k/\hat\sigma_{\text{full}}^2 + 2(k+1) - n\) | Minimize (\(\leq k+1\)) | Total MSE |
| LOOCV-MSPE | \(\frac{1}{n}\sum_i \bigl(e_i/(1-h_{ii})\bigr)^2\) | Minimize | Prediction |
Glossary of Key Terms
Gauss-Markov Theorem: OLS estimator is BLUE (Best Linear Unbiased Estimator) under the assumptions of linearity, independence, zero-mean errors, and homoscedasticity. Normality is not required for BLUE.
Leverage: A high-leverage point has unusual predictor values. It is not necessarily an outlier (if it lies on the regression surface) but has the potential to strongly influence the fitted line.
Influential point: A point is influential if removing it substantially changes the fitted model. Cook’s distance combines leverage and residual size into a single diagnostic.
Multicollinearity: Near-linear dependence among predictors inflates \(\operatorname{Var}(\hat\beta_j)\) and makes coefficient estimates unstable, but does not affect prediction accuracy when the correlation structure is preserved in new data.
Identifiability: A parameter is identifiable if distinct values lead to distinct distributions. In MLR, \(\boldsymbol\beta\) is identifiable if and only if \(X^\top X\) is invertible (the columns of \(X\) are linearly independent).