AFM 113: Analytic Methods for Business 2

Daniel Jiang

Estimated study time: 1 hr 28 min

Table of contents

Sources and References

Primary textbook — Balka, J. Introductory Statistics Explained (open access). Available at jbstatistics.com. Supplementary — Devore, J.L. (2016). Probability and Statistics for Engineering and the Sciences (9th ed.). Cengage. Newbold, P., Carlson, W.L., & Thorne, B. (2013). Statistics for Business and Economics (8th ed.). Pearson. Online resources — OpenIntro Statistics (openintro.org); MIT OCW 18.650: Statistics for Applications; Khan Academy Statistics and Probability.


Chapter 1: Hypothesis Testing — Core Framework

1.1 The Logic of Hypothesis Testing

Hypothesis testing is the engine of classical statistical inference. It provides a structured, reproducible procedure for deciding whether sample evidence is strong enough to warrant rejecting a baseline claim about a population. The process translates a business question into competing statistical hypotheses, computes a test statistic from data, and renders a verdict calibrated to a pre-specified tolerance for error.

Null Hypothesis \(H_0\): The baseline claim — a statement of "no effect," "no difference," or "status quo." It is assumed true until evidence to the contrary accumulates. Mathematically it is always expressed as an equality (or weak inequality): e.g., \(\mu = \mu_0\), \(\mu \leq \mu_0\), or \(p = p_0\).
Alternative Hypothesis \(H_a\): The claim the analyst seeks to establish — that some effect exists, some difference is real, or some parameter has moved away from the null value. It is the logical complement of \(H_0\) in the region of interest.

The two hypotheses are exhaustive and mutually exclusive. In practice, we never “prove” \(H_0\) true — we either reject it (sufficient evidence against it) or fail to reject it (insufficient evidence, not proof of truth). The asymmetry is intentional: the null is a straw-man position that requires compelling evidence to overturn.

Steps in a Hypothesis Test

  1. State \(H_0\) and \(H_a\) precisely, in terms of a named population parameter.
  2. Choose a significance level \(\alpha\) — the maximum probability of incorrectly rejecting \(H_0\) the analyst will tolerate. Common choices: 0.10, 0.05, 0.01.
  3. Select the appropriate test statistic and verify conditions for its validity.
  4. Compute the test statistic from sample data.
  5. Find the p-value (probability of a result at least as extreme as observed, under \(H_0\)) or compare the test statistic to the critical value.
  6. Decide: reject \(H_0\) if p-value \(< \alpha\) (equivalently, if |test statistic| exceeds the critical value for two-tailed tests).
  7. Interpret in context — translate the statistical conclusion into a business-meaningful statement.

1.2 One-Tailed vs. Two-Tailed Tests

The choice of tail direction is dictated by the research question, not by what you observe in the data — it must be specified before looking at results.

Two-tailed test: \(H_a: \mu \neq \mu_0\). Used when a difference in either direction is of interest. Rejection region is split equally between both tails: each tail has area \(\alpha/2\).
Upper-tailed (right-tailed) test: \(H_a: \mu > \mu_0\). The entire rejection region lies in the upper tail. Used when only an increase is of concern.
Lower-tailed (left-tailed) test: \(H_a: \mu < \mu_0\). The entire rejection region lies in the lower tail. Used when only a decrease is of concern.
One-tailed tests are more powerful than two-tailed tests at detecting deviations in the anticipated direction — the full \(\alpha\) is concentrated in one tail rather than split. However, they provide no protection against surprises in the opposite direction. In accounting and finance, where unexpected deviations in either direction can be material, two-tailed tests are the default unless strong prior theory dictates directionality.

1.3 Type I and Type II Errors

Any binary decision procedure operating under uncertainty will sometimes be wrong. There are exactly two ways to err:

\(H_0\) Actually True\(H_0\) Actually False
Fail to Reject \(H_0\)Correct (probability \(1 - \alpha\))Type II Error (probability \(\beta\))
Reject \(H_0\)Type I Error (probability \(\alpha\))Correct — Power (probability \(1 - \beta\))
Type I Error: Rejecting \(H_0\) when it is in fact true — a false positive. The probability is \(\alpha\), the significance level. In auditing, a Type I error might mean flagging a compliant transaction as fraudulent.
Type II Error: Failing to reject \(H_0\) when it is in fact false — a false negative. The probability is \(\beta\). In auditing, a Type II error means failing to detect an actual fraud.
Statistical Power: The probability of correctly rejecting a false \(H_0\). Power \(= 1 - \beta\). Power depends on (a) the true effect size \(\delta = |\mu - \mu_0|\), (b) sample size \(n\), (c) population standard deviation \(\sigma\), and (d) significance level \(\alpha\).

There is an inherent trade-off: decreasing \(\alpha\) (making it harder to reject \(H_0\)) reduces Type I errors but increases \(\beta\) (Type II errors), lowering power. The only way to reduce both simultaneously is to increase sample size. In financial auditing, regulatory settings, and drug approval, the costs of each error type differ sharply — this asymmetry should inform the choice of \(\alpha\).

Power Analysis and Sample Size

The power of a one-sample z-test against the specific alternative \(\mu = \mu_1\) is:

\[ \text{Power} = P\!\left(Z > z_\alpha - \frac{|\mu_1 - \mu_0|}{\sigma/\sqrt{n}}\right) \]

for an upper-tailed test. To achieve power \(1 - \beta\) at effect size \(\delta = \mu_1 - \mu_0\) with significance level \(\alpha\):

\[ n = \left(\frac{(z_\alpha + z_\beta)\,\sigma}{\delta}\right)^2 \]

For a two-tailed test replace \(z_\alpha\) with \(z_{\alpha/2}\). This formula is the foundation of sample size planning in surveys, clinical trials, and audit procedures.

Power analysis: An auditor wants to detect a mean overstatement of \$500 in an account where \(\sigma = \$1{,}200\). Using \(\alpha = 0.05\) (two-tailed, \(z_{0.025} = 1.960\)) and desired power 80% (\(z_{0.20} = 0.842\)): \[ n = \left(\frac{(1.960 + 0.842) \times 1200}{500}\right)^2 = \left(\frac{2.802 \times 1200}{500}\right)^2 = \left(\frac{3362.4}{500}\right)^2 = (6.725)^2 \approx 45.2 \]

Round up to \(n = 46\) invoices. Sampling fewer than 46 items would leave the test underpowered and likely to miss a real misstatement of $500.

1.4 The p-Value: Interpretation and Misuse

p-value: The probability, computed assuming \(H_0\) is true, of obtaining a test statistic at least as extreme as the one observed in the sample. A small p-value means the observed data would be unusual if \(H_0\) were true.

The p-value is not the probability that \(H_0\) is true. It is not the probability that results occurred by chance. It is not a measure of the effect’s practical importance. These are among the most common misinterpretations in applied statistics.

Common misinterpretations to avoid:

  • “The p-value is 0.03, so there is a 3% probability that \(H_0\) is true.” — FALSE. \(H_0\) is either true or not; the p-value is a conditional probability about data, not about hypotheses.
  • “p = 0.06 means no effect exists.” — FALSE. Failing to reject \(H_0\) is not evidence that \(H_0\) is true; it only means insufficient evidence to reject it.
  • “p = 0.001 means the effect is large and important.” — FALSE. With a huge sample, even a trivially small and practically unimportant effect will produce a tiny p-value.
Statistical significance and practical significance are distinct concepts. Always report effect sizes alongside p-values. An effect that is statistically significant may be too small to matter in practice; an effect that is practically large may fail to reach significance only because the sample is too small.

Chapter 2: The One-Sample t-Test and z-Test

2.1 One-Sample z-Test (Known \(\sigma\))

When the population standard deviation \(\sigma\) is known (rare in practice), the test statistic under \(H_0: \mu = \mu_0\) is:

\[ z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} \]

This follows a standard normal distribution exactly (for normal populations) or approximately via the CLT (for large \(n\)).

Critical values for common significance levels:

Test Type\(\alpha = 0.10\)\(\alpha = 0.05\)\(\alpha = 0.01\)
Two-tailed\(\pm 1.645\)\(\pm 1.960\)\(\pm 2.576\)
Upper-tailed\(1.282\)\(1.645\)\(2.326\)
Lower-tailed\(-1.282\)\(-1.645\)\(-2.326\)

2.2 One-Sample t-Test (Unknown \(\sigma\))

In practice \(\sigma\) must be estimated by the sample standard deviation \(s\). The test statistic is:

\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]

Under \(H_0\) and the normality assumption, this follows a Student’s t-distribution with \(n - 1\) degrees of freedom. The t-distribution has heavier tails than the normal, reflecting the additional uncertainty from estimating \(\sigma\).

Student's t-distribution: A symmetric, bell-shaped distribution indexed by degrees of freedom \(\nu\). As \(\nu \to \infty\), it converges to \(N(0,1)\). For small samples (\(n < 30\)), the t-distribution is noticeably heavier-tailed and critical values are correspondingly larger.
Invoice processing time: An internal auditor suspects that the average invoice processing time has increased above the standard of 3.5 days. A sample of 36 recent invoices yields \(\bar{x} = 3.9\) days with \(s = 1.2\) days.

Hypotheses: \(H_0: \mu = 3.5\) vs. \(H_a: \mu > 3.5\).

Test statistic:

\[ t = \frac{3.9 - 3.5}{1.2 / \sqrt{36}} = \frac{0.4}{0.2} = 2.00, \quad \text{df} = 35 \]

p-value (upper-tailed): In R, pt(2.00, df = 35, lower.tail = FALSE) \(\approx 0.027\).

Decision: Since \(p = 0.027 < \alpha = 0.05\), reject \(H_0\).

Interpretation: There is statistically significant evidence that the mean invoice processing time exceeds 3.5 days. Management should investigate the cause of the slowdown.

Conditions for the One-Sample t-Test

  1. Random sample: Observations are independently drawn from the population.
  2. Normality or large sample: Either the population is approximately normal, or \(n \geq 30\) so the CLT applies. For \(n < 30\), assess normality with a histogram, QQ plot, or Shapiro-Wilk test.
  3. Continuous outcome: The variable being measured is quantitative.

R Implementation

# One-sample t-test (two-tailed)
t.test(x, mu = 3.5, alternative = "two.sided")

# One-sample t-test (upper-tailed)
t.test(x, mu = 3.5, alternative = "greater")

# One-sample t-test (lower-tailed)
t.test(x, mu = 3.5, alternative = "less")

The output reports: t statistic, degrees of freedom, p-value, sample mean, and 95% confidence interval for \(\mu\).


Chapter 3: Two-Sample Inference

3.1 Independent Samples t-Test

Many business questions require comparing two independent groups: Does the experimental branch outperform the control branch in customer satisfaction? Is the mean claim amount different between two insurance product lines?

Hypotheses (two-tailed): \(H_0: \mu_1 - \mu_2 = D_0\) (typically \(D_0 = 0\)) vs. \(H_a: \mu_1 - \mu_2 \neq 0\).

Test statistic (Welch, not assuming equal variances):

\[ t = \frac{(\bar{x}_1 - \bar{x}_2) - D_0}{\sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}} \]

Degrees of freedom (Welch-Satterthwaite approximation):

\[ df \approx \frac{\left(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1 - 1} + \dfrac{(s_2^2/n_2)^2}{n_2 - 1}} \]

This df is not generally an integer; R rounds it or uses it continuously in the t-distribution. Welch’s test is preferred over the pooled (equal-variance) t-test because it remains valid whether or not the variances are equal, while incurring little efficiency loss when variances happen to be equal.

Regional sales performance: A retail chain wants to know whether the eastern and western regions differ in average weekly sales. A random sample of 25 weeks from the East region gives \(\bar{x}_E = \$142{,}000\), \(s_E = \$18{,}000\). A random sample of 30 weeks from the West gives \(\bar{x}_W = \$128{,}000\), \(s_W = \$22{,}000\).

\(H_0: \mu_E = \mu_W\) vs. \(H_a: \mu_E \neq \mu_W\).

\[ t = \frac{142000 - 128000}{\sqrt{\dfrac{18000^2}{25} + \dfrac{22000^2}{30}}} = \frac{14000}{\sqrt{12{,}960{,}000 + 16{,}133{,}333}} = \frac{14000}{\sqrt{29{,}093{,}333}} = \frac{14000}{5394} \approx 2.595 \]

Using t.test(east_sales, west_sales) in R gives \(df \approx 52.8\), p-value \(\approx 0.012\).

At \(\alpha = 0.05\), \(p = 0.012 < 0.05\): reject \(H_0\). The East region has significantly higher average weekly sales than the West.

3.2 Paired Samples t-Test

When each observation in Group 1 is logically matched with an observation in Group 2 (before/after measurements, matched vendors, twin stores), the paired design eliminates between-subject variability and is more powerful. Define differences \(d_i = x_{1i} - x_{2i}\) and compute:

\[ t = \frac{\bar{d} - D_0}{s_d / \sqrt{n}}, \quad df = n - 1 \]

where \(\bar{d}\) is the mean difference and \(s_d\) is the standard deviation of the differences.

New payables process: A company implements a redesigned accounts payable process. Payment cycle times (days) are recorded for the same 12 vendors before and after implementation.
VendorBeforeAfter\(d_i\)
118144
222193
31516-1
428217
519154
624222
720173
817134
925241
1021183
1123194
1216142

\(\bar{d} = 36/12 = 3.0\), \(s_d \approx 2.09\).

\[ t = \frac{3.0 - 0}{2.09 / \sqrt{12}} = \frac{3.0}{0.603} \approx 4.97, \quad df = 11 \]

p-value (two-tailed) \(\approx 0.0004\). Strong evidence that the new process reduces cycle time. Average reduction is 3 days per vendor.

3.3 F-Test for Equality of Variances

Before applying the pooled t-test (which assumes \(\sigma_1^2 = \sigma_2^2\)), analysts sometimes test equality of variances. The F-test for this purpose uses:

\[ F = \frac{s_1^2}{s_2^2} \]

Under \(H_0: \sigma_1^2 = \sigma_2^2\), \(F\) follows an F-distribution with \((n_1 - 1, n_2 - 1)\) degrees of freedom.

The F-test for equality of variances is sensitive to non-normality. In practice, Welch's t-test is recommended regardless of whether variances appear equal, making the variance pre-test largely unnecessary. In R: var.test(x, y).

3.4 Comparing Two Proportions

Two-sample z-test for proportions: Tests \(H_0: p_1 = p_2\) using the pooled proportion \(\hat{p} = (x_1 + x_2)/(n_1 + n_2)\): \[ z = \frac{(\hat{p}_1 - \hat{p}_2)}{\sqrt{\hat{p}(1 - \hat{p})\!\left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right)}} \]
Default rates: A bank compares loan default rates between two credit tiers. In Tier A, 45 of 400 loans defaulted (\(\hat{p}_A = 0.1125\)). In Tier B, 72 of 500 loans defaulted (\(\hat{p}_B = 0.144\)). Test whether default rates differ (\(\alpha = 0.05\)).

\(\hat{p} = (45 + 72)/(400 + 500) = 117/900 = 0.130\).

\[ z = \frac{0.1125 - 0.144}{\sqrt{0.130 \times 0.870 \times (1/400 + 1/500)}} = \frac{-0.0315}{\sqrt{0.1131 \times 0.0045}} = \frac{-0.0315}{\sqrt{0.000509}} = \frac{-0.0315}{0.02256} \approx -1.396 \]

p-value (two-tailed) \(= 2 \times P(Z < -1.396) \approx 2 \times 0.0814 = 0.163\).

At \(\alpha = 0.05\), fail to reject \(H_0\). Insufficient evidence that default rates differ between tiers.


Chapter 4: Chi-Square Tests

4.1 Goodness-of-Fit Test

The chi-square goodness-of-fit test asks whether the observed distribution of a categorical variable matches a specified theoretical distribution.

Chi-square goodness-of-fit statistic: \[ \chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i} \]

where \(O_i\) is the observed frequency in category \(i\) and \(E_i = n \cdot p_i\) is the expected frequency under \(H_0\). Under \(H_0\), \(\chi^2 \sim \chi^2_{k-1}\) (degrees of freedom \(= k - 1\) minus the number of parameters estimated from data).

Conditions: All expected cell counts \(E_i \geq 5\) (combine categories if necessary).

Sales by quarter: A retailer claims that quarterly sales are evenly distributed throughout the year (25% each quarter). Actual annual sales (in millions): Q1 = \$18M, Q2 = \$24M, Q3 = \$21M, Q4 = \$27M. Total = \$90M. Expected per quarter = \$22.5M. \[ \chi^2 = \frac{(18-22.5)^2}{22.5} + \frac{(24-22.5)^2}{22.5} + \frac{(21-22.5)^2}{22.5} + \frac{(27-22.5)^2}{22.5} \]\[ = \frac{20.25}{22.5} + \frac{2.25}{22.5} + \frac{2.25}{22.5} + \frac{20.25}{22.5} = 0.900 + 0.100 + 0.100 + 0.900 = 2.000 \]

df = 3, critical value at \(\alpha = 0.05\) is \(\chi^2_{3, 0.05} = 7.815\). Since \(2.000 < 7.815\), fail to reject. No evidence against the uniform quarterly distribution.

4.2 Test of Independence

The chi-square test of independence assesses whether two categorical variables are associated in a contingency table.

Expected cell counts under independence:

\[ E_{ij} = \frac{(\text{Row } i \text{ Total}) \times (\text{Column } j \text{ Total})}{n} \]\[ \chi^2 = \sum_{i}\sum_{j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, \quad df = (r-1)(c-1) \]
Customer satisfaction by region: A bank surveys 500 customers and records satisfaction (Low, Medium, High) and region (North, South, East, West). Results:
NorthSouthEastWestRow Total
Low2030251590
Medium50605545210
High55604540200
Col Total125150125100500

Expected count for (Low, North): \(E_{11} = 90 \times 125 / 500 = 22.5\).

After computing all 12 expected counts and the \(\chi^2\) statistic (df = \((3-1)(4-1) = 6\)), compare to \(\chi^2_{6, 0.05} = 12.592\). In R: chisq.test(satisfaction_table).

The chi-square test of independence detects any form of association (not just linear). It does not measure the strength of the relationship. For \(2 \times 2\) tables, Cramér's V or the phi coefficient quantify effect size: \[ V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}} \]

Chapter 5: One-Way Analysis of Variance (ANOVA)

5.1 Motivation: Comparing More Than Two Means

When comparing means across \(k \geq 3\) groups, running all pairwise t-tests is problematic: with \(k = 5\) groups there are \(\binom{5}{2} = 10\) pairwise tests, and at \(\alpha = 0.05\) each, the probability of at least one false rejection (familywise error rate) rises to \(1 - 0.95^{10} \approx 0.40\). ANOVA tests all \(k\) group means simultaneously at a single \(\alpha\) level.

One-way ANOVA: Tests \(H_0: \mu_1 = \mu_2 = \cdots = \mu_k\) (all group population means are equal) against \(H_a\): at least one pair of means differs. The test statistic is the F-ratio of between-group variance to within-group variance.

5.2 Partitioning Total Variation

ANOVA decomposes the total variation in the response into two sources:

\[ SST = SSB + SSW \]

where:

  • \(SST = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij} - \bar{y})^2\) — Total Sum of Squares: total variation of all observations around the grand mean \(\bar{y}\).
  • \(SSB = \sum_{i=1}^{k} n_i(\bar{y}_i - \bar{y})^2\) — Between-Groups Sum of Squares: variation of group means around the grand mean (explained by group membership).
  • \(SSW = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij} - \bar{y}_i)^2\) — Within-Groups Sum of Squares: variation of individual observations around their group mean (unexplained noise).

Mean squares are obtained by dividing by degrees of freedom:

\[ MSB = \frac{SSB}{k - 1}, \qquad MSW = \frac{SSW}{N - k} \]

where \(N = \sum n_i\) is the total sample size.

5.3 The F-Statistic

F-statistic for one-way ANOVA: \[ F = \frac{MSB}{MSW} \]

Under \(H_0\), both \(MSB\) and \(MSW\) estimate \(\sigma^2\), so \(F \approx 1\). When \(H_0\) is false, \(MSB\) exceeds \(MSW\) systematically (between-group differences inflate \(MSB\) but not \(MSW\)). The test is always upper-tailed: reject \(H_0\) if \(F > F_{\alpha, k-1, N-k}\).

5.4 The ANOVA Table

SourceSSdfMSFp-value
Between GroupsSSB\(k - 1\)\(MSB = SSB/(k-1)\)\(MSB/MSW\)\(P(F_{k-1, N-k} > F)\)
Within Groups (Error)SSW\(N - k\)\(MSW = SSW/(N-k)\)
TotalSST\(N - 1\)

5.5 Assumptions of One-Way ANOVA

  1. Independence: Observations within and across groups are independent.
  2. Normality: Within each group, the response is approximately normally distributed. ANOVA is robust to moderate departures when group sizes are equal and reasonably large.
  3. Equal variances (homoscedasticity): Population variances are equal across all groups: \(\sigma_1^2 = \sigma_2^2 = \cdots = \sigma_k^2\). Assess with Levene’s test (leveneTest() in R) or Bartlett’s test.
Performance by sales region: A national retailer divides stores into four regions. Monthly sales (thousands of dollars) are recorded for random samples from each region.
Region\(n_i\)\(\bar{y}_i\)\(s_i\)
North814218
South1012822
East915515
West713320

\(N = 34\), grand mean \(\bar{y} = (8 \times 142 + 10 \times 128 + 9 \times 155 + 7 \times 133) / 34\).

\(\bar{y} = (1136 + 1280 + 1395 + 931)/34 = 4742/34 = 139.47\).

\[ SSB = 8(142-139.47)^2 + 10(128-139.47)^2 + 9(155-139.47)^2 + 7(133-139.47)^2 \]\[ = 8(6.40) + 10(131.55) + 9(241.06) + 7(41.89) \]\[ = 51.2 + 1315.5 + 2169.5 + 293.2 = 3829.4 \]\[ MSB = 3829.4 / 3 = 1276.5 \]

\(SSW\) is computed from within-group variances: \(SSW \approx \sum(n_i - 1)s_i^2 = 7(324) + 9(484) + 8(225) + 6(400) = 2268 + 4356 + 1800 + 2400 = 10824\).

\(MSW = 10824 / 30 = 360.8\).

\[ F = 1276.5 / 360.8 = 3.538 \]

Critical value \(F_{3, 30, 0.05} = 2.922\). Since \(3.538 > 2.922\), reject \(H_0\). At least one region differs in mean monthly sales. In R: summary(aov(sales ~ region, data = store_data)).

5.6 Post-Hoc Tests

A significant ANOVA F-test tells us only that at least one pair of means differs — it does not say which pairs. Post-hoc tests make pairwise comparisons while controlling the familywise error rate.

Tukey’s Honestly Significant Difference (HSD)

Tukey's HSD: Compares all \(\binom{k}{2}\) pairs of group means simultaneously, controlling the familywise error rate at \(\alpha\). The critical difference for declaring \(\bar{y}_i - \bar{y}_j\) significant is: \[ \text{HSD} = q_{\alpha, k, N-k} \sqrt{\frac{MSW}{n}} \]

where \(q_{\alpha, k, N-k}\) is the Studentized range critical value and \(n\) is the common group size (for balanced designs). For unbalanced designs, use the Tukey-Kramer adjustment.

Bonferroni Correction

Bonferroni correction: To maintain familywise error rate \(\alpha\) across \(m\) pairwise comparisons, compare each individual p-value to the adjusted threshold \(\alpha / m\). More conservative (wider intervals) than Tukey's HSD, but applicable to any set of comparisons, not just all pairwise.

In R:

# One-way ANOVA
model_anova <- aov(sales ~ region, data = store_data)
summary(model_anova)

# Post-hoc: Tukey's HSD
TukeyHSD(model_anova)

# Post-hoc: Bonferroni-adjusted pairwise t-tests
pairwise.t.test(store_data$sales, store_data$region, p.adjust.method = "bonferroni")

Chapter 6: Two-Way ANOVA

6.1 Extending ANOVA to Two Factors

Two-way ANOVA examines the effects of two categorical factors simultaneously and tests whether the factors interact. It is more efficient than running two separate one-way ANOVAs because it partitions variation more precisely and can detect interaction effects.

Notation: Factor A has \(a\) levels, Factor B has \(b\) levels, with \(n\) replications per cell. The model is:

\[ y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \varepsilon_{ijk} \]

where \(\alpha_i\) is the main effect of Factor A level \(i\), \(\beta_j\) is the main effect of Factor B level \(j\), \((\alpha\beta)_{ij}\) is the interaction effect, and \(\varepsilon_{ijk} \sim N(0, \sigma^2)\).

6.2 Sum of Squares Decomposition

\[ SST = SSA + SSB + SSAB + SSE \]
SourcedfMSF
Factor A\(a - 1\)\(MSA\)\(MSA/MSE\)
Factor B\(b - 1\)\(MSB\)\(MSB/MSE\)
Interaction AB\((a-1)(b-1)\)\(MSAB\)\(MSAB/MSE\)
Error\(ab(n-1)\)\(MSE\)
Total\(abn - 1\)

6.3 Interpreting Interaction

Interaction effect: An interaction between Factor A and Factor B exists when the effect of one factor on the response depends on the level of the other factor. If interaction is significant, the main effects cannot be interpreted in isolation.

An interaction plot (plotting group means with lines connecting levels of one factor, across levels of the other) reveals interaction visually: parallel lines indicate no interaction; converging, crossing, or diverging lines indicate interaction.

Advertising channel and market: A firm tests two advertising channels (Digital, Traditional) across three markets (Urban, Suburban, Rural) with 5 replications each. Two-way ANOVA reveals: - Factor A (Channel): \(F = 8.24\), \(p = 0.006\) — significant. - Factor B (Market): \(F = 12.11\), \(p < 0.001\) — significant. - Interaction: \(F = 4.87\), \(p = 0.014\) — significant.

The significant interaction means: the advantage of Digital over Traditional advertising differs by market (e.g., large advantage in Urban markets, negligible in Rural markets). Reporting main effects alone would be misleading.


Chapter 7: Simple Linear Regression

7.1 The SLR Model

Simple Linear Regression (SLR) quantifies the linear relationship between a single predictor \(X\) and a continuous response \(Y\). In business settings, it answers questions like: How does advertising spend predict sales? How do machine-hours predict overhead costs? How does interest rate changes affect bond prices?

Population SLR model: \[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i, \quad \varepsilon_i \overset{iid}{\sim} N(0, \sigma^2) \]

\(\beta_0\) is the true intercept, \(\beta_1\) is the true slope, and \(\varepsilon_i\) is the irreducible random error for observation \(i\).

Model assumptions (LINE):

  • Linearity: The true relationship between \(X\) and \(E(Y)\) is linear.
  • Independence: Errors are mutually independent.
  • Normality: Errors are normally distributed.
  • Equal variance (homoscedasticity): \(\text{Var}(\varepsilon_i) = \sigma^2\) constant across all \(X\).

7.2 Ordinary Least Squares (OLS) Estimation

OLS finds \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the Residual Sum of Squares:

\[ SSE = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \]

Taking partial derivatives and setting them to zero yields the normal equations:

\[ \frac{\partial SSE}{\partial \hat{\beta}_0} = -2\sum(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]\[ \frac{\partial SSE}{\partial \hat{\beta}_1} = -2\sum x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0 \]

Solving:

\[ \hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

The fitted values are \(\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\) and the residuals are \(e_i = y_i - \hat{y}_i\).

7.3 Interpretation of Coefficients

  • \(\hat{\beta}_1\) (slope): The estimated average change in \(Y\) for a one-unit increase in \(X\), holding everything else constant. Units: units of \(Y\) per unit of \(X\).
  • \(\hat{\beta}_0\) (intercept): The estimated average value of \(Y\) when \(X = 0\). This is only meaningful if \(X = 0\) is within the plausible range of the data; otherwise it is an extrapolation artifact.
Advertising spend and sales revenue: A consumer goods company records monthly advertising spend (thousands of dollars) and sales revenue (thousands of dollars) for 24 months.
Summary statisticValue
\(\bar{x}\) (avg. ad spend)45.2
\(\bar{y}\) (avg. revenue)312.8
\(S_{xx} = \sum(x_i - \bar{x})^2\)8{,}640
\(S_{xy} = \sum(x_i-\bar{x})(y_i-\bar{y})\)37{,}152
\[ \hat{\beta}_1 = \frac{37152}{8640} = 4.30 \]\[ \hat{\beta}_0 = 312.8 - 4.30 \times 45.2 = 312.8 - 194.4 = 118.4 \]

Fitted model: \(\hat{\text{Revenue}} = 118.4 + 4.30 \times \text{AdSpend}\)

Interpretation: Each additional $1{,}000 in advertising is associated with an estimated $4{,}300 increase in monthly sales revenue, on average. The intercept of $118{,}400 represents estimated revenue when advertising spend is zero — plausible as a baseline (brand recognition, repeat customers).

Coefficient of Determination

\[ SST = SSR + SSE \]

where \(SSR = \sum(\hat{y}_i - \bar{y})^2\) is the regression sum of squares (variation explained by \(X\)) and \(SSE = \sum(y_i - \hat{y}_i)^2\) is the residual sum of squares (unexplained variation).

\[ R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} \]

\(R^2\) ranges from 0 to 1. In SLR, \(R^2 = r^2\) where \(r\) is the Pearson correlation coefficient.

Residual Standard Error

The estimated standard deviation of the errors:

\[ \hat{\sigma} = s_e = \sqrt{\frac{SSE}{n - 2}} = \sqrt{MSE} \]

The denominator is \(n - 2\) because two parameters (\(\beta_0\), \(\beta_1\)) were estimated. \(s_e\) is reported as “Residual Standard Error” in R output and measures the typical size of prediction errors in the units of \(Y\).

7.5 Inference on the Slope

Standard Error of \(\hat{\beta}_1\)

\[ SE(\hat{\beta}_1) = \frac{s_e}{\sqrt{S_{xx}}} = \sqrt{\frac{MSE}{S_{xx}}} \]

This decreases as (a) \(n\) increases, (b) \(S_{xx}\) increases (wider spread in \(X\) values), or (c) \(s_e\) decreases (better fit).

t-Test for the Slope

\[ H_0: \beta_1 = 0 \quad \text{(no linear relationship)} \quad \text{vs.} \quad H_a: \beta_1 \neq 0 \]\[ t = \frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}, \quad df = n - 2 \]

A significant result means there is evidence of a linear relationship between \(X\) and \(Y\) in the population. Note: the F-test in the ANOVA table of regression output is equivalent to this t-test in SLR (i.e., \(F = t^2\)).

Confidence Interval for \(\beta_1\)

\[ \hat{\beta}_1 \pm t_{\alpha/2,\, n-2} \cdot SE(\hat{\beta}_1) \]
Continuing the advertising example: \(n = 24\), \(\hat{\beta}_1 = 4.30\), \(s_e = 28.4\), \(S_{xx} = 8640\). \[ SE(\hat{\beta}_1) = \frac{28.4}{\sqrt{8640}} = \frac{28.4}{92.95} = 0.306 \]\[ t = \frac{4.30}{0.306} = 14.05, \quad df = 22, \quad p < 0.0001 \]

95% CI for \(\beta_1\): \(4.30 \pm t_{0.025, 22} \times 0.306 = 4.30 \pm 2.074 \times 0.306 = 4.30 \pm 0.63 = (3.67,\ 4.93)\).

We are 95% confident that each additional $1{,}000 of advertising is associated with between $3{,}670 and $4{,}930 in additional revenue, on average.

7.6 Confidence Intervals for Mean Response and Prediction Intervals

At a given predictor value \(x^*\), two types of interval estimate are available:

Confidence interval for the mean response \(E(Y|X = x^*)\): Estimates the average \(Y\) for the entire subpopulation with \(X = x^*\): \[ \hat{y}^ \pm t_{\alpha/2,\, n-2} \cdot s_e \sqrt{\frac{1}{n} + \frac{(x^ - \bar{x})^2}{S_{xx}}} \]

This interval narrows as \(x^*\) approaches \(\bar{x}\) (the center of the data) and widens toward the extremes.

Prediction interval for an individual new observation \(Y^*\): Estimates where a single future \(Y\) value at \(X = x^*\) will fall: \[ \hat{y}^ \pm t_{\alpha/2,\, n-2} \cdot s_e \sqrt{1 + \frac{1}{n} + \frac{(x^ - \bar{x})^2}{S_{xx}}} \]

The extra “1” under the radical accounts for the individual variability of the new observation around the mean response. Prediction intervals are always wider than the corresponding confidence intervals.

The further \(x^*\) is from \(\bar{x}\), the wider both intervals become. Extrapolating far beyond the range of observed \(X\) values leads to intervals so wide as to be nearly useless, and also relies on the unverifiable assumption that the linear model continues to hold.

7.7 Residual Analysis

Residual analysis is the primary tool for checking whether the LINE assumptions are satisfied.

Key residual plots:

  1. Residuals vs. Fitted values: Should show a horizontal band with no pattern. A funnel shape indicates heteroscedasticity; a curve indicates non-linearity.
  2. Normal QQ plot of residuals: Points should fall approximately on a straight diagonal line. Systematic deviations indicate non-normality.
  3. Residuals vs. predictor \(X\): Equivalent to (1) in SLR; helps identify non-linearity.
  4. Scale-Location plot (square root of |standardized residuals| vs. fitted): Assesses homoscedasticity — should be a horizontal band.
model <- lm(revenue ~ ad_spend, data = marketing_data)
par(mfrow = c(2, 2))
plot(model)   # Four residual diagnostic plots

Chapter 8: Multiple Linear Regression

8.1 Extending to Multiple Predictors

Multiple Linear Regression (MLR) models the relationship between a response \(Y\) and \(p\) predictor variables \(X_1, X_2, \ldots, X_p\):

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \cdots + \beta_p X_{ip} + \varepsilon_i \]

In matrix notation: \(\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\), where \(\mathbf{X}\) is the \(n \times (p+1)\) design matrix. The OLS estimator is:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]

8.2 Interpretation of Coefficients in MLR

Partial effect: In MLR, \(\hat{\beta}_j\) is the estimated change in \(Y\) associated with a one-unit increase in \(X_j\), holding all other predictors constant. This is the key distinction from SLR: the coefficient controls for the other variables in the model.
Predicting overhead costs: A manufacturing firm regresses monthly overhead cost (Y, \$000s) on machine-hours (\(X_1\)) and number of production runs (\(X_2\)).

Fitted model: \(\hat{Y} = 42.3 + 3.15 X_1 + 8.90 X_2\)

  • \(\hat{\beta}_1 = 3.15\): Holding number of production runs fixed, each additional machine-hour is associated with $3{,}150 in additional overhead, on average.
  • \(\hat{\beta}_2 = 8.90\): Holding machine-hours fixed, each additional production run is associated with $8{,}900 in additional overhead, on average.
  • \(\hat{\beta}_0 = 42.3\): Estimated overhead when both predictors are zero — an extrapolation anchor with limited practical meaning here.

8.3 Adjusted R-Squared

Adding any predictor to a model will never decrease \(R^2\), even if the predictor has no real relationship with \(Y\) (overfitting). The adjusted \(R^2\) penalizes for model complexity:

\[ \bar{R}^2 = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} = 1 - (1 - R^2)\frac{n-1}{n-p-1} \]

Unlike \(R^2\), adjusted \(R^2\) can decrease when a predictor is added that does not sufficiently improve the fit. It is used for comparing models with different numbers of predictors.

8.4 F-Test for Overall Significance

Tests whether at least one predictor has a non-zero coefficient:

\[ H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{vs.} \quad H_a: \text{at least one } \beta_j \neq 0 \]\[ F = \frac{SSR/p}{SSE/(n-p-1)} = \frac{MSR}{MSE}, \quad \text{df} = (p,\ n-p-1) \]

The p-value for this F-test appears in the last line of R’s summary(lm(...)) output. A significant overall F-test should precede interpretation of individual coefficients.

8.5 Individual t-Tests for Coefficients

Each coefficient has its own t-test: \(H_0: \beta_j = 0\) vs. \(H_a: \beta_j \neq 0\):

\[ t_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}, \quad df = n - p - 1 \]

These test the partial effect of \(X_j\) given all other predictors in the model. A predictor that is individually significant in SLR may become insignificant in MLR if it is correlated with another predictor (multicollinearity).

8.6 Multicollinearity

Multicollinearity: A condition in which two or more predictor variables are highly linearly correlated with each other. It does not bias the OLS estimates, but it inflates standard errors, making individual coefficients imprecisely estimated and t-tests unreliable.

Detection: Compute the Variance Inflation Factor (VIF):

\[ VIF_j = \frac{1}{1 - R_j^2} \]

where \(R_j^2\) is the \(R^2\) from regressing \(X_j\) on all other predictors. As a rule of thumb, \(VIF > 10\) (or even \(VIF > 5\) in some fields) indicates problematic multicollinearity.

Consequences: Large standard errors → wide confidence intervals → difficulty identifying which variables matter → unstable coefficient estimates (small changes in data lead to large changes in coefficients).

Remedies: Drop one of the correlated predictors; combine correlated predictors into a single index; use ridge regression or principal components regression.

library(car)
vif(model_mlr)   # Variance inflation factors

8.7 Heteroscedasticity

Heteroscedasticity: Violation of the equal-variance assumption — the variance of the errors \(\text{Var}(\varepsilon_i)\) changes as a function of the predictor values. Common in financial data (larger companies have more variable revenues) and cross-sectional data.

Detection: Residuals vs. fitted values plot (funnel/fan shape); Breusch-Pagan test (bptest() in the lmtest package).

Consequences: OLS estimates remain unbiased but are no longer minimum variance (BLUE). Standard errors are incorrect, invalidating t-tests and F-tests.

Remedies: Weighted least squares (WLS); heteroscedasticity-consistent (robust) standard errors (coeftest(model, vcov = vcovHC(model, type = “HC3”)) from the sandwich package); log-transforming the response.

8.8 Model Selection: AIC, BIC, and Stepwise Regression

When many candidate predictors are available, model selection criteria help balance fit against complexity.

Akaike Information Criterion (AIC): \[ AIC = -2\ln(\hat{L}) + 2(p + 1) \]

where \(\hat{L}\) is the maximized log-likelihood. Lower AIC is better. The penalty term \(2(p+1)\) discourages overfitting.

Bayesian Information Criterion (BIC): \[ BIC = -2\ln(\hat{L}) + (p + 1)\ln(n) \]

BIC applies a stronger penalty for model complexity than AIC, especially for large \(n\), and tends to select more parsimonious models.

Stepwise regression: An algorithmic procedure that adds or removes predictors one at a time based on a criterion (AIC, p-value, or \(R^2\)). Variants: forward selection (start empty, add best predictor iteratively), backward elimination (start full, remove least-significant predictor iteratively), stepwise (bidirectional). These are exploratory tools — results should be interpreted cautiously and validated on new data.

# AIC-based stepwise selection
library(MASS)
full_model   <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = df)
step_model   <- stepAIC(full_model, direction = "both")
summary(step_model)

# AIC and BIC for a given model
AIC(model_mlr)
BIC(model_mlr)
Model selection should ideally use cross-validation or a held-out test set, not just in-sample criteria. AIC and BIC minimize different approximations to prediction error and may select different models. When both point to the same model, confidence in that model is higher.

Chapter 9: Non-Parametric Tests

9.1 When to Use Non-Parametric Tests

Parametric tests (t-tests, ANOVA, regression) rely on distributional assumptions — primarily normality of the response or residuals. Non-parametric tests make fewer assumptions and are appropriate when:

  • The sample is small and normality cannot be verified.
  • The data are ordinal (ranked) rather than continuous.
  • Outliers strongly distort the mean, making it an inappropriate summary.
  • The Shapiro-Wilk test or QQ plot indicates significant non-normality.

Non-parametric tests trade statistical efficiency for robustness: they require larger samples to detect the same effect size.

9.2 Wilcoxon Signed-Rank Test (One-Sample / Paired)

Wilcoxon signed-rank test: The non-parametric alternative to the one-sample t-test (or paired t-test). Tests whether the population median equals a hypothesized value \(M_0\), based on the ranks of the absolute differences \(|d_i|\).

Procedure:

  1. Compute differences \(d_i = x_i - M_0\) (or \(d_i = x_{1i} - x_{2i}\) for paired data).
  2. Discard zero differences; rank the absolute values of remaining differences.
  3. Compute \(W^+\) = sum of ranks with positive differences, \(W^-\) = sum of ranks with negative differences.
  4. Use \(W = \min(W^+, W^-)\) as the test statistic. Compare to the Wilcoxon critical value table, or use R.

For large samples (\(n \geq 10\)), \(W^+\) is approximately normal with:

\[ \mu_{W^+} = \frac{n(n+1)}{4}, \qquad \sigma_{W^+}^2 = \frac{n(n+1)(2n+1)}{24} \]
wilcox.test(x, mu = M_0, alternative = "two.sided")         # One-sample
wilcox.test(after, before, paired = TRUE, alternative = "two.sided") # Paired

9.3 Mann-Whitney U Test (Two Independent Samples)

Mann-Whitney U test: The non-parametric alternative to the independent-samples t-test. Tests whether one population tends to produce larger values than the other (i.e., the distributions have the same location). Also called the Wilcoxon rank-sum test.

Procedure:

  1. Pool all observations from both groups and rank from smallest to largest (averaging ranks for ties).
  2. Compute \(R_1\) = sum of ranks for Group 1.
  3. \(U_1 = n_1 n_2 + \frac{n_1(n_1+1)}{2} - R_1\); \(U_2 = n_1 n_2 - U_1\).
  4. Test statistic: \(U = \min(U_1, U_2)\).

For large samples, \(U\) is approximately normal with \(\mu_U = n_1 n_2 / 2\) and \(\sigma_U^2 = n_1 n_2(n_1 + n_2 + 1)/12\).

wilcox.test(group1, group2, alternative = "two.sided")   # Mann-Whitney
Customer wait times: Two bank branches record customer wait times (minutes). Branch A: \{3, 5, 4, 7, 6\}, Branch B: \{8, 10, 7, 12, 9\}. The Shapiro-Wilk test suggests non-normality in Branch B. Use Mann-Whitney.

Pooling and ranking: 3(1), 4(2), 5(3), 6(4), 7(5.5), 7(5.5), 8(7), 9(8), 10(9), 12(10).

\(R_A = 1 + 2 + 3 + 4 + 5.5 = 15.5\); \(U_A = 5 \times 5 + 15 - 15.5 = 24.5\); \(U_B = 25 - 24.5 = 0.5\).

\(U = 0.5\). With \(n_1 = n_2 = 5\), the critical value at \(\alpha = 0.05\) two-tailed is \(U^* = 2\). Since \(0.5 < 2\), reject \(H_0\). Branch B has significantly longer wait times.

9.4 Kruskal-Wallis Test (Non-Parametric ANOVA)

Kruskal-Wallis test: The non-parametric analog of one-way ANOVA. Tests whether \(k\) groups come from the same distribution (have the same median), based on pooled ranks. \[ H = \frac{12}{N(N+1)} \sum_{i=1}^{k} \frac{R_i^2}{n_i} - 3(N+1) \]

Under \(H_0\), \(H \sim \chi^2_{k-1}\) approximately (for \(n_i \geq 5\)).

kruskal.test(response ~ group, data = df)

Chapter 10: Tests for Proportions and the Chi-Square Distribution

10.1 One-Sample Test for a Proportion

One-sample z-test for a proportion: \[ H_0: p = p_0 \quad \text{vs.} \quad H_a: p \neq p_0 \text{ (or } > \text{ or } < \text{)} \]\[ z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} \]

Valid when \(np_0 \geq 10\) and \(n(1 - p_0) \geq 10\).

Audit sampling: An auditor believes that the error rate in a batch of expense claims has risen above the historical rate of 8%. A sample of 200 claims finds 22 errors (\(\hat{p} = 0.11\)). \[ z = \frac{0.11 - 0.08}{\sqrt{0.08 \times 0.92 / 200}} = \frac{0.03}{\sqrt{0.000368}} = \frac{0.03}{0.01918} = 1.565 \]

p-value (upper-tailed) = \(P(Z > 1.565) \approx 0.059\). At \(\alpha = 0.05\), fail to reject \(H_0\) — marginal evidence. At \(\alpha = 0.10\), would reject. The auditor might request a larger sample or investigate the 22 errors identified.

10.2 The Chi-Square Distribution

Chi-square distribution \(\chi^2_\nu\): A right-skewed distribution indexed by degrees of freedom \(\nu > 0\). If \(Z_1, \ldots, Z_\nu \overset{iid}{\sim} N(0,1)\), then \(\chi^2 = Z_1^2 + \cdots + Z_\nu^2 \sim \chi^2_\nu\). Mean = \(\nu\), variance = \(2\nu\). As \(\nu\) increases, the distribution becomes more symmetric.

The chi-square distribution arises naturally in: (a) tests for variance (\(\chi^2 = (n-1)s^2/\sigma_0^2\)), (b) goodness-of-fit and independence tests, and (c) as the distribution of the log-likelihood ratio in maximum likelihood estimation.

One-Sample Test for Variance

\[ H_0: \sigma^2 = \sigma_0^2, \quad \chi^2 = \frac{(n-1)s^2}{\sigma_0^2}, \quad df = n-1 \]

This test is sensitive to departures from normality; use with caution.


Chapter 11: Introduction to Time Series Analysis

11.1 Components of a Time Series

A time series is a sequence of observations recorded at successive equally-spaced points in time. In business and finance: monthly sales, quarterly earnings, daily stock prices, weekly unemployment claims. Time series data violates the independence assumption of standard regression — observations close in time tend to be correlated.

Decomposition model: A time series \(Y_t\) can be thought of as composed of:
  • Trend (\(T_t\)): The long-term upward or downward direction of the series.
  • Seasonal component (\(S_t\)): Regular, predictable fluctuations that repeat over a fixed period (e.g., higher retail sales every December).
  • Cyclical component (\(C_t\)): Longer-term undulations associated with business cycles (expansions and contractions), with irregular duration.
  • Irregular (random) component (\(I_t\)): Unpredictable, random noise remaining after the other components are accounted for.

Additive model: \(Y_t = T_t + S_t + C_t + I_t\) — appropriate when seasonal variation is roughly constant in absolute magnitude.

Multiplicative model: \(Y_t = T_t \times S_t \times C_t \times I_t\) — appropriate when seasonal variation grows proportionally with the trend. Can be linearized by taking logarithms: \(\ln Y_t = \ln T_t + \ln S_t + \ln C_t + \ln I_t\).

11.2 Moving Averages

A moving average smooths a time series by averaging consecutive windows of observations, removing short-term fluctuations and revealing the trend.

Simple moving average (SMA) of order \(m\): \[ \hat{T}_t = \frac{1}{m}\sum_{j=-(m-1)/2}^{(m-1)/2} Y_{t+j} \quad (m \text{ odd}) \]

For \(m = 5\): \(\hat{T}_t = (Y_{t-2} + Y_{t-1} + Y_t + Y_{t+1} + Y_{t+2})/5\).

Choosing \(m\) equal to the length of the seasonal period (e.g., \(m = 12\) for monthly data, \(m = 4\) for quarterly data) eliminates the seasonal component and exposes the underlying trend-cycle. A centered moving average (\(2 \times 12\) MA) is used for even-period data to maintain alignment with the original time index.

Quarterly revenue trend: A firm's quarterly revenues (millions): Q1'23 = 42, Q2'23 = 58, Q3'23 = 67, Q4'23 = 71, Q1'24 = 48, Q2'24 = 63, Q3'24 = 72, Q4'24 = 76.

4-period moving average:

  • Centered at Q2'23–Q3'23: \((42+58+67+71)/4 = 59.5\)
  • Centered at Q3'23–Q4'23: \((58+67+71+48)/4 = 61.0\)
  • Centered at Q4'23–Q1'24: \((67+71+48+63)/4 = 62.25\)
  • Centered at Q1'24–Q2'24: \((71+48+63+72)/4 = 63.5\)
  • Centered at Q2'24–Q3'24: \((48+63+72+76)/4 = 64.75\)

The smoothed values reveal a steady upward trend (~1.5M per quarter) that the raw seasonal swings obscure.

11.3 Exponential Smoothing

Exponential smoothing is a powerful forecasting method that assigns exponentially decreasing weights to past observations — recent data receives more weight than older data.

Simple Exponential Smoothing (SES): \[ \hat{Y}_{t+1} = \alpha Y_t + (1 - \alpha)\hat{Y}_t = \hat{Y}_t + \alpha(Y_t - \hat{Y}_t) \]

where \(\alpha \in (0, 1)\) is the smoothing parameter. Large \(\alpha\) (near 1) weights recent observations heavily (more responsive, less smooth); small \(\alpha\) (near 0) weights historical observations more (smoother, slower to react).

Expanding the recursion:

\[ \hat{Y}_{t+1} = \alpha Y_t + \alpha(1-\alpha)Y_{t-1} + \alpha(1-\alpha)^2 Y_{t-2} + \cdots \]

The weights \(\alpha(1-\alpha)^j\) decline geometrically, confirming the exponential weighting. SES produces optimal forecasts for series with no trend and no seasonality (i.e., a random walk with noise).

Choosing \(\alpha\): Minimize the sum of squared forecast errors over the historical data:

\[ SSE(\alpha) = \sum_{t=2}^{T}(Y_t - \hat{Y}_t)^2 \]

In R: HoltWinters(ts, beta = FALSE, gamma = FALSE) fits SES and optimizes \(\alpha\) automatically.

11.4 Holt’s Two-Parameter Exponential Smoothing (Trend Corrected)

SES does not account for trend. Holt’s method adds a trend equation:

\[ \text{Level: } L_t = \alpha Y_t + (1-\alpha)(L_{t-1} + T_{t-1}) \]\[ \text{Trend: } T_t = \beta(L_t - L_{t-1}) + (1-\beta)T_{t-1} \]\[ \text{Forecast: } \hat{Y}_{t+h} = L_t + h T_t \]

where \(\alpha\) smooths the level and \(\beta\) smooths the trend. Both parameters are optimized by minimizing SSE.

11.5 Holt-Winters Seasonal Exponential Smoothing

Holt-Winters extends Holt’s method to handle seasonality of period \(s\):

Additive version:

\[ L_t = \alpha(Y_t - S_{t-s}) + (1-\alpha)(L_{t-1} + T_{t-1}) \]\[ T_t = \beta(L_t - L_{t-1}) + (1-\beta)T_{t-1} \]\[ S_t = \gamma(Y_t - L_t) + (1-\gamma)S_{t-s} \]\[ \hat{Y}_{t+h} = L_t + hT_t + S_{t+h-s} \]

Three parameters to optimize: \(\alpha\) (level), \(\beta\) (trend), \(\gamma\) (seasonality).

Retail sales forecasting: Monthly retail sales exhibit both an upward trend and a strong December seasonal spike. Holt-Winters fits \(\alpha = 0.3\), \(\beta = 0.1\), \(\gamma = 0.8\) (high \(\gamma\) because seasonality is pronounced). The model captures the December surge and the underlying growth trend simultaneously, producing 12-month-ahead forecasts with associated prediction intervals.

In R:

ts_data <- ts(sales_vector, start = c(2020, 1), frequency = 12)
hw_model <- HoltWinters(ts_data)
forecast_hw <- forecast(hw_model, h = 12)
plot(forecast_hw)

11.6 Trend Regression

When a time series has a clear linear trend, a regression of \(Y_t\) on time \(t\) is a natural starting point:

\[ Y_t = \beta_0 + \beta_1 t + \varepsilon_t \]

Forecasting: \(\hat{Y}_{t} = \hat{\beta}_0 + \hat{\beta}_1 t\). Extend \(t\) beyond the sample to forecast.

For exponential (multiplicative) growth, linearize with a log transform: regress \(\ln Y_t\) on \(t\), then exponentiate the forecast.

Caution: Regression residuals in time series are often autocorrelated — the Durbin-Watson statistic tests for first-order autocorrelation (\(DW \approx 2\) indicates no autocorrelation; values \(<1.5\) or \(>2.5\) signal problems). Autocorrelated errors invalidate standard t-tests and confidence intervals.


Chapter 12: Putting It All Together — Statistical Workflow in R

12.1 A Complete Analysis Pipeline

A rigorous quantitative analysis in business follows a standard pipeline. Each stage connects to specific tools covered in AFM 113.

Stage 1 — Understand and clean the data:

library(tidyverse)
glimpse(df)
summary(df)
df %>% filter(is.na(revenue))   # Check missing values

Stage 2 — Explore distributions:

ggplot(df, aes(x = revenue)) + geom_histogram(bins = 20) + theme_minimal()
ggplot(df, aes(sample = revenue)) + stat_qq() + stat_qq_line()
shapiro.test(df$revenue[1:50])  # Shapiro-Wilk (for n <= 5000)

Stage 3 — Choose and apply the right test:

QuestionParametric TestNon-Parametric Alternative
One group vs. target meanOne-sample t-testWilcoxon signed-rank
Two independent group meansWelch two-sample t-testMann-Whitney U
Two related group meansPaired t-testWilcoxon signed-rank (paired)
Three or more group meansOne-way ANOVA + TukeyKruskal-Wallis
Two categorical variablesChi-square independence testFisher’s exact test
One proportion vs. targetOne-sample z-test for proportionsExact binomial test
Two proportionsTwo-sample z-test for proportionsFisher’s exact test

Stage 4 — Build a regression model:

model <- lm(revenue ~ ad_spend + stores + region, data = df)
summary(model)    # Coefficients, SE, t-stats, p-values, R-squared, F-test
confint(model)    # 95% CI for all coefficients
vif(model)        # Multicollinearity check
plot(model)       # Residual diagnostics

Stage 5 — Forecast (time series):

ts_sales <- ts(df$sales, start = c(2020, 1), frequency = 12)
hw <- HoltWinters(ts_sales)
forecast(hw, h = 6)   # 6-month-ahead forecast

12.2 Worked Comprehensive Example

Retail chain analysis: A national retail chain wants to understand what drives store-level monthly revenue. Data are available for 120 stores over 12 months (1{,}440 observations). Variables: Revenue (\$000), AdSpend (\$000), StoreSize (sq ft, 000s), Region (North/South/East/West), OnlineComp (local online competition index, 0–100).

Step 1 — Descriptive statistics reveal right-skewed revenue distribution (mean $312K, median $285K, SD $98K). Log-transform considered.

Step 2 — ANOVA tests whether mean revenue differs by region. F-statistic = 11.43 (\(p < 0.001\)). Tukey post-hoc shows: East significantly higher than West and South (adjusted p \(< 0.05\)); North vs. South not significant.

Step 3 — Correlation shows AdSpend (\(r = 0.71\)), StoreSize (\(r = 0.65\)), OnlineComp (\(r = -0.38\)) all correlated with Revenue.

Step 4 — MLR:

\[ \widehat{\text{Revenue}} = 88.3 + 3.42\,\text{AdSpend} + 5.18\,\text{StoreSize} - 1.22\,\text{OnlineComp} + \text{(region indicators)} \]

\(R^2 = 0.734\), adjusted \(R^2 = 0.726\), overall \(F = 87.3\) (\(p < 0.001\)). All predictors significant (\(p < 0.05\)). VIFs all below 3 (no serious multicollinearity). Residual plots show approximate homoscedasticity and normality.

Interpretation: Holding other factors constant, each $1{,}000 additional advertising is associated with $3{,}420 additional monthly revenue. An additional 1{,}000 sq ft of store size adds $5{,}180, on average. Each 1-point increase in the online competition index reduces revenue by $1{,}220 — highlighting the strategic importance of managing competitive pressure from e-commerce.

12.3 Common Errors and How to Avoid Them

ErrorDescriptionCorrect Approach
Multiple testingRunning 10 tests at \(\alpha = 0.05\) without adjustment; familywise error \(\approx 40\%\)Apply Bonferroni or FDR correction; use ANOVA instead of pairwise t-tests
p-hackingTrying many models until \(p < 0.05\) appearsPre-register hypotheses; report all tests performed
ExtrapolationPredicting far outside the range of observed \(X\)Report the observed range; qualify predictions as extrapolations
Confusing CI and PIUsing a confidence interval for the mean to bound individual predictionsUse prediction intervals for individual outcomes
Ignoring residual diagnosticsTrusting regression results without checking assumptionsAlways plot residuals vs. fitted, and QQ plot of residuals
Omitted variable biasLeaving a confounding variable out of MLRUse domain knowledge to include all relevant predictors; acknowledge limitations
Treating \(R^2\) as the only metricHigh \(R^2\) can coexist with poor predictions, heteroscedasticity, multicollinearityReport residual standard error, VIF, and diagnostic plots alongside \(R^2\)

12.4 Glossary of Key Terms

TermDefinition
p-valueProbability of observed (or more extreme) data under \(H_0\)
Significance level \(\alpha\)Pre-set maximum Type I error rate
Statistical power \(1 - \beta\)Probability of correctly rejecting a false \(H_0\)
Confidence intervalRange of plausible values for a parameter at a given confidence level
Prediction intervalRange for a single future observation; always wider than CI
F-statisticRatio of explained to unexplained variance (ANOVA, regression)
ResidualDifference between observed and fitted value: \(e_i = y_i - \hat{y}_i\)
SST / SSR / SSETotal / Regression / Error sums of squares
MulticollinearityHigh correlation among predictors in MLR
HeteroscedasticityNon-constant error variance across predictor values
VIFVariance inflation factor; measures severity of multicollinearity
AIC / BICModel selection criteria penalizing complexity
Moving averageSmoothed time series estimate using local window of observations
Exponential smoothingForecast weighting past observations geometrically
Holt-WintersTriple exponential smoothing for trend + seasonality
MAPEMean absolute percentage error — common forecast accuracy metric

Appendix A: Critical Value Reference Tables

Standard Normal Critical Values

Confidence Level\(\alpha\)\(z_{\alpha/2}\) (two-tailed)
90%0.101.645
95%0.051.960
99%0.012.576
99.9%0.0013.291

Student’s t Critical Values (two-tailed, \(\alpha = 0.05\))

df\(t_{0.025, df}\)df\(t_{0.025, df}\)
52.571202.086
102.228302.042
152.131602.000
182.1011201.980
192.093\(\infty\)1.960

Chi-Square Critical Values (\(\alpha = 0.05\), upper tail)

df\(\chi^2_{df, 0.05}\)df\(\chi^2_{df, 0.05}\)
13.841815.507
25.991916.919
37.8151018.307
49.4881524.996
511.0702031.410
612.5922537.652
714.0673043.773

Appendix B: R Quick Reference for AFM 113

# ---- Descriptive Statistics ----
mean(x); median(x); sd(x); var(x); IQR(x)
summary(x)
table(group)

# ---- Normality Assessment ----
shapiro.test(x)
qqnorm(x); qqline(x)

# ---- One-Sample Tests ----
t.test(x, mu = mu0, alternative = "two.sided")   # t-test
prop.test(x = k, n = n, p = p0)                 # Proportion z-test
wilcox.test(x, mu = M0)                          # Wilcoxon signed-rank

# ---- Two-Sample Tests ----
t.test(x, y, alternative = "two.sided")          # Welch t-test
t.test(x, y, paired = TRUE)                      # Paired t-test
prop.test(c(x1, x2), c(n1, n2))                 # Two proportions
var.test(x, y)                                   # F-test for variances
wilcox.test(x, y)                                # Mann-Whitney U

# ---- Chi-Square Tests ----
chisq.test(table_data)                           # Independence or GoF
chisq.test(x, p = c(0.25, 0.25, 0.25, 0.25))   # GoF with specified probs

# ---- ANOVA ----
model_aov <- aov(y ~ group, data = df)
summary(model_aov)
TukeyHSD(model_aov)
pairwise.t.test(y, group, p.adjust.method = "bonferroni")
kruskal.test(y ~ group, data = df)               # Non-parametric ANOVA

# ---- Two-Way ANOVA ----
model_2way <- aov(y ~ factorA * factorB, data = df)
summary(model_2way)
interaction.plot(df$factorA, df$factorB, df$y)

# ---- Simple Linear Regression ----
model_slr <- lm(y ~ x, data = df)
summary(model_slr)
confint(model_slr)
predict(model_slr, newdata = data.frame(x = x_star), interval = "confidence")
predict(model_slr, newdata = data.frame(x = x_star), interval = "prediction")
plot(model_slr)   # Four diagnostic plots

# ---- Multiple Linear Regression ----
model_mlr <- lm(y ~ x1 + x2 + x3, data = df)
summary(model_mlr)
library(car); vif(model_mlr)
library(MASS); stepAIC(model_mlr, direction = "both")
AIC(model_mlr); BIC(model_mlr)

# ---- Heteroscedasticity-Robust SEs ----
library(lmtest); library(sandwich)
coeftest(model_mlr, vcov = vcovHC(model_mlr, type = "HC3"))

# ---- Time Series ----
ts_data <- ts(y, start = c(2020, 1), frequency = 12)
plot(ts_data)
hw <- HoltWinters(ts_data)
library(forecast)
forecast_hw <- forecast(hw, h = 12)
plot(forecast_hw)
autoplot(forecast_hw)

Appendix C: Formula Sheet Summary

One-sample t-statistic:

\[ t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}, \quad df = n-1 \]

Two-sample Welch t-statistic:

\[ t = \frac{(\bar{x}_1 - \bar{x}_2)}{\sqrt{s_1^2/n_1 + s_2^2/n_2}} \]

Paired t-statistic:

\[ t = \frac{\bar{d}}{s_d/\sqrt{n}}, \quad df = n-1 \]

Proportion z-statistic (one-sample):

\[ z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} \]

Chi-square statistic:

\[ \chi^2 = \sum \frac{(O - E)^2}{E} \]

OLS slope and intercept:

\[ \hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}, \quad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x} \]

R-squared:

\[ R^2 = 1 - \frac{SSE}{SST} = \frac{SSR}{SST} \]

Adjusted R-squared:

\[ \bar{R}^2 = 1 - \frac{SSE/(n-p-1)}{SST/(n-1)} \]

ANOVA F-statistic:

\[ F = \frac{MSB}{MSW} = \frac{SSB/(k-1)}{SSW/(N-k)} \]

Prediction interval for new observation:

\[ \hat{y}^* \pm t_{\alpha/2,\, n-2} \cdot s_e \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{S_{xx}}} \]

Simple exponential smoothing:

\[ \hat{Y}_{t+1} = \alpha Y_t + (1 - \alpha)\hat{Y}_t \]

Sample size for given power (two-tailed):

\[ n = \left(\frac{(z_{\alpha/2} + z_\beta)\sigma}{\delta}\right)^2 \]
Back to top