AFM 112: Analytic Methods for Business 1

Laila Rohani

Estimated study time: 1 hr 38 min

Table of contents

Sources and References

Primary reference — Balka, J. (open access). Statistics: A First Course (various editions, freely available online). This text is the backbone of the quantitative methods material. Supplementary — Wackerly, D., Mendenhall, W., & Scheaffer, R. (2008). Mathematical Statistics with Applications (7th ed.). Brooks/Cole. DeVore, J. L. (2016). Probability and Statistics for Engineering and the Sciences (9th ed.). Cengage Learning.


Chapter 1: Types of Data and Measurement Scales

1.1 Why Data Classification Matters

Before applying any statistical technique, the analyst must understand the nature of the data at hand. The appropriate summary statistics, charts, and inferential procedures all depend on whether a variable is categorical, ordinal, or numerical. Using a histogram for a nominal variable or computing a mean for an ordinal scale both produce misleading results. For AFM students, misclassifying a variable is a routine source of error that shows up in both spreadsheet work and programmatic analysis.

Variable: A characteristic that can assume different values across observational units. In a dataset of loan applications, variables might include applicant age (numerical, continuous), number of prior credit inquiries (numerical, discrete), home ownership status (categorical, nominal), and credit rating (ordinal).
Observation (or case): A single unit of measurement—one row in a rectangular dataset. Each observation records the values of all variables for that unit.

1.2 Scales of Measurement

Nominal Scale

Nominal variables classify observations into named categories with no inherent order. Arithmetic operations on nominal codes are meaningless.

Examples in accounting and finance:

  • Industry sector (Technology, Financial Services, Energy, Healthcare)
  • Type of audit opinion (Unqualified, Qualified, Adverse, Disclaimer)
  • Payment method (Cash, Credit card, Wire transfer, Cheque)
  • Province of incorporation

Because nominal categories carry no numerical meaning, the only meaningful summary statistics are counts and proportions. The appropriate chart is a bar chart or pie chart (with the caveat that pie charts become difficult to read with more than four or five categories).

Ordinal Scale

Ordinal variables have categories that can be ranked, but the gaps between adjacent ranks are not necessarily equal. A customer satisfaction score of 4 is better than 3, but the difference between 4 and 3 need not equal the difference between 3 and 2.

Examples in accounting and finance:

  • Bond credit rating (AAA, AA, A, BBB, BB, B, CCC, D)
  • Audit risk level (Low, Medium, High, Very High)
  • Employee performance tier (Below expectations, Meets expectations, Exceeds expectations, Outstanding)
  • Education level of survey respondent (High school, Bachelor’s, Master’s, Doctoral)

The median and mode are meaningful for ordinal data, but the mean is technically not, because computing a mean requires that differences between values be comparable. In practice, means are often computed on Likert-scale responses (1–5), but this is a modelling assumption, not a mathematical fact.

Interval Scale

Interval scales have equal spacing between values but lack a true zero. The zero on an interval scale is arbitrary, so ratios are not interpretable.

Classic example: Temperature in Celsius. 20°C is not “twice as warm” as 10°C, because 0°C does not represent the absence of temperature.

Finance-adjacent example: Calendar dates. The difference between January 1 and February 1 is 31 days (interval is meaningful), but a ratio of dates is not.

Ratio Scale

Ratio scales have equal spacing and a meaningful zero that represents the complete absence of the quantity. All arithmetic operations are valid.

Examples in accounting and finance:

  • Account balance (a balance of $0 means no money; a balance of $200,000 is twice a balance of $100,000)
  • Revenue, cost, gross profit
  • Number of transactions
  • Days payable outstanding
  • Interest rate (%)

The majority of quantitative financial variables are on a ratio scale. All standard descriptive statistics (mean, variance, standard deviation, coefficient of variation) and inferential procedures apply.

1.3 Cross-Sectional vs. Time-Series Data

Cross-sectional data: Observations collected on multiple units at a single point in time. Example: the closing stock prices of all TSX-listed companies on March 31, 2026.
Time-series data: Repeated observations on a single unit across multiple time periods. Example: the quarterly revenue of Royal Bank of Canada from 2010 Q1 to 2025 Q4.
Panel data (longitudinal data): Repeated observations on multiple units across multiple time periods. Example: annual earnings per share for each company in the S&P/TSX 60 from 2015 to 2025.

The type of data structure governs the appropriate model. Time-series data requires techniques that account for autocorrelation (the tendency for observations close in time to be correlated). Cross-sectional data requires accounting for heterogeneity across units.

1.4 Population vs. Sample

Population: The complete set of all observations of interest. The population is defined by the research question. For an internal auditor at a bank, the population might be all wire transfers above \$50,000 processed in fiscal year 2025.
Sample: A subset of the population actually observed. Because populations are often too large to examine completely, statistical methods allow us to draw conclusions about the population from the sample.
Parameter: A numerical summary of a population (e.g., population mean \( \mu \), population variance \( \sigma^2 \)). Parameters are usually unknown.
Statistic: A numerical summary computed from sample data (e.g., sample mean \( \bar{x} \), sample variance \( s^2 \)). Statistics are used to estimate parameters.

Chapter 2: Frequency Distributions and Histograms

2.1 Organizing Raw Data

Raw data in its unordered form is difficult to interpret. A frequency distribution reorganizes the data to reveal patterns in the distribution.

Frequency distribution: A tabular summary showing how many observations fall into each category (for categorical data) or each class interval (for numerical data).

Categorical Frequency Table

Example 2.1 — Audit opinions issued by a regional accounting firm: In a sample of 120 audit engagements, the following opinions were issued:
Opinion TypeFrequencyRelative FrequencyPercentage
Unqualified940.78378.3%
Qualified180.15015.0%
Adverse50.0424.2%
Disclaimer30.0252.5%
Total1201.000100.0%

The relative frequency for each category is computed as: frequency / total. The most common opinion is unqualified (78.3%), which is typical for a firm with a rigorous client-acceptance process.

2.2 Frequency Distributions for Numerical Data

For numerical data, observations are grouped into class intervals (also called bins or classes). The choice of the number of classes affects how much detail the distribution reveals.

Sturges’ rule suggests the number of classes \( k \) for a dataset of size \( n \):

\[ k \approx 1 + 3.322 \log_{10}(n) \]

For \( n = 100 \): \( k \approx 1 + 3.322 \times 2 = 7.6 \), so use 7 or 8 classes. For \( n = 1000 \): \( k \approx 1 + 3.322 \times 3 = 10.97 \), so use 10 or 11 classes.

Class width is then approximately:

\[ \text{Width} \approx \frac{\text{Maximum} - \text{Minimum}}{k} \]

Round up to a convenient number so the classes cover the full range.

Example 2.2 — Processing time for insurance claims: An insurance company records the number of days to process 80 personal injury claims. The minimum is 3 days and the maximum is 62 days.

Using Sturges’ rule: \( k \approx 1 + 3.322 \log_{10}(80) \approx 7.3 \), so 7 classes.

Class width \( \approx (62 - 3)/7 = 8.4 \), rounded up to 9. Starting at 3:

Class IntervalMidpointFrequencyRelative Frequency
3 – 117120.150
12 – 2016220.275
21 – 2925190.238
30 – 3834130.163
39 – 474380.100
48 – 565240.050
57 – 656120.025
Total801.001

(Rounding to three decimal places causes the relative frequencies to sum to 1.001 rather than exactly 1.000.)

The distribution is right-skewed: most claims are processed within 20 days, but a small number take over 50 days.

2.3 Cumulative Frequency Distributions

A cumulative frequency column records the number of observations at or below the upper bound of each class. A cumulative relative frequency records this as a proportion.

Extending Example 2.2:

Class (Upper Bound)FrequencyCumulative FrequencyCumulative Relative Frequency
≤ 1112120.150
≤ 2022340.425
≤ 2919530.663
≤ 3813660.825
≤ 478740.925
≤ 564780.975
≤ 652801.000

From this table we can immediately answer questions such as: “What proportion of claims are processed within 29 days?” Answer: 66.3%.

2.4 Histograms

A histogram is the graphical counterpart of a numerical frequency distribution. The horizontal axis represents the measurement scale and the vertical axis represents frequency or relative frequency. The bars are contiguous (touching), reflecting that the underlying variable is continuous.

Key histogram features to look for:
  • Shape: Is the distribution symmetric, right-skewed (long right tail), or left-skewed (long left tail)?
  • Centre: Around what value is the mass of the distribution located?
  • Spread: How wide is the distribution?
  • Modality: Does the histogram show one peak (unimodal), two peaks (bimodal), or more?
  • Outliers: Are there isolated bars far from the main body?

Right-skewed (positively skewed) distributions are common in financial data: income, transaction values, insurance claim amounts, and waiting times all tend to have long right tails because values cannot be negative but can be arbitrarily large.

Left-skewed (negatively skewed) distributions occur when values cluster near a natural maximum—for example, exam scores that are truncated at 100.


Chapter 3: Measures of Central Tendency

3.1 The Arithmetic Mean

The arithmetic mean is the most widely used measure of centre. For a sample of \( n \) observations \( x_1, x_2, \ldots, x_n \):

\[ \bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n} \sum_{i=1}^{n} x_i \]

For a population of \( N \) observations, the population mean is denoted \( \mu \):

\[ \mu = \frac{1}{N} \sum_{i=1}^{N} x_i \]

Properties of the Mean

  1. Uniqueness: Every dataset has exactly one mean.
  2. Uses all observations: Every data value contributes to the mean.
  3. Algebraic tractability: The mean is the centre of gravity of the distribution — the sum of deviations from the mean is exactly zero: \( \sum_{i=1}^{n}(x_i - \bar{x}) = 0 \).
  4. Sensitivity to outliers: A single extreme value can pull the mean far from the bulk of the data.
Example 3.1 — Average account balance: A credit union records the savings account balances (in thousands of dollars) for seven customers: 4.2, 8.7, 3.1, 12.4, 6.8, 5.5, 9.1. \[ \bar{x} = \frac{4.2 + 8.7 + 3.1 + 12.4 + 6.8 + 5.5 + 9.1}{7} = \frac{49.8}{7} = 7.114 \]

The average account balance is $7,114. Now suppose a wealthy customer joins with a balance of $210,000. The new mean becomes \( (49.8 + 210)/8 = 259.8/8 = 32.475 \), or $32,475—a value larger than seven of the eight customers’ balances. This illustrates the mean’s sensitivity to outliers.

Weighted Mean

When observations carry different weights (e.g., different numbers of units, different portfolio allocations), the weighted mean is appropriate:

\[ \bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i} \]
Example 3.2 — Portfolio weighted return: An investor holds three assets with the following characteristics:
AssetWeightAnnual Return
Canadian equities0.508.2%
US equities0.3011.4%
Fixed income0.203.6%
\[ \bar{r}_w = (0.50)(8.2) + (0.30)(11.4) + (0.20)(3.6) = 4.10 + 3.42 + 0.72 = 8.24\% \]

The portfolio’s expected return is 8.24% per annum.

3.2 The Median

The median is the value that divides the sorted dataset into two equal halves. It is the 50th percentile.

Procedure:

  1. Sort the observations from smallest to largest.
  2. If \( n \) is odd, the median is the middle value at position \( (n+1)/2 \).
  3. If \( n \) is even, the median is the average of the two middle values at positions \( n/2 \) and \( n/2 + 1 \).
Example 3.3 — Median salary in an accounting department: Nine staff members earn (in thousands of dollars per year): 52, 58, 61, 65, 68, 72, 80, 95, 145.

Sorted (already sorted). \( n = 9 \), so the median is at position \( (9+1)/2 = 5 \).

Median = 68 (the 5th value) = $68,000.

The mean: \( \bar{x} = (52+58+61+65+68+72+80+95+145)/9 = 696/9 = 77.3 \) thousand dollars.

The senior partner’s salary of $145,000 pulls the mean to $77,300, well above seven of the nine employees’ salaries. The median of $68,000 better represents a “typical” salary in this department.

When to prefer the median over the mean: Use the median when the distribution is skewed or contains outliers. Common examples include house prices, executive compensation, insurance claims, and time-to-event data (e.g., time until a customer defaults).

3.3 The Mode

The mode is the most frequently occurring value (or values) in the dataset. A dataset can be unimodal (one mode), bimodal (two modes), multimodal (more than two), or have no mode if all values are distinct.

Example 3.4 — Mode in retail transaction analysis: A retailer's checkout data for a Saturday shows transaction totals (rounded to the nearest dollar): many transactions cluster at \$5, \$10, \$15, and \$20 due to popular fixed-price meal deals. The mode of \$10 identifies the most popular transaction value and informs inventory decisions.

The mode is especially useful for nominal variables, where the mean and median are not defined. “The modal payment method is credit card” is a meaningful statement; “the mean payment method is 2.3” is not.

3.4 Relationship Among Mean, Median, and Mode

Empirical relationship for unimodal, moderately skewed distributions:

For a unimodal distribution that is roughly symmetric, mean ≈ median ≈ mode.

For a right-skewed distribution: mode < median < mean. The long right tail pulls the mean upward.

For a left-skewed distribution: mean < median < mode. The long left tail pulls the mean downward.

This relationship is a useful quick check: if you observe \(\bar{x} > \text{Median}\), the distribution is likely right-skewed, which motivates checking for outliers or using a log transformation.


Chapter 4: Measures of Spread

4.1 The Range

\[ \text{Range} = x_{\max} - x_{\min} \]

The range is the simplest measure of spread and is trivially easy to compute. Its weakness is that it depends on only two observations and is highly sensitive to outliers.

Example 4.1: Daily returns (%) for two mutual funds over five days:

Fund A: 0.5, 0.8, 0.6, 0.7, 0.9 — Range = 0.9 − 0.5 = 0.4% Fund B: −1.2, 0.0, 0.7, 0.5, 2.0 — Range = 2.0 − (−1.2) = 3.2%

Fund B is considerably more volatile. The range captures the difference but provides no information about how observations are distributed within that interval.

4.2 Variance and Standard Deviation

The sample variance measures the average squared deviation of each observation from the sample mean:

\[ s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n - 1} \]

The denominator is \( n - 1 \) (degrees of freedom) rather than \( n \) to make \( s^2 \) an unbiased estimator of the population variance \( \sigma^2 \). This is called Bessel’s correction.

The sample standard deviation is the square root of the sample variance:

\[ s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n - 1}} \]

The standard deviation is expressed in the same units as the original data, making it interpretable as a “typical distance from the mean.”

The population variance and population standard deviation use \( N \) in the denominator:

\[ \sigma^2 = \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{N}, \qquad \sigma = \sqrt{\sigma^2} \]
Example 4.2 — Computing variance and standard deviation by hand: Five quarterly revenue figures (in millions of dollars): 12, 15, 11, 18, 14.

Step 1 — Compute the mean:

\[ \bar{x} = \frac{12 + 15 + 11 + 18 + 14}{5} = \frac{70}{5} = 14 \]

Step 2 — Compute deviations and squared deviations:

\( x_i \)\( x_i - \bar{x} \)\( (x_i - \bar{x})^2 \)
12−24
15+11
11−39
18+416
1400
Sum030

Step 3 — Compute variance and standard deviation:

\[ s^2 = \frac{30}{5 - 1} = \frac{30}{4} = 7.5 \text{ (millions}^2\text{)} \]\[ s = \sqrt{7.5} \approx 2.739 \text{ million dollars} \]

Revenue in a typical quarter deviates from the mean of $14M by about $2.74M.

Computational Formula for Variance

When computing by hand with large datasets, the computational formula avoids accumulating rounding errors:

\[ s^2 = \frac{\sum_{i=1}^{n} x_i^2 - n\bar{x}^2}{n - 1} \]

4.3 The Interquartile Range (IQR)

Percentiles and quartiles divide the sorted data into equal-sized groups:

  • \( Q_1 \) (first quartile, 25th percentile): 25% of observations fall below this value.
  • \( Q_2 \) (second quartile, 50th percentile): the median.
  • \( Q_3 \) (third quartile, 75th percentile): 75% of observations fall below this value.

The interquartile range is:

\[ \text{IQR} = Q_3 - Q_1 \]

The IQR is the range of the middle 50% of the data. It is robust to outliers because extreme values in the tails do not affect \( Q_1 \) or \( Q_3 \).

Example 4.3 — IQR for loan processing times: Sorted processing times (days) for 12 loans: 2, 3, 4, 5, 6, 7, 8, 10, 12, 15, 18, 35.

\( Q_1 \) = average of 3rd and 4th values = (4 + 5)/2 = 4.5 days. \( Q_3 \) = average of 9th and 10th values = (12 + 15)/2 = 13.5 days.

\[ \text{IQR} = 13.5 - 4.5 = 9 \text{ days} \]

Note that the outlier of 35 days has no impact on the IQR, though it would substantially inflate the standard deviation.

4.4 The Coefficient of Variation

The coefficient of variation (CV) expresses the standard deviation as a percentage of the mean, enabling meaningful comparison of variability across datasets with different scales or units:

\[ \text{CV} = \frac{s}{\bar{x}} \times 100\% \]
Example 4.4 — Comparing fund volatility:
FundMean Annual ReturnStandard DeviationCV
Fund A (Large Cap)9.5%3.2%33.7%
Fund B (Small Cap)14.1%8.6%61.0%
Fund C (Bond)4.2%1.1%26.2%

Fund C has the lowest CV (least variable relative to its return), while Fund B has the highest CV (most variable relative to its return). An investor seeking return per unit of risk would favour Fund C on a relative-volatility basis and Fund A as an equity option.

Note: CV is undefined (or meaningless) when the mean is zero or negative, which is why it is used carefully with return data that can include negative years.


Chapter 5: Skewness, Kurtosis, and Box Plots

5.1 Skewness

Skewness quantifies the asymmetry of a distribution. The most common formula for the sample skewness coefficient is:

\[ g_1 = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^3}{s^3} \]
  • \( g_1 = 0 \): Perfectly symmetric distribution.
  • \( g_1 > 0 \): Right-skewed (positive skew); the right tail is longer.
  • \( g_1 < 0 \): Left-skewed (negative skew); the left tail is longer.

As a practical guideline:

  • \( |g_1| < 0.5 \): approximately symmetric.
  • \( 0.5 \leq |g_1| < 1 \): moderate skewness.
  • \( |g_1| \geq 1 \): substantial skewness.
Business implications of skewness: Right-skewed financial data (income, losses, claim amounts) means that the mean overstates what a typical observation looks like. The mean loss on an insurance policy might be \$2,000, but the median might be only \$400 — meaning most claims are small, with a few catastrophic losses driving up the average. Actuaries must model the full distribution, not just its mean, to price policies correctly.

5.2 Kurtosis

Kurtosis measures the heaviness of the tails relative to a normal distribution. Excess kurtosis (kurtosis minus 3) is more commonly reported:

\[ \text{Excess kurtosis} = \frac{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^4}{s^4} - 3 \]
  • Excess kurtosis = 0: tails like a normal distribution (mesokurtic).
  • Excess kurtosis > 0: heavier tails than normal (leptokurtic); more extreme observations than a normal distribution would predict.
  • Excess kurtosis < 0: lighter tails (platykurtic).

Financial returns are well-documented to exhibit leptokurtosis (excess kurtosis > 0), meaning extreme gains and losses occur more often than a normal distribution would suggest. This is called the “fat tails” property and is critical for risk management—Value at Risk models that assume normality systematically underestimate tail risk.

5.3 Five-Number Summary

The five-number summary compresses a distribution into five key values:

\[ \{ x_{\min},\; Q_1,\; Q_2 \text{ (Median)},\; Q_3,\; x_{\max} \} \]

This provides a complete picture of a distribution’s location, spread, and tail behaviour.

Example 5.1: For the loan processing times from Example 4.3 (sorted: 2, 3, 4, 5, 6, 7, 8, 10, 12, 15, 18, 35):

Five-number summary: \(\{2,\; 4.5,\; 7.5,\; 13.5,\; 35\}\)

Median = (7 + 8)/2 = 7.5 days. The large gap between \( Q_3 = 13.5 \) and the maximum of 35 reveals the right-skewed tail.

5.4 Box Plots (Box-and-Whisker Plots)

A box plot graphically displays the five-number summary with additional identification of outliers.

Construction:

  1. Draw a box from \( Q_1 \) to \( Q_3 \). The width of the box equals the IQR.
  2. Draw a line inside the box at the median \( Q_2 \).
  3. Compute fences (outlier thresholds):
    • Lower fence: \( Q_1 - 1.5 \times \text{IQR} \)
    • Upper fence: \( Q_3 + 1.5 \times \text{IQR} \)
  4. Draw whiskers extending from the box to the most extreme data points that are still within the fences.
  5. Plot any observations beyond the fences as individual points — these are potential outliers.
Example 5.2 — Box plot for loan processing times:

\( Q_1 = 4.5 \), \( Q_3 = 13.5 \), \( \text{IQR} = 9 \).

Lower fence: \( 4.5 - 1.5(9) = 4.5 - 13.5 = -9 \) days (not meaningful here; no negative times). Upper fence: \( 13.5 + 1.5(9) = 13.5 + 13.5 = 27 \) days.

The value of 35 days exceeds the upper fence of 27 days and is plotted as an outlier. The upper whisker extends to 18 (the largest non-outlier value). The lower whisker extends to 2 (the minimum, which is within the lower fence).

Box plots are particularly valuable for comparing distributions across groups. A side-by-side box plot showing loan processing times by branch can immediately reveal whether one branch is systematically slower than others.


Chapter 6: Data Visualization

6.1 Principles of Effective Charts

Edward Tufte’s foundational principle of visualization is the data-ink ratio: the proportion of a chart’s ink that is devoted to displaying actual data. Maximizing the data-ink ratio — eliminating chart junk such as unnecessary gridlines, 3D effects, shadows, and decorative images — produces cleaner, more readable charts.

A second key principle is encode quantitative comparisons using position on a common scale wherever possible. Position is the most accurately perceived visual attribute. Length, area, and angle are progressively less accurate encodings — which is why bar charts (position/length) are easier to read than bubble charts (area) or pie charts (angle).

6.2 Chart Types and When to Use Them

Bar Charts (Categorical Data)

A bar chart displays the frequency, relative frequency, or mean of a categorical variable. Bars are separated by gaps (unlike a histogram), emphasizing the discrete nature of the categories.

  • Use a vertical bar chart when category names are short and the number of categories is small (≤ 6).
  • Use a horizontal bar chart when category labels are long or the number of categories is larger (makes labels readable).
Example 6.1 — Revenue by product segment: A financial services firm reports the following annual revenue by product:
ProductRevenue ($M)
Retail Banking420
Commercial Lending310
Wealth Management185
Insurance97
Other42

A horizontal bar chart with bars sorted from longest to shortest (Retail Banking at top, Other at bottom) makes it immediately clear that Retail Banking contributes more than 40% of total revenue.

Histograms (Continuous Data)

Use a histogram to show the distribution (shape, centre, spread, outliers) of a continuous variable. Key choices:

  • Number of bins: Too few bins over-smooth and hide structure; too many bins create noise. Sturges’ rule provides a starting point.
  • Axis labels: Both axes must be labelled. The y-axis can show count, relative frequency, or density.

Scatter Plots (Relationship Between Two Variables)

A scatter plot places one variable on the x-axis and another on the y-axis, with one point per observation. Use scatter plots to:

  • Detect linear or nonlinear relationships between two numerical variables.
  • Identify outliers in the joint distribution.
  • Assess whether a linear model is appropriate before running a regression.
Example 6.2 — Scatter plot interpretation: A scatter plot of advertising spend (x-axis, \$000s) against quarterly sales (y-axis, \$M) for 40 quarters of a retail chain shows a positive linear trend with one point in the upper-right corner that is substantially above the trend line. This outlier warrants investigation: was there an unusual promotional event that quarter? Data errors? Seasonal factors?

Time Series Plots (Data Over Time)

A time series plot (line chart) connects successive observations in time order. It is the appropriate chart for tracking a variable over time. Key features to look for:

  • Trend: A long-run upward or downward movement.
  • Seasonality: A regular, repeating pattern within each year (e.g., retail sales peak in December).
  • Cyclical variation: Longer-run fluctuations tied to business cycles.
  • Irregular variation: Random, unpredictable noise.
Choosing the right chart — a decision guide:
Data SituationRecommended Chart
One categorical variable — frequenciesBar chart
One continuous variable — distributionHistogram or box plot
Two continuous variables — relationshipScatter plot
One variable over timeTime series (line) plot
Comparing distributions across groupsSide-by-side box plots
Part-to-whole (≤ 5 categories)Stacked bar or pie chart
Geographic variationChoropleth map

6.3 Common Visualization Errors to Avoid

  1. Truncated y-axis: A bar chart starting at a non-zero y-value exaggerates differences. A 5% difference looks like a 500% difference if the axis starts at 95%.
  2. Dual y-axes: Two scales on the same chart can create spurious visual correlations that do not exist in the data.
  3. 3D charts: Three-dimensional bar charts distort perceived heights due to perspective effects.
  4. Pie charts with too many slices: More than five slices become impossible to read; use a bar chart instead.
  5. Overplotting: In a scatter plot with thousands of points, individual points overlap and the underlying pattern is obscured. Solutions include transparency (alpha blending), jitter, or hexagonal bin plots.

Chapter 7: Probability Fundamentals

7.1 Definitions and Notation

Random experiment: A process whose outcome is uncertain before it occurs. Examples: rolling a die, observing next quarter's revenue, testing a financial control.
Sample space (\( S \)): The set of all possible outcomes of a random experiment. - Die roll: \( S = \{1, 2, 3, 4, 5, 6\} \) - Loan outcome: \( S = \{\text{Default}, \text{No Default}\} \) - Daily return: \( S = (-\infty, +\infty) \) (continuous sample space)
Event: A subset of the sample space — a collection of outcomes. An event occurs if the observed outcome belongs to the event set.

Set operations on events:

  • Union \( A \cup B \): “Either A or B (or both) occurs.”
  • Intersection \( A \cap B \): “Both A and B occur.”
  • Complement \( A^c \): “A does not occur.”
  • Mutually exclusive: \( A \cap B = \emptyset \) — A and B cannot both occur.
  • Exhaustive: \( A_1 \cup A_2 \cup \cdots \cup A_k = S \) — at least one of the events must occur.

7.2 Probability Axioms

Probability is a function that assigns a number to each event. Three axioms, due to Kolmogorov, define a valid probability measure:

Kolmogorov's Axioms:
  1. \( P(A) \geq 0 \) for any event \( A \).
  2. \( P(S) = 1 \) (the probability of some outcome occurring is 1).
  3. If \( A \) and \( B \) are mutually exclusive, \( P(A \cup B) = P(A) + P(B) \).

From these axioms, all other probability rules can be derived.

Derived rules:

  • \( P(A^c) = 1 - P(A) \) — complement rule.
  • \( P(\emptyset) = 0 \) — the empty event has probability zero.
  • \( 0 \leq P(A) \leq 1 \) for all events.

7.3 General Addition Rule

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

When \( A \) and \( B \) are mutually exclusive, \( P(A \cap B) = 0 \) and the rule reduces to the additive axiom.

Example 7.1 — Quality control: A manufacturing firm inspects components coming off an assembly line. Based on historical data: 4% of components have a dimensional defect, 3% have a surface defect, and 1% have both types of defects.

What is the probability that a randomly selected component has at least one defect?

\[ P(\text{Dimensional} \cup \text{Surface}) = 0.04 + 0.03 - 0.01 = 0.06 \]

Six percent of components have at least one defect.

7.4 Conditional Probability

The conditional probability of event \( A \) given that event \( B \) has occurred is:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B) > 0 \]

Conditioning reduces the sample space to only those outcomes in \( B \) and asks what fraction of those also belong to \( A \).

Example 7.2 — Conditional probability in audit: An auditor reviews 200 accounts. Of these, 30 contain material errors. Of the 30 error-containing accounts, 24 trigger the auditor's analytical procedure test (the test is positive). Of the 170 error-free accounts, 17 also trigger the test (false positives).

Construct the contingency table:

Error PresentError AbsentTotal
Test Positive241741
Test Negative6153159
Total30170200
\[ P(\text{Error} \mid \text{Test Positive}) = \frac{24/200}{41/200} = \frac{24}{41} \approx 0.585 \]

Given a positive test result, there is a 58.5% probability that the account actually contains a material error. This is the positive predictive value of the test.

7.5 Independence

Events \( A \) and \( B \) are independent if the occurrence of one does not change the probability of the other:

\[ P(A \mid B) = P(A) \iff A \text{ and } B \text{ are independent} \]

Equivalently:

\[ A \text{ and } B \text{ are independent} \iff P(A \cap B) = P(A) \cdot P(B) \]
Independence vs. mutual exclusivity: These concepts are easily confused but are logically opposite in spirit. Two events with positive probability cannot be both mutually exclusive and independent: if they are mutually exclusive, \( P(A \cap B) = 0 \), but independence requires \( P(A \cap B) = P(A) \cdot P(B) > 0 \). Knowing that \( A \) occurred makes \( B \) impossible — the strongest possible dependence.
Example 7.3 — Independent controls: A company uses two independent electronic controls to prevent unauthorized payments. Control A fails with probability 0.02; Control B fails with probability 0.05. What is the probability that both controls fail simultaneously (allowing an unauthorized payment)? \[ P(\text{Both fail}) = P(A \text{ fails}) \times P(B \text{ fails}) = 0.02 \times 0.05 = 0.001 \]

The independent backup control reduces the failure probability from 2% to 0.1%.

7.6 The Law of Total Probability

If \( B_1, B_2, \ldots, B_k \) are mutually exclusive and exhaustive events (a partition of \( S \)), then for any event \( A \):

\[ P(A) = \sum_{i=1}^{k} P(A \mid B_i) \cdot P(B_i) \]
Example 7.4 — Loan default rates by tier: A bank classifies loans into three risk tiers: Low (40% of portfolio), Medium (35%), and High (25%). Historical default rates: Low = 1%, Medium = 5%, High = 15%. What is the overall default rate? \[ P(\text{Default}) = (0.01)(0.40) + (0.05)(0.35) + (0.15)(0.25) \]\[ = 0.004 + 0.0175 + 0.0375 = 0.059 \]

The overall default rate is 5.9%.

7.7 Bayes’ Theorem

Bayes’ theorem combines the law of total probability with conditional probability to update a probability in light of new information:

\[ P(B_i \mid A) = \frac{P(A \mid B_i) \cdot P(B_i)}{\sum_{j=1}^{k} P(A \mid B_j) \cdot P(B_j)} \]

The terms have specific names in Bayesian reasoning:

  • \( P(B_i) \): Prior probability — our belief about \( B_i \) before observing \( A \).
  • \( P(A \mid B_i) \): Likelihood — how probable is \( A \) if \( B_i \) is true?
  • \( P(B_i \mid A) \): Posterior probability — our updated belief after observing \( A \).
Example 7.5 — Medical diagnostic testing (business analogy: fraud detection): A fraud detection system flags transactions as "suspicious." Historically, 2% of transactions are fraudulent. The system correctly flags 90% of fraudulent transactions (sensitivity = 0.90) and incorrectly flags 5% of legitimate transactions (false positive rate = 0.05).

A transaction is flagged. What is the probability it is actually fraudulent?

Let \( F \) = transaction is fraudulent, \( F^c \) = legitimate. Let \( + \) = flagged by system.

Given: \( P(F) = 0.02 \), \( P(F^c) = 0.98 \), \( P(+ \mid F) = 0.90 \), \( P(+ \mid F^c) = 0.05 \).

\[ P(+) = P(+ \mid F) P(F) + P(+ \mid F^c) P(F^c) = (0.90)(0.02) + (0.05)(0.98) \]\[ = 0.018 + 0.049 = 0.067 \]\[ P(F \mid +) = \frac{P(+ \mid F) \cdot P(F)}{P(+)} = \frac{(0.90)(0.02)}{0.067} = \frac{0.018}{0.067} \approx 0.269 \]

Despite a 90% sensitivity, only 26.9% of flagged transactions are actually fraudulent. This low positive predictive value is driven by the low base rate (2%) of actual fraud. The system generates many false positives that require manual review.

This example illustrates why base rates matter so much in risk scoring and why high sensitivity alone does not guarantee a useful screening tool.


Chapter 8: Discrete Probability Distributions

8.1 Random Variables

Random variable: A function that assigns a numerical value to each outcome of a random experiment. Denoted by capital letters (X, Y, Z); specific values taken by lower-case letters (x, y, z).
Discrete random variable: Takes a countable number of distinct values (e.g., 0, 1, 2, 3, ...).
Continuous random variable: Can take any value in an interval (e.g., account balance, rate of return).

Probability Mass Function (PMF)

For a discrete random variable \( X \), the probability mass function \( p(x) \) gives the probability that \( X \) equals each possible value:

\[ p(x) = P(X = x), \quad \text{for each } x \text{ in the support of } X \]

Valid PMFs satisfy: \( p(x) \geq 0 \) for all \( x \), and \( \sum_{\text{all } x} p(x) = 1 \).

Expected Value and Variance of a Discrete Random Variable

\[ E(X) = \mu = \sum_{\text{all }x} x \cdot p(x) \]\[ \text{Var}(X) = \sigma^2 = \sum_{\text{all }x} (x - \mu)^2 \cdot p(x) = E(X^2) - [E(X)]^2 \]\[ \text{SD}(X) = \sigma = \sqrt{\sigma^2} \]

8.2 Bernoulli Distribution

The simplest discrete distribution models a single trial with two outcomes: success (1) or failure (0).

Bernoulli(\(p\)) distribution: A random variable \( X \) is Bernoulli with parameter \( p \) if: \[ P(X = 1) = p, \quad P(X = 0) = 1 - p = q \]\[ E(X) = p, \quad \text{Var}(X) = p(1-p) = pq \]

The Bernoulli distribution underlies the Binomial distribution and is the building block for binary outcome modelling in credit scoring, insurance underwriting, and quality control.

8.3 Binomial Distribution

The Binomial distribution models the number of successes in a fixed number of independent Bernoulli trials.

Conditions for the Binomial model: 1. A fixed number \( n \) of trials. 2. Each trial results in exactly one of two outcomes (success / failure). 3. Trials are independent. 4. The probability of success \( p \) is constant across all trials.
Binomial(\(n, p\)) PMF: \[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, \ldots, n \]

where \( \binom{n}{k} = \frac{n!}{k!(n-k)!} \) is the binomial coefficient (number of ways to choose \( k \) successes from \( n \) trials).

\[ E(X) = np, \quad \text{Var}(X) = np(1-p), \quad \text{SD}(X) = \sqrt{np(1-p)} \]
Example 8.1 — Defective components in a production run: A supplier claims that 3% of its components are defective. A quality inspector randomly selects 20 components for testing. Let \( X \) = number of defective components.

Here \( X \sim \text{Binomial}(n = 20, p = 0.03) \).

\[ E(X) = 20 \times 0.03 = 0.6 \text{ defective components (on average)} \]\[ \text{Var}(X) = 20 \times 0.03 \times 0.97 = 0.582 \]\[ \text{SD}(X) = \sqrt{0.582} \approx 0.763 \]

Probability of finding zero defectives:

\[ P(X = 0) = \binom{20}{0}(0.03)^0(0.97)^{20} = (0.97)^{20} \approx 0.5438 \]

There is a 54.4% probability of finding no defectives in the sample of 20, even if the true defective rate is 3%. This illustrates the challenge of quality control with small samples.

Probability of finding at least 2 defectives:

\[ P(X \geq 2) = 1 - P(X = 0) - P(X = 1) \]\[ P(X = 1) = \binom{20}{1}(0.03)^1(0.97)^{19} = 20 \times 0.03 \times (0.97)^{19} \approx 20 \times 0.03 \times 0.5604 \approx 0.3362 \]\[ P(X \geq 2) = 1 - 0.5438 - 0.3362 = 0.1200 \]

There is a 12.0% probability of finding two or more defectives.

Normal Approximation to the Binomial

When \( n \) is large and \( p \) is not too close to 0 or 1, the Binomial distribution is approximately Normal:

\[ X \approx N(np,\; np(1-p)) \]

The rule of thumb for this approximation: use it when \( np \geq 10 \) and \( n(1-p) \geq 10 \).

When applying the normal approximation, the continuity correction improves accuracy by treating the discrete value \( k \) as the continuous interval \( [k - 0.5, k + 0.5] \):

\[ P(X \leq k) \approx P\!\left(Z \leq \frac{k + 0.5 - np}{\sqrt{np(1-p)}}\right) \]

8.4 Poisson Distribution

The Poisson distribution models the number of events occurring in a fixed interval of time or space, when events occur independently at a constant average rate.

Conditions for the Poisson model: 1. Events occur independently. 2. Events occur at a constant average rate \( \lambda \) (per unit of time/space). 3. Two events cannot occur simultaneously.
Poisson(\(\lambda\)) PMF: \[ P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!}, \quad k = 0, 1, 2, \ldots \]\[ E(X) = \lambda, \quad \text{Var}(X) = \lambda, \quad \text{SD}(X) = \sqrt{\lambda} \]

A distinctive property: the mean and variance are both equal to \( \lambda \).

Example 8.2 — Calls to a bank's customer service centre: A bank's call centre receives an average of 8 customer calls per hour. The number of calls follows a Poisson distribution with \( \lambda = 8 \).

Probability of receiving exactly 10 calls in the next hour:

\[ P(X = 10) = \frac{e^{-8} \cdot 8^{10}}{10!} = \frac{0.000335 \times 1,073,741,824}{3,628,800} \approx 0.0993 \]

There is approximately a 9.93% probability of receiving exactly 10 calls.

Probability of receiving fewer than 5 calls:

\[ P(X < 5) = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) \]\[ = e^{-8}\!\left(\frac{8^0}{0!} + \frac{8^1}{1!} + \frac{8^2}{2!} + \frac{8^3}{3!} + \frac{8^4}{4!}\right) \]\[ = e^{-8}(1 + 8 + 32 + \tfrac{512}{6} + \tfrac{4096}{24}) \]\[ = 0.000335 \times (1 + 8 + 32 + 85.33 + 170.67) \]\[ = 0.000335 \times 297 \approx 0.0995 \]

There is about a 9.95% probability of fewer than 5 calls in an hour. Management can use this to staff the call centre to keep wait times acceptable at various confidence levels.


Chapter 9: Continuous Probability Distributions

9.1 Probability Density Functions

For a continuous random variable \( X \), probabilities are computed as areas under the probability density function (PDF) \( f(x) \):

\[ P(a \leq X \leq b) = \int_a^b f(x)\, dx \]

Properties of a valid PDF: \( f(x) \geq 0 \) for all \( x \), and \( \int_{-\infty}^{\infty} f(x)\, dx = 1 \).

An important consequence: for a continuous random variable, \( P(X = c) = 0 \) for any specific value \( c \), because the area under a single point is zero. Thus \( P(X \leq c) = P(X < c) \).

9.2 Uniform Distribution

Uniform(\(a, b\)) distribution: \( X \) is uniformly distributed on the interval \([a, b]\) if all values in the interval are equally likely. \[ f(x) = \frac{1}{b - a}, \quad a \leq x \leq b \]\[ E(X) = \frac{a + b}{2}, \quad \text{Var}(X) = \frac{(b-a)^2}{12} \]
Example 9.1 — Uniform wait time: Processing time for an inter-bank settlement is uniformly distributed between 30 minutes and 90 minutes. \[ E(X) = (30 + 90)/2 = 60 \text{ minutes} \]\[ \text{SD}(X) = (90-30)/\sqrt{12} = 60/3.464 \approx 17.3 \text{ minutes} \]

Probability processing takes more than 75 minutes:

\[ P(X > 75) = \frac{90 - 75}{90 - 30} = \frac{15}{60} = 0.25 \]

9.3 Exponential Distribution

The exponential distribution models the time between successive events in a Poisson process. If events occur at rate \( \lambda \) per unit time (Poisson), the time between events is exponential with parameter \( \lambda \).

Exponential(\(\lambda\)) distribution: \[ f(x) = \lambda e^{-\lambda x}, \quad x \geq 0 \]\[ P(X > x) = e^{-\lambda x} \quad \text{(survival function)} \]\[ E(X) = \frac{1}{\lambda}, \quad \text{Var}(X) = \frac{1}{\lambda^2}, \quad \text{SD}(X) = \frac{1}{\lambda} \]
Memoryless property: For any \( s, t \geq 0 \): \[ P(X > s + t \mid X > s) = P(X > t) \]

The remaining wait time, given that you have already waited \( s \) units, has the same distribution as the original wait time. The past waiting time is irrelevant.

Example 9.2 — Time between customer arrivals: Customers arrive at a bank branch at an average rate of 12 per hour, implying an average inter-arrival time of 5 minutes (\( \lambda = 12/\text{hour} \), or equivalently \( 1/5 \) per minute).

Probability a customer waits more than 8 minutes for the next arrival:

\[ P(X > 8) = e^{-\lambda \cdot 8} = e^{-(1/5)(8)} = e^{-1.6} \approx 0.2019 \]

There is about a 20.2% probability of a gap longer than 8 minutes between successive arrivals.

9.4 Normal Distribution

The normal distribution is the most important continuous distribution in statistics. Many natural phenomena, measurement errors, and sample averages follow normal distributions or are well-approximated by them.

Normal(\(\mu, \sigma^2\)) distribution: \[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad -\infty < x < \infty \]

Parameters: \( \mu \) = mean (location), \( \sigma^2 \) = variance (spread).

\[ E(X) = \mu, \quad \text{Var}(X) = \sigma^2, \quad \text{SD}(X) = \sigma \]

Key properties:

  • Symmetric and bell-shaped, centred at \( \mu \).
  • Increasing \( \sigma \) flattens and widens the curve; decreasing \( \sigma \) makes it taller and narrower.
  • The total area under the curve is 1.
  • Tails extend to \( \pm\infty \) but become negligible beyond \( \pm 3\sigma \).

The Empirical Rule (68-95-99.7 Rule)

Empirical Rule: For any Normal(\(\mu, \sigma^2\)) distribution: \[ P(\mu - \sigma < X < \mu + \sigma) \approx 0.6827 \]\[ P(\mu - 2\sigma < X < \mu + 2\sigma) \approx 0.9545 \]\[ P(\mu - 3\sigma < X < \mu + 3\sigma) \approx 0.9973 \]

Approximately 68% of observations fall within one standard deviation of the mean, 95% within two, and 99.7% within three.

9.5 The Standard Normal Distribution and Z-Scores

The standard normal distribution is the special case \( Z \sim N(0, 1) \) — mean 0 and variance 1.

Standardization: Any normal random variable \( X \sim N(\mu, \sigma^2) \) can be converted to a standard normal variable by subtracting the mean and dividing by the standard deviation:

\[ Z = \frac{X - \mu}{\sigma} \]

The \( z \)-score measures how many standard deviations an observation lies above or below the mean. A \( z \)-score of \( +2.0 \) means the observation is 2 standard deviations above the mean; a \( z \)-score of \( -1.5 \) means 1.5 standard deviations below.

Computing normal probabilities:

  1. Standardize: convert to \( Z \).
  2. Look up \( \Phi(z) = P(Z \leq z) \) in a standard normal table, or use software.
  3. Use symmetry and complementation as needed.
Example 9.3 — Annual return on a portfolio: Annual portfolio returns follow a Normal distribution with mean \( \mu = 8\% \) and standard deviation \( \sigma = 12\% \).

(a) What is the probability of a negative return in any given year?

\[ P(X < 0) = P\!\left(Z < \frac{0 - 8}{12}\right) = P(Z < -0.667) \]

From a standard normal table: \( \Phi(-0.67) \approx 0.2514 \).

There is approximately a 25.1% probability of a negative return.

(b) What is the probability of a return exceeding 20%?

\[ P(X > 20) = P\!\left(Z > \frac{20 - 8}{12}\right) = P(Z > 1.00) = 1 - \Phi(1.00) = 1 - 0.8413 = 0.1587 \]

There is approximately a 15.9% probability of a return exceeding 20%.

(c) What return is at the 95th percentile?

Find \( x \) such that \( P(X \leq x) = 0.95 \). From the standard normal table, \( \Phi(1.645) = 0.95 \), so \( z_{0.95} = 1.645 \).

\[ x = \mu + z \cdot \sigma = 8 + 1.645 \times 12 = 8 + 19.74 = 27.74\% \]

The 95th percentile annual return is approximately 27.7%.

Example 9.4 — Value at Risk (VaR) application: A bank's daily trading portfolio has returns normally distributed with mean 0.05% and standard deviation 1.2%. The risk management team wants to find the 1% daily Value at Risk (VaR) — the loss that will not be exceeded on 99% of trading days.

The 1% VaR corresponds to the 1st percentile of the daily return distribution.

\( z_{0.01} = -2.326 \) (from tables: \( P(Z < -2.326) = 0.01 \)).

\[ x_{0.01} = 0.05 + (-2.326)(1.2) = 0.05 - 2.791 = -2.741\% \]

The daily VaR is 2.74%: on a portfolio worth $100 million, the bank should expect to lose more than $2.74 million on only 1% of trading days (about 2-3 days per year under the normality assumption).

9.6 Using the Standard Normal Table

Standard normal tables give \( \Phi(z) = P(Z \leq z) \) for \( z \geq 0 \). For negative values, use symmetry:

\[ P(Z \leq -z) = 1 - P(Z \leq z) = 1 - \Phi(z) \]

Key probabilities to memorize:

\( z \)\( \Phi(z) = P(Z \leq z) \)
0.000.5000
1.000.8413
1.280.8997 ≈ 0.90
1.6450.9500
1.960.9750
2.000.9772
2.3260.9900
2.5760.9950
3.000.9987

The values 1.28, 1.645, 1.96, and 2.576 are particularly important as they correspond to the 90th, 95th, 97.5th, and 99.5th percentiles and appear repeatedly in confidence interval construction.


Chapter 10: Sampling Distributions and the Central Limit Theorem

10.1 Why Sampling Distributions Matter

Statistical inference draws conclusions about population parameters using sample statistics. The key insight is that a sample statistic — like \( \bar{x} \) — is itself a random variable: it varies from sample to sample. The sampling distribution describes this variability.

Sampling distribution of a statistic: The probability distribution of the statistic computed over all possible samples of size \( n \) drawn from the population.

10.2 Sampling Distribution of the Sample Mean

Suppose the population has mean \( \mu \) and variance \( \sigma^2 \) (finite). A random sample of size \( n \) is drawn. The sample mean \( \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \) has the following properties:

Properties of the sampling distribution of \(\bar{X}\): \[ E(\bar{X}) = \mu \]\[ \text{Var}(\bar{X}) = \frac{\sigma^2}{n} \]\[ \text{SD}(\bar{X}) = \frac{\sigma}{\sqrt{n}} \]

The standard deviation of the sample mean is called the standard error (SE).

Two key results:

  1. The sample mean is an unbiased estimator of \( \mu \): its expected value equals the population mean.
  2. The standard error decreases as sample size increases: larger samples produce more precise estimates. To halve the standard error, quadruple the sample size.

10.3 The Central Limit Theorem

Central Limit Theorem (CLT): Let \( X_1, X_2, \ldots, X_n \) be a random sample from any population with mean \( \mu \) and finite variance \( \sigma^2 \). Then, as \( n \to \infty \): \[ \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1) \]

In practice, \(\bar{X}\) is approximately normally distributed for sufficiently large \( n \), regardless of the shape of the population distribution.

Practical rule of thumb: The CLT approximation is generally adequate for \( n \geq 30 \) when the population distribution is not too severely skewed. For nearly symmetric populations, normality of \( \bar{X} \) kicks in even sooner (sometimes as low as \( n = 5 \) or \( n = 10 \)).

The CLT is arguably the most important theorem in applied statistics. It explains why the normal distribution appears so widely in inference: even if the underlying data are skewed, Binomial, or Poisson, the distribution of sample averages converges to a normal distribution for large samples.

Example 10.1 — CLT applied to insurance claims: An insurance company's individual claim amounts have a strongly right-skewed distribution with mean \( \mu = \$2,500 \) and standard deviation \( \sigma = \$4,200 \).

The company receives \( n = 150 \) claims per week. What is the probability that the average weekly claim amount exceeds $3,000?

By the CLT:

\[ \bar{X} \approx N\!\left(2500, \frac{4200^2}{150}\right) = N(2500, 117{,}600) \]\[ \text{SE} = \frac{4200}{\sqrt{150}} = \frac{4200}{12.247} \approx 343.0 \]\[ P(\bar{X} > 3000) = P\!\left(Z > \frac{3000 - 2500}{343.0}\right) = P(Z > 1.458) \]\[ = 1 - \Phi(1.458) \approx 1 - 0.9277 = 0.0723 \]

There is approximately a 7.2% probability that the average claim exceeds $3,000 in any given week.

Example 10.2 — Effect of sample size on standard error: The population of customer account balances has \( \mu = \$15,000 \) and \( \sigma = \$8,000 \).
Sample Size \( n \)Standard Error \( \sigma/\sqrt{n} \)
10$2,530
25$1,600
100$800
400$400
1,000$253

Quadrupling the sample size halves the standard error. This illustrates the diminishing returns of larger samples: going from \( n = 100 \) to \( n = 400 \) cuts the SE in half (a worthwhile improvement), but going from \( n = 400 \) to \( n = 1,600 \) achieves the same reduction for four times the sampling cost.


Chapter 11: Point Estimation

11.1 Estimators and Estimates

Estimator: A rule (formula) that uses sample data to produce an estimate of a population parameter. An estimator is a random variable — its value changes from sample to sample.
Estimate: The specific numerical value produced by applying an estimator to a particular sample.

11.2 Unbiasedness

Unbiased estimator: An estimator \( \hat{\theta} \) is unbiased for parameter \( \theta \) if \( E(\hat{\theta}) = \theta \) — the expected value of the estimator equals the true parameter.

Key unbiased estimators:

  • The sample mean \( \bar{X} \) is an unbiased estimator of the population mean \( \mu \).
  • The sample proportion \( \hat{p} = X/n \) is an unbiased estimator of the population proportion \( p \).
  • The sample variance \( S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2 \) is an unbiased estimator of \( \sigma^2 \). (This is why we divide by \( n-1 \), not \( n \).)

Bias: If \( E(\hat{\theta}) \neq \theta \), the estimator is biased. The bias is \( E(\hat{\theta}) - \theta \).

11.3 Efficiency

Among all unbiased estimators of a parameter, the one with the smallest variance is called the most efficient (or minimum variance unbiased estimator, MVUE). For a normal population, the sample mean is the MVUE of \( \mu \).

The bias-variance tradeoff: Occasionally a biased estimator is preferred in practice if it has much lower variance. The mean squared error (MSE) balances both: \[ \text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + [\text{Bias}(\hat{\theta})]^2 \]

An estimator with small MSE can be preferred even if it is slightly biased.


Chapter 12: Confidence Intervals

12.1 The Logic of Confidence Intervals

A point estimate is a single number that serves as our “best guess” of a parameter. But a point estimate conveys no information about precision. A confidence interval (CI) supplements the point estimate with a margin of error, producing a range of plausible values for the parameter.

95% confidence interval: A procedure for constructing an interval from sample data such that, if repeated many times on independent samples, 95% of the resulting intervals would contain the true population parameter.
Common misinterpretation: A 95% CI does NOT mean "there is a 95% probability that the parameter lies in this specific interval." Once the interval is computed, the parameter either is or is not in it — probability is not applicable to fixed numbers. The 95% refers to the long-run coverage frequency of the procedure.

12.2 Confidence Interval for a Population Mean — Large Sample

When the sample size is large (\( n \geq 30 \)) or the population standard deviation \( \sigma \) is known, the sampling distribution of \( \bar{X} \) is approximately normal and the CI for \( \mu \) is:

\[ \bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \]

where \( z_{\alpha/2} \) is the critical value from the standard normal distribution corresponding to confidence level \( (1-\alpha) \times 100\% \):

Confidence Level\( \alpha \)\( z_{\alpha/2} \)
90%0.101.645
95%0.051.960
99%0.012.576

When \( \sigma \) is unknown (the usual case in practice), we substitute the sample standard deviation \( s \). For large samples, the result is approximately valid.

Margin of Error (ME):

\[ \text{ME} = z_{\alpha/2} \cdot \frac{s}{\sqrt{n}} \]

The CI is then: \( (\bar{x} - \text{ME},\; \bar{x} + \text{ME}) \).

Example 12.1 — Average invoice processing time: An accounts payable department samples 64 invoices and records the processing time (in hours). Results: \( \bar{x} = 18.5 \) hours, \( s = 6.4 \) hours.

Construct a 95% confidence interval for the true mean processing time.

\[ \text{ME} = 1.960 \times \frac{6.4}{\sqrt{64}} = 1.960 \times \frac{6.4}{8} = 1.960 \times 0.80 = 1.568 \]\[ \text{95% CI} = (18.5 - 1.568,\; 18.5 + 1.568) = (16.93,\; 20.07) \text{ hours} \]

We are 95% confident that the true mean processing time is between 16.9 and 20.1 hours.

If management’s target is 18 hours average processing time, note that 18 hours is inside the confidence interval — the data are consistent with meeting the target, but there is uncertainty.

12.3 Confidence Interval for a Population Mean — Small Sample (t-Distribution)

When \( n < 30 \) and \( \sigma \) is unknown, we cannot rely on the CLT to guarantee approximate normality of \( \bar{X} \). If we additionally assume the population is approximately normal, we use the t-distribution with \( n - 1 \) degrees of freedom:

\[ \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1} \]

The t-distribution is symmetric and bell-shaped like the standard normal, but has heavier tails — it accounts for the additional uncertainty in estimating \( \sigma \) from a small sample. As \( n \to \infty \), the t-distribution converges to the standard normal.

Small-sample CI for \( \mu \):

\[ \bar{x} \pm t_{\alpha/2,\; n-1} \cdot \frac{s}{\sqrt{n}} \]

where \( t_{\alpha/2,\; n-1} \) is the \( (1 - \alpha/2) \) quantile of the t-distribution with \( n-1 \) degrees of freedom.

Selected critical values:

Degrees of Freedom\( t_{0.025} \) (95% CI)\( t_{0.005} \) (99% CI)
52.5714.032
102.2283.169
152.1312.947
202.0862.845
252.0602.787
302.0422.750
∞ (Normal)1.9602.576
Example 12.2 — Mean expense reimbursement (small sample): An internal auditor reviews 12 randomly selected expense claims to estimate the mean reimbursement amount. Summary statistics: \( \bar{x} = \$287.40 \), \( s = \$64.20 \).

Assuming expense amounts are approximately normally distributed, construct a 90% confidence interval for the mean reimbursement.

Degrees of freedom: \( df = n - 1 = 11 \). \( t_{0.05, 11} = 1.796 \) (from t-table).

\[ \text{ME} = 1.796 \times \frac{64.20}{\sqrt{12}} = 1.796 \times \frac{64.20}{3.464} = 1.796 \times 18.53 = 33.29 \]\[ \text{90% CI} = (287.40 - 33.29,\; 287.40 + 33.29) = (\$254.11,\; \$320.69) \]

The auditor is 90% confident that the true mean expense claim lies between $254 and $321.

12.4 Confidence Interval for a Population Proportion

Let \( X \) be the number of successes in \( n \) trials. The sample proportion is \( \hat{p} = X/n \).

For large samples (where \( n\hat{p} \geq 10 \) and \( n(1-\hat{p}) \geq 10 \)), by the CLT:

\[ \hat{p} \approx N\!\left(p,\; \frac{p(1-p)}{n}\right) \]

The large-sample CI for a population proportion is:

\[ \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
Example 12.3 — Proportion of overdue accounts: A collections manager samples 200 customer accounts and finds that 34 are overdue by more than 30 days. \[ \hat{p} = \frac{34}{200} = 0.17 \]

Construct a 95% confidence interval for the true proportion of overdue accounts.

\[ \text{SE}(\hat{p}) = \sqrt{\frac{0.17 \times 0.83}{200}} = \sqrt{\frac{0.1411}{200}} = \sqrt{0.000706} \approx 0.02657 \]\[ \text{95% CI} = 0.17 \pm 1.96 \times 0.02657 = 0.17 \pm 0.0521 = (0.1179,\; 0.2221) \]

The manager is 95% confident that the true proportion of overdue accounts is between 11.8% and 22.2%.

12.5 Determining the Required Sample Size

Before collecting data, analysts must plan how large a sample is needed to achieve a desired margin of error. For estimating a population mean with margin of error \( E \) and confidence level \( (1-\alpha) \):

\[ n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2 \]

Since \( \sigma \) is often unknown at the planning stage, use a pilot-study estimate, a conservative bound, or literature-based values.

For a population proportion:

\[ n = \frac{z_{\alpha/2}^2 \cdot p(1-p)}{E^2} \]

If \( p \) is completely unknown, use \( p = 0.5 \), which maximizes \( p(1-p) \) and gives the largest (most conservative) required sample size.

Example 12.4 — Sample size for proportion estimation: A bank wants to estimate the proportion of customers who would use a new mobile payment feature. They want the estimate to be within 3 percentage points with 99% confidence. No prior estimate of \( p \) is available. \[ n = \frac{(2.576)^2 \times 0.5 \times 0.5}{(0.03)^2} = \frac{6.635 \times 0.25}{0.0009} = \frac{1.659}{0.0009} \approx 1,843 \]

The bank needs to survey at least 1,843 customers. This relatively large sample is required because the margin of error (3 percentage points) is small and the confidence level (99%) is high.


Chapter 13: Integrative Review and Applied Problems

13.1 Connecting the Pieces

The topics covered in this course form a logical chain:

  1. Data classification determines which summaries are valid.
  2. Frequency distributions and histograms reveal the shape of distributions.
  3. Measures of centre and spread quantify location and variability.
  4. Probability theory provides the theoretical foundation for drawing inferences.
  5. Probability distributions model specific data-generating processes.
  6. Sampling distributions and the CLT explain how sample statistics behave.
  7. Confidence intervals translate sample information into statements about population parameters.

Each step builds on the previous, forming the foundation for more advanced techniques in regression analysis, hypothesis testing, and predictive modelling (AFM 113 and beyond).

13.2 Comprehensive Worked Example

Example 13.1 — Full analysis of retail transaction data:

A national retailer’s internal audit team investigates the efficiency of its point-of-sale system. They record the transaction processing time (in seconds) for a random sample of \( n = 36 \) transactions. The results are:

8, 12, 7, 15, 11, 9, 14, 8, 10, 13, 6, 18, 9, 11, 12, 7, 10, 14, 8, 11, 15, 9, 13, 7, 10, 12, 8, 16, 11, 9, 13, 7, 10, 14, 9, 11

Step 1: Summary statistics.

Sum = 8+12+7+15+11+9+14+8+10+13+6+18+9+11+12+7+10+14+8+11+15+9+13+7+10+12+8+16+11+9+13+7+10+14+9+11 = 383

\[ \bar{x} = \frac{383}{36} \approx 10.64 \text{ seconds} \]

Sorted data (for percentiles): 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 11, 12, 12, 12, 13, 13, 13, 14, 14, 14, 15, 15, 16, 18

\( Q_2 \) (Median) = average of 18th and 19th values = (10 + 11)/2 = 10.5 seconds. \( Q_1 \) = average of 9th and 10th values = (8 + 9)/2 = 8.5 seconds. \( Q_3 \) = average of 27th and 28th values = (13 + 13)/2 = 13.0 seconds. \( \text{IQR} = 13.0 - 8.5 = 4.5 \) seconds.

Sample standard deviation (computed): \( s \approx 2.71 \) seconds.

\[ \text{CV} = \frac{2.71}{10.64} \times 100\% \approx 25.5\% \]

The mean (10.64s) is slightly larger than the median (10.5s), suggesting mild right skewness, consistent with the value of 18 seconds being a high outlier.

Step 2: 95% confidence interval for the mean processing time.

With \( n = 36 \geq 30 \), use the normal approximation:

\[ \text{95% CI} = 10.64 \pm 1.96 \times \frac{2.71}{\sqrt{36}} = 10.64 \pm 1.96 \times 0.452 = 10.64 \pm 0.885 = (9.76,\; 11.52) \text{ seconds} \]

Step 3: Probability calculation.

If processing times follow \( N(10.64, 2.71^2) \), what proportion of transactions take more than 15 seconds?

\[ P(X > 15) = P\!\left(Z > \frac{15 - 10.64}{2.71}\right) = P(Z > 1.609) \approx 1 - \Phi(1.61) \approx 1 - 0.9463 = 0.054 \]

Approximately 5.4% of transactions are expected to take more than 15 seconds.

Step 4: Business interpretation.

The average processing time of 10.6 seconds is well within the company’s service standard of 15 seconds. The 95% CI (9.8s to 11.5s) confirms this with high confidence. However, about 5% of transactions exceed 15 seconds, warranting investigation of what drives slower transactions (specific cashiers, product types, payment methods).

13.3 Summary of Key Formulas

TopicFormula
Sample mean\( \bar{x} = \frac{1}{n}\sum x_i \)
Sample variance\( s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1} \)
IQR\( Q_3 - Q_1 \)
CV\( s/\bar{x} \times 100\% \)
Conditional probability\( P(A\mid B) = P(A \cap B)/P(B) \)
Bayes’ theorem\( P(B_i \mid A) = P(A \mid B_i)P(B_i)/P(A) \)
Binomial PMF\( P(X=k) = \binom{n}{k}p^k(1-p)^{n-k} \)
Binomial mean/variance\( np \) ; \( np(1-p) \)
Poisson PMF\( P(X=k) = e^{-\lambda}\lambda^k/k! \)
Normal standardization\( Z = (X - \mu)/\sigma \)
Standard error\( \text{SE} = \sigma/\sqrt{n} \)
Large-sample CI for \( \mu \)\( \bar{x} \pm z_{\alpha/2} \cdot s/\sqrt{n} \)
Small-sample CI for \( \mu \)\( \bar{x} \pm t_{\alpha/2,n-1} \cdot s/\sqrt{n} \)
CI for proportion\( \hat{p} \pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n} \)
Sample size (mean)\( n = (z_{\alpha/2}\sigma/E)^2 \)
Sample size (proportion)\( n = z_{\alpha/2}^2 p(1-p)/E^2 \)

13.4 Practice Problems

Problem 1. A random sample of 15 audit engagements records the number of hours billed: 42, 58, 35, 71, 49, 63, 38, 55, 67, 44, 52, 78, 41, 60, 47.

(a) Compute the mean, median, and mode. (b) Compute the range, IQR, variance, and standard deviation. (c) Compute the coefficient of variation. (d) Identify any outliers using the 1.5 × IQR rule. (e) Describe the shape of the distribution.

Problem 2. A credit card issuer believes 8% of its cardholders carry a balance greater than $10,000. A random sample of 250 accounts is drawn.

(a) What is the expected number of accounts with balances over $10,000? (b) What is the standard deviation of the count? (c) What is the probability that exactly 20 accounts have balances over $10,000? (d) Using the normal approximation, what is the probability that more than 25 accounts have balances over $10,000?

Problem 3. Transaction processing times at an ATM follow an exponential distribution with a mean of 45 seconds.

(a) What is the rate parameter \( \lambda \)? (b) What is the probability a transaction takes more than 1 minute? (c) What is the median transaction time? (Hint: use the formula \( \text{Median} = \ln(2)/\lambda \).) (d) What transaction time is exceeded by only 5% of transactions?

Problem 4. Annual returns for a market index are approximately normally distributed with mean 7.2% and standard deviation 15.6%.

(a) What is the probability of a loss exceeding 10% in a given year? (b) What return separates the top 10% of years from the rest? (c) In what range do 80% of annual returns fall? (d) Over 25 independent years, what is the probability that the average annual return is negative?

Problem 5. A mortgage lender samples 50 recently approved mortgages to estimate the mean loan-to-value ratio (LTV). Results: \( \bar{x} = 0.783 \), \( s = 0.094 \).

(a) Construct a 90%, 95%, and 99% confidence interval for the mean LTV. (b) How do the widths of these intervals compare? What drives the difference? (c) How large a sample would be needed to achieve a margin of error of 0.01 at 95% confidence? (Use \( \sigma \approx 0.094 \).)


These notes draw on Balka’s open-access statistics text, Wackerly, Mendenhall, and Scheaffer’s Mathematical Statistics with Applications*, and DeVore’s* Probability and Statistics for Engineering and the Sciences. Worked examples are original and designed for an AFM audience.

Back to top