AFM 112: Analytic Methods for Business 1

Estimated study time: 19 minutes

Table of contents

Sources and References

Primary textbook — Stratopoulos, T. (2025). Introduction to Data Analytics Process: From Spreadsheets to R and Python. VitalSource. Supplementary — Provost, F., & Fawcett, T. (2013). Data Science for Business. O’Reilly Media. Watt, A., & Eng, N. (2014). Database Design (2nd ed.). BCcampus Open Education. Online resources — DataCamp (R and spreadsheet exercises); MIT OpenCourseWare 15.060: Data, Models, and Decisions; Google’s Machine Learning Crash Course.

Chapter 1: Data Analytics in Business Context

Why Analytics Matters for Finance Professionals

Modern accounting and finance professionals are increasingly expected to work with data at scale. The era of compiling reports from manually entered spreadsheets is giving way to environments where vast datasets must be queried, cleaned, modelled, and communicated—often under time pressure and with business decisions hanging in the balance. An understanding of data analytics principles is no longer optional for professionals in accounting, auditing, or corporate finance; it is a baseline expectation.

Business analytics: The practice of iterative, methodical exploration of organizational data, using statistical analysis and data modelling tools, to drive better decision-making. It encompasses descriptive, diagnostic, predictive, and prescriptive approaches.

The Four Types of Analytics

Type	Question Answered	Complexity	Example
Descriptive	What happened?	Low	Monthly sales dashboard
Diagnostic	Why did it happen?	Moderate	Variance analysis by segment
Predictive	What will happen?	High	Customer churn probability
Prescriptive	What should we do?	Highest	Optimal inventory reorder point

This course traverses all four types through a sequence of business cases, beginning with descriptive statistics and building toward regression-based prediction.

Generative AI as a Productivity Tool

Generative AI (GenAI) tools such as large language models (LLMs) can serve as powerful accelerators for data analytics work: generating boilerplate code, explaining error messages, translating natural-language descriptions of analyses into code, and providing first drafts of reports. However, effective use of GenAI requires sufficient domain knowledge to evaluate whether the output is correct. A model that writes R code with a logical error will produce incorrect results confidently; only a practitioner who understands what the code should do can catch the error.

In this course, prompt engineering—the craft of writing clear, specific instructions to a GenAI model—is introduced early. A well-designed prompt specifies the task precisely, provides relevant context, and requests the output in a usable format. Poorly specified prompts produce generic or incorrect results.

Chapter 2: Data Management

Data Fundamentals

Before analysing data, professionals must understand what data is, how it is structured, and where its quality problems originate.

Variables and Data Types

Variable: A characteristic that can take different values across observations. In a dataset about customers, variables might include customer ID, age, account balance, and account type.

Data types shape the analysis tools that are appropriate:

Categorical (qualitative): Values represent categories with no natural ordering (e.g., province, product category, account type). Subtype: nominal.
Ordinal: Categories with a natural ordering but no consistent spacing (e.g., credit rating from AAA to D, customer satisfaction from 1 to 5).
Numerical — discrete: Countable integer values (e.g., number of transactions, number of employees).
Numerical — continuous: Can take any value within a range (e.g., account balance, interest rate, duration in days).

Tidy Data Principles

The concept of tidy data (formalized by Hadley Wickham) describes a standardized way of organizing tabular data to facilitate analysis:

Each variable occupies exactly one column.
Each observation occupies exactly one row.
Each type of observational unit forms a separate table.

Messy data—where multiple variables are stored in one column, where rows represent multiple observations, or where values are in column headers—must be reshaped before analysis can proceed.

Derived Variables

A derived variable is created by applying a formula or transformation to one or more existing variables. Examples in an accounting context:

Gross margin ratio = Gross profit / Revenue — derived from two income statement items.
Days sales outstanding = (Accounts receivable / Revenue) × 365 — derived from balance sheet and income statement.
Year-over-year growth = (Current year value − Prior year value) / |Prior year value| — derived from two time points.

Creating derived variables is a critical step in feature engineering for both descriptive and predictive analyses.

Data Quality Issues

Real business data is rarely clean. Common quality problems include:

Missing values: Rows where one or more fields are blank or null.
Duplicates: Multiple identical or near-identical rows representing the same observation.
Inconsistent formatting: Dates recorded as “2026-01-15,” “Jan 15, 2026,” and “15/01/26” in the same column.
Outliers: Values that are far from the bulk of the data, which may represent errors or genuine extreme observations.
Encoding errors: A value of “999” or “-1” used to indicate a missing value rather than a true observation.

Documenting and addressing data quality issues is a non-negotiable first step in any analysis; analyses built on dirty data produce unreliable conclusions.

Chapter 3: The CRISP-DM Framework

Cross-Industry Standard Process for Data Mining

CRISP-DM (Cross-Industry Standard Process for Data Mining) is a widely adopted framework for structuring data analytics projects. It provides a roadmap that ensures analytical work is grounded in business needs and produces actionable results.

CRISP-DM: A process model for data analytics consisting of six phases: (1) Business Understanding, (2) Data Understanding, (3) Data Preparation, (4) Modeling, (5) Evaluation, and (6) Deployment. The process is iterative—insights from later phases often cause revision of earlier phases.

Phase 1: Business Understanding

The starting point is a precise formulation of the business problem or question. Good analytics projects fail when this phase is rushed:

What decision needs to be made?
What information would reduce uncertainty about that decision?
Who are the stakeholders, and what do they care about?
What does success look like?

For example, a retail bank might ask: “Which customers are likely to close their accounts in the next 90 days?” This frames the problem as a binary classification task (churn / no churn), which guides every subsequent phase.

Framing a problem as a decision or question is the first CRISP-DM step applied in this course, practiced through simulated case scenarios that mirror real financial industry situations.

Phase 2: Data Understanding

With the question formulated, analysts identify what data is available and assess its suitability:

What datasets exist that could contain relevant variables?
How much data is available? Over what time period?
At what granularity is the data (transaction-level, daily, monthly)?
What are the data’s quality issues?

Exploratory data analysis (EDA) is the primary tool in this phase: summary statistics, frequency tables, and visualizations reveal the distribution, spread, and relationships among variables.

Phase 3: Data Preparation

Data preparation is typically the most time-consuming phase, often accounting for 60–80% of total project effort. Tasks include:

Selecting relevant variables — excluding irrelevant or redundant columns.
Cleaning — handling missing values (impute, delete, flag), removing duplicates, standardizing formats.
Transforming — creating derived variables, scaling numeric features, encoding categorical variables.
Subsetting — filtering rows to the relevant population, time period, or segment.
Merging — joining multiple tables on key columns (customer ID, transaction ID).

Phase 4: Modeling

With a clean, prepared dataset, analysts apply statistical or machine learning models. The choice of model depends on the question type:

Regression — predicting a continuous outcome (e.g., next-quarter revenue).
Classification — predicting a categorical outcome (e.g., loan default / no default).
Clustering — discovering natural groupings in unlabelled data (e.g., customer segments).
Association rules — identifying items frequently purchased together (market basket analysis).

In AFM 112, modelling is performed first in Google Sheets (using formulas and pivot tables) and then in R (using packages such as dplyr, ggplot2, and lm()).

Phase 5: Evaluation

A model is not useful unless its quality has been assessed. Evaluation asks:

How accurately does the model predict on data it has not seen?
Does the model generalize, or has it overfit to the training data?
Does the model’s output actually inform the business decision (Phase 1)?

Phase 6: Deployment

Deployment means delivering the model’s output to decision-makers in a usable form: a dashboard, a report, a scoring system integrated into a business process, or a recommendation to management.

Chapter 4: Descriptive Analytics — The Big Picture

Summary Statistics

Descriptive analytics summarizes what is in the data without inferring beyond it. The two fundamental dimensions of description are central tendency and spread.

Measures of Central Tendency

Mean (arithmetic average): $ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $

Sensitive to outliers; best used with symmetric, outlier-free distributions.

Median: The middle value when observations are sorted. Robust to outliers. Preferred for skewed distributions (e.g., income, house prices).

Mode: The most frequently occurring value. Useful for categorical data or identifying dominant categories.

Measures of Spread

Range: Maximum − Minimum. Simplest but highly sensitive to outliers.

Variance: $ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 $

The average squared deviation from the mean. Uses $n-1$ (Bessel’s correction) for sample data.

Standard deviation: $ s = \sqrt{s^2} $. Returns spread to the original units of measurement.

Interquartile Range (IQR): $ Q_3 - Q_1 $. The range of the middle 50% of observations. Robust to outliers.

Visualizing Distributions

Histogram: Shows the shape of the distribution—whether it is symmetric, left-skewed, right-skewed, or bimodal.

Box plot: Displays the median, IQR, whiskers (typically 1.5 × IQR), and individual outliers. Excellent for comparing distributions across groups.

Bar chart: Appropriate for categorical variables—shows frequency or proportion of each category.

Scatter plot: Shows the relationship between two continuous variables. Used in diagnostics and regression analysis.

Descriptive Analytics in a Business Context

A typical descriptive analysis for a financial services firm might involve:

Importing transaction data and computing total and average transaction value by customer segment.
Identifying the distribution of account balances: what percentage of customers fall below $1,000, between $1,000–$10,000, above $10,000?
Tracking revenue trends over time with a time series line chart.
Using pivot tables to compute total and average values by product line, region, and quarter.

These summaries form the basis of management reporting and are prerequisite to any deeper diagnostic or predictive analysis.

Chapter 5: Diagnostic Analytics — Segment Analysis

The Purpose of Segmentation

Aggregate summaries can obscure important variation. A company may report a healthy overall profit margin while one product line hemorrhages cash, or an overall strong satisfaction score while one customer segment is deeply dissatisfied. Diagnostic analytics disaggregates the big picture to find patterns that explain outcomes.

Segmentation Variables

Segmentation partitions the dataset into mutually exclusive groups based on one or more variables. Common segmentation dimensions in accounting and finance:

Customer demographics: Age, income tier, geographic region.
Product: Product line, SKU, category.
Time: Quarter, fiscal year, month.
Behavior: Recency, frequency, monetary value (RFM segmentation).
Risk profile: Credit score band, default history.

Conditional Aggregation in Spreadsheets

In Google Sheets or Excel, conditional aggregation computes summary statistics for a subset of rows meeting specified criteria:

SUMIF(range, criteria, sum_range) — sums values where the condition is met.
COUNTIF(range, criteria) — counts rows meeting the condition.
AVERAGEIF(range, criteria, average_range) — averages values meeting the condition.

Pivot tables automate this for cross-tabular analyses: drag and drop dimensions to produce group-level summaries instantly.

Segment Analysis in R

In R, the dplyr package provides a grammar for data manipulation:

library(dplyr)

segment_summary <- data %>%
  group_by(product_category) %>%
  summarise(
    avg_revenue    = mean(revenue, na.rm = TRUE),
    total_revenue  = sum(revenue, na.rm = TRUE),
    count          = n()
  )

This pipeline groups the data by product_category, then computes the mean revenue, total revenue, and count for each group. The %>% (pipe) operator chains operations left to right, producing readable, modular code.

Chapter 6: Introduction to R for Data Analytics

Why R?

R is an open-source programming language designed specifically for statistical computing and data analysis. It has become the lingua franca of academic statistics and is widely used in finance, actuarial science, economics, and data science. Key advantages:

Extensive statistical libraries (base R + CRAN packages).
High-quality visualization via ggplot2.
Strong community and documentation.
Reproducible research via R Markdown / Quarto.

Core R Concepts

Data Structures

Vector: A one-dimensional array of elements of the same type. c(1, 2, 3, 4) creates a numeric vector.
Data frame: The standard tabular data structure in R. Each column is a vector; all columns must have the same length. Equivalent to a spreadsheet or SQL table.
List: A collection of objects that can be of different types and lengths.

Basic Operations

# Assign a value
x <- 42

# Create a vector
scores <- c(85, 92, 78, 96, 88)

# Compute mean and standard deviation
mean(scores)
sd(scores)

# Create a data frame
df <- data.frame(
  name    = c("Alice", "Bob", "Carol"),
  revenue = c(120000, 85000, 200000)
)

# Access a column
df$revenue

Data Visualization with ggplot2

ggplot2 uses a layered “grammar of graphics” to build charts:

library(ggplot2)

ggplot(df, aes(x = name, y = revenue)) +
  geom_col(fill = "steelblue") +
  labs(title = "Revenue by Salesperson",
       x = "Salesperson", y = "Revenue ($)") +
  theme_minimal()

Each + adds a layer: the aesthetic mapping (aes), the geometry (geom_col for bar charts), labels (labs), and theme.

Chapter 7: Predictive Analytics — Introduction to Modelling

Predictive Analytics Overview

Predictive analytics uses historical data to estimate future outcomes. In business contexts, predictive models are used to:

Forecast revenue for the next quarter.
Score customers by probability of loan default.
Estimate employee turnover in the next year.
Predict optimal pricing levels.

The building blocks of predictive modelling are covered in more depth in AFM 113 (inferential statistics). In AFM 112, the focus is on applying a simple modelling tool within the CRISP-DM framework through two case studies.

OkCupid Case — Predictive Analytics for Matching

This case study applies predictive logic to a dating platform dataset, using user attributes to predict compatibility or matching outcomes. It illustrates how the same analytical techniques used in financial risk scoring (logistic regression, classification) appear in very different industry contexts—reinforcing the transferability of analytic skills across domains.

Key concepts applied:

Binary outcome variable (match / no match).
Selection of predictor variables.
Evaluation of predictive accuracy.

Pet Adoption Case — Prescriptive Analytics

Prescriptive analytics extends prediction to recommendation: given a prediction about future outcomes, what is the best course of action?

In the pet adoption case, the analysis asks: which animals in a shelter are at risk of not being adopted, and what interventions (featured placement, discounted adoption fees, targeted outreach) would be most effective? The analyst must both predict adoption probability and optimize the allocation of limited promotional resources.

Chapter 8: Data Visualization and Dashboards

Principles of Effective Visualization

Data visualization communicates analytical findings to audiences who may not have technical backgrounds. Poor visualizations bury insights in clutter; excellent visualizations make the key message obvious at a glance.

The Preattentive Attributes

Certain visual properties are processed by the human visual system almost instantaneously, before conscious attention:

Color hue — use sparingly; reserve distinct colors for distinct categories.
Color intensity — effective for encoding magnitude (e.g., heat maps).
Size — encodes quantitative differences (e.g., bubble charts).
Position — the most accurate encoding for quantitative comparisons (position on a common scale).
Shape — useful for distinguishing groups in scatter plots.

Choosing the Right Chart Type

Analytical Purpose	Recommended Chart
Compare values across categories	Bar chart
Show trend over time	Line chart
Show distribution	Histogram, box plot
Show relationship between two variables	Scatter plot
Show part-to-whole	Stacked bar; avoid pie charts for > 3 slices
Show geographic patterns	Choropleth map

Avoiding Common Visualization Mistakes

Truncated y-axes: Starting a bar chart’s y-axis at a value other than zero exaggerates differences visually.
3D charts: Three-dimensional bar charts add visual distortion without adding information.
Chartjunk (Tufte’s term): Decorative elements that add no information—gridlines, borders, shadows, unnecessary images.
Dual axes: Two y-axes with different scales on the same chart can create misleading visual correlations.

Interactive Dashboards

A dashboard is a visual display of the key information a decision-maker needs, arranged on a single screen. Interactive dashboards allow users to filter, drill down, and explore data without requiring programming knowledge.

In AFM 112, dashboard construction is practiced using Google Sheets (slicers, charts linked to pivot tables) and potentially R Shiny or Tableau Public. The crew project requires producing a dashboard for a non-technical audience—this forces students to make deliberate choices about what to show, what to omit, and how to frame the findings.

Design Principles for Dashboards

Single audience, single purpose: Know who the dashboard serves and what decision it supports.
Hierarchy: Put the most important metric at the top-left (where the eye goes first).
Consistency: Use the same color palette and chart types throughout.
Context: Show a metric alongside its target or prior-period value. A revenue figure of $1.2M means nothing without knowing whether the target was $1.0M or $1.5M.
Minimize cognitive load: Fewer charts done well outperform many charts done poorly.

Chapter 9: Probability Review for Business Analytics

Why Probability Matters in Analytics

Many analytic outputs are probabilistic: a predictive model estimates the probability that a loan will default; a simulation model produces a distribution of possible revenue outcomes; hypothesis tests produce p-values that quantify the probability of observing the data if a null hypothesis were true. Without a grounding in probability, these outputs cannot be interpreted correctly.

Basic Probability Rules

Addition rule (for mutually exclusive events):

\[ P(A \cup B) = P(A) + P(B) \]

General addition rule:

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

Multiplication rule (independent events):

\[ P(A \cap B) = P(A) \times P(B) \]

Conditional probability:

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

Bayes’ theorem:

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

Bayes’ theorem is particularly relevant in audit and risk contexts: given that an audit test detected a discrepancy, what is the probability that the account contains a material misstatement? The answer depends on the prior probability of misstatement (the “base rate”) and the sensitivity of the test.

Expected Value

For a random variable $X$ taking values $x_1, x_2, \ldots, x_k$ with probabilities $p_1, p_2, \ldots, p_k$:

\[ E(X) = \sum_{i=1}^{k} x_i \cdot p_i \]

Expected value is the long-run average outcome and is foundational to decision analysis under uncertainty.

This probability review module bridges the descriptive analytics content of AFM 112 with the inferential statistics content of AFM 113, ensuring students have the conceptual grounding to engage meaningfully with sampling distributions, confidence intervals, and hypothesis tests.