AFM 112: Analytic Methods for Business 1
Estimated study time: 19 minutes
Table of contents
Sources and References
Primary textbook — Stratopoulos, T. (2025). Introduction to Data Analytics Process: From Spreadsheets to R and Python. VitalSource. Supplementary — Provost, F., & Fawcett, T. (2013). Data Science for Business. O’Reilly Media. Watt, A., & Eng, N. (2014). Database Design (2nd ed.). BCcampus Open Education. Online resources — DataCamp (R and spreadsheet exercises); MIT OpenCourseWare 15.060: Data, Models, and Decisions; Google’s Machine Learning Crash Course.
Chapter 1: Data Analytics in Business Context
Why Analytics Matters for Finance Professionals
Modern accounting and finance professionals are increasingly expected to work with data at scale. The era of compiling reports from manually entered spreadsheets is giving way to environments where vast datasets must be queried, cleaned, modelled, and communicated—often under time pressure and with business decisions hanging in the balance. An understanding of data analytics principles is no longer optional for professionals in accounting, auditing, or corporate finance; it is a baseline expectation.
The Four Types of Analytics
| Type | Question Answered | Complexity | Example |
|---|---|---|---|
| Descriptive | What happened? | Low | Monthly sales dashboard |
| Diagnostic | Why did it happen? | Moderate | Variance analysis by segment |
| Predictive | What will happen? | High | Customer churn probability |
| Prescriptive | What should we do? | Highest | Optimal inventory reorder point |
This course traverses all four types through a sequence of business cases, beginning with descriptive statistics and building toward regression-based prediction.
Generative AI as a Productivity Tool
Generative AI (GenAI) tools such as large language models (LLMs) can serve as powerful accelerators for data analytics work: generating boilerplate code, explaining error messages, translating natural-language descriptions of analyses into code, and providing first drafts of reports. However, effective use of GenAI requires sufficient domain knowledge to evaluate whether the output is correct. A model that writes R code with a logical error will produce incorrect results confidently; only a practitioner who understands what the code should do can catch the error.
In this course, prompt engineering—the craft of writing clear, specific instructions to a GenAI model—is introduced early. A well-designed prompt specifies the task precisely, provides relevant context, and requests the output in a usable format. Poorly specified prompts produce generic or incorrect results.
Chapter 2: Data Management
Data Fundamentals
Before analysing data, professionals must understand what data is, how it is structured, and where its quality problems originate.
Variables and Data Types
Data types shape the analysis tools that are appropriate:
- Categorical (qualitative): Values represent categories with no natural ordering (e.g., province, product category, account type). Subtype: nominal.
- Ordinal: Categories with a natural ordering but no consistent spacing (e.g., credit rating from AAA to D, customer satisfaction from 1 to 5).
- Numerical — discrete: Countable integer values (e.g., number of transactions, number of employees).
- Numerical — continuous: Can take any value within a range (e.g., account balance, interest rate, duration in days).
Tidy Data Principles
The concept of tidy data (formalized by Hadley Wickham) describes a standardized way of organizing tabular data to facilitate analysis:
- Each variable occupies exactly one column.
- Each observation occupies exactly one row.
- Each type of observational unit forms a separate table.
Messy data—where multiple variables are stored in one column, where rows represent multiple observations, or where values are in column headers—must be reshaped before analysis can proceed.
Derived Variables
A derived variable is created by applying a formula or transformation to one or more existing variables. Examples in an accounting context:
- Gross margin ratio = Gross profit / Revenue — derived from two income statement items.
- Days sales outstanding = (Accounts receivable / Revenue) × 365 — derived from balance sheet and income statement.
- Year-over-year growth = (Current year value − Prior year value) / |Prior year value| — derived from two time points.
Creating derived variables is a critical step in feature engineering for both descriptive and predictive analyses.
Data Quality Issues
Real business data is rarely clean. Common quality problems include:
- Missing values: Rows where one or more fields are blank or null.
- Duplicates: Multiple identical or near-identical rows representing the same observation.
- Inconsistent formatting: Dates recorded as “2026-01-15,” “Jan 15, 2026,” and “15/01/26” in the same column.
- Outliers: Values that are far from the bulk of the data, which may represent errors or genuine extreme observations.
- Encoding errors: A value of “999” or “-1” used to indicate a missing value rather than a true observation.
Documenting and addressing data quality issues is a non-negotiable first step in any analysis; analyses built on dirty data produce unreliable conclusions.
Chapter 3: The CRISP-DM Framework
Cross-Industry Standard Process for Data Mining
CRISP-DM (Cross-Industry Standard Process for Data Mining) is a widely adopted framework for structuring data analytics projects. It provides a roadmap that ensures analytical work is grounded in business needs and produces actionable results.
Phase 1: Business Understanding
The starting point is a precise formulation of the business problem or question. Good analytics projects fail when this phase is rushed:
- What decision needs to be made?
- What information would reduce uncertainty about that decision?
- Who are the stakeholders, and what do they care about?
- What does success look like?
For example, a retail bank might ask: “Which customers are likely to close their accounts in the next 90 days?” This frames the problem as a binary classification task (churn / no churn), which guides every subsequent phase.
Framing a problem as a decision or question is the first CRISP-DM step applied in this course, practiced through simulated case scenarios that mirror real financial industry situations.
Phase 2: Data Understanding
With the question formulated, analysts identify what data is available and assess its suitability:
- What datasets exist that could contain relevant variables?
- How much data is available? Over what time period?
- At what granularity is the data (transaction-level, daily, monthly)?
- What are the data’s quality issues?
Exploratory data analysis (EDA) is the primary tool in this phase: summary statistics, frequency tables, and visualizations reveal the distribution, spread, and relationships among variables.
Phase 3: Data Preparation
Data preparation is typically the most time-consuming phase, often accounting for 60–80% of total project effort. Tasks include:
- Selecting relevant variables — excluding irrelevant or redundant columns.
- Cleaning — handling missing values (impute, delete, flag), removing duplicates, standardizing formats.
- Transforming — creating derived variables, scaling numeric features, encoding categorical variables.
- Subsetting — filtering rows to the relevant population, time period, or segment.
- Merging — joining multiple tables on key columns (customer ID, transaction ID).
Phase 4: Modeling
With a clean, prepared dataset, analysts apply statistical or machine learning models. The choice of model depends on the question type:
- Regression — predicting a continuous outcome (e.g., next-quarter revenue).
- Classification — predicting a categorical outcome (e.g., loan default / no default).
- Clustering — discovering natural groupings in unlabelled data (e.g., customer segments).
- Association rules — identifying items frequently purchased together (market basket analysis).
In AFM 112, modelling is performed first in Google Sheets (using formulas and pivot tables) and then in R (using packages such as dplyr, ggplot2, and lm()).
Phase 5: Evaluation
A model is not useful unless its quality has been assessed. Evaluation asks:
- How accurately does the model predict on data it has not seen?
- Does the model generalize, or has it overfit to the training data?
- Does the model’s output actually inform the business decision (Phase 1)?
Phase 6: Deployment
Deployment means delivering the model’s output to decision-makers in a usable form: a dashboard, a report, a scoring system integrated into a business process, or a recommendation to management.
Chapter 4: Descriptive Analytics — The Big Picture
Summary Statistics
Descriptive analytics summarizes what is in the data without inferring beyond it. The two fundamental dimensions of description are central tendency and spread.
Measures of Central Tendency
Mean (arithmetic average): \( \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i \)
Sensitive to outliers; best used with symmetric, outlier-free distributions.
Median: The middle value when observations are sorted. Robust to outliers. Preferred for skewed distributions (e.g., income, house prices).
Mode: The most frequently occurring value. Useful for categorical data or identifying dominant categories.
Measures of Spread
Range: Maximum − Minimum. Simplest but highly sensitive to outliers.
Variance: \( s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \)
The average squared deviation from the mean. Uses \(n-1\) (Bessel’s correction) for sample data.
Standard deviation: \( s = \sqrt{s^2} \). Returns spread to the original units of measurement.
Interquartile Range (IQR): \( Q_3 - Q_1 \). The range of the middle 50% of observations. Robust to outliers.
Visualizing Distributions
Histogram: Shows the shape of the distribution—whether it is symmetric, left-skewed, right-skewed, or bimodal.
Box plot: Displays the median, IQR, whiskers (typically 1.5 × IQR), and individual outliers. Excellent for comparing distributions across groups.
Bar chart: Appropriate for categorical variables—shows frequency or proportion of each category.
Scatter plot: Shows the relationship between two continuous variables. Used in diagnostics and regression analysis.
Descriptive Analytics in a Business Context
A typical descriptive analysis for a financial services firm might involve:
- Importing transaction data and computing total and average transaction value by customer segment.
- Identifying the distribution of account balances: what percentage of customers fall below $1,000, between $1,000–$10,000, above $10,000?
- Tracking revenue trends over time with a time series line chart.
- Using pivot tables to compute total and average values by product line, region, and quarter.
These summaries form the basis of management reporting and are prerequisite to any deeper diagnostic or predictive analysis.
Chapter 5: Diagnostic Analytics — Segment Analysis
The Purpose of Segmentation
Aggregate summaries can obscure important variation. A company may report a healthy overall profit margin while one product line hemorrhages cash, or an overall strong satisfaction score while one customer segment is deeply dissatisfied. Diagnostic analytics disaggregates the big picture to find patterns that explain outcomes.
Segmentation Variables
Segmentation partitions the dataset into mutually exclusive groups based on one or more variables. Common segmentation dimensions in accounting and finance:
- Customer demographics: Age, income tier, geographic region.
- Product: Product line, SKU, category.
- Time: Quarter, fiscal year, month.
- Behavior: Recency, frequency, monetary value (RFM segmentation).
- Risk profile: Credit score band, default history.
Conditional Aggregation in Spreadsheets
In Google Sheets or Excel, conditional aggregation computes summary statistics for a subset of rows meeting specified criteria:
SUMIF(range, criteria, sum_range)— sums values where the condition is met.COUNTIF(range, criteria)— counts rows meeting the condition.AVERAGEIF(range, criteria, average_range)— averages values meeting the condition.
Pivot tables automate this for cross-tabular analyses: drag and drop dimensions to produce group-level summaries instantly.
Segment Analysis in R
In R, the dplyr package provides a grammar for data manipulation:
library(dplyr)
segment_summary <- data %>%
group_by(product_category) %>%
summarise(
avg_revenue = mean(revenue, na.rm = TRUE),
total_revenue = sum(revenue, na.rm = TRUE),
count = n()
)
This pipeline groups the data by product_category, then computes the mean revenue, total revenue, and count for each group. The %>% (pipe) operator chains operations left to right, producing readable, modular code.
Chapter 6: Introduction to R for Data Analytics
Why R?
R is an open-source programming language designed specifically for statistical computing and data analysis. It has become the lingua franca of academic statistics and is widely used in finance, actuarial science, economics, and data science. Key advantages:
- Extensive statistical libraries (base R + CRAN packages).
- High-quality visualization via
ggplot2. - Strong community and documentation.
- Reproducible research via R Markdown / Quarto.
Core R Concepts
Data Structures
- Vector: A one-dimensional array of elements of the same type.
c(1, 2, 3, 4)creates a numeric vector. - Data frame: The standard tabular data structure in R. Each column is a vector; all columns must have the same length. Equivalent to a spreadsheet or SQL table.
- List: A collection of objects that can be of different types and lengths.
Basic Operations
# Assign a value
x <- 42
# Create a vector
scores <- c(85, 92, 78, 96, 88)
# Compute mean and standard deviation
mean(scores)
sd(scores)
# Create a data frame
df <- data.frame(
name = c("Alice", "Bob", "Carol"),
revenue = c(120000, 85000, 200000)
)
# Access a column
df$revenue
Data Visualization with ggplot2
ggplot2 uses a layered “grammar of graphics” to build charts:
library(ggplot2)
ggplot(df, aes(x = name, y = revenue)) +
geom_col(fill = "steelblue") +
labs(title = "Revenue by Salesperson",
x = "Salesperson", y = "Revenue ($)") +
theme_minimal()
Each + adds a layer: the aesthetic mapping (aes), the geometry (geom_col for bar charts), labels (labs), and theme.
Chapter 7: Predictive Analytics — Introduction to Modelling
Predictive Analytics Overview
Predictive analytics uses historical data to estimate future outcomes. In business contexts, predictive models are used to:
- Forecast revenue for the next quarter.
- Score customers by probability of loan default.
- Estimate employee turnover in the next year.
- Predict optimal pricing levels.
The building blocks of predictive modelling are covered in more depth in AFM 113 (inferential statistics). In AFM 112, the focus is on applying a simple modelling tool within the CRISP-DM framework through two case studies.
OkCupid Case — Predictive Analytics for Matching
This case study applies predictive logic to a dating platform dataset, using user attributes to predict compatibility or matching outcomes. It illustrates how the same analytical techniques used in financial risk scoring (logistic regression, classification) appear in very different industry contexts—reinforcing the transferability of analytic skills across domains.
Key concepts applied:
- Binary outcome variable (match / no match).
- Selection of predictor variables.
- Evaluation of predictive accuracy.
Pet Adoption Case — Prescriptive Analytics
Prescriptive analytics extends prediction to recommendation: given a prediction about future outcomes, what is the best course of action?
In the pet adoption case, the analysis asks: which animals in a shelter are at risk of not being adopted, and what interventions (featured placement, discounted adoption fees, targeted outreach) would be most effective? The analyst must both predict adoption probability and optimize the allocation of limited promotional resources.
Chapter 8: Data Visualization and Dashboards
Principles of Effective Visualization
Data visualization communicates analytical findings to audiences who may not have technical backgrounds. Poor visualizations bury insights in clutter; excellent visualizations make the key message obvious at a glance.
The Preattentive Attributes
Certain visual properties are processed by the human visual system almost instantaneously, before conscious attention:
- Color hue — use sparingly; reserve distinct colors for distinct categories.
- Color intensity — effective for encoding magnitude (e.g., heat maps).
- Size — encodes quantitative differences (e.g., bubble charts).
- Position — the most accurate encoding for quantitative comparisons (position on a common scale).
- Shape — useful for distinguishing groups in scatter plots.
Choosing the Right Chart Type
| Analytical Purpose | Recommended Chart |
|---|---|
| Compare values across categories | Bar chart |
| Show trend over time | Line chart |
| Show distribution | Histogram, box plot |
| Show relationship between two variables | Scatter plot |
| Show part-to-whole | Stacked bar; avoid pie charts for > 3 slices |
| Show geographic patterns | Choropleth map |
Avoiding Common Visualization Mistakes
- Truncated y-axes: Starting a bar chart’s y-axis at a value other than zero exaggerates differences visually.
- 3D charts: Three-dimensional bar charts add visual distortion without adding information.
- Chartjunk (Tufte’s term): Decorative elements that add no information—gridlines, borders, shadows, unnecessary images.
- Dual axes: Two y-axes with different scales on the same chart can create misleading visual correlations.
Interactive Dashboards
A dashboard is a visual display of the key information a decision-maker needs, arranged on a single screen. Interactive dashboards allow users to filter, drill down, and explore data without requiring programming knowledge.
In AFM 112, dashboard construction is practiced using Google Sheets (slicers, charts linked to pivot tables) and potentially R Shiny or Tableau Public. The crew project requires producing a dashboard for a non-technical audience—this forces students to make deliberate choices about what to show, what to omit, and how to frame the findings.
Design Principles for Dashboards
- Single audience, single purpose: Know who the dashboard serves and what decision it supports.
- Hierarchy: Put the most important metric at the top-left (where the eye goes first).
- Consistency: Use the same color palette and chart types throughout.
- Context: Show a metric alongside its target or prior-period value. A revenue figure of $1.2M means nothing without knowing whether the target was $1.0M or $1.5M.
- Minimize cognitive load: Fewer charts done well outperform many charts done poorly.
Chapter 9: Probability Review for Business Analytics
Why Probability Matters in Analytics
Many analytic outputs are probabilistic: a predictive model estimates the probability that a loan will default; a simulation model produces a distribution of possible revenue outcomes; hypothesis tests produce p-values that quantify the probability of observing the data if a null hypothesis were true. Without a grounding in probability, these outputs cannot be interpreted correctly.
Basic Probability Rules
Addition rule (for mutually exclusive events):
\[ P(A \cup B) = P(A) + P(B) \]General addition rule:
\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]Multiplication rule (independent events):
\[ P(A \cap B) = P(A) \times P(B) \]Conditional probability:
\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]Bayes’ theorem:
\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]Bayes’ theorem is particularly relevant in audit and risk contexts: given that an audit test detected a discrepancy, what is the probability that the account contains a material misstatement? The answer depends on the prior probability of misstatement (the “base rate”) and the sensitivity of the test.
Expected Value
For a random variable \(X\) taking values \(x_1, x_2, \ldots, x_k\) with probabilities \(p_1, p_2, \ldots, p_k\):
\[ E(X) = \sum_{i=1}^{k} x_i \cdot p_i \]Expected value is the long-run average outcome and is foundational to decision analysis under uncertainty.
This probability review module bridges the descriptive analytics content of AFM 112 with the inferential statistics content of AFM 113, ensuring students have the conceptual grounding to engage meaningfully with sampling distributions, confidence intervals, and hypothesis tests.