AFM 423: AI and Finance

Tony Wirjanto

Estimated study time: 1 hr 16 min

Table of contents

Sources and References

The exposition that follows is built on top of a core set of references that together define the modern landscape of machine learning for quantitative investing.

Coqueret, Guillaume, and Tony Guida. Machine Learning for Factor Investing: R Version. Chapman & Hall / CRC Financial Mathematics Series, 2020. A Python edition followed in 2023. The book is also maintained as a free HTML edition at https://www.mlfactor.com/. This text is the structural spine of these notes; its notation, data conventions and worked examples are reused throughout.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning, with Applications in R. 2nd ed. Springer Texts in Statistics, 2021.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer Series in Statistics, 2009.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
J.P. Morgan Global Quantitative & Derivatives Strategy team. Big Data and AI Strategies: Machine Learning and Alternative Data Approach to Investing. May 2017.
Lopez de Prado, Marcos. Advances in Financial Machine Learning. Wiley, 2018. Source for the triple-barrier labelling method, meta-labelling, purged and combinatorially purged cross-validation, fractional differentiation and the deflated Sharpe ratio.
Stanford University, CS229, Machine Learning lecture notes by Andrew Ng and colleagues.
Stanford University, CS230, Deep Learning lecture notes and course materials.
Massachusetts Institute of Technology, 6.036 / 6.867, Introduction to Machine Learning and Machine Learning lecture notes.
Kenneth R. French data library, Tuck School of Business at Dartmouth, https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html. Primary source for Fama-French factor returns, industry portfolios, and book-to-market sorts.

Chapter 1: Foundations of Factor Investing

1.1 Notation, Data Conventions and the Panel Structure of Returns

Quantitative investing begins with a panel of firms observed over time. Let the universe of assets be indexed by \( i = 1, \ldots, N \) and calendar time by \( t = 1, \ldots, T \), where the time index corresponds to whatever rebalancing frequency is in use (daily, weekly, monthly, quarterly). The object of study is the matrix of excess total returns

\[ r_{i,t+1} = \frac{P_{i,t+1} + D_{i,t+1}}{P_{i,t}} - 1 - r^f_{t+1}, \]

where \( P_{i,t} \) is the price of asset \( i \) at the close of period \( t \), \( D_{i,t+1} \) is the cash dividend paid during \( (t, t+1] \) and \( r^f_{t+1} \) is the risk-free rate. The one-period-ahead return \( r_{i,t+1} \) is the canonical target of supervised learning in this field.

Alongside returns, each firm-date pair \( (i,t) \) carries a vector of features (equivalently, predictors, characteristics, or factors) \( \mathbf{x}_{i,t} \in \mathbb{R}^K \). A feature might be last period’s market capitalization, book-to-market ratio, twelve-month past return, gross profitability, asset growth, or a sentiment score extracted from news. The typical factor-investing dataset is therefore a stacked panel

\[ \mathcal{D} = \{(\mathbf{x}_{i,t}, r_{i,t+1}) : i = 1, \ldots, N_t, \ t = 1, \ldots, T\}, \]

where \( N_t \) can vary with time because firms enter (IPO) and exit (delist, merge) the sample. Two features of this data structure distinguish it from the standard i.i.d. setup assumed in introductory machine learning texts. First, observations are not independent: they cluster along the time axis (autocorrelation and common shocks) and along the cross-section (industry comovement). Second, the data generating process is non-stationary: parameters linking features to returns drift as markets, regulations and investor attention shift.

Definition 1.1 (Supervised learning for factor investing). Given the panel \( \mathcal{D} \), a supervised learning problem seeks a function \( f \colon \mathbb{R}^K \to \mathbb{R} \) chosen from a hypothesis class \( \mathcal{F} \) such that \( f(\mathbf{x}_{i,t}) \) forecasts \( r_{i,t+1} \) in a loss-minimizing sense. When the target is a binary or multiclass label derived from \( r_{i,t+1} \), the task becomes classification.

Returns are typically winsorized at the 1% and 99% cross-sectional quantiles to limit the influence of extreme events. Features are almost always normalized cross-sectionally at each date, either by computing standardized z-scores or by replacing each raw feature with its rank within the cross-section and mapping the ranks to the interval \( [0, 1] \) or \( [-1, 1] \). This cross-sectional transformation has two virtues: it makes features comparable across dates when their marginal distributions drift over time, and it focuses the learning problem on the relative ordering of firms, which is precisely what matters for long-short portfolios.

1.2 The Cross-Section of Stock Returns and Classical Anomalies

Empirical asset pricing has accumulated, over forty years, a zoo of characteristics that are robustly correlated with average stock returns. These cross-sectional patterns, often called anomalies, form the ground on which modern machine-learning-based factor models operate.

The size anomaly, documented by Banz in 1981 and folded into Fama and French’s subsequent work, is the observation that small-capitalization stocks have historically earned higher average returns than large-cap stocks, even after adjusting for market beta. The value anomaly of Basu (1977), Rosenberg, Reid and Lanstein (1985), and Fama and French (1992) states that stocks with a high ratio of book equity to market equity (value stocks) outperform those with a low ratio (growth stocks). The momentum anomaly of Jegadeesh and Titman (1993) finds that stocks with high returns over the past three to twelve months continue to outperform stocks with low past returns over the next three to twelve months. Profitability (Novy-Marx, 2013) and investment (Titman, Wei and Xie, 2004) are the two pillars of Fama and French’s 2015 five-factor model: firms with high gross profits over assets earn higher future returns, while firms that aggressively grow their assets earn lower future returns. Finally, the low-volatility anomaly (Ang, Hodrick, Xing and Zhang, 2006; Baker, Bradley and Wurgler, 2011) is the empirical fact that stocks with low idiosyncratic or total volatility deliver abnormally high risk-adjusted returns, in direct contradiction to the textbook mean-variance tradeoff.

Remark 1.2. A 2016 survey by Harvey, Liu and Zhu listed more than 300 candidate factors documented in the academic literature. A large fraction fail to replicate out of sample, a concern that motivates the multiple-testing adjustments and deflated Sharpe ratios discussed in Chapter 8.

1.3 From CAPM to Fama-French to ML Factor Models

The simplest equilibrium model, the Capital Asset Pricing Model (CAPM) of Sharpe, Lintner and Mossin, states that expected excess returns are linear in a single market factor:

\[ \mathbb{E}[r_{i,t+1}] = \beta_i \, \mathbb{E}[r^{m}_{t+1}], \]

with \( \beta_i = \operatorname{Cov}(r_{i}, r^{m})/\operatorname{Var}(r^{m}) \). The CAPM’s empirical failures motivated Fama and French (1993) to add two additional factors, size (SMB, small-minus-big) and value (HML, high-minus-low), producing the three-factor model. Carhart (1997) added a momentum factor (MOM or WML, winners-minus-losers), and Fama and French (2015) added profitability (RMW, robust-minus-weak) and investment (CMA, conservative-minus-aggressive) factors to form a five-factor model, later extended to six by re-adding momentum.

Letting \( \mathbf{f}_t \) be a vector of traded factor returns, classical factor models take the linear time-series form

\[ r_{i,t} = \alpha_i + \boldsymbol{\beta}_i^\top \mathbf{f}_t + \varepsilon_{i,t}. \]

Machine-learning factor models instead use firm characteristics rather than traded factor returns as inputs. The unifying framework, articulated in Gu, Kelly and Xiu (2020), Kelly, Pruitt and Su (2019) and the Coqueret-Guida textbook, is

\[ \mathbb{E}[r_{i,t+1} \mid \mathbf{x}_{i,t}] = g^\star(\mathbf{x}_{i,t}), \]

where \( g^\star \) is an unknown, potentially highly non-linear function of firm characteristics. The job of the ML model is to approximate \( g^\star \) well enough that sorting firms by \( \widehat{g}(\mathbf{x}_{i,t}) \) yields a portfolio with genuine out-of-sample alpha.

Example 1.3 (A simple characteristic portfolio). Suppose \( K = 1 \) and the only feature is last month's return reversal, \( x_{i,t} = -r_{i,t} \). A zero-cost long-short portfolio that at each month buys the top decile of \( x_{i,t} \) and shorts the bottom decile implements the short-term reversal strategy of Jegadeesh (1990). A simple ML model \( \widehat{g}(x) = x \) would reproduce this portfolio exactly; the interest of richer ML models is that they can combine dozens of such features into a single ranking that exploits interactions and non-linearities.

1.4 Firm Characteristics as Predictors

Standard feature panels used in factor investing research include the 94 characteristics compiled by Green, Hand and Zhang (2017) and reused by Gu, Kelly and Xiu (2020). They can be grouped into:

Group	Typical features
Valuation	book-to-market, earnings-to-price, sales-to-price, cash flow-to-price
Momentum and reversal	1-month reversal, 12-1 month momentum, 36-month long-term reversal
Quality	ROE, ROA, gross profitability, asset turnover, accruals
Investment and growth	asset growth, net stock issuance, capex-to-assets
Risk	realized volatility, market beta, maximum daily return, idiosyncratic skewness
Liquidity	Amihud illiquidity, bid-ask spread, share turnover, dollar volume
Trading frictions	short interest, institutional ownership, days-to-cover

The assumption that makes ML models useful is that the mapping from these features to expected returns is non-linear and involves interactions. A classical linear regression imposes that the effect of, say, book-to-market does not depend on the level of momentum; a tree ensemble or neural network can capture the fact that value signals work much better among winners than among losers, or that momentum is stronger in small stocks. Chapter 5 and Chapter 6 return to this point with concrete architectures.

Chapter 2: Portfolio Back-testing Framework

2.1 The Back-testing Mentality

A back-test is a simulation of a trading strategy on historical data that answers the counterfactual question “if I had deployed this model in the past, how would it have performed?”. Back-testing is the empirical laboratory of quantitative finance, but it is also a laboratory in which it is embarrassingly easy to fool oneself. The central fact is that any sufficiently flexible class of strategies will contain one that looks great on any finite dataset purely by chance. Careful back-testing is therefore as much about disciplined protocol as about code.

Definition 2.1 (Back-test). A back-test is a mapping from a dataset \( \mathcal{D} \), a model class \( \mathcal{F} \), a training procedure \( \mathcal{A} \) and a portfolio rule \( \pi \) to a simulated time series of portfolio returns \( \{R^{\pi}_t\}_{t=1}^T \), together with summary statistics (mean return, volatility, Sharpe ratio, drawdowns, turnover).

2.2 Train/Test Splits in Time-Series Finance

In i.i.d. machine learning, the standard protocol is to partition data randomly into training, validation and test sets. In a time-series panel this is inadmissible because random splitting leaks information from the future into the past. The minimum discipline is a chronological split: the training window ends at some date \( T_1 \), the validation window runs from \( T_1 + 1 \) to \( T_2 \), and the test window from \( T_2 + 1 \) to \( T \). The model is estimated on the training set, hyperparameters are tuned on the validation set, and the test set is touched only once, at the end.

A more realistic framework mimics how a strategy would actually be deployed, re-estimating the model periodically and producing out-of-sample forecasts at every date. Two implementations dominate:

Expanding window: at rebalancing date \( t \), the model is re-estimated on data up to \( t \), using every historical observation available. The training set grows over time. This is the default in academic papers.
Rolling window: at rebalancing date \( t \), the model is re-estimated on a window of fixed length \( L \) ending at \( t \). The training set has constant size. This copes better with non-stationarity by discarding stale data.

Remark 2.2. In practice, expanding windows work well for slow-moving relationships (e.g. value) while rolling windows work better for fast-moving relationships (e.g. microstructure-driven strategies). A compromise is an exponentially weighted window that gives recent observations more weight without completely discarding old ones.

2.3 Rebalancing, Transaction Costs and Turnover

A back-test that ignores trading costs is worthless. Let \( \mathbf{w}_t \in \mathbb{R}^N \) denote the vector of portfolio weights at the start of period \( t \), with \( \mathbf{1}^\top \mathbf{w}_t = 1 \) for long-only portfolios and \( \mathbf{1}^\top \mathbf{w}_t = 0 \) for dollar-neutral long-short portfolios. The realized return after costs is

\[ R^{\pi}_{t+1} = \mathbf{w}_t^\top \mathbf{r}_{t+1} - c \sum_{i=1}^N \big| w_{i,t} - w_{i,t-1}^{-} \big|, \]

where \( c \) is the per-dollar transaction cost (a blend of bid-ask spread, market impact and commissions) and \( w_{i,t-1}^{-} \) is the weight on asset \( i \) at the end of period \( t-1 \) after price drift. The term

\[ \mathrm{TO}_t = \sum_{i=1}^N \big| w_{i,t} - w_{i,t-1}^{-} \big| \]

is the turnover, the total fraction of the portfolio replaced at date \( t \). Typical institutional values are \( c \in [5, 20] \) basis points for liquid developed-market equities and an order of magnitude higher in small caps. Back-testing with realistic costs almost always reveals that the most predictive models (high cross-sectional \( R^2 \)) are not the most profitable: their signal turns over too fast.

Example 2.3 (Cost-adjusted Sharpe). A momentum strategy with an annualized gross Sharpe of 1.2 and monthly turnover of 80% will pay \( 12 \times 0.8 \times 0.001 = 0.96\% \) per year in costs at a one-sided cost of 10 bps. If annualized vol is 15%, this subtracts about 0.064 from the Sharpe ratio, a modest but material hit. A weekly-rebalanced reversal strategy with turnover 300% per week and the same cost rate loses \( 52 \times 3 \times 0.001 = 15.6\% \) per year and is almost certainly unprofitable.

2.4 Long-only and Long-short Portfolio Construction

Let \( s_{i,t} = \widehat{g}(\mathbf{x}_{i,t}) \) denote the model’s predicted return (the “signal”) for asset \( i \) at date \( t \). Several standard portfolio rules map signals to weights:

Sort-and-hold deciles: rank firms by \( s_{i,t} \), form ten equal-weight or value-weight portfolios, and report the return of the top minus bottom decile. This is the workhorse of academic factor research because it is simple and robust to the scale of the signal.
Equal-weight long-short: buy the top quintile and short the bottom quintile, equal weighting within each leg, dollar neutral overall.
Linear weight: set \( w_{i,t} \propto s_{i,t} - \bar{s}_t \) with normalization \( \sum_i |w_{i,t}| = 1 \). This uses the full cross-section rather than only the tails.
Mean-variance optimized: feed the signal and an estimated covariance matrix \( \widehat{\boldsymbol{\Sigma}}_t \) into a Markowitz problem.

Long-only versions of these rules replace short positions with zero weights and usually renormalize. In practice regulatory frictions (UCITS, 40-Act funds) and shorting costs make long-only implementations much more common than academic papers suggest; the trade-off is that removing negative weights substantially reduces the effective information coefficient captured by the portfolio.

2.5 The Gap Between In-sample Fit and Out-of-sample Profitability

In classical statistics, a model with a good fit on training data is presumed to generalize. In finance, this presumption is false. The basic reason is non-stationarity: the data-generating process changes over time, so a relationship that was strong in-sample may have weakened, flipped sign, or disappeared out-of-sample as it became crowded. A secondary reason is signal-to-noise ratio: monthly equity returns are roughly 80-95% noise, so training algorithms tend to latch onto idiosyncratic features of the training set that have no predictive value.

The gap is quantified by comparing in-sample \( R^2 \), the squared correlation between fitted and realized returns on the training data, with out-of-sample \( R^2_{\mathrm{OOS}} \), computed on data the model has never seen. Gu, Kelly and Xiu (2020) report monthly cross-sectional \( R^2_{\mathrm{OOS}} \) numbers of 0.3% to 0.7% for neural networks on US equities, with tree ensembles nearly as good. These numbers look trivial but compound into annualized Sharpe ratios of 1.5-2.5 in long-short portfolios constructed from the forecasts.

2.6 Common Pitfalls

Remark 2.4 (Look-ahead bias). Features must be constructed using information available at the start of period \( t \), not information that becomes available during or at the end of \( t \). Accounting data require particular care: Compustat fiscal-year data are typically available with a three- to six-month lag, so using them without a lag would peek into the future.

Remark 2.5 (Survivorship bias). A database that includes only firms currently listed understates the risk and overstates the return of any strategy. A dataset used for back-testing must be survivorship-bias free: it must include firms that were subsequently delisted, merged or went bankrupt, with accurate delisting returns (CRSP provides these).

Remark 2.6 (Multiple testing). Running 1000 strategies on the same data and keeping the best guarantees a result that looks good even if all strategies are pure noise. The Sharpe ratio of the best-of-1000 pure-noise strategy is approximately \( \sqrt{2 \ln 1000} \approx 3.7 \). López de Prado's deflated Sharpe ratio (Chapter 8) adjusts for this.

Chapter 3: Penalized Regressions and Sparse Portfolios

3.1 Linear Regression Revisited

Let \( \mathbf{y} \in \mathbb{R}^n \) be a vector of stacked realized returns and \( \mathbf{X} \in \mathbb{R}^{n \times K} \) the matrix of standardized features. Ordinary least squares (OLS) solves

\[ \widehat{\boldsymbol{\beta}}^{\mathrm{OLS}} = \arg\min_{\boldsymbol{\beta} \in \mathbb{R}^K} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}, \]

provided \( \mathbf{X}^\top \mathbf{X} \) is non-singular. In a typical factor-investing dataset with \( K \) of order 100 and cross-sectional sample size \( n \) in the thousands or tens of thousands, \( \mathbf{X}^\top \mathbf{X} \) is non-singular but very ill-conditioned: features are correlated with each other, so small changes in the data induce large changes in \( \widehat{\boldsymbol{\beta}}^{\mathrm{OLS}} \). The variance of OLS estimates explodes, and the resulting forecasts generalize poorly.

3.2 Ridge Regression

Ridge regression, introduced by Hoerl and Kennard (1970), adds an \( \ell_2 \) penalty to the objective:

\[ \widehat{\boldsymbol{\beta}}^{\mathrm{ridge}} = \arg\min_{\boldsymbol{\beta}} \Big\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2 \Big\}, \]

with closed-form solution

\[ \widehat{\boldsymbol{\beta}}^{\mathrm{ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}_K)^{-1} \mathbf{X}^\top \mathbf{y}. \]

The tuning parameter \( \lambda \ge 0 \) controls the trade-off between fit and shrinkage: \( \lambda = 0 \) recovers OLS, and \( \lambda \to \infty \) shrinks all coefficients to zero. Ridge can be derived as the posterior mean under a Gaussian prior \( \boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, (\sigma^2/\lambda) \mathbf{I}_K) \) and has the pleasant property of always being well defined, even when \( K > n \) or when columns of \( \mathbf{X} \) are exactly collinear.

Theorem 3.1 (Bias-variance decomposition of ridge). For fixed \( \mathbf{X} \) and true parameter \( \boldsymbol{\beta}^\star \), writing \( \mathbf{W}_\lambda = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{X} \), \[ \operatorname{MSE}(\widehat{\boldsymbol{\beta}}^{\mathrm{ridge}}) = \underbrace{\|(\mathbf{I} - \mathbf{W}_\lambda)\boldsymbol{\beta}^\star\|_2^2}_{\text{squared bias}} + \underbrace{\sigma^2\operatorname{tr}\!\big[(\mathbf{X}^\top\mathbf{X}+\lambda\mathbf{I})^{-2}\mathbf{X}^\top\mathbf{X}\big]}_{\text{variance}}. \]

There always exists a strictly positive \( \lambda \) at which the total mean-squared error is smaller than that of OLS.

3.3 Lasso Regression

The lasso, proposed by Tibshirani (1996), replaces the \( \ell_2 \) penalty with an \( \ell_1 \) penalty:

\[ \widehat{\boldsymbol{\beta}}^{\mathrm{lasso}} = \arg\min_{\boldsymbol{\beta}} \Big\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1 \Big\}. \]

The \( \ell_1 \) norm is non-differentiable at zero, and its geometry induces sparsity: for sufficiently large \( \lambda \), many coefficients are exactly zero. In factor investing, this is extremely appealing. Out of 100 candidate characteristics, we expect that only a handful genuinely forecast returns; lasso’s built-in variable selection automatically discards the rest, and the resulting “sparse” model is easier to interpret, cheaper to back-test and more robust to regime changes.

There is no closed-form solution, but the problem is convex and efficiently solvable. The most common algorithm is coordinate descent, which cycles through coordinates \( j = 1, \ldots, K \), updating each in closed form by soft-thresholding:

\[ \widehat{\beta}_j \leftarrow \mathcal{S}_{\lambda}\!\left( \frac{1}{n}\sum_{i=1}^n x_{ij}\!\left(y_i - \sum_{\ell \neq j} x_{i\ell}\widehat{\beta}_\ell \right) \right), \]

where the soft-threshold operator is

\[ \mathcal{S}_\lambda(z) = \operatorname{sign}(z) \cdot \max(|z| - \lambda, 0). \]

Coordinate descent is Friedman, Hastie and Tibshirani’s (2010) celebrated algorithm behind the glmnet package, which computes a full regularization path over a grid of \( \lambda \) values in essentially the same time as a single OLS fit.

Example 3.2 (Lasso selects sparse factor models). Applied to the monthly panel of US equities with 94 Green-Hand-Zhang characteristics, a lasso typically retains about 10 to 20 non-zero coefficients at the cross-validation-optimal \( \lambda \). The retained features almost always include short-term reversal, 12-1 momentum, book-to-market and gross profitability, consistent with the factors isolated in the academic literature.

3.4 Elastic Net

The lasso has two known weaknesses: when several features are highly correlated, it picks one arbitrarily and zeros the rest, and when \( K > n \) it selects at most \( n \) variables. The elastic net of Zou and Hastie (2005) blends ridge and lasso:

\[ \widehat{\boldsymbol{\beta}}^{\mathrm{enet}} = \arg\min_{\boldsymbol{\beta}} \Big\{ \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|_2^2 + \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|_2^2 \Big\}. \]

Reparameterized as \( \lambda_1 = \lambda \alpha \) and \( \lambda_2 = \lambda (1 - \alpha)/2 \), the elastic net interpolates between lasso (\( \alpha = 1 \)) and ridge (\( \alpha = 0 \)). The ridge component encourages correlated features to enter the model together, reducing the arbitrariness of variable selection; the lasso component preserves sparsity.

3.5 Penalized Minimum-Variance Portfolios

Markowitz’s mean-variance problem for a long-only portfolio with expected return vector \( \boldsymbol{\mu} \) and covariance \( \boldsymbol{\Sigma} \) is

\[ \min_{\mathbf{w}} \mathbf{w}^\top \boldsymbol{\Sigma} \mathbf{w} \quad \text{s.t.} \quad \mathbf{1}^\top \mathbf{w} = 1. \]

The global minimum-variance (GMV) solution is \( \mathbf{w}^{\mathrm{GMV}} = \boldsymbol{\Sigma}^{-1} \mathbf{1} / (\mathbf{1}^\top \boldsymbol{\Sigma}^{-1} \mathbf{1}) \). When \( N \) is large relative to the sample size used to estimate \( \boldsymbol{\Sigma} \), the sample covariance \( \mathbf{S} \) is ill-conditioned and its inverse is unstable. The resulting portfolio often contains enormous long and short positions that cancel each other; worse, it is highly sensitive to estimation error.

Two standard remedies are closely related to penalized regression:

Linear shrinkage (Ledoit and Wolf, 2004): replace \( \mathbf{S} \) with \( (1 - \alpha) \mathbf{S} + \alpha \mathbf{F} \), where \( \mathbf{F} \) is a structured target such as \( \sigma^2 \mathbf{I} \) (scalar target) or a constant-correlation matrix. The shrinkage intensity \( \alpha \in [0,1] \) is chosen to minimize Frobenius-norm risk; Ledoit and Wolf derive closed-form optimal choices.
\(\ell_1\) penalized portfolios: solve

\[ \min_{\mathbf{w}} \mathbf{w}^\top \mathbf{S} \mathbf{w} + \lambda \|\mathbf{w}\|_1 \quad \text{s.t.} \quad \mathbf{1}^\top \mathbf{w} = 1, \]

which induces sparsity in \( \mathbf{w} \). Sparse portfolios are easier to trade, have lower turnover and mechanically avoid the “short everything” pathology.

Remark 3.3 (Penalized regression as portfolio construction). DeMiguel et al. (2009) and Brodie et al. (2009) show that the Markowitz problem with a target return constraint can be rewritten as an unconstrained regression, so that \( \ell_1 \)-penalization on portfolio weights is equivalent to a lasso regression in a cleverly rotated coordinate system. The equivalence explains why penalized regression and robust portfolio construction are two sides of the same coin.

3.6 Regularization Path and Tuning

Tuning \( \lambda \) is not optional: it controls the model. For ridge and lasso, the efficient strategy is to compute the entire regularization path, tracking \( \widehat{\boldsymbol{\beta}}(\lambda) \) as \( \lambda \) varies from very large (the null model) to zero. Cross-validation then picks the \( \lambda \) that minimizes out-of-sample MSE (see Chapter 7 for the time-series variant of CV). The “one-standard-error rule” of Breiman et al. (1984) is a useful heuristic: select the largest \( \lambda \) whose CV error is within one standard error of the minimum, producing a sparser and more robust model at essentially no cost in accuracy.

Chapter 4: Data Preprocessing for Factor Investing

4.1 Why Preprocessing Dominates Model Choice

Practitioners who have implemented many factor models concur on an uncomfortable truth: the choice of algorithm matters less than the discipline of data preparation. A well-preprocessed dataset fed to a linear regression will usually outperform a messy dataset fed to an elaborate deep network. This chapter walks through the main preprocessing tools: cross-sectional normalization, neutralization, winsorization, imputation, and labelling.

4.2 Cross-sectional Ranking and Normalization

Raw features are rarely comparable across time. Book-to-market ratios have drifted structurally higher during equity bubbles and lower afterwards; market capitalization is denominated in nominal dollars whose value changes with inflation. Two cross-sectional transformations repair this:

Z-score: at each date \( t \), replace each feature \( x_{i,t}^k \) by

\[ z_{i,t}^k = \frac{x_{i,t}^k - \mu_{t}^k}{\sigma_t^k}, \]

with \( \mu_t^k \) and \( \sigma_t^k \) the cross-sectional mean and standard deviation of feature \( k \) at date \( t \).

Uniform rank: at each date \( t \), replace \( x_{i,t}^k \) by the rank of firm \( i \) in the cross-section, divided by the number of firms, yielding a value in \( (0, 1) \). Optionally shift to \( (-0.5, 0.5) \) or apply an inverse normal CDF transformation.

Both transformations neutralize level drift but retain cross-sectional ordering. Rank-based features are robust to outliers and to heavy-tailed distributions, at the cost of discarding information about the magnitude of differences.

4.3 Neutralization by Sector, Country and Style

Two firms in very different industries should not be compared on the same value score without adjustment, because value means different things to a bank and to a technology firm. Sector neutralization subtracts from each firm’s feature the cross-sectional mean of its sector, so that the engineered feature measures each firm’s deviation from its sector peers:

\[ \tilde{x}_{i,t}^k = x_{i,t}^k - \bar{x}_{s(i), t}^k, \]

where \( s(i) \) is the sector of firm \( i \) and \( \bar{x}_{s,t}^k \) is the within-sector mean. The same idea applies to country neutralization, style neutralization (removing exposure to size or value), and even to “beta neutralization” (regressing the feature on market beta and taking residuals).

Example 4.1 (Regression-based neutralization). To neutralize a raw momentum signal against size, run a cross-sectional regression at each date: \[ x_{i,t}^{\mathrm{mom}} = a_t + b_t \log(\mathrm{Mcap}_{i,t}) + \varepsilon_{i,t}, \]

and use the residual \( \widehat{\varepsilon}_{i,t} \) as the new momentum feature. By construction it is orthogonal to size within the cross-section.

4.4 Winsorization and Outlier Handling

Financial features have heavy tails. A single observation of a zero-book-equity firm with a tiny positive price can generate a book-to-market ratio of 1000, which will dominate any regression fit. Winsorization caps each feature at specified cross-sectional quantiles:

\[ x_{i,t}^{k, \mathrm{wins}} = \min\!\left(\max(x_{i,t}^k, q_{0.01, t}^k), q_{0.99, t}^k\right). \]

An alternative is to trim (drop) observations outside the quantile bounds, but winsorization is preferred because it retains all firms, which matters for balanced-panel regressions.

4.5 Missing-value Imputation

Missing data in factor investing is pervasive: firms have different fiscal years, short-interest data is reported only semi-monthly, and some features (analyst forecasts) are available only for covered stocks. Simple imputation strategies are:

Cross-sectional mean or median within the same date-and-sector bucket.
Cross-sectional rank with missing-as-median (effectively assigning rank 0.5 to missing observations).
Explicit missing indicator: create a binary feature that flags missingness, and replace the original value by zero. Tree models can then split on missingness explicitly.

Remark 4.2. One must never impute using information from future dates, even via the simplest devices such as forward-filling. In a panel, the rule is "replacement values at date \( t \) may depend only on information available at date \( t \)".

4.6 Labelling Schemes

The choice of target variable is itself a modelling decision.

Raw forward returns: \( y_{i,t} = r_{i,t+1} \), used in regression tasks.
Decile labels: \( y_{i,t} \in \{1, \ldots, 10\} \), the cross-sectional decile of the forward return. This converts a noisy continuous target into a cleaner ordinal signal.
Binary classification: \( y_{i,t} = \mathbb{1}\{r_{i,t+1} > m_t\} \) for some threshold \( m_t \) (the cross-sectional median, zero, the risk-free rate, or the top-quintile cutoff).
Triple-barrier method (López de Prado, 2018): for each observation, place an upper barrier at \( +u\sigma_t \), a lower barrier at \( -d\sigma_t \) and a vertical (time) barrier at \( t + h \). The label is \( +1 \), \( -1 \) or \( 0 \) depending on which barrier is hit first.
Meta-labelling (López de Prado, 2018): train a first model to produce a side (long/short), then train a second binary model to decide whether to take the trade and at what size. Meta-labelling aims to improve precision at the cost of recall.

4.7 Sample-weighting, Fractional Differentiation and Stationarity

Financial time series are generally not stationary. A standard remedy, taking first differences, destroys most of the memory in the series. Fractional differentiation (López de Prado, 2018, Chapter 5) replaces first differences with a fractional operator \( (1 - L)^d \) for some \( d \in (0, 1) \), making the series stationary while preserving as much long memory as possible:

\[ (1 - L)^d X_t = \sum_{k=0}^\infty \binom{d}{k} (-1)^k X_{t-k}. \]

The expansion is truncated in practice at a tolerance on the coefficients. Features computed from fractionally differenced prices retain predictive content that simple returns cannot capture.

Observations are also weighted non-uniformly. Because overlapping forward-return labels share information (e.g. a 12-month forward return labelled at date \( t \) overlaps the one labelled at \( t+1 \)), López de Prado recommends weighting each observation inversely to its uniqueness — the number of other observations whose label window it intersects. This reduces the effective sample size but prevents the model from being dominated by repeated information.

Chapter 5: Decision Trees and Tree Ensembles

5.1 Regression and Classification Trees

A decision tree partitions the feature space \( \mathbb{R}^K \) into disjoint rectangular regions \( R_1, \ldots, R_M \) and fits a constant within each region. Formally, the prediction is

\[ \widehat{f}(\mathbf{x}) = \sum_{m=1}^M c_m \, \mathbb{1}\{\mathbf{x} \in R_m\}, \]

where \( c_m \) is the mean of the training targets in region \( m \) (for regression) or the majority class (for classification). Finding the optimal tree is NP-hard, so CART (Breiman, Friedman, Olshen and Stone, 1984) uses a greedy top-down algorithm: at each node, consider every feature \( j \) and every threshold \( s \), and pick the split that maximizes the reduction in an impurity criterion.

Impurity criteria for regression are usually mean-squared error (MSE):

\[ \mathrm{MSE}(R) = \frac{1}{|R|} \sum_{i \in R} (y_i - \bar{y}_R)^2. \]

For classification the standard choices are the Gini index and entropy:

\[ \mathrm{Gini}(R) = \sum_{k=1}^C p_{Rk}(1 - p_{Rk}), \qquad \mathrm{Entropy}(R) = -\sum_{k=1}^C p_{Rk} \log_2 p_{Rk}, \]

where \( p_{Rk} \) is the proportion of class \( k \) in region \( R \). A split of parent \( R \) into children \( R_L \) and \( R_R \) is scored by the impurity reduction

\[ \Delta \mathrm{I} = \mathrm{I}(R) - \frac{|R_L|}{|R|} \mathrm{I}(R_L) - \frac{|R_R|}{|R|} \mathrm{I}(R_R). \]

Trees are recursively split until some stopping criterion (minimum leaf size, maximum depth, minimum impurity reduction). Deep trees fit the training data perfectly but overfit; hence pruning, either by minimum cost-complexity pruning (adding a penalty \( \alpha |T| \) on the number of leaves) or by early stopping.

5.2 Why Trees Matter for Factor Investing

A single decision tree is a flexible, interpretable non-linear model. It automatically handles variable interactions (by splitting on different features at different depths), is invariant to monotone transformations of inputs, and handles missing data gracefully. For factor investing, trees capture effects such as “momentum is profitable only among small stocks with low analyst coverage” or “value works in all sectors except technology during tech bubbles”. These conditional patterns are difficult to specify by hand in a linear model.

The down side of a single tree is high variance: small changes in training data produce very different splits. Tree ensembles fix this.

5.3 Bagging and Random Forests

Bagging (Breiman, 1996) generates \( B \) bootstrap samples of the training data, fits a tree on each, and averages the predictions:

\[ \widehat{f}^{\mathrm{bag}}(\mathbf{x}) = \frac{1}{B} \sum_{b=1}^B \widehat{f}^{(b)}(\mathbf{x}). \]

Averaging decorrelated unbiased predictors reduces variance by a factor close to \( B \), and bagged trees typically outperform a single tree substantially. Random forests (Breiman, 2001) go further by adding, at each candidate split, a random subset of \( m < K \) features from which the split can be chosen. This forced feature decorrelation makes the individual trees even less correlated, amplifying the variance-reduction benefit.

A standard rule of thumb is \( m = \lfloor K/3 \rfloor \) for regression and \( m = \lfloor \sqrt{K} \rfloor \) for classification. Random forests are largely self-tuning: beyond choosing \( m \) and \( B \), the model is robust to most hyperparameter choices. They also deliver a free out-of-bag estimate of generalization error: each training point is unused in roughly \( 1/e \approx 37\% \) of the bootstraps, and its prediction can be formed by averaging only those trees that did not see it.

Definition 5.1 (Variable importance). For a random forest, the mean decrease in impurity importance of feature \( j \) is the total reduction in impurity produced by splits on \( j \), averaged over all trees: \[ \mathrm{VI}(j) = \frac{1}{B} \sum_{b=1}^B \sum_{\text{nodes }\nu \text{ split on }j} \Delta \mathrm{I}_\nu. \]

An alternative is permutation importance: the drop in out-of-bag accuracy when feature \( j \) is randomly shuffled.

5.4 Gradient Boosting Machines

Boosting constructs an additive model iteratively by fitting new trees to the residuals of the current ensemble. Given a differentiable loss \( \ell(y, f) \), gradient boosting (Friedman, 2001) at iteration \( m \) performs:

Compute the negative gradient \( g_{i,m} = -\partial \ell(y_i, f_{m-1}(\mathbf{x}_i)) / \partial f_{m-1}(\mathbf{x}_i) \) for each training point.
Fit a regression tree \( h_m(\mathbf{x}) \) to the pseudo-residuals \( \{g_{i,m}\}_{i=1}^n \).
Update the ensemble: \( f_m(\mathbf{x}) = f_{m-1}(\mathbf{x}) + \eta \, h_m(\mathbf{x}) \), where \( \eta \in (0,1] \) is a learning rate.

The learning rate, number of trees and tree depth are the key hyperparameters. Small \( \eta \) (e.g. 0.01-0.1) combined with many trees usually produces the best generalization, at the cost of computation.

Modern gradient boosting implementations include XGBoost (Chen and Guestrin, 2016), LightGBM (Ke et al., 2017) and CatBoost (Prokhorenkova et al., 2018). XGBoost’s innovation is a second-order Taylor expansion of the loss combined with an explicit regularization of tree complexity. With \( g_i \) and \( h_i \) the first and second derivatives of the loss, the optimal leaf value for leaf \( j \) containing points \( I_j \) is

\[ w_j^\star = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}, \]

and the corresponding objective contribution is

\[ -\frac{1}{2} \frac{\Big(\sum_{i \in I_j} g_i\Big)^2}{\sum_{i \in I_j} h_i + \lambda} + \gamma. \]

The parameter \( \lambda \) plays a ridge-like role on leaf weights and \( \gamma \) is a cost per leaf that discourages overgrowth.

Example 5.2 (Boosted trees on US equities). Gu, Kelly and Xiu (2020) report that gradient-boosted regression trees on 94 firm characteristics produce a monthly out-of-sample \( R^2 \) of about 0.34% in US equities, generating a long-short quintile strategy with annualized Sharpe near 2.4 on paper. Neural networks edge them out by a small margin, but boosted trees are the most competitive interpretable alternative.

5.5 Interpreting Tree Ensembles

Tree ensembles are not black boxes. Two interpretability tools dominate:

Partial dependence plots (PDP): fix feature \( j \) at a grid of values, average predictions over the empirical distribution of the remaining features, and plot the result. A PDP reveals the marginal effect of feature \( j \) on predictions.
SHAP values (Lundberg and Lee, 2017): for each prediction, assign to each feature a contribution \( \phi_j \) satisfying consistency and local accuracy axioms. TreeSHAP computes these in polynomial time for tree models.

SHAP summaries are widely used in factor investing research to explain which features drive a particular forecast and to audit the ensemble for spurious dependencies.

Chapter 6: Deep Learning for Asset Returns

6.1 Multilayer Perceptrons

A multilayer perceptron (MLP) or feedforward network composes a sequence of affine transformations and non-linear activations. For input \( \mathbf{x} \in \mathbb{R}^K \), hidden layer widths \( h_1, \ldots, h_L \), and output dimension \( o \), the forward pass is

\[ \mathbf{a}^{(0)} = \mathbf{x}, \qquad \mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)} \mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \qquad \mathbf{a}^{(\ell)} = \phi(\mathbf{z}^{(\ell)}), \]

for \( \ell = 1, \ldots, L \), with a final linear or softmax layer producing the output \( \widehat{\mathbf{y}} = \mathbf{W}^{(L+1)} \mathbf{a}^{(L)} + \mathbf{b}^{(L+1)} \). The scalar activation \( \phi \) is applied element-wise; standard choices are the rectified linear unit (ReLU) \( \phi(z) = \max(z, 0) \), the sigmoid \( \phi(z) = 1/(1 + e^{-z}) \), and the hyperbolic tangent \( \phi(z) = \tanh(z) \). ReLU is the default in modern networks because it is piecewise linear, cheap to compute, and avoids the saturating gradients that plague sigmoids for deep networks.

Theorem 6.1 (Universal approximation, Cybenko 1989 / Hornik 1991). A feedforward network with one hidden layer and a sufficiently rich non-polynomial activation can approximate any continuous function on a compact subset of \( \mathbb{R}^K \) to arbitrary accuracy, provided the hidden layer is allowed to be wide enough.

The theorem guarantees expressiveness in principle but says nothing about the sample size or optimization difficulty required to find a good approximation.

6.2 Backpropagation

Training an MLP means minimizing an empirical loss \( L(\theta) = (1/n)\sum_{i=1}^n \ell(y_i, \widehat{y}_i(\theta)) \) over network parameters \( \theta = \{\mathbf{W}^{(\ell)}, \mathbf{b}^{(\ell)}\} \). Gradient-based optimization requires \( \nabla_\theta L \), which the backpropagation algorithm computes in one forward and one backward pass through the network.

Writing \( \delta^{(\ell)} = \partial \ell / \partial \mathbf{z}^{(\ell)} \), the recursion is

\[ \delta^{(L+1)} = \nabla_{\widehat{y}} \ell, \qquad \delta^{(\ell)} = (\mathbf{W}^{(\ell+1)})^\top \delta^{(\ell+1)} \odot \phi'(\mathbf{z}^{(\ell)}), \]

and the parameter gradients are

\[ \frac{\partial \ell}{\partial \mathbf{W}^{(\ell)}} = \delta^{(\ell)} (\mathbf{a}^{(\ell-1)})^\top, \qquad \frac{\partial \ell}{\partial \mathbf{b}^{(\ell)}} = \delta^{(\ell)}. \]

Here \( \odot \) is element-wise multiplication. Modern deep-learning frameworks (PyTorch, TensorFlow, JAX) compute these gradients automatically via reverse-mode automatic differentiation, freeing the user from manual derivation.

6.3 Optimization: SGD, Momentum and Adam

Because \( n \) is large, networks are trained with mini-batch stochastic gradient descent (SGD), which at each step samples a batch \( \mathcal{B} \) of size \( b \ll n \) and updates

\[ \theta \leftarrow \theta - \eta \cdot \frac{1}{b} \sum_{i \in \mathcal{B}} \nabla_\theta \ell(y_i, \widehat{y}_i(\theta)). \]

Adding momentum smooths the trajectory by maintaining a velocity \( v_t = \mu v_{t-1} + \nabla_\theta L \) and updating \( \theta_t = \theta_{t-1} - \eta v_t \). The Adam optimizer (Kingma and Ba, 2014) combines momentum with per-parameter learning-rate adaptation:

\[ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, \qquad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2, \]

with bias-corrected estimates \( \widehat{m}_t = m_t/(1 - \beta_1^t) \), \( \widehat{v}_t = v_t/(1 - \beta_2^t) \) and update

\[ \theta_t = \theta_{t-1} - \eta \cdot \frac{\widehat{m}_t}{\sqrt{\widehat{v}_t} + \epsilon}. \]

Typical defaults are \( \beta_1 = 0.9 \), \( \beta_2 = 0.999 \), \( \epsilon = 10^{-8} \) and \( \eta \in [10^{-4}, 10^{-3}] \).

6.4 Regularization

Deep networks applied to noisy financial data overfit aggressively unless regularization is applied. The main tools are:

Weight decay: add \( (\lambda/2) \|\theta\|_2^2 \) to the loss, equivalent to ridge regression on the weights.
Dropout (Srivastava et al., 2014): during training, zero out each hidden unit with probability \( p \); at test time, scale activations by \( 1 - p \). This prevents co-adaptation of units and acts as an approximate Bayesian model average.
Batch normalization (Ioffe and Szegedy, 2015): normalize hidden activations within each mini-batch to have zero mean and unit variance, then apply learned affine scaling. Improves optimization and provides a mild regularizing effect.
Early stopping: track validation loss every few epochs and halt training when it begins to rise.
Ensembling: average predictions across multiple networks trained with different seeds or architectures.

Coqueret and Guida emphasize that financial time series benefit enormously from strong regularization because the signal-to-noise ratio is small and the training set is effectively much smaller than the number of observations would suggest (due to high intra-panel correlation).

6.5 Recurrent Networks, GRU and LSTM

Cross-sectional predictions at date \( t \) use only information from \( t \); they ignore the temporal structure of each firm’s feature history. Recurrent networks model sequences \( \mathbf{x}_{i,1}, \mathbf{x}_{i,2}, \ldots, \mathbf{x}_{i,t} \) by maintaining a hidden state that summarizes the past. A vanilla RNN evolves as

\[ \mathbf{h}_t = \tanh(\mathbf{W}_h \mathbf{h}_{t-1} + \mathbf{W}_x \mathbf{x}_t + \mathbf{b}), \]

with output \( \widehat{y}_t = \mathbf{W}_o \mathbf{h}_t + b_o \). In practice, vanilla RNNs struggle to learn long-range dependencies because gradients explode or vanish when backpropagated through many time steps.

The long short-term memory (LSTM) unit of Hochreiter and Schmidhuber (1997) introduces a memory cell \( \mathbf{c}_t \) and three gates:

\[ \begin{aligned} \mathbf{f}_t &= \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f), \\ \mathbf{i}_t &= \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i), \\ \mathbf{o}_t &= \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o), \\ \tilde{\mathbf{c}}_t &= \tanh(\mathbf{W}_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c), \\ \mathbf{c}_t &= \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t, \\ \mathbf{h}_t &= \mathbf{o}_t \odot \tanh(\mathbf{c}_t). \end{aligned} \]

The forget gate \( \mathbf{f}_t \) decides what to erase from the memory cell, the input gate \( \mathbf{i}_t \) decides what to write, and the output gate \( \mathbf{o}_t \) decides what to expose as the hidden state. These gates allow gradients to flow unimpeded through time, enabling the LSTM to learn dependencies on scales of hundreds of steps.

The gated recurrent unit (GRU) of Cho et al. (2014) is a simplified variant with two gates instead of three, which is often competitive with the LSTM while being faster to train. For monthly panels with \( T \sim 60 \) observations per firm, GRUs typically generalize better than LSTMs because they have fewer parameters.

Example 6.2 (LSTM for volatility forecasting). Model the sequence of realized volatilities \( \{\sigma_{i,t}\} \) for an index constituent \( i \) using an LSTM with features \( \mathbf{x}_{i,t} = (\log \sigma_{i,t-1}, r_{i,t-1}, r_{i,t-1}^2, \mathrm{VIX}_{t-1}) \). Trained with MSE loss on log-volatility, the network typically beats GARCH(1,1) by a modest margin on S&P 500 out-of-sample forecasts, and combines well with a HAR (heterogeneous autoregressive) baseline.

6.6 Autoencoder Factor Models

Kelly, Pruitt and Su (2019) and Gu, Kelly and Xiu (2021) propose autoencoder asset pricing models that jointly learn latent factors and their loadings from data. In the simplest form, \( r_{i,t} = \boldsymbol{\beta}_i^\top \mathbf{f}_t + \varepsilon_{i,t} \) with \( \boldsymbol{\beta}_i = g_\theta(\mathbf{x}_{i,t}) \) a neural network mapping characteristics to betas, and \( \mathbf{f}_t \) a small vector of latent factors extracted from returns via PCA-like reconstruction. The model can be trained by minimizing reconstruction error, yielding a flexible nonlinear generalization of Fama-French with a handful of unobserved factors that deliver impressive pricing performance.

Chapter 7: Model Validation and Hyperparameter Tuning

7.1 Training vs Testing

The distinction between training and testing is the atomic discipline of machine learning. A model that fits training data perfectly but generalizes poorly is memorizing, not learning. Validation techniques estimate the out-of-sample loss \( \mathbb{E}_{(x,y) \sim \mathcal{P}} \ell(y, \widehat{f}(x)) \) using held-out data and serve two purposes: tuning hyperparameters and, separately, obtaining an unbiased estimate of generalization error on a final untouched test set.

7.2 K-fold Cross-validation and Its Failure in Time Series

Standard k-fold cross-validation partitions the dataset into \( k \) disjoint folds, trains the model on \( k-1 \) folds and validates on the remaining fold, cycling through all \( k \) possibilities and averaging the validation losses. For i.i.d. data, k-fold CV is an almost-unbiased estimator of generalization error.

For financial panels it is not. Two issues arise:

Temporal leakage: a fold that contains observations before and after the held-out fold allows the model to learn from the future.
Label overlap: when forward returns are computed over a horizon \( h \) longer than the rebalancing period, adjacent labels share information, and randomly splitting them puts closely related observations in different folds.

The naive fix is a chronological hold-out: train on the first 70% of dates, validate on the next 15%, test on the last 15%. This works but wastes data and gives a single point estimate with high variance.

7.3 Walk-forward and Rolling Origin Cross-validation

Walk-forward validation (equivalently, rolling origin CV or time-series CV) trains on dates \( 1, \ldots, t \), validates on \( t+1, \ldots, t+w \), advances \( t \) by \( w \) and repeats. Every validation point is strictly after every training point used to predict it. The total validation loss is the average of the per-window losses.

Two variants are common:

Expanding origin: the training window grows.
Sliding origin: the training window has fixed length \( L \) and slides forward.

Walk-forward CV is the natural analogue of k-fold CV for time-series data and is what most quantitative investment research uses.

7.4 Purged and Combinatorially Purged Cross-validation

López de Prado (2018) refines walk-forward validation with two additions designed for label overlap:

Purging: if a training observation’s label window \( [t, t+h] \) overlaps any validation date, that training observation is removed.
Embargo: after each validation window, a further small buffer of \( e \) observations is removed from the subsequent training set to prevent information leakage through autocorrelated features.

Combined with a purging rule, combinatorial purged cross-validation (CPCV) generates all \( \binom{N}{k} \) choices of \( k \) validation folds out of \( N \) total folds, producing many near-independent back-test paths. CPCV answers the question “how would my strategy have performed across a family of plausible histories?” and produces a distribution of Sharpe ratios rather than a single number.

Definition 7.1 (Embargo). Given training-validation partitions with validation dates \( \mathcal{V} \) and embargo length \( e \), the training set excludes any observation whose date lies in \( \mathcal{V} \cup \{t + 1, \ldots, t + e : t \in \mathcal{V}\} \).

7.5 Performance Metrics

Different metrics answer different questions, and a practitioner’s choice of metric should match the downstream use of the model.

Mean squared error \( \mathrm{MSE} = (1/n)\sum_i (y_i - \widehat{y}_i)^2 \): standard regression loss, emphasizes large errors.
Mean absolute error \( \mathrm{MAE} = (1/n)\sum_i |y_i - \widehat{y}_i| \): robust to outliers.
Out-of-sample \( R^2 \): \( R^2_{\mathrm{OOS}} = 1 - \sum_i (y_i - \widehat{y}_i)^2 / \sum_i y_i^2 \), omitting the sample mean subtraction for a harder benchmark.
Hit rate or directional accuracy: fraction of predictions whose sign matches the realized return. In noisy equity return problems, 52% hit rates can translate into strong Sharpes.
Information coefficient (IC): the Spearman rank correlation between \( \widehat{y}_{i,t} \) and \( y_{i,t+1} \) across the cross-section at each date. Useful because it focuses on ordering, which is what a long-short portfolio exploits.
Sharpe ratio: \( \mathrm{SR} = \mathbb{E}[R^\pi]/\mathrm{Std}[R^\pi] \), annualized by multiplying by \( \sqrt{12} \) for monthly data.
Information ratio: Sharpe of portfolio returns in excess of a benchmark, i.e. \( \mathbb{E}[R^\pi - R^b]/\mathrm{Std}[R^\pi - R^b] \).
Maximum drawdown: \( \mathrm{MDD} = \max_{t} \left(\max_{s \le t} V_s - V_t\right)/\max_{s \le t} V_s \), where \( V_t \) is the cumulative portfolio value. Captures worst peak-to-trough loss.
Calmar ratio: annualized return divided by \( \mathrm{MDD} \).

7.6 Hyperparameter Tuning

Hyperparameters are settings that are not learned from data but chosen by the practitioner: \( \lambda \) in penalized regression, tree depth in boosted trees, learning rate and dropout rate in neural networks. Tuning is itself a statistical learning problem, and three common strategies are:

Grid search: specify a finite grid of candidate values and evaluate all combinations by cross-validation. Simple and exhaustive but exponential in the number of hyperparameters.
Random search (Bergstra and Bengio, 2012): sample configurations uniformly at random from the grid. Surprisingly effective, especially when only a few hyperparameters matter.
Bayesian optimization: model the validation loss as a function of hyperparameters with a Gaussian process (or a tree-structured Parzen estimator, as in Hyperopt/Optuna) and use an acquisition function (expected improvement, upper confidence bound) to propose the next configuration to try. Sample-efficient, widely used for deep-learning hyperparameter tuning.

Remark 7.2. Hyperparameter tuning is a form of model selection and therefore must be accounted for when reporting generalization error. The rigorous protocol uses nested cross-validation: an outer loop evaluates model performance, and an inner loop selects hyperparameters. For time series, both loops respect the walk-forward structure.

7.7 The Bias-Variance Tradeoff in Financial ML

Generalization error decomposes as

\[ \mathbb{E}[(y - \widehat{f}(x))^2] = \underbrace{(\mathbb{E}\widehat{f}(x) - f^\star(x))^2}_{\text{bias}^2} + \underbrace{\operatorname{Var}(\widehat{f}(x))}_{\text{variance}} + \underbrace{\sigma_\varepsilon^2}_{\text{irreducible}}. \]

Linear models have low variance but high bias; deep networks have low bias but high variance. In financial data, the irreducible noise \( \sigma_\varepsilon^2 \) dominates, which means that variance reduction (via regularization, ensembling, shrinkage) is usually more valuable than bias reduction. A 2% reduction in variance translates directly into a tighter Sharpe ratio, while a 2% reduction in bias on top of a dominant noise floor is essentially invisible.

Chapter 8: Other Topics in Financial Machine Learning

8.1 Ensemble Learning: Stacking, Blending and Model Averaging

Even after choosing a best single model, better results can often be obtained by combining models. Three ensemble strategies are widely used:

Simple averaging / blending: average the out-of-sample forecasts of several models, possibly with weights proportional to their individual validation Sharpe ratios.
Stacking (Wolpert, 1992): train a meta-learner whose inputs are the out-of-fold predictions of base learners. Regularized linear models (ridge, lasso) are a natural choice for the meta-learner to avoid overfitting to the small meta-training set.
Bayesian model averaging: weight models by their posterior probabilities, requiring explicit priors and marginal likelihoods.

Ensembles work because different model classes make different mistakes; averaging uncorrelated errors reduces variance without increasing bias. In factor investing, the typical blend combines a penalized linear model (captures monotone signals), a gradient-boosted tree (captures interactions) and a neural network (captures smooth non-linearities), averaging their forecasts at each rebalance.

8.2 Interpretability: Variable Importance, SHAP, LIME, PDP

Interpretability matters in finance for three reasons: regulatory audit, trust from portfolio managers, and debugging. The standard tools are:

Permutation importance: randomly shuffle feature \( j \) in the validation set and record the drop in performance. Unlike tree-based impurity importance, this is model-agnostic.
Partial dependence plots: plot \( \mathbb{E}_{\mathbf{x}_{-j}}[\widehat{f}(\mathbf{x}_{-j}, x_j)] \) against \( x_j \) to visualize the marginal effect of feature \( j \).
Individual conditional expectation (ICE) plots: like PDP but for individual observations, revealing heterogeneity that partial dependence averaging hides.
LIME (Ribeiro, Singh and Guestrin, 2016): fit a sparse local linear approximation around a specific prediction by sampling perturbed inputs.
SHAP values (Lundberg and Lee, 2017): assign to each feature its Shapley value from cooperative game theory, producing an additive decomposition \( \widehat{f}(\mathbf{x}) = \phi_0 + \sum_j \phi_j \) with attractive theoretical properties (efficiency, symmetry, null player, additivity).

8.3 The Deflated Sharpe Ratio

A back-tested Sharpe ratio is an estimator of the true Sharpe ratio. When \( N \) strategies are tried and the best is reported, the distribution of the maximum is biased upwards. López de Prado (2014) formalizes the correction. Given the first four moments of realized returns and \( N \) trials, the expected maximum Sharpe ratio under the null of zero true Sharpe is approximately

\[ \mathbb{E}\left[\max_{i \le N} \widehat{\mathrm{SR}}_i\right] \approx (1 - \gamma) \, \Phi^{-1}\!\left(1 - \frac{1}{N}\right) + \gamma \, \Phi^{-1}\!\left(1 - \frac{1}{N e}\right), \]

with \( \gamma \approx 0.5772 \) the Euler-Mascheroni constant and \( \Phi \) the standard normal CDF. Writing \( \mathrm{SR}^\star_0 \) for this expected maximum, the deflated Sharpe ratio is the probability that the observed Sharpe exceeds \( \mathrm{SR}^\star_0 \) given the estimated variance of the Sharpe estimator:

\[ \mathrm{DSR} = \Phi\!\left( \frac{(\widehat{\mathrm{SR}} - \mathrm{SR}^\star_0)\sqrt{T - 1}}{\sqrt{1 - \widehat{\gamma}_3 \widehat{\mathrm{SR}} + \tfrac{\widehat{\gamma}_4 - 1}{4} \widehat{\mathrm{SR}}^2}} \right), \]

where \( \widehat{\gamma}_3 \) and \( \widehat{\gamma}_4 \) are the sample skewness and kurtosis of the back-tested returns. A deflated Sharpe above 0.95 is a strong signal of a genuine effect; below 0.5 the reported Sharpe is consistent with noise plus selection.

Remark 8.1. The DSR requires an honest count of \( N \), the number of strategies tried. Hyperparameter sweeps, alternative preprocessing choices and informal model iterations all count; practitioners frequently underestimate this number by an order of magnitude.

8.4 Support Vector Machines

Support vector machines (SVMs) (Cortes and Vapnik, 1995) solve a convex optimization problem that finds the maximum-margin hyperplane separating two classes. For linearly separable data,

\[ \min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|_2^2 \quad \text{s.t.} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \ge 1 \quad \forall i. \]

The soft-margin version allows violations with slack variables and penalty \( C \):

\[ \min_{\mathbf{w}, b, \xi} \frac{1}{2}\|\mathbf{w}\|_2^2 + C \sum_i \xi_i \quad \text{s.t.} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \ge 1 - \xi_i, \ \xi_i \ge 0. \]

The kernel trick replaces inner products \( \mathbf{x}_i^\top \mathbf{x}_j \) in the dual problem with a kernel \( k(\mathbf{x}_i, \mathbf{x}_j) \), implicitly mapping inputs into a high-dimensional (possibly infinite-dimensional) feature space. The Gaussian (RBF) kernel \( k(\mathbf{x}, \mathbf{x}') = \exp(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2) \) is the default. SVMs were dominant in the 1990s and 2000s and retain niches in small-sample factor problems where they are competitive with tree ensembles.

8.5 Causality vs Prediction

Machine learning in finance lives almost entirely in the realm of prediction, not causation. A model might discover that firms with high analyst dispersion underperform without saying anything about whether dispersion causes the underperformance. Causal inference methods, increasingly influential in economics, attempt to move beyond correlation by using natural experiments, instrumental variables, regression discontinuities and double-machine-learning methods (Chernozhukov et al., 2018). For an investor who only wants to rank stocks, predictive correlation is sufficient; for a researcher who wants to understand why an anomaly exists, causality is indispensable. Conflating the two is a common source of false alpha: a predictive signal can decay quickly once its causal mechanism is understood and arbitraged.

8.6 Non-stationarity and Regime Changes

All financial series exhibit non-stationarity. Model parameters that were estimated on a period of high interest rates may be obsolete in a period of low interest rates. Three responses are available:

Adaptive re-estimation: retrain the model at every rebalance using only recent data (rolling window), accepting that recent history is more informative than distant history.
Regime-switching models: fit separate parameters in distinct regimes (bull/bear, high/low volatility) identified either by a Hidden Markov Model or by exogenous state variables.
Change-point detection: monitor the model’s performance in real time and trigger a retraining or a shutdown if a structural break is detected.

The Diebold-Mariano test (1995) is useful for comparing the predictive accuracy of two competing models out of sample. Given loss differentials \( d_t = \ell(y_t, \widehat{y}_t^{(1)}) - \ell(y_t, \widehat{y}_t^{(2)}) \), the statistic

\[ \mathrm{DM} = \frac{\bar{d}}{\sqrt{\widehat{\operatorname{Var}}(\bar{d})}} \]

is asymptotically standard normal under the null of equal predictive accuracy, with \( \widehat{\operatorname{Var}}(\bar{d}) \) estimated via HAC (Newey-West) to account for serial correlation.

8.7 Unsupervised Learning: Clustering and PCA

Unsupervised methods serve as auxiliary tools in factor investing.

Principal Component Analysis finds orthogonal directions that maximize variance. Applied to a covariance matrix of stock returns, the first few principal components recover latent risk factors that usually correspond to market, size and value-like patterns. A spectral cutoff on the eigenvalues of the sample covariance matrix, following Marchenko-Pastur theory, is a principled way to de-noise \( \boldsymbol{\Sigma} \) before plugging it into a mean-variance optimizer (Laloux et al., 1999; Bun, Bouchaud and Potters, 2017).

Clustering algorithms partition assets into groups with similar characteristics. K-means minimizes within-cluster sum of squares; hierarchical clustering builds a dendrogram from pairwise distances; HDBSCAN identifies density-based clusters without requiring \( k \) in advance. López de Prado’s hierarchical risk parity (HRP, 2016) uses a hierarchical clustering tree over correlations to recursively allocate risk, producing portfolios that are more stable than Markowitz and more diversified than equal weights.

8.8 Reinforcement Learning for Portfolio Choice

Portfolio optimization is naturally sequential: today’s trades affect tomorrow’s state (positions, capital), and the investor wants to maximize long-run utility. Reinforcement learning (RL) formalizes this with a Markov decision process \( (\mathcal{S}, \mathcal{A}, P, r, \gamma) \), where \( s_t \) is a state summarizing the portfolio and market, \( a_t \) is a trading action, \( P(s_{t+1} \mid s_t, a_t) \) is the transition kernel, \( r(s_t, a_t) \) is a reward (typically portfolio return minus a transaction cost, or utility thereof), and \( \gamma \in (0,1) \) is a discount factor.

The goal is to learn a policy \( \pi^\star(a \mid s) \) maximizing \( \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t r_t\right] \). Methods used in finance include Q-learning, deep Q-networks (DQN), policy gradient methods (REINFORCE, PPO) and actor-critic methods (A2C, SAC). The Merton (1969) continuous-time optimal consumption-investment problem is the classical analytical benchmark; RL methods offer data-driven generalizations that can incorporate transaction costs, frictions, and rich state variables at the price of giving up closed-form solutions.

Remark 8.2. RL in real markets is hard because the "environment" does not let you replay the past — you cannot cheaply generate billions of training trajectories as you can in Atari or Go. Practical deployments typically train on market simulators and rely heavily on regularization and imitation learning from an expert policy such as mean-variance optimization.

8.9 A Concluding Perspective

Machine learning has redrawn the boundaries of quantitative investing, but it has not repealed the two iron laws that govern the field. First, the no-free-lunch principle: any edge discovered by an algorithm can be arbitraged away as more practitioners find it, so yesterday’s anomaly is rarely tomorrow’s profit. Second, the signal-to-noise ratio of financial returns is and always will be small, so the value of an ML method is measured not by its flexibility but by its ability to extract faint signal from overwhelming noise. Ridge, lasso, random forests, gradient boosting and deep networks all have their place in a disciplined factor-investing workflow; what ties them together is not a particular algorithm but an insistence on rigorous validation, honest back-testing and humble interpretation of the results. The practitioner who remembers that the goal is generalization, not fit, and who treats every impressive Sharpe ratio with the scepticism it deserves, is far more likely to build strategies that earn real returns in real markets than the one who chases the latest architecture.