CS 485: Machine Learning

Shai Ben-David

Estimated study time: 36 minutes

Table of contents

Based on lecture notes by Sibelius Peng — PDF source

Sources and References

Primary textbook — Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014 (freely available at cs.huji.ac.il/~shais/UnderstandingMachineLearning). Supplementary texts — Mohri, M., Rostamizadeh, A. & Talwalkar, A. Foundations of Machine Learning, 2nd ed. MIT Press, 2018. Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006. Kearns, M. & Vazirani, U. An Introduction to Computational Learning Theory. MIT Press, 1994. Online resources — Ben-David’s CS485 lecture videos on newworldai.com; MIT 9.520/6.860 Statistical Learning Theory (Poggio, Rosasco); Tong Zhang’s ML theory lectures; Bartlett & Mendelson’s lecture notes on Rademacher complexity.

Chapter 1: What Is Learning?

From Experience to Expertise

Machine learning is the study of algorithms that improve through experience. Unlike traditional programming — where a human encodes explicit rules — a learning algorithm derives rules automatically from data. The defining challenge, and what makes ML non-trivial, is generalization: the learned rule must perform well on new, unseen examples, not merely on the training data. Memorizing training examples is trivial; learning is the extraction of general patterns.

The connection between prior knowledge and sample efficiency is fundamental. A rat can learn from a single poisoning experience (bait shyness) because evolutionary prior knowledge directs its attention to food characteristics. Without such priors, the rat would need far more experience to determine what caused the sickness. Inductive bias — the assumptions built into a learning algorithm about which patterns are plausible — is the machine learning analogue of this prior knowledge. All successful learning systems encode substantial inductive bias; the No-Free-Lunch theorem (Chapter 5) will make this formal.

The Learning Taxonomy

The primary distinction in machine learning is between supervised and unsupervised learning. In supervised learning, training examples come with labels: \((x_1, y_1), \ldots, (x_m, y_m)\) where \(x_i\) are inputs and \(y_i\) are desired outputs. The task is to find a predictor \(h \colon X \to Y\) that generalizes well. In classification, \(Y = \{0, 1\}\) or a finite set; in regression, \(Y = \mathbb{R}\). In unsupervised learning, there are no labels; the task is to discover structure (clusters, representations, generative models) from unlabeled data.

Additional axes of taxonomy include batch versus online learning (all data at once versus sequential arrival), passive versus active learning (data is given versus the learner queries for labels), cooperative versus adversarial settings (the data source is benign or adversarial), and the presence or absence of computational constraints.

Machine learning is distinguished from classical statistics by three emphases: (1) it is algorithmic — explicit, efficient algorithms are required, not just existence proofs; (2) it is distribution-free — no parametric assumption about the data distribution is made; (3) it is finite-sample — rigorous guarantees must hold for small \(m\), not asymptotically.

Chapter 2: The Formal PAC Learning Framework

The Learning Problem Formalized

The formal setup for supervised learning is:

Domain set \(X\) (the instance space, e.g., \(\mathbb{R}^d\))
Label set \(Y\) (e.g., \(\{0, 1\}\) for binary classification)
Target distribution \(\mathcal{D}\) over \(X\) (the distribution from which instances are drawn)
Labeling rule \(f \colon X \to Y\) (the unknown true function we want to approximate)
Training sample \(S = ((x_1, y_1), \ldots, (x_m, y_m))\) where \(x_i \sim \mathcal{D}\) i.i.d. and \(y_i = f(x_i)\)

The true risk (or generalization error) of a hypothesis \(h \colon X \to Y\) is:

\[ L_{\mathcal{D},f}(h) = \mathcal{D}[\{x : h(x) \neq f(x)\}] = \Pr_{x \sim \mathcal{D}}[h(x) \neq f(x)]. \]

The empirical risk on sample \(S\) is \(L_S(h) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}[h(x_i) \neq y_i]\). The fundamental problem is that we can observe \(L_S\) but we want to minimize \(L_{\mathcal{D},f}\).

Empirical Risk Minimization (ERM) with respect to a hypothesis class \(\mathcal{H}\) returns \(h_S = \arg\min_{h \in \mathcal{H}} L_S(h)\). Unrestricted ERM (with no hypothesis class) simply memorizes the training data; this overfits catastrophically. ERM over a restricted \(\mathcal{H}\) introduces inductive bias through the choice of \(\mathcal{H}\).

PAC Learning

Probably Approximately Correct (PAC) learning, introduced by Valiant in 1984, is the central framework for learning theory.

Definition (PAC Learnability): A hypothesis class \(\mathcal{H}\) is PAC learnable if there exists an algorithm \(A\) and a function \(m_{\mathcal{H}} \colon (0,1)^2 \to \mathbb{N}\) such that: for all \(\epsilon, \delta \in (0,1)\), for all distributions \(\mathcal{D}\) over \(X\), for all labeling functions \(f \in \mathcal{H}\), when \(A\) is run on \(m \geq m_{\mathcal{H}}(\epsilon, \delta)\) i.i.d. examples from \(\mathcal{D}\) labeled by \(f\), it returns \(h\) satisfying \[ \Pr_{S \sim \mathcal{D}^m}[L_{\mathcal{D},f}(h) \leq \epsilon] \geq 1 - \delta. \]

The parameters \(\epsilon\) (accuracy) and \(\delta\) (confidence) are the “approximately” and “probably” in PAC. The sample complexity \(m_{\mathcal{H}}(\epsilon, \delta)\) is the number of examples needed to learn to accuracy \(\epsilon\) with confidence \(1-\delta\). The definition requires learnability for all distributions \(\mathcal{D}\) and all labelings \(f \in \mathcal{H}\) — a worst-case guarantee.

Agnostic PAC Learning

The realizability assumption (\(f \in \mathcal{H}\)) is restrictive. In the agnostic setting, no assumption is made on \(f\); the best hypothesis in \(\mathcal{H}\) may still err. The definition of agnostic PAC learning requires:

\[ \Pr_{S \sim \mathcal{D}^m}[L_{\mathcal{D}}(h) \leq \min_{h' \in \mathcal{H}} L_{\mathcal{D}}(h') + \epsilon] \geq 1 - \delta. \]

The learner only needs to be approximately as good as the best hypothesis in \(\mathcal{H}\), without matching the Bayes-optimal predictor. The approximation error \(\min_{h \in \mathcal{H}} L_{\mathcal{D}}(h)\) measures the cost of restricting to \(\mathcal{H}\); the estimation error \(L_{\mathcal{D}}(h_S) - \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h)\) measures the statistical cost of learning from a finite sample.

Chapter 3: Uniform Convergence and Generalization Bounds

Finite Classes Are Agnostically PAC Learnable

The key concept for bounding generalization error is uniform convergence: the empirical risk converges to the true risk uniformly over all hypotheses in \(\mathcal{H}\).

Theorem (Finite Classes): Let \(\mathcal{H}\) be a finite hypothesis class. Then for any \(\epsilon, \delta > 0\), if \[ m \geq \frac{\ln(|\mathcal{H}|/\delta)}{2\epsilon^2}, \] then with probability at least \(1-\delta\) over the draw of \(S \sim \mathcal{D}^m\), for every \(h \in \mathcal{H}\), \[ |L_{\mathcal{D}}(h) - L_S(h)| \leq \epsilon. \]

The proof uses the union bound over all \(|\mathcal{H}|\) hypotheses, combined with Hoeffding’s inequality (which bounds the probability that the empirical mean deviates from the true mean). The sample complexity is \(O(\log|\mathcal{H}|/\epsilon^2)\) — the log of the hypothesis class size.

\[ L_{\mathcal{D}}(h_S) \leq L_S(h_S) + \epsilon \leq L_S(h^*) + \epsilon \leq L_{\mathcal{D}}(h^*) + 2\epsilon, \]

where \(h^* = \arg\min_{h \in \mathcal{H}} L_{\mathcal{D}}(h)\).

Chapter 4: The No-Free-Lunch Theorem and the Bias-Complexity Tradeoff

No-Free-Lunch

The most fundamental negative result in learning theory:

No-Free-Lunch Theorem: For any learning algorithm \(A\) and any training set size \(m < |X|/2\), there exists a distribution \(\mathcal{D}\) over \(X \times \{0,1\}\) such that the Bayes-optimal error is zero but \[ \Pr_{S \sim \mathcal{D}^m}[L_{\mathcal{D}}(A(S)) \geq 1/8] \geq 1/7. \]

The theorem says that without assumptions about the target — no restriction to a hypothesis class — no algorithm can guarantee learning. Any algorithm that succeeds on one family of distributions must fail on another. Equivalently, there is no “best” universal learning algorithm; good performance on any problem class requires matching inductive bias to the problem structure.

The proof constructs \(2^m\) different distributions over a finite domain of \(2m\) points; each has zero Bayes error, but for any algorithm, at least one of these distributions causes the algorithm to predict randomly on unseen points. The argument is purely combinatorial.

The Bias-Complexity Tradeoff

The interplay between the approximation error and estimation error determines the overall generalization:

\[ L_{\mathcal{D}}(h_S) = \underbrace{L_{\mathcal{D}}(h_S) - \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h)}_{\text{estimation error}} + \underbrace{\min_{h \in \mathcal{H}} L_{\mathcal{D}}(h)}_{\text{approximation error}}. \]

A richer \(\mathcal{H}\) (more expressive) reduces approximation error but increases estimation error (harder to learn from finite samples).
A simpler \(\mathcal{H}\) is easy to learn from data but may not contain any good predictors.

Minimizing both errors simultaneously is the bias-variance tradeoff of classical statistics, now stated in a distribution-free framework. The choice of \(\mathcal{H}\) encodes the inductive bias; the optimal choice depends on the true data distribution, which is unknown.

Chapter 5: The VC Dimension

When Infinite Classes Are Learnable

Finite classes have sample complexity \(O(\log|\mathcal{H}|)\). But many important classes — halfspaces in \(\mathbb{R}^d\), polynomial classifiers, neural networks — are infinite. Are these learnable? The answer is yes, but the relevant complexity measure is not \(|\mathcal{H}|\) but the Vapnik-Chervonenkis (VC) dimension.

The key concept is shattering: a set of points \(C = \{x_1, \ldots, x_k\} \subset X\) is shattered by \(\mathcal{H}\) if for every labeling \(y \in \{0,1\}^k\), there exists \(h \in \mathcal{H}\) with \(h(x_i) = y_i\) for all \(i\). Shattering means \(\mathcal{H}\) can express any pattern on \(C\), with no restriction.

Definition (VC Dimension): The VC dimension of \(\mathcal{H}\) is \[ \mathrm{VCdim}(\mathcal{H}) = \sup\{|C| : C \text{ is shattered by } \mathcal{H}\}. \] If \(\mathcal{H}\) shatters sets of all sizes, then \(\mathrm{VCdim}(\mathcal{H}) = \infty\).

The VC dimension measures the richness of \(\mathcal{H}\): how large a set can it shatter? Examples:

Threshold functions on \(\mathbb{R}\) (\(h_a(x) = \mathbf{1}[x > a]\)): \(\mathrm{VCdim} = 1\).
Intervals on \(\mathbb{R}\) (\(h_{a,b}(x) = \mathbf{1}[a \leq x \leq b]\)): \(\mathrm{VCdim} = 2\).
Halfspaces in \(\mathbb{R}^d\): \(\mathrm{VCdim} = d+1\).
Finite sets \(\mathcal{H}\): \(\mathrm{VCdim}(\mathcal{H}) \leq \log_2 |\mathcal{H}|\).

The Fundamental Theorem of PAC Learning

Fundamental Theorem of Statistical Learning Theory: A hypothesis class \(\mathcal{H}\) of binary classifiers is PAC learnable if and only if \(\mathrm{VCdim}(\mathcal{H}) < \infty\). \[ \Omega\!\left(\frac{d + \log(1/\delta)}{\epsilon}\right) \leq m_{\mathcal{H}}(\epsilon, \delta) \leq O\!\left(\frac{d \log(1/\epsilon) + \log(1/\delta)}{\epsilon}\right), \]

where \(d = \mathrm{VCdim}(\mathcal{H})\).

The fundamental theorem is one of the great results in theoretical computer science and statistics — it identifies a single combinatorial number that completely characterizes learnability. The proof of the upper bound uses Sauer’s lemma (the growth function is polynomial when VC dimension is finite), which combined with uniform convergence bounds gives the sample complexity guarantee.

Sauer’s Lemma and the Growth Function

The growth function \(\tau_{\mathcal{H}}(m) = \max_{x_1,\ldots,x_m} |\{(h(x_1),\ldots,h(x_m)) : h \in \mathcal{H}\}|\) counts the maximum number of distinct labelings \(\mathcal{H}\) can achieve on \(m\) points.

Sauer's Lemma: If \(\mathrm{VCdim}(\mathcal{H}) \leq d\), then \[ \tau_{\mathcal{H}}(m) \leq \sum_{i=0}^{d} \binom{m}{i} \leq \left(\frac{em}{d}\right)^d. \]

Sauer’s lemma says that a finite VC dimension forces the growth function to grow polynomially rather than exponentially. This polynomial growth is what allows uniform convergence arguments to go through for infinite classes: the effective number of “distinct” hypotheses on any sample of \(m\) points is at most \(O(m^d)\), giving a sample complexity of \(O(d\log(m)/\epsilon^2)\), which solves to \(O(d\log(d/\epsilon)/\epsilon^2)\) via the standard bound.

Chapter 6: Nonuniform Learnability and Model Selection

Structural Risk Minimization

PAC learnability requires a fixed hypothesis class \(\mathcal{H}\). But in practice, we want to choose the complexity of the model from the data — using cross-validation, held-out error, or principled model selection criteria.

Nonuniform learnability generalizes PAC learning: a class \(\mathcal{H}\) is nonuniformly learnable if there is an algorithm \(A\) such that for every \(h^* \in \mathcal{H}\) and every \(\epsilon, \delta > 0\), there exists \(m_0(h^*, \epsilon, \delta)\) with: if \(m \geq m_0\), then \(L(A(S)) \leq L(h^*) + \epsilon\) with probability \(1-\delta\). The key difference from PAC learning is that \(m_0\) may depend on the specific \(h^*\), not just the class \(\mathcal{H}\).

Structural Risk Minimization (SRM) achieves nonuniform learnability by decomposing \(\mathcal{H} = \bigcup_{n=1}^\infty \mathcal{H}_n\) into a hierarchy of nested classes with increasing complexity, assigning weights \(w(n)\) (with \(\sum_n w(n) \leq 1\)) and choosing the hypothesis minimizing the regularized empirical risk:

\[ h_{\mathrm{SRM}} = \arg\min_{h \in \mathcal{H}_n, n} \left[L_S(h) + \epsilon_n(m, \delta)\right], \]

where \(\epsilon_n(m, \delta)\) is the generalization bound for class \(\mathcal{H}_n\) at sample size \(m\).

Minimum Description Length

MDL (Minimum Description Length) learning is a Bayesian-inspired variant: assign each hypothesis a “description length” (or complexity) \(\ell(h)\) in bits, and minimize the total description length of hypothesis and the labeled training data:

\[ h_{\mathrm{MDL}} = \arg\min_{h \in \mathcal{H}} \left[L_S(h) + \frac{\ell(h)}{m}\right]. \]

Hypotheses with shorter descriptions are preferred; data that is well-compressed by the hypothesis gets a bonus. MDL connects to Kolmogorov complexity and Bayesian model selection (with the prior \(P(h) \propto 2^{-\ell(h)}\)), and gives a principled framework for Occam’s razor — preferring simpler explanations consistent with the data.

Chapter 7: Computational Complexity of Learning

Efficient Learnability

PAC learnability says how many examples are needed. A separate question is how efficiently the learning algorithm can run. A class is efficiently PAC learnable if there is a polynomial-time algorithm (in \(m\) and the input dimension \(n\)) that achieves PAC learning.

The two questions are independent: there exist classes that are information-theoretically PAC learnable (finite sample complexity) but for which no efficient algorithm is known, assuming standard cryptographic hardness assumptions. Learning \(k\)-term DNF (disjunctive normal form) formulas is a famous example: DNF with \(k\) terms has polynomial VC dimension, so it is information-theoretically learnable, but no polynomial-time algorithm is known (the problem is as hard as learning polynomial-size circuits).

The hardness of learning problems is typically shown through cryptographic reductions: showing that an efficient learning algorithm would imply breaking a cryptographic scheme (factoring, discrete logarithm, etc.). The key techniques are the Learning Parity with Noise (LPN) hardness and related assumptions.

Hardness of Learning

Some learning problems are hard to learn efficiently not because of statistical reasons but computational ones:

Learning halfspaces in the presence of random classification noise is conjectured to require superpolynomial time (the Blum-Karp-Lipton hardness conjecture).
Agnostic learning of halfspaces is NP-hard in the worst case (even over the uniform distribution).
Learning DNF is hard under cryptographic assumptions.

These results suggest that for practical learning problems, either special structure must be exploited (convexity, margin, sparsity) or one must accept approximate guarantees that may not hold in the adversarial worst case.

Chapter 8: Linear Predictors

Halfspaces and the Perceptron

The most fundamental class of binary classifiers is halfspaces:

\[ \mathcal{H}_{\mathrm{HS}} = \{x \mapsto \mathrm{sign}(\mathbf{w}^T\mathbf{x} + b) : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}\}. \]

A halfspace is defined by a hyperplane \(\mathbf{w}^T\mathbf{x} + b = 0\) separating the positive from the negative class. The VC dimension of halfspaces in \(\mathbb{R}^d\) is \(d+1\): any \(d+1\) points in general position are shattered. This means the sample complexity of learning halfspaces is \(O(d/\epsilon)\), linear in the dimension.

When the training data is linearly separable (there exists a halfspace with zero training error), the perceptron algorithm (Rosenblatt 1957) finds it:

Initialize w = 0
Repeat until convergence:
    For each misclassified (x_i, y_i):
        w ← w + y_i · x_i

Perceptron Convergence Theorem: If the data is linearly separable with margin \(\gamma\) (every point satisfies \(y_i(\mathbf{w}^* \cdot \mathbf{x}_i) \geq \gamma\)) and all points have \(\|\mathbf{x}_i\| \leq R\), then the perceptron makes at most \((R/\gamma)^2\) mistakes.

The bound is independent of \(d\) and \(m\) — only the margin matters. When the data is not separable, the perceptron may not converge; soft-margin classifiers (SVMs) handle this case.

Chapter 9: Boosting

Weak Learnability and AdaBoost

A weak learner is an algorithm that achieves accuracy only slightly better than random: there exists \(\gamma > 0\) such that the weak learner produces \(h\) with \(L_{\mathcal{D}}(h) \leq 1/2 - \gamma\). The insight behind boosting is that weak learners can be combined into a strong learner (arbitrary accuracy) by sequential training on adaptively reweighted distributions.

The AdaBoost algorithm (Schapire & Freund, 1995) works by iteratively reweighting the training data to focus on examples misclassified by the current ensemble:

Initialize: \(D_1(i) = 1/m\) for all \(i\).
For \(t = 1, \ldots, T\):
- Train weak learner on \((S, D_t)\), get hypothesis \(h_t\).
- Compute weighted error \(\epsilon_t = \sum_i D_t(i) \cdot \mathbf{1}[h_t(x_i) \neq y_i]\).
- Compute weight \(\alpha_t = \frac{1}{2}\ln\!\left(\frac{1-\epsilon_t}{\epsilon_t}\right)\).
- Update: \(D_{t+1}(i) \propto D_t(i) \cdot \exp(-\alpha_t y_i h_t(x_i))\).
Output: \(H(x) = \mathrm{sign}\!\left(\sum_t \alpha_t h_t(x)\right)\).

If each weak learner achieves error at most \(1/2 - \gamma\), AdaBoost achieves zero training error after \(T = O(\log m / \gamma^2)\) rounds. The training error decreases as \(\exp(-2\gamma^2 T)\). Moreover, even after the training error reaches zero, the generalization error continues to improve — AdaBoost “maximizes the margin,” connecting it to SVMs.

Chapter 10: Support Vector Machines

Hard-SVM and the Maximum Margin Classifier

For linearly separable data, there are infinitely many separating hyperplanes. The Support Vector Machine (SVM) selects the one that maximizes the margin — the minimum distance from any training point to the hyperplane.

For data with labels \(y_i \in \{-1, +1\}\), the margin of a hyperplane \(\mathbf{w}, b\) on example \((\mathbf{x}_i, y_i)\) is \(y_i(\mathbf{w}^T\mathbf{x}_i + b)/\|\mathbf{w}\|\). The hard-SVM optimization problem is:

\[ \min_{\mathbf{w}, b} \|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1 \text{ for all } i. \]

This is a convex quadratic program. The constraint \(y_i(\mathbf{w}^T\mathbf{x}_i + b) \geq 1\) normalizes the margin to 1; minimizing \(\|\mathbf{w}\|^2\) maximizes the geometric margin \(1/\|\mathbf{w}\|\).

The solution is determined only by the support vectors — the training points lying exactly on the margin boundaries (\(y_i(\mathbf{w}^T\mathbf{x}_i + b) = 1\)). The KKT conditions of the dual problem reveal that \(\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i\), a sparse combination of support vectors.

The sample complexity of hard-SVM does not depend on the dimension \(d\) but only on the margin \(\gamma\): \(m = O(1/(\gamma^2\epsilon))\), entirely independent of \(d\). This is a major advantage in high dimensions when the data is well-separated.

Chapter 11: Kernel Methods

Feature Maps and the Kernel Trick

Many real-world problems are not linearly separable in the original feature space but become separable in a higher-dimensional (possibly infinite-dimensional) feature space. A feature map \(\phi \colon X \to \mathcal{F}\) transforms instances; the SVM is then applied in \(\mathcal{F}\).

The kernel trick avoids explicitly computing \(\phi(\mathbf{x})\) by working only with inner products. A kernel function \(k(x, x') = \langle\phi(x), \phi(x')\rangle_{\mathcal{F}}\) computes the inner product in feature space without materializing \(\phi(x)\). The SVM dual depends on the data only through the Gram matrix \(K_{ij} = k(x_i, x_j)\), so the entire computation requires only evaluating \(k\) — which may be \(O(d)\) even when \(\mathcal{F}\) is infinite-dimensional.

A function \(k \colon X \times X \to \mathbb{R}\) is a valid (Mercer) kernel if and only if the Gram matrix \(K\) is positive semidefinite for any finite set of inputs. Important kernels:

Linear kernel: \(k(x, x') = x^T x'\) (original features).
Polynomial kernel: \(k(x, x') = (x^T x' + c)^p\) (degree-\(p\) polynomial features).
Gaussian (RBF) kernel: \(k(x, x') = \exp(-\|x-x'\|^2/(2\sigma^2))\) (infinite-dimensional feature space).
String kernels, graph kernels: work on non-Euclidean domains.

By the Representer Theorem, the solution to any kernel regularized risk minimization problem (of the form \(\min_h \lambda\|h\|^2_{\mathcal{H}} + \frac{1}{m}\sum_i L(y_i, h(x_i))\)) lies in the span of the kernel evaluations \(\{k(\cdot, x_i)\}\) — the solution has the form \(h(x) = \sum_i \alpha_i k(x, x_i)\), reducing an infinite-dimensional optimization to a finite-dimensional one.

Chapter 12: Neural Networks

Feedforward Networks

A feedforward neural network with \(L\) layers computes:

\[ h^{(l)} = \sigma\!\left(W^{(l)} h^{(l-1)} + b^{(l)}\right), \quad l = 1, \ldots, L, \]

where \(W^{(l)}\) and \(b^{(l)}\) are weight matrices and biases for layer \(l\), and \(\sigma\) is an activation function applied elementwise (e.g., ReLU: \(\sigma(z) = \max(0,z)\), sigmoid: \(\sigma(z) = 1/(1+e^{-z})\)). The output layer typically applies a softmax (for classification) or linear activation (for regression).

The universal approximation theorem (Cybenko 1989, Hornik 1991) establishes that a two-layer network with a sufficient number of hidden units can approximate any continuous function on a compact domain to arbitrary accuracy. In this sense, neural networks are universal. However, the theorem gives no guidance on how many neurons are needed or how to find the weights — these are computational questions.

Learning Neural Networks via Backpropagation

The parameters are learned by minimizing the training loss using stochastic gradient descent (SGD). The gradients are computed by backpropagation — the chain rule applied efficiently in reverse through the network:

\[ \frac{\partial \mathcal{L}}{\partial W^{(l)}} = \frac{\partial \mathcal{L}}{\partial h^{(L)}} \cdot \frac{\partial h^{(L)}}{\partial h^{(L-1)}} \cdots \frac{\partial h^{(l+1)}}{\partial h^{(l)}} \cdot \frac{\partial h^{(l)}}{\partial W^{(l)}}. \]

Backpropagation computes gradients in \(O(\text{total parameters})\) — the same cost as a forward pass. The optimization landscape of deep networks is non-convex; yet in practice, SGD finds solutions with excellent generalization. Understanding why — why local minima are rare, why overparameterization helps, how implicit bias of SGD shapes the learned solution — is among the central open problems of deep learning theory.

Chapter 13: Beyond This Course

Natural Next Topics

Rademacher Complexity and Uniform Convergence

The VC dimension gives tight bounds for binary classification. For general loss functions and unbounded hypothesis classes, Rademacher complexity is the right notion. The Rademacher complexity of \(\mathcal{H}\) on a sample \(S\) is:

\[ \hat{\mathcal{R}}_S(\mathcal{H}) = \mathbb{E}_{\sigma}\!\left[\sup_{h \in \mathcal{H}} \frac{1}{m}\sum_{i=1}^m \sigma_i h(x_i)\right], \]

where \(\sigma_1, \ldots, \sigma_m\) are i.i.d. Rademacher random variables (\(\pm 1\) with equal probability). The generalization bound via Rademacher complexity is:

\[ L_{\mathcal{D}}(h_S) \leq L_S(h_S) + 2\mathcal{R}_m(\mathcal{H}) + O\!\left(\sqrt{\frac{\log(1/\delta)}{m}}\right). \]

Rademacher complexity handles regression, margin-based classifiers, and kernel methods in a unified framework. For linear predictors with bounded norm (\(\|\mathbf{w}\| \leq B\)) over inputs with \(\|\mathbf{x}\| \leq R\), the Rademacher complexity is \(BR/\sqrt{m}\) — giving the margin-based bound directly.

Generalization in Deep Learning

The classical PAC theory predicts that overparameterized models (far more parameters than training points) should overfit. Modern deep networks are massively overparameterized yet generalize remarkably well. This double descent phenomenon — test error first increases, then decreases as model complexity grows beyond interpolation threshold — is not captured by classical bias-variance theory. Understanding generalization in overparameterized regimes requires new tools: implicit regularization of gradient descent, flat minima and the connection to generalization, NTK (neural tangent kernel, which linearizes training dynamics at initialization), and the PAC-Bayes framework (bounding generalization via a prior over hypotheses).

Follow-up Courses and Reading

At the University of Waterloo

CS 886 (Seminar on Selected Topics in Machine Learning) covers current research. CS 480/680 (Introduction to Machine Learning) is the companion course with more emphasis on methods and applications. STAT 441/841 (Statistical Learning — Classification) develops the statistical perspective. CO 671 (Semidefinite Programming) is relevant to kernel methods and relaxations.

Standard Texts

Shalev-Shwartz & Ben-David Understanding Machine Learning (Cambridge) — the course text, freely available, rigorous and comprehensive. Mohri, Rostamizadeh & Talwalkar Foundations of Machine Learning (MIT Press, 2018, 2nd ed.) covers similar material with particular strength in Rademacher complexity and boosting. Bartlett & Mendelson (2002, JMLR) is the definitive paper on Rademacher complexity. Wainwright High-Dimensional Statistics: A Non-Asymptotic Viewpoint (Cambridge, 2019) covers concentration inequalities, uniform convergence, and estimation in modern high-dimensional settings.

Open Problems and Active Research

Computational vs. Statistical Complexity

The relationship between computational hardness and statistical hardness in learning is poorly understood. The Statistical Query (SQ) model (Kearns 1998) captures a broad class of algorithms and has been used to show hardness for problems like learning sparse parities and some planted clique problems. But it remains open whether every SQ-hard problem is computationally hard or whether some can be solved by non-SQ algorithms. The planted clique conjecture — that no polynomial-time algorithm can find a clique of size \(O(\sqrt{n})\) in a random graph with a planted clique of that size — underlies many hardness results in learning and testing.

Benign Overfitting and Interpolation

Recent work (Bartlett-Montanari, Belkin et al.) shows that in sufficiently high dimensions, interpolating classifiers (with zero training error) can still generalize well. The exact conditions under which interpolation is benign — depending on signal-to-noise ratio, dimensionality, and the structure of the hypothesis class — are active research. This connects to the theory of ridgeless regression, random features, and the double descent risk curve.

Privacy-Preserving Learning

Differential privacy (Dwork et al. 2006) provides a rigorous definition of privacy for learning algorithms: a randomized algorithm \(A\) is \((\epsilon, \delta)\)-differentially private if changing one training example changes the distribution of \(A\)’s output by at most an \(\epsilon\)-multiplicative factor (plus \(\delta\) additive). The fundamental question is the tradeoff between privacy and sample complexity: how many extra examples are needed to learn \(\epsilon\)-accurately under \((\epsilon_P, \delta)\)-differential privacy? Recent results show this overhead is polynomial in the VC dimension, and the connection between private learnability and online learnability (via the Littlestone dimension) is a recent deep result.