CS 479: Neural Networks

Jeff Orchard

Estimated study time: 55 minutes

Table of contents

Based on lecture notes by Sibelius Peng — PDF source

Sources and References

Primary textbook — Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. Freely available at deeplearningbook.org. Supplementary texts — Bishop, C.M. Pattern Recognition and Machine Learning. Springer, 2006. Nielsen, M.A. Neural Networks and Deep Learning. Determination Press, 2015 (freely available at neuralnetworksanddeeplearning.com). Online resources — Karpathy, A. CS231n: Convolutional Neural Networks for Visual Recognition (Stanford, cs231n.stanford.edu). Lecture notes and assignments freely available. MIT 6.S191 Introduction to Deep Learning (introtodeeplearning.com). Fast.ai Practical Deep Learning for Coders (fast.ai). Lecture notes — Peng, S. CS 479 Lecture Notes (Winter 2021). https://pdf.sibeliusp.com/1211/cs479.pdf

Chapter 1: Biological Neurons and Mathematical Abstractions

Neural networks take inspiration from biology — from the densely interconnected web of neurons in the brain — but quickly depart from it. Understanding the bridge between biology and mathematics clarifies what properties we are trying to capture and which biological details we deliberately abandon.

Biological Neurons

A neuron consists of a dendrite tree (input fibres that receive electrochemical signals from other neurons), a cell body (soma) that integrates these signals, and an axon (output fibre) that transmits a signal to other neurons via synapses. The neuron’s response is all-or-nothing: if the integrated input exceeds a threshold, it fires an action potential — a brief \(\approx 1\text{ ms}\) voltage spike that travels down the axon.

The brain contains roughly \(10^{11}\) neurons, each connected to approximately 7,000 others on average, giving around \(10^{14}\) synapses total. The diversity is immense: cortical pyramidal cells, cerebellar Purkinje cells, retinal ganglion cells — each type has distinct morphology, firing patterns, and connectivity.

The Hodgkin–Huxley Model

\[ C_m \frac{dV}{dt} = I_\text{ext} - \bar{g}_\text{Na} m^3 h (V - E_\text{Na}) - \bar{g}_\text{K} n^4 (V - E_\text{K}) - g_L (V - E_L), \]

where \(m, h, n\) are gating variables satisfying their own first-order ODEs. The three ion channels are sodium (Na), potassium (K), and a leak current. The nonlinear gating variables produce the characteristic spike waveform: rapid depolarisation (sodium inflow), repolarisation (potassium outflow), and hyperpolarisation (refractory period).

While the H-H model is biophysically accurate, it is too complex to scale to millions of neurons. Simpler abstractions are needed.

The Leaky Integrate-and-Fire Model

\[ \tau_m \frac{dV}{dt} = -(V - V_\text{rest}) + R_m I(t), \]

where \(\tau_m = R_m C_m\) is the membrane time constant. When \(V\) reaches threshold \(V_\text{th}\), a spike is emitted and \(V\) resets to \(V_\text{rest}\). The LIF model captures the integration of inputs, the leaky decay, and the threshold-firing behaviour, while discarding spike shape detail.

Artificial Neuron Model and Activation Functions

\[ z = \sum_i w_i x_i + b = \mathbf{w}^T \mathbf{x} + b, \qquad y = \sigma(z). \]

Common activation functions and their properties:

Sigmoid: \(\sigma(z) = 1/(1 + e^{-z})\). Range \((0,1)\), smooth, gradient \(\sigma(1-\sigma)\). Historically important. Suffers from vanishing gradients for large \(|z|\).
Tanh: \(\tanh(z)\). Range \((-1, 1)\), zero-centred (better for gradient flow). Still vanishes for large \(|z|\).
ReLU (Rectified Linear Unit): \(\max(0, z)\). No saturation for \(z > 0\), extremely fast to compute. Suffers from dying ReLU (neurons stuck at zero). Dominant in deep learning since Nair–Hinton 2010.
Softmax (for output layer): \(\text{softmax}(\mathbf{z})_j = e^{z_j} / \sum_k e^{z_k}\). Converts a vector of scores to a probability distribution over classes.

The connection weight between neuron \(j\) and neuron \(i\) is \(w_{ij}\); the total input to neuron \(i\) is \(a_i = \sum_j w_{ij} y_j\). Synaptic plasticity — changes to weights — underlies learning.

Chapter 2: The Learning Framework

What Neural Networks Learn

A neural network is a parameterised function \(f_\theta : \mathcal{X} \to \mathcal{Y}\), where \(\theta\) denotes all weights and biases. Learning means adjusting \(\theta\) so that \(f_\theta\) performs well on some task, typically measured by a loss function \(\mathcal{L}(\theta)\) computed over a dataset.

The fundamental challenge is generalisation: the network must perform well not just on training data but on unseen examples from the same distribution. A network that memorises training data but fails on new inputs is overfit.

Universal Approximation

Theorem (Universal Approximation, Cybenko 1989; Hornik et al. 1989). A feedforward neural network with a single hidden layer of sufficient width, using a non-polynomial activation function \(\sigma\), can approximate any continuous function on a compact subset of \(\mathbb{R}^n\) to arbitrary precision.

This theorem guarantees expressive power but says nothing about how to find the parameters or how many neurons are needed. Depth — using many layers rather than one wide layer — is not explained by the UAT; deeper networks often generalise better and are more efficient (exponentially fewer neurons needed for certain functions). The depth separation results by Telgarsky (2016) and Eldan–Shamir (2016) show that some functions computable by depth-\(k\) networks of polynomial size require exponential size for depth \(k-1\) networks.

Loss Functions

The choice of loss function embeds the learning objective:

\[ \mathcal{L}_\text{MSE} = \frac{1}{N} \sum_{i=1}^N \| \hat{y}_i - y_i \|^2. \]

Appropriate for regression. Under Gaussian noise assumptions, MSE is the maximum likelihood estimator.

\[ \mathcal{L}_\text{BCE} = -\frac{1}{N} \sum_{i=1}^N \left[ y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i) \right]. \]

This is the negative log-likelihood under a Bernoulli model, and penalises confident wrong predictions heavily.

\[ \mathcal{L}_\text{CE} = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^C y_{ij} \log \hat{y}_{ij}. \]

When \(y_i\) is a one-hot vector, this simplifies to \(-\log \hat{y}_{i, \text{true class}}\) — the negative log-probability assigned to the correct class.

Chapter 3: Backpropagation

Backpropagation is the algorithm that made deep learning practical. It computes the gradient of the loss with respect to all parameters of a network in time proportional to the forward pass, using the chain rule of calculus.

Gradient Descent

\[ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t), \]

where \(\eta > 0\) is the learning rate (step size). The gradient points uphill, so subtracting it moves downhill. For a convex loss, gradient descent converges to the global minimum; for the nonconvex loss of neural networks, it converges to a stationary point (not necessarily a global minimum, but empirically often a good one).

Numerical differentiation approximates \(\partial \mathcal{L} / \partial \theta_j \approx [\mathcal{L}(\theta + h e_j) - \mathcal{L}(\theta - h e_j)] / (2h)\), but this requires two forward passes per parameter — \(O(|\theta|)\) passes total. For networks with millions of parameters, this is completely intractable.

The Backpropagation Algorithm

\[ a_j^{(\ell)} = \sum_i w_{ji}^{(\ell)} y_i^{(\ell-1)} + b_j^{(\ell)}, \quad y_j^{(\ell)} = \sigma(a_j^{(\ell)}). \]\[ \delta_j^{(\ell)} = \frac{\partial \mathcal{L}}{\partial a_j^{(\ell)}}. \]

Output layer: \(\delta_j^{(L)} = \frac{\partial \mathcal{L}}{\partial y_j^{(L)}} \cdot \sigma'(a_j^{(L)})\).

\[ \delta_i^{(\ell)} = \sigma'(a_i^{(\ell)}) \sum_j w_{ji}^{(\ell+1)} \delta_j^{(\ell+1)}. \]\[ \frac{\partial \mathcal{L}}{\partial w_{ji}^{(\ell)}} = \delta_j^{(\ell)} \cdot y_i^{(\ell-1)}. \]

The total cost is one forward pass plus one backward pass — roughly \(2\times\) the forward pass cost, regardless of the number of parameters. This is the fundamental computational advantage of backprop.

Chapter 4: Automatic Differentiation

Modern deep learning frameworks (PyTorch, JAX, TensorFlow) implement automatic differentiation (AD), a generalisation of backpropagation that works for arbitrary computation graphs, not just layered networks.

Forward Mode vs. Reverse Mode

A computation \(f : \mathbb{R}^n \to \mathbb{R}^m\) is broken into elementary operations (add, multiply, exp, etc.) forming a computation graph. AD has two modes:

Forward mode: accumulates derivatives from inputs to outputs. For each input direction \(v\), compute the directional derivative \(Jv\) where \(J = \partial f / \partial x\). Cost: \(O(1)\) forward passes per input dimension. Efficient when \(n \ll m\).

Reverse mode (the one used in neural networks): accumulates from outputs back to inputs. Starting from output gradient \(\bar{y} = 1\) (or the gradient of the loss), propagate \(\bar{x} = J^T \bar{y}\). Cost: \(O(1)\) reverse passes per output dimension. Efficient when \(m \ll n\) — which is always true for scalar losses (\(m = 1\)) over many parameters (\(n \gg 1\)).

Backpropagation in a layered network is exactly reverse-mode AD on the chain-structured computation graph.

Matrix AD extends this to matrix operations. The Jacobian of a matrix product \(Y = AX\) with respect to \(X\) is computed by the chain rule on unrolled entries, but efficient implementations use matrix-vector products rather than materialising the full Jacobian.

Chapter 5: Generalisation and Regularisation

A model that perfectly fits its training data often performs poorly on unseen test data — this is overfitting. The bias–variance tradeoff formalises this: a model with too few parameters (high bias) underfits; a model with too many parameters (high variance) memorises training noise and overfits.

Bias–Variance Decomposition

\[ \mathbb{E}[(y - \hat{y})^2] = \text{Bias}^2 + \text{Variance} + \text{Noise}. \]

The bias is the error from wrong assumptions in the model family; the variance is the error from sensitivity to fluctuations in training data. Increasing model complexity reduces bias but increases variance.

For deep neural networks, this classical picture is complicated by the double descent phenomenon (Belkin et al. 2019): test error decreases, then increases (classical overfitting), then decreases again as the number of parameters far exceeds the number of training examples. This interpolation regime — where the model exactly fits training data yet generalises well — is not explained by classical bias-variance theory and is an active research area.

Combating Overfitting

Train/validation/test split: hold out a validation set to tune hyperparameters (learning rate, regularisation strength, architecture), and a completely separate test set for final evaluation. Cross-validation averages over multiple train-validation splits.

\[ \mathcal{L}_\text{reg} = \mathcal{L} + \frac{\lambda}{2} \|\theta\|^2, \]

which encourages small weights and limits the model’s effective capacity. The gradient update becomes \(\theta \leftarrow (1 - \lambda \eta) \theta - \eta \nabla \mathcal{L}\) — a decay step before the gradient step.

Data augmentation artificially increases training set size by applying label-preserving transformations: random crops, flips, colour jitter, rotation for images; synonym replacement or back-translation for text. Augmentation is one of the most effective regularisers in practice.

\[ \tilde{y}_j = y_j \cdot \text{Bernoulli}(1-p) \cdot \frac{1}{1-p}. \]

At test time, all neurons are active (the \(1/(1-p)\) scaling makes expectations match). Dropout prevents neurons from co-adapting and acts as an ensemble of exponentially many thinned networks. It is most effective in fully connected layers; batch normalisation has largely supplanted it in convolutional layers.

Chapter 6: Optimisation in Practice

Vanishing and Exploding Gradients

\[ \frac{\partial \mathcal{L}}{\partial w^{(1)}} = \frac{\partial \mathcal{L}}{\partial y^{(L)}} \cdot \frac{\partial y^{(L)}}{\partial y^{(L-1)}} \cdots \frac{\partial y^{(2)}}{\partial y^{(1)}} \cdot \frac{\partial y^{(1)}}{\partial w^{(1)}}. \]

If the Jacobian norms are \(< 1\) (sigmoid saturating), the product vanishes exponentially — gradients in early layers are negligible, and learning stops (vanishing gradient). If norms are \(> 1\), the product explodes — training diverges (exploding gradient).

Mitigations:

ReLU activations: gradient is exactly 1 for active neurons, avoiding saturation.
Careful weight initialisation: He initialisation sets \(w \sim \mathcal{N}(0, 2/n_\text{in})\) for ReLU; Glorot/Xavier initialisation for tanh. These ensure the variance of activations is approximately preserved across layers.
Gradient clipping: cap the gradient norm at a threshold \(\|\nabla\|_2 \leq c\) before the update step. Standard in RNN training.
Skip connections (residual networks): the identity shortcut \(y = F(x, W) + x\) ensures that gradients have a direct path back through the network, circumventing the product of Jacobians.

Stochastic Gradient Descent and Variants

Batch gradient descent computes the gradient over the entire dataset — accurate but slow for large datasets. Stochastic gradient descent (SGD) uses a single example per update — fast but noisy. Mini-batch SGD uses a small batch (typically 32–256 examples) — a compromise that leverages vectorised computation and provides enough noise to escape sharp minima.

\[ v_{t+1} = \beta v_t + (1-\beta) \nabla_\theta \mathcal{L}, \quad \theta_{t+1} = \theta_t - \eta v_{t+1}. \]

With momentum coefficient \(\beta \approx 0.9\), the effective learning rate in persistent directions is amplified by \(1/(1-\beta) = 10\times\), while oscillations across dimensions are damped.

\[ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2, \]\[ \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}, \quad \theta_t = \theta_{t-1} - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}. \]

Adam is the default optimiser for most deep learning research. Despite its practical success, its convergence theory is weaker than SGD — there exist simple convex problems where Adam diverges.

Chapter 7: Convolutional Neural Networks and Autoencoders

Autoencoders

An autoencoder learns to compress data and reconstruct it. It consists of an encoder \(f : \mathcal{X} \to \mathcal{Z}\) and a decoder \(g : \mathcal{Z} \to \mathcal{X}\), trained to minimise reconstruction loss \(\| g(f(x)) - x \|^2\). The bottleneck forces the network to learn a compressed representation (latent code) \(z = f(x) \in \mathcal{Z}\) with \(\dim \mathcal{Z} \ll \dim \mathcal{X}\).

Applications: dimensionality reduction (analogous to PCA, but nonlinear), denoising (train with noisy input, clean target), anomaly detection (high reconstruction error signals novel inputs), and as building blocks for generative models.

A denoising autoencoder learns to recover \(x\) from a corrupted version \(\tilde{x} = x + \epsilon\). This forces the encoder to learn which features are robust to noise — an implicit form of feature selection. Vincent et al. (2008) showed that denoising autoencoders learn features that capture the structure of the data manifold.

The Visual System and CNNs

The mammalian visual cortex is hierarchically organised: V1 detects oriented edges, V2 combines edges into contours, V4 processes colour and shape, the inferotemporal cortex recognises complex objects. Crucially, neurons in V1 respond to local patches of the visual field (their receptive field), not the entire image.

Convolutional Neural Networks (LeCun et al. 1989, popularised by AlexNet 2012) implement these principles:

Local connectivity: each neuron connects to a small receptive field (e.g., \(3 \times 3\) patch).
Weight sharing: the same filter (weights) is applied at every spatial position. This introduces a translation equivariance prior: detecting a horizontal edge works the same way regardless of where in the image it appears.
Pooling: max-pooling or average-pooling over local regions provides translation invariance and reduces spatial dimensions.

A convolutional layer with \(K\) filters of size \(F \times F\) produces an output of depth \(K\); stacking layers builds up a hierarchy from edges (first layers) to object parts (middle layers) to object categories (final layers). The parameter count scales as \(K \cdot F^2 \cdot C_\text{in}\) per layer rather than \(O(H^2 W^2)\) for a fully connected layer — a massive reduction.

Chapter 8: Hopfield Networks and Energy-Based Models

Hopfield Networks as Associative Memory

\[ s_i \leftarrow \text{sgn}\left(\sum_j W_{ij} s_j\right). \]\[ E(s) = -\frac{1}{2} s^T W s. \]\[ W = \frac{1}{N} \sum_\mu \xi^\mu (\xi^\mu)^T, \]

then given a partial or noisy version of a stored pattern, the network converges to the nearest stored pattern. The storage capacity is approximately \(0.14N\) patterns for \(N\) neurons (Amit–Gutfreund–Sompolinsky 1985).

\[ E = -\text{lse}(\beta, X^T \xi) + \frac{1}{2}\xi^T \xi + \frac{1}{\beta} \log N + \frac{1}{2} M^2, \]

where lse is the log-sum-exp function. The update rule is exactly the attention mechanism used in transformers, establishing a formal connection between associative memory and modern attention-based architectures. Modern Hopfield networks have exponential storage capacity in the interaction order.

Batch Normalisation

\[ \hat{x}_i = \frac{x_i - \mu_\mathcal{B}}{\sqrt{\sigma_\mathcal{B}^2 + \varepsilon}}, \qquad y_i = \gamma \hat{x}_i + \beta, \]

where \(\mu_\mathcal{B}\) and \(\sigma_\mathcal{B}^2\) are the batch mean and variance, and \(\gamma, \beta\) are learned scale and shift parameters. Batch norm reduces internal covariate shift (the distribution of activations changes as weights update), enables higher learning rates, and acts as a regulariser. At inference, running mean and variance accumulated during training are used.

Chapter 9: Variational Autoencoders

A standard autoencoder maps each input to a single point in latent space. This makes generation difficult: points in latent space that were never trained on produce garbage. Variational autoencoders (VAEs; Kingma–Welling 2013) impose a probabilistic structure on the latent space.

The VAE Objective

\[ \mathcal{L}_\text{ELBO}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x \mid z)] - D_\text{KL}(q_\phi(z \mid x) \| p(z)). \]

The first term is a reconstruction loss; the second (KL divergence) regularises the posterior toward the standard normal prior. Together, these encourage a smooth, connected latent space where nearby points decode to similar images.

Reparameterisation trick: to backpropagate through the sampling step \(z \sim q_\phi(z \mid x)\), rewrite \(z = \mu_\phi(x) + \sigma_\phi(x) \odot \varepsilon\) where \(\varepsilon \sim \mathcal{N}(0, I)\). The stochasticity is now in \(\varepsilon\), which does not depend on \(\phi\), so gradients flow through \(\mu_\phi\) and \(\sigma_\phi\) normally.

VAEs allow smooth interpolation in latent space (walking from one face to another), conditional generation (given a class label as auxiliary input), and anomaly detection (high KL or low reconstruction likelihood signals anomalies).

Chapter 10: Recurrent Neural Networks

Feedforward networks process fixed-length inputs. For sequential data — time series, text, audio — we need networks with memory: the ability to use context from earlier inputs when processing later ones.

Recurrent Neural Networks

\[ h_t = \sigma_h(W_h h_{t-1} + W_x x_t + b), \qquad y_t = \sigma_y(W_y h_t + b_y). \]

All time steps share the same weight matrices, making RNNs parameter-efficient. They can in principle remember information over arbitrary time spans.

Backpropagation through time (BPTT) unrolls the RNN for \(T\) time steps and applies backprop to the resulting (very deep) feedforward network. The gradient through \(T\) steps involves \(W_h^T\) (the weight matrix raised to the \(T\)th power): if the largest singular value of \(W_h\) is \(> 1\), gradients explode; if \(< 1\), they vanish. This is the vanishing/exploding gradient problem, severe for long sequences.

Long Short-Term Memory

LSTM (Hochreiter–Schmidhuber 1997) introduces a separate cell state \(c_t\) that can carry information over long time spans, modulated by three gates:

Forget gate: \(f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)\) — how much of the previous cell state to forget.
Input gate: \(i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)\) — how much of a new candidate \(\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)\) to add.
Output gate: \(o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\) — how much of the cell state to expose as output.

Cell state update: \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t\). Hidden state: \(h_t = o_t \odot \tanh(c_t)\).

The multiplicative gating allows the cell state to be selectively written and erased, providing a gradient highway that avoids the vanishing gradient problem for relevant dependencies. LSTMs dominated sequential modelling until the attention mechanism (Chapter 13) took over.

Chapter 11: Adversarial Attacks and Defences

Adversarial Examples

Deep neural networks, despite high accuracy on natural images, are remarkably vulnerable to carefully crafted adversarial examples: perturbations to inputs that are imperceptible to humans but cause the network to mis-classify with high confidence (Szegedy et al. 2014, Goodfellow et al. 2015).

\[ x_\text{adv} = x + \varepsilon \cdot \text{sign}(\nabla_x \mathcal{L}(\theta, x, y)). \]

The perturbation moves each pixel by \(\pm \varepsilon\) in the direction that increases the loss. For small \(\varepsilon\), the image looks identical to humans but fools the classifier.

\[ x_{t+1} = \Pi_{B_\varepsilon(x)} \left( x_t + \alpha \cdot \text{sign}(\nabla_x \mathcal{L}(\theta, x_t, y)) \right). \]

PGD finds stronger adversarial examples and is used as the inner maximisation in adversarial training.

Transferability: adversarial examples often transfer across models — an example crafted against model A often fools model B, even if the architectures differ. This enables black-box attacks where the attacker has no access to the target model’s weights or gradients.

Adversarial Defences

\[ \min_\theta \mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ \max_{\delta \in B_\varepsilon} \mathcal{L}(\theta, x + \delta, y) \right]. \]

This is a minimax problem: the inner maximisation finds the worst-case perturbation, and the outer minimisation learns to be robust to it. PGD-based adversarial training provides certified robustness within the \(\varepsilon\)-ball but significantly increases training cost and often reduces clean accuracy.

\[ \mathcal{L}_\text{TRADES} = \mathcal{L}_\text{CE}(f(x), y) + \beta \cdot \max_{\delta \in B_\varepsilon} D_\text{KL}(f(x) \| f(x+\delta)). \]

The KL term penalises the model for changing its prediction under adversarial perturbation, without requiring the adversarial prediction to be correct. TRADES achieves a better robustness–accuracy tradeoff than vanilla adversarial training.

Chapter 12: Generative Adversarial Networks

The GAN Framework

\[ \min_G \max_D\; \mathbb{E}_{x \sim p_\text{data}} [\log D(x)] + \mathbb{E}_{z \sim p_z} [\log(1 - D(G(z)))]. \]

The generator \(G : \mathcal{Z} \to \mathcal{X}\) maps noise vectors \(z \sim \mathcal{N}(0,I)\) to synthetic images. The discriminator \(D : \mathcal{X} \to [0,1]\) outputs the probability that its input is real (rather than generated). The optimal discriminator is \(D^*(x) = p_\text{data}(x) / (p_\text{data}(x) + p_G(x))\).

At the Nash equilibrium, \(G\) perfectly reproduces the data distribution (\(p_G = p_\text{data}\)) and \(D\) outputs \(1/2\) everywhere. The global minimum of the generator’s loss (with optimal \(D^*\)) is the Jensen–Shannon divergence between \(p_\text{data}\) and \(p_G\).

Training GANs is notoriously unstable. Two common failure modes:

Mode collapse: the generator learns to produce only a few types of outputs (ignoring the full diversity of the training set).
Training oscillation: the generator and discriminator enter a cycle without converging.

Practical improvements: Wasserstein GAN (Arjovsky et al. 2017) replaces JSD with the Wasserstein-1 distance, providing better gradients and more stable training. Spectral normalisation of the discriminator controls its Lipschitz constant. Progressive growing (Karras et al. 2018) trains at progressively higher resolutions.

Chapter 13: Word Embeddings and Representation Learning

Distributional Semantics

The distributional hypothesis (Harris 1954) states that words appearing in similar contexts have similar meanings. This motivates learning word representations from co-occurrence statistics.

Word2Vec (Mikolov et al. 2013) trains a shallow neural network to predict either:

CBOW (Continuous Bag of Words): predict a word from its context.
Skip-gram: predict the context from a word.

\[ \mathcal{L} = \frac{1}{T} \sum_{t=1}^T \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} \mid w_t), \]

where \(p(w_O \mid w_I) = e^{v_{w_O}^T v_{w_I}} / \sum_w e^{v_w^T v_{w_I}}\). Since the vocabulary is large (often 100k–1M words), the softmax denominator is computed via negative sampling: sample random “noise” words and train a binary classifier.

Geometry of word2vec: the learned embeddings capture semantic relationships geometrically. The famous example: \(\text{vec}(\text{king}) - \text{vec}(\text{man}) + \text{vec}(\text{woman}) \approx \text{vec}(\text{queen})\). Countries relate to capitals, verb tenses relate to base forms — the embedding space captures a surprising amount of linguistic structure.

Beyond word2vec, GloVe (Pennington et al. 2014) factorises the log-probability co-occurrence matrix directly. FastText (Bojanowski et al. 2017) represents words as bags of character n-grams, handling out-of-vocabulary words. Contextual embeddings (ELMo, BERT) represent each word occurrence differently depending on context, addressing the polysemy limitation of static embeddings.

Chapter 14: Predictive Coding and Biological Plausibility

The Predictive Coding Hypothesis

Standard backpropagation is biologically implausible: it requires symmetric weights between forward and backward passes, non-local computations (the error signal depends on all downstream layers), and a separate backward pass. Yet the brain is a learning system.

Predictive coding (Rao–Ballard 1999) proposes that the brain constantly generates predictions of sensory input and updates representations based on prediction errors. At each cortical level, higher areas send top-down predictions to lower areas; lower areas return bottom-up prediction errors. Learning minimises the free energy (prediction error variance).

\[ \Delta w_{ij} \propto \varepsilon_i^{(\ell+1)} \cdot y_j^{(\ell)}, \]

where \(\varepsilon_i^{(\ell+1)} = y_i^{(\ell+1)} - \hat{y}_i^{(\ell+1)}\) is the prediction error at level \(\ell+1\). This has a local form — each weight update depends only on pre- and post-synaptic activities and the local prediction error — making it more biologically plausible than backprop.

Millidge et al. (2020) showed that under certain approximations, predictive coding performs gradient descent on the same loss as backpropagation, establishing a mathematical equivalence. Ongoing research investigates whether the equivalence extends to the full convergence behaviour.

Chapter 15: Beyond This Course

Natural Next Topics

CS 479 covers the foundations of neural networks through the lens of biological plausibility, learning theory, and practical architectures including RNNs, CNNs, Hopfield networks, VAEs, and GANs. The field has advanced rapidly since 2021, and several major architectural and theoretical developments extend directly from this foundation.

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V, \]

where queries \(Q = XW^Q\), keys \(K = XW^K\), and values \(V = XW^V\) are linear projections of the input sequence \(X\). Each attention head computes a weighted average of the values, with weights given by the softmax of the dot-product similarity between queries and keys. Multi-head attention runs \(h\) attention heads in parallel and concatenates their outputs.

The transformer architecture also includes: positional encodings (since attention is permutation-invariant, we add sinusoidal or learned position signals to break symmetry), feed-forward sublayers (position-wise MLPs applied after attention), layer normalisation, and residual connections. The encoder-decoder structure of the original Transformer was designed for machine translation; decoder-only transformers (GPT) excel at language modelling; encoder-only (BERT) excel at understanding tasks.

Pre-trained large language models represent the largest step-change in AI capability in a generation. GPT-2 (1.5B parameters, 2019) showed emergent text generation; GPT-3 (175B parameters, 2020) demonstrated few-shot learning without fine-tuning — the model solves new tasks from just a handful of examples in the prompt. GPT-4 and equivalent systems (2023) achieve expert-level performance on professional exams. Understanding why large models exhibit emergent capabilities — abilities that suddenly appear as scale increases — is a central open problem.

Vision Transformers (ViT) apply the transformer directly to image patches: a \(224 \times 224\) image is split into \(16 \times 16 = 196\) non-overlapping patches of size \(16 \times 16\) pixels, linearly embedded, and fed as a sequence to a transformer encoder. ViT matches or exceeds CNNs on large datasets. CLIP (Contrastive Language-Image Pre-Training; Radford et al. 2021) jointly trains a visual encoder and text encoder to match image-text pairs, enabling zero-shot image classification by comparing image embeddings to text embeddings of class names.

\[ q(x_t \mid x_{t-1}) = \mathcal{N}(x_t;\, \sqrt{1 - \beta_t}\, x_{t-1},\, \beta_t I), \]\[ \mathcal{L}_\text{simple} = \mathbb{E}_{t, x_0, \varepsilon}\left[\|\varepsilon - \varepsilon_\theta(\sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\varepsilon,\, t)\|^2\right], \]

where \(\bar\alpha_t = \prod_{s=1}^t (1-\beta_s)\). Sampling starts from pure Gaussian noise \(x_T \sim \mathcal{N}(0, I)\) and iteratively denoises. Diffusion models underlie Stable Diffusion, DALL-E 2/3, and Imagen. The mathematical connection is to score matching (Hyvärinen 2005): the optimal denoising network estimates \(\nabla_x \log p(x)\) — the score of the data distribution — and the reverse process is a stochastic differential equation driven by this score.

\[ \tilde\varepsilon_\theta(x_t, c) = (1 + w)\,\varepsilon_\theta(x_t, c) - w\,\varepsilon_\theta(x_t, \emptyset). \]

Higher guidance weight \(w\) produces images more faithful to the condition at the cost of diversity. Flow matching (Lipman et al. 2022) provides a cleaner theoretical framework: train the vector field that maps a noise distribution to the data distribution along straight-line paths, enabling faster sampling.

Reinforcement learning and RLHF. Reinforcement learning (RL) trains an agent to maximise cumulative reward by interacting with an environment. The central challenge is the credit assignment problem: which actions in a long sequence contributed to a reward received much later? Policy gradient methods (REINFORCE, Actor-Critic, PPO) directly optimise the expected return using gradient estimates. Q-learning (DQN; Mnih et al. 2013) learns a value function estimating future rewards and selects the greedy action. Deep RL combined with self-play produced superhuman performance at Go (AlphaGo/AlphaZero 2017), Chess, and Starcraft II (AlphaStar 2019).

Reinforcement Learning from Human Feedback (RLHF) is the key technique behind instruction-following large language models. The procedure: (1) pre-train a language model on text; (2) fine-tune on demonstrations of desired behaviour (SFT); (3) train a reward model on human preference data (which of two responses is better?); (4) optimise the language model against the reward model using PPO, with a KL penalty against the SFT model to prevent reward hacking. InstructGPT (Ouyang et al. 2022), ChatGPT, Claude, and Gemini all use RLHF or direct preference optimisation (DPO; Rafailov et al. 2023) variants.

Neural ODEs and continuous normalising flows. Residual networks iterate \(h_{t+1} = h_t + f(h_t, \theta)\) — each layer adds a small update to the hidden state. Taking the limit \(\Delta t \to 0\) gives the Neural ODE (Chen et al. 2018): the hidden state evolves according to an ODE \(dh/dt = f(h(t), t, \theta)\), and the forward pass integrates this ODE from \(t = 0\) to \(t = 1\). The adjoint method computes gradients of the loss with respect to \(\theta\) without storing intermediate states — only the final state and a reverse-time ODE are needed.

Normalising flows are invertible neural networks that transform a simple distribution (e.g., Gaussian) into a complex data distribution, with an explicit likelihood. Flow matching (also: conditional flow matching, rectified flow) trains a neural network to predict the velocity field of an ODE that moves samples from noise to data along near-straight paths, enabling high-quality generation with fewer steps than diffusion.

Follow-up Courses and Reading

Moving from CS 479 to research-level deep learning requires both theoretical depth (statistical learning theory, information theory) and practical breadth (distributed systems, hardware, empirical methodology). Here is a structured path.

At UWaterloo: CS 485 (Machine Learning, Shai Ben-David) provides the statistical learning theory foundations — PAC learning, VC dimension, Rademacher complexity, and uniform convergence — that underpin generalisation theory. CS 680 (Introduction to Machine Learning) and CS 886 (Advanced Topics: Deep Learning) are graduate-level continuations. STAT 441/841 (Classification) and STAT 946 (Topics in Probability and Statistics) provide the probabilistic and statistical foundations. CS 798 (Research Topics in Machine Learning) offers seminar exposure to current research.

Stanford courses (all materials freely available): CS231n (CNNs, Karpathy/Li/Johnson) covers image recognition architectures in depth — every lecture slide and assignment is on the course website. CS224n (NLP, Manning/Socher/Clark) covers RNNs, attention, transformers, and modern LLMs. CS234 (Reinforcement Learning, Brunskill) covers RL theory and practice. CS330 (Multi-Task Learning, Finn) covers meta-learning and transfer.

MIT courses: 6.S191 Introduction to Deep Learning (introtodeeplearning.com) provides a fast modern overview with labs in TensorFlow. 6.867 Machine Learning (Jaakkola/Barzilay) provides the theoretical treatment.

Key textbooks and online resources:

Goodfellow, Bengio, Courville Deep Learning (MIT Press, 2016; deeplearningbook.org) — the comprehensive graduate textbook covering feedforward networks, CNNs, RNNs, optimisation, regularisation, and generative models.
Bishop Pattern Recognition and Machine Learning (Springer, 2006) — Bayesian and probabilistic perspective, covering EM, Gaussian processes, and variational inference.
Murphy Probabilistic Machine Learning: An Introduction (MIT Press, 2022) and Advanced Topics (2023) — the most current probabilistic treatment, free online at probml.ai.
Shalev-Shwartz and Ben-David Understanding Machine Learning (Cambridge, 2014) — rigorous PAC learning and statistical learning theory.
Fast.ai Practical Deep Learning for Coders (course, fast.ai) — top-down practical approach; excellent for building intuition before diving into theory.

For research: follow arXiv cs.LG (learning), cs.CV (vision), cs.CL (language), and stat.ML. Anthropic’s Transformer Circuits thread (transformer-circuits.pub), Chris Olah’s Distill articles, and Andrej Karpathy’s “Zero to Hero” video series are outstanding pedagogical resources.

Frameworks and tools: PyTorch (pytorch.org) and JAX (jax.readthedocs.io) are the dominant research frameworks. Hugging Face Transformers provides implementations of hundreds of pre-trained models. For large-scale training: DeepSpeed (Microsoft), Megatron-LM (Nvidia), and JAX’s pmap provide distributed training utilities.

Open Problems and Active Research

Neural networks are the most active area of AI research, with major open problems spanning theory, systems, and safety.

Generalisation theory for deep networks. The double descent phenomenon — test error decreasing as model size grows far beyond the interpolation threshold — contradicts classical bias-variance theory. Partial explanations exist: the Neural Tangent Kernel (NTK) theory (Jacot et al. 2018) shows that infinitely wide networks trained with gradient descent behave as kernel methods, with the NTK as the kernel. However, NTK theory requires infinite width and is a poor approximation for practical finite-width networks with feature learning. Feature learning — the crucial property that finite-width networks learn useful representations that evolve during training, unlike kernel methods with fixed features — lacks a complete theoretical treatment. The connection between sharpness of the loss landscape, flatness of the minima found by SGD, and generalisation remains debated. Tight generalisation bounds for practical networks are an open problem.

Mechanistic interpretability. Transformers are deployed in safety-critical settings but remain largely opaque. Mechanistic interpretability aims to reverse-engineer the algorithms implemented by neural network weights. Progress includes: discovery of induction heads (circuits that implement in-context learning by attending to previous instances of a pattern), indirect object identification circuits (multi-step computation over token representations), and superposition (the hypothesis that networks represent more features than they have neurons, by superimposing multiple features as nearly orthogonal directions). Whether full mechanistic interpretability is achievable in principle — whether neural networks implement intelligible algorithms or perform fundamentally statistical pattern matching — is philosophically and empirically open.

Scaling laws and emergent capabilities. Kaplan et al. (2020) showed that language model loss scales as a power law in compute, data, and parameters: \(L(N) \propto N^{-\alpha}\) for model size \(N\), with \(\alpha \approx 0.076\). Chinchilla (Hoffmann et al. 2022) optimised the compute-data tradeoff. But emergent capabilities — abilities that appear suddenly at scale — are not predicted by smooth scaling laws. Tasks like arithmetic, chain-of-thought reasoning, and code generation appear only above a threshold model size. Whether emergent capabilities are truly discontinuous (a phase transition in capabilities) or merely an artifact of evaluation metrics is an active debate. Understanding what determines which capabilities emerge at what scale is a central open question.

Alignment and goal specification. As large models are deployed as autonomous agents, ensuring they pursue intended goals becomes critical. Reward hacking (optimising a proxy reward in unintended ways) and Goodhart’s law (“when a measure becomes a target, it ceases to be a good measure”) are empirically observed in RLHF-trained models. Goal misgeneralisation occurs when the training process selects for a policy that generalises correctly in training but pursues a different objective in deployment — a form of distributional shift in the objective itself. Constitutional AI (Anthropic), debate (Irving et al.), and scalable oversight (asking the AI to assist in evaluating itself) are proposed approaches, but none is fully satisfactory. The fundamental difficulty: evaluating whether an agent is “aligned” requires understanding its internals, which is the interpretability problem.

Robustness and distribution shift. Neural networks trained on a distribution \(P_\text{train}\) often fail on inputs from a shifted distribution \(P_\text{test} \neq P_\text{train}\) (different hospitals, different geographic regions, different demographics). The failure mode is spurious correlations: features that are predictive in training data due to dataset biases but not causally related to the label. Causal machine learning (Schölkopf et al., Arjovsky et al.’s IRM) seeks representations that are causally invariant across environments. Whether invariant risk minimisation or similar approaches can robustly generalise out of distribution — and when this is information-theoretically impossible — is open.

Energy efficiency and spiking networks. GPT-4 is estimated to consume ~500 megawatt-hours per training run. The human brain operates at ~20 watts with comparable cognitive capabilities in some domains. The efficiency gap is partly architectural (dense float32 matrix multiplications vs. sparse binary spike events) and partly hardware (CPUs/GPUs optimised for dense computation vs. neuromorphic chips with event-driven processing). Intel Loihi 2 and IBM NorthPole demonstrate orders-of-magnitude better energy-per-inference for specific workloads. Whether spiking neural networks (SNNs) can match dense networks on complex tasks — and whether the biological efficiency advantage can be replicated in silicon — is an open research question combining neuroscience, hardware design, and algorithms.

Theoretical foundations of diffusion and flow models. The empirical success of diffusion models is striking — they produce higher-quality images than GANs with more stable training. But the theoretical understanding is incomplete. Why do denoising score networks generalise from training data to novel generations? What is the correct information-theoretic analysis of the score matching objective? How does the sampling algorithm interact with the learned score to produce samples that are both diverse and high-quality? The connection to stochastic differential equations (Langevin dynamics, score-based SDEs; Song et al. 2021) provides a partial framework, but a complete theoretical account of why diffusion models produce such high-quality samples remains an open problem.

In-context learning and the nature of few-shot capabilities. Large language models exhibit a striking ability known as in-context learning (ICL): given a few input-output examples in the prompt at inference time (without updating any weights), the model adapts its behaviour to the demonstrated task. GPT-3 performs arithmetic, translation, and code generation from a handful of examples without any fine-tuning. The theoretical question is: what mechanism enables ICL? One hypothesis (Akyürek et al. 2022; von Oswald et al. 2023) is that transformers implement implicit gradient descent — the attention mechanism can simulate one step of linear regression or gradient descent on the in-context examples. This would mean transformers learn an “inner optimiser” during pre-training that runs during inference. Another view (Xie et al. 2022) is Bayesian: ICL performs implicit Bayesian inference over a latent concept class, with the pre-training distribution inducing a prior. Whether these views are consistent, and whether they extend to non-linear tasks, is an active research direction with implications for understanding why scale produces new capabilities.

Multimodal foundation models and world models. Recent systems combine vision, language, audio, and action modalities in a single neural network. GPT-4V, Gemini Ultra, Claude 3 Opus, and LLaMA-based multimodal models process images and text together; AudioLM generates coherent speech and music; RT-2 (Google, 2023) uses vision-language pre-training to control physical robots, generalising to novel objects and tasks. The architectural innovation enabling this is typically a shared transformer backbone that processes tokenised representations from multiple modalities — images become patches (ViT-style), audio becomes spectrograms, actions become discrete tokens. World models (Ha–Schmidhuber 2018, DreamerV3, Genie) learn compact internal representations of environment dynamics and use them for planning and imagination. Whether a neural network can learn a world model that generalises reliably — that correctly predicts the consequences of novel actions in novel environments — is a key prerequisite for safe autonomous systems. The question of “what is a world model” and how to evaluate its accuracy and completeness is one of the most important open problems in AI.

Causal reasoning and compositionality. Current neural networks excel at pattern recognition but struggle with systematic compositional generalisation — applying known rules to new combinations. The SCAN benchmark (Lake–Baroni 2018) showed that LSTMs trained to map commands to action sequences completely fail when the test commands combine familiar words in novel ways that require compositional understanding (“jump twice and walk thrice” from “jump” and “walk” and “twice”). Transformers do much better but still fail on sufficiently out-of-distribution composites. Causal reasoning — understanding not just correlation but the effect of interventions (“if I do X, what happens?”) — requires models of cause and effect, not just associations. Pearl’s do-calculus provides a formal framework for causal inference from observational and interventional data; whether neural networks trained on passive data can learn causal graphs (or whether architectural inductive biases can encode causal structure) is a central open question at the boundary of machine learning and cognitive science. The answer has direct implications for the reliability of AI in medical diagnosis, autonomous driving, and scientific discovery — domains where spurious correlations can be actively harmful.

Neural networks for scientific discovery. Beyond pattern recognition in natural language and images, neural networks are transforming scientific computing. Physics-informed neural networks (PINNs) (Raissi–Perdikaris–Karniadakis 2019) solve partial differential equations by training a network to minimise the residual of the PDE at collocation points, enabling fast surrogate models for fluid dynamics, heat transfer, and structural mechanics. Neural operators (DeepONet, Fourier Neural Operator; Li et al. 2020) learn operators between function spaces — mapping boundary conditions to solutions, or initial conditions to time-evolved states — with architecture that respects the continuous nature of functions. AlphaFold2 (discussed in the CS 482 context) is perhaps the clearest demonstration of neural networks for scientific discovery: a problem that took experimental biologists decades of work per protein was essentially solved in a single model. Graph neural networks (GNNs) have transformed molecular property prediction, drug discovery (predicting binding affinity, toxicity, ADMET properties from molecular graphs), and materials science (predicting crystal properties, finding new superconductors). Whether neural networks can go beyond interpolation in scientific domains — whether they can propose genuinely novel physical mechanisms, not just efficient interpolators of training data — is a fundamental question about the limits of learned representations as a tool for science.

Reinforcement learning for real-world decision making. CS 479 introduced RLHF and the basic RL framework. Research-level RL covers many more settings. Offline RL (also called batch RL) trains policies from a fixed dataset of previously collected experience, without any further environment interaction — essential when environment interaction is expensive (medical treatment, robotics, power grid control). Conservative Q-learning (CQL) and Decision Transformer are two prominent approaches to offline RL. Reward-free exploration asks how many environment interactions are needed to learn a good policy for any reward function specified later — an exploration problem that underlies efficient algorithm design for large state spaces. Safe RL constrains the agent to avoid catastrophic actions during training and deployment; Lyapunov-based methods and constrained MDPs provide formal safety guarantees but are computationally intensive. RL for large language model alignment connects the CS 479 material on RLHF to the fundamental alignment questions: whether RLHF-trained models are actually safer, how reward models are evaluated, and whether iterative RLHF can converge to human-aligned behaviour or simply learns to game the reward model. The ongoing debate between RLHF and Direct Preference Optimisation (DPO) — which avoids explicit reward modelling by directly fine-tuning on preference data — is one of the most active practical research questions in LLM development.