PHYS 449: Machine Learning in Physics

Roger Melko

Estimated study time: 26 minutes

Table of contents

Sources and References

Primary references

  • I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016) — deeplearningbook.org
  • K. Murphy, Probabilistic Machine Learning: An Introduction (MIT Press, 2022) — probml.github.io/pml-book/book1.html
  • C. Bishop, Pattern Recognition and Machine Learning (Springer, 2006)

Physics-specific references

  • J. Carrasquilla and R. G. Melko, “Machine learning phases of matter,” Nature Physics 13 (2017), 431–434
  • P. Mehta et al., “A high-bias, low-variance introduction to machine learning for physicists,” Physics Reports 810 (2019), 1–124
  • G. Carleo et al., “Machine learning and the physical sciences,” Rev. Mod. Phys. 91 (2019), 045002

Online resources

  • Stanford CS229 materials — cs229.stanford.edu
  • UBC CPSC 340/540 notes — cs.ubc.ca

Chapter 1: Learning Theory in Physics — Connections and Foundations

Section 1.1: Why Physics and Machine Learning?

Machine learning and statistical physics share deep mathematical structures. The central theme of this course:

Statistical mechanics provides a natural language for machine learning, and ML provides powerful tools for physics.

Key correspondences:

Statistical MechanicsMachine Learning
Energy function \(E(\mathbf{s})\)Loss function / negative log-likelihood
Boltzmann distribution \(p(\mathbf{s}) \propto e^{-\beta E(\mathbf{s})}\)Softmax / Gibbs distribution
Partition function \(Z = \sum_\mathbf{s} e^{-\beta E(\mathbf{s})}\)Normalizing constant
Free energy \(F = -\frac{1}{\beta}\log Z\)Log-partition function
Entropy \(S = -\sum_i p_i \log p_i\)Shannon entropy / cross-entropy loss
Phase transitionChange of regime in learning dynamics
Renormalization group (RG)Deep learning hierarchy / coarse-graining

Section 1.2: Statistical Learning Theory

Learning problem: given a dataset \(\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N\) drawn i.i.d. from an unknown distribution \(p(\mathbf{x}, y)\), find a function \(f_\theta: \mathcal{X} \to \mathcal{Y}\) that generalizes to new samples.

The generalization error (true risk):

\[ R[f] = \mathbb{E}_{(\mathbf{x},y) \sim p}[\ell(y, f(\mathbf{x}))]. \]

The empirical risk (training error):

\[ \hat{R}[f] = \frac{1}{N}\sum_{i=1}^N \ell(y_i, f(\mathbf{x}_i)). \]

The bias-variance tradeoff: for mean-squared error,

\[ \mathbb{E}[(y - f(\mathbf{x}))^2] = \underbrace{(\mathbb{E}[f(\mathbf{x})] - y)^2}_{\text{bias}^2} + \underbrace{\mathrm{Var}[f(\mathbf{x})]}_{\text{variance}} + \underbrace{\sigma^2_\varepsilon}_{\text{noise}}. \]

Simple models have high bias (underfitting); complex models have high variance (overfitting).

Section 1.3: Information Theory

Shannon entropy quantifies uncertainty in a distribution \(p\):

\[ H(p) = -\sum_i p_i \log p_i = \mathbb{E}[-\log p]. \]

KL divergence (relative entropy) measures how much distribution \(q\) differs from \(p\):

\[ D_{KL}(p \| q) = \sum_i p_i \log \frac{p_i}{q_i} \geq 0, \quad \text{with equality iff } p = q. \]

Cross-entropy:

\[ H(p, q) = H(p) + D_{KL}(p \| q) = -\sum_i p_i \log q_i. \]

Minimizing cross-entropy loss is equivalent to maximum likelihood estimation: if \(p\) is the true label distribution and \(q_\theta\) is the model’s predicted distribution, then \(\min_\theta H(p, q_\theta)\) is equivalent to \(\max_\theta \sum_i \log q_\theta(y_i|\mathbf{x}_i)\).


Chapter 2: The Ising Model — A Physical Introduction to Learning

Section 2.1: The Ising Model

The Ising model is a prototypical model of ferromagnetism, consisting of binary spins \(s_i \in \{-1, +1\}\) on a lattice with Hamiltonian:

\[ H(\mathbf{s}) = -J \sum_{\langle i,j \rangle} s_i s_j - h \sum_i s_i, \]

where \(J > 0\) is the ferromagnetic coupling, \(h\) is an external field, and \(\langle i,j \rangle\) denotes nearest-neighbor pairs.

The thermal equilibrium state is the Boltzmann distribution:

\[ p(\mathbf{s}) = \frac{1}{Z} e^{-\beta H(\mathbf{s})}, \qquad Z = \sum_{\mathbf{s}} e^{-\beta H(\mathbf{s})}, \qquad \beta = 1/(k_B T). \]

Section 2.2: Phase Transition

In 2D on the square lattice, the Ising model undergoes a continuous phase transition at the critical temperature:

\[ T_c = \frac{2J}{k_B \ln(1 + \sqrt{2})} \approx 2.269 J/k_B. \]
  • Low \(T\) (\(T < T_c\)): spontaneous magnetization \(\langle M \rangle \neq 0\); ordered ferromagnetic phase.
  • High \(T\) (\(T > T_c\)): \(\langle M \rangle = 0\); disordered paramagnetic phase.
  • At \(T_c\): power-law correlations, diverging correlation length, universal scaling.

The Ising model serves as a test bed for ML: given configurations sampled at various temperatures, can a ML model identify the phase transition?

Carrasquilla and Melko (2017) showed that a simple convolutional neural network trained to classify Ising configurations as ordered/disordered learns to output a phase diagram with a sharp transition — demonstrating that ML can identify phases of matter.

Section 2.3: Monte Carlo and Sampling

To generate configurations distributed according to the Boltzmann distribution, Markov Chain Monte Carlo (MCMC) methods are used:

Metropolis-Hastings algorithm:

  1. Propose a spin flip: \(\mathbf{s}' = \mathbf{s}\) with spin \(s_i \to -s_i\).
  2. Compute the energy change: \(\Delta E = H(\mathbf{s}') - H(\mathbf{s})\).
  3. Accept with probability: \(\min(1, e^{-\beta \Delta E})\).

This generates a Markov chain that converges to the Boltzmann distribution.

Connection to optimization: Simulated annealing uses the Metropolis algorithm with a slowly decreasing temperature (“annealing schedule”) to find the minimum-energy configuration — a global optimization strategy motivated by statistical mechanics.


Chapter 3: Linear Models for Regression and Classification

Section 3.1: Linear Regression

Linear regression models the relationship \(y = \mathbf{w}^T \mathbf{x} + b + \varepsilon\). Minimizing the mean squared error gives the OLS estimator \(\hat{\mathbf{w}} = (X^T X)^{-1} X^T \mathbf{y}\).

Probabilistic interpretation: if \(\varepsilon \sim \mathcal{N}(0, \sigma^2)\), then minimizing MSE is equivalent to maximum likelihood under a Gaussian model.

Ridge regression: add \(L_2\) regularization to the loss: \(\mathcal{L} = \|X\mathbf{w} - \mathbf{y}\|^2 + \lambda\|\mathbf{w}\|^2\). Solution: \(\hat{\mathbf{w}} = (X^T X + \lambda I)^{-1} X^T \mathbf{y}\). Regularization prevents overfitting by shrinking weights.

Bayesian interpretation: Ridge regression corresponds to a Gaussian prior on \(\mathbf{w}\): maximizing the posterior \(p(\mathbf{w}|\mathbf{y}, X) \propto p(\mathbf{y}|X,\mathbf{w})p(\mathbf{w})\) with \(p(\mathbf{w}) = \mathcal{N}(0, \lambda^{-1}I)\).

Section 3.2: Logistic Regression

For binary classification (\(y \in \{0,1\}\)), logistic regression models:

\[ p(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}. \]

Training minimizes the binary cross-entropy loss:

\[ \mathcal{L} = -\sum_i [y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]. \]

For multi-class classification, the softmax generalizes the sigmoid:

\[ p(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}_k^T \mathbf{x}}}{\sum_j e^{\mathbf{w}_j^T \mathbf{x}}}. \]

This is formally the Boltzmann distribution with “energy” \(-\mathbf{w}_k^T \mathbf{x}\) — connecting classification to statistical mechanics.


Chapter 4: Optimization

Section 4.1: Gradient Descent and Variants

Gradient descent iteratively follows the negative gradient:

\[ \boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - \eta \nabla_\theta \mathcal{L}(\boldsymbol{\theta}^{(k)}). \]

Stochastic gradient descent (SGD): use a mini-batch estimate \(\hat{\nabla} \mathcal{L}\). Noisy gradients help escape saddle points and sharp minima.

Momentum: accumulate gradient history to reduce oscillations:

\[ \mathbf{v}^{(k+1)} = \mu \mathbf{v}^{(k)} - \eta \nabla \mathcal{L}, \qquad \boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} + \mathbf{v}^{(k+1)}. \]

Adam (Kingma & Ba, 2015): adaptive per-parameter learning rates using first and second moment estimates:

\[ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2, \]\[ \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \varepsilon} \hat{m}_t. \]

Adam is the default optimizer in most deep learning applications.

Section 4.2: The Loss Landscape

The loss landscape of a deep neural network is a high-dimensional non-convex surface. Key features:

  • Saddle points: gradient zero but not a minimum (second-order derivatives have mixed signs). Dominant in high dimensions; SGD noise helps escape.
  • Flat minima vs. sharp minima: flat minima generalize better than sharp minima (Hochreiter & Schmidhuber, 1997). SGD with large batch size tends to find sharp minima.
  • Mode connectivity: recent work shows that most good minima of neural networks are connected by low-loss paths.

Chapter 5: Neural Networks

Section 5.1: Architecture

A deep neural network with \(L\) layers is a composition:

\[ f(\mathbf{x}; \theta) = f^{(L)} \circ \cdots \circ f^{(1)}(\mathbf{x}), \]

where \(f^{(\ell)}(\mathbf{h}) = \sigma(W^{(\ell)}\mathbf{h} + \mathbf{b}^{(\ell)})\). The parameters \(\theta = \{W^{(\ell)}, \mathbf{b}^{(\ell)}\}\) are learned by backpropagation.

Depth vs. width: Depth allows exponentially more efficient representation of certain functions. Width ensures expressivity at each layer.

Section 5.2: Backpropagation

Backpropagation is the chain rule applied backwards. For loss \(\mathcal{L}\) and parameters \(W^{(\ell)}\):

  • Forward pass: compute \(\mathbf{h}^{(\ell)}\) for each layer.
  • Backward pass: compute \(\delta^{(\ell)} = \partial \mathcal{L}/\partial \mathbf{z}^{(\ell)}\) recursively.
  • Weight gradient: \(\nabla_{W^{(\ell)}} \mathcal{L} = \delta^{(\ell)} (\mathbf{h}^{(\ell-1)})^T\).

Cost: \(O(\text{parameters})\) — same as a forward pass.

Section 5.3: Regularization

  • Dropout (Srivastava et al., 2014): randomly zero out neurons during training with probability \(p\). Equivalent to training an exponential number of sub-networks and averaging.
  • Batch normalization: normalize activations within each mini-batch; reduces internal covariate shift and accelerates training.
  • Weight decay: \(L_2\) regularization on weights; equivalent to a Gaussian prior in Bayesian terms.
  • Early stopping: stop training when validation loss starts increasing.

Chapter 6: Supervised and Unsupervised Learning

Section 6.1: Classification and Regression

Supervised learning has labeled training data. Loss functions:

  • Classification: cross-entropy loss \(-\sum_i y_i \log \hat{p}_i\).
  • Regression: mean squared error \(\frac{1}{N}\sum_i (y_i - \hat{y}_i)^2\).

Convolutional neural networks (CNNs): exploit spatial translational invariance via shared filter kernels. Dominant for image classification, materials property prediction from crystal structures.

Section 6.2: Unsupervised Learning

Unsupervised learning discovers structure in unlabeled data.

Clustering: K-means minimizes within-cluster variance; assigns each point to the nearest centroid.

Principal Component Analysis (PCA): linear dimensionality reduction by projecting onto the top eigenvectors of the data covariance matrix (equivalent to the SVD, Chapter 2).

Autoencoders: neural networks trained to reconstruct their input through a bottleneck. The encoder \(z = f(\mathbf{x})\) learns a compressed representation; the decoder \(\hat{\mathbf{x}} = g(z)\) reconstructs the input.


Chapter 7: Generative Models

Section 7.1: Variational Autoencoders (VAEs)

The VAE (Kingma & Welling, 2014) is a generative model that learns a latent variable model \(p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|z) p(z) dz\). Training maximizes the Evidence Lower Bound (ELBO):

\[ \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|\mathbf{x})}[\log p_\theta(\mathbf{x}|z)] - D_{KL}(q_\phi(z|\mathbf{x}) \| p(z)), \]

where \(q_\phi(z|\mathbf{x})\) is the encoder (approximate posterior) and \(p_\theta(\mathbf{x}|z)\) is the decoder (generative model). The KL term regularizes the latent space to be compact.

Section 7.2: Generative Adversarial Networks (GANs)

GANs (Goodfellow et al., 2014) train two networks simultaneously:

  • Generator \(G_\theta\): maps random noise \(\mathbf{z} \sim p(\mathbf{z})\) to samples in data space.
  • Discriminator \(D_\phi\): classifies inputs as real or generated.

The adversarial training objective:

\[ \min_G \max_D \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[\log(1 - D(G(\mathbf{z})))]. \]

At Nash equilibrium, \(G\) produces samples indistinguishable from real data: \(p_G = p_{\text{data}}\).


Chapter 8: Hopfield Networks, RBMs, and Energy-Based Models

Section 8.1: Hopfield Networks

The Hopfield network (Hopfield, 1982) is a recurrent network of \(N\) binary neurons \(s_i \in \{-1, +1\}\) with symmetric weights \(W_{ij}\). The energy function is:

\[ E(\mathbf{s}) = -\frac{1}{2} \sum_{i \neq j} W_{ij} s_i s_j. \]

This is exactly the Ising Hamiltonian with \(J_{ij} = W_{ij}\)!

The network stores patterns via Hebbian learning: \(W_{ij} = \frac{1}{N}\sum_\mu \xi_i^\mu \xi_j^\mu\), where \(\{\boldsymbol{\xi}^\mu\}\) are the patterns to be stored. Retrieval proceeds by updating spins to minimize energy:

\[ s_i \leftarrow \mathrm{sign}\!\left(\sum_j W_{ij} s_j\right). \]

The energy decreases monotonically with updates, so the network converges to a local minimum — an attractor corresponding to a stored pattern (or a spurious state).

Section 8.2: Restricted Boltzmann Machines

An RBM is a bipartite energy-based model with visible units \(\mathbf{v} \in \{0,1\}^n\) and hidden units (\mathbf{h} \in {0,1}^m$, with energy:

\[ E(\mathbf{v}, \mathbf{h}) = -\mathbf{a}^T \mathbf{v} - \mathbf{b}^T \mathbf{h} - \mathbf{v}^T W \mathbf{h}. \]

The joint distribution is \(p(\mathbf{v}, \mathbf{h}) = Z^{-1} e^{-E(\mathbf{v},\mathbf{h})}\). Since the graph is bipartite, the conditional distributions factorize:

\[ p(h_j=1|\mathbf{v}) = \sigma\!\left(b_j + \sum_i v_i W_{ij}\right), \quad p(v_i=1|\mathbf{h}) = \sigma\!\left(a_i + \sum_j W_{ij} h_j\right). \]

Training maximizes the log-likelihood of the data:

\[ \mathcal{L} = \sum_{\mathbf{v} \in \mathcal{D}} \log p(\mathbf{v}) = \sum_{\mathbf{v}} \left[\log \sum_\mathbf{h} e^{-E(\mathbf{v},\mathbf{h})} - \log Z\right]. \]

The gradient involves the model’s own expected statistics:

\[ \nabla_{W_{ij}} \mathcal{L} = \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{model}}. \]

The data term is tractable (one Gibbs pass); the model term requires MCMC sampling (Contrastive Divergence approximation).

Connection to physics: the RBM is formally identical to an Ising model on a bipartite graph. RBMs have been used as neural quantum states (Carleo & Troyer, 2017) to represent quantum many-body wave functions.


Chapter 9: Autoregressive Models

Section 9.1: Autoregressive Decomposition

Any joint distribution can be factored by the chain rule:

\[ p(\mathbf{x}) = p(x_1) \prod_{i=2}^n p(x_i | x_1, \ldots, x_{i-1}). \]

An autoregressive model parameterizes each conditional \(p_\theta(x_i | x_{

PixelCNN: applies this idea to images, modeling pixels left-to-right and top-to-bottom using masked convolutions.

Autoregressive neural quantum states: model the wave function \(\psi(\mathbf{s}) = \langle \mathbf{s}|\psi\rangle\) of a quantum system using an autoregressive network. Enables direct sampling from \(|\psi(\mathbf{s})|^2\) without MCMC.


Chapter 10: Recurrent Neural Networks and Transformers

Section 10.1: Recurrent Neural Networks

A vanilla RNN processes a sequence \(\mathbf{x}_1, \ldots, \mathbf{x}_T\) by maintaining a hidden state:

\[ \mathbf{h}_t = \sigma(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b}). \]

Long Short-Term Memory (LSTM): solves the vanishing gradient problem with gated memory cells that can maintain information across long sequences.

Section 10.2: The Transformer

The Transformer (Vaswani et al., 2017) processes sequences using self-attention rather than recurrence. For a sequence of embeddings \(\{Q, K, V\}\):

\[ \text{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V. \]

Self-attention computes pairwise interactions between all positions — \(O(n^2)\) complexity — but can be parallelized across the sequence dimension.

Physical analogy: self-attention is related to the interaction energy in a many-body system. The query-key dot product measures “compatibility” between positions, analogous to pairwise coupling in a spin model.


Chapter 11: Diffusion Models and Normalizing Flows

Section 11.1: Diffusion Models

Denoising Diffusion Probabilistic Models (DDPMs) (Ho et al., 2020) learn to reverse a gradual noising process.

Forward process: gradually add Gaussian noise to data over \(T\) steps:

\[ q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t I). \]

After \(T\) steps, \(\mathbf{x}_T \approx \mathcal{N}(0, I)\).

Reverse process: learn a neural network \(\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\) that predicts the noise added at step \(t\). Sampling generates data by iteratively denoising from pure noise.

Connection to statistical mechanics: the forward diffusion process is an Ornstein-Uhlenbeck process (Langevin dynamics); the reverse process is trained to denoise, analogous to score-based generative models that estimate the score function \(\nabla_\mathbf{x} \log p(\mathbf{x})\).

Section 11.2: Normalizing Flows

Normalizing flows model a complex distribution \(p(\mathbf{x})\) by applying a sequence of invertible transformations \(f_1, \ldots, f_K\) to a simple base distribution \(p_0(\mathbf{z})\):

\[ \mathbf{x} = f_K \circ \cdots \circ f_1(\mathbf{z}), \quad \mathbf{z} \sim p_0(\mathbf{z}). \]

By the change-of-variables formula:

\[ \log p(\mathbf{x}) = \log p_0(\mathbf{z}) - \sum_{k=1}^K \log\left|\det \frac{\partial f_k}{\partial \mathbf{z}_k}\right|. \]

The Jacobian determinant must be efficiently computable; this is achieved with affine coupling layers (RealNVP, Glow). Flows give exact likelihoods and allow exact inference, in contrast to VAEs and GANs.


Chapter 12: Machine Learning for Quantum Physics

Section 12.1: Neural Quantum States

Carleo and Troyer (2017) proposed representing the many-body wave function \(\Psi(\mathbf{s})\) using a neural network. For a spin-1/2 system with \(N\) spins, the Hilbert space has \(2^N\) dimensions — exponentially large. A neural network can represent approximate ground states compactly.

The Restricted Boltzmann Machine ansatz:

\[ \Psi(\mathbf{s}) = e^{\sum_i a_i s_i} \prod_j \cosh\!\left(b_j + \sum_i W_{ij} s_i\right). \]

Training minimizes the variational energy \(E = \langle \Psi | H | \Psi \rangle / \langle \Psi | \Psi \rangle\) using stochastic reconfiguration (a natural gradient method adapted for quantum wave functions).

Section 12.2: Machine Learning for Phase Transitions

Supervised: train a CNN to classify spin configurations as ordered/disordered. The classifier’s output (confidence) serves as an order parameter, automatically locating the phase transition.

Unsupervised: apply PCA or autoencoders to spin configurations across temperatures. The dominant principal component captures the order parameter; anomalous changes in the latent representation signal phase transitions.

Confusion scheme (van Nieuwenburg et al., 2017): train a binary classifier to distinguish “low T” and “high T” configurations while varying the boundary. The classification accuracy as a function of boundary position peaks at the critical temperature.

Section 12.3: Quantum Machine Learning

Variational quantum circuits: parameterized quantum circuits trained with classical optimizers (variational quantum eigensolver, QAOA). Related to deep learning on quantum hardware.

Barren plateaus: a fundamental obstacle — the gradients of variational quantum circuits vanish exponentially with system size (in analogy with vanishing gradients in deep networks), making training intractable at large scales.

Back to top