PHYS 449: Machine Learning in Physics

Roger Melko

Estimated study time: 26 minutes

Table of contents

Sources and References

Primary references

I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016) — deeplearningbook.org
K. Murphy, Probabilistic Machine Learning: An Introduction (MIT Press, 2022) — probml.github.io/pml-book/book1.html
C. Bishop, Pattern Recognition and Machine Learning (Springer, 2006)

Physics-specific references

J. Carrasquilla and R. G. Melko, “Machine learning phases of matter,” Nature Physics 13 (2017), 431–434
P. Mehta et al., “A high-bias, low-variance introduction to machine learning for physicists,” Physics Reports 810 (2019), 1–124
G. Carleo et al., “Machine learning and the physical sciences,” Rev. Mod. Phys. 91 (2019), 045002

Online resources

Stanford CS229 materials — cs229.stanford.edu
UBC CPSC 340/540 notes — cs.ubc.ca

Chapter 1: Learning Theory in Physics — Connections and Foundations

Section 1.1: Why Physics and Machine Learning?

Machine learning and statistical physics share deep mathematical structures. The central theme of this course:

Statistical mechanics provides a natural language for machine learning, and ML provides powerful tools for physics.

Key correspondences:

Statistical Mechanics	Machine Learning
Energy function $E(\mathbf{s})$	Loss function / negative log-likelihood
Boltzmann distribution $p(\mathbf{s}) \propto e^{-\beta E(\mathbf{s})}$	Softmax / Gibbs distribution
Partition function $Z = \sum_\mathbf{s} e^{-\beta E(\mathbf{s})}$	Normalizing constant
Free energy $F = -\frac{1}{\beta}\log Z$	Log-partition function
Entropy $S = -\sum_i p_i \log p_i$	Shannon entropy / cross-entropy loss
Phase transition	Change of regime in learning dynamics
Renormalization group (RG)	Deep learning hierarchy / coarse-graining

Section 1.2: Statistical Learning Theory

Learning problem: given a dataset $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^N$ drawn i.i.d. from an unknown distribution $p(\mathbf{x}, y)$, find a function $f_\theta: \mathcal{X} \to \mathcal{Y}$ that generalizes to new samples.

The generalization error (true risk):

\[ R[f] = \mathbb{E}_{(\mathbf{x},y) \sim p}[\ell(y, f(\mathbf{x}))]. \]

The empirical risk (training error):

\[ \hat{R}[f] = \frac{1}{N}\sum_{i=1}^N \ell(y_i, f(\mathbf{x}_i)). \]

The bias-variance tradeoff: for mean-squared error,

\[ \mathbb{E}[(y - f(\mathbf{x}))^2] = \underbrace{(\mathbb{E}[f(\mathbf{x})] - y)^2}_{\text{bias}^2} + \underbrace{\mathrm{Var}[f(\mathbf{x})]}_{\text{variance}} + \underbrace{\sigma^2_\varepsilon}_{\text{noise}}. \]

Simple models have high bias (underfitting); complex models have high variance (overfitting).

Section 1.3: Information Theory

Shannon entropy quantifies uncertainty in a distribution $p$:

\[ H(p) = -\sum_i p_i \log p_i = \mathbb{E}[-\log p]. \]

KL divergence (relative entropy) measures how much distribution $q$ differs from $p$:

\[ D_{KL}(p \| q) = \sum_i p_i \log \frac{p_i}{q_i} \geq 0, \quad \text{with equality iff } p = q. \]

Cross-entropy:

\[ H(p, q) = H(p) + D_{KL}(p \| q) = -\sum_i p_i \log q_i. \]

Minimizing cross-entropy loss is equivalent to maximum likelihood estimation: if $p$ is the true label distribution and $q_\theta$ is the model’s predicted distribution, then $\min_\theta H(p, q_\theta)$ is equivalent to $\max_\theta \sum_i \log q_\theta(y_i|\mathbf{x}_i)$.

Chapter 2: The Ising Model — A Physical Introduction to Learning

Section 2.1: The Ising Model

The Ising model is a prototypical model of ferromagnetism, consisting of binary spins $s_i \in \{-1, +1\}$ on a lattice with Hamiltonian:

\[ H(\mathbf{s}) = -J \sum_{\langle i,j \rangle} s_i s_j - h \sum_i s_i, \]

where $J > 0$ is the ferromagnetic coupling, $h$ is an external field, and $\langle i,j \rangle$ denotes nearest-neighbor pairs.

The thermal equilibrium state is the Boltzmann distribution:

\[ p(\mathbf{s}) = \frac{1}{Z} e^{-\beta H(\mathbf{s})}, \qquad Z = \sum_{\mathbf{s}} e^{-\beta H(\mathbf{s})}, \qquad \beta = 1/(k_B T). \]

Section 2.2: Phase Transition

In 2D on the square lattice, the Ising model undergoes a continuous phase transition at the critical temperature:

\[ T_c = \frac{2J}{k_B \ln(1 + \sqrt{2})} \approx 2.269 J/k_B. \]

Low $T$ ($T < T_c$): spontaneous magnetization $\langle M \rangle \neq 0$; ordered ferromagnetic phase.
High $T$ ($T > T_c$): $\langle M \rangle = 0$; disordered paramagnetic phase.
At $T_c$: power-law correlations, diverging correlation length, universal scaling.

The Ising model serves as a test bed for ML: given configurations sampled at various temperatures, can a ML model identify the phase transition?

Carrasquilla and Melko (2017) showed that a simple convolutional neural network trained to classify Ising configurations as ordered/disordered learns to output a phase diagram with a sharp transition — demonstrating that ML can identify phases of matter.

Section 2.3: Monte Carlo and Sampling

To generate configurations distributed according to the Boltzmann distribution, Markov Chain Monte Carlo (MCMC) methods are used:

Metropolis-Hastings algorithm:

Propose a spin flip: $\mathbf{s}' = \mathbf{s}$ with spin $s_i \to -s_i$.
Compute the energy change: $\Delta E = H(\mathbf{s}') - H(\mathbf{s})$.
Accept with probability: $\min(1, e^{-\beta \Delta E})$.

This generates a Markov chain that converges to the Boltzmann distribution.

Connection to optimization: Simulated annealing uses the Metropolis algorithm with a slowly decreasing temperature (“annealing schedule”) to find the minimum-energy configuration — a global optimization strategy motivated by statistical mechanics.

Chapter 3: Linear Models for Regression and Classification

Section 3.1: Linear Regression

Linear regression models the relationship $y = \mathbf{w}^T \mathbf{x} + b + \varepsilon$. Minimizing the mean squared error gives the OLS estimator $\hat{\mathbf{w}} = (X^T X)^{-1} X^T \mathbf{y}$.

Probabilistic interpretation: if $\varepsilon \sim \mathcal{N}(0, \sigma^2)$, then minimizing MSE is equivalent to maximum likelihood under a Gaussian model.

Ridge regression: add $L_2$ regularization to the loss: $\mathcal{L} = \|X\mathbf{w} - \mathbf{y}\|^2 + \lambda\|\mathbf{w}\|^2$. Solution: $\hat{\mathbf{w}} = (X^T X + \lambda I)^{-1} X^T \mathbf{y}$. Regularization prevents overfitting by shrinking weights.

Bayesian interpretation: Ridge regression corresponds to a Gaussian prior on $\mathbf{w}$: maximizing the posterior $p(\mathbf{w}|\mathbf{y}, X) \propto p(\mathbf{y}|X,\mathbf{w})p(\mathbf{w})$ with $p(\mathbf{w}) = \mathcal{N}(0, \lambda^{-1}I)$.

Section 3.2: Logistic Regression

For binary classification ($y \in \{0,1\}$), logistic regression models:

\[ p(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}. \]

Training minimizes the binary cross-entropy loss:

\[ \mathcal{L} = -\sum_i [y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]. \]

For multi-class classification, the softmax generalizes the sigmoid:

\[ p(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}_k^T \mathbf{x}}}{\sum_j e^{\mathbf{w}_j^T \mathbf{x}}}. \]

This is formally the Boltzmann distribution with “energy” $-\mathbf{w}_k^T \mathbf{x}$ — connecting classification to statistical mechanics.

Chapter 4: Optimization

Section 4.1: Gradient Descent and Variants

Gradient descent iteratively follows the negative gradient:

\[ \boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - \eta \nabla_\theta \mathcal{L}(\boldsymbol{\theta}^{(k)}). \]

Stochastic gradient descent (SGD): use a mini-batch estimate $\hat{\nabla} \mathcal{L}$. Noisy gradients help escape saddle points and sharp minima.

Momentum: accumulate gradient history to reduce oscillations:

\[ \mathbf{v}^{(k+1)} = \mu \mathbf{v}^{(k)} - \eta \nabla \mathcal{L}, \qquad \boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} + \mathbf{v}^{(k+1)}. \]

Adam (Kingma & Ba, 2015): adaptive per-parameter learning rates using first and second moment estimates:

\[ m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2, \]\[ \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \varepsilon} \hat{m}_t. \]

Adam is the default optimizer in most deep learning applications.

Section 4.2: The Loss Landscape

The loss landscape of a deep neural network is a high-dimensional non-convex surface. Key features:

Saddle points: gradient zero but not a minimum (second-order derivatives have mixed signs). Dominant in high dimensions; SGD noise helps escape.
Flat minima vs. sharp minima: flat minima generalize better than sharp minima (Hochreiter & Schmidhuber, 1997). SGD with large batch size tends to find sharp minima.
Mode connectivity: recent work shows that most good minima of neural networks are connected by low-loss paths.

Chapter 5: Neural Networks

Section 5.1: Architecture

A deep neural network with $L$ layers is a composition:

\[ f(\mathbf{x}; \theta) = f^{(L)} \circ \cdots \circ f^{(1)}(\mathbf{x}), \]

where $f^{(\ell)}(\mathbf{h}) = \sigma(W^{(\ell)}\mathbf{h} + \mathbf{b}^{(\ell)})$. The parameters $\theta = \{W^{(\ell)}, \mathbf{b}^{(\ell)}\}$ are learned by backpropagation.

Depth vs. width: Depth allows exponentially more efficient representation of certain functions. Width ensures expressivity at each layer.

Section 5.2: Backpropagation

Backpropagation is the chain rule applied backwards. For loss $\mathcal{L}$ and parameters $W^{(\ell)}$:

Forward pass: compute $\mathbf{h}^{(\ell)}$ for each layer.
Backward pass: compute $\delta^{(\ell)} = \partial \mathcal{L}/\partial \mathbf{z}^{(\ell)}$ recursively.
Weight gradient: $\nabla_{W^{(\ell)}} \mathcal{L} = \delta^{(\ell)} (\mathbf{h}^{(\ell-1)})^T$.

Cost: $O(\text{parameters})$ — same as a forward pass.

Section 5.3: Regularization

Dropout (Srivastava et al., 2014): randomly zero out neurons during training with probability $p$. Equivalent to training an exponential number of sub-networks and averaging.
Batch normalization: normalize activations within each mini-batch; reduces internal covariate shift and accelerates training.
Weight decay: $L_2$ regularization on weights; equivalent to a Gaussian prior in Bayesian terms.
Early stopping: stop training when validation loss starts increasing.

Chapter 6: Supervised and Unsupervised Learning

Section 6.1: Classification and Regression

Supervised learning has labeled training data. Loss functions:

Classification: cross-entropy loss $-\sum_i y_i \log \hat{p}_i$.
Regression: mean squared error $\frac{1}{N}\sum_i (y_i - \hat{y}_i)^2$.

Convolutional neural networks (CNNs): exploit spatial translational invariance via shared filter kernels. Dominant for image classification, materials property prediction from crystal structures.

Section 6.2: Unsupervised Learning

Unsupervised learning discovers structure in unlabeled data.

Clustering: K-means minimizes within-cluster variance; assigns each point to the nearest centroid.

Principal Component Analysis (PCA): linear dimensionality reduction by projecting onto the top eigenvectors of the data covariance matrix (equivalent to the SVD, Chapter 2).

Autoencoders: neural networks trained to reconstruct their input through a bottleneck. The encoder $z = f(\mathbf{x})$ learns a compressed representation; the decoder $\hat{\mathbf{x}} = g(z)$ reconstructs the input.

Chapter 7: Generative Models

Section 7.1: Variational Autoencoders (VAEs)

The VAE (Kingma & Welling, 2014) is a generative model that learns a latent variable model $p_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x}|z) p(z) dz$. Training maximizes the Evidence Lower Bound (ELBO):

\[ \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q_\phi(z|\mathbf{x})}[\log p_\theta(\mathbf{x}|z)] - D_{KL}(q_\phi(z|\mathbf{x}) \| p(z)), \]

where $q_\phi(z|\mathbf{x})$ is the encoder (approximate posterior) and $p_\theta(\mathbf{x}|z)$ is the decoder (generative model). The KL term regularizes the latent space to be compact.

Section 7.2: Generative Adversarial Networks (GANs)

GANs (Goodfellow et al., 2014) train two networks simultaneously:

Generator $G_\theta$: maps random noise $\mathbf{z} \sim p(\mathbf{z})$ to samples in data space.
Discriminator $D_\phi$: classifies inputs as real or generated.

The adversarial training objective:

\[ \min_G \max_D \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[\log(1 - D(G(\mathbf{z})))]. \]

At Nash equilibrium, $G$ produces samples indistinguishable from real data: $p_G = p_{\text{data}}$.

Chapter 8: Hopfield Networks, RBMs, and Energy-Based Models

Section 8.1: Hopfield Networks

The Hopfield network (Hopfield, 1982) is a recurrent network of $N$ binary neurons $s_i \in \{-1, +1\}$ with symmetric weights $W_{ij}$. The energy function is:

\[ E(\mathbf{s}) = -\frac{1}{2} \sum_{i \neq j} W_{ij} s_i s_j. \]

This is exactly the Ising Hamiltonian with $J_{ij} = W_{ij}$!

The network stores patterns via Hebbian learning: $W_{ij} = \frac{1}{N}\sum_\mu \xi_i^\mu \xi_j^\mu$, where $\{\boldsymbol{\xi}^\mu\}$ are the patterns to be stored. Retrieval proceeds by updating spins to minimize energy:

\[ s_i \leftarrow \mathrm{sign}\!\left(\sum_j W_{ij} s_j\right). \]

The energy decreases monotonically with updates, so the network converges to a local minimum — an attractor corresponding to a stored pattern (or a spurious state).

Section 8.2: Restricted Boltzmann Machines

An RBM is a bipartite energy-based model with visible units $\mathbf{v} \in \{0,1\}^n$ and hidden units (\mathbf{h} \in {0,1}^m$, with energy:

\[ E(\mathbf{v}, \mathbf{h}) = -\mathbf{a}^T \mathbf{v} - \mathbf{b}^T \mathbf{h} - \mathbf{v}^T W \mathbf{h}. \]

The joint distribution is $p(\mathbf{v}, \mathbf{h}) = Z^{-1} e^{-E(\mathbf{v},\mathbf{h})}$. Since the graph is bipartite, the conditional distributions factorize:

\[ p(h_j=1|\mathbf{v}) = \sigma\!\left(b_j + \sum_i v_i W_{ij}\right), \quad p(v_i=1|\mathbf{h}) = \sigma\!\left(a_i + \sum_j W_{ij} h_j\right). \]

Training maximizes the log-likelihood of the data:

\[ \mathcal{L} = \sum_{\mathbf{v} \in \mathcal{D}} \log p(\mathbf{v}) = \sum_{\mathbf{v}} \left[\log \sum_\mathbf{h} e^{-E(\mathbf{v},\mathbf{h})} - \log Z\right]. \]

The gradient involves the model’s own expected statistics:

\[ \nabla_{W_{ij}} \mathcal{L} = \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{model}}. \]

The data term is tractable (one Gibbs pass); the model term requires MCMC sampling (Contrastive Divergence approximation).

Connection to physics: the RBM is formally identical to an Ising model on a bipartite graph. RBMs have been used as neural quantum states (Carleo & Troyer, 2017) to represent quantum many-body wave functions.

Chapter 9: Autoregressive Models

Section 9.1: Autoregressive Decomposition

Any joint distribution can be factored by the chain rule:

\[ p(\mathbf{x}) = p(x_1) \prod_{i=2}^n p(x_i | x_1, \ldots, x_{i-1}). \]

An autoregressive model parameterizes each conditional \(p_\theta(x_i | x_{

PixelCNN: applies this idea to images, modeling pixels left-to-right and top-to-bottom using masked convolutions.

Autoregressive neural quantum states: model the wave function $\psi(\mathbf{s}) = \langle \mathbf{s}|\psi\rangle$ of a quantum system using an autoregressive network. Enables direct sampling from $|\psi(\mathbf{s})|^2$ without MCMC.

Chapter 10: Recurrent Neural Networks and Transformers

Section 10.1: Recurrent Neural Networks

A vanilla RNN processes a sequence $\mathbf{x}_1, \ldots, \mathbf{x}_T$ by maintaining a hidden state:

\[ \mathbf{h}_t = \sigma(W_h \mathbf{h}_{t-1} + W_x \mathbf{x}_t + \mathbf{b}). \]
Long Short-Term Memory (LSTM): solves the vanishing gradient problem with gated memory cells that can maintain information across long sequences.
Section 10.2: The Transformer
The Transformer (Vaswani et al., 2017) processes sequences using self-attention rather than recurrence. For a sequence of embeddings $\{Q, K, V\}$:
\[ \text{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V. \]
Self-attention computes pairwise interactions between all positions — $O(n^2)$ complexity — but can be parallelized across the sequence dimension.
Physical analogy: self-attention is related to the interaction energy in a many-body system. The query-key dot product measures “compatibility” between positions, analogous to pairwise coupling in a spin model.
Chapter 11: Diffusion Models and Normalizing Flows
Section 11.1: Diffusion Models
Denoising Diffusion Probabilistic Models (DDPMs) (Ho et al., 2020) learn to reverse a gradual noising process.
Forward process: gradually add Gaussian noise to data over $T$ steps:
\[ q(\mathbf{x}_t | \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t I). \]
After $T$ steps, $\mathbf{x}_T \approx \mathcal{N}(0, I)$.
Reverse process: learn a neural network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)$ that predicts the noise added at step $t$. Sampling generates data by iteratively denoising from pure noise.
Connection to statistical mechanics: the forward diffusion process is an Ornstein-Uhlenbeck process (Langevin dynamics); the reverse process is trained to denoise, analogous to score-based generative models that estimate the score function $\nabla_\mathbf{x} \log p(\mathbf{x})$.
Section 11.2: Normalizing Flows
Normalizing flows model a complex distribution $p(\mathbf{x})$ by applying a sequence of invertible transformations $f_1, \ldots, f_K$ to a simple base distribution $p_0(\mathbf{z})$:
\[ \mathbf{x} = f_K \circ \cdots \circ f_1(\mathbf{z}), \quad \mathbf{z} \sim p_0(\mathbf{z}). \]
By the change-of-variables formula:
\[ \log p(\mathbf{x}) = \log p_0(\mathbf{z}) - \sum_{k=1}^K \log\left|\det \frac{\partial f_k}{\partial \mathbf{z}_k}\right|. \]
The Jacobian determinant must be efficiently computable; this is achieved with affine coupling layers (RealNVP, Glow). Flows give exact likelihoods and allow exact inference, in contrast to VAEs and GANs.
Chapter 12: Machine Learning for Quantum Physics
Section 12.1: Neural Quantum States
Carleo and Troyer (2017) proposed representing the many-body wave function $\Psi(\mathbf{s})$ using a neural network. For a spin-1/2 system with $N$ spins, the Hilbert space has $2^N$ dimensions — exponentially large. A neural network can represent approximate ground states compactly.
The Restricted Boltzmann Machine ansatz:
\[ \Psi(\mathbf{s}) = e^{\sum_i a_i s_i} \prod_j \cosh\!\left(b_j + \sum_i W_{ij} s_i\right). \]
Training minimizes the variational energy $E = \langle \Psi | H | \Psi \rangle / \langle \Psi | \Psi \rangle$ using stochastic reconfiguration (a natural gradient method adapted for quantum wave functions).
Section 12.2: Machine Learning for Phase Transitions
Supervised: train a CNN to classify spin configurations as ordered/disordered. The classifier’s output (confidence) serves as an order parameter, automatically locating the phase transition.
Unsupervised: apply PCA or autoencoders to spin configurations across temperatures. The dominant principal component captures the order parameter; anomalous changes in the latent representation signal phase transitions.
Confusion scheme (van Nieuwenburg et al., 2017): train a binary classifier to distinguish “low T” and “high T” configurations while varying the boundary. The classification accuracy as a function of boundary position peaks at the critical temperature.
Section 12.3: Quantum Machine Learning
Variational quantum circuits: parameterized quantum circuits trained with classical optimizers (variational quantum eigensolver, QAOA). Related to deep learning on quantum hardware.
Barren plateaus: a fundamental obstacle — the gradients of variational quantum circuits vanish exponentially with system size (in analogy with vanishing gradients in deep networks), making training intractable at large scales.

Back to top

Statistical Mechanics	Machine Learning
Energy function \(E(\mathbf{s})\)	Loss function / negative log-likelihood
Boltzmann distribution \(p(\mathbf{s}) \propto e^{-\beta E(\mathbf{s})}\)	Softmax / Gibbs distribution
Partition function \(Z = \sum_\mathbf{s} e^{-\beta E(\mathbf{s})}\)	Normalizing constant
Free energy \(F = -\frac{1}{\beta}\log Z\)	Log-partition function
Entropy \(S = -\sum_i p_i \log p_i\)	Shannon entropy / cross-entropy loss
Phase transition	Change of regime in learning dynamics
Renormalization group (RG)	Deep learning hierarchy / coarse-graining