AMATH 496: AI and Scientific Discovery
Estimated study time: 1 hr 9 min
Table of contents
- Jumper, John, et al. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (2021): 583–589.
- Abramson, Josh, et al. “Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3.” Nature 630 (2024): 493–500.
- Davies, Alex, et al. “Advancing Mathematics by Guiding Human Intuition with AI.” Nature 600 (2021): 70–74.
- Romera-Paredes, Bernardino, et al. “Mathematical Discoveries from Program Search with Large Language Models.” Nature 625 (2024): 468–475. [FunSearch]
- Trinh, Trieu H., et al. “Solving Olympiad Geometry without Human Demonstrations.” Nature 625 (2024): 476–482. [AlphaGeometry]
- Karniadakis, George Em, et al. “Physics-Informed Machine Learning.” Nature Reviews Physics 3 (2021): 422–440.
- Raissi, Maziar, Paris Perdikaris, and George Em Karniadakis. “Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear PDEs.” Journal of Computational Physics 378 (2019): 686–707.
- Li, Zongyi, et al. “Fourier Neural Operator for Parametric Partial Differential Equations.” ICLR 2021.
- Lu, Lu, Pengzhan Jin, and George Em Karniadakis. “DeepONet: Learning Nonlinear Operators for Identifying Differential Equations Based on the Universal Approximation Theorem of Operators.” Nature Machine Intelligence 3 (2021): 218–229.
- Lam, Remi, et al. “Learning Skillful Medium-Range Global Weather Forecasting.” Science 382 (2023): 1416–1421. [GraphCast]
- Bi, Kaifeng, et al. “Accurate Medium-Range Global Weather Forecasting with 3D Neural Networks.” Nature 619 (2023): 533–538. [Pangu-Weather]
- Merchant, Amil, et al. “Scaling Deep Learning for Materials Discovery.” Nature 624 (2023): 80–85. [GNoME]
- Stokes, Jonathan M., et al. “A Deep Learning Approach to Antibiotic Discovery.” Cell 180 (2020): 688–702.
- Cranmer, Kyle, Johann Brehmer, and Gilles Louppe. “The Frontier of Simulation-Based Inference.” PNAS 117 (2020): 30055–30062.
- Lu, Chris, et al. “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.” arXiv:2408.06292, 2024.
- Online resources: MIT 6.S898 (AI for Science) lecture notes; Stanford AA 290 (Scientific Machine Learning) materials; Cambridge AI4Science programme; DeepMind AI for Science blog; Climate Change AI (climatechange.ai) workshop proceedings.
Chapter 1: AI as a Scientific Instrument
Every major expansion of scientific knowledge has been preceded or enabled by a new kind of instrument. The telescope did not merely magnify what the naked eye could already see; it revealed an entirely new stratum of observable phenomena — moons of Jupiter, the phases of Venus, the resolution of the Milky Way into discrete stars — and thereby forced a reorganisation of theory. The microscope extended the boundary of the visible in the opposite direction, inaugurating microbiology and cell theory as disciplines that could not have been conceptualised without the observational substrate the instrument provided. Particle accelerators extended this logic into regimes inaccessible to direct perception at any scale, generating particle tracks whose interpretation required new theoretical frameworks. In each case the instrument did not replace the scientist but enlarged the domain of phenomena over which scientific reasoning could operate, and in doing so it changed what scientific reasoning was required to be.
The claim motivating this course is that machine learning constitutes a new kind of scientific instrument — one that is structurally different from its predecessors in ways that matter epistemologically. A telescope extends human visual perception along a single spatial axis; a particle accelerator converts relativistic kinetic energy into detectable decay products. Both of these instruments produce data that human cognition, suitably trained, can inspect and interpret directly. Machine learning systems, by contrast, find structure in data whose dimensionality and volume lie beyond the reach of direct human inspection. A graph neural network trained on millions of molecular structures does not present the scientist with a magnified image of a molecule; it produces a latent representation in a high-dimensional space whose geometric structure encodes patterns that no human analyst extracted or even specified in advance. This distinction — between instruments that extend perception and instruments that find patterns beyond cognitive reach — is the first analytic distinction this course will maintain throughout.
Within the broad category of AI applied to science, there is a further distinction that carries significant weight: the difference between automation (自动化) and structurally novel discovery. Automation in science refers to the use of AI to perform tasks that scientists previously performed manually but that do not in themselves generate new scientific knowledge — classifying galaxy morphologies from telescope images, extracting named entities from literature, screening compound libraries for binding affinity by exhaustively applying a known scoring function. These applications are valuable and often transformative in scale, but they do not change the logical structure of scientific inquiry; they compress the time between hypothesis and observation. What is structurally different, and what makes several of the systems examined in this course philosophically interesting, is AI that generates candidate hypotheses from combinatorially large spaces that are simply inaccessible to exhaustive human search. When FunSearch identifies a new construction for the cap set problem by searching program space, or when AlphaFold generates a structural prediction for a sequence with no known homologues, the system is not automating a task a human could have performed more slowly; it is doing something that could not, as a practical matter, have been done at all.
Cranmer and colleagues frame one central domain of AI-for-science problems under the concept of simulation-based inference (模拟推断). The setting is one that arises repeatedly across the natural sciences: a forward model — a simulator, a physical theory, a computational model — is known and can be evaluated, but the inverse problem is intractable analytically. Given observed data \( x \), one wishes to infer the parameter vector \( \theta \) that generated it, which in the probabilistic frame means computing the posterior \( p(\theta \mid x) \propto p(x \mid \theta)\, p(\theta) \). When the likelihood \( p(x \mid \theta) \) cannot be written in closed form — as is the case for gravitational wave signals, high-energy particle collision events, and epidemiological time series from stochastic agent models — classical Bayesian inference cannot be directly applied. Cranmer et al. show that neural networks can learn to approximate either the likelihood, the likelihood ratio, or the posterior directly from simulation runs, providing tractable approximate inference over parameters. The forward model remains the scientific theory; the ML system learns to invert it.
The epistemological question this raises is not easily dismissed. Does a pattern found by a neural network constitute scientific knowledge if no human understands why the pattern holds? This question becomes acute in a specific way in science that it does not in engineering: engineering design often requires only that a system be reliable and efficient, whereas science has traditionally required that its products be explanatory — that they provide not merely a correct answer but an account of why the answer is correct. The Rissanen minimum description length (MDL) tradition offers one framing: learning is compression, and a model that compresses data efficiently has found genuine structure in it. But compression and explanation are not the same thing. A neural network that accurately predicts protein structure from sequence has found a compressive mapping between sequence and structure, and this is scientifically significant; but the physical mechanism by which amino acid sequence determines three-dimensional conformation is not contained in the mapping as such, and recovering that mechanism requires additional interpretive work that the network does not provide.
The tension between predictive accuracy and mechanistic understanding runs through every domain treated in this course and will be revisited from different angles in each chapter. In Chapter 9, the tension is examined philosophically in the tradition of Popper, Duhem, and Kuhn. Here it is worth registering its practical form: research programmes in AI for science increasingly distinguish between surrogate models (代理模型), which are valued precisely for predictive accuracy and computational speed without any claim to mechanistic transparency, and scientific machine learning in the stricter sense, which insists that ML-derived models be interpretable, physically consistent, or expressible in symbolic form. Both modes coexist in current practice, but they represent genuinely different commitments about what science is for.
A further structural feature of AI as a scientific instrument deserves note before proceeding. Classical instruments are passive amplifiers or transducers: they do not themselves generate hypotheses, and their outputs are, in principle, theory-neutral (though the history and philosophy of science since Hanson has complicated this claim). Machine learning systems, by contrast, are trained on existing scientific data and therefore inherit the biases, gaps, and theoretical commitments embedded in their training corpora. AlphaFold’s training data consists of experimentally determined protein structures, which represent a biased sample of protein space skewed toward water-soluble, stable, crystallisable proteins. GNoME’s training data consists of structures reported in the Inorganic Crystal Structure Database, which reflects the history of experimental inorganic chemistry. A scientific instrument that inherits the biases of the knowledge it was trained on is a structurally different kind of instrument than a telescope, which has no prior commitments about what the sky contains. This is not an objection to AI as a scientific instrument — it is a characteristic that must be understood and managed — but it marks a genuine difference in kind.
Chapter 2: AI for Mathematics — Conjecture, Proof, and Intuition
Mathematics occupies a unique position among the disciplines considered in this course. The empirical sciences seek to model a world that exists independently of human cognition and that is, in principle, always capable of generating surprising new observations; mathematics seeks necessary truths derivable by finite chains of valid inference from axioms. If machine learning is valuable in the empirical sciences because it finds patterns in large, noisy datasets, it is far less obvious what role it could play in a discipline whose objects are formal and whose standards of justification are deductive proof. The last decade has produced a sufficiently rich body of results that this scepticism requires revision, though the precise nature and limits of AI involvement in mathematics remain genuinely contested.
Two projects that must be carefully distinguished are automated theorem proving (自动定理证明, ATP) and AI-assisted discovery of mathematical conjectures. ATP is in principle a search problem: given a set of axioms and inference rules, find a derivation tree whose root is the target theorem. Systems such as E, Vampire, and Z3 use resolution-based or superposition-based search to find proofs automatically in first-order logic; the Lean theorem prover and its library Mathlib have formalised large bodies of advanced mathematics, and the Xena Project led by Kevin Buzzard at Imperial College London has been systematically formalising undergraduate and graduate mathematics in Lean 4, with the stated goal of making it possible to check any publishable mathematical proof by machine. The significance of Lean for AI-assisted mathematics is not that Lean proves theorems automatically — it does not — but that it provides a formal verification infrastructure against which AI-generated proof steps can be checked, enabling a productive division of labour: the AI proposes, the proof assistant verifies.
Davies and colleagues (Nature 2021) represent a different modality: using machine learning not to prove theorems but to find patterns in mathematical data that guide human mathematicians toward new conjectures. The experimental design is, in retrospect, elegant. A supervised ML model is trained to predict a mathematical invariant from a high-dimensional feature vector derived from mathematical objects; the model’s internal attribution scores (using gradient-based saliency) are then used to identify which features are most predictive. This attribution does not prove anything — it is a pattern in data — but it provides a mathematician with a hypothesis about which quantities to look for a relationship between. In two domains — knot theory and representation theory — the patterns identified by the ML model led to new theorems that were subsequently proved by human mathematicians using conventional proof techniques. In knot theory, the result was a relationship between the signature of a knot and its hyperbolic geometric invariants; in representation theory, it was a new understanding of the structure of Kazhdan–Lusztig polynomials. The ML system did not prove either result; it found a signal that made the conjectures plausible enough to be worth the human effort of proof.
AlphaGeometry (AlphaGeometry几何求解器, Trinh et al., Nature 2024) operates in a more constrained but technically impressive setting: solving International Mathematical Olympiad geometry problems. The architecture combines two components: a large language model trained on synthetic geometry theorems that proposes auxiliary constructions — adding a new point, drawing a new line, defining a circle — and a symbolic geometry engine based on Wu’s method that verifies whether a chain of deductions from the proposed constructions reaches the target statement. The critical insight is that IMO geometry problems often require a non-obvious auxiliary construction without which the solution appears unreachable; once the right construction is found, the deductive chain is relatively mechanical. The LM learns to propose useful constructions by training on 100 million synthetic geometry problem–proof pairs generated by a process that randomly instantiates geometric configurations and records which theorems can be derived. On a benchmark of 30 geometry problems drawn from IMO competitions spanning 30 years, AlphaGeometry solved 25, compared to 10 solved by a prior state-of-the-art symbolic solver. Human gold medallists solve, on average, about 25 of the same 30 problems, placing AlphaGeometry at roughly human gold-medal level.
AlphaProof (AlphaProof证明系统, DeepMind 2024) extends the domain from geometry to the full breadth of IMO mathematics, using a reinforcement learning system that generates and evaluates proof attempts in Lean 4. At the 2024 International Mathematical Olympiad, AlphaProof (combined with AlphaGeometry 2) solved 4 of 6 problems, including Problem 6, a number theory problem involving a functional equation that only 5 of 609 human contestants solved during the competition. The total score of 28 points corresponds to a silver medal — the first time any AI system has achieved such a result on the IMO. The RL training loop: AlphaProof generates proof attempts in Lean, which are verified by the proof checker; successful proofs become training examples that improve the policy; failed attempts provide learning signal through the reward structure. The self-improvement dynamic — solving easier problems, adding them to the training set, then attempting harder problems — resembles in structure the curriculum learning programmes used in game-playing RL agents such as AlphaGo.
FunSearch (Romera-Paredes et al., Nature 2024) takes a third approach, framing mathematical discovery as a program search problem. The system uses a large language model as a mutation operator over a population of programs that compute mathematical objects; programs are evaluated by executing them and scoring their output against a mathematical objective function; programs that score highly are retained and used as context for generating further mutations. Applied to the cap set problem (帽集问题) — finding large subsets of \( \mathbb{F}_3^n \) containing no three-term arithmetic progression — FunSearch found constructions exceeding the previous best-known bounds for dimensions \( n = 8 \) and \( n = 9 \), beating a record that had stood for over 20 years. Applied to online bin-packing, FunSearch discovered heuristics that outperform all previously known algorithms. The significance is that the outputs are programs that can be inspected and understood by human mathematicians, making the discovery interpretable in a way that a trained neural network is not.
The Ramanujan Machine (developed by a group at the Technion) represents yet another approach: automated discovery of continued-fraction conjectures about mathematical constants such as \( \pi \), \( e \), and the Catalan constant. The system uses reinforcement learning to search the space of continued fraction expressions, conjecture identities, and submit them for human verification. Several of its conjectures have subsequently been proved. The system is named for Srinivasa Ramanujan, whose prodigious and largely unexplained capacity for conjecturing correct formulas about mathematical constants has long attracted philosophical interest; the Ramanujan Machine raises in acute form the question of whether conjecture-generation can be separated from insight.
The philosophical question raised by these results is whether AlphaGeometry’s derivation or AlphaProof’s Lean proof constitutes a proof in the epistemically meaningful sense. A Lean proof is, by construction, formally valid: it is a term of the correct type in the dependent type theory underlying Lean 4, and its validity is checkable mechanically. But mathematical proof has historically served functions beyond formal validity: it provides understanding, illuminates structure, enables generalisation, and generates intuition about why a result is true. AlphaGeometry’s proof of an IMO geometry problem may pass formal verification while remaining opaque to any human who wishes to understand the geometry. The question of what mathematical understanding consists in — and whether a formally valid derivation produced by a pattern-matching system constitutes understanding — connects directly to the arguments examined in PHIL 459b (Philosophy of Artificial Intelligence), where the concept of understanding as distinct from competent behaviour is examined through the frameworks of Searle, Dreyfus, and their contemporary interlocutors.
Chapter 3: Physics-Informed Neural Networks and Neural Operators
Classical numerical methods for partial differential equations represent one of the most mature fields of applied mathematics, with a history reaching back through spectral methods (Gottlieb and Orszag, 1977), finite element analysis (Strang and Fix, 1973), and finite difference methods to the foundational contributions of Courant, Friedrichs, and Lewy in the 1920s. The computational cost of these methods scales poorly with problem complexity: a finite element discretisation of the three-dimensional Navier–Stokes equations at turbulence-resolving resolution requires meshes with \( O(10^{12}) \) degrees of freedom, and explicit time-integration schemes impose CFL stability conditions that further multiply the number of required timesteps. For many applications — real-time flow simulation, many-query optimisation, uncertainty quantification requiring thousands of forward evaluations — classical methods are computationally prohibitive, and the field has long sought principled approximations.
Machine learning offers two structurally distinct responses to this challenge, and maintaining the distinction between them is important for understanding what each can and cannot offer. The first paradigm, physics-informed neural networks (物理信息神经网络, PINNs), treats the neural network as an ansatz for the solution function itself. The key insight, due to Raissi, Perdikaris, and Karniadakis (2019), is that a neural network \( u_\theta(x, t) \) parameterised by weights \( \theta \) can be trained to satisfy a PDE by including the PDE residual in the training loss. For a PDE of the form \( \mathcal{N}[u](x,t) = f(x,t) \) on domain \( \Omega \), with boundary condition \( u = g \) on \( \partial\Omega \), the PINN loss takes the form
\[ \mathcal{L}(\theta) = \mathcal{L}_\text{data}(\theta) + \lambda_r \mathcal{L}_\text{PDE}(\theta) + \lambda_b \mathcal{L}_\text{BC}(\theta), \]where \( \mathcal{L}_\text{data} \) measures fit to any available observational data, \( \mathcal{L}_\text{PDE} = \frac{1}{N_r} \sum_{i=1}^{N_r} |\mathcal{N}[u_\theta](x_i, t_i) - f(x_i, t_i)|^2 \) is the mean-squared PDE residual evaluated at a set of collocation points (配置点) sampled from the interior of the domain, and \( \mathcal{L}_\text{BC} \) enforces boundary conditions. The PDE residual is evaluated using automatic differentiation (自动微分), which computes the exact partial derivatives of the neural network output with respect to its inputs. No discretisation of the domain is required; the collocation points can be placed wherever the PDE is expected to be most difficult to satisfy.
The forward-problem PINN is essentially a PDE-constrained regression; its greater power lies in the inverse problem. If the governing PDE contains unknown parameters — a viscosity coefficient, a reaction rate, a source term — then those parameters can be incorporated into the trainable parameter set \( \theta \), and the PINN trained on sparse observational data will simultaneously reconstruct the solution field and infer the unknown physical coefficients. Karniadakis and colleagues have applied this to inferring blood viscosity from velocity measurements in cardiovascular flows, identifying hidden sources in groundwater transport, and recovering turbulent velocity fields from pressure measurements, all in settings where classical inverse methods would be severely ill-conditioned.
The second paradigm, neural operators (神经算子), is structurally different in a way that matters for practical use. A PINN, once trained, represents a single solution for a specific choice of initial conditions, boundary conditions, and PDE parameters; solving for a different set of conditions requires retraining from scratch. A neural operator, by contrast, learns a mapping between function spaces: it maps an input function (an initial condition, a forcing term, a coefficient field) to an output function (the corresponding solution), and having learned this mapping, it can evaluate it at a new input function without retraining. This is the regime of parametric PDEs (参数化偏微分方程): one wishes to solve the same PDE family repeatedly under varying inputs.
The theoretical foundation for neural operators is the universal approximation theorem for operators, established by Chen and Chen (1995) and revisited by Lu, Jin, and Karniadakis in the DeepONet (深度算子网络) framework (Nature Machine Intelligence 2021). The key architectural idea in DeepONet is a separation between a branch net (分支网络) that encodes the input function evaluated at a fixed set of sensor points, and a trunk net (主干网络) that encodes the query location at which the output function is to be evaluated. The operator output is the inner product of the branch and trunk representations:
\[ \mathcal{G}_\theta(u)(y) = \sum_{k=1}^p b_k(u(x_1), \ldots, u(x_m)) \cdot t_k(y), \]where \( b_k \) are outputs of the branch net and \( t_k \) are outputs of the trunk net. This architecture is provably a universal approximator of continuous nonlinear operators between Banach spaces, under mild conditions.
The Fourier Neural Operator (傅里叶神经算子, FNO, Li et al., ICLR 2021) achieves the same function-space mapping goal through a different mechanism, one particularly well-suited to problems on regular grids. The FNO applies a sequence of layers each of which applies a local linear transformation (convolution in the physical domain) and a global convolution parameterised in Fourier space. The key observation is that a kernel integral operator
\[ (\mathcal{K} u)(x) = \int_D \kappa(x - y)\, u(y)\, dy \]can be evaluated in Fourier space as a pointwise multiplication: \( \widehat{(\mathcal{K} u)}(\xi) = \widehat{\kappa}(\xi) \cdot \widehat{u}(\xi) \), which by the convolution theorem takes only \( O(n \log n) \) operations using the FFT. The FNO learns the Fourier-space kernel \( \widehat{\kappa} \) directly as a trainable parameter, truncating to the lowest \( k_\text{max} \) Fourier modes. This architecture is discretisation-invariant (离散化不变): a model trained on a \( 64 \times 64 \) grid can be evaluated on a \( 256 \times 256 \) grid at inference time without architectural change, because the Fourier transform adapts naturally to different resolutions. On the Navier–Stokes benchmark with viscosity \( \nu = 10^{-3} \), FNO achieves accuracy comparable to classical pseudo-spectral methods at three orders of magnitude lower computational cost.
The applications of PINNs and neural operators span a remarkable range of physical domains. In turbulence modelling, neural operators have been used as closure models (闭合模型) that approximate the subgrid-scale stress tensor in large-eddy simulation, achieving turbulence statistics that match direct numerical simulation at a fraction of the cost. In seismology, PINNs have been used to infer subsurface velocity models from surface seismic measurements, formulated as the inverse Helmholtz equation. In cardiovascular medicine, both PINNs and FNO-based operators have been applied to patient-specific blood flow simulation, where the geometry of a patient’s vasculature is derived from medical imaging and the goal is rapid estimation of wall shear stress relevant to aneurysm risk. Climate emulation is a particularly important application domain: training a neural operator on outputs from a comprehensive general circulation model, then using the operator as a fast emulator for uncertainty quantification and parameter sensitivity analysis, is a key strategy in the ClimSim project.
Limitations of the two paradigms are complementary in an instructive way. PINNs struggle on problems with sharp gradients, boundary layers, and stiff ODEs/PDEs, where the gradient of the loss landscape becomes poorly conditioned and optimisation stalls. The choice of collocation point distribution and loss weighting \( \lambda_r, \lambda_b \) is highly problem-dependent and currently requires manual tuning. Neural operators require large training corpora of input–output function pairs, which must either be generated by classical solvers (expensive) or obtained from experimental measurements (sparse). Both paradigms face distribution shift (分布偏移): a model trained on a family of initial conditions performs poorly when evaluated on initial conditions drawn from a different distribution, which limits reliability in extrapolation regimes.
Chapter 4: AI in Physics — From Colliders to Cosmology
High-energy physics at the Large Hadron Collider presents a data problem of a scale that has no parallel in most scientific disciplines. Each proton–proton collision event at the LHC generates approximately 1 megabyte of raw detector data; collisions occur at 40 MHz, producing approximately 40 terabytes per second. The LHC’s data acquisition system must decide in real time, using only partial detector information, which of roughly one collision in a million is worth recording for offline analysis. This trigger system (触发系统) is one of the earliest and largest-scale deployments of machine learning in frontier physics: boosted decision trees and later neural networks were introduced into the trigger pipeline to improve signal efficiency over the cut-based selection algorithms they replaced.
The technical vocabulary of LHC physics introduces several ML problems of intrinsic interest. Jet tagging (喷注标记) refers to the classification of collimated sprays of particles (jets) by their origin: a top-quark jet, a bottom-quark jet (b-jet), a gluon jet, and a jet initiated by a boosted Higgs boson decaying to two b-quarks all have distinct but subtle substructure signatures. Early ML approaches used hand-engineered jet substructure variables (transverse momentum ratios, N-subjettiness) as inputs to gradient-boosted classifiers. More recent approaches treat the jet as a variable-length set of particle four-momenta and apply graph neural networks (图神经网络, GNNs) that learn jet-substructure features end-to-end: each particle is a node, edges connect particles within a cone or by k-nearest neighbour in momentum space, and message-passing aggregation builds global representations. ParticleNet (Qu and Gouskos, 2019) and similar architectures achieve substantially better b-tagging and top-tagging performance than their hand-engineered predecessors.
Particle track reconstruction (粒子径迹重建) is computationally the most demanding component of LHC event reconstruction. Each event contains \( O(10^5) \) silicon pixel hits from charged particle tracks traversing the detector; associating hits to tracks is a combinatorial assignment problem whose difficulty scales super-linearly with instantaneous luminosity. The TrackML challenge (2018) specifically benchmarked ML approaches against this problem; several graph-network-based track finders subsequently demonstrated performance competitive with classical Kalman-filter-based approaches at a fraction of the CPU cost.
Simulation-based inference, as described in Chapter 1, has found some of its most sophisticated applications in particle physics parameter estimation. At the LHC, one wishes to measure the values of Standard Model parameters — coupling constants, masses, mixing angles — from the distribution of observed collision events. The likelihood function is a high-dimensional integral over the phase space of unobserved final-state particles, evaluable only approximately by Monte Carlo simulation. Cranmer and colleagues have developed a programme of likelihood-ratio estimators that use neural networks to learn the ratio \( p(x \mid \theta_0) / p(x \mid \theta_1) \) from simulated event pairs, enabling efficient frequentist inference over parameters without explicitly computing the intractable likelihood. This has been applied to Higgs coupling measurements and searches for new physics beyond the Standard Model.
Symbolic regression (符号回归) represents a separate and philosophically interesting application in physics: the automated discovery of interpretable functional relationships from numerical data. The approach associated with Miles Cranmer and collaborators (PySR, published 2023) combines genetic programming — evolutionary search over trees of mathematical operations — with neural network pre-training to guide the search toward physically meaningful expressions. Demonstrations have included recovering Newton’s law of gravitation, Kepler’s third law, and the relativistic energy formula \( E = mc^2 / \sqrt{1 - v^2/c^2} \) from simulated data without providing the functional form in advance. The recovered expressions are exact symbolic formulas rather than neural network weights, making them interpretable and generalisable in a way that trained networks are not.
In gravitational wave astronomy, the LIGO–Virgo–KAGRA network detects signals from merging compact objects embedded in detector noise. Classical matched-filter analysis cross-correlates the data stream against a bank of template waveforms covering the expected parameter space, at substantial computational cost. Deep learning approaches — initially convolutional networks operating on time-frequency spectrograms, subsequently architectures operating directly on time-series data — have been used both for rapid event classification (distinguishing astrophysical signals from noise glitches) and for parameter estimation. The latter task is framed as a posterior inference problem: given a gravitational wave strain h(t), infer the posterior over merger parameters (chirp mass, mass ratio, spin magnitudes, sky location). Normalising flows trained to produce the posterior directly from the detector data have been shown to recover parameter posteriors consistent with classical nested-sampling results in seconds rather than the hours required by Markov chain Monte Carlo.
Cosmology presents a different statistical challenge. Large-scale structure surveys such as the Rubin Observatory Legacy Survey of Space and Time (LSST) will image billions of galaxies; weak gravitational lensing of those galaxies is sensitive to the matter power spectrum and thereby to the values of cosmological parameters \( \Omega_m \) and \( \sigma_8 \). Classical analysis pipelines compress the sky maps to two-point statistics (the angular power spectrum), which discard higher-order non-Gaussian information. ML approaches applied directly to convergence maps or galaxy density fields have been shown to extract more Fisher information from the same data, effectively constituting a lossless or near-lossless compression that preserves non-Gaussian signatures of modified gravity and neutrino mass. N-body emulators (N体模拟仿真器) such as the Aemulus and Euclid emulators use Gaussian processes or neural networks trained on a suite of N-body simulations spanning a grid of cosmological parameters to interpolate the matter power spectrum at arbitrary parameter values, enabling Markov chain Monte Carlo sampling of cosmological parameters at a fraction of the cost of running new simulations.
The cultural dimensions of ML adoption in physics are worth examining. Particle physics has a strong tradition of deriving predictions from first principles — the Standard Model Lagrangian predicts cross-sections from coupling constants, and the ability to trace a measurement back to a Feynman diagram is considered essential to understanding. The introduction of black-box classifiers into the data analysis pipeline has been met with genuine intellectual concern: if a GNN-based b-tagger improves signal efficiency but cannot be understood in terms of physically interpretable jet substructure, has the discipline gained information or merely classification power? The distinction Anderson drew in “More is Different” between emergent regularities and reductive understanding is relevant here; so is Weinberg’s defence of the explanatory primacy of fundamental laws. ML-found patterns in LHC data may capture emergent regularities of QCD without providing a reductive account at the level of quarks and gluons, and whether such patterns constitute physical understanding depends on contested commitments about what understanding in physics consists of.
Chapter 5: AlphaFold and the Protein Structure Revolution
The protein structure prediction problem has served, for a quarter century, as the primary benchmark for computational structural biology and one of the canonical hard problems at the interface of biology and computation. Its difficulty stems from Levinthal’s paradox: a protein of \( n \) residues has \( O(3^n) \) possible backbone conformations in even the coarsest discretisation of the Ramachandran angles, yet proteins fold reliably to their native state in microseconds to seconds. Anfinsen’s thermodynamic hypothesis, confirmed by his 1972 Nobel Prize work on ribonuclease A, establishes that the native structure corresponds to the global minimum of the free energy — a statement that the structure is determined in principle by the sequence, but one that does not provide a tractable route to computing it. The CASP (Critical Assessment of Protein Structure Prediction) competitions, running biennially since 1994, provided the community with a standardised benchmark by releasing sequences of proteins whose structures had been determined experimentally but not yet published, allowing independent structure prediction groups to submit models that were then scored against the experimental reference after a deadline.
At CASP13 in 2018, the first DeepMind AlphaFold system achieved a median GDT score substantially above all competitors, attracting widespread attention. At CASP14 in 2020, AlphaFold 2 (Jumper et al., Nature 2021) produced results so dramatically above the previous state of the art — a median TM-score exceeding 0.9 across free-modelling targets, where a TM-score above 0.5 is conventionally considered a correct fold — that the CASP organisers described the protein folding problem as “largely solved” for single-chain proteins with detectable sequence homologues. The TM-score (模板匹配得分) between a predicted structure and a reference structure is defined as
\[ \text{TM-score} = \max \frac{1}{L_\text{ref}} \sum_{i=1}^{L_\text{aligned}} \frac{1}{1 + (d_i / d_0(L_\text{ref}))^2}, \]where \( d_i \) is the distance between the \( i \)-th aligned residue pair after optimal superposition, and \( d_0 \) is a length-dependent normalisation factor. The score is bounded in \([0, 1]\) and is length-independent, making it the standard metric for structural similarity.
The architectural innovations in AlphaFold 2 are closely coupled to biological insights about the sources of structural information in protein sequences. The central observation is that proteins evolve under the constraint of maintaining their three-dimensional fold; when two residue positions are in physical contact in the native structure, mutations at one position tend to be compensated by correlated mutations at the other, preserving the contact. These evolutionary co-variation (进化协变) signals are contained in multiple sequence alignments (多序列比对, MSAs) of the query protein against its evolutionary homologues, and detecting them is key to structural inference. The Evoformer module in AlphaFold 2 is a stack of attention-based transformer layers that jointly processes the MSA representation (a \( N_\text{seq} \times L \) matrix, where rows are homologous sequences and columns are residue positions) and a pairwise representation (an \( L \times L \) matrix encoding relationships between every pair of residue positions). The pairwise representation is updated by axial attention over the MSA rows, and the MSA representation is updated by attention biased by the pairwise representation, enabling the network to extract structural constraints from co-evolutionary signals in a way that is learned end-to-end rather than specified by hand.
The structure module (结构模块) takes the output of the Evoformer and produces three-dimensional backbone coordinates, representing each residue as a rigid frame (translation and rotation in \( \mathbb{R}^3 \)) and iteratively refining the frame assembly through an equivariant attention mechanism. The full network is applied iteratively through recycling (循环机制): the predicted structure from one pass is fed back as an additional input to the next, allowing progressive refinement. The per-residue confidence score pLDDT (预测局部距离差异测试) is produced as a by-product and has proven remarkably useful as an indicator of intrinsic disorder: residues with pLDDT below 50 are typically in disordered regions, and this signal has been used to map the intrinsically disordered regions of the proteome at scale.
The AlphaFold Protein Structure Database, launched in 2021 and expanded in 2022, currently contains predicted structures for over 200 million proteins spanning virtually the entire known proteome, freely accessible to the scientific community. The scientific impact has been rapid and multidimensional. Structural biologists who previously spent years determining a single protein structure by X-ray crystallography or cryo-electron microscopy can now obtain a high-quality structural model in minutes. Drug design programmes that depend on identifying binding pockets and designing complementary ligands can proceed without waiting for experimental structure determination; virtual screening libraries of billions of compounds can be scored against AlphaFold structures rather than homology models. In basic science, AlphaFold has enabled the structural characterisation of protein families that had resisted experimental determination for decades, including membrane proteins and large multi-domain complexes.
AlphaFold 3 (Abramson et al., Nature 2024) extends the system from single-chain proteins to biomolecular complexes including protein–DNA, protein–RNA, protein–glycan, and protein–small molecule (ligand) interactions. The key architectural change is the replacement of the structure module’s equivariant frame assembly with a diffusion-based structure generation (扩散式结构生成) process: the system learns a denoising diffusion model over 3D atomic coordinates, conditioned on the representations produced by a modified Evoformer. This allows the system to model the full all-atom geometry of a complex without the assumption of rigid backbone frames. On the PoseBusters benchmark for drug-like molecule docking, AlphaFold 3 substantially outperforms all previous methods, including classical physics-based docking programs such as Glide. ESMFold (Meta AI, Lin et al. 2023) and RoseTTAFold (Baker lab, Baek et al. 2021) represent complementary approaches: ESMFold uses the internal representations of the ESM2 protein language model — a large transformer pre-trained on sequences alone, without MSAs — as the input to a structure prediction head, achieving competitive accuracy with far shorter inference times and no requirement for multiple sequence alignment computation.
The limitations of AlphaFold are as important to understand as its achievements, particularly for users encountering the system’s outputs in research contexts. The system predicts the static ground-state structure of a protein; protein function often depends critically on conformational dynamics (构象动力学), including large-scale domain motions, allosteric communication, and the sampling of excited states. Intrinsically disordered regions — a substantial fraction of eukaryotic proteomes — do not have a well-defined ground-state structure and should not be interpreted as having one even when a prediction is produced. Long-range epistatic interactions between widely separated residues in large multimeric assemblies remain challenging, and predicted structures of novel folds without any homologues in the training data are less reliable than predictions in well-populated regions of protein space. The system has been trained on structures deposited in the Protein Data Bank, which is biased toward stable, soluble, crystallisable proteins; the blind spots of the training set are the blind spots of the model.
Chapter 6: AI for Chemistry — Drug Discovery and Materials Design
The combinatorial scale of chemical space is one of the defining numerical facts of modern science. The number of drug-like organic molecules with molecular weight below 500 Da is estimated at between \( 10^{23} \) and \( 10^{60} \), depending on the precise definition of drug-likeness and the rules used to count valid molecular graphs; the larger estimates exceed the number of atoms in the observable universe by many orders of magnitude. Experimental high-throughput screening, capable of testing perhaps \( 10^6 \) compounds per year per laboratory, is therefore categorically incapable of exhaustive exploration; even the most ambitious virtual screening programme based on classical docking can evaluate perhaps \( 10^9 \) compounds. This combinatorial problem is the fundamental motivation for applying machine learning to molecular discovery: a trained model that accurately predicts biological activity from molecular structure enables in-silico prioritisation of the combinatorial space, directing experimental effort toward the fraction that a model believes is most promising.
Deep learning for molecular property prediction (分子性质预测) addresses the question of how to represent a molecule in a form suitable for neural network processing. Early approaches used fixed-length fingerprints (分子指纹) — bit vectors encoding which molecular fragments are present — as input to fully connected networks. The more powerful and now-standard approach treats the molecule as a graph: atoms are nodes with feature vectors encoding atomic number, formal charge, hybridisation, and aromaticity; bonds are edges with feature vectors encoding bond order and ring membership. The message-passing neural network (消息传递神经网络, MPNN, Gilmer et al. 2017) framework formalises this: in each message-passing step, each atom aggregates feature vectors from its bonded neighbours, updates its own representation using a learned function, and passes updated messages in the next step. After \( k \) steps, each atom’s representation encodes information about all atoms within \( k \) bonds, and a global readout function (读出函数) aggregates atomic representations into a molecular property prediction. The choice of \( k \), the aggregation function, and the global readout are architectural hyperparameters with known connections to the expressive power limitations of MPNNs as analysed by Xu et al. in the Weisfeiler–Leman graph isomorphism test framework.
Stokes and colleagues (Cell 2020) provide one of the clearest demonstrations of the potential of ML-guided antibiotic discovery. The Chemprop MPNN was trained on bacterial growth inhibition data for approximately 2,500 molecules, then used to predict antibiotic activity for a structurally diverse set of 6,000 FDA-approved drugs and natural products, followed by a library of approximately 107 million drug-like molecules. The top-scoring candidate, identified by the model as structurally novel and highly active, was halicin (哈利星) — a compound originally developed for diabetes but never characterised as an antibiotic. Experimental validation confirmed potent activity against drug-resistant Mycobacterium tuberculosis and Acinetobacter baumannii (including extensively drug-resistant strains), acting through a mechanism distinct from all existing antibiotic classes (disruption of transmembrane electrochemical gradients). The key quantitative result is the ratio of molecules tested in silico to molecules synthesised for experimental validation: roughly \( 10^8 \) to \( 10^2 \), a compression factor of six orders of magnitude.
Generative molecular design (生成式分子设计) addresses the inverse problem: rather than predicting properties for given molecules, generate novel molecules with desired target properties. Two architectures dominate current practice. The Junction Tree Variational Autoencoder (JTVAE, Jin et al. 2018) encodes molecules as trees of chemical fragments (rings, chains, functional groups), learns a continuous latent space, and decodes latent vectors into valid molecular structures by first predicting the tree structure and then assembling the atoms within each fragment — guaranteeing chemical validity by construction. More recent approaches use 3D diffusion models (三维扩散模型) that directly generate all-atom 3D structures by learning to denoise Gaussian noise applied to atomic positions; DiffSBDD and DiffDock operate in this paradigm, with DiffDock specifically generating the 3D pose of a ligand within a protein binding pocket (structure-based drug design).
GNoME (Merchant et al., Nature 2023) applies graph network architecture to the materials discovery problem at a scale that substantially exceeds previous computational materials science efforts. The system learns to predict the formation energy (生成能) of inorganic crystal structures — the energy per atom relative to the constituent elemental solids — which determines thermodynamic stability. The GNoME graph network was trained on the Materials Project database (approximately 500,000 known stable and unstable inorganic crystals) and used to evaluate 2.2 million candidate crystal structures generated by a combination of substitution and random generation in known prototype structure families. Of these, 381,000 were predicted to be thermodynamically stable, expanding the number of known stable inorganic crystal structures by roughly a factor of ten relative to the previous comprehensive databases. Of the predictions, 52 were validated by automated robotic synthesis experiments within DeepMind’s laboratory collaborators. The 381,000 predicted stable materials include potential candidates for solid electrolytes, photovoltaic absorbers, and topological insulators, and the dataset has been released publicly for community use.
Retrosynthesis prediction (逆合成预测) addresses the synthetic chemistry problem: given a target molecule, identify a sequence of known chemical reactions that could produce it from commercially available starting materials. This is a search problem over reaction space, and ML approaches have framed it as either a template-based classification problem (selecting which known reaction template applies to which part of the target molecule) or a template-free sequence-to-sequence prediction problem (treating SMILES strings as text). AiZynthFinder (Genheden et al. 2020) combines a neural network policy for template selection with Monte Carlo tree search over the synthetic tree, a strategy structurally analogous to AlphaGo’s use of a policy network to guide tree search. The Reaction Transformer (Schwaller et al. 2019) treats the SMILES of reactants as the source language and the product SMILES as the target in a standard sequence-to-sequence transformer, achieving high accuracy on standard benchmark reaction datasets.
Chemical language models (化学语言模型) pre-trained on large SMILES databases represent a parallel development to AlphaFold’s use of protein sequence representations. ChemBERTa (Chithrananda et al. 2020), MolBERT, and similar models use the BERT masked-language-modelling pre-training objective on millions of SMILES strings to learn representations that encode chemical similarity, functional group presence, and molecular property correlations in their latent spaces. These representations can be fine-tuned on small labelled datasets for specific property prediction tasks, achieving competitive performance with MPNN-based approaches that process molecular graphs directly. The analogy with protein language models (ESMFold, ProtTrans) is close: in both cases, a large pre-trained model learns a latent chemistry of the molecular domain from sequence data alone, without 3D structural supervision.
The loop from computational prediction to experimental validation to updated model is being closed by lab automation (实验室自动化). Robotic chemistry platforms — the Chemputer (Cronin group), MIT’s self-driving chemistry laboratory, and the Pasteur robotic synthesis system — can execute multi-step organic syntheses under computational control, automatically characterise products by NMR and mass spectrometry, and feed results back into the model training pipeline. This active-learning closed loop is qualitatively different from the one-way relationship between computation and experiment that has characterised most of the history of computational chemistry: the machine proposes, the robot tests, and the result informs the next proposal, without human intermediation at the synthesis and characterisation steps.
Chapter 7: AI for Earth and Climate Science
The problem of weather forecasting is, at a fundamental level, a problem of initial-value sensitivity in a chaotic dynamical system. Lorenz’s 1963 demonstration that the three-component convective system he studied exhibits sensitive dependence on initial conditions — the butterfly effect — established the theoretical basis for the predictability horizon of approximately two weeks beyond which deterministic forecasting must give way to probabilistic ensemble methods. Richardson’s 1922 vision of a numerical weather prediction system, in which a vast hall of human calculators would integrate the primitive equations of fluid dynamics simultaneously across the globe, was realised in electronic form beginning with Charney and von Neumann’s 1950 computation; the ECMWF Integrated Forecasting System (IFS), first operational in 1979 and continuously developed since, currently represents the gold standard of operational global NWP.
The computational cost of global NWP is substantial. A single 10-day global forecast at 9 km horizontal resolution with 137 pressure levels requires on the order of \( 10^{15} \) floating-point operations; producing an ensemble of 50 such forecasts for uncertainty quantification multiplies this by 50. The core of the IFS is a semi-Lagrangian semi-implicit time integration of the hydrostatic primitive equations, with sub-grid processes (convection, boundary-layer turbulence, radiation, cloud microphysics, land surface) parameterised by empirically calibrated schemes. The data assimilation (数据同步化) step, which merges the 12-hour model forecast with incoming observations from radiosondes, satellite radiances, surface stations, aircraft, and ocean buoys using 4D-Var variational optimisation, is computationally comparable in cost to the forecast itself and epistemologically central: the initial condition of the forecast system encodes everything the observing network knows about the current state of the atmosphere.
Machine learning weather models represent a methodological departure of a different kind from those discussed in earlier chapters. They do not improve on an existing numerical method by making it faster, as PINNs and neural operators do; they bypass the numerical integration of physical equations entirely, replacing it with a direct learned mapping from atmospheric state at time \( t \) to atmospheric state at time \( t + \Delta t \), trained on decades of reanalysis data. The training dataset used by all major ML weather models is ERA5 (ERA5再分析资料), the ECMWF fifth-generation global atmospheric reanalysis, which provides atmospheric variables on a \( 0.25° \times 0.25° \) latitude–longitude grid at 37 pressure levels from 1940 to the present, assimilating all available historical observations into a dynamically consistent reconstruction of the global atmospheric state.
FourCastNet (Pathak et al., NVIDIA 2022) was among the first ML weather models to demonstrate ECMWF-competitive performance at global scale, using Fourier neural operators (from Chapter 3) adapted to spherical geometry and trained on ERA5. Its principal advantage is speed: global 6-hourly forecasts over 10 days are produced in seconds on a single GPU, enabling ensemble generation at a cost that would be prohibitive for the IFS. Pangu-Weather (Bi et al., Nature 2023) uses a hierarchical 3D transformer architecture that explicitly models the vertical pressure-level structure of the atmosphere, treating each pressure level as a distinct feature channel and applying 3D attention over the latitude–longitude–pressure volume. Trained on ERA5, it achieves lower RMSE than ECMWF HRES (the high-resolution deterministic forecast) on 3D wind, temperature, and geopotential height at most lead times beyond 3 days.
GraphCast (Lam et al., Science 2023) represents the most thoroughly evaluated ML weather model as of its publication. It uses a message-passing graph neural network (消息传递图神经网络) on a multi-scale icosahedral mesh: the globe is discretised by a refinement of the icosahedron into approximately 40,000 mesh nodes, with edges connecting each node to its nearest neighbours at multiple resolution levels. The encoder maps the latitude–longitude ERA5 grid onto the mesh; the processor applies 16 rounds of message passing on the mesh; the decoder maps back to the ERA5 grid. The model is trained autoregressively on 39 years of ERA5 data, with a loss that weights all 227 predicted atmospheric variables by their physical importance. On the WeatherBench 2 benchmark, GraphCast outperforms ECMWF HRES on approximately 90% of atmospheric variables at all lead times from 1 to 10 days, including a historically difficult test case — tropical cyclone track prediction — where GraphCast’s track errors are substantially below those of the IFS. ECMWF has subsequently integrated ML-based models, including a version of GraphCast, into its operational ensemble as additional members.
The success of ML weather models raises a question structurally parallel to that raised by AlphaFold: what is the epistemic status of a forecast produced by a system that has learned a mapping from reanalysis to reanalysis, rather than by integrating physical conservation laws? The IFS forecast is explainable in the sense that any element of it can in principle be traced back to the physical parameterisation scheme responsible; a GraphCast forecast is not. For operational forecasting purposes — issuing warnings about tropical cyclones and severe weather events that will affect human safety — predictive skill is the primary criterion, and by that criterion ML models are competitive. For climate science, where the goal is to understand the physical mechanisms governing atmospheric dynamics, not merely to predict the next 10 days, the black-box nature of ML models is more problematic.
The deeper limitation concerns distribution shift (分布偏移) under climate change. ERA5 spans roughly 1940–present, a period during which global mean surface temperature has increased by approximately 1.2°C. ML weather models trained on this record are, in an important sense, interpolating within the historical distribution of atmospheric states. As warming continues, the atmosphere will enter regimes — altered Hadley cell extent, different sea ice extent, modified land surface albedo — that have no precedent in the ERA5 record. A physics-based model, because it encodes the relevant conservation laws, can in principle extrapolate to these novel states if its parameterisations remain valid; an ML model trained on historical reanalysis has no such guarantee. Climate Change AI (climatechange.ai) has documented this concern extensively, and it motivates hybrid approaches that combine ML fast emulators with physics-based constraints.
ML for climate emulation (气候模拟仿真) differs from weather forecasting in targeting longer time scales and different scientific questions. The ClimSim project (Yu et al. 2024) has produced a large dataset for emulating the parameterisation schemes of the Community Earth System Model 2, enabling ML models to replace computationally expensive parameterisation calls while preserving the dynamical core of the GCM. Statistical downscaling using diffusion models — generating high-resolution local climate projections conditioned on coarse-resolution GCM output — has shown promise for providing climate services at the scales required by infrastructure planning. Extreme event attribution, which asks whether a specific weather extreme was made more likely by anthropogenic climate forcing, has been addressed by training classifiers on factual and counterfactual climate simulations.
Chapter 8: Foundation Models for Science
The concept of a foundation model (基础模型) — a large neural network pre-trained on a broad corpus and then adapted to specific tasks — has reorganised the landscape of natural language processing and, more recently, vision and speech. The question this chapter examines is what foundation models mean for science: whether the paradigm that produced GPT-4 and its successors can be applied to scientific domains in a way that produces genuinely new scientific knowledge rather than sophisticated retrieval and summarisation of existing knowledge.
The most direct extension of the GPT paradigm to science is a large language model pre-trained on scientific text. Galactica (Taylor et al., Meta AI 2022) was trained on 48 million scientific papers, textbooks, chemistry databases, and reference materials, with a special tokenisation scheme for SMILES molecular strings, LaTeX mathematical expressions, and protein sequences, allowing it to process heterogeneous scientific content in a unified architecture. The 120-billion-parameter model demonstrated surprising capabilities: predicting molecular properties from SMILES strings without task-specific fine-tuning, writing literature review summaries with accurate citations from its training data, and solving mathematical problems stated in natural language. However, after three days of public access, Galactica was retracted from public availability by Meta AI following widespread criticism that it generated scientifically plausible but factually incorrect statements — hallucinated citations, incorrect molecular properties, false mechanistic claims — with the confident tone appropriate to ground truth.
The hallucination problem (幻觉问题) in scientific LLMs is structurally more serious than in general-purpose LLMs. When a general LLM writes a plausible but incorrect biography, the error is typically detectable by anyone with relevant knowledge. When a scientific LLM produces a plausible but incorrect molecular property — the toxicity of a compound, the melting point of a crystal, the binding affinity of a ligand — the error may not be immediately detectable by the user, and acting on it represents a real research misdirection with costs in time, money, and potentially patient safety. SciBERT (Beltagy et al. 2019), BioMedLM (Bolton et al. 2022), and domain-specific BERT models trained on scientific corpora with MLM pre-training represent a more conservative deployment of language model technology to science: they are evaluated on specific, well-defined benchmarks (named entity recognition, relation extraction, question answering over scientific text) with known train/test splits, and their outputs are used as structured information extraction tools rather than generative science systems.
AlphaFold as a scientific foundation model is perhaps the most instructive case for understanding what foundation model pre-training can offer to science. The Evoformer’s representations, learned from the task of predicting protein structure, transfer to a range of downstream tasks that were not in the training objective: protein–protein interaction prediction, protein function annotation, protein fitness landscape prediction, and binding specificity estimation. This transfer is structurally analogous to the transfer of BERT representations to downstream NLP tasks — the pre-training has learned a general-purpose encoding of protein sequence-structure-function relationships, not just a narrow predictor for the CASP task. The analogy suggests a broader principle: when a scientific domain has a large, well-curated dataset with a clear self-supervised learning objective (MSA masking, structure prediction, property regression), training a large model on that objective may produce representations that encode much of the domain’s information in a transferable form.
The AI Scientist (Lu et al., arXiv 2024) represents perhaps the most provocative proposal in the current literature: a system that autonomously proposes research ideas, writes code implementing experiments, runs those experiments, analyses the results, writes a full paper including abstract, introduction, related work, methods, results, and discussion sections, and submits it for automated peer review. The system uses a large language model as its orchestrator, with code execution, plotting, and literature search as tools. Papers generated by the AI Scientist have been rated as acceptable by automated review systems, and some have proposed genuinely novel training tricks for small neural network models in areas such as diffusion modelling and transformers. The system raises, in unusually acute form, a cluster of questions about authorship and credit: if a machine proposes, executes, and writes up a research programme without human direction beyond specification of the domain, who is the author? Who is responsible for errors? Who can be held accountable for false results that influence subsequent research? Existing norms of scientific authorship — which require that authors take intellectual responsibility for the work — were not written with autonomous AI systems in mind.
The AI for drug discovery foundation model ecosystem illustrates the practical dynamics of scientific foundation model development. Models such as ChemBERTa, MolBERT, Uni-Mol, and ESMFold occupy a spectrum from relatively narrow SMILES-pretrained models to large joint models over protein sequences, molecular structures, and biological assay data. The scaling behaviour of these models — whether larger pre-training datasets and larger model size consistently improve downstream task performance — is an active area of study with implications for investment in pre-training infrastructure. BioMedGPT (Luo et al. 2023), a multimodal foundation model jointly trained on protein sequences, molecular SMILES, and biological knowledge graphs, demonstrates promising performance on zero-shot molecular property prediction tasks. Genie 2 (Google DeepMind, 2024) and similar video-based world models, while not specifically designed for science, raise the question of whether a generative model trained on a sufficiently rich and structured scientific simulation corpus could serve as an implicit model of physical dynamics — a world model for physics rather than for video game environments.
The philosophical relationship between large-scale pre-training and the scientific process of theory formation is suggestive. Scientific theories, on standard accounts, are compact, generalisable structures abstracted from a large body of particular observations; a good theory compresses a large dataset by identifying its underlying generating mechanisms. A foundation model trained on a large scientific corpus is, in the MDL sense, compressing that corpus — and the structure of its latent representations may reflect something about the structure of the domain. But theory formation in science has an additional demand that latent representations do not automatically satisfy: theories must be articulable, communicable, and criticisable. The representations of a large transformer, however predictively powerful, are not in themselves a theory.
Chapter 9: The Epistemology of AI-Assisted Science
The entry of machine learning into scientific practice at scale prompts a re-examination of the epistemological foundations that scientists have, for the most part, taken for granted. These foundations were largely articulated in the mid-twentieth century and have since served as the background framework within which methodological debates in specific sciences take place. The challenge AI now poses is not simply that it introduces new tools but that some of those tools challenge the role of human understanding as the terminal arbiter of scientific knowledge claims.
The classical picture of scientific method, in the tradition of Popper’s hypothetico-deductivism (假设演绎主义), proceeds as follows: a scientist formulates a hypothesis, derives testable consequences, and submits those consequences to experimental test. A hypothesis that survives genuine attempts at falsification acquires a degree of corroboration, though it can never be confirmed with logical finality. The Duhem–Quine holism problem complicates this picture by noting that no hypothesis is tested in isolation; an anomalous experimental result always admits the response that some auxiliary hypothesis is false rather than the target hypothesis. Kuhn’s paradigm shift account further complicates it by situating individual hypothesis testing within a larger structure of shared exemplars, values, and conceptual frameworks that are not themselves testable in Popper’s sense. Neither Popper nor Kuhn imagined a scientific instrument that generates hypotheses autonomously, and adapting their frameworks to accommodate this is not straightforward.
Three modes of AI involvement in science can be analytically distinguished. The first, automation, does not challenge the classical picture at all: AI performs tasks that scientists previously performed manually — classifying spectra, extracting tabular data from papers, aligning sequences — without changing the epistemic structure of the research. The scientist remains the agent who formulates hypotheses and interprets evidence; the AI merely accelerates the data-processing pipeline. The second mode, discovery assistance, is more interesting and better illustrated by the cases examined throughout this course. Davies et al. use ML to identify patterns in knot invariants that guide human mathematicians toward new conjectures; AlphaGeometry proposes auxiliary constructions that human geometers can verify and understand; GNoME produces a list of stable materials that human chemists then synthesise and characterise. In this mode, AI shifts the locus of hypothesis generation from the human scientist to the machine, but human scientists retain interpretive authority over the outputs: they decide which patterns are interesting, which conjectures are worth pursuing, and what the results mean. The third mode, autonomous discovery, as exemplified by the AI Scientist, removes even this residual human authority from the discovery loop. The AI Scientist proposes, tests, and publishes without human intermediation; the results are attributed to “the AI Scientist system,” and the question of what it means for a machine to claim a scientific result — with its attendant norms of accountability, reproducibility, and intellectual responsibility — is posed in practice rather than in thought experiment.
The explainability problem (可解释性问题) in science is more acute than the more familiar explainability problem in applied AI. In engineering, a reliable black-box predictor that improves bridge design, credit scoring, or image classification is valuable because the goal is instrumental. Science’s goal is not merely instrumental: a PINN that accurately solves the Navier–Stokes equations does not explain fluid turbulence; it produces correct numbers without identifying the mechanism that generates them. GraphCast’s superior 10-day forecast does not illuminate the atmospheric dynamics responsible for the forecast features; it produces a numerically accurate prediction from an internal representation that cannot be read off as a physical account of why pressure systems evolve as they do. The community response to this limitation has been to insist on a complementary programme: ML outputs are used as constraints on interpretable symbolic models. The SINDy framework (Brunton, Proctor, and Kutz, 2016) identifies nonlinear dynamical systems as sparse linear combinations of basis functions from a large library, producing compact symbolic equations from data. Symbolic regression (Chapter 4) recovers exact functional relationships from numerical simulations. Sparse attention analyses and feature attribution methods (SHAP, integrated gradients) attempt to decompose neural network predictions into contributions from interpretable input features. These approaches do not eliminate the tension between predictive accuracy and mechanistic understanding, but they provide structured ways of extracting interpretable content from ML systems.
The distribution shift problem (分布偏移问题) as an epistemological issue is particularly significant for science. A model that accurately interpolates within its training distribution may fail catastrophically on examples from outside that distribution — and in science, the most important discoveries typically occur precisely at the boundaries and outside the established distribution of knowledge. AlphaFold performs reliably on proteins with multiple sequence alignment coverage because its training data is dense in that region of sequence space; its predictions for truly novel protein families with no evolutionary relatives are less reliable. GraphCast, trained on ERA5, may fail on atmospheric states produced by a substantially warmer climate. GNoME’s predictions for crystal structures with compositional types absent from the Inorganic Crystal Structure Database carry large uncertainty. These are not technical failures but structural features of the generalisation problem: a learning system trained to interpolate cannot be trusted to extrapolate, and the regimes of greatest scientific interest are precisely the regimes where extrapolation is required.
The reproducibility crisis (可重复性危机) in machine learning for science takes several specific forms. AlphaFold benchmarks have been scrutinised for potential data leakage (数据泄漏): if any of the CASP14 target structures, or their close homologues, were present in the PDB subset used for AlphaFold 2 training, the performance estimates may be inflated. The CASP organisers take precautions against this, including the use of targets whose structures were determined but not released before the submission deadline; nonetheless, the complexity of the PDB’s homology network means that true leakage-free evaluation is difficult to guarantee. Comparisons of ML weather models with ECMWF HRES have been criticised on the grounds that the comparison is not exactly fair: HRES uses real-time observations assimilated in 4D-Var, whereas ML models typically take ERA5 (which is itself the output of a data assimilation system) as their initial conditions, potentially giving ML models an unfair advantage if the ERA5 initial conditions are smoother and less uncertain than operational NWP initial conditions.
The credit and priority problem (信用与优先权问题) in AI-assisted science is non-trivial. When a major discovery involves a deep learning system trained by a team of hundreds of engineers and scientists, and a biological insight emerges from applying that system to a new protein family, how is scientific credit distributed between the system developers, the researchers who applied the system, and the biologists who interpreted the result? Merton’s norms of science — communalism (results should be shared), universalism (claims should be evaluated on their merits regardless of their source), disinterestedness (scientists should not be motivated primarily by personal gain), and organised scepticism (claims should be subjected to systematic doubt) — are under strain from the involvement of commercial AI laboratories in science. DeepMind’s publication of AlphaFold 2 was a highly communalistic act; the simultaneous filing of patents on aspects of the AlphaFold methodology was a partial contradiction of that norm. The AI Scientist raises the question of whether communalism and organised scepticism can apply to a system that has no personal interests, cannot be held accountable, and cannot respond to peer criticism.
The co-scientist vision (共同科学家愿景) that motivates much of the current enthusiasm for AI in science imagines a division of labour in which AI systems handle combinatorial search, pattern detection in high-dimensional data, and hypothesis generation from large corpora, while human scientists provide interpretive, theoretical, and ethical framing. In this division, the machine handles the parts of science that are computationally intensive and cognitively exhausting, while the human handles the parts that require understanding, judgment, and accountability. The historical parallel that is most often invoked is the introduction of electronic computers into theoretical physics and chemistry in the 1950s and 1960s: computational quantum chemistry, molecular dynamics simulation, and Monte Carlo methods all required computers to become tractable, and the effect was to transform but not replace human theoretical reasoning — to make new theoretical work possible that would not have been possible without the computational infrastructure. Whether the current transformation is of the same kind or represents a qualitative departure — whether, in particular, the autonomy of AI systems in hypothesis generation and experimental design represents a genuine change in the epistemic structure of science rather than merely a change in the speed and scale of its execution — is the central unresolved question that this course has been designed to hold open rather than prematurely close.
For the philosophical dimensions of machine creativity and understanding — including the debate between computationalist and embodied accounts of cognition, and the question of whether a formal symbol manipulation system can be said to understand what it processes — the arguments examined in PHIL 459b (Philosophy of Artificial Intelligence) provide the relevant philosophical context. For the historical context of computing as a scientific instrument — tracing the genealogy from Turing machines and von Neumann architectures through the computerisation of scientific laboratories in the postwar decades — the narrative is examined in detail in HIST 415 (A History of Artificial Intelligence), where the social, institutional, and political dimensions of computing’s entry into science are treated alongside the technical.