CS 482: Algorithms in Bioinformatics

Bin Ma

Estimated study time: 41 minutes

Table of contents

Based on lecture notes by Sibelius Peng — PDF source

Sources and References

Primary textbook — Durbin, R., Eddy, S.R., Krogh, A., and Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. Supplementary texts — Pevzner, P.A. Computational Molecular Biology: An Algorithmic Approach. MIT Press, 2000. Jones, N.C. and Pevzner, P.A. An Introduction to Bioinformatics Algorithms. MIT Press, 2004. Online resources — MIT 6.047/6.878 Computational Biology: Genomes, Networks, Evolution (Kellis, freely available). Ben Langmead’s Computational Genomics (Johns Hopkins, lecture materials on GitHub). EMBL-EBI bioinformatics training materials. The NCBI BLAST documentation and tutorials. Lecture notes — Peng, S. CS 482 Lecture Notes (Winter 2022). https://pdf.sibeliusp.com/1221/cs482.pdf

Chapter 1: Sequence Alignment and Biological Motivation

The fundamental question in computational biology is: given two biological sequences (DNA, RNA, or protein), how similar are they, and what does that similarity imply about their shared ancestry, structure, and function? Sequence alignment is the algorithmic foundation for answering this question.

Biological Background

DNA is a double-stranded polymer of four nucleotides: adenine (A), cytosine (C), guanine (G), thymine (T). A gene encodes a protein: the DNA sequence is transcribed into mRNA and translated (via the genetic code) into an amino acid sequence — a protein. The human genome has \(\approx 3.2 \times 10^9\) base pairs; about 20,000–25,000 protein-coding genes.

Evolutionary divergence: two species sharing a common ancestor will have similar genome sequences. Genes performing the same function in different species (orthologues) evolve under selective pressure to preserve function, so their sequences diverge slowly. Paralogues — genes duplicated within a genome — can evolve more freely. By comparing sequences across species, we can reconstruct evolutionary history, identify functional regions, and predict the function of unknown genes.

Mutations that occur during evolution: substitutions (one nucleotide replaced by another), insertions, and deletions. The latter two are collectively called indels. An alignment of two sequences places them one above the other, with gaps inserted to accommodate indels:

ACGT-ACGT
ACG-TACGT

The minus signs (\(-\)) are gaps. Every alignment corresponds to an edit history; the edit distance between two strings is the minimum number of character insertions, deletions, and substitutions to transform one into the other.

Global Alignment: Needleman–Wunsch

Global alignment aligns two sequences in their entirety, from end to end. Given sequences \(x = x_1 \cdots x_m\) and \(y = y_1 \cdots y_n\), and a scoring function \(s(a,b)\) for matching/mismatching characters and a gap penalty \(g\), the optimal alignment maximises the total score.

\[ F(i, j) = \max\begin{cases} F(i-1, j-1) + s(x_i, y_j) & \text{(match/mismatch)}\\ F(i-1, j) + g & \text{(gap in } y\text{)}\\ F(i, j-1) + g & \text{(gap in } x\text{)} \end{cases} \]

with base cases \(F(0, j) = jg\), \(F(i, 0) = ig\). The optimal score is \(F(m, n)\), and the optimal alignment is recovered by traceback. Running time: \(O(mn)\) time and space. This is the Needleman–Wunsch algorithm (1970).

Affine Gap Penalty

A single gap of length \(k\) is biologically more likely than \(k\) separate gaps — evolution tends to insert or delete in chunks, not character by character. The affine gap penalty \(g_\text{aff}(k) = g_\text{open} + (k-1) \cdot g_\text{ext}\) charges a larger penalty for opening a gap and a smaller one for extending it.

The affine gap DP maintains three tables: \(M(i,j)\) for an alignment ending with a match/mismatch, \(X(i,j)\) for an alignment ending with a gap in \(y\), and \(Y(i,j)\) for a gap in \(x\). The recurrences couple these three tables. Still \(O(mn)\) time and space.

Scoring Functions: PAM and BLOSUM

The choice of scoring matrix matters enormously. For protein alignments, a substitution of Leu (leucine) for Ile (isoleucine) — both hydrophobic — is much more likely than Leu for Lys (a charged amino acid). The scoring matrix should reflect evolutionary substitution probabilities.

PAM matrices (Point Accepted Mutation; Dayhoff et al. 1978): derived from closely related protein families by counting observed substitutions and extrapolating via matrix powers (1 PAM = 1% expected change). PAM250 is for distantly related sequences.

BLOSUM matrices (Blocks Substitution Matrix; Henikoff–Henikoff 1992): derived by comparing ungapped blocks of multiply aligned proteins. BLOSUM62 (aligned sequences sharing \(\leq 62\%\) identity) is the default in BLAST for database search.

The score \(s(a, b) = \log[q(a,b) / (p(a) p(b))]\) where \(q(a,b)\) is the joint probability and \(p(a), p(b)\) are background frequencies — this is a log-odds score. A positive score means the pair \((a,b)\) is more likely to align by homology than by chance; negative means less likely. The expected score \(\sum_{a,b} p(a) p(b) s(a,b)\) must be negative (otherwise a random alignment would have positive expected score, making all alignments significant).

Chapter 2: Local Alignment and Database Search

Global alignment finds the best end-to-end alignment. For database search — comparing a query protein against millions of known proteins — we want to find the region of best local similarity, regardless of the overall sequences.

Smith–Waterman Local Alignment

\[ H(i, j) = \max\begin{cases} 0 & \\ H(i-1, j-1) + s(x_i, y_j) \\ H(i-1, j) + g \\ H(i, j-1) + g \end{cases} \]

The addition of the \(0\) option (compared to Needleman–Wunsch) allows the alignment to restart with zero score, avoiding negative-score extensions. The best local alignment score is \(\max_{i,j} H(i,j)\). Still \(O(mn)\) time and space.

Linear space alignment (Hirschberg’s algorithm 1975, extended to alignment by Hirschberg): the traceback matrix requires \(O(mn)\) space, which is prohibitive for long sequences. The linear-space trick uses divide-and-conquer: find the midpoint of the optimal alignment (the cell \((m/2, k)\) on the alignment path) using only \(O(n)\) space (forward and backward score arrays), then recurse on the two halves. Total time: \(O(mn)\), space: \(O(m + n)\).

Smith–Waterman on a database of billions of amino acids would take days. BLAST (Basic Local Alignment Search Tool; Altschul et al. 1990) achieves near-linear time by filtering candidates using short exact matches.

BLAST algorithm (for proteins):

  1. Word generation: decompose the query into overlapping words (k-mers, k = 3 for proteins). For each query k-mer, generate all k-mers with score above threshold \(T\) under the scoring matrix.
  2. Lookup: find all database positions matching any generated word using a hash table or suffix array — these are hits.
  3. Extension: extend each hit in both directions (ungapped, then gapped) until the score drops below a threshold.
  4. Significance: report alignments whose score exceeds a threshold corresponding to a specified E-value (expected number of such alignments by chance in a database of this size).
\[ E = K m n e^{-\lambda S}, \]

where \(S\) is the raw alignment score, \(m\) and \(n\) are query and database lengths, and \(K, \lambda\) are parameters depending on the scoring matrix and gap penalties.

Spaced seeds: BLAST uses consecutive k-mers (“solid seeds”), but spaced seeds (patterns with don’t-care positions, e.g., \(1 \# 1 \# 1\) meaning match in positions 1, 3, 5 but ignore positions 2, 4) provide better sensitivity at the same specificity. Ma, Tromp, Li (2002) showed that spaced seeds outperform consecutive k-mers; optimising seed patterns (maximising sensitivity for a given hit probability) is a DP on the seed pattern.

Chapter 3: Multiple Sequence Alignment

Comparing two sequences is the foundation; comparing many simultaneously reveals conserved patterns and builds evolutionary models. Multiple sequence alignment (MSA) aligns \(k\) sequences simultaneously.

The natural extension of pairwise DP to \(k\) sequences requires a \(k\)-dimensional DP table of size \(O(n^k)\) — exponential in \(k\). Exact MSA is NP-hard for the SP (sum-of-pairs) score under general gap penalties.

Heuristic MSA: Progressive Alignment

Progressive alignment (CLUSTALW; Thompson et al. 1994): first compute all pairwise alignment scores; build a guide tree (UPGMA or neighbor-joining from pairwise distances); align sequences in order dictated by the guide tree, merging profiles.

A profile is a \(k \times |\Sigma|\) matrix giving the frequency of each amino acid at each position across the already-aligned sequences. Profile-profile alignment extends Needleman–Wunsch to incorporate the profile distribution. Progressive alignment runs in \(O(k^2 n^2)\) total and works well when sequences are closely related.

Approximation for SP-MSA

For the SP (sum-of-pairs) objective — total score over all \(\binom{k}{2}\) pairwise alignments induced by the MSA — a 2-approximation exists:

Theorem (2-approximation for SP-MSA). The “star alignment” algorithm — choose a center sequence \(c^*\), align all other sequences to \(c^*\), then merge — achieves cost at most 2 OPT for the SP objective.

The proof uses the fact that each pairwise alignment in the star is a projection of the MSA, so the projected score is at most the corresponding contribution in the optimal MSA. Summing over all pairs gives the 2-factor.

Exact algorithms for small \(k\): the exact DP is feasible for \(k \leq 5\) or so. For large \(k\), modern tools (MUSCLE, MAFFT, T-Coffee) use iterative refinement: run progressive alignment, then improve by realigning subsets of sequences in a round-robin fashion.

Chapter 4: Proteomics and Mass Spectrometry

DNA sequencing tells us which genes are present; proteomics studies which proteins are actually expressed, and at what levels. Mass spectrometry (MS) has become the dominant technique for large-scale protein identification and quantification.

Mass Spectrometry Basics

A mass spectrometer measures the mass-to-charge ratio (\(m/z\)) of ions. The three main components are:

  1. Ionisation: the sample (usually a tryptic digest of proteins) is ionised. Electrospray ionisation (ESI) produces multiply charged ions (\([M + zH]^{z+}\)) in solution; MALDI ablates a matrix-embedded sample with a laser, producing singly charged ions.
  2. Mass analyser: separates ions by \(m/z\). Quadrupole (low resolution, fast), TOF (time-of-flight; measures flight time — high resolution, broad range), Orbitrap (ions orbit an electrode at frequency proportional to \(m/z^{-1/2}\), Fourier-transformed to spectrum — very high resolution).
  3. Detector: records ion arrival times/frequencies.

Tandem MS (MS/MS or MS\(^2\)): first stage selects a precursor ion of a specific \(m/z\); second stage fragments it (by collision-induced dissociation) and analyses fragment masses. Each fragment ion retains either the N-terminal (b-ions) or C-terminal (y-ions) part of the peptide:

  • \(b_k\) ion: amino acids \(1, \ldots, k\), plus one proton, no water.
  • \(y_k\) ion: amino acids \(n-k+1, \ldots, n\), plus one water, plus one proton.

The mass of a peptide is \(\sum_i m(a_i) + m(\text{H}_2\text{O})\); b-ions and y-ions together uniquely determine the sequence if the spectrum is complete.

In practice, a spectrum is noisy and incomplete. Database search identifies the peptide by comparing the observed spectrum against theoretical spectra computed for all peptides in a proteome database.

Sequest scoring (Eng–McCormack–Yates 1994): for each candidate peptide, compute the theoretical b/y ion spectrum and cross-correlate with the observed spectrum. The normalised cross-correlation score \(\text{XCorr}\) ranks candidates. The top hit is the peptide identification.

\[ \text{FDR} = \frac{\text{\# decoy hits above threshold}}{\text{\# target hits above threshold}}. \]

Setting the threshold to achieve \(\text{FDR} = 1\%\) is standard.

Pitfalls in FDR estimation: target-decoy FDR is valid under the assumption that decoy hit scores have the same distribution as false target hits. If the target database contains many similar sequences (e.g., many isoforms), or if decoys are constructed from the same database (correlated scores), FDR is underestimated. Separate-search FDR (searching target and decoy separately) and entrapment databases (adding unrelated sequences as controls) address these pitfalls.

De Novo Peptide Sequencing

Without a database, de novo sequencing recovers the peptide sequence directly from the spectrum. It treats sequencing as a graph problem: create a spectrum graph where each node is a peak mass and each edge of mass difference \(\Delta m = m(a)\) (the mass of an amino acid \(a\)) represents a possible residue.

The optimal path through the spectrum graph corresponds to the best peptide sequence explaining the observed peaks. Dynamic programming on the DAG (topologically ordered by mass) finds the highest-score path.

Novor (Ma, Bin 2015): a modern de novo tool using a machine-learning-trained scoring function to evaluate each edge (residue) transition. Novor processes hundreds of spectra per second and achieves near-perfect accuracy on high-quality spectra.

Chapter 5: Hidden Markov Models and Gene Prediction

Hidden Markov Models

A Hidden Markov Model (HMM) has a sequence of hidden states \(q_1, q_2, \ldots, q_T\) (not directly observed) and a sequence of observations \(o_1, o_2, \ldots, o_T\). The model is specified by:

  • Transition probabilities \(a_{ij} = P(q_{t+1} = j \mid q_t = i)\).
  • Emission probabilities \(b_i(o) = P(o_t = o \mid q_t = i)\).
  • Initial distribution \(\pi_i = P(q_1 = i)\).

Three canonical HMM algorithms:

  1. Forward algorithm: compute \(P(\text{observations} \mid \text{model})\) in \(O(TN^2)\) (where \(N\) = number of states).
  2. Viterbi algorithm: find the most likely hidden state sequence given observations — the argmax over all state paths, computed by DP: \[ v_t(j) = \max_{i} v_{t-1}(i)\, a_{ij}\, b_j(o_t). \]
  3. Baum–Welch algorithm: learn HMM parameters \(\lambda = (A, B, \pi)\) from observations by EM (expectation-maximisation). E-step: compute posterior state probabilities via forward-backward; M-step: update parameters as weighted statistics.

Gene Prediction with HMMs

A gene in a DNA sequence is a region that is transcribed to RNA and (for protein-coding genes) translated to protein. Gene prediction is the problem of identifying gene locations in a newly sequenced genome.

The biological signal includes:

  • Start codon (ATG) and stop codon (TAA, TAG, or TGA).
  • Splice sites: introns begin with GT and end with AG (GT-AG rule for eukaryotes).
  • Coding sequences (CDS): coding DNA has a periodic statistical signal in trinucleotides (because of codon usage bias) — a periodic HMM can detect this.
  • Promoters: upstream regulatory sequences (TATA box at -30 bp from the transcription start site in many genes).

A gene-finding HMM has states for intergenic regions, exons, introns, start/stop codons, and splice sites. The Viterbi algorithm finds the most likely gene structure given the DNA sequence. Tools like GENSCAN (Burge–Karlin 1997), Augustus (Stanke–Morgenstern 2005), and AUGUSTUS++ use such models. Ab initio gene finders use only sequence statistics; evidence-based finders incorporate expressed sequence tags (ESTs), RNA-seq data, and cross-species conservation to improve accuracy.

Chapter 6: Phylogenetic Tree Reconstruction

Phylogenetics reconstructs the evolutionary history (tree) of a set of species (or genes) from observed sequence data. The phylogenetic tree (or species tree) shows the branching pattern of lineages from their common ancestor.

Parsimony

The maximum parsimony criterion seeks the tree topology (and ancestral sequences) that minimises the total number of evolutionary changes. For binary characters (each position in the sequence is in one of two states), Fitch’s algorithm solves the small parsimony problem (given a fixed tree, find the assignment of ancestral states minimising changes) in \(O(kn)\) time:

  • Bottom-up: each internal node’s state is the intersection of its children’s states (if non-empty) or their union (if empty, adding 1 to the parsimony score).
  • Top-down traceback: assign states consistent with the sets.

The large parsimony problem (finding the tree minimising parsimony) is NP-hard; heuristics explore the tree space by nearest-neighbor interchange (NNI) and subtree pruning and regrafting (SPR).

Perfect phylogeny: a special case where every character evolves exactly once and no character is lost. A perfect phylogeny exists iff for every pair of characters, their four-gamete test passes (the four combinations 00, 01, 10, 11 do not all appear). The compatibility check algorithm determines this in \(O(k^2 n)\) time.

Distance-Based Methods

Distance-based methods first estimate pairwise evolutionary distances \(d_{ij}\) (from sequence divergence, corrected for multiple hits via Jukes–Cantor or similar substitution models), then reconstruct the tree from the distance matrix.

UPGMA (Unweighted Pair Group Method with Arithmetic Mean): a hierarchical clustering algorithm. Merge the two closest taxa at each step, updating distances as averages. UPGMA is fast (\(O(n^2)\)) but assumes a molecular clock (constant evolutionary rate across lineages) — it gives an incorrect tree when rates vary.

Neighbor-joining (Saitou–Nei 1987): an agglomerative algorithm that corrects for unequal rates. At each step, merge the pair that minimises a corrected distance \(Q_{ij} = (n-2)d_{ij} - r_i - r_j\) where \(r_i = \sum_k d_{ik}\). NJ is consistent (converges to the true tree as sequence length increases) under the tree model and runs in \(O(n^3)\).

Maximum Likelihood Phylogenetics

Maximum likelihood (Felsenstein 1981) finds the tree topology and branch lengths that maximise the probability of the observed sequences. Under the GTR model (General Time Reversible), the substitution rate matrix \(Q\) has six reversible rates and four nucleotide frequencies, giving a continuous-time Markov model for sequence evolution.

\[ L_k(b) = \prod_{j \in \text{children}(k)} \left(\sum_{c} P_{bc}(t_j) L_j(c)\right), \]

where \(P_{bc}(t)\) is the probability of substituting from state \(b\) to \(c\) in time \(t\). ML tree search (used in RAxML, IQ-TREE) combines NNI/SPR moves with ML optimisation.

Chapter 7: Indexing Sequences — Suffix Trees and Arrays

For large-scale sequence analysis (aligning a query to a reference genome of 3 billion bases), we need fast string indexing structures.

Suffix Trees

A suffix tree of a string \(S\) of length \(n\) is a compressed trie of all \(n\) suffixes \(S[i..n]\). Each leaf is labelled with the starting position of the corresponding suffix; each internal node represents a common prefix of multiple suffixes; each edge is labelled with a (non-empty) substring. The compressed structure stores only branching nodes, reducing total nodes to \(O(n)\).

Key operations:

  • Substring search: find all occurrences of pattern \(P\) in \(S\) in \(O(|P| + \text{occ})\) time — follow the path for \(P\) from the root, then report all leaves in the subtree.
  • Longest common substring of two strings \(S_1, S_2\): build a generalised suffix tree for \(S_1\$_1 S_2\$_2\); the deepest internal node with descendant leaves from both strings gives the LCS.
  • Maximal Unique Match (MUM): a match between two genomes that occurs exactly once in each. Found by traversing the generalised suffix tree for nodes whose subtrees have exactly one leaf from each genome. Used in MUMmer for whole-genome alignment.

Ukkonen’s linear-time algorithm (1995) builds the suffix tree in \(O(n)\) time using suffix links and the observation that extending one suffix implicitly extends all others.

Suffix Arrays

A suffix array \(\text{SA}[1..n]\) stores the starting positions of all suffixes in lexicographic order. Together with the LCP array (longest common prefix between consecutive suffixes in sorted order), it provides the same functionality as the suffix tree with lower constant factors in practice.

Search for pattern \(P\): binary search on the SA in \(O(|P| \log n)\); using the LCP array, this reduces to \(O(|P| + \log n)\). Construction: \(O(n)\) time algorithms exist (SA-IS; Nong, Zhang, Chan 2009), though \(O(n \log n)\) sort-based algorithms are simpler and often used in practice.

The FM-index (full-text index; Ferragina–Manzini 2000) combines the suffix array with the Burrows–Wheeler transform (BWT) for compressed text indexing and backward search. The BWT rearranges the string so that characters with similar context appear together, enabling compression. The FM-index enables exact-match search in \(O(|P|)\) time with the index using only \(O(n \log \sigma)\) bits (near the information-theoretic minimum). BWA and Bowtie use FM-indexes for fast read mapping to reference genomes.

Chapter 8: Spectrum Prediction and Deep Learning in Bioinformatics

Deep Learning for Spectrum Prediction

Traditional fragmentation models (STO) predict b/y ion intensities using physicochemical properties. Deep neural networks learn the complex relationship between peptide sequence and spectrum directly from training data.

pDeep and Prosit (2019) train neural networks (LSTM or transformer-based) to predict fragment ion intensities from peptide sequences, charge states, and collision energies. Inputs are embedded amino acid sequences; outputs are predicted intensities for each b/y ion type. Training on hundreds of thousands of experimental spectra from public repositories (PRIDE, MassIVE) enables near-perfect intensity prediction.

Applications: improving database search scores (replace simple theoretical spectra with predicted spectra), data-independent acquisition (DIA) analysis where all co-eluting peptides are fragmented simultaneously, and spectral library generation for new organisms.

AlphaFold and Protein Structure Prediction

The protein folding problem — predicting three-dimensional structure from sequence — was a 50-year grand challenge. AlphaFold2 (Jumper et al., DeepMind 2021) achieved near-experimental accuracy for the first time, winning the CASP14 competition.

AlphaFold2’s key innovations:

  • Multiple sequence alignment (MSA) features: evolutionary co-variation across homologous sequences encodes structural contacts.
  • Evoformer: a transformer-based module that iterates attention over the MSA and a pair representation (capturing residue-residue relationships).
  • Structure module: explicitly predicts 3D coordinates by iterating “invariant point attention” — attention that respects SE(3) (rotation-translation) equivariance.

The AF2 protein structure database now contains predicted structures for essentially all known proteins (\(>200\) million), accelerating drug discovery, functional annotation, and evolutionary analysis.

Chapter 9: Beyond This Course

Natural Next Topics

CS 482 covers the algorithmic foundations of bioinformatics — sequence alignment, mass spectrometry, HMMs, phylogenetics, and suffix-based indexing — with a strong algorithms perspective. The course deliberately prioritises mathematical and algorithmic rigour over wet-lab biology, which positions its graduates to engage with the rapidly evolving computational biology literature. Several major directions grow naturally from this foundation.

Next-generation sequencing analysis and the read-mapping problem. Modern high-throughput sequencers (Illumina short reads, 150 bp; PacBio HiFi long reads, \(\sim\)15 kb; Oxford Nanopore, up to megabases) produce billions of reads per instrument run. Read mapping — aligning each read to a reference genome — is the entry point for virtually every downstream analysis. The BWT-FM index (Burrows–Wheeler transform with an FM-index) enables exact and approximate matching in \(O(m)\) time per query, essentially independent of reference genome size. BWA-MEM2 and Minimap2 extend this to paired-end and long-read data with seed-chain-extend heuristics. The read-mapping problem itself connects directly to the suffix-array material in CS 482: the FM-index is a compressed suffix array that achieves space close to the compressed genome while preserving \(O(m)\) lookup.

Once reads are mapped, variant calling identifies differences between an individual’s genome and the reference: single-nucleotide polymorphisms (SNPs), insertions-deletions (indels), and structural variants (SVs — inversions, translocations, copy-number changes). GATK’s HaplotypeCaller uses local assembly (reassembling reads in active regions using a de Bruijn graph) to call SNPs and small indels with high accuracy. DeepVariant reformulates variant calling as an image classification problem: it converts pileup data at each candidate site to a tensor (read, position, base channel) and applies a ResNet to predict genotypes. Structural variant detection is harder because SVs span kilobases to megabases; methods rely on split-read signatures, discordant read pairs, and read-depth anomalies.

Graph genomes and pangenomics. A single linear reference genome (e.g., GRCh38 for Homo sapiens) introduces reference bias: reads from individuals with alleles absent from the reference are misaligned or unmapped, causing systematic errors in downstream analyses. Graph genomes represent the reference as a directed acyclic graph (or more generally a variation graph), where nodes are sequence segments and edges represent alternative paths — alleles, indels, inversions. The algorithmic challenges are significant: efficient construction from thousands of assembled genomes, read mapping to a graph (generalising BWT), variant calling relative to graph paths, and population-level genotyping. The Human Pangenome Reference Consortium (2023) released the first high-quality human pangenome of 47 individuals, covering \(\sim\)119 Mb of sequence absent from GRCh38. Efficient algorithms for graph genome analysis are an active research frontier.

Single-cell genomics. Microfluidics-based methods (10x Genomics Chromium) can barcode mRNA molecules from tens of thousands of individual cells simultaneously, producing a cell-by-gene count matrix of dimension \(\sim 10^4 \times 10^4\) to \(\sim 10^5 \times 10^4\). The matrix is extremely sparse (most genes are not expressed in a given cell) and contaminated by technical noise (dropout events — true expression that goes undetected). The computational pipeline — ambient RNA removal, doublet detection, normalisation, dimensionality reduction (PCA → UMAP), clustering (Leiden algorithm on a k-nearest-neighbour graph), cell-type annotation, and differential expression — is a computational biology workflow in itself.

More sophisticated single-cell analyses include pseudotime analysis (ordering cells along a developmental trajectory), RNA velocity (using unspliced-to-spliced mRNA ratios to infer the time derivative of expression state, modelled by a system of ODEs), and cell-cell communication (inferring ligand-receptor interactions between cell types from expression data). Spatial transcriptomics (Visium, MERFISH, seqFISH+) adds physical coordinates to transcriptomic measurements, allowing reconstruction of tissue architecture at cellular resolution. The key algorithmic challenges in this area are scalability (datasets now reach \(10^6\) cells), batch effect correction across experimental conditions, and statistical rigour in differential expression when cells from the same individual are not independent.

Metagenomics. When genomic material is sequenced directly from an environmental sample (soil, ocean water, gut microbiome), reads originate from thousands to millions of species at widely varying abundances. Metagenomic assembly must reconstruct genomes (or genome fragments) from this mixture without a reference. The assembly graph (de Bruijn graph) for a metagenome is vastly more complex than for a single genome: strain-level variation, horizontal gene transfer, and low-coverage species create ambiguities. Binning — grouping assembled contigs into putative genomes — uses tetranucleotide composition (k-mer profiles) and coverage depth, exploiting that different species have characteristic sequence compositions. Deep learning approaches (DAS Tool, SemiBin2) use neural embeddings of contig features for binning. Taxonomic profiling (Kraken2, Metaphlan4) assigns reads to taxa using k-mer databases and marker genes. The human microbiome contains \(\sim 10^{13}\) microbial cells; understanding how its composition relates to health, disease, and drug metabolism is a major biomedical research direction.

Protein language models and structure-function prediction. The CS 482 coverage of AlphaFold2 introduces the transformer-based architecture that revolutionised structural biology. Subsequent developments extend this in several directions. ESM-2 (Facebook AI Research, 2022) is a 15-billion-parameter protein language model trained purely on sequence data; its residue embeddings encode evolutionary and structural information without explicit MSA computation. RoseTTAFold uses a three-track architecture (sequence, MSA, and pair representations) and achieves accuracy competitive with AlphaFold2. AlphaFold3 (DeepMind, 2024) extends structure prediction to protein-DNA, protein-RNA, and protein-ligand complexes — critical for drug discovery. The open problem is not structure prediction (now largely solved for single proteins) but function prediction: given a structure, what does the protein do? The sequence-structure-function relationship remains incompletely understood.

Follow-up Courses and Reading

At UWaterloo, CS 683 (Computational Biology) is the natural graduate continuation, covering genome assembly theory (de Bruijn graphs, string graphs, assembly polishing), statistical genomics (GWAS, eQTL mapping), and population genetics. STAT 440 (Computational Inference) covers Markov chain Monte Carlo methods that underlie Bayesian phylogenetics (MrBayes, BEAST), and variational inference methods used in probabilistic single-cell analysis (scVI). BIOL 439 (Genomics and Bioinformatics) provides the biological context — gene structure, regulatory elements, epigenomics — that makes the algorithms meaningful. CS 889 (special topics) has covered deep learning for biology in recent years.

At peer universities, MIT’s 7.91J / 20.490J (Foundations of Computational and Systems Biology) is one of the strongest introductory graduate courses; all lecture videos and problem sets are freely available through MIT OpenCourseWare. Stanford CS 374 (Algorithms for Biology) covers similar ground with a stronger algorithms emphasis. UC Berkeley’s CS 176 (Algorithms in Bioinformatics) is another strong option. Johns Hopkins, which has a particularly deep computational biology research community (Steven Salzberg, Ben Langmead, Michael Schatz), offers advanced courses through its department of Biomedical Engineering; Ben Langmead’s computational genomics materials (github.com/BenLangmead/comp-genomics-class) are among the best freely available resources for NGS analysis.

Primary textbooks. Durbin, Eddy, Krogh and Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids (Cambridge University Press, 1998) remains the authoritative reference for HMMs, profile HMMs, pairwise alignment, and phylogenetic models. Pevzner and Compeau, Bioinformatics Algorithms: An Active Learning Approach (3rd edition, 2018; interactive version at bioinformaticsalgorithms.org) is more accessible and includes programming exercises. Gusfield, Algorithms on Strings, Trees, and Sequences (Cambridge, 1997) gives the definitive algorithmic treatment of suffix trees, suffix arrays, and string matching — essential background for Chapter 8. For genomics pipelines, Heng Li’s documentation and papers for BWA and Minimap2 are indispensable; for population genetics and GWAS, Population Genetics: A Concise Guide by Gillespie and Human Evolutionary Genetics by Jobling, Hurles and Tyler-Smith provide the biological background.

For deep learning in biology. Eraslan et al. Deep Learning: New Computational Modelling Techniques for Genomics (Nature Reviews Genetics, 2019) is an excellent review. Zou et al. A Primer on Deep Learning in Genomics (Nature Genetics, 2019) focuses on sequence-based models. For protein structure, the AlphaFold2 paper (Jumper et al., Nature, 2021) and its supplementary material are required reading; the AlphaFold3 paper (Abramson et al., Nature, 2024) extends the framework to biomolecular complexes. The DREAM challenges (dreamchallenges.org) provide community benchmarking problems spanning gene expression, drug response prediction, and network inference; participating in a DREAM challenge is an excellent way to calibrate your methods against the field.

For self-study. The Rosalind platform (rosalind.info) offers hundreds of bioinformatics programming problems with an interactive judge, organised by topic (string algorithms, phylogenetics, mass spectrometry). Completing the full Rosalind set provides solid algorithmic foundations. The EMBL-EBI Training portal (training.ebi.ac.uk) provides short courses on specific databases and tools (BLAST, InterPro, UniProt, Ensembl) that are used in every bioinformatics pipeline. For statistical genomics, the GTEx and gnomAD consortium papers are good entry points to understanding large-scale genomic datasets.

Open Problems and Active Research

The algorithmic foundations of bioinformatics are mature, but the field is far from closed. Several major open problems sit at the boundary of algorithms, statistics, and biology.

Reference bias, graph genomes, and pangenomics. The single linear reference genome remains the standard for most bioinformatics pipelines, but its limitations are well-documented. Reads from individuals of non-European ancestry are systematically harder to map because GRCh38 over-represents European genomic variation. Medically important regions (MHC, segmental duplications, centromeres) are structurally complex and poorly represented in a linear reference. The Human Pangenome Reference Consortium (HPRC) has released a pangenome of 47 high-quality haplotype-resolved genomes; extending this to thousands of individuals requires compressing the variation graph to manageable size while preserving mapping accuracy. The key open algorithmic problems are: (1) efficient variation graph construction and indexing at population scale, (2) read mapping to a graph that is provably sensitive and specific, (3) variant calling relative to a graph — which allele is the reference? — and (4) population-level statistics on graph-structured data. The HPRC papers (Liao et al. 2023 and subsequent) and the GBZ format for graph genome storage represent the current state of the art.

The sequence-structure-function gap. AlphaFold2 essentially solved the problem of predicting a protein’s tertiary structure from its amino acid sequence, achieving accuracy comparable to X-ray crystallography for many proteins. But structure is only the means; function is the end. The same fold can have different functions in different organisms (convergent evolution), and proteins with unrelated structures can perform the same biochemical reaction (analogous enzymes). Predicting enzyme specificity, allosteric regulation, protein-protein interaction interfaces, and post-translational modification sites from structure alone remains beyond current methods. The CAFA (Critical Assessment of protein Function Annotation) community challenges consistently show that computational methods miss a substantial fraction of true GO-term annotations, particularly for molecular function and cellular component. The missing ingredient may be dynamics: proteins are not static structures, and their conformational flexibility — captured only partially in a single AlphaFold prediction — is often essential for function.

De novo genome assembly and the repeat problem. Despite the T2T (telomere-to-telomere) assembly of a complete human genome by the T2T Consortium (2022, Nurk et al., Science), many biologically important genomes remain intractable. Plant genomes are often large, polyploid (wheat is hexaploid, \(16\) Gb), and rich in transposable elements that create assembly ambiguities. Marine invertebrate genomes can be highly heterozygous (up to 5% divergence between haplotypes), making it difficult to distinguish allelic variation from repeat copies. The remaining open problems are: (1) phasing algorithms that can resolve long-range haplotype structure from long-read data, (2) assembly of highly repetitive centromeric and telomeric regions at scale, (3) reference-quality assemblies for the \(\sim\)70% of eukaryotic species with genomes that have not been sequenced. The algorithmic challenges — designing string graph algorithms that handle heterozygosity and repeats gracefully — are closely related to fundamental questions in combinatorics on words and de Bruijn graph theory.

3D genome organisation and regulatory interpretation. Gene expression is regulated not just by the linear sequence of promoters and enhancers but by the three-dimensional architecture of chromosomes. Topologically associating domains (TADs) — regions where sequences preferentially contact each other rather than sequences outside — emerge from the loop extrusion mechanism (cohesin motor + CTCF boundary elements) and confine regulatory interactions. Hi-C experiments capture chromosome contact frequencies genome-wide; the data are a noisy, biased estimate of the contact probability matrix, which must be normalised (KR normalisation, ICE) before interpretation. Predicting TAD boundaries, loop anchors, and long-range enhancer-promoter contacts from sequence is an active deep learning problem. Orca (Zhou et al. 2022) and Akita (Fudenberg et al.) achieve reasonable Hi-C prediction, but the models are not interpretable — it is unclear which sequence features are responsible for which structural features. The open problem is connecting sequence variation (mutations, structural variants) to changes in 3D architecture to changes in gene expression; this chain of causality underlies many human disease mechanisms, including cancer driver rearrangements and developmental disorder mutations.

Multi-omics integration and causal inference. Modern single-cell technology can simultaneously measure multiple modalities in the same cell: RNA expression (scRNA-seq), chromatin accessibility (scATAC-seq), protein surface markers (CITE-seq), and DNA methylation. The paired data from these assays offer an unprecedented view of cellular state, but the statistical challenges are severe. Each modality has different noise models, different dimensionalities, and different sparsity patterns. Methods for multi-modal integration (Seurat, Muon, MultiVI) typically learn a joint embedding, but it is unclear what information is lost in dimensionality reduction and whether the embedding is interpretable. A deeper problem is causal inference: most multi-omics analyses are correlational — gene A’s expression is correlated with chromatin accessibility at enhancer B. Inferring causal direction (does B regulate A, or vice versa?) requires perturbation data (Perturb-seq: CRISPR knockouts combined with scRNA-seq) and causal structure learning algorithms. Scaling perturbation experiments to the full transcription factor repertoire (\(\sim\)1600 in human) and integrating causal graphs across cell types is a major open research direction at the intersection of machine learning (causal discovery, e.g. NOTEARS, DAG-GNN) and systems biology.

RNA design and the inverse folding problem. CS 482 covers RNA secondary structure prediction (minimum free energy folding via dynamic programming — the Nussinov algorithm and Turner energy model used by Vienna RNA / RNAfold). The inverse RNA folding problem asks: given a target secondary structure, design a sequence that folds into it with high probability. This is an optimisation problem with a combinatorial search space (\(4^n\) sequences). The challenge is that thermodynamic predictions are inaccurate, that the designed sequence must also avoid off-target folds, and that biological RNA must satisfy additional constraints (no premature termination, required splice sites). RNA design has become industrially important since the SARS-CoV-2 mRNA vaccine: the sequence of the vaccine mRNA was codon-optimised and structure-designed to maximise stability and translational efficiency. Methods for RNA design — including deep learning generative models (RNAinverse, LEARNA, RDesign) — are an active area connecting algorithms with structural biology and bioengineering.

Genome-scale metabolic modelling and flux balance analysis. Bioinformatics is not only about sequence analysis. A genome-scale metabolic model (GEM) is a stoichiometric model of all metabolic reactions encoded in an organism’s genome, assembled by annotating enzyme-coding genes and mapping them to biochemical reactions from databases (KEGG, BiGG). The model is a matrix \(S \in \mathbb{R}^{m \times n}\) where \(S_{ij}\) is the stoichiometric coefficient of metabolite \(i\) in reaction \(j\). At steady state, \(S \mathbf{v} = \mathbf{0}\) where \(\mathbf{v} \in \mathbb{R}^n\) is the vector of reaction fluxes. Flux balance analysis (FBA) optimises a linear objective (typically biomass production or ATP yield) subject to \(S\mathbf{v} = \mathbf{0}\) and flux bounds \(\mathbf{v}_\text{lb} \leq \mathbf{v} \leq \mathbf{v}_\text{ub}\) — a linear programme. Genome-scale metabolic models now encompass thousands of reactions and are used to predict drug targets, engineer industrial fermentation strains (yeast, E. coli), and understand metabolic adaptation in cancer cells. The algorithmic challenges connect directly to the linear programming, duality, and network flow material in CO 353 and CO 370.

Algorithmic challenges in long-read sequencing. Oxford Nanopore and PacBio SMRT sequencing produce long reads (1–200 kb) but with higher raw error rates than short-read platforms. The fundamental error model differs: Nanopore errors are dominated by homopolymer miscalls (runs of identical bases are systematically miscounted), while PacBio HiFi errors are nearly random at \(\sim\)0.1% after circular consensus sequencing. Aligning long reads to a reference or to each other requires algorithms that tolerate higher per-read error rates: Minimap2 uses minimiser-based seeding (a space-efficient form of k-mer hashing) followed by chaining (an interval scheduling DP) and alignment. Basecalling — converting the raw ionic current signal from a Nanopore sequencer to nucleotide sequence — is itself a deep learning problem. Early methods used hidden Markov models; modern basecallers (Dorado, Guppy) use transformer models trained on reference-aligned reads. The accuracy of Nanopore basecalling has improved dramatically (Q20+ chemistry), and further algorithmic improvements — particularly for modified bases (5-methylcytosine detection) — are an active research area. Modified-base detection from Nanopore signal is particularly valuable: it directly sequences the epigenome without bisulfite conversion, opening new avenues for cancer diagnostics and developmental biology that complement the sequence alignment and HMM material covered in CS 482.

Computational immunogenomics and antigen prediction. The adaptive immune system uses recombination (V(D)J recombination) to generate an enormous diversity of T-cell receptor (TCR) and B-cell receptor (BCR) sequences from a limited set of gene segments — estimated at \(\sim 10^{15}\) possible TCR sequences, far exceeding the number of T cells in a human body (\(\sim 10^{11}\)). High-throughput immune repertoire sequencing (AIRR-seq) quantifies TCR and BCR clonotype frequencies in a sample. Algorithmic challenges include: clonotype clustering (grouping reads that derive from the same V(D)J rearrangement despite sequencing errors and somatic hypermutation), inference of antigen specificity from TCR sequence (TCRdist, TCRGP, NetTCR), and modelling clonal expansion dynamics. Predicting which peptide-MHC (pMHC) complexes are recognised by a given TCR — the TCR specificity prediction problem — is directly relevant to cancer neoantigen vaccine design and autoimmune disease understanding. The sequence alignment and HMM foundations of CS 482 underpin both the IMGT database annotation of immune receptor genes and the machine learning models for specificity prediction, connecting the course directly to one of the most active areas of computational immunology. Benchmarking competitions such as CASP (protein structure), CAMI (metagenomics assembly), and the PepMHC binding prediction challenges on IEDB serve as annual community calibration events, measuring progress and exposing gaps — a useful model for understanding where algorithms in bioinformatics stand relative to the underlying biological ground truth. The immunogenomics problem also has a strong statistical component: public TCR databases (VDJdb, McPAS-TCR) are heavily biased toward well-studied viral epitopes (CMV, EBV, influenza), meaning that models trained on them may not generalise to cancer neoantigens or autoimmune epitopes. Developing methods that explicitly model this dataset shift, and designing wet-lab experimental pipelines that generate more diverse training data, are joint algorithmic and biological challenges where methods from CS 482 — particularly probabilistic sequence models and profile HMMs — remain central tools.

Back to top