BIOL 365: Methods in Bioinformatics
Brendan McConkey
Estimated study time: 23 minutes
Table of contents
Sources and References
Online resources — NCBI (ncbi.nlm.nih.gov): GenBank, RefSeq, PubMed; UniProt (uniprot.org); Ensembl Genome Browser (ensembl.org); KEGG Pathway Database (genome.jp/kegg); STRING protein interaction database (string-db.org); UCSC Genome Browser; Galaxy Project (usegalaxy.org); Bioconductor (bioconductor.org)
Chapter 1: Introduction to Bioinformatics
Section 1.1: The Scope and Origins of Bioinformatics
Bioinformatics emerged from the convergence of molecular biology, computer science, and statistics in the late twentieth century. Its central challenge is the management, analysis, and interpretation of the vast quantities of biological data generated by modern experimental methods — particularly the sequence data produced by high-throughput nucleic acid and protein sequencing technologies. The sequencing of the human genome, completed in draft form in 2001, was a watershed moment: it demonstrated both the power of computational approaches to biological problems and the scale of the analytical challenge ahead. A human genome contains approximately 3.2 billion base pairs organized into roughly 20,000–25,000 protein-coding genes, embedded in a vast landscape of regulatory sequences, repetitive elements, and non-coding RNAs whose functions are still being characterized.
The term bioinformatics encompasses database development and curation, algorithm design and software engineering, statistical modeling of biological data, and the application of all of these to specific biological problems. It is fundamentally an integrative discipline — one that requires literacy in molecular biology (to understand what the data mean), mathematics and statistics (to extract signals from noisy data), and computer programming (to implement analyses at scale). The R statistical computing environment has become a dominant platform for bioinformatics data analysis, particularly for transcriptomics and genomics, because of its rich ecosystem of specialized packages (many available through the Bioconductor repository) and its powerful statistical modeling capabilities.
Section 1.2: Reproducible Workflows in Bioinformatics
A recurring challenge in computational biology is reproducibility — the ability of another researcher (or the original researcher at a later time) to obtain the same results from the same data using the same methods. Given the complexity of bioinformatics workflows (which may involve dozens of software tools, each with multiple version-dependent behaviors and adjustable parameters), achieving reproducibility requires explicit documentation of every computational step.
Best practices for reproducible bioinformatics include: use of version control (such as Git) for code and analysis scripts; documentation of all software versions, parameters, and computational environments (ideally through containerization with Docker or Singularity); use of workflow management systems (such as Snakemake or Nextflow) that explicitly define the dependencies among analysis steps; and deposition of both raw data and processed results in public repositories with persistent identifiers.
Chapter 2: Biological Sequence Databases
Section 2.1: Primary and Secondary Databases
The central infrastructure of bioinformatics is a network of biological databases that serve as repositories of experimentally determined and computationally predicted biological information. These databases are essential resources: without a means of storing and accessing the sequences and annotations of millions of genes and proteins, the comparative analyses that constitute much of modern genomics would be impossible.
Primary sequence databases store raw sequence data deposited directly by experimenters. The three major international nucleotide sequence databases — GenBank (NCBI, USA), EMBL-Bank (EBI, UK), and DDBJ (Japan) — participate in the International Nucleotide Sequence Database Collaboration (INSDC) and exchange data daily, maintaining effectively identical content. Each sequence entry in GenBank contains the nucleotide sequence itself, information about the organism of origin, the sequencing method, and annotations — computational or experimentally verified features of the sequence (gene locations, coding sequences, regulatory elements, repeat elements).
UniProt is the primary database for protein sequences and functional annotations. It is organized into two divisions: Swiss-Prot, a manually curated, high-quality database in which each entry has been reviewed and annotated by a biocurator with reference to the experimental literature; and TrEMBL, a computationally annotated database that is far larger but has lower annotation confidence. The Swiss-Prot database contains approximately 570,000 entries (as of 2024); TrEMBL contains over 250 million.
Secondary databases are derived from primary databases through additional computational analysis. Pfam is a database of protein domain families, each defined by a hidden Markov model (HMM) trained on a curated alignment of representative family members; querying Pfam reveals the domain architecture of a query protein. InterPro integrates information from multiple protein signature databases. KEGG (Kyoto Encyclopedia of Genes and Genomes) links sequence data to metabolic pathways and biochemical reactions. The UCSC Genome Browser and Ensembl provide graphical interfaces to annotated genome assemblies with tracks of genes, conservation, regulatory features, and variant data.
Chapter 3: Sequence Alignment — Algorithms and Scoring Statistics
Section 3.1: Pairwise Sequence Alignment
Sequence alignment is the foundational operation of bioinformatics. The principle is straightforward: two sequences that share a common evolutionary ancestor will have retained some similarity in their nucleotide or amino acid sequences; identifying and quantifying this similarity allows us to infer evolutionary relationships and functional equivalence. However, sequences diverge not only by base substitutions but also by insertions and deletions (indels), requiring the introduction of gaps into alignments.
A global alignment aligns two sequences over their entire length, which is appropriate when the two sequences are known to be homologous throughout. The Needleman-Wunsch algorithm solves the global alignment problem exactly using dynamic programming. A local alignment identifies the regions of greatest similarity between two sequences, which is more appropriate when comparing sequences that may share only a single conserved domain amid otherwise divergent sequence. The Smith-Waterman algorithm is the dynamic programming solution for optimal local alignment.
Both algorithms fill a scoring matrix in which each cell represents the optimal alignment score for the corresponding pair of prefixes of the two sequences. The score of an alignment is the sum of match scores (positive), mismatch penalties (negative), and gap penalties (negative). The choice of substitution matrix and gap penalties profoundly influences alignment quality. For amino acid alignments, the BLOSUM matrices (Blocks Substitution Matrix) are most widely used; they are derived empirically from alignments of conserved sequence blocks in related proteins. The BLOSUM62 matrix — derived from alignments of sequences sharing approximately 62% identity — is the default in most protein alignment tools.
Section 3.2: Statistical Significance of Alignments
Finding that two sequences produce an alignment score of, say, 150 bits is meaningless without knowing what score would be expected by chance for sequences of the same length and composition. The E-value (Expectation value) answers this question: it represents the number of alignments with score ≥ S that would be expected by chance when searching a database of a given size. An E-value of 0.01 means that only 1 in 100 such searches against the database would produce a score this high by chance alone.
The statistical theory underlying E-values was developed by Karlin and Altschul and assumes that alignment scores of unrelated sequences follow an extreme value distribution. The key parameters are \( \lambda \) (related to the expected score per position) and \( K \) (related to the effective search space). The E-value for a score \( S \) in a database search is:
\[ E = K \cdot m \cdot n \cdot e^{-\lambda S} \]where \( m \) and \( n \) are the lengths of the query and database. Importantly, E-values scale with database size: searching a database twice as large doubles the E-value for any given score. For this reason, the significance threshold (typically E < 0.001 or E < 10⁻⁵) must be interpreted in the context of database size.
Chapter 4: BLAST and Multiple Sequence Alignments
Section 4.1: BLAST — Basic Local Alignment Search Tool
The Smith-Waterman algorithm, though optimal, is too slow for database searches when millions of query-database sequence pairs must be evaluated. BLAST (Basic Local Alignment Search Tool), introduced by Altschul and colleagues in 1990, uses a heuristic approach to achieve near-optimal sensitivity at dramatically greater speed. The core strategy is to first identify short, high-scoring “word” matches between query and database sequences and then extend these seed alignments only in the subset of sequences where the word score exceeds a threshold. BLAST exists in several flavors: BLASTN (nucleotide vs. nucleotide), BLASTP (protein vs. protein), BLASTX (translated nucleotide vs. protein), TBLASTN (protein vs. translated nucleotide), and TBLASTX (translated nucleotide vs. translated nucleotide). Each flavor suits a different biological question.
Section 4.2: Multiple Sequence Alignments
A multiple sequence alignment (MSA) aligns three or more sequences simultaneously. Compared to pairwise alignment, MSA provides far richer evolutionary information: positions that are conserved across many taxa are likely functionally important; regions of high variability may be under relaxed selective constraint; and insertions and deletions in particular lineages are readily identified. MSA is the basis for phylogenetic analysis, identification of conserved protein domains, and the derivation of sequence logos and HMM profiles for database searching.
The most widely used MSA algorithms are ClustalW/ClustalOmega (progressive alignment: sequences are aligned pairwise, then combined in order of decreasing similarity according to a guide tree) and MUSCLE and MAFFT (which use profile-based refinement to improve alignments iteratively). For protein alignments where very distant homologs are being compared, structure-based alignment can be more accurate than sequence-based methods, since protein three-dimensional structure is more conserved than sequence.
A sequence logo is a graphical representation of the information content at each position of an MSA: the height of each column represents the conservation (information content) at that position in bits, and the height of each letter within the column represents its relative frequency. Highly conserved positions appear as tall single-letter columns; variable positions appear as stacks of letters of roughly equal height.
Chapter 5: Molecular Phylogeny and Phylogenetic Analysis
Section 5.1: Principles of Molecular Phylogenetics
Evolutionary relationships among organisms — the phylogeny — can be inferred from molecular sequence data by comparing DNA, RNA, or protein sequences. The fundamental assumption is that sequences sharing a common ancestor will be more similar to each other than sequences with more distant ancestry, and that — given a model of molecular evolution — the pattern of sequence similarities and differences can be used to reconstruct the branching order and relative timing of evolutionary divergences.
A phylogenetic tree is a graph representing the inferred evolutionary relationships among a set of sequences or organisms. The terminal nodes (tips or leaves) represent the observed sequences (operational taxonomic units, OTUs); the internal nodes represent hypothetical common ancestors; and the branches connecting them represent evolutionary lineages. Trees may be rooted (with a single ancestral node from which all others descend, usually identified by including an outgroup — a sequence known to be more distantly related than any of the ingroup sequences) or unrooted (showing the topology of relationships without specifying direction of evolution).
Section 5.2: Methods of Phylogenetic Reconstruction
Several methods are available for inferring phylogenetic trees, each based on different assumptions about the evolutionary process.
Distance-based methods (UPGMA, Neighbor-Joining) reduce sequence data to a pairwise distance matrix (typically the proportion of sites that differ, corrected for multiple substitutions) and then cluster sequences by distance. Neighbor-Joining (NJ) is by far the most widely used distance method: it sequentially identifies and collapses the pair of taxa that minimizes the total branch length of the tree. NJ trees are unrooted, computationally fast, and generally consistent (converging on the true tree as data increase).
Maximum parsimony identifies the tree that requires the smallest total number of evolutionary changes to explain the observed sequence data. It is conceptually simple but computationally intractable for large datasets (there are \( (2n-5)!! \) possible unrooted trees for \( n \) taxa) and relies on the assumption that the true tree minimizes character changes — an assumption that fails when substitution rates are high or vary among lineages.
Maximum likelihood (ML) identifies the tree (and associated branch lengths and model parameters) that maximizes the probability of the observed data given the tree and a specified model of sequence evolution. ML is statistically consistent and can incorporate complex substitution models (accounting for base frequency heterogeneity, rate variation among sites [modeled with a gamma distribution], and invariant sites). Common ML software includes RAxML and IQ-TREE.
Bayesian inference uses Bayes’ theorem to compute the posterior probability distribution of trees given the data and prior distributions on tree parameters. The Markov Chain Monte Carlo (MCMC) algorithm (implemented in MrBayes and BEAST) samples trees from this distribution; the result is a set of credible trees with associated posterior probabilities, providing a natural measure of statistical support. Bayesian methods are particularly powerful for molecular clock analyses — using the timing of known evolutionary events (fossil calibrations) to estimate the absolute ages of divergences.
Chapter 6: Gene Finding and Genome Annotation
Section 6.1: Annotating a Newly Sequenced Genome
Sequencing the DNA of a genome is only the first step; the sequence must then be annotated — the locations and functions of genes and other functional elements must be identified. This process is called genome annotation and involves both computational prediction and experimental verification.
Gene finding in prokaryotic genomes is relatively straightforward: genes are densely packed (approximately 85% of the genome is coding in bacteria), introns are absent, and genes can be identified as long open reading frames (ORFs) — stretches of sequence between a start codon (ATG) and a stop codon (TAA, TAG, or TGA) in the same reading frame. Tools such as Prodigal and Glimmer identify prokaryotic ORFs using training sets derived from verified genes in the same or related organisms.
Eukaryotic gene finding is dramatically more challenging because only approximately 2% of the human genome is protein-coding, genes are interrupted by multiple introns (some of which are hundreds of kilobases long), alternative splicing can produce multiple proteins from a single gene locus, and regulatory signals are more complex. Gene finding in eukaryotes typically combines: ab initio prediction (using machine learning models trained on known genes to identify splice sites, promoters, and coding regions); homology-based prediction (aligning known protein sequences from other species to find conserved coding regions); and transcript-supported prediction (aligning RNA-seq data to the genome to directly identify expressed regions). The MAKER pipeline integrates these approaches.
Chapter 7: High-Throughput Sequencing and Assembly
Section 7.1: Next-Generation Sequencing Technologies
The revolution in biological data generation began with the introduction of massively parallel — or next-generation sequencing (NGS) — technologies in the mid-2000s. The dominant platform, Illumina sequencing, uses sequencing-by-synthesis with fluorescently labeled, reversible terminator nucleotides. Library DNA is first fragmented and adapters are ligated to both ends; fragments are then amplified into spatially distinct clonal clusters on a flow cell surface. During each synthesis cycle, the incorporation of a labeled nucleotide is recorded as a fluorescence image; base calls are made for all clusters simultaneously. Illumina instruments produce hundreds of millions to billions of reads of typically 150 bp (paired-end) per run, with raw base-call error rates of approximately 0.1–1%. The cost of sequencing has decreased approximately 100,000-fold since 2001, far outpacing Moore’s Law.
Long-read sequencing platforms (Pacific Biosciences SMRT sequencing and Oxford Nanopore Technologies) produce reads of 10–100 kb or more, enabling resolution of repetitive sequences, structural variants, and complex genomic regions that are inaccessible to short reads. Long reads carry higher per-base error rates but these can be corrected by computational polishing with high-coverage short reads or by consensus calling from multiple long reads.
Section 7.2: Genome Assembly
Genome assembly is the computational process of reconstructing the complete genome sequence from millions to billions of short sequencing reads. The fundamental challenge is the repetitive nature of eukaryotic genomes: if the same repeat sequence is present at many locations in the genome, short reads cannot be unambiguously placed to specific locations.
Two main approaches exist. Overlap-layout-consensus (OLC) assemblers identify all pairwise overlaps among reads, construct a graph (the overlap graph) that represents these overlaps, find a path through the graph that explains the reads (layout), and derive a consensus sequence. OLC assemblers are well-suited to long reads. De Bruijn graph assemblers (Velvet, SPAdes, Trinity for RNA) decompose reads into short overlapping subsequences of fixed length k (k-mers), build a directed graph in which edges represent k-mers and nodes represent their prefix/suffix overlaps, and traverse this graph to produce contigs. De Bruijn assemblers are the dominant approach for short reads.
Assembly quality is assessed by metrics including N50 (the contig length such that 50% of the total assembly is contained in contigs of that length or longer — larger is better), the total assembly size relative to the estimated genome size, and the completeness of single-copy orthologs (assessed with the BUSCO tool).
Chapter 8: Gene Expression and Differential Expression Analysis
Section 8.1: RNA-seq Data Analysis Workflow
RNA sequencing (RNA-seq) is the application of NGS technology to RNA — specifically to complementary DNA (cDNA) reverse-transcribed from mRNA or total RNA — to quantify gene expression. It has largely replaced microarrays for transcriptomics because it offers greater dynamic range, does not require prior knowledge of gene sequences, and can simultaneously detect alternative splicing, novel transcripts, and RNA editing events.
The RNA-seq analysis workflow consists of several steps: quality control and adapter trimming of raw reads (FastQC, Trimmomatic); alignment of reads to the reference genome or transcriptome (STAR, HISAT2 for genome alignment; Salmon, kallisto for pseudoalignment); read counting — counting the number of reads overlapping each annotated gene (featureCounts, HTSeq-count); normalization to account for differences in library size and gene length; and statistical testing for differential expression.
Normalization is critical because raw read counts are influenced both by biological differences in expression and by technical factors (total sequencing depth, RNA quality). The most commonly reported normalization metrics are RPKM (Reads Per Kilobase per Million mapped reads) and TPM (Transcripts Per Million). For differential expression analysis, normalization is performed within the statistical model itself by tools such as DESeq2 and edgeR, which use negative binomial statistical models appropriate for count data.
Section 8.2: Differential Expression Analysis
Differential expression (DE) analysis identifies genes whose expression levels differ significantly between two or more experimental conditions (treated vs. untreated, disease vs. healthy). The core statistical challenge is distinguishing true biological differences from the noise inherent in count data.
DESeq2 (developed by Love, Huber, and Anders) models read counts with a negative binomial distribution and uses shrinkage estimation of dispersion parameters (which account for gene-specific variability) and fold changes to stabilize estimates for low-expression genes. The result for each gene is a log₂ fold change (indicating the direction and magnitude of expression change) and an adjusted p-value (Benjamini-Hochberg False Discovery Rate correction for multiple testing). A gene is typically considered differentially expressed if the adjusted p-value falls below 0.05 (i.e., the FDR is controlled at 5%) and the absolute log₂ fold change exceeds a threshold (often 1, corresponding to a twofold change).
DE analysis alone identifies which genes change; pathway analysis and gene set enrichment analysis (GSEA) then determine whether the differentially expressed genes are enriched for particular biological functions, metabolic pathways, or molecular processes — providing mechanistic interpretation. GSEA tests whether a predefined set of genes (representing a pathway or biological process) shows coordinated up- or down-regulation across an experiment, using a running-sum statistic (normalized enrichment score, NES) compared against a permutation-derived null distribution.
Chapter 9: Clustering and Descriptive Statistics in Bioinformatics
Section 9.1: Dimensionality Reduction
High-dimensional biological data — such as gene expression matrices containing expression values for tens of thousands of genes across dozens to hundreds of samples — are challenging to visualize and interpret. Dimensionality reduction methods transform the data into a lower-dimensional space that captures the most important sources of variation.
Principal Component Analysis (PCA) is a linear dimensionality reduction method that identifies orthogonal axes (principal components, PCs) in the data space along which variance is maximized. The first PC captures the greatest variance, the second PC (orthogonal to the first) captures the next greatest, and so on. Projecting samples onto the first two or three PCs provides an overview of overall gene expression variation and reveals sample clustering, outliers, and batch effects. PCA is sensitive to scale and is typically applied to log-transformed, normalized expression data.
t-SNE (t-distributed Stochastic Neighbor Embedding) is a nonlinear dimensionality reduction method designed to preserve the local structure of high-dimensional data — that is, samples that are similar in high-dimensional space appear close together in the 2D t-SNE plot. It is widely used for visualizing single-cell RNA-seq data, where individual cells form clusters corresponding to cell types or states. UMAP (Uniform Manifold Approximation and Projection) is a newer alternative that is computationally faster and better at preserving global data structure.
Section 9.2: Clustering Methods
Clustering partitions a set of data points (typically samples or genes) into groups based on similarity. Hierarchical clustering iteratively merges the most similar pairs into clusters, producing a dendrogram — a tree-like diagram showing the nested structure of similarities. The choice of linkage method (single, complete, average, or Ward’s) affects cluster shape. When applied to gene expression data, hierarchical clustering of both genes and samples simultaneously produces a heatmap with dendrogram annotations — perhaps the most widely used visualization in transcriptomics.
k-means clustering partitions data into k pre-specified clusters by iteratively assigning each point to its nearest centroid and updating the centroids. It requires specifying k in advance, which can be guided by the elbow method (plotting within-cluster sum of squares against k and identifying the point of diminishing return).
Chapter 10: Metagenomics
Section 10.1: Principles and Applications of Metagenomics
Metagenomics is the direct sequencing of genomic DNA from all organisms present in an environmental sample (soil, water, gut contents, skin) without prior culture, enabling the characterization of microbial community composition and functional potential. Classical microbiology was limited to organisms that could be cultured in the laboratory — only approximately 1% of environmental bacteria by some estimates. Metagenomics circumvents this great plate count anomaly, revealing the vast “uncultured majority” of the microbial world.
Two broad approaches are used. Amplicon sequencing (often called 16S rRNA gene sequencing for bacteria) amplifies and sequences a phylogenetically informative marker gene present in all members of a target group. The 16S ribosomal RNA gene contains both highly conserved regions (used for universal primer design) and variable regions (V1–V9) that vary among taxa and are used to identify organisms to genus or species level. Sequences are clustered into Operational Taxonomic Units (OTUs) or assigned to exact sequence variants (ASVs) and compared against reference databases (SILVA, Greengenes) for taxonomic assignment.
Shotgun metagenomics sequences all genomic DNA in a sample indiscriminately, providing not only taxonomic information but also functional gene content — the metabolic potential of the community. Assembly of metagenomic reads into contigs, followed by binning (clustering contigs that likely derive from the same organism based on composition and co-coverage) and annotation, can yield draft genomes of individual organisms — metagenome-assembled genomes (MAGs) — directly from environmental samples.
Diversity metrics are central to metagenomic analyses. Alpha diversity measures within-sample diversity using metrics such as species richness (number of unique taxa), Shannon entropy (which weights species by their relative abundances), and Faith’s phylogenetic diversity (which accounts for evolutionary distinctiveness). Beta diversity measures between-sample diversity — how different two community compositions are from each other — using distances (Bray-Curtis, UniFrac) that can be visualized by ordination (PCoA, NMDS) to reveal patterns of community variation associated with environmental gradients or experimental treatments.