CHEM 481: Rational Design of Potential Drug Candidates

Dr. Subha Kalyaanamoorthy

Estimated study time: 29 minutes

Table of contents

Sources and References

Supplementary texts — Leach, A.R. Molecular Modelling: Principles and Applications, 2nd ed. Prentice Hall, 2001. | Erlanson, D.A. & Jahnke, W. (eds.) Fragment-Based Drug Discovery: Lessons and Outlook. Wiley-VCH, 2016. | Silverman, R.B. & Holladay, M.W. The Organic Chemistry of Drug Design and Drug Action, 3rd ed. Academic Press, 2014. Online resources — Protein Data Bank (rcsb.org); PubChem (pubchem.ncbi.nlm.nih.gov); ChEMBL (ebi.ac.uk/chembl); ZINC database (zinc.docking.org); AutoDock Vina documentation (vina.scripps.edu); VMD documentation (ks.uiuc.edu/Research/vmd)


Chapter 1: The Drug Discovery Process

Section 1.1: From Disease to Drug — An Overview

The development of a new drug is one of the most complex, costly, and time-consuming endeavors in modern science. From the identification of a molecular target linked to a disease to the arrival of an approved therapeutic on pharmacy shelves, the process typically spans 10–15 years and costs, on average, over one billion US dollars. The high attrition rate (fewer than 1 in 10,000 compounds screened ever reaches patients) makes efficiency in the early stages of the pipeline critically important, which is precisely where computational drug discovery — the subject of this course — plays its most powerful role.

The drug discovery and development pipeline can be conceptualized as a cascade of progressively more selective filters. The earliest stage is target identification and validation: scientists seek to establish that a particular protein, nucleic acid, or other biomolecule plays a causal role in the disease of interest, and that modulating its activity will produce therapeutic benefit without unacceptable toxicity. Target validation draws on evidence from genetics (human genetic variants that affect disease risk), biochemistry (enzyme kinetics, structural biology), and animal models (knockouts, transgenics).

Once a target is validated, hit identification begins: the search for small organic molecules that interact with the target. Traditional approaches include high-throughput screening (HTS), in which robotic systems test libraries of hundreds of thousands of compounds in biochemical or cell-based assays. More modern, computationally efficient approaches — virtual screening and fragment-based drug discovery — dramatically reduce the experimental burden by computationally filtering a much larger virtual library before synthesis and testing.

Hit-to-lead optimization refines the initial hits into lead compounds with improved potency (binding affinity), selectivity (activity at the intended target but not at off-targets that could cause side effects), metabolic stability (resistance to degradation by cytochrome P450 enzymes), and aqueous solubility (for oral bioavailability). This iterative process of design–synthesis–testing is guided by structure–activity relationships (SAR) derived from the properties of many structural analogues.

Lead optimization continues to improve the ADMET profile (absorption, distribution, metabolism, excretion, toxicity) of the lead compound. The Lipinski Rule of Five predicts that a compound is likely to have good oral bioavailability if: molecular weight ≤ 500 Da, calculated log P (lipophilicity) ≤ 5, number of hydrogen bond donors ≤ 5, and number of hydrogen bond acceptors ≤ 10. Compounds violating two or more of these rules are unlikely to be orally bioavailable (though natural products and biologics are exceptions).

After lead optimization, a candidate drug is identified and enters preclinical development: toxicology studies in cell-based and animal models, pharmacokinetic profiling, and formulation work. If these studies support continuation, an Investigational New Drug (IND) application is filed with the regulatory authority, and clinical trials begin in three phases — Phase I (safety, dose escalation in healthy volunteers), Phase II (efficacy and dose ranging in patients), and Phase III (large-scale efficacy and safety in diverse patient populations). Finally, a New Drug Application (NDA) or Biologics License Application (BLA) is submitted and reviewed by the regulatory authority (FDA in the US, EMA in Europe).


Chapter 2: Computational Drug Discovery — Concepts and Methods

Section 2.1: Structure-Based Drug Design

Structure-based drug design (SBDD) uses the three-dimensional structure of the biological target — almost always determined by X-ray crystallography, cryo-electron microscopy (cryo-EM), or nuclear magnetic resonance (NMR) — as the foundation for designing molecules that fit precisely into the target’s binding site. The critical information is the shape and electrostatic character of the binding pocket: its dimensions (which constrain molecular size), the hydrogen-bond donors and acceptors along its walls (which specify complementary features in the ligand), and the hydrophobic patches (where non-polar ligand groups can make favorable van der Waals contacts with hydrophobic protein residues).

The Protein Data Bank (PDB) is the central repository for experimentally determined three-dimensional protein structures, containing over 200,000 structures as of 2025. Before using a PDB structure for drug design, the raw crystallographic coordinates must be processed: water molecules are typically retained (those in the binding site may mediate key interactions with ligands), any co-crystallized ligand or inhibitor is examined to understand the binding mode, hydrogen atoms are added (not determined by X-ray crystallography, which sees only heavy atoms), protonation states of ionizable residues are assigned (using tools like PropKa or H++), and any missing loops or disordered regions are modeled.

The binding site is identified either from the co-crystallized ligand position, from computational solvent mapping (FTmap, SiteMap), or from the literature. A pharmacophore model describes the binding site in terms of the minimum set of structural and electronic features — hydrogen-bond donor, hydrogen-bond acceptor, hydrophobic center, aromatic ring, positively or negatively charged group — that a ligand must present to achieve high-affinity binding.

Section 2.2: Molecular Docking

Molecular docking is the computational method that predicts the preferred orientation (binding mode or “pose”) of a small molecule within the binding site of a target protein, and assigns a score (a numerical estimate of the binding affinity or complementarity) to each pose.

Molecular docking consists of two algorithmic components: a search algorithm that generates candidate poses by exploring the conformational and translational space of the ligand within the binding site, and a scoring function that evaluates the energy of each pose to identify the most likely binding mode and estimate the binding affinity. The search and scoring steps together define the docking protocol.

The most widely used search algorithms in modern docking include genetic algorithms (AutoDock 4), Lamarckian genetic algorithms (AutoDock 4 and AutoDock Vina precursors), iterative local search (AutoDock Vina), and Monte Carlo random walks (Glide, GOLD). These algorithms explore the torsional degrees of freedom of the ligand and, in some cases, the side chain flexibility of the protein, while keeping the protein backbone rigid (the most common approximation, rigid-receptor docking).

Scoring functions can be classified as force-field based (evaluating the molecular mechanics energy of the protein–ligand complex, including van der Waals, electrostatic, and torsional terms), empirical (summing contributions from identified interaction types, with coefficients calibrated against experimental binding data), or knowledge-based (using statistical pair potentials derived from the frequency of atom–atom contacts in the PDB). Hybrid scoring functions combining elements of all three classes are common in commercial docking software.

The AutoDock Vina scoring function is empirical/knowledge-based, using terms for van der Waals interactions, hydrogen bonding, hydrophobic contacts, and torsional entropy. The final score (in kcal mol\(^{-1}\)) is the estimated binding free energy; more negative values indicate stronger predicted binding.

Subsection 2.2.1: Docking Protocol in Practice

A typical docking workflow in AutoDock Vina proceeds as follows. The protein structure is prepared using tools such as AutoDockTools or UCSF Chimera: the structure is loaded, water molecules outside the binding site are removed, polar hydrogens are added, and Gasteiger partial charges are assigned to all heavy atoms. A grid box is defined around the known or predicted binding site, with dimensions large enough to accommodate the ligand and allow some translational freedom (typically 20–30 Å per side). The ligand is prepared by establishing appropriate protonation states, assigning partial charges, and specifying which bonds are rotatable (typically all non-ring, non-amide single bonds).

The docking calculation generates a set of poses ranked by predicted binding energy. The top poses are visually inspected in a molecular visualization program such as VMD or PyMOL to assess their chemical reasonableness: are hydrogen bonds formed with the appropriate protein residues? are large hydrophobic groups buried in hydrophobic pockets? do key pharmacophore features overlap with those of known active ligands? Poses that score well but form no hydrogen bonds to the protein, or that place charged groups in hydrophobic environments, are likely artifacts of the scoring function.

Virtual screening applies docking to large libraries of compounds (thousands to millions) to rank them by predicted binding affinity, enabling selection of the most promising candidates for experimental testing. The efficiency of virtual screening depends critically on the accuracy of the scoring function — high false-positive rates (compounds that score well but don’t bind) and false-negative rates (active compounds that score poorly) are persistent challenges in the field.


Chapter 3: Molecular Visualization and Structure Analysis

Section 3.1: Molecular Visualization with VMD

VMD (Visual Molecular Dynamics) is a freely available molecular visualization and analysis program developed at the University of Illinois at Urbana–Champaign, particularly well-suited for viewing MD simulation trajectories and analyzing protein–ligand interactions. It supports a wide range of molecular file formats (PDB, DCD, PSF, mol2, etc.) and rendering styles (lines, licorice/stick, CPK/VDW spheres, ribbon/cartoon, surface representations).

The VMD command-line interface uses Tcl scripts (Tool Command Language, the scripting language that controls VMD) to automate analysis tasks. A basic Tcl script in VMD can load a trajectory, select a subset of atoms using an intuitive selection language, measure geometric quantities (distances, angles, dihedrals), and write output files. The atom selection language supports expressions such as protein and resname HIS and within 5 of resname LIG, which selects all atoms of histidine residues within 5 Å of the ligand — a natural way to identify the residues participating in ligand binding.

Section 3.2: Biomolecular Interaction Analysis

Understanding a protein–ligand complex requires quantitative analysis of the non-covalent interactions that stabilize binding. The principal interaction types in protein–ligand complexes are:

Hydrogen bonds: directional interactions between a hydrogen-bond donor (N–H or O–H) and a hydrogen-bond acceptor (lone pair on N, O, or S). In protein–ligand complexes, hydrogen bonds to backbone amide NH and C=O groups and to side chains of Ser, Thr, Tyr, Asn, Gln, Lys, Arg, Asp, Glu, and His are particularly important. Typical O···N or O···O hydrogen-bond distances are 2.5–3.5 Å; the angle at the donor H is typically > 120°.

Hydrophobic contacts: non-specific, favorable van der Waals interactions between non-polar ligand substituents (methyl, phenyl, cyclohexyl) and hydrophobic protein residues (Leu, Ile, Val, Met, Phe, Trp, Pro). The transfer of non-polar surfaces from water into a hydrophobic pocket is entropically favorable (the “hydrophobic effect”) and enthalpically slightly unfavorable (weaker van der Waals interactions), making hydrophobic contacts primarily entropy-driven.

Pi-stacking (aromatic–aromatic) interactions: face-to-face or edge-to-face (T-shaped) stacking between aromatic rings of the ligand and aromatic side chains of Phe, Tyr, or Trp. Typical centroid–centroid distances are 3.5–5 Å. Face-to-face stacking is electrostatically favorable when the rings have complementary electrostatic surfaces.

Cation–pi interactions: electrostatic attraction between a cationic group (protonated amine, guanidinium of Arg) and the face of an aromatic \( \pi \) system. These are among the strongest non-covalent interactions in biology, with energies of 10–50 kJ mol\(^{-1}\) in the gas phase.

Salt bridges (ionic interactions): Coulombic attraction between oppositely charged groups — e.g., a protonated amine of the ligand and the carboxylate of Asp or Glu, or a guanidinium group and a phosphate.


Chapter 4: Molecular Mechanics and Force Fields

Section 4.1: The Concept of a Force Field

A force field is a mathematical model (a set of equations and parameters) that describes the potential energy of a molecular system as a function of the positions of its atoms. The force field does not treat electrons explicitly — it is a classical (non-quantum) model — but it parameterizes the effective interatomic interactions to reproduce experimental or quantum mechanical data.

The total potential energy in a molecular mechanics force field is a sum of bonded (through-bond) and non-bonded (through-space) terms: \[ V_{\text{total}} = V_{\text{bonds}} + V_{\text{angles}} + V_{\text{dihedrals}} + V_{\text{improper}} + V_{\text{vdW}} + V_{\text{elec}}. \]

The bonded terms include:

Bond stretching — typically modeled as a harmonic potential:

\[ V_{\text{bond}} = \sum_{\text{bonds}} \frac{1}{2} k_b (r - r_0)^2, \]

where \( k_b \) is the bond spring constant and \( r_0 \) is the equilibrium bond length. More accurate force fields use a Morse potential for covalent bonds, which correctly goes to zero at infinite separation (bond dissociation).

Angle bending — also harmonic:

\[ V_{\text{angle}} = \sum_{\text{angles}} \frac{1}{2} k_\theta (\theta - \theta_0)^2. \]

Dihedral (torsional) potential — modeled as a Fourier series:

\[ V_{\text{dihedral}} = \sum_{\text{dihedrals}} \frac{V_n}{2}\left[1 + \cos(n\phi - \phi_0)\right], \]

where \( n \) is the periodicity (1, 2, or 3 for most dihedrals), \( \phi \) is the torsional angle, \( V_n \) is the barrier height, and \( \phi_0 \) is the phase offset.

The non-bonded terms include:

van der Waals interactions — modeled by the Lennard-Jones potential:

\[ V_{\text{vdW}}(r_{ij}) = 4\varepsilon_{ij}\left[\left(\frac{\sigma_{ij}}{r_{ij}}\right)^{12} - \left(\frac{\sigma_{ij}}{r_{ij}}\right)^6\right], \]

where \( r_{ij} \) is the interatomic distance, \( \varepsilon_{ij} \) is the depth of the potential well (a measure of the interaction strength), and \( \sigma_{ij} \) is the distance at which the potential is zero (related to the sum of van der Waals radii). The \( r^{-12} \) term models short-range repulsion (Pauli repulsion) and the \( r^{-6} \) term models long-range attraction (London dispersion).

Electrostatic interactions — modeled by Coulomb’s law:

\[ V_{\text{elec}}(r_{ij}) = \frac{q_i q_j}{4\pi\varepsilon_0 \varepsilon_r r_{ij}}, \]

where \( q_i \) and \( q_j \) are the partial atomic charges and \( \varepsilon_r \) is the dielectric constant of the medium. Partial atomic charges are not physical observables but are assigned by fitting to reproduce quantum mechanically computed electrostatic potentials (for small molecules, the RESP or AM1-BCC procedures) or by chemical intuition (for the parameterization of protein force fields like AMBER, CHARMM, GROMOS, and OPLS).


Chapter 5: Molecular Dynamics Simulations

Section 5.1: Principles of Molecular Dynamics

Molecular dynamics (MD) simulation computes the time evolution of a molecular system by numerically integrating Newton’s equations of motion for every atom. Given the positions \( \mathbf{r}_i \) and velocities \( \mathbf{v}_i \) of all \( N \) atoms at time \( t \), the force on each atom is computed from the gradient of the force-field potential:

\[ \mathbf{F}_i = -\nabla_i V_{\text{total}}, \]

and Newton’s second law \( m_i \ddot{\mathbf{r}}_i = \mathbf{F}_i \) is integrated numerically. The most common integration algorithm is the Verlet algorithm (or its velocity-Verlet variant):

\[ \mathbf{r}_i(t + \Delta t) = 2\mathbf{r}_i(t) - \mathbf{r}_i(t - \Delta t) + \frac{\mathbf{F}_i(t)}{m_i}(\Delta t)^2 + \mathcal{O}(\Delta t)^4. \]

The time step \( \Delta t \) must be small enough to accurately integrate the fastest motions in the system — typically the O–H and N–H bond stretching vibrations (\( \sim 10 \) fs period), requiring \( \Delta t \approx 1 \) fs for a stable simulation. Using the SHAKE or LINCS algorithm to constrain all bond lengths allows the time step to be increased to 2 fs, since the fast stretching modes are frozen. With heavy hydrogen atoms (virtual sites), \( \Delta t \) can be extended to 4–5 fs.

Section 5.2: Simulation Setup and Equilibration

Preparing a protein–ligand system for MD simulation requires a series of careful steps. The protein structure (from the PDB or generated by homology modeling) is protonated and placed in a simulation box — typically a truncated octahedron or cubic box — with dimensions large enough that the protein does not interact with its periodic image across the box boundary (at least 10–12 Å of water between the protein surface and the box edge in any direction). The box is filled with explicit water molecules modeled by a water force field such as TIP3P, TIP4P, or SPC/E.

If the protein–ligand system carries a net charge, counterions (Na\(^+\) or Cl\(^-\)) are added to neutralize the system — charged periodic systems create artifacts due to infinite charge–charge interactions in the Ewald summation scheme used for long-range electrostatics. Additional salt (typically 150 mM NaCl, matching physiological conditions) can be added to more accurately model the ionic environment.

After assembling the system, a minimization step removes bad contacts (atom clashes) introduced during the setup. Then equilibration in the NVT ensemble (constant number of particles, volume, and temperature) heats the system from 0 K to the target temperature (usually 300 K), followed by equilibration in the NPT ensemble (constant pressure and temperature) to equilibrate the density. Finally, the production run is performed, during which configurations are saved at regular intervals (every 1–10 ps) for subsequent analysis.

Section 5.3: Analysis of MD Trajectories

The MD trajectory — the time series of atomic coordinates — contains a wealth of information about the dynamics of the protein–ligand complex. Standard analyses include:

Root-mean-square deviation (RMSD) measures the structural deviation of the protein backbone (or the ligand) from a reference structure (typically the starting X-ray structure) over the course of the simulation:

\[ \text{RMSD}(t) = \sqrt{\frac{1}{N}\sum_{i=1}^{N}\left|\mathbf{r}_i(t) - \mathbf{r}_i^{\text{ref}}\right|^2}. \]

A stable, equilibrated system shows an RMSD that fluctuates around a plateau value. Large RMSD values indicate significant conformational changes or simulation instability.

Root-mean-square fluctuation (RMSF) measures the time-averaged flexibility of individual residues: a high RMSF indicates a flexible region (loops, termini), while a low RMSF indicates a rigid, well-ordered region (secondary structure elements, buried residues). RMSF values can be compared with experimental crystallographic B-factors.

Hydrogen-bond analysis tracks the frequency with which each hydrogen bond between the ligand and the protein (or within the protein) is formed over the course of the simulation. A hydrogen bond is typically defined geometrically (donor–acceptor distance < 3.5 Å and D–H···A angle > 120°). A stable, functionally important hydrogen bond is formed in >50% of simulation frames; transient interactions are seen at lower frequencies.

Binding free energy estimation by MM-PBSA (Molecular Mechanics / Poisson–Boltzmann Surface Area) or MM-GBSA (using the Generalized Born model) provides a computationally tractable estimate of the binding free energy from snapshots extracted from the MD trajectory:

\[ \Delta G_{\text{bind}} \approx \langle \Delta E_{\text{MM}} \rangle + \langle \Delta G_{\text{solvation}} \rangle - T\langle \Delta S_{\text{MM}} \rangle. \]

While MM-PBSA does not capture all contributions to the binding free energy with chemical accuracy, it is widely used for ranking congeneric series of ligands during lead optimization.


Chapter 6: Ligand-Based Drug Design and Pharmacophore Modeling

Section 6.1: Ligand-Based Approaches

When the three-dimensional structure of the target protein is not available, ligand-based drug design (LBDD) methods exploit the structural and physicochemical properties of known active and inactive compounds to infer the features required for activity. These methods are complementary to structure-based approaches and are particularly powerful when a large, structurally diverse set of active compounds is available.

Section 6.2: Pharmacophore Modeling

A pharmacophore (from the Greek for “carrier of pharmacological activity”) is the ensemble of steric and electronic features that a molecule must possess to achieve optimal supramolecular interactions with a specific biological macromolecule and trigger (or block) a biological response. In modern practice, a pharmacophore is a three-dimensional arrangement of chemical feature types (hydrogen-bond donor, hydrogen-bond acceptor, hydrophobic center, aromatic ring, positive ionizable, negative ionizable) with associated volume and distance constraints.

Pharmacophore models can be built either from the structure of the target binding site (structure-based pharmacophore) or from the superposition of known active ligands (ligand-based pharmacophore). In the ligand-based approach, a conformational model of each active compound is generated, and algorithms such as CATALYST (Accelrys) or Phase (Schrödinger) identify the common chemical features and their relative spatial arrangement across all the active conformers.

Once a validated pharmacophore model is in hand, it can be used to screen databases of compounds (pharmacophore virtual screening) to identify novel scaffolds that present the required features in the correct geometry, even without structural similarity to the training set ligands (scaffold hopping). This is a powerful strategy for identifying non-obvious leads and for circumventing intellectual property barriers.

Section 6.3: Quantitative Structure–Activity Relationships

Quantitative structure–activity relationship (QSAR) modeling establishes mathematical relationships between the physicochemical and structural descriptors of a series of compounds and their observed biological activity. Classical QSAR uses the Hansch approach: the biological activity (often expressed as \( \log(1/C) \), where \( C \) is the molar concentration required for a specified effect) is correlated with Hammett substituent constants (\( \sigma \), for electronic effects), Hansch hydrophobic constants (\( \pi = \log P_{\text{substituted}} - \log P_{\text{unsubstituted}} \)), and steric parameters (Taft’s \( E_s \), molar refractivity) by multiple linear regression.

Modern QSAR uses a much broader set of molecular descriptors computed from the two-dimensional or three-dimensional structure of the molecule: topological indices (connectivity indices, Wiener index), constitutional descriptors (molecular weight, atom count, ring count), geometric descriptors (surface area, volume, dipole moment), and quantum chemical descriptors (HOMO/LUMO energies, partial charges). Machine learning methods — from partial least squares (PLS) to random forests to deep neural networks — are applied to correlate these high-dimensional descriptor vectors with activity, selectivity, and ADMET properties.


Chapter 7: Bioinformatics and Unix for Drug Discovery

Section 7.1: Unix-Based Computing for Molecular Modelling

Nearly all professional molecular modelling and drug discovery software runs in a Unix or Linux environment. Proficiency with the Unix command line is therefore an essential skill for the computational chemist. The core Unix commands for file management, text processing, and job submission provide a productive environment for running docking calculations, MD simulations, and analysis scripts.

Basic navigation uses ls (list directory contents), cd (change directory), mkdir (create directory), rm (remove files), cp (copy), and mv (move or rename). Text processing tools such as grep (search for patterns), awk (column-oriented text processing), and sed (stream editor for substitution) allow rapid manipulation of the large text-based output files generated by molecular modelling programs.

Shell scripts combine Unix commands and control structures (loops, conditionals) to automate repetitive tasks — for example, running AutoDock Vina with a list of ligands, collecting the output binding energies, and sorting the list by score. A simple bash loop iterates over all files in a directory:

for ligand in *.pdbqt; do
    vina --receptor receptor.pdbqt --ligand "$ligand" \
         --config config.txt --out "${ligand%.pdbqt}_out.pdbqt" \
         --log "${ligand%.pdbqt}.log"
done

Tcl scripts in VMD automate structure analysis: loading a trajectory, selecting atoms, computing geometric measurements (hydrogen bond distances, RMSD, contact maps), and outputting results to files for plotting in Python or gnuplot.

Section 7.2: Programming for Cheminformatics

Python has become the dominant programming language for computational chemistry and cheminformatics, due to its readable syntax, extensive scientific libraries (NumPy, SciPy, Matplotlib, Pandas), and dedicated chemistry toolkits. RDKit is the most widely used open-source cheminformatics library; it provides tools for reading and writing chemical file formats (SMILES, SDF, MOL2, PDB), computing molecular descriptors (fingerprints, physicochemical properties, pharmacophore features), performing substructure searching, and generating 2D and 3D molecular representations.

A typical cheminformatics workflow in Python might load a library of compounds in SMILES format, compute their Lipinski parameters, filter for drug-likeness, generate 3D conformers using ETKDG (RDKit’s distance geometry method), and export the filtered, conformation-generated compounds as SDF files ready for docking. The entire workflow can be scripted in a few dozen lines of Python, demonstrating the power of this computational approach for rapid, reproducible library preparation.


Chapter 8: Research Project — Drug Candidate Design

Section 8.1: Identifying and Characterizing a Drug Target

The capstone of this course is an independent research project in which students apply the computational tools and concepts developed throughout the semester to a realistic drug discovery problem. The project begins with target selection: choosing a disease-relevant protein with a known or modeled three-dimensional structure and a well-defined binding site. Suitable targets include enzyme active sites (proteases, kinases, phosphatases, transferases), protein–protein interaction interfaces, and allosteric pockets.

A thorough literature review establishes the biological rationale for the target (what disease is it linked to? what evidence implicates it?), the structural biology of the binding site (what co-crystal structures are available in the PDB? what are the key pharmacophore features?), and the landscape of known inhibitors (what are the most potent compounds? what structural features are essential for activity?). This review culminates in a project proposal — a concise document that states the target, the disease rationale, the proposed computational strategy, and the expected outcome of the virtual screening or design effort.

Section 8.2: Virtual Screening and Hit Identification

The computational strategy for the project may involve virtual screening of a commercially available compound library (such as the ZINC database or the Enamine library), de novo design (computationally assembling novel structures that fit the pharmacophore), or scaffold hopping (identifying bioisosteric replacements for the core scaffold of a known inhibitor to improve selectivity or patent position).

For virtual screening, the workflow follows the docking protocol described in Chapter 3, extended to a library of hundreds to thousands of compounds. The top-scoring compounds are filtered by: visual inspection of the predicted binding mode (is the docking pose chemically reasonable?), Lipinski’s Rule of Five filtering (drug-likeness), PAINS (Pan-Assay INterference compoundS) filtering (removing compounds likely to give false positives in assays), and structural diversity analysis (ensuring a diverse set of chemotypes is selected for experimental follow-up).

Section 8.3: Communicating Drug Discovery Research

The ability to communicate computational drug discovery research clearly and persuasively — to an audience that includes not only computational chemists but also medicinal chemists, biologists, and (in industry) management — is as important as the computational work itself. The course project culminates in both a written report (following scientific journal conventions) and an oral presentation to the class.

The written report presents the biological rationale for the target, the computational methods used (with sufficient detail for reproducibility), the results of the virtual screening or design (binding poses, predicted affinities, ADMET predictions), and a critical evaluation of the results (limitations of the scoring function, confidence in the predicted binding modes, proposed experimental validation). The oral presentation distills this material into a 15-minute talk with clear visual aids, emphasizing the key scientific question, the approach taken, the main results, and the implications for further work.

The ability to tell a coherent and compelling scientific story — choosing the right level of detail for the audience, anticipating objections, and presenting data honestly including its limitations — is a skill that transcends any particular computational tool or drug target. It is, ultimately, the skill that determines whether a computational drug discovery finding leads to experimental follow-up and eventually to new medicines.

Back to top