BME 555: Artificial Intelligence in Health and Medicine
Estimated study time: 50 minutes
Table of contents
- Topol, Eric J. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. Basic Books, 2019.
- Rajkomar, Alvin, Jeff Dean, and Isaac Kohane. “Machine Learning in Medicine.” New England Journal of Medicine 380 (2019): 1347–1358.
- Esteva, Andre, et al. “Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks.” Nature 542 (2017): 115–118.
- Obermeyer, Ziad, et al. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (2019): 447–453.
- Char, Danton S., Nigam H. Shah, and David Magnus. “Implementing Machine Learning in Health Care — Addressing Ethical Challenges.” New England Journal of Medicine 378 (2018): 981–983.
- World Health Organization. Ethics and Governance of Artificial Intelligence for Health. WHO, 2021.
- Marblestone, Adam H., Greg Wayne, and Konrad P. Kording. “Toward an Integration of Deep Learning and Neuroscience.” Frontiers in Computational Neuroscience 10 (2016): 94.
- Hochberg, Leigh R. “Turning Thought into Action.” New England Journal of Medicine 359 (2008): 1175–1177.
- Saria, Suchi, et al. “Integration of Early Physiological Responses Predicts Later Illness Severity in Preterm Infants.” Science Translational Medicine 2, no. 48 (2010).
- Miotto, Riccardo, et al. “Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records.” Scientific Reports 6 (2016): 26094.
- Beam, Andrew L., and Isaac S. Kohane. “Big Data and Machine Learning in Health Care.” JAMA 319, no. 13 (2018): 1317–1318.
- Mnih, Volodymyr, et al. “Human-Level Control through Deep Reinforcement Learning.” Nature 518 (2015): 529–533.
- Online resources: Stanford BIODS 220 course materials; MIT 6.S897 lecture notes; Harvard BMI 715 syllabus; Toronto MHI 1006 materials; WHO AI for Health resources (who.int/health-topics/artificial-intelligence); ClinicalTrials.gov for AI in medicine studies.
Chapter 1: AI Enters the Clinic — Promises, Realities, and Stakes
The history of artificial intelligence in medicine is not a story of sudden arrival but of long anticipation, periodic disappointment, and finally a period of genuine, if uneven, transformation. Eric Topol opens Deep Medicine with a diagnosis: medicine has become dehumanised. The clinician of the early twenty-first century spends an extraordinary fraction of the working day entering data into electronic records, clicking through administrative checkboxes, and performing pattern-recognition tasks — reading an electrocardiogram, interpreting a chest radiograph, scanning a pathology slide — that, in Topol’s view, neither require nor deserve the full attention of a trained physician. His central thesis is consequently counterintuitive: AI in medicine will not replace doctors but will give doctors time to be doctors again. By absorbing the pattern-recognition burden, AI systems free clinicians for the deeply human dimensions of care — listening, comforting, explaining, navigating the moral complexity of end-of-life decisions — that no algorithm can or should perform. This is an aspirational vision rather than a description of current practice, but it provides an organising principle for the field and a standard against which specific developments can be measured.
The intellectual lineage of medical AI has three distinct waves, each defined by the computational tools available and the clinical problems they targeted. The first wave, spanning roughly the 1970s through the 1990s, consisted of rule-based clinical decision support (基于规则的临床决策支持): expert systems such as MYCIN, which encoded physician knowledge as logical if-then rules to recommend antibiotic therapies for bacterial infections, and INTERNIST-1, which attempted differential diagnosis across a broad range of internal medicine problems. These systems were technically impressive for their era and demonstrated proof-of-concept for machine-augmented clinical reasoning, but they proved brittle — they could not generalise beyond the narrow domains their creators had manually encoded, and their maintenance burden grew prohibitive as medical knowledge expanded. The second wave brought statistical machine learning (统计机器学习): logistic regression, Bayesian networks, support vector machines, and random forests applied to clinical prediction problems. These methods could learn from data rather than requiring manual rule specification, and they produced a generation of validated clinical prediction tools — the APACHE score for ICU mortality, the Framingham cardiovascular risk equations — that remain in use today. The third and current wave, inaugurated by the deep learning revolution beginning around 2012, applies hierarchical representation learning to the massive data sets that modern healthcare generates: medical images, electronic health records, genomic sequences, continuous physiological monitoring streams, and clinical notes.
Medicine is simultaneously one of the most promising and one of the most dangerous domains for the deployment of artificial intelligence, and understanding both sides of that duality is foundational to the course. The promise is grounded in data scale: modern hospitals generate hundreds of terabytes of imaging data per year; electronic health records accumulate decades of clinical observations on millions of patients; genomic sequencing costs have fallen from billions of dollars per genome to hundreds. These data streams, if properly curated and linked, constitute a learning resource of extraordinary richness that dwarfs the implicit experience of any individual clinician. The danger is equally grounded in context: clinical decisions are high-stakes, frequently irreversible, and embedded in relationships of trust and vulnerability. When a streaming recommendation algorithm suggests the wrong film, the consequence is mild disappointment. When a sepsis prediction algorithm fails to fire in the early hours of a patient’s deterioration, or fires so frequently that clinicians learn to ignore it, the consequences can be catastrophic. Beam and Kohane, writing in JAMA, argue that medicine’s combination of high stakes, complex multi-modal inputs, severe error consequences, and historically entrenched systemic inequalities makes it both the domain where AI can do the most good and the domain where insufficient rigour can do the most harm.
Rajkomar, Dean, and Kohane, in their landmark 2019 New England Journal of Medicine review, map the landscape of machine learning in medicine across three primary data modalities: medical images, clinical time series, and text. Their review is notable not only for its breadth but for its intellectual honesty about the gap between benchmark performance and clinical utility. A model that achieves state-of-the-art accuracy on a curated benchmark dataset, they observe, may perform substantially worse when deployed in a different hospital with different imaging equipment, a different patient population, or different data-entry practices. The distribution shift problem — the divergence between the distribution of the development data and the deployment context — is not a technical footnote but a fundamental challenge for clinical AI that no amount of benchmark optimisation resolves. The course proceeds through medical imaging, EHR analytics, public health, ethics, aging, neuroscience, brain-computer interfaces, governance, and futures — with this foundational tension between laboratory performance and real-world clinical utility as a constant reference point.
Chapter 2: AI in Medical Imaging — Radiology, Pathology, and Dermatology
The most technically mature and extensively validated applications of clinical AI concern medical imaging. The reasons are structural: medical images are high-dimensional but regularly formatted data; large labelled datasets have been assembled through retrospective review of clinical archives; and performance can be benchmarked against the ground-truth judgments of expert clinicians in ways that are more straightforward than for many clinical prediction tasks. The so-called ImageNet moment (ImageNet 时刻) in medical imaging arrived in the mid-2010s, when researchers began applying the deep convolutional architectures that had revolutionised natural image recognition to chest radiographs, retinal fundus photographs, histopathology slides, and dermatoscopic images. Rajpurkar et al.’s CheXNet achieved radiologist-equivalent performance on pneumonia detection from chest X-rays. Gulshan et al., working at Google, demonstrated that a convolutional network trained on 128,175 retinal fundus images could detect referable diabetic retinopathy with sensitivity and specificity meeting or exceeding those of ophthalmologists — a result with enormous implications for diabetic retinopathy screening in populations without consistent access to specialist care.
The canonical demonstration of imaging-level clinical AI performance is the Esteva et al. Nature paper published in 2017. The study trained a single Inception-v3 convolutional neural network on 129,450 clinical images spanning 2,032 different skin disease classes, and then evaluated its performance on a test set of biopsy-confirmed lesions against 21 board-certified dermatologists. The network classified keratinocyte carcinomas versus benign seborrhoeic keratoses and malignant melanomas versus benign naevi at a level of accuracy that matched or exceeded the dermatologist panel. The methodological rigour of the comparison — using receiver operating characteristic curves rather than a single operating point, and testing against clinicians with varying levels of experience — made it a landmark. Esteva et al. conclude that their results demonstrate the feasibility of a smartphone-based skin cancer screening tool capable of providing expert-level guidance in primary care and community settings. The paper was immediately recognised as a proof-of-concept for a broader programme of AI-based specialist-level clinical decision support accessible outside specialist settings.
The question of whether clinical AI augments or replaces radiologists has become a structuring debate in the field, and the current weight of evidence favours augmentation over replacement. McKinney et al.’s 2020 Nature Medicine paper on AI screening mammography — trained on de-identified mammograms from nearly 29,000 women and tested on an independent UK set of over 25,000 — found that the AI model reduced false positives by 5.7 percent and false negatives by 9.4 percent compared to standard care, and matched or exceeded individual radiologist performance. Crucially, however, the study did not test the combination of AI and radiologist against radiologist alone, leaving open the question of whether the optimal deployment model is AI-assisted reading, AI triage, or sequential double reading. A growing body of evidence from colonoscopy, chest CT, and ophthalmology suggests that the AI-plus-radiologist combination consistently outperforms either AI or radiologist alone on detection tasks, a result that aligns with Topol’s augmentation thesis: the value of AI in imaging is not to replace human expert judgment but to provide a complementary signal that reduces the error modes unique to human perception — fatigue, anchoring, satisfaction-of-search.
The regulatory landscape governing AI-based medical imaging tools has evolved rapidly. In the United States, AI and machine learning-based software that meets the definition of Software as a Medical Device (SaMD, 医疗软件) is subject to FDA oversight under the 510(k) or De Novo pathways, which require demonstration of substantial equivalence to a predicate device or novel safety and effectiveness data respectively. The FDA cleared its first AI-based medical imaging tool — IDx-DR, an autonomous AI for diabetic retinopathy detection — in 2018 via De Novo, marking the first regulatory authorisation of an AI diagnostic system that does not require clinician input to interpret results. The FDA’s 2021 action plan for AI/ML-based SaMD introduced the concept of a predetermined change control plan (预定变更控制计划): a regulatory pathway that allows manufacturers to specify in advance the types of model updates and retraining that can occur without requiring a new regulatory submission, addressing the dynamic nature of continuously learning clinical AI systems. The EU Medical Device Regulation (MDR) applies similar conformity assessment requirements through a CE marking process, and the EU AI Act — which classifies medical AI as high risk — adds additional transparency and documentation requirements that are likely to shape global practice as manufacturers seek unified regulatory strategies.
Chapter 3: Electronic Health Records, Predictive Analytics, and Clinical Decision Support
The electronic health record (电子健康档案, EHR) is simultaneously the most data-rich and the most methodologically treacherous substrate for clinical AI. A modern hospital EHR contains structured data — diagnoses encoded in ICD-10, medications in RxNorm, laboratory values with timestamps — alongside unstructured clinical notes, imaging reports, and procedure logs. Linked over years or decades for a large patient population, this constitutes a longitudinal observational dataset of formidable richness. The challenge is that EHR data is a record of clinical decisions, not a direct observation of disease. What appears in the record is shaped by which patients sought care, which clinicians ordered which tests, what the prevailing coding incentives were, and how the documentation culture of the institution evolved over time. Missing data in EHR is not random: the absence of a laboratory measurement often means the clinician did not think it necessary to order, not that the test was ordered and yielded a normal result. This distinction, familiar to epidemiologists as informative censoring (信息性截断), is easy to overlook but profoundly affects what machine learning models trained on EHR data are actually learning.
Miotto et al.’s Deep Patient (2016), trained on the de-identified EHR records of 700,000 patients at the Icahn School of Medicine at Mount Sinai, demonstrated that an unsupervised deep autoencoder could learn a low-dimensional representation of patient health state from raw EHR data — diagnoses, medications, and procedures — that was substantially more predictive of future diagnoses than features engineered by clinical experts. The Deep Patient representation improved prediction of future liver disease, psychosis, and cancer, among dozens of other outcomes, suggesting that deep learning can discover clinically meaningful latent structure in messy, heterogeneous EHR data that eludes conventional feature engineering. The paper was widely cited as a demonstration of the potential of unsupervised representation learning for clinical prediction. It was also frank about interpretability: the authors reported that the learned representation was difficult to interpret and that its clinical utility for individual patient management — as opposed to population-level risk stratification — remained unclear, a tension that recurs throughout EHR-based clinical AI.
Saria et al.’s work on the neonatal intensive care unit constitutes one of the earliest and most methodologically careful demonstrations of the predictive health (预测健康) paradigm: the use of continuously collected physiological time series to predict clinical deterioration before it becomes clinically apparent. Their 2010 Science Translational Medicine paper showed that machine learning analysis of heart rate variability, respiratory patterns, and temperature in preterm infants could predict late-onset sepsis up to 24 hours before clinical presentation — a lead time of potentially enormous value in a population where sepsis progression can be rapid and devastating. The underlying biological insight is that physiological systems exhibit subtle dynamical changes during early infection — reduced heart rate variability is a well-documented early signal — that are below the threshold of clinical notice but detectable by algorithms sensitive to time-series structure. This work established the template for a family of ICU early warning systems that now includes prediction models for sepsis, acute kidney injury, deterioration, and mortality, and that are deployed, with varying degrees of clinical uptake, in hospitals worldwide.
The last mile problem (最后一公里问题) in clinical AI refers to the gap between an algorithm that is technically accurate and an algorithm that is clinically useful — a gap that is often wider than technical performance metrics suggest. A sepsis prediction model that fires appropriately on 80 percent of true sepsis cases achieves nothing if the resulting alert appears in a corner of the EHR interface that busy nurses do not notice, if the recommended response action is not clearly specified, if the alert fires at a level of acuity where clinical protocols already specify management, or if the clinical team’s workflow does not include a designated responder for AI alerts. The science of clinical AI implementation — understanding how to integrate algorithmic outputs into clinical workflows in ways that actually change clinician behaviour and improve outcomes — is a research domain in its own right, and one that lags substantially behind the technical development of the algorithms. Rajkomar, Dean, and Kohane argue that this implementation gap, not technical performance limitations, is the primary constraint on clinical AI value in the near term, a judgment that aligns with the broader implementation science literature on clinical decision support.
Chapter 4: AI, Public Health, and Epidemiology
The application of machine learning to public health and epidemiology encompasses a spectrum of tasks: disease surveillance and outbreak detection, risk factor identification, intervention targeting, and pandemic response — and the track record across these applications is strikingly uneven. The cautionary tale most frequently cited is Google Flu Trends, launched in 2008 on the premise that the volume of flu-related search queries on Google could serve as a real-time proxy for influenza incidence, providing a two-week lead on CDC surveillance data. For several seasons, Google Flu Trends performed impressively. Then, in 2012–13, it overestimated flu prevalence by a factor of two. Lazer et al.’s 2014 Science analysis identified the cause: the Google search algorithm had itself changed, promoting flu-related health news content that drove search traffic independent of actual flu activity. The model had been trained on an assumption of stable relationships between search behaviour and disease prevalence, but search behaviour is shaped by the search algorithm itself — a feedback loop that epidemiological surveillance models based on internet data are structurally vulnerable to. Epidemiological surveillance (流行病学监测) based on digital traces requires not only statistical sophistication but an understanding of the data-generating process that includes the behavioural and platform dynamics shaping what users search, tweet, or report.
The SARS-CoV-2 pandemic generated what may be the largest rapid proliferation of clinical AI models in history: hundreds of prognostic models for COVID-19 severity, mortality, and ICU admission were developed and published within months of the outbreak’s onset. The systematic review by Wynants et al., published in BMJ in 2020 and updated through subsequent waves, examined 169 prediction models and found that the vast majority were at high risk of bias due to poor reporting, small sample sizes, inappropriate exclusion criteria, and inadequate handling of missing data. Almost none had been externally validated in independent populations, and none met the review’s bar for recommendation for clinical use. The COVID-19 AI modelling episode is a case study in how the combination of data availability, publication pressure, and genuine clinical urgency can produce a literature of models that are technically complex but scientifically fragile — and how the systematic review and reporting-guideline infrastructure that the evidence-based medicine movement built for clinical trials had not yet been fully adapted to the evaluation of clinical prediction models.
AI for global health introduces a further dimension of inequity that domestic applications can obscure. Diabetic retinopathy screening using AI-based fundus photograph analysis has been piloted in multiple low- and middle-income country settings — India, Thailand, Mexico — with the explicit motivation that AI can extend specialist-equivalent screening to populations without access to ophthalmologists. The equity promise is genuine: diabetic retinopathy is a leading cause of preventable blindness globally, and the treatment is effective when applied early. The equity challenge is equally genuine: the models deployed in these settings were trained predominantly on data from high-income country patient populations with different baseline retinopathy severity distributions, different fundus camera characteristics, and different image quality profiles than those encountered in low-resource field settings. Studies evaluating commercial retinopathy AI tools in sub-Saharan African contexts have found reduced sensitivity for the specific retinopathy features most prevalent in those populations. The principle that Obermeyer et al. articulate for domestic clinical AI — that models trained on data reflecting access inequities tend to underserve those with less prior healthcare utilisation — generalises globally: models trained on data from high-resource settings may systematically fail the populations they are positioned as helping.
Precision public health (精准公共卫生) is the extension of precision medicine logic — using multi-omic, behavioural, and social data to stratify risk and target interventions at the individual level — to population health programmes. The intellectual case is straightforward: if high-risk individuals can be identified with sufficient accuracy before they experience a negative health event, targeted preventive intervention can be both more effective and more efficient than universal programmes. The challenge is that this framing may obscure the structural determinants of population health — poverty, housing, air quality, occupational exposure, historical disinvestment — that individual-level risk prediction and targeting cannot address. A precision public health algorithm that identifies individuals at high risk of asthma hospitalisation and targets them for inhaler adherence support does not change the air quality in the industrial neighbourhood where they live. Krieger’s ecosocial theory and Rose’s classical public health argument for population-level over individual-level intervention both suggest that the precision public health paradigm, however technically sophisticated, may systematically misidentify where leverage lies in population health improvement.
Chapter 5: The Ethics of Clinical AI
Char, Shah, and Magnus, writing in the New England Journal of Medicine in 2018, identified four ethical challenges that clinical machine learning presents in structural terms that have become canonical in the biomedical ethics literature. The first is the opacity challenge, which they frame not merely as a technical limitation but as an ethical one: a clinical recommendation that cannot be explained to the patient or justified to the clinician disrupts the informed consent process, undermines shared decision-making, and conflicts with the norms of evidence-based practice. The second is the bias challenge: training data that does not represent the population to which a model will be applied encodes historical patterns of differential access and differential treatment into the model’s predictions. The third is the feedback loop challenge: when deployed AI systems generate data — prescription patterns, referral decisions, test orders — that is used to retrain or validate subsequent model versions, the system risks amplifying rather than correcting its initial biases. The fourth is the consent challenge: using patient data for model development raises questions about the scope of the consent patients gave when their data was collected, and about whether patients have a right to opt out of data uses they did not anticipate.
The Obermeyer et al. Science 2019 paper provides the most powerful and precise illustration of racial bias in a deployed clinical AI system yet published, and its analytical architecture deserves careful attention. The study examined a widely used commercial algorithm — deployed across hundreds of US hospitals to identify patients who would benefit from enrolment in high-complexity care management programmes — that used predicted healthcare costs as a proxy for health needs. The central finding was that, controlling for predicted cost, Black patients were substantially sicker than white patients: at any given risk score generated by the algorithm, Black patients had more active chronic conditions than white patients with the same score. The consequence was that the algorithm systematically underestimated the severity of illness in Black patients, effectively requiring Black patients to be significantly sicker than white patients to receive the same level of care management. Obermeyer et al. demonstrate that the mechanism was the proxy variable: Black patients with the same health needs as white patients incurred substantially lower healthcare costs, because they had less access to care. Using cost as a proxy for health needs therefore imported the access disparity directly into the risk score.
The informed consent (知情同意) landscape for AI-assisted clinical care remains unsettled both normatively and practically. Current practice in most health systems does not specifically disclose to patients when AI systems contribute to their diagnosis or treatment recommendations, on the grounds that clinicians routinely use many computational tools — laboratory reference ranges, pharmacokinetic calculators, clinical prediction scores — without specific disclosure. Critics argue that AI diagnostic tools are categorically different: they operate at a level of complexity and autonomy that makes meaningful clinician verification difficult, they are proprietary and opaque in ways that laboratory instruments are not, and their potential for demographic disparate impact creates a specific duty of disclosure. Patient attitude surveys — conducted by Longoni et al. and others — consistently find that patients prefer human decision-making and feel less satisfied with diagnoses attributed to AI, a finding that interacts with the disclosure question in complex ways: transparency that improves informed consent may simultaneously reduce therapeutic confidence.
Liability and accountability for AI-assisted clinical harm remain unresolved in most legal systems. The EU AI Act’s classification of medical AI as high-risk imposes pre-market conformity assessment, post-market monitoring, and documentation requirements that may clarify the manufacturer’s obligations without fully resolving the allocation of responsibility between manufacturer, deploying institution, and treating clinician. The WHO’s 2021 Ethics and Governance of Artificial Intelligence for Health articulates six normative principles — protecting autonomy, promoting well-being, ensuring transparency, fostering responsibility, ensuring inclusiveness and equity, and promoting sustainable and responsive AI — that represent a global consensus aspiration. The WHO document is notable for its emphasis on the systemic conditions required for ethical AI in health: regulatory capacity, data governance infrastructure, workforce training, and public participation in AI governance decisions. These structural prerequisites are unevenly distributed globally, and the WHO’s honest acknowledgment of this gap distinguishes its framework from more abstract bioethical analyses that treat ethics as a matter of principle rather than institutional capacity.
Chapter 6: AI and Aging — Assistive Technology, Surveillance, and Dignity
Population aging is one of the defining demographic facts of the twenty-first century. In OECD countries, the fraction of the population aged sixty-five and over is projected to exceed twenty percent by 2040, and the ratio of working-age adults to older adults is declining rapidly in most high-income nations. The resulting pressure on formal geriatric care systems — already strained by workforce shortages and funding constraints — has generated substantial interest in AI-assisted technologies that could support independent living, monitor health status, and detect deterioration in older adults without requiring constant human supervision. Aging in place (居家养老) — the goal of enabling older adults to remain in their own homes as long as possible — is both a stated preference of most older adults and a policy objective for healthcare systems seeking to reduce the cost of institutional care. AI technologies are positioned as an enabling infrastructure for aging in place, providing the monitoring and support capabilities that make it safe without requiring continuous human presence.
Fall detection and prediction (跌倒检测与预测) is among the most clinically consequential and technically developed applications of AI in aging. Falls are the leading cause of injury-related death and disability in adults over sixty-five, and the consequences of an undetected fall — lying on the floor unable to summon help — can be severe even when the fall itself causes only minor injury. Wearable accelerometer-based fall detection systems have been commercially available for decades, with well-understood limitations: high false-positive rates from non-fall events such as sitting down suddenly, and detection latency from the requirement for a specific wrist-worn device. Computer vision approaches — using depth cameras or radar sensors to classify body pose and detect falls — have shown improved specificity in controlled evaluations, though they introduce the privacy intrusion of continuous video surveillance. AI-based fall prediction, as distinct from detection, uses longitudinal gait analysis, balance assessment, and activity monitoring to identify individuals at elevated future fall risk and target them for preventive intervention — a paradigm that has shown promise in institutional settings but whose clinical implementation raises the same alert fatigue concerns as EHR-based prediction tools.
Social robots for older adults occupy a position at the intersection of affective computing, gerontology, and bioethics that makes them simultaneously promising and philosophically troubling. PARO, the robotic therapeutic seal developed by Takanori Shibata at Japan’s National Institute of Advanced Industrial Science and Technology, has accumulated a substantial evidence base for its efficacy in reducing agitation and loneliness in people with dementia in institutional settings, with randomised controlled trial evidence for improvements in mood, social engagement, and medication use. Stevie, developed at Trinity College Dublin, and Pepper, the Softbank humanoid robot, have been deployed in eldercare settings for medication reminders, video calling facilitation, and cognitive games. Sherry Turkle, in her long-term ethnographic work on human-robot interaction, raises a critique that the effectiveness literature does not resolve: when an older adult forms an emotional attachment to a robot companion, they are experiencing something that resembles care without receiving care — what Turkle calls the relationship we cannot feel but can simulate. The ethical question is whether simulated companionship that reduces measurable loneliness is a legitimate therapeutic intervention or a morally problematic substitute for the human care that social conditions deny.
A structural bias that is less discussed than race or gender discrimination in clinical AI but equally consequential is age discrimination in model development (模型开发中的年龄歧视): the systematic underrepresentation of older adults in the research datasets from which clinical AI is developed. Clinical trials for AI-based medical devices, like pharmaceutical trials, often apply exclusion criteria that skew toward younger, healthier patients — partly because comorbidity complicates outcome attribution and partly because of implicit assumptions about the primary market for new technologies. The consequence is that models deployed in geriatric clinical contexts — where patients often have multiple chronic conditions, polypharmacy, atypical disease presentations, and different physiological parameter ranges — are evaluated in populations that do not represent them. Skin cancer classification models trained on younger adult dermatoscopic datasets perform worse on the distinctive presentation patterns of ageing skin. Speech-based cognitive screening tools calibrated on middle-aged adults may misclassify age-normal vocabulary changes as pathological. Correcting this structural bias requires not only technical solutions — stratified sampling, age-specific validation cohorts — but a prior commitment to including older adults as equal beneficiaries of the clinical AI development enterprise.
Chapter 7: Neuroscience, Brain-Computer Interfaces, and AI
The relationship between artificial intelligence and neuroscience is bidirectional and historically deep. Artificial neural networks were explicitly inspired by the architecture of biological neural circuits — the perceptron by the Hebbian model of synaptic learning, the convolutional neural network by Hubel and Wiesel’s characterisation of the visual cortex’s receptive field hierarchy. But the field diverged: deep learning advanced through gradient descent and large-scale computation rather than through progressively refined biological realism, and neuroscience developed its own computational frameworks that had limited contact with machine learning practice for several decades. Marblestone, Wayne, and Kording’s 2016 Frontiers in Computational Neuroscience paper makes the case for renewed integration through three conceptual contributions. First, they argue that the brain’s diverse neural circuits can be understood as implementing cost function minimisation — that the apparent diversity of learning mechanisms across brain regions reflects different implicit objective functions rather than fundamentally different learning algorithms. Second, they analyse the parallels and disanalogies between artificial and biological neural networks in a systematic way, identifying which architectural features of deep learning systems have biological analogues and which do not. Third, and most ambitiously, they argue that progress in understanding biological intelligence and progress in building artificial intelligence are not merely parallel but mutually informative — that solving the neuroscience problem requires and will be accelerated by solving the AI problem.
Brain-computer interfaces (脑机接口, BCI) are devices that establish a direct communication channel between the brain and an external computing system, bypassing the normal efferent pathways of the peripheral nervous system. The clinical motivation is the restoration of communication and motor function to individuals with conditions — amyotrophic lateral sclerosis, spinal cord injury, brainstem stroke — that have severed the neural pathways connecting intention to action while leaving the cortical substrate of motor planning intact. The BrainGate consortium, a multi-institutional academic collaboration, has been the primary vehicle for translating invasive BCI from animal models to human clinical application. Hochberg et al.’s 2006 Nature paper, summarised in Hochberg’s 2008 NEJM essay “Turning Thought into Action,” reported the first demonstration in a human participant that the BrainGate microelectrode array — a 4×4 mm silicon substrate bearing 96 platinum-tipped electrodes implanted in the hand-knob area of motor cortex — could record neural population activity sufficient to control a computer cursor, operate a television, and modulate robot arm movements by imagining the corresponding physical actions. This result established the clinical proof-of-concept for invasive BCI and initiated a research programme that has progressively advanced communication rates and motor repertoire.
Neuralink, Elon Musk’s neurotechnology company, occupies an unusual position in the BCI landscape: technically sophisticated and well-funded, but operating with a gap between public communications and peer-reviewed scientific output that the academic BCI community has documented with concern. Neuralink’s device — a flexible polymer thread with 1,024 electrodes implanted by a neurosurgical robot — represents genuine engineering innovation in electrode count and implantation precision relative to the silicon arrays used by BrainGate. The company received FDA Breakthrough Device designation in 2023 and initiated the PRIME human clinical trial; the first human implantation was reported in January 2024. The clinical endpoints and performance data from the PRIME study have been communicated primarily through company press releases and public demonstrations rather than peer-reviewed publications, which limits independent evaluation. The controversy over the conduct of Neuralink’s animal trials — including regulatory citations for transport of potentially contaminated implants and, separately, concerns raised by former employees about experimental rigour — has further complicated assessment of the company’s scientific programme, independent of its technical capabilities.
AlphaFold, DeepMind’s transformer-based model for protein structure prediction, represents a contribution of AI to biomedicine that is structurally different from clinical decision support but arguably of comparable long-term significance. Jumper et al.’s 2021 Science paper demonstrated that AlphaFold 2 predicted protein structures with atomic-level accuracy across the Critical Assessment of Protein Structure Prediction (CASP) benchmark — a problem that the structural biology community had been unable to solve computationally for fifty years. The implications for neuropharmacology are direct: the structure of neurodegenerative disease-associated proteins — tau, alpha-synuclein, TDP-43 — and their interaction partners can now be predicted and analysed at a resolution that was previously available only through years of crystallographic or cryogenic electron microscopy work. AlphaFold-enabled structure prediction does not identify drug candidates automatically, but it dramatically accelerates the rational drug design process by providing the three-dimensional structural context that is prerequisite to target-based drug discovery. In the domain of neurodegenerative disease, where the decades-long history of clinical trial failure has been partly attributed to the poor structural understanding of the relevant protein aggregates, this acceleration could be clinically consequential.
Chapter 8: AI Governance in Healthcare — Regulation, Evaluation, and Equity
The global regulatory landscape for clinical AI is fragmented, evolving, and not yet adequate to the pace of development. In the United States, the FDA’s Software as a Medical Device framework is the primary regulatory pathway, but its application is complicated by two features unique to AI-based devices relative to conventional medical technology. The first is the distinction between locked (锁定) and adaptive algorithms: a conventional diagnostic device produces consistent outputs from the same inputs; an adaptive machine learning model may be retrained as new data accumulates, changing its behaviour over time in ways that a one-time regulatory clearance does not evaluate. The FDA’s 2021 action plan for AI/ML-based SaMD acknowledges this challenge and proposes the predetermined change control plan as a regulatory mechanism, but the plan’s implementation remains a work in progress. The second complication is performance across subgroups: regulatory submissions can demonstrate aggregate performance across a test dataset without demonstrating that performance is equivalent across the demographic subgroups that will be encountered in clinical deployment. The FDA has published guidance encouraging applicants to report performance stratified by age, sex, and race and ethnicity, but this guidance is not yet a mandatory requirement with standardised reporting formats.
The distinction between technical validation (技术验证) and clinical validation (临床验证) is among the most consequential methodological distinctions in clinical AI evaluation, and it is routinely collapsed in both the scientific literature and regulatory submissions. Technical validation asks whether an algorithm performs well on a dataset: does it classify images accurately, predict outcomes with high AUROC, or assign risk scores that correlate with outcomes? Clinical validation asks whether deploying the algorithm changes clinical behaviour and improves patient outcomes — a substantially harder question that requires prospective study in clinical settings. The CONSORT-AI and SPIRIT-AI reporting guidelines, published in Nature Medicine and BMJ in 2020, provide reporting standards for randomised trials and protocols respectively involving AI interventions, and constitute the biomedical AI community’s most important methodological contribution to closing the technical-clinical validation gap. Despite these guidelines, the proportion of clinical AI studies that include prospective clinical outcome evaluation remains low: a systematic review by Nagendran et al. in BMJ found that the evidence base for clinical AI consisted overwhelmingly of retrospective technical validation studies, with very few randomised trials evaluating clinical outcomes, and that the handful of randomised trials that existed were generally small and at high risk of bias.
Algorithmic auditing (算法审计) in healthcare is an emerging practice that draws on frameworks developed in anti-discrimination law, financial regulation, and the emerging algorithmic accountability literature. New York City Local Law 144, enacted in 2021 to require bias audits of automated employment decision tools, provides one regulatory model: it requires that covered tools be evaluated by independent auditors for impact ratios across race and sex categories, and that audit results be publicly disclosed. Proposals to apply analogous requirements to clinical AI have been advanced by health equity researchers and AI governance scholars, with arguments that independent third-party auditing of deployed clinical AI systems — especially those used for resource allocation, triage, or insurance decisions — should be a regulatory requirement rather than a voluntary best practice. The practical challenges are substantial: clinical AI developers often treat model architectures and training data as proprietary trade secrets, which conflicts with the transparency requirements of meaningful independent auditing. The tension between intellectual property protection and public accountability for systems that make consequential health decisions about millions of patients is one the field has not resolved.
The WHO’s 2021 Ethics and Governance of Artificial Intelligence for Health makes a call for transparent reporting of training data provenance, performance across demographic groups, and post-deployment monitoring that is ambitious and well-reasoned but honest about the distance between aspiration and current practice. The document identifies four structural prerequisites for ethical AI in health that extend beyond algorithm design: regulatory capacity in health ministries that currently lack technical expertise to evaluate AI submissions; data governance infrastructure to manage the privacy, consent, and access dimensions of large-scale health data use; clinical and technical workforce training to enable both clinicians and patients to engage meaningfully with AI systems; and mechanisms for public participation in the governance decisions that shape which AI systems are developed and deployed. These structural prerequisites are unevenly distributed globally, with significant gaps in low- and middle-income countries that are, paradoxically, both the populations where AI-assisted healthcare extension has the greatest potential impact and the settings least equipped to regulate and evaluate the systems deployed in them.
Chapter 9: Futures — Precision Medicine, AI Surgery, and the Augmented Clinician
Precision medicine (精准医疗) is the aspiration to tailor prevention, diagnosis, and treatment to the individual patient’s genomic, physiological, behavioural, and social profile rather than applying population-average guidelines derived from heterogeneous clinical trial populations. The National Institutes of Health All of Us Research Programme, launched in 2018 with a goal of enrolling one million diverse participants and collecting genomic, electronic health record, wearable sensor, and social determinant data, is the most ambitious infrastructure investment in precision medicine to date. Its explicit emphasis on diversity — it aims to include at least fifty percent of participants from racial and ethnic minority groups historically underrepresented in biomedical research — is a direct response to the well-documented skew of prior genomic research toward populations of European ancestry, which limits the applicability of polygenic risk scores and pharmacogenomic associations derived from those studies to non-European populations. Whether the All of Us data, once assembled and linked, will generate clinical insights of the magnitude its funding justifies remains to be seen; the history of precision medicine is populated with biomarkers that replicate at the population level but do not improve clinical outcomes when used to guide individual management.
AI in surgical robotics is advancing along a trajectory from surgeon-controlled teleoperation toward increasing automation of surgical subtasks, with regulatory and liability frameworks that have not yet caught up. The da Vinci Surgical System — the dominant platform for robotic-assisted minimally invasive surgery with over a million procedures per year — currently functions as a precision teleoperation system that amplifies and filters surgeon hand movements without autonomous behaviour. Shademan et al.’s 2016 Science Translational Medicine paper reported the first demonstration of an autonomous robotic system — the Smart Tissue Autonomous Robot (STAR) — performing supervised laparoscopic bowel anastomosis in a porcine model with greater consistency than human surgeons, as measured by anastomotic leak rates and suture spacing. The STAR system used near-infrared fluorescent tissue markers and computer vision tracking to compensate for tissue deformation, the primary challenge for autonomous soft-tissue surgery. The path from supervised animal trials to unsupervised human surgical autonomy involves not only technical challenges — real-time tissue tracking in the presence of bleeding and retraction, reliable failure mode detection — but regulatory and ethical challenges for which no established framework yet exists.
Federated learning (联邦学习) and privacy-preserving AI represent a partial technical response to one of the most significant governance barriers in clinical AI development: the difficulty of assembling multi-institutional, diverse training datasets given the privacy, liability, and competitive concerns that impede centralised data sharing. In federated learning, model training is distributed across data custodians — hospitals, health systems, genomic biobanks — with gradient updates rather than raw patient data transmitted to a central aggregation server. The NVIDIA FLARE platform and the HealthChain consortium in Europe have demonstrated federated training of clinical imaging models across multiple institutions, with performance approaching that of centrally trained models on some tasks. The approach is not a complete privacy solution — gradient updates can leak information about training data through membership inference attacks, and the federated architecture requires trust that the aggregation server and participating nodes are not colluding — but it is a meaningful step toward enabling the multi-institutional diversity in training data that both technical performance and equity objectives require.
The concept of the augmented clinician (增强型临床医生) provides the synthesis toward which the course has been building. Topol’s vision — AI absorbs pattern recognition, clinicians reclaim the human dimensions of care — is compelling as an aspiration but requires more precise articulation of what the human dimensions of medicine consist of and why AI cannot or should not perform them. Empathy, the capacity to understand and share the emotional experience of another person, requires subjective experience that AI systems, as currently understood, do not possess. Communication — the negotiation of meaning and values in a relationship of care — is not reducible to information transfer and requires contextual judgment of a kind that large language models can approximate linguistically but not enact relationally. Moral reasoning in clinical contexts — weighing competing goods under uncertainty, navigating the irreducible particularity of individual patient circumstances — is a practice that develops through the experience of clinical relationships and cannot be separated from the care relationships that instantiate it. Algorithmic systems that replicate the surface form of moral reasoning without the relational context that grounds it may produce outputs that are linguistically indistinguishable from those of an experienced clinician while missing the ethical substance entirely. The augmented clinician framework does not require resolving philosophical debates about machine consciousness; it requires only the practical judgment that the specific human capacities most essential to good medicine — attentive listening, trust-building, moral navigation in relationships of vulnerability — are precisely those that AI current and foreseeable is least equipped to replicate, and that the value of AI in medicine consists in protecting those capacities from displacement by the pattern-recognition burden that currently overwhelms them.