BME 555: Artificial Intelligence in Health and Medicine

Estimated study time: 50 minutes

Table of contents

Why make it up
UW’s biomedical-engineering and health programs cover individual AI-adjacent topics — BME 550 (Medical Imaging), BME 544 (Neural Interfaces), HLTH 333 (Healthcare Systems) — but no course unifies AI in clinical medicine, public health, aging, and neuroscience. This course follows Topol’s Deep Medicine, the Rajkomar–Dean–Kohane NEJM review, Esteva et al.’s dermatology paper, Obermeyer et al. on racial bias in clinical algorithms, the WHO’s 2021 Ethics and Governance of AI for Health, Marblestone–Wayne–Kording on neuroscience and deep learning, and Neuralink’s published BCI work. Built on Stanford BIODS 220, MIT 6.S897, Harvard BMI 715, the Toronto MHI program, and Oxford’s EBM AI module.
  • Topol, Eric J. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. Basic Books, 2019.
  • Rajkomar, Alvin, Jeff Dean, and Isaac Kohane. “Machine Learning in Medicine.” New England Journal of Medicine 380 (2019): 1347–1358.
  • Esteva, Andre, et al. “Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks.” Nature 542 (2017): 115–118.
  • Obermeyer, Ziad, et al. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (2019): 447–453.
  • Char, Danton S., Nigam H. Shah, and David Magnus. “Implementing Machine Learning in Health Care — Addressing Ethical Challenges.” New England Journal of Medicine 378 (2018): 981–983.
  • World Health Organization. Ethics and Governance of Artificial Intelligence for Health. WHO, 2021.
  • Marblestone, Adam H., Greg Wayne, and Konrad P. Kording. “Toward an Integration of Deep Learning and Neuroscience.” Frontiers in Computational Neuroscience 10 (2016): 94.
  • Hochberg, Leigh R. “Turning Thought into Action.” New England Journal of Medicine 359 (2008): 1175–1177.
  • Saria, Suchi, et al. “Integration of Early Physiological Responses Predicts Later Illness Severity in Preterm Infants.” Science Translational Medicine 2, no. 48 (2010).
  • Miotto, Riccardo, et al. “Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records.” Scientific Reports 6 (2016): 26094.
  • Beam, Andrew L., and Isaac S. Kohane. “Big Data and Machine Learning in Health Care.” JAMA 319, no. 13 (2018): 1317–1318.
  • Mnih, Volodymyr, et al. “Human-Level Control through Deep Reinforcement Learning.” Nature 518 (2015): 529–533.
  • Online resources: Stanford BIODS 220 course materials; MIT 6.S897 lecture notes; Harvard BMI 715 syllabus; Toronto MHI 1006 materials; WHO AI for Health resources (who.int/health-topics/artificial-intelligence); ClinicalTrials.gov for AI in medicine studies.

Chapter 1: AI Enters the Clinic — Promises, Realities, and Stakes

The history of artificial intelligence in medicine is not a story of sudden arrival but of long anticipation, periodic disappointment, and finally a period of genuine, if uneven, transformation. Eric Topol opens Deep Medicine with a diagnosis: medicine has become dehumanised. The clinician of the early twenty-first century spends an extraordinary fraction of the working day entering data into electronic records, clicking through administrative checkboxes, and performing pattern-recognition tasks — reading an electrocardiogram, interpreting a chest radiograph, scanning a pathology slide — that, in Topol’s view, neither require nor deserve the full attention of a trained physician. His central thesis is consequently counterintuitive: AI in medicine will not replace doctors but will give doctors time to be doctors again. By absorbing the pattern-recognition burden, AI systems free clinicians for the deeply human dimensions of care — listening, comforting, explaining, navigating the moral complexity of end-of-life decisions — that no algorithm can or should perform. This is an aspirational vision rather than a description of current practice, but it provides an organising principle for the field and a standard against which specific developments can be measured.

The intellectual lineage of medical AI has three distinct waves, each defined by the computational tools available and the clinical problems they targeted. The first wave, spanning roughly the 1970s through the 1990s, consisted of rule-based clinical decision support (基于规则的临床决策支持): expert systems such as MYCIN, which encoded physician knowledge as logical if-then rules to recommend antibiotic therapies for bacterial infections, and INTERNIST-1, which attempted differential diagnosis across a broad range of internal medicine problems. These systems were technically impressive for their era and demonstrated proof-of-concept for machine-augmented clinical reasoning, but they proved brittle — they could not generalise beyond the narrow domains their creators had manually encoded, and their maintenance burden grew prohibitive as medical knowledge expanded. The second wave brought statistical machine learning (统计机器学习): logistic regression, Bayesian networks, support vector machines, and random forests applied to clinical prediction problems. These methods could learn from data rather than requiring manual rule specification, and they produced a generation of validated clinical prediction tools — the APACHE score for ICU mortality, the Framingham cardiovascular risk equations — that remain in use today. The third and current wave, inaugurated by the deep learning revolution beginning around 2012, applies hierarchical representation learning to the massive data sets that modern healthcare generates: medical images, electronic health records, genomic sequences, continuous physiological monitoring streams, and clinical notes.

Medicine is simultaneously one of the most promising and one of the most dangerous domains for the deployment of artificial intelligence, and understanding both sides of that duality is foundational to the course. The promise is grounded in data scale: modern hospitals generate hundreds of terabytes of imaging data per year; electronic health records accumulate decades of clinical observations on millions of patients; genomic sequencing costs have fallen from billions of dollars per genome to hundreds. These data streams, if properly curated and linked, constitute a learning resource of extraordinary richness that dwarfs the implicit experience of any individual clinician. The danger is equally grounded in context: clinical decisions are high-stakes, frequently irreversible, and embedded in relationships of trust and vulnerability. When a streaming recommendation algorithm suggests the wrong film, the consequence is mild disappointment. When a sepsis prediction algorithm fails to fire in the early hours of a patient’s deterioration, or fires so frequently that clinicians learn to ignore it, the consequences can be catastrophic. Beam and Kohane, writing in JAMA, argue that medicine’s combination of high stakes, complex multi-modal inputs, severe error consequences, and historically entrenched systemic inequalities makes it both the domain where AI can do the most good and the domain where insufficient rigour can do the most harm.

The black box problem (黑箱问题): Deep learning systems — convolutional neural networks, transformer architectures, recurrent networks — learn representations through millions of gradient descent updates on high-dimensional data. The resulting models can achieve remarkable predictive accuracy, but their internal representations are not interpretable by human inspection. A radiologist who judges a scan abnormal can, in principle, articulate which features motivated the judgment; a convolutional network that makes the same judgment does so through billions of weighted arithmetic operations that have no natural linguistic description. In clinical AI, this opacity raises three distinct concerns. First, it makes error analysis difficult: when the system is wrong, it is not possible to identify which aspect of its reasoning failed. Second, it complicates clinical trust: clinicians trained in evidence-based reasoning find it difficult to integrate recommendations from systems they cannot interrogate. Third, it creates liability ambiguity: in a healthcare system built on documented clinical rationale, a recommendation whose justification cannot be articulated creates legal exposure for both the deploying institution and the responsible clinician.

Rajkomar, Dean, and Kohane, in their landmark 2019 New England Journal of Medicine review, map the landscape of machine learning in medicine across three primary data modalities: medical images, clinical time series, and text. Their review is notable not only for its breadth but for its intellectual honesty about the gap between benchmark performance and clinical utility. A model that achieves state-of-the-art accuracy on a curated benchmark dataset, they observe, may perform substantially worse when deployed in a different hospital with different imaging equipment, a different patient population, or different data-entry practices. The distribution shift problem — the divergence between the distribution of the development data and the deployment context — is not a technical footnote but a fundamental challenge for clinical AI that no amount of benchmark optimisation resolves. The course proceeds through medical imaging, EHR analytics, public health, ethics, aging, neuroscience, brain-computer interfaces, governance, and futures — with this foundational tension between laboratory performance and real-world clinical utility as a constant reference point.


Chapter 2: AI in Medical Imaging — Radiology, Pathology, and Dermatology

The most technically mature and extensively validated applications of clinical AI concern medical imaging. The reasons are structural: medical images are high-dimensional but regularly formatted data; large labelled datasets have been assembled through retrospective review of clinical archives; and performance can be benchmarked against the ground-truth judgments of expert clinicians in ways that are more straightforward than for many clinical prediction tasks. The so-called ImageNet moment (ImageNet 时刻) in medical imaging arrived in the mid-2010s, when researchers began applying the deep convolutional architectures that had revolutionised natural image recognition to chest radiographs, retinal fundus photographs, histopathology slides, and dermatoscopic images. Rajpurkar et al.’s CheXNet achieved radiologist-equivalent performance on pneumonia detection from chest X-rays. Gulshan et al., working at Google, demonstrated that a convolutional network trained on 128,175 retinal fundus images could detect referable diabetic retinopathy with sensitivity and specificity meeting or exceeding those of ophthalmologists — a result with enormous implications for diabetic retinopathy screening in populations without consistent access to specialist care.

The canonical demonstration of imaging-level clinical AI performance is the Esteva et al. Nature paper published in 2017. The study trained a single Inception-v3 convolutional neural network on 129,450 clinical images spanning 2,032 different skin disease classes, and then evaluated its performance on a test set of biopsy-confirmed lesions against 21 board-certified dermatologists. The network classified keratinocyte carcinomas versus benign seborrhoeic keratoses and malignant melanomas versus benign naevi at a level of accuracy that matched or exceeded the dermatologist panel. The methodological rigour of the comparison — using receiver operating characteristic curves rather than a single operating point, and testing against clinicians with varying levels of experience — made it a landmark. Esteva et al. conclude that their results demonstrate the feasibility of a smartphone-based skin cancer screening tool capable of providing expert-level guidance in primary care and community settings. The paper was immediately recognised as a proof-of-concept for a broader programme of AI-based specialist-level clinical decision support accessible outside specialist settings.

The limitations of the Esteva result became apparent in subsequent work and deserve close attention as a case study in the generalisability problem. The training data drew overwhelmingly from lighter-skinned patient populations, reflecting both the demographics of the clinical archives used and the historical underrepresentation of darker skin tones in dermatological research. A 2018 study by Adamson and Smith in JAMA Dermatology identified that the majority of publicly available dermatoscopic image datasets contained fewer than five percent images from patients with Fitzpatrick skin types V and VI. Studies evaluating commercial AI dermatology tools on darker skin tones have consistently found substantially degraded performance — precisely in the populations where access to dermatologist specialist care is most limited. The irony is acute: the populations for whom a smartphone-based expert-level screening tool would be most beneficial are those for whom existing tools perform worst. This is a specific instance of a general pattern that Buolamwini and Gebru documented in facial recognition: systems trained on homogeneous datasets encode the biases of those datasets into their predictions, and those biases compound existing inequities rather than reducing them.

The question of whether clinical AI augments or replaces radiologists has become a structuring debate in the field, and the current weight of evidence favours augmentation over replacement. McKinney et al.’s 2020 Nature Medicine paper on AI screening mammography — trained on de-identified mammograms from nearly 29,000 women and tested on an independent UK set of over 25,000 — found that the AI model reduced false positives by 5.7 percent and false negatives by 9.4 percent compared to standard care, and matched or exceeded individual radiologist performance. Crucially, however, the study did not test the combination of AI and radiologist against radiologist alone, leaving open the question of whether the optimal deployment model is AI-assisted reading, AI triage, or sequential double reading. A growing body of evidence from colonoscopy, chest CT, and ophthalmology suggests that the AI-plus-radiologist combination consistently outperforms either AI or radiologist alone on detection tasks, a result that aligns with Topol’s augmentation thesis: the value of AI in imaging is not to replace human expert judgment but to provide a complementary signal that reduces the error modes unique to human perception — fatigue, anchoring, satisfaction-of-search.

The regulatory landscape governing AI-based medical imaging tools has evolved rapidly. In the United States, AI and machine learning-based software that meets the definition of Software as a Medical Device (SaMD, 医疗软件) is subject to FDA oversight under the 510(k) or De Novo pathways, which require demonstration of substantial equivalence to a predicate device or novel safety and effectiveness data respectively. The FDA cleared its first AI-based medical imaging tool — IDx-DR, an autonomous AI for diabetic retinopathy detection — in 2018 via De Novo, marking the first regulatory authorisation of an AI diagnostic system that does not require clinician input to interpret results. The FDA’s 2021 action plan for AI/ML-based SaMD introduced the concept of a predetermined change control plan (预定变更控制计划): a regulatory pathway that allows manufacturers to specify in advance the types of model updates and retraining that can occur without requiring a new regulatory submission, addressing the dynamic nature of continuously learning clinical AI systems. The EU Medical Device Regulation (MDR) applies similar conformity assessment requirements through a CE marking process, and the EU AI Act — which classifies medical AI as high risk — adds additional transparency and documentation requirements that are likely to shape global practice as manufacturers seek unified regulatory strategies.


Chapter 3: Electronic Health Records, Predictive Analytics, and Clinical Decision Support

The electronic health record (电子健康档案, EHR) is simultaneously the most data-rich and the most methodologically treacherous substrate for clinical AI. A modern hospital EHR contains structured data — diagnoses encoded in ICD-10, medications in RxNorm, laboratory values with timestamps — alongside unstructured clinical notes, imaging reports, and procedure logs. Linked over years or decades for a large patient population, this constitutes a longitudinal observational dataset of formidable richness. The challenge is that EHR data is a record of clinical decisions, not a direct observation of disease. What appears in the record is shaped by which patients sought care, which clinicians ordered which tests, what the prevailing coding incentives were, and how the documentation culture of the institution evolved over time. Missing data in EHR is not random: the absence of a laboratory measurement often means the clinician did not think it necessary to order, not that the test was ordered and yielded a normal result. This distinction, familiar to epidemiologists as informative censoring (信息性截断), is easy to overlook but profoundly affects what machine learning models trained on EHR data are actually learning.

Miotto et al.’s Deep Patient (2016), trained on the de-identified EHR records of 700,000 patients at the Icahn School of Medicine at Mount Sinai, demonstrated that an unsupervised deep autoencoder could learn a low-dimensional representation of patient health state from raw EHR data — diagnoses, medications, and procedures — that was substantially more predictive of future diagnoses than features engineered by clinical experts. The Deep Patient representation improved prediction of future liver disease, psychosis, and cancer, among dozens of other outcomes, suggesting that deep learning can discover clinically meaningful latent structure in messy, heterogeneous EHR data that eludes conventional feature engineering. The paper was widely cited as a demonstration of the potential of unsupervised representation learning for clinical prediction. It was also frank about interpretability: the authors reported that the learned representation was difficult to interpret and that its clinical utility for individual patient management — as opposed to population-level risk stratification — remained unclear, a tension that recurs throughout EHR-based clinical AI.

Saria et al.’s work on the neonatal intensive care unit constitutes one of the earliest and most methodologically careful demonstrations of the predictive health (预测健康) paradigm: the use of continuously collected physiological time series to predict clinical deterioration before it becomes clinically apparent. Their 2010 Science Translational Medicine paper showed that machine learning analysis of heart rate variability, respiratory patterns, and temperature in preterm infants could predict late-onset sepsis up to 24 hours before clinical presentation — a lead time of potentially enormous value in a population where sepsis progression can be rapid and devastating. The underlying biological insight is that physiological systems exhibit subtle dynamical changes during early infection — reduced heart rate variability is a well-documented early signal — that are below the threshold of clinical notice but detectable by algorithms sensitive to time-series structure. This work established the template for a family of ICU early warning systems that now includes prediction models for sepsis, acute kidney injury, deterioration, and mortality, and that are deployed, with varying degrees of clinical uptake, in hospitals worldwide.

Alert fatigue (警报疲劳): The systematic reduction in clinical responsiveness to computer-generated alerts that occurs when alert volumes are high, alert specificity is low, and workflow integration is poor. Alert fatigue is among the most consequential barriers to effective clinical decision support deployment. A 2014 study in the Journal of the American Medical Informatics Association found that physicians overrode 90 percent of drug–drug interaction alerts in a major academic medical centre EHR — not because the interactions were clinically insignificant in all cases, but because the alert system generated so many low-priority warnings that clinicians had learned to dismiss them reflexively. The Epic Sepsis Model (ESM), a widely deployed commercial sepsis prediction tool, was evaluated by Sendak et al. and by Wong et al. in independent validation studies that found sensitivity and positive predictive value substantially below those reported in the model's development cohort. Critically, both studies found high alert volumes and low specificity, conditions predictive of alert fatigue. The ESM case illustrates the full lifecycle challenge of clinical AI: a model can be technically proficient on its development dataset and yet, when deployed in the heterogeneous complexity of real clinical workflows, generate more noise than signal.

The last mile problem (最后一公里问题) in clinical AI refers to the gap between an algorithm that is technically accurate and an algorithm that is clinically useful — a gap that is often wider than technical performance metrics suggest. A sepsis prediction model that fires appropriately on 80 percent of true sepsis cases achieves nothing if the resulting alert appears in a corner of the EHR interface that busy nurses do not notice, if the recommended response action is not clearly specified, if the alert fires at a level of acuity where clinical protocols already specify management, or if the clinical team’s workflow does not include a designated responder for AI alerts. The science of clinical AI implementation — understanding how to integrate algorithmic outputs into clinical workflows in ways that actually change clinician behaviour and improve outcomes — is a research domain in its own right, and one that lags substantially behind the technical development of the algorithms. Rajkomar, Dean, and Kohane argue that this implementation gap, not technical performance limitations, is the primary constraint on clinical AI value in the near term, a judgment that aligns with the broader implementation science literature on clinical decision support.


Chapter 4: AI, Public Health, and Epidemiology

The application of machine learning to public health and epidemiology encompasses a spectrum of tasks: disease surveillance and outbreak detection, risk factor identification, intervention targeting, and pandemic response — and the track record across these applications is strikingly uneven. The cautionary tale most frequently cited is Google Flu Trends, launched in 2008 on the premise that the volume of flu-related search queries on Google could serve as a real-time proxy for influenza incidence, providing a two-week lead on CDC surveillance data. For several seasons, Google Flu Trends performed impressively. Then, in 2012–13, it overestimated flu prevalence by a factor of two. Lazer et al.’s 2014 Science analysis identified the cause: the Google search algorithm had itself changed, promoting flu-related health news content that drove search traffic independent of actual flu activity. The model had been trained on an assumption of stable relationships between search behaviour and disease prevalence, but search behaviour is shaped by the search algorithm itself — a feedback loop that epidemiological surveillance models based on internet data are structurally vulnerable to. Epidemiological surveillance (流行病学监测) based on digital traces requires not only statistical sophistication but an understanding of the data-generating process that includes the behavioural and platform dynamics shaping what users search, tweet, or report.

The SARS-CoV-2 pandemic generated what may be the largest rapid proliferation of clinical AI models in history: hundreds of prognostic models for COVID-19 severity, mortality, and ICU admission were developed and published within months of the outbreak’s onset. The systematic review by Wynants et al., published in BMJ in 2020 and updated through subsequent waves, examined 169 prediction models and found that the vast majority were at high risk of bias due to poor reporting, small sample sizes, inappropriate exclusion criteria, and inadequate handling of missing data. Almost none had been externally validated in independent populations, and none met the review’s bar for recommendation for clinical use. The COVID-19 AI modelling episode is a case study in how the combination of data availability, publication pressure, and genuine clinical urgency can produce a literature of models that are technically complex but scientifically fragile — and how the systematic review and reporting-guideline infrastructure that the evidence-based medicine movement built for clinical trials had not yet been fully adapted to the evaluation of clinical prediction models.

Contact tracing apps represent a different point in the public health AI landscape — one where the technical design challenge is inseparable from the privacy design challenge. The DP-3T protocol (Decentralised Privacy-Preserving Proximity Tracing), developed by Troncoso et al. and adopted as the basis for the Apple/Google Exposure Notification (EN) API, uses Bluetooth low-energy broadcasts and locally stored rolling identifiers to detect close contacts without centralising contact graphs on a server. The architecture was explicitly designed to resist re-identification: the EN API does not transmit location data, does not reveal which specific individual a user was exposed to, and computes risk scores on-device. The epidemiological effectiveness of EN-based apps proved difficult to measure; randomised evaluation was not performed, and observational analyses yielded inconsistent estimates. The DP-3T/EN case illustrates a recurring tension in public health AI: the privacy-protective design that makes a surveillance tool politically acceptable may reduce its epidemiological power, and the optimal trade-off between surveillance effectiveness and civil liberty protection is not a technical question but a political one that algorithms cannot resolve.

AI for global health introduces a further dimension of inequity that domestic applications can obscure. Diabetic retinopathy screening using AI-based fundus photograph analysis has been piloted in multiple low- and middle-income country settings — India, Thailand, Mexico — with the explicit motivation that AI can extend specialist-equivalent screening to populations without access to ophthalmologists. The equity promise is genuine: diabetic retinopathy is a leading cause of preventable blindness globally, and the treatment is effective when applied early. The equity challenge is equally genuine: the models deployed in these settings were trained predominantly on data from high-income country patient populations with different baseline retinopathy severity distributions, different fundus camera characteristics, and different image quality profiles than those encountered in low-resource field settings. Studies evaluating commercial retinopathy AI tools in sub-Saharan African contexts have found reduced sensitivity for the specific retinopathy features most prevalent in those populations. The principle that Obermeyer et al. articulate for domestic clinical AI — that models trained on data reflecting access inequities tend to underserve those with less prior healthcare utilisation — generalises globally: models trained on data from high-resource settings may systematically fail the populations they are positioned as helping.

Precision public health (精准公共卫生) is the extension of precision medicine logic — using multi-omic, behavioural, and social data to stratify risk and target interventions at the individual level — to population health programmes. The intellectual case is straightforward: if high-risk individuals can be identified with sufficient accuracy before they experience a negative health event, targeted preventive intervention can be both more effective and more efficient than universal programmes. The challenge is that this framing may obscure the structural determinants of population health — poverty, housing, air quality, occupational exposure, historical disinvestment — that individual-level risk prediction and targeting cannot address. A precision public health algorithm that identifies individuals at high risk of asthma hospitalisation and targets them for inhaler adherence support does not change the air quality in the industrial neighbourhood where they live. Krieger’s ecosocial theory and Rose’s classical public health argument for population-level over individual-level intervention both suggest that the precision public health paradigm, however technically sophisticated, may systematically misidentify where leverage lies in population health improvement.


Chapter 5: The Ethics of Clinical AI

Char, Shah, and Magnus, writing in the New England Journal of Medicine in 2018, identified four ethical challenges that clinical machine learning presents in structural terms that have become canonical in the biomedical ethics literature. The first is the opacity challenge, which they frame not merely as a technical limitation but as an ethical one: a clinical recommendation that cannot be explained to the patient or justified to the clinician disrupts the informed consent process, undermines shared decision-making, and conflicts with the norms of evidence-based practice. The second is the bias challenge: training data that does not represent the population to which a model will be applied encodes historical patterns of differential access and differential treatment into the model’s predictions. The third is the feedback loop challenge: when deployed AI systems generate data — prescription patterns, referral decisions, test orders — that is used to retrain or validate subsequent model versions, the system risks amplifying rather than correcting its initial biases. The fourth is the consent challenge: using patient data for model development raises questions about the scope of the consent patients gave when their data was collected, and about whether patients have a right to opt out of data uses they did not anticipate.

The Obermeyer et al. Science 2019 paper provides the most powerful and precise illustration of racial bias in a deployed clinical AI system yet published, and its analytical architecture deserves careful attention. The study examined a widely used commercial algorithm — deployed across hundreds of US hospitals to identify patients who would benefit from enrolment in high-complexity care management programmes — that used predicted healthcare costs as a proxy for health needs. The central finding was that, controlling for predicted cost, Black patients were substantially sicker than white patients: at any given risk score generated by the algorithm, Black patients had more active chronic conditions than white patients with the same score. The consequence was that the algorithm systematically underestimated the severity of illness in Black patients, effectively requiring Black patients to be significantly sicker than white patients to receive the same level of care management. Obermeyer et al. demonstrate that the mechanism was the proxy variable: Black patients with the same health needs as white patients incurred substantially lower healthcare costs, because they had less access to care. Using cost as a proxy for health needs therefore imported the access disparity directly into the risk score.

Proxy variable bias (代理变量偏差): A form of algorithmic bias in which the outcome variable used to train a predictive model is a proxy for the true quantity of interest, and the relationship between the proxy and the true quantity differs systematically across demographic groups. In the Obermeyer et al. case, healthcare cost is a proxy for healthcare need, but the relationship between cost and need is mediated by healthcare access, which differs by race. The bias does not require any intent to discriminate and does not require explicitly using race as a feature; it arises through the structural relationship between race, access, and cost in the US healthcare system. Obermeyer et al. estimate that correcting the proxy variable — replacing predicted cost with a direct measure of comorbidity burden — would increase the fraction of Black patients enrolled in care management from 17.7 percent to 46.5 percent at the threshold previously used. The paper establishes that proxy variable selection is not a statistical nicety but an ethical decision with direct consequences for health equity.

The informed consent (知情同意) landscape for AI-assisted clinical care remains unsettled both normatively and practically. Current practice in most health systems does not specifically disclose to patients when AI systems contribute to their diagnosis or treatment recommendations, on the grounds that clinicians routinely use many computational tools — laboratory reference ranges, pharmacokinetic calculators, clinical prediction scores — without specific disclosure. Critics argue that AI diagnostic tools are categorically different: they operate at a level of complexity and autonomy that makes meaningful clinician verification difficult, they are proprietary and opaque in ways that laboratory instruments are not, and their potential for demographic disparate impact creates a specific duty of disclosure. Patient attitude surveys — conducted by Longoni et al. and others — consistently find that patients prefer human decision-making and feel less satisfied with diagnoses attributed to AI, a finding that interacts with the disclosure question in complex ways: transparency that improves informed consent may simultaneously reduce therapeutic confidence.

Liability and accountability for AI-assisted clinical harm remain unresolved in most legal systems. The EU AI Act’s classification of medical AI as high-risk imposes pre-market conformity assessment, post-market monitoring, and documentation requirements that may clarify the manufacturer’s obligations without fully resolving the allocation of responsibility between manufacturer, deploying institution, and treating clinician. The WHO’s 2021 Ethics and Governance of Artificial Intelligence for Health articulates six normative principles — protecting autonomy, promoting well-being, ensuring transparency, fostering responsibility, ensuring inclusiveness and equity, and promoting sustainable and responsive AI — that represent a global consensus aspiration. The WHO document is notable for its emphasis on the systemic conditions required for ethical AI in health: regulatory capacity, data governance infrastructure, workforce training, and public participation in AI governance decisions. These structural prerequisites are unevenly distributed globally, and the WHO’s honest acknowledgment of this gap distinguishes its framework from more abstract bioethical analyses that treat ethics as a matter of principle rather than institutional capacity.


Chapter 6: AI and Aging — Assistive Technology, Surveillance, and Dignity

Population aging is one of the defining demographic facts of the twenty-first century. In OECD countries, the fraction of the population aged sixty-five and over is projected to exceed twenty percent by 2040, and the ratio of working-age adults to older adults is declining rapidly in most high-income nations. The resulting pressure on formal geriatric care systems — already strained by workforce shortages and funding constraints — has generated substantial interest in AI-assisted technologies that could support independent living, monitor health status, and detect deterioration in older adults without requiring constant human supervision. Aging in place (居家养老) — the goal of enabling older adults to remain in their own homes as long as possible — is both a stated preference of most older adults and a policy objective for healthcare systems seeking to reduce the cost of institutional care. AI technologies are positioned as an enabling infrastructure for aging in place, providing the monitoring and support capabilities that make it safe without requiring continuous human presence.

Fall detection and prediction (跌倒检测与预测) is among the most clinically consequential and technically developed applications of AI in aging. Falls are the leading cause of injury-related death and disability in adults over sixty-five, and the consequences of an undetected fall — lying on the floor unable to summon help — can be severe even when the fall itself causes only minor injury. Wearable accelerometer-based fall detection systems have been commercially available for decades, with well-understood limitations: high false-positive rates from non-fall events such as sitting down suddenly, and detection latency from the requirement for a specific wrist-worn device. Computer vision approaches — using depth cameras or radar sensors to classify body pose and detect falls — have shown improved specificity in controlled evaluations, though they introduce the privacy intrusion of continuous video surveillance. AI-based fall prediction, as distinct from detection, uses longitudinal gait analysis, balance assessment, and activity monitoring to identify individuals at elevated future fall risk and target them for preventive intervention — a paradigm that has shown promise in institutional settings but whose clinical implementation raises the same alert fatigue concerns as EHR-based prediction tools.

Social robots for older adults occupy a position at the intersection of affective computing, gerontology, and bioethics that makes them simultaneously promising and philosophically troubling. PARO, the robotic therapeutic seal developed by Takanori Shibata at Japan’s National Institute of Advanced Industrial Science and Technology, has accumulated a substantial evidence base for its efficacy in reducing agitation and loneliness in people with dementia in institutional settings, with randomised controlled trial evidence for improvements in mood, social engagement, and medication use. Stevie, developed at Trinity College Dublin, and Pepper, the Softbank humanoid robot, have been deployed in eldercare settings for medication reminders, video calling facilitation, and cognitive games. Sherry Turkle, in her long-term ethnographic work on human-robot interaction, raises a critique that the effectiveness literature does not resolve: when an older adult forms an emotional attachment to a robot companion, they are experiencing something that resembles care without receiving care — what Turkle calls the relationship we cannot feel but can simulate. The ethical question is whether simulated companionship that reduces measurable loneliness is a legitimate therapeutic intervention or a morally problematic substitute for the human care that social conditions deny.

Cognitive monitoring (认知监测) for early dementia detection exemplifies the dual-use tension that is characteristic of AI in aging. Passive digital biomarkers — changes in speech rate and lexical diversity detectable in phone conversations, gait alterations measurable through smartphone accelerometers, navigation patterns in GPS traces, typing rhythm changes captured by mobile keyboards — have shown prospective associations with cognitive decline onset in longitudinal studies. The promise is substantial: Alzheimer's disease has a presymptomatic phase of fifteen to twenty years during which pathological changes are accumulating but clinical manifestations are absent, and early detection could, in principle, enable interventions during a window when they might modify disease trajectory. The ethical complication is that an accurate early warning of cognitive decline is not a benign notification. It carries insurance implications, driving licence implications, employment implications, and profound personal implications for life planning, relationship disclosure, and existential reckoning. The right to receive, and the right not to receive, a dementia prognosis is a contested bioethical question that AI-based cognitive monitoring systems force into a new form: the question of whether continuous passive monitoring that generates risk signals without clinical confirmation constitutes a form of diagnostic disclosure that requires consent.

A structural bias that is less discussed than race or gender discrimination in clinical AI but equally consequential is age discrimination in model development (模型开发中的年龄歧视): the systematic underrepresentation of older adults in the research datasets from which clinical AI is developed. Clinical trials for AI-based medical devices, like pharmaceutical trials, often apply exclusion criteria that skew toward younger, healthier patients — partly because comorbidity complicates outcome attribution and partly because of implicit assumptions about the primary market for new technologies. The consequence is that models deployed in geriatric clinical contexts — where patients often have multiple chronic conditions, polypharmacy, atypical disease presentations, and different physiological parameter ranges — are evaluated in populations that do not represent them. Skin cancer classification models trained on younger adult dermatoscopic datasets perform worse on the distinctive presentation patterns of ageing skin. Speech-based cognitive screening tools calibrated on middle-aged adults may misclassify age-normal vocabulary changes as pathological. Correcting this structural bias requires not only technical solutions — stratified sampling, age-specific validation cohorts — but a prior commitment to including older adults as equal beneficiaries of the clinical AI development enterprise.


Chapter 7: Neuroscience, Brain-Computer Interfaces, and AI

The relationship between artificial intelligence and neuroscience is bidirectional and historically deep. Artificial neural networks were explicitly inspired by the architecture of biological neural circuits — the perceptron by the Hebbian model of synaptic learning, the convolutional neural network by Hubel and Wiesel’s characterisation of the visual cortex’s receptive field hierarchy. But the field diverged: deep learning advanced through gradient descent and large-scale computation rather than through progressively refined biological realism, and neuroscience developed its own computational frameworks that had limited contact with machine learning practice for several decades. Marblestone, Wayne, and Kording’s 2016 Frontiers in Computational Neuroscience paper makes the case for renewed integration through three conceptual contributions. First, they argue that the brain’s diverse neural circuits can be understood as implementing cost function minimisation — that the apparent diversity of learning mechanisms across brain regions reflects different implicit objective functions rather than fundamentally different learning algorithms. Second, they analyse the parallels and disanalogies between artificial and biological neural networks in a systematic way, identifying which architectural features of deep learning systems have biological analogues and which do not. Third, and most ambitiously, they argue that progress in understanding biological intelligence and progress in building artificial intelligence are not merely parallel but mutually informative — that solving the neuroscience problem requires and will be accelerated by solving the AI problem.

Brain-computer interfaces (脑机接口, BCI) are devices that establish a direct communication channel between the brain and an external computing system, bypassing the normal efferent pathways of the peripheral nervous system. The clinical motivation is the restoration of communication and motor function to individuals with conditions — amyotrophic lateral sclerosis, spinal cord injury, brainstem stroke — that have severed the neural pathways connecting intention to action while leaving the cortical substrate of motor planning intact. The BrainGate consortium, a multi-institutional academic collaboration, has been the primary vehicle for translating invasive BCI from animal models to human clinical application. Hochberg et al.’s 2006 Nature paper, summarised in Hochberg’s 2008 NEJM essay “Turning Thought into Action,” reported the first demonstration in a human participant that the BrainGate microelectrode array — a 4×4 mm silicon substrate bearing 96 platinum-tipped electrodes implanted in the hand-knob area of motor cortex — could record neural population activity sufficient to control a computer cursor, operate a television, and modulate robot arm movements by imagining the corresponding physical actions. This result established the clinical proof-of-concept for invasive BCI and initiated a research programme that has progressively advanced communication rates and motor repertoire.

Neural decoding (神经解码): The computational process of inferring intended movement, speech, or cognitive state from recorded neural population activity. In BCI systems using microelectrode arrays, neural decoding typically employs statistical models — Kalman filters, recurrent neural networks, population vector algorithms — trained on simultaneous neural recordings and behavioural observations. Willett et al.'s 2021 Nature paper demonstrated that a recurrent neural network trained on neural activity recorded during imagined handwriting could decode 90 characters per minute with 94.1 percent raw accuracy — substantially faster than previous BCI communication benchmarks. The improvement came from the combination of high-dimensional neural data (192 electrodes recorded simultaneously), a powerful recurrent decoder, and a language model prior that corrected decoding errors using word probability. The Willett et al. result illustrates the contribution of modern AI methods to BCI performance: the neural signal contains substantial information about intended hand movements, but extracting that information at high rates requires deep learning decoders that exploit temporal dependencies in neural population dynamics.

Neuralink, Elon Musk’s neurotechnology company, occupies an unusual position in the BCI landscape: technically sophisticated and well-funded, but operating with a gap between public communications and peer-reviewed scientific output that the academic BCI community has documented with concern. Neuralink’s device — a flexible polymer thread with 1,024 electrodes implanted by a neurosurgical robot — represents genuine engineering innovation in electrode count and implantation precision relative to the silicon arrays used by BrainGate. The company received FDA Breakthrough Device designation in 2023 and initiated the PRIME human clinical trial; the first human implantation was reported in January 2024. The clinical endpoints and performance data from the PRIME study have been communicated primarily through company press releases and public demonstrations rather than peer-reviewed publications, which limits independent evaluation. The controversy over the conduct of Neuralink’s animal trials — including regulatory citations for transport of potentially contaminated implants and, separately, concerns raised by former employees about experimental rigour — has further complicated assessment of the company’s scientific programme, independent of its technical capabilities.

AlphaFold, DeepMind’s transformer-based model for protein structure prediction, represents a contribution of AI to biomedicine that is structurally different from clinical decision support but arguably of comparable long-term significance. Jumper et al.’s 2021 Science paper demonstrated that AlphaFold 2 predicted protein structures with atomic-level accuracy across the Critical Assessment of Protein Structure Prediction (CASP) benchmark — a problem that the structural biology community had been unable to solve computationally for fifty years. The implications for neuropharmacology are direct: the structure of neurodegenerative disease-associated proteins — tau, alpha-synuclein, TDP-43 — and their interaction partners can now be predicted and analysed at a resolution that was previously available only through years of crystallographic or cryogenic electron microscopy work. AlphaFold-enabled structure prediction does not identify drug candidates automatically, but it dramatically accelerates the rational drug design process by providing the three-dimensional structural context that is prerequisite to target-based drug discovery. In the domain of neurodegenerative disease, where the decades-long history of clinical trial failure has been partly attributed to the poor structural understanding of the relevant protein aggregates, this acceleration could be clinically consequential.


Chapter 8: AI Governance in Healthcare — Regulation, Evaluation, and Equity

The global regulatory landscape for clinical AI is fragmented, evolving, and not yet adequate to the pace of development. In the United States, the FDA’s Software as a Medical Device framework is the primary regulatory pathway, but its application is complicated by two features unique to AI-based devices relative to conventional medical technology. The first is the distinction between locked (锁定) and adaptive algorithms: a conventional diagnostic device produces consistent outputs from the same inputs; an adaptive machine learning model may be retrained as new data accumulates, changing its behaviour over time in ways that a one-time regulatory clearance does not evaluate. The FDA’s 2021 action plan for AI/ML-based SaMD acknowledges this challenge and proposes the predetermined change control plan as a regulatory mechanism, but the plan’s implementation remains a work in progress. The second complication is performance across subgroups: regulatory submissions can demonstrate aggregate performance across a test dataset without demonstrating that performance is equivalent across the demographic subgroups that will be encountered in clinical deployment. The FDA has published guidance encouraging applicants to report performance stratified by age, sex, and race and ethnicity, but this guidance is not yet a mandatory requirement with standardised reporting formats.

The distinction between technical validation (技术验证) and clinical validation (临床验证) is among the most consequential methodological distinctions in clinical AI evaluation, and it is routinely collapsed in both the scientific literature and regulatory submissions. Technical validation asks whether an algorithm performs well on a dataset: does it classify images accurately, predict outcomes with high AUROC, or assign risk scores that correlate with outcomes? Clinical validation asks whether deploying the algorithm changes clinical behaviour and improves patient outcomes — a substantially harder question that requires prospective study in clinical settings. The CONSORT-AI and SPIRIT-AI reporting guidelines, published in Nature Medicine and BMJ in 2020, provide reporting standards for randomised trials and protocols respectively involving AI interventions, and constitute the biomedical AI community’s most important methodological contribution to closing the technical-clinical validation gap. Despite these guidelines, the proportion of clinical AI studies that include prospective clinical outcome evaluation remains low: a systematic review by Nagendran et al. in BMJ found that the evidence base for clinical AI consisted overwhelmingly of retrospective technical validation studies, with very few randomised trials evaluating clinical outcomes, and that the handful of randomised trials that existed were generally small and at high risk of bias.

Health equity in AI deployment does not require intentional discrimination to produce discriminatory effects. As Obermeyer et al. demonstrate, and as the skin tone performance gap in dermatology AI illustrates, biased outcomes can emerge from structurally compromised training data even when demographic variables are not explicitly included in the model and no discriminatory intent is present. The concept of equity-by-design (设计中的公平) — building equity considerations into the model development process from the outset, including training data curation, outcome variable selection, and subgroup performance requirements — is proposed by health equity researchers as an alternative to the more common post-hoc disparate impact analysis, which identifies bias after a model is deployed and typically results in reactive correction rather than prevention. Equity-by-design requires specifying, before model development, which demographic groups constitute equity-relevant subgroups, what minimum performance standards must be met for each subgroup, and how training data will be curated or augmented to meet those standards. This approach adds complexity and cost to the development process, but its proponents argue that the alternative — deploying discriminatory systems and addressing equity concerns reactively — creates greater total harm and greater institutional liability.

Algorithmic auditing (算法审计) in healthcare is an emerging practice that draws on frameworks developed in anti-discrimination law, financial regulation, and the emerging algorithmic accountability literature. New York City Local Law 144, enacted in 2021 to require bias audits of automated employment decision tools, provides one regulatory model: it requires that covered tools be evaluated by independent auditors for impact ratios across race and sex categories, and that audit results be publicly disclosed. Proposals to apply analogous requirements to clinical AI have been advanced by health equity researchers and AI governance scholars, with arguments that independent third-party auditing of deployed clinical AI systems — especially those used for resource allocation, triage, or insurance decisions — should be a regulatory requirement rather than a voluntary best practice. The practical challenges are substantial: clinical AI developers often treat model architectures and training data as proprietary trade secrets, which conflicts with the transparency requirements of meaningful independent auditing. The tension between intellectual property protection and public accountability for systems that make consequential health decisions about millions of patients is one the field has not resolved.

The WHO’s 2021 Ethics and Governance of Artificial Intelligence for Health makes a call for transparent reporting of training data provenance, performance across demographic groups, and post-deployment monitoring that is ambitious and well-reasoned but honest about the distance between aspiration and current practice. The document identifies four structural prerequisites for ethical AI in health that extend beyond algorithm design: regulatory capacity in health ministries that currently lack technical expertise to evaluate AI submissions; data governance infrastructure to manage the privacy, consent, and access dimensions of large-scale health data use; clinical and technical workforce training to enable both clinicians and patients to engage meaningfully with AI systems; and mechanisms for public participation in the governance decisions that shape which AI systems are developed and deployed. These structural prerequisites are unevenly distributed globally, with significant gaps in low- and middle-income countries that are, paradoxically, both the populations where AI-assisted healthcare extension has the greatest potential impact and the settings least equipped to regulate and evaluate the systems deployed in them.


Chapter 9: Futures — Precision Medicine, AI Surgery, and the Augmented Clinician

Precision medicine (精准医疗) is the aspiration to tailor prevention, diagnosis, and treatment to the individual patient’s genomic, physiological, behavioural, and social profile rather than applying population-average guidelines derived from heterogeneous clinical trial populations. The National Institutes of Health All of Us Research Programme, launched in 2018 with a goal of enrolling one million diverse participants and collecting genomic, electronic health record, wearable sensor, and social determinant data, is the most ambitious infrastructure investment in precision medicine to date. Its explicit emphasis on diversity — it aims to include at least fifty percent of participants from racial and ethnic minority groups historically underrepresented in biomedical research — is a direct response to the well-documented skew of prior genomic research toward populations of European ancestry, which limits the applicability of polygenic risk scores and pharmacogenomic associations derived from those studies to non-European populations. Whether the All of Us data, once assembled and linked, will generate clinical insights of the magnitude its funding justifies remains to be seen; the history of precision medicine is populated with biomarkers that replicate at the population level but do not improve clinical outcomes when used to guide individual management.

AI in surgical robotics is advancing along a trajectory from surgeon-controlled teleoperation toward increasing automation of surgical subtasks, with regulatory and liability frameworks that have not yet caught up. The da Vinci Surgical System — the dominant platform for robotic-assisted minimally invasive surgery with over a million procedures per year — currently functions as a precision teleoperation system that amplifies and filters surgeon hand movements without autonomous behaviour. Shademan et al.’s 2016 Science Translational Medicine paper reported the first demonstration of an autonomous robotic system — the Smart Tissue Autonomous Robot (STAR) — performing supervised laparoscopic bowel anastomosis in a porcine model with greater consistency than human surgeons, as measured by anastomotic leak rates and suture spacing. The STAR system used near-infrared fluorescent tissue markers and computer vision tracking to compensate for tissue deformation, the primary challenge for autonomous soft-tissue surgery. The path from supervised animal trials to unsupervised human surgical autonomy involves not only technical challenges — real-time tissue tracking in the presence of bleeding and retraction, reliable failure mode detection — but regulatory and ethical challenges for which no established framework yet exists.

Ambient clinical intelligence (环境临床智能, ACI): A class of AI systems that automatically generate structured clinical documentation from physician-patient conversations, using automatic speech recognition and natural language processing to produce the clinical note that would otherwise require post-encounter dictation or direct EHR entry. Commercial ACI tools, including Microsoft's DAX Copilot and Nuance's Dragon Ambient eXperience, have been deployed in large health systems and evaluated in feasibility studies that document substantial reductions in documentation time and improvements in physician-reported experience. The risks of ACI deployment at scale include documentation errors propagated without detection — a miscaptured medication dose or a misattributed clinical finding in an automatically generated note could persist in the medical record and influence future clinical decisions. The liability of ACI documentation errors is currently governed by the same standard applicable to physician-generated documentation, meaning the responsible clinician bears liability for the content of notes produced by ACI tools they use without adequate review. This allocation may be appropriate given current ACI performance characteristics, but it may not remain appropriate as ACI systems become more capable and the practical expectation of close physician review of every generated sentence becomes less realistic.

Federated learning (联邦学习) and privacy-preserving AI represent a partial technical response to one of the most significant governance barriers in clinical AI development: the difficulty of assembling multi-institutional, diverse training datasets given the privacy, liability, and competitive concerns that impede centralised data sharing. In federated learning, model training is distributed across data custodians — hospitals, health systems, genomic biobanks — with gradient updates rather than raw patient data transmitted to a central aggregation server. The NVIDIA FLARE platform and the HealthChain consortium in Europe have demonstrated federated training of clinical imaging models across multiple institutions, with performance approaching that of centrally trained models on some tasks. The approach is not a complete privacy solution — gradient updates can leak information about training data through membership inference attacks, and the federated architecture requires trust that the aggregation server and participating nodes are not colluding — but it is a meaningful step toward enabling the multi-institutional diversity in training data that both technical performance and equity objectives require.

The concept of the augmented clinician (增强型临床医生) provides the synthesis toward which the course has been building. Topol’s vision — AI absorbs pattern recognition, clinicians reclaim the human dimensions of care — is compelling as an aspiration but requires more precise articulation of what the human dimensions of medicine consist of and why AI cannot or should not perform them. Empathy, the capacity to understand and share the emotional experience of another person, requires subjective experience that AI systems, as currently understood, do not possess. Communication — the negotiation of meaning and values in a relationship of care — is not reducible to information transfer and requires contextual judgment of a kind that large language models can approximate linguistically but not enact relationally. Moral reasoning in clinical contexts — weighing competing goods under uncertainty, navigating the irreducible particularity of individual patient circumstances — is a practice that develops through the experience of clinical relationships and cannot be separated from the care relationships that instantiate it. Algorithmic systems that replicate the surface form of moral reasoning without the relational context that grounds it may produce outputs that are linguistically indistinguishable from those of an experienced clinician while missing the ethical substance entirely. The augmented clinician framework does not require resolving philosophical debates about machine consciousness; it requires only the practical judgment that the specific human capacities most essential to good medicine — attentive listening, trust-building, moral navigation in relationships of vulnerability — are precisely those that AI current and foreseeable is least equipped to replicate, and that the value of AI in medicine consists in protecting those capacities from displacement by the pattern-recognition burden that currently overwhelms them.

Back to top