About
165
Publications
33,759
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,917
Citations
Current institution
Publications
Publications (165)
The increased capabilities of generative artificial intelligence (AI) have dramatically expanded its possible use cases in medicine. We provide a comprehensive overview of generative AI use cases for clinicians, patients, clinical trial organizers, researchers, and trainees. We then discuss the many challenges—including maintaining privacy and secu...
Machine learning-driven rankings, where individuals (or items) are ranked in response to a query, mediate search exposure or attention in a variety of safety-critical settings. Thus, it is important to ensure that such rankings are fair. Under the goal of equal opportunity, attention allocated to an individual on a ranking interface should be propo...
Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses t...
Without careful dissection of the ways in which biases can be encoded into artificial intelligence (AI) health technologies, there is a risk of perpetuating existing health inequalities at scale. One major source of bias is the data that underpins such technologies. The STANDING Together recommendations aim to encourage transparency regarding limit...
The increased capabilities of generative AI have dramatically expanded its possible use cases in medicine. We provide a comprehensive overview of generative AI use cases for clinicians, patients, clinical trial organizers, researchers, and trainees. We then discuss the many challenges -- including maintaining privacy and security, improving transpa...
Spurious features associated with class labels can lead image classifiers to rely on shortcuts that don't generalize well to new domains. This is especially problematic in medical settings, where biased models fail when applied to different hospitals or systems. In such cases, data-driven methods to reduce spurious correlations are preferred, as cl...
Sepsis is a life-threatening condition defined by end-organ dysfunction due to a dysregulated host response to infection. Although the Surviving Sepsis Campaign has launched and has been releasing sepsis treatment guidelines to unify and normalize the care for sepsis patients, it has been reported in numerous studies that disparities in care exist...
Vision-language model (VLM) embeddings have been shown to encode biases present in their training data, such as societal biases that prescribe negative characteristics to members of various racial and gender identities. VLMs are being quickly adopted for a variety of tasks ranging from few-shot classification to text-guided image generation, making...
Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation of stereotypes and potential down...
Vision-language models, like CLIP (Contrastive Language Image Pretraining), are becoming increasingly popular for a wide range of multimodal retrieval tasks. However, prior work has shown that large language and deep vision models can learn historical biases contained in their training sets, leading to perpetuation of stereotypes and potential down...
To induce desired behaviors in large language models (LLMs) for interaction-driven tasks, the instruction-tuning stage typically trains LLMs on instruction-response pairs using the next-token prediction (NTP) loss. Previous work aiming to improve instruction-tuning performance often emphasizes the need for higher-quality supervised fine-tuning (SFT...
Background
There are known racial disparities in the organ transplant allocation system in the United States. However, prior work has yet to establish if transplant center decisions on offer acceptance—the final step in the allocation process— contribute to these disparities.
Objective
To estimate racial differences in the acceptance of organ offe...
For medical imaging AI models to be clinically impactful, they must generalize. However, this goal is hindered by (i) diverse types of distribution shifts, such as temporal, demographic, and label shifts, and (ii) limited diversity in datasets that are siloed within single medical institutions. While these limitations have spurred interest in feder...
Large repositories of image-caption pairs are essential for the development of vision-language models. However, these datasets are often extracted from noisy data scraped from the web, and contain many mislabeled examples. In order to improve the reliability of downstream models, it is important to identify and filter images with incorrect captions...
Machine learning models can fail on subgroups that are underrepresented during training. While techniques such as dataset balancing can improve performance on underperforming groups, they require access to training group annotations and can end up removing large portions of the dataset. In this paper, we introduce Data Debiasing with Datamodels (D3...
Supervised learning methods have been found to exhibit inductive biases favoring simpler features. When such features are spuriously correlated with the label, this can result in suboptimal performance on minority subgroups. Despite the growing popularity of methods which learn from unlabeled data, the extent to which these representations rely on...
The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement was published in 2015 to provide the minimum reporting recommendations for studies developing or evaluating the performance of a prediction model. Methodological advances in the field of prediction have since included the widesprea...
While novel oral anticoagulants are increasingly used to reduce risk of stroke in patients with atrial fibrillation, vitamin K antagonists such as warfarin continue to be used extensively for stroke prevention across the world. While effective in reducing the risk of strokes, the complex pharmacodynamics of warfarin make it difficult to use clinica...
The study provides a comprehensive review of OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research,...
Stress is associated with numerous chronic health conditions, both mental and physical. However, the heterogeneity of these associations at the individual level is poorly understood. While data generated from individuals in their day-to-day lives “in the wild” may best represent the heterogeneity of stress, gathering these data and separating signa...
Artificial intelligence as a medical device is increasingly being applied to healthcare for diagnosis, risk stratification and resource allocation. However, a growing body of evidence has highlighted the risk of algorithmic bias, which may perpetuate existing health inequity. This problem arises in part because of systemic inequalities in dataset c...
Social determinants of health (SDOH) -- the conditions in which people live, grow, and age -- play a crucial role in a person's health and well-being. There is a large, compelling body of evidence in population health studies showing that a wide range of SDOH is strongly correlated with health outcomes. Yet, a majority of the risk prediction models...
Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While t...
As governments and industry turn to increased use of automated decision systems, it becomes essential to consider how closely such systems can reproduce human judgment. We identify a core potential failure, finding that annotators label objects differently depending on whether they are being asked a factual question or a normative question. This ch...
Background
Prior research has shown that artificial intelligence (AI) systems often encode biases against minority subgroups. However, little work has focused on ways to mitigate the harm discriminatory algorithms can cause in high-stakes settings such as medicine.
Methods
In this study, we experimentally evaluated the impact biased AI recommendat...
Large pre-trained models decay over long-term deployment as input distributions shift, user requirements change, or crucial knowledge gaps are discovered. Recently, model editors have been proposed to modify a model's behavior by adjusting its weights during deployment. However, when editing the same model multiple times, these approaches quickly d...
Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These manifest as changes to the un...
The standard approach to personalization in machine learning consists of training a model with group attributes like sex, age group, and blood type. In this work, we show that this approach to personalization fails to improve performance for all groups who provide personal data. We discuss how this effect inflicts harm in applications where models...
Clinical notes are becoming an increasingly important data source for machine learning (ML) applications in healthcare. Prior research has shown that deploying ML models can perpetuate existing biases against racial minorities, as bias can be implicitly embedded in data. In this study, we investigate the level of implicit race information available...
Machine learning models in safety-critical settings like healthcare are often blackboxes: they contain a large number of parameters which are not transparent to users. Post-hoc explainability methods where a simple, human-interpretable model imitates the behavior of these blackbox models are often proposed to help users trust model predictions. In...
Background
Previous studies in medical imaging have shown disparate abilities of artificial intelligence (AI) to detect a person's race, yet there is no known correlation for race on medical imaging that would be obvious to human experts when interpreting the images. We aimed to conduct a comprehensive evaluation of the ability of AI to recognise a...
[This corrects the article DOI: 10.2196/32165.].
Objective
To predict older adults’ risk of avoidable hospitalisation related to ambulatory care sensitive conditions (ACSC) using machine learning applied to administrative health data of Ontario, Canada.
Design, setting and participants
A retrospective cohort study was conducted on a large cohort of all residents covered under a single-payer syst...
Artificial intelligence systems for health care, like any other medical device, have the potential to fail. However, specific qualities of artificial intelligence systems, such as the tendency to learn spurious correlates in training data, poor generalisability to new deployment settings, and a paucity of reliable explainability mechanisms, mean th...
Health care is a human process that generates data from human lives, as well as the care they receive. Machine learning has worked in health to bring new technology into this sociotechnical environment, using data to support a vision of healthier living for everyone. Interdisciplinary fields of research like machine learning for health bring differ...
Deep learning models have reached or surpassed human-level performance in the field of medical imaging, especially in disease diagnosis using chest x-rays. However, prior work has found that such classifiers can exhibit biases in the form of gaps in predictive performance across protected groups. In this paper, we question whether striving to achie...
Deep metric learning (DML) enables learning with less supervision through its emphasis on the similarity structure of representations. There has been much work on improving generalization of DML in settings like zero-shot retrieval, but little is known about its implications for fairness. In this paper, we are the first to evaluate state-of-the-art...
Reinforcement learning (RL) tasks are typically framed as Markov Decision Processes (MDPs), assuming that decisions are made at fixed time intervals. However, many applications of great importance, including healthcare, do not satisfy this assumption, yet they are commonly modelled as MDPs after an artificial reshaping of the data. In addition, mos...
The Better Understanding the Metamorphosis of Pregnancy (BUMP) study is a longitudinal feasibility study aimed to gain a deeper understanding of the pre-pregnancy and pregnancy symptom experience using digital tools. The present paper describes the protocol for the BUMP study. Over 1000 participants are being recruited through a patient provider-pl...
Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally r...
Purpose:
To develop an end-to-end pipeline to localize and identify cervical spine hardware brands on routine cervical spine radiographs.
Materials and methods:
In this single-center retrospective study, patients who received cervical spine implants between 2014 and 2018 were identified. Information on the implant model was retrieved from the su...
Checklists are simple decision aids that are often used to promote safety and reliability in clinical applications. In this paper, we present a method to learn checklists for clinical decision support. We represent predictive checklists as discrete linear classifiers with binary features and unit weights. We then learn globally optimal predictive c...
Artificial intelligence (AI) systems have increasingly achieved expert-level performance in medical imaging applications. However, there is growing concern that such AI systems may reflect and amplify human bias, and reduce the quality of their performance in historically under-served populations such as female patients, Black patients, or patients...
The black-box nature of current artificial intelligence (AI) has caused some to question whether AI must be explainable to be used in high-stakes scenarios such as medicine. It has been argued that explainable AI will engender trust with the health-care workforce, provide transparency into the AI decision making process, and potentially mitigate va...
Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the ``shortcuts'' inherent in the datasets, do not contribute to the *task-specific informatio...
Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning. In data-constrained offline settings, both approaches may fail as they assume fully optimal behavior or rely on exploring alternatives that may not exist. We...
Following the publication of the original article [1], we were notified that current affiliations 17, 18 and 19 were erroneously added to the first author rather than the senior author (Marzyeh Ghassemi). The original article has now been corrected.
Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally r...
Machine learning models achieve state-of-the-art performance on many supervised learning tasks. However, prior evidence suggests that these models may learn to rely on shortcut biases or spurious correlations (intuitively, correlations that do not hold in the test as they hold in train) for good predictive performance. Such models cannot be trusted...
Background
Interest in the application of machine learning (ML) to the design, conduct, and analysis of clinical trials has grown, but the evidence base for such applications has not been surveyed. This manuscript reviews the proceedings of a multi-stakeholder conference to discuss the current and future state of ML for clinical research. Key areas...
Background: In medical imaging, prior studies have demonstrated disparate AI performance by race, yet there is no known correlation for race on medical imaging that would be obvious to the human expert interpreting the images. Methods: Using private and public datasets we evaluate: A) performance quantification of deep learning models to detect rac...
Machine learning can be used to make sense of healthcare data. Probabilistic machine learning models help provide a complete picture of observed data in healthcare. In this review, we examine how probabilistic machine learning can advance healthcare. We consider challenges in the predictive model building pipeline where probabilistic models can be...
The use of machine learning (ML) in healthcare raises numerous ethical concerns, especially as models can amplify existing health inequities. Here, we outline ethical considerations for equitable ML in the advancement of healthcare. Specifically, we frame ethics of ML in healthcare through the lens of social justice. We describe ongoing efforts and...
BACKGROUND
Background: Several app-based studies share similar characteristics of a ‘light touch’ approach that recruit, enroll, and onboard via a smartphone app and attempt to minimize burden through low-friction active study tasks, while emphasizing the collection of passive data with minimal human contact. However, engagement is a common challen...
Background:
Several app-based studies share similar characteristics of a 'light touch' approach that recruit, enroll, and onboard via a smartphone app and attempt to minimize burden through low-friction active study tasks, while emphasizing the collection of passive data with minimal human contact. However, engagement is a common challenge across...
Commercial wearable devices are surfacing as an appealing mechanism to detect COVID-19 and potentially other public health threats, due to their widespread use. To assess the validity of wearable devices as population health screening tools, it is essential to evaluate predictive methodologies based on wearable devices by mimicking their real-world...
Commercial wearable devices are surfacing as an appealing mechanism to detect COVID-19 and potentially other public health threats, due to their widespread use. To assess the validity of wearable devices as population health screening tools, it is essential to evaluate predictive methodologies based on wearable devices by mimicking their real-world...
National Institute of Biomedical Imaging and Bioengineering (Grants EB017205 and 1928481)
Machine learning for health must be reproducible to ensure reliable clinical use. We evaluated 511 scientific papers across several machine learning subfields and found that machine learning for health compared poorly to other areas regarding reproducibility metrics, such as dataset and code accessibility. We propose recommendations to address this...
Clinical machine learning models experience significantly degraded performance in datasets not seen during training, e.g., new hospitals or populations. Recent developments in domain generalization offer a promising solution to this problem by creating models that learn invariances across environments. In this work, we benchmark the performance of...
Objective
To predict older adults' risk of avoidable hospitalization related to ambulatory care sensitive conditions (ACSC) using machine learning applied to administrative health data of Ontario, Canada.
Design, Setting, and Participants
A retrospective cohort study was conducted on a large cohort of all residents covered under a single-payer syst...
Artificial intelligence (AI) systems have increasingly achieved expert-level performance, particularly in medical imaging (1). However, there is growing concern that AI systems will reflect and amplify human bias against under-served subpopulations (2-7). Such biases are especially troubling in the context of underdiagnosis: if AI systems falsel...
Machine learning systems have received much attention recently for their ability to achieve expert-level performance on clinical tasks, particularly in medical imaging. Here, we examine the extent to which state-of-the-art deep learning classifiers trained to yield diagnostic labels from X-ray images are biased with respect to protected attributes....
Introduction
There is widespread enthusiasm to improve health through the application of artificial intelligence and machine learning (AI/ML) methods to large population-level health datasets. Achieving this may require successful collaboration between institutions as well as between computer scientists (CS), machine learning researchers (MLR) and...
Reinforcement Learning (RL) has recently been applied to sequential estimation and prediction problems identifying and developing hypothetical treatment strategies for septic patients, with a particular focus on offline learning with observational data. In practice, successful RL relies on informative latent states derived from sequential observati...
Reliable treatment effect estimation from observational data depends on the availability of all confounding information. While much work has targeted treatment effect estimation from observational data, there is relatively little work in the setting of confounding variable missingness, where collecting more information on confounders is often costl...
National Institute of Health (Grant R01-EB017205)
Machine learning models in health care are often deployed in settings where it is important to protect patient privacy. In such settings, methods for differentially private (DP) learning provide a general-purpose approach to learn models with privacy guarantees. Modern methods for DP learning ensure privacy through mechanisms that censor informatio...
Machine learning can be used to make sense of healthcare data. Probabilistic machine learning models help provide a complete picture of observed data in healthcare. In this review, we examine how probabilistic machine learning can advance healthcare. We consider challenges in the predictive model building pipeline where probabilistic models can be...
The use of machine learning (ML) in health care raises numerous ethical concerns, especially as models can amplify existing health inequities. Here, we outline ethical considerations for equitable ML in the advancement of health care. Specifically, we frame ethics of ML in health care through the lens of social justice. We describe ongoing efforts...
Pretrained embedding representations of biological sequences which capture meaningful properties can alleviate many problems associated with supervised learning in biology. We apply the principle of mutual information maximization between local and global information as a self-supervised pretraining signal for protein embeddings. To do so, we divid...
Multi-task learning (MTL) is a machine learning technique aiming to improve model performance by leveraging information across many tasks. It has been used extensively on various data modalities, including electronic health record (EHR) data. However, despite significant use on EHR data, there has been little systematic investigation of the utility...
It is often infeasible or impossible to obtain ground truth labels for medical data. To circumvent this, one may build rule-based or other expert-knowledge driven labelers to ingest data and yield silver labels absent any ground-truth training data. One popular such labeler is CheXpert, a labeler that produces diagnostic labels for chest X-ray radi...
Reliably transferring treatment policies learned in one clinical environment to another is currently limited by challenges related to domain shift. In this paper we address off-policy learning for sequential decision making under domain shift -- a scenario susceptible to catastrophic overconfidence -- which is highly relevant to a high-stakes clini...