npj Digital Medicine

Published by Springer Nature
Online ISSN: 2398-6352
Discipline: Medicine
Learn more about this page
Aims and scope

npj Digital Medicine is an online open-access journal dedicated to publishing high quality peer-reviewed research in all aspects of digital medicine including the clinical implementation of digital and mobile technologies, virtual healthcare, data analytic methodologies and innovative sensor development to provide the necessary data and longitudinal monitoring to best inform the broadest medical community. The journal aims to guide innovation and the transformation of health and healthcare through the incorporation of novel digital and mobile technologies.



Recent publications
Risk of bias and applicability concerns summary: review authors' judgements about each domain for each study. High, unclear and low risk of bias and applicability concerns are represented as shown in the legend.
Risk of bias and applicability concerns graph: review authors' judgements about each domain presented as percentages across included studies. High, unclear and low risk of bias or applicability concerns are represented as shown in the legend.
Summary test accuracy of surgical site infection diagnosis by index test method.
The Sars-CoV-2 pandemic catalysed integration of telemedicine worldwide. This systematic review assesses it’s accuracy for diagnosis of Surgical Site Infection (SSI). Databases were searched for telemedicine and wound infection studies. All types of studies were included, only paired designs were taken to meta-analysis. QUADAS-2 assessed methodological quality. 1400 titles and abstracts were screened, 61 full text reports were assessed for eligibility and 17 studies were included in meta-analysis, mean age was 47.1 ± 13.3 years. Summary sensitivity and specificity was 87.8% (95% CI, 68.4–96.1) and 96.8% (95% CI 93.5–98.4) respectively. The overall SSI rate was 5.6%. Photograph methods had lower sensitivity and specificity at 63.9% (95% CI 30.4–87.8) and 92.6% (95% CI, 89.9–94.5). Telemedicine is highly specific for SSI diagnosis is highly specific, giving rise to great potential for utilisation excluding SSI. Further work is needed to investigate feasibility telemedicine in the elderly population group.
Illustration of the seven-step pipeline for making aggregated predictions, using fictional examples. The seven-step pipeline is illustrated at the point of prediction, assuming there are four patients in the ED at that moment in time, and an unknown number of patients yet to arrive, who will be admitted within the prediction window. Figures on the right show fictional examples of the output from each step.
Machine learning for hospital operations is under-studied. We present a prediction pipeline that uses live electronic health-records for patients in a UK teaching hospital’s emergency department (ED) to generate short-term, probabilistic forecasts of emergency admissions. A set of XGBoost classifiers applied to 109,465 ED visits yielded AUROCs from 0.82 to 0.90 depending on elapsed visit-time at the point of prediction. Patient-level probabilities of admission were aggregated to forecast the number of admissions among current ED patients and, incorporating patients yet to arrive, total emergency admissions within specified time-windows. The pipeline gave a mean absolute error (MAE) of 4.0 admissions (mean percentage error of 17%) versus 6.5 (32%) for a benchmark metric. Models developed with 104,504 later visits during the Covid-19 pandemic gave AUROCs of 0.68–0.90 and MAE of 4.2 (30%) versus a 4.9 (33%) benchmark. We discuss how we surmounted challenges of designing and implementing models for real-time use, including temporal framing, data preparation, and changing operational conditions.
Heatmaps. A few examples of heatmaps, generated with the attention network's weights, compared with manual pixel-wise annotations made by pathologist. Each couple of examples includes a heatmap (on the left) and the corresponding pathologist pixel-wise annotations (on the right). The highlighted regions within the heatmaps represent the ones where model assigned the highest importance for the global diagnosis. The comparison of the heatmaps and the annotations shows that the attention network gives greater importance to regions including relevant patches.
Overview of the dataset composition.
CNN performance overview. Performance at WSI-level (private data)
The mapping adopted on the publicly available datasets.
Overview of SKET performance, reporting the precision, the recall and the F1-score of the single classes, in both Catania and Radboudumc reports.
The digitalization of clinical workflows and the increasing performance of deep learning algorithms are paving the way towards new methods for tackling cancer diagnosis. However, the availability of medical specialists to annotate digitized images and free-text diagnostic reports does not scale with the need for large datasets required to train robust computer-aided diagnosis methods that can target the high variability of clinical cases and data produced. This work proposes and evaluates an approach to eliminate the need for manual annotations to train computer-aided diagnosis tools in digital pathology. The approach includes two components, to automatically extract semantically meaningful concepts from diagnostic reports and use them as weak labels to train convolutional neural networks (CNNs) for histopathology diagnosis. The approach is trained (through 10-fold cross-validation) on 3’769 clinical images and reports, provided by two hospitals and tested on over 11’000 images from private and publicly available datasets. The CNN, trained with automatically generated labels, is compared with the same architecture trained with manual labels. Results show that combining text analysis and end-to-end deep neural networks allows building computer-aided diagnosis tools that reach solid performance (micro-accuracy = 0.908 at image-level) based only on existing clinical data without the need for manual annotations.
Elements of digital surgery identified by the Delphi panel. Consensus elements were grouped into three themes: data; analysis; and applications.
Future research goals for digital surgery identified and ranked highest to lowest in order of importance by the Delphi panel.
The use of digital technology is increasing rapidly across surgical specialities, yet there is no consensus for the term ‘digital surgery’. This is critical as digital health technologies present technical, governance, and legal challenges which are unique to the surgeon and surgical patient. We aim to define the term digital surgery and the ethical issues surrounding its clinical application, and to identify barriers and research goals for future practice. 38 international experts, across the fields of surgery, AI, industry, law, ethics and policy, participated in a four-round Delphi exercise. Issues were generated by an expert panel and public panel through a scoping questionnaire around key themes identified from the literature and voted upon in two subsequent questionnaire rounds. Consensus was defined if >70% of the panel deemed the statement important and <30% unimportant. A final online meeting was held to discuss consensus statements. The definition of digital surgery as the use of technology for the enhancement of preoperative planning, surgical performance, therapeutic support, or training, to improve outcomes and reduce harm achieved 100% consensus agreement. We highlight key ethical issues concerning data, privacy, confidentiality and public trust, consent, law, litigation and liability, and commercial partnerships within digital surgery and identify barriers and research goals for future practice. Developers and users of digital surgery must not only have an awareness of the ethical issues surrounding digital applications in healthcare, but also the ethical considerations unique to digital surgery. Future research into these issues must involve all digital surgery stakeholders including patients.
Angular Velocity Comparison. Top: Comparison of angular velocity calculations using APDM Opal (blue) and iPhone (orange) samples with both devices placed at lumbar region. Bottom: Comparison of angular velocity calculated using Opal (blue) and Apple Watch (orange) samples with both devices placed on the same wrist of the subject.
Comparison between active vs. passive digital assessments.
Smartphones and wearables are widely recognised as the foundation for novel Digital Health Technologies (DHTs) for the clinical assessment of Parkinson’s disease. Yet, only limited progress has been made towards their regulatory acceptability as effective drug development tools. A key barrier in achieving this goal relates to the influence of a wide range of sources of variability (SoVs) introduced by measurement processes incorporating DHTs, on their ability to detect relevant changes to PD. This paper introduces a conceptual framework to assist clinical research teams investigating a specific Concept of Interest within a particular Context of Use, to identify, characterise, and when possible, mitigate the influence of SoVs. We illustrate how this conceptual framework can be applied in practice through specific examples, including two data-driven case studies.
Study inclusion flowchart.
Model performance assessment. Receiver operating characteristic (ROC) curves are shown for our (a) inpatient care and (b) critical care outcome prediction models. ROC curves and measurements of area under the curve (AUC) are shown for three separate validation cohorts: retrospective out-of-sample (retro), prospective but prior to decision support activation (silent) and prospective after decision support activation (visible). Performance assessment was limited to patients not meeting outcome criteria prior to ED disposition decision.
Clinical decision support interface. a Model-generated COVID-19 Deterioration Risk Levels were displayed in real-time for every patient with or under investigation for COVID-19 within the electronic health record (EHR). A screenshot of the emergency clinician disposition (Dispo) module is shown. b A hyperlink embedded within the Dispo module (bottom left of panel a) allowed emergency clinicians to access a more detailed explanation of model development and function within the EHR.
Study cohort characteristics.
Patient-oriented outcome measures.
Demand has outstripped healthcare supply during the coronavirus disease 2019 (COVID-19) pandemic. Emergency departments (EDs) are tasked with distinguishing patients who require hospital resources from those who may be safely discharged to the community. The novelty and high variability of COVID-19 have made these determinations challenging. In this study, we developed, implemented and evaluated an electronic health record (EHR) embedded clinical decision support (CDS) system that leverages machine learning (ML) to estimate short-term risk for clinical deterioration in patients with or under investigation for COVID-19. The system translates model-generated risk for critical care needs within 24 h and inpatient care needs within 72 h into rapidly interpretable COVID-19 Deterioration Risk Levels made viewable within ED clinician workflow. ML models were derived in a retrospective cohort of 21,452 ED patients who visited one of five ED study sites and were prospectively validated in 15,670 ED visits that occurred before ( n = 4322) or after ( n = 11,348) CDS implementation; model performance and numerous patient-oriented outcomes including in-hospital mortality were measured across study periods. Incidence of critical care needs within 24 h and inpatient care needs within 72 h were 10.7% and 22.5%, respectively and were similar across study periods. ML model performance was excellent under all conditions, with AUC ranging from 0.85 to 0.91 for prediction of critical care needs and 0.80–0.90 for inpatient care needs. Total mortality was unchanged across study periods but was reduced among high-risk patients after CDS implementation.
The text-guided diffusion model GLIDE (Guided Language to Image Diffusion for Generation and Editing) is the state of the art in text-to-image generative artificial intelligence (AI). GLIDE has rich representations, but medical applications of this model have not been systematically explored. If GLIDE had useful medical knowledge, it could be used for medical image analysis tasks, a domain in which AI systems are still highly engineered towards a single use-case. Here we show that the publicly available GLIDE model has reasonably strong representations of key topics in cancer research and oncology, in particular the general style of histopathology images and multiple facets of diseases, pathological processes and laboratory assays. However, GLIDE seems to lack useful representations of the style and content of radiology data. Our findings demonstrate that domain-agnostic generative AI models can learn relevant medical concepts without explicit training. Thus, GLIDE and similar models might be useful for medical image processing tasks in the future - particularly with additional domain-specific fine-tuning.
Drug-drug interaction (DDI) Prediction Pipeline Overview. (Step 1) Reliable Food and Drug Administration (FDA) drug labels were used through the DailyMed website to build the pharmacokinetic (PK)-DDI dataset. A total of 38,711 FDA drug labels were obtained (Evaluation date: May 2020) from sentences/pictures/tables in the clinical pharmacology and drug interaction sections. (Step 2) Information on various drug properties from DrugBank (Evaluation date: March 2021) was obtained. Drug properties data may be arranged around the perpetrator and the victim drugs, and various polypeptides are radially linked. Polypeptide-PD (pharmacodynamics)-Drug-Type (PPDT) tokenization was proposed to represent drug pairs. A bag-of-words containing 2830 unique tokens was obtained. Each drug-drug pair was encoded as a 2830-dimensional vector through normalization with a Term Frequency-Inverse Document Frequency (tf-idf ) matrix of bag-ofwords. (Step 3) The Bagged (Bootstrap Aggregation) Tree method was used as an application model. The tree consisted of 615 branches and had 308 nodes for which fold change values were determined. (Step 4) A standalone application PK-DDI prediction (PK-DDIP) model is provided. Through this application, users may obtain predicted and reported fold change values, drug polypeptide information and its plot, single-nucleotide polymorphisms action, and alternative drug recommendation information at the 4 th anatomical therapeutic chemical level.
Comparison with real patients' result. The comparison of pharmacokinetic drug-drug interaction prediction (PKDDIP) model results (predicted fold change values) and observed real-world patients' results using tacrolimus as a victim drug in a tertiary hospital clinical data warehouse. SNUH, Seoul National University Hospital.
Many machine learning techniques provide a simple prediction for drug-drug interactions (DDIs). However, a systematically constructed database with pharmacokinetic (PK) DDI information does not exist, nor is there a machine learning model that numerically predicts PK fold change (FC) with it. Therefore, we propose a PK DDI prediction (PK-DDIP) model for quantitative DDI prediction with high accuracy, while constructing a highly reliable PK-DDI database. Reliable information of 3,627 PK DDIs was constructed from 3,587 drugs using 38,711 Food and Drug Administration (FDA) drug labels. This PK-DDIP model predicted the FC of the area under the time-concentration curve (AUC) within ± 0.5959. The prediction proportions within 0.8–1.25-fold, 0.67–1.5-fold, and 0.5–2-fold of the AUC were 75.77, 86.68, and 94.76%, respectively. Two external validations confirmed good prediction performance for newly updated FDA labels and FC from patients’. This model enables potential DDI evaluation before clinical trials, which will save time and cost.
Parkinson’s disease (PD) lacks sensitive, objective, and reliable measures for disease progression and response. This presents a challenge for clinical trials given the multifaceted and fluctuating nature of PD symptoms. Innovations in digital health and wearable sensors promise to more precisely measure aspects of patient function and well-being. Beyond research trials, digital biomarkers and clinical outcome assessments may someday support clinician-initiated or closed-loop treatment adjustments. A recent study from Verily Life Sciences presents results for a smartwatch-based motor exam intended to accelerate the development and evaluation of therapies for PD.
Networks of feature associations through pregnancy. a Venn diagram to show common features shared with three pregnancy periods, and specific features to each period. b The network to display the associations of selected clinical features with each pregnancy period. c The network for the 17 time points in the antepartum. The two networks were constructed by connecting predictive features with respective PE time point. The squares signify different time points of PE, and the round nodes represent the identified predictive features with their sizes proportional to the feature importance. The red edges indicate risk associations (adjusted OR > 1) while the blue edges indicate protective associations (adjusted OR < 1). The edge width reflects the significance of predictive features. Different feature categories are represented with different colors and also laid out together. The networks were visualized using Cytoscape 3.7.2.
Feature inspection for antepartum based on moving average. a 28 days moving average of systolic blood pressure for PE and control patients. The dashed line shows the normal range of systolic blood pressure. b Distribution of urine protein for PE and control patients. c 28 days moving average of fibrinogen for PE and control patients. The dashed line represents the reference ranges for fibrinogen. d 28 days moving average of mean corpuscular hemoglobin (HGB) for PE and control patients. The dashed line represents the normal range of mean corpuscular hemoglobin. In the moving average plots, the shaded areas indicate the standard deviation and solid lines represent the average value across the pregnancy.
Feature inspection for intrapartum based on SHAP value. a SHAP summary plot for top 20 clinical features for PE prediction shows the SHAP values for the most important features from gradient boosting model in the training data. Features in the summary plot (Y-axis) are ordered by the mean absolute SHAP values (in the parenthesis after each feature name), which represents the importance of the feature in driving the intrapartum PE prediction. Values of the feature for each patient are colored by their relative value with red indicating high value and blue indicating low value. Positive SHAP values indicate increased risks for intrapartum PE and negative values indicate protective effects to intrapartum PE. b The average feature group contribution calculated from averaging mean absolute SHAP values for each feature set. c The dependence plot with maximum SBP measured in antepartum versus PE relative risk, along with the interaction of African American race. The plot shows how different values of the feature can affect relative risks and ultimately impact classifier decision. Data points are colored by the African American race. The solid line represents the mean of SHAP values.
Feature inspection for postpartum based on SHAP value. a SHAP summary plot for top 20 features. Features in the summary plot (Yaxis) are ordered by the mean absolute SHAP values (in the parenthesis after each feature name), representing the importance of the feature in driving the postpartum PE prediction. Values of the feature for each patient are colored by their relative value with red signifying high value and blue presenting low value. Positive SHAP values indicate increased risks for postpartum PE and negative values indicate protective effects to postpartum PE. b The average feature category contribution. c The dependence plot of PE relative risk in terms of maximum SBP measured in postpartum. d The dependence plot of PE relative risk versus ibuprofen. The SHAP dependence plots indicate how different values of the features can affect relative risks and ultimately impact classifier decision for SBP and ibuprofen stratified by African American race. The solid line shows the mean of SHAP values.
Preeclampsia is a heterogeneous and complex disease associated with rising morbidity and mortality in pregnant women and newborns in the US. Early recognition of patients at risk is a pressing clinical need to reduce the risk of adverse outcomes. We assessed whether information routinely collected in electronic medical records (EMR) could enhance the prediction of preeclampsia risk beyond what is achieved in standard of care assessments. We developed a digital phenotyping algorithm to curate 108,557 pregnancies from EMRs across the Mount Sinai Health System, accurately reconstructing pregnancy journeys and normalizing these journeys across different hospital EMR systems. We then applied machine learning approaches to a training dataset ( N = 60,879) to construct predictive models of preeclampsia across three major pregnancy time periods (ante-, intra-, and postpartum). The resulting models predicted preeclampsia with high accuracy across the different pregnancy periods, with areas under the receiver operating characteristic curves (AUC) of 0.92, 0.82, and 0.89 at 37 gestational weeks, intrapartum and postpartum, respectively. We observed comparable performance in two independent patient cohorts. While our machine learning approach identified known risk factors of preeclampsia (such as blood pressure, weight, and maternal age), it also identified other potential risk factors, such as complete blood count related characteristics for the antepartum period. Our model not only has utility for earlier identification of patients at risk for preeclampsia, but given the prediction accuracy exceeds what is currently achieved in clinical practice, our model provides a path for promoting personalized precision therapeutic strategies for patients at risk.
Participant demographics and characteristics at their first clinic visit.
Sensor-based remote monitoring could help better track Parkinson’s disease (PD) progression, and measure patients’ response to putative disease-modifying therapeutic interventions. To be useful, the remotely-collected measurements should be valid, reliable, and sensitive to change, and people with PD must engage with the technology. We developed a smartwatch-based active assessment that enables unsupervised measurement of motor signs of PD. Participants with early-stage PD ( N = 388, 64% men, average age 63) wore a smartwatch for a median of 390 days. Participants performed unsupervised motor tasks both in-clinic (once) and remotely (twice weekly for one year). Dropout rate was 5.4%. Median wear-time was 21.1 h/day, and 59% of per-protocol remote assessments were completed. Analytical validation was established for in-clinic measurements, which showed moderate-to-strong correlations with consensus MDS-UPDRS Part III ratings for rest tremor (⍴ = 0.70), bradykinesia (⍴ = −0.62), and gait (⍴ = −0.46). Test-retest reliability of remote measurements, aggregated monthly, was good-to-excellent (ICC = 0.75–0.96). Remote measurements were sensitive to the known effects of dopaminergic medication (on vs off Cohen’s d = 0.19–0.54). Of note, in-clinic assessments often did not reflect the patients’ typical status at home. This demonstrates the feasibility of smartwatch-based unsupervised active tests, and establishes the analytical validity of associated digital measurements. Weekly measurements provide a real-life distribution of disease severity, as it fluctuates longitudinally. Sensitivity to medication-induced change and improved reliability imply that these methods could help reduce sample sizes needed to demonstrate a response to therapeutic interventions or disease progression.
Selected AI devices that are reimbursed by US Medicare.
Over the past 7 years, regulatory agencies have approved hundreds of artificial intelligence (AI) devices for clinical use. In late 2020, payers began reimbursing clinicians and health systems for each use of select image-based AI devices. The experience with traditional medical devices has shown that per-use reimbursement may result in the overuse use of AI. We review current models of paying for AI in medicine and describe five alternative and complementary reimbursement approaches, including incentivizing outcomes instead of volume, utilizing advance market commitments and time-limited reimbursements for new AI applications, and rewarding interoperability and bias mitigation. As AI rapidly integrates into routine healthcare, careful design of payment for AI is essential for improving patient outcomes while maximizing cost-effectiveness and equity.
Due to its enormous capacity for benefit, harm, and cost, health care is among the most tightly regulated industries in the world. But with the rise of smartphones, an explosion of direct-to-consumer mobile health applications has challenged the role of centralized gatekeepers. As interest in health apps continue to climb, national regulatory bodies have turned their attention toward strategies to protect consumers from apps that mine and sell health data, recommend unsafe practices, or simply do not work as advertised. To characterize the current state and outlook of these efforts, Essén and colleagues map the nascent landscape of national health app policies and raise several considerations for cross-border collaboration. Strategies to increase transparency, organize app marketplaces, and monitor existing apps are needed to ensure that the global wave of new digital health tools fulfills its promise to improve health at scale.
Store-and-forward consultation steps. Capture gross image of lesion. Wipe lesion with alcohol pad before taking dermoscopy photos.
The Decision Tree of literature search. Literature search, review, and article inclusions.
Teledermoscopy, or the utilization of dermatoscopic images in telemedicine, can help diagnose dermatologic disease remotely, triage lesions of concern (i.e., determine whether in-person consultation with a dermatologist is necessary, biopsy, or reassure the patient), and monitor dermatologic lesions over time. Handheld dermatoscopes, a magnifying apparatus, have become a commonly utilized tool for providers in many healthcare settings and professions and allows users to view microstructures of the epidermis and dermis. This Dermoscopy Practice Guideline reflects current knowledge in the field of telemedicine to demonstrate the correct capture, usage, and incorporation of dermoscopic images into everyday practice.
Digital approaches are increasingly common in clinical trial recruitment, retention, analysis, and dissemination. Community engagement processes have contributed to the successful implementation of clinical trials and are crucial in enhancing equity in trials. However, few studies focus on how digital approaches can be implemented to enhance community engagement in clinical trials. This narrative review examines three key areas for digital approaches to deepen community engagement in clinical trials—the use of digital technology for trial processes to decentralize trials, digital crowdsourcing to develop trial components, and digital qualitative research methods. We highlight how digital approaches enhanced community engagement through a greater diversity of participants, and deepened community engagement through the decentralization of research processes. We discuss new possibilities that digital technologies offer for community engagement, and highlight potential strengths, weaknesses, and practical considerations. We argue that strengthening community engagement using a digital approach can enhance equity and improve health outcomes.
CAD PRS and statin efficacy. A CAD PRS can be introduced as a "risk enhancer" under current clinical guidelines (left), influencing preventive health decision-making through either re-classification of individuals across clinical risk tiers or, more often, by influencing the degree of therapeutic intervention used within each clinical risk tier. Each human figure represents a statin treated individual, with the figure in orange representing a heart attack prevented. The CAD PRS tiers and number needed to treat values depicted are derived from the landmark Mega et al. study 18 .
MyGeneRank Screenshots. The two central screens depicted return of CAD PRS results alone (left) as well as the integration of the CAD PRS with 10-year clinical risk in a dynamic risk-reducing interface.
Initiation and discontinuation of lipid-lowering therapy.
Changes in the use of lipid-lowering therapy.
We developed a smartphone application, MyGeneRank , to conduct a prospective observational cohort study (NCT03277365) involving the automated generation, communication, and electronic capture of response to a polygenic risk score (PRS) for coronary artery disease (CAD). Adults with a smartphone and an existing 23andMe genetic profiling self-referred to the study. We evaluated self-reported actions taken in response to personal CAD PRS information, with special interest in the initiation of lipid-lowering therapy. 19% (721/3,800) of participants provided complete responses for baseline and follow-up use of lipid-lowering therapy. 20% ( n = 19/95) of high CAD PRS vs 7.9% ( n = 8/101) of low CAD PRS participants initiated lipid-lowering therapy at follow-up ( p -value = 0.002). Both the initiation of statin and non-statin lipid-lowering therapy was associated with degree of CAD PRS: 15.2% ( n = 14/92) vs 6.0% ( n = 6/100) for statins ( p -value = 0.018) and 6.8% ( n = 8/118) vs 1.6% ( n = 2/123) for non-statins ( p -value = 0.022) in high vs low CAD PRS, respectively. High CAD PRS was also associated with earlier initiation of lipid lowering therapy (average age of 52 vs 65 years in high vs low CAD PRS respectively, p -value = 0.007). Overall, degree of CAD PRS was associated with use of any lipid-lowering therapy at follow-up: 42.4% ( n = 56/132) vs 28.5% ( n = 37/130) ( p -value = 0.009). We find that digital communication of personal CAD PRS information is associated with increased and earlier lipid-lowering initiation in individuals of high CAD PRS. Loss to follow-up is the primary limitation of this study. Alternative communication routes, and long-term studies with EHR-based outcomes are needed to understand the generalizability and durability of this finding.
SARS-CoV2 case counts by phenotyping strategy. The absolute cumulative SARS-CoV-2 cases by adjudication strategy across the study period. The cases are based on either principal diagnosis or any diagnosis, compared with a polymerase chain reaction or antigen test for SARS-CoV-2.
Overlap of SARS-CoV2 case counts by computational phenotyping strategies. Computable phenotypes for SARS-CoV-2 infection across the study period at Yale New Haven Health System.
SARS-CoV2 case counts by computational phenotyping strategies in the Mayo Clinic System. Computable phenotypes for SARSCoV-2 infection across the study period at the Mayo Clinic System, a across all Mayo Clinic sites, b Rochester, c Arizona, and d Florida.
Diagnosis codes are used to study SARS-CoV2 infections and COVID-19 hospitalizations in administrative and electronic health record (EHR) data. Using EHR data (April 2020–March 2021) at the Yale-New Haven Health System and the three hospital systems of the Mayo Clinic, computable phenotype definitions based on ICD-10 diagnosis of COVID-19 (U07.1) were evaluated against positive SARS-CoV-2 PCR or antigen tests. We included 69,423 patients at Yale and 75,748 at Mayo Clinic with either a diagnosis code or a positive SARS-CoV-2 test. The precision and recall of a COVID-19 diagnosis for a positive test were 68.8% and 83.3%, respectively, at Yale, with higher precision (95%) and lower recall (63.5%) at Mayo Clinic, varying between 59.2% in Rochester to 97.3% in Arizona. For hospitalizations with a principal COVID-19 diagnosis, 94.8% at Yale and 80.5% at Mayo Clinic had an associated positive laboratory test, with secondary diagnosis of COVID-19 identifying additional patients. These patients had a twofold higher inhospital mortality than based on principal diagnosis. Standardization of coding practices is needed before the use of diagnosis codes in clinical research and epidemiological surveillance of COVID-19.
The COVID-19 pandemic has pushed healthcare systems globally to a breaking point. The urgent need for effective and affordable COVID-19 treatments calls for repurposing combinations of approved drugs. The challenge is to identify which combinations are likely to be most effective and at what stages of the disease. Here, we present the first disease-stage executable signalling network model of SARS-CoV-2-host interactions used to predict effective repurposed drug combinations for treating early- and late stage severe disease. Using our executable model, we performed in silico screening of 9870 pairs of 140 potential targets and have identified nine new drug combinations. Camostat and Apilimod were predicted to be the most promising combination in effectively supressing viral replication in the early stages of severe disease and were validated experimentally in human Caco-2 cells. Our study further demonstrates the power of executable mechanistic modelling to enable rapid pre-clinical evaluation of combination therapies tailored to disease progression. It also presents a novel resource and expandable model system that can respond to further needs in the pandemic.
To identify Coronavirus disease (COVID-19) cases efficiently, affordably, and at scale, recent work has shown how audio (including cough, breathing and voice) based approaches can be used for testing. However, there is a lack of exploration of how biases and methodological decisions impact these tools’ performance in practice. In this paper, we explore the realistic performance of audio-based digital testing of COVID-19. To investigate this, we collected a large crowdsourced respiratory audio dataset through a mobile app, alongside symptoms and COVID-19 test results. Within the collected dataset, we selected 5240 samples from 2478 English-speaking participants and split them into participant-independent sets for model development and validation. In addition to controlling the language, we also balanced demographics for model training to avoid potential acoustic bias. We used these audio samples to construct an audio-based COVID-19 prediction model. The unbiased model took features extracted from breathing, coughs and voice signals as predictors and yielded an AUC-ROC of 0.71 (95% CI: 0.65–0.77). We further explored several scenarios with different types of unbalanced data distributions to demonstrate how biases and participant splits affect the performance. With these different, but less appropriate, evaluation strategies, the performance could be overestimated, reaching an AUC up to 0.90 (95% CI: 0.85–0.95) in some circumstances. We found that an unrealistic experimental setting can result in misleading, sometimes over-optimistic, performance. Instead, we reported complete and reliable results on crowd-sourced data, which would allow medical professionals and policy makers to accurately assess the value of this technology and facilitate its deployment.
Approaches to digital health tool selection. Various digital health-product-selection approaches, and important considerations for each approach. We recommend investigating the viability of all four possibilities in parallel; the optimal approach will depend on the type of problem being addressed and characteristics of the health system.
In recent years, the number of digital health tools with the potential to significantly improve delivery of healthcare services has grown tremendously. However, the use of these tools in large, complex health systems remains comparatively limited. The adoption and implementation of digital health tools at an enterprise level is a challenge; few strategies exist to help tools cross the chasm from clinical validation to integration within the workflows of a large health system. Many previously proposed frameworks for digital health implementation are difficult to operationalize in these dynamic organizations. In this piece, we put forth nine dimensions along which clinically validated digital health tools should be examined by health systems prior to adoption, and propose strategies for selecting digital health tools and planning for implementation in this setting. By evaluating prospective tools along these dimensions, health systems can evaluate which existing digital health solutions are worthy of adoption, ensure they have sufficient resources for deployment and long-term use, and devise a strategic plan for implementation.
Correspondence between structured and unstructured codes.
Performance of NBC models on the test set.
Performance of BRF models on the test set.
Structured-unstructured feature pairs A-B with high interaction heterogeneity (IH) values.
Clinical risk prediction models powered by electronic health records (EHRs) are becoming increasingly widespread in clinical practice. With suicide-related mortality rates rising in recent years, it is becoming increasingly urgent to understand, predict, and prevent suicidal behavior. Here, we compare the predictive value of structured and unstructured EHR data for predicting suicide risk. We find that Naive Bayes Classifier (NBC) and Random Forest (RF) models trained on structured EHR data perform better than those based on unstructured EHR data. An NBC model trained on both structured and unstructured data yields similar performance (AUC = 0.743) to an NBC model trained on structured data alone (0.742, p = 0.668), while an RF model trained on both data types yields significantly better results (AUC = 0.903) than an RF model trained on structured data alone (0.887, p < 0.001), likely due to the RF model’s ability to capture interactions between the two data types. To investigate these interactions, we propose and implement a general framework for identifying specific structured-unstructured feature pairs whose interactions differ between case and non-case cohorts, and thus have the potential to improve predictive performance and increase understanding of clinical risk. We find that such feature pairs tend to capture heterogeneous pairs of general concepts, rather than homogeneous pairs of specific concepts. These findings and this framework can be used to improve current and future EHR-based clinical modeling efforts.
Human versus hybrid artificial intelligence (AI)-human workflow for serious illness communication (SIC). The current workflow relies on human judgment to identify SIC-eligible patients and manual effort to initiate SIC, to document SIC, and to locate SIC documentation in the electronic health record (EHR). A hybrid AI-human workflow would leverage AI to identify SIC-eligible patients more accurately and to streamline the workflow by helping complete essential menial tasks, thus ensuring more seriously ill patients will receive timely SIC and allowing clinicians more time and energy to focus on the higher-order cognitive and emotional tasks, including problem-solving. Natacha Meyer designed and illustrated the figure and provided permission to use this figure in the manuscript.
Delivery of serious illness communication (SIC) is necessary to ensure that all seriously ill patients receive goal-concordant care. However, the current SIC delivery process contains barriers that prevent the delivery of timely and effective SIC. In this paper, we describe the current bottlenecks of the traditional SIC workflow and explore how a hybrid artificial intelligence-human workflow may improve the efficiency and effectiveness of SIC delivery in busy practice settings.
Artificial intelligence (AI) centred diagnostic systems are increasingly recognised as robust solutions in healthcare delivery pathways. In turn, there has been a concurrent rise in secondary research studies regarding these technologies in order to influence key clinical and policymaking decisions. It is therefore essential that these studies accurately appraise methodological quality and risk of bias within shortlisted trials and reports. In order to assess whether this critical step is performed, we undertook a meta-research study evaluating adherence to the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool within AI diagnostic accuracy systematic reviews. A literature search was conducted on all studies published from 2000 to December 2020. Of 50 included reviews, 36 performed the quality assessment, of which 27 utilised the QUADAS-2 tool. Bias was reported across all four domains of QUADAS-2. Two hundred forty-three of 423 studies (57.5%) across all systematic reviews utilising QUADAS-2 reported a high or unclear risk of bias in the patient selection domain, 110 (26%) reported a high or unclear risk of bias in the index test domain, 121 (28.6%) in the reference standard domain and 157 (37.1%) in the flow and timing domain. This study demonstrates the incomplete uptake of quality assessment tools in reviews of AI-based diagnostic accuracy studies and highlights inconsistent reporting across all domains of quality assessment. Poor standards of reporting act as barriers to clinical implementation. The creation of an AI-specific extension for quality assessment tools of diagnostic accuracy AI studies may facilitate the safe translation of AI tools into clinical practice.
Study overview and system architecture. a Overview of our study methodology. b System architecture of PhenoPad: 1. Cloud server with data storage; 2. Audio capturing devices including a Raspberry Pi, a microphone array, and a power bank; 3. Note taking devices including a Microsoft Surface Book and a Surface Pen.
Note generation interface design. a Speech Transcripts Panel is for presenting the transcripts of the conversations (right) and medical information recognized from the transcripts (left). When clicking on a medical term, the positions where it appears are highlighted as yellow bars (or an orange bar if it is the one currently being shown). b Note Writing Panel is a regular text box for note writing and editing. Clinicians can add information from (a) into (b) by dragging and dropping the text in (a) into the desired position in (b). c Raw Notes Panel presents the raw notes including handwritings, drawings, photos, and/or videos. Clinicians have the option to switch between (a) and (c). d ICD 10 List contains a list of ICD 10 codes recognized from the conversation and used for billing purposes.
Experience of patients with PhenoPad and usability evaluation by physicians. a Results of the questionnaire for evaluating patients' experience with PhenoPad. b Evaluation results on the usability of PhenoPad: b1 System usability scale assessment results. b2 Componentlevel Likert-scale assessment results.
Current clinical note-taking approaches cannot capture the entirety of information available from patient encounters and detract from patient-clinician interactions. By surveying healthcare providers’ current note-taking practices and attitudes toward new clinical technologies, we developed a patient-centered paradigm for clinical note-taking that makes use of hybrid tablet/keyboard devices and artificial intelligence (AI) technologies. PhenoPad is an intelligent clinical note-taking interface that captures free-form notes and standard phenotypic information via a variety of modalities, including speech and natural language processing techniques, handwriting recognition, and more. The output is unobtrusively presented on mobile devices to clinicians for real-time validation and can be automatically transformed into digital formats that would be compatible with integration into electronic health record systems. Semi-structured interviews and trials in clinical settings rendered positive feedback from both clinicians and patients, demonstrating that AI-enabled clinical note-taking under our design improves ease and breadth of information captured during clinical visits without compromising patient-clinician interactions. We open source a proof-of-concept implementation that can lay the foundation for broader clinical use cases.
During the critical early stages of an emerging pandemic, limited availability of pathogen-specific testing can severely inhibit individualized risk screening and pandemic tracking. Standard clinical laboratory tests offer a widely available complementary data source for first-line risk screening and pandemic surveillance. Here, we propose an integrated framework for developing clinical-laboratory indicators for novel pandemics that combines population-level and individual-level analyses. We apply this framework to 7,520,834 clinical laboratory tests recorded over five years and find clinical-lab-test combinations that are strongly associated with SARS-CoV-2 PCR test results and Multisystem Inflammatory Syndrome in Children (MIS-C) diagnoses: Interleukin-related tests (e.g. IL4, IL10) were most strongly associated with SARS-CoV-2 infection and MIS-C, while other more widely available tests (ferritin, D-dimer, fibrinogen, alanine transaminase, and C-reactive protein) also had strong associations. When novel pandemics emerge, this framework can be used to identify specific combinations of clinical laboratory tests for public health tracking and first-line individualized risk screening.
PRISMA study flow diagram. Summary of number of articles screened and included.
Publications trends for machine learning studies in vascular surgery between 1991 and 2021. Each bar represents a 5-year interval.
Machine learning (ML) is a rapidly advancing field with increasing utility in health care. We conducted a systematic review and critical appraisal of ML applications in vascular surgery. MEDLINE, Embase, and Cochrane CENTRAL were searched from inception to March 1, 2021. Study screening, data extraction, and quality assessment were performed by two independent reviewers, with a third author resolving discrepancies. All original studies reporting ML applications in vascular surgery were included. Publication trends, disease conditions, methodologies, and outcomes were summarized. Critical appraisal was conducted using the PROBAST risk-of-bias and TRIPOD reporting adherence tools. We included 212 studies from a pool of 2235 unique articles. ML techniques were used for diagnosis, prognosis, and image segmentation in carotid stenosis, aortic aneurysm/dissection, peripheral artery disease, diabetic foot ulcer, venous disease, and renal artery stenosis. The number of publications on ML in vascular surgery increased from 1 (1991–1996) to 118 (2016–2021). Most studies were retrospective and single center, with no randomized controlled trials. The median area under the receiver operating characteristic curve (AUROC) was 0.88 (range 0.61–1.00), with 79.5% [62/78] studies reporting AUROC ≥ 0.80. Out of 22 studies comparing ML techniques to existing prediction tools, clinicians, or traditional regression models, 20 performed better and 2 performed similarly. Overall, 94.8% (201/212) studies had high risk-of-bias and adherence to reporting standards was poor with a rate of 41.4%. Despite improvements over time, study quality and reporting remain inadequate. Future studies should consider standardized tools such as PROBAST and TRIPOD to improve study quality and clinical applicability.
Study cohorts' summary. A Dataset generation based on emergency department visits to an academic medical center and a community hospital; (B) The distributions of ECG-K + and Lab-K + at the academic medical center and community hospital.
Risk matrices of different ECG-K + and Lab-K + groups on adverse outcomes in combined analysis. The baseline model of combined analysis is adjusted to each hospital site and based on Cox proportional hazard model or logistic regression as appropriate for each outcome. The color gradient represents the risk of the corresponding group and non-significant results are colored white. Model 1 includes significant demographic data (All-cause mortality: gender, Age, SBP, and DBP; Hospitalization: gender, age, BMI, DBP, and smoke; ED revisit in 30 days: gender). Model 2 includes the variables in model 1 and additional significant disease histories (All-cause mortality: HLP; Hospitalization: HLP, STK, and HF; ED revisit in 30 days: DM, CAD, STK, and COPD). Model 3 includes the variables in model 2 and additional significant laboratory tests (All-cause mortality: Hb, HCO 3 , Blood pH, Na, AST, ALT, Alb, CRP, pBNP, and D-dimer; Hospitalization: WBC, Hb, PLT, HCO 3 , PH, Na, Cl, tCa, GLU, AST, CK, Alb, CRP, TnI, and D-dimer; ED revisit in 30 days: Hb and Na).
Dyskalemias are common electrolyte disorders associated with high cardiovascular risk. Artificial intelligence (AI)-assisted electrocardiography (ECG) has been evaluated as an early-detection approach for dyskalemia. The aims of this study were to determine the clinical accuracy of AI-assisted ECG for dyskalemia and prognostic ability on clinical outcomes such as all-cause mortality, hospitalizations, and ED revisits. This retrospective cohort study was done at two hospitals within a health system from May 2019 to December 2020. In total, 26,499 patients with 34,803 emergency department (ED) visits to an academic medical center and 6492 ED visits from 4747 patients to a community hospital who had a 12-lead ECG to estimate ECG-K ⁺ and serum laboratory potassium measurement (Lab-K ⁺ ) within 1 h were included. ECG-K ⁺ had mean absolute errors (MAEs) of ≤0.365 mmol/L. Area under receiver operating characteristic curves for ECG-K ⁺ to predict moderate-to-severe hypokalemia (Lab-K ⁺ ≤3 mmol/L) and moderate-to-severe hyperkalemia (Lab-K ⁺ ≥ 6 mmol/L) were >0.85 and >0.95, respectively. The U-shaped relationships between K ⁺ concentration and adverse outcomes were more prominent for ECG-K ⁺ than for Lab-K ⁺ . ECG-K ⁺ and Lab-K ⁺ hyperkalemia were associated with high HRs for 30-day all-cause mortality. Compared to hypokalemic Lab-K ⁺ , patients with hypokalemic ECG-K ⁺ had significantly higher risk for adverse outcomes after full confounder adjustment. In addition, patients with normal Lab-K ⁺ but dyskalemic ECG-K ⁺ (pseudo-positive) also exhibited more co-morbidities and had worse outcomes. Point-of-care bloodless AI ECG-K ⁺ not only rapidly identified potentially severe hypo- and hyperkalemia, but also may serve as a biomarker for medical complexity and an independent predictor for adverse outcomes.
AUCs of 3555 operations for predicting death in the next 7 days. Colored vertical bars from left to right indicate the AUCs of the standard deviation of oxygen saturation, mean heart rate, mean oxygen saturation, standard deviation of heart rate, and a novel measure, successive increases of heart rate.
Heat maps of the absolute values of the correlation coefficients between results of operations. a Correlations between all 3555 candidate algorithmic operations. b Correlations between 20 cluster medoids. The reduced feature set explains 81% of the variance in the full set.
Demographics of the patient population.
Causes of death.
Model performances as a function of days until death.
To seek new signatures of illness in heart rate and oxygen saturation vital signs from Neonatal Intensive Care Unit (NICU) patients, we implemented highly comparative time-series analysis to discover features of all-cause mortality in the next 7 days. We collected 0.5 Hz heart rate and oxygen saturation vital signs of infants in the University of Virginia NICU from 2009 to 2019. We applied 4998 algorithmic operations from 11 mathematical families to random daily 10 min segments from 5957 NICU infants, 205 of whom died. We clustered the results and selected a representative from each, and examined multivariable logistic regression models. 3555 operations were usable; 20 cluster medoids held more than 81% of the information, and a multivariable model had AUC 0.83. New algorithms outperformed others: moving threshold, successive increases, surprise, and random walk. We computed provenance of the computations and constructed a software library with links to the data. We conclude that highly comparative time-series analysis revealed new vital sign measures to identify NICU patients at the highest risk of death in the next week.
While COVID-19 diagnosis and prognosis artificial intelligence models exist, very few can be implemented for practical use given their high risk of bias. We aimed to develop a diagnosis model that addresses notable shortcomings of prior studies, integrating it into a fully automated triage pipeline that examines chest radiographs for the presence, severity, and progression of COVID-19 pneumonia. Scans were collected using the DICOM Image Analysis and Archive, a system that communicates with a hospital’s image repository. The authors collected over 6,500 non-public chest X-rays comprising diverse COVID-19 severities, along with radiology reports and RT-PCR data. The authors provisioned one internally held-out and two external test sets to assess model generalizability and compare performance to traditional radiologist interpretation. The pipeline was evaluated on a prospective cohort of 80 radiographs, reporting a 95% diagnostic accuracy. The study mitigates bias in AI model development and demonstrates the value of an end-to-end COVID-19 triage platform.
Agreement between the reference heart rate values (computed from the ECG and PPG) and the camera estimates for the valid video camera data, comprising a total estimated time of approximately 103.0 h. a The differences between the camera and reference monitors are normally distributed. b The Bland-Altman plot presents a minimal sensor bias. c The scatter plot shows high correlation between the two devices, with a Pearson correlation coefficient of 0.98. d The distribution of the mean values shows that most of the heart rate estimates are within the expected physiological range for adults.
Visualisation of the local respiratory effort of a patient later diagnosed as having a ruptured diaphragm, as confirmed by chest CT, and a right-sided pneumothorax. (a) Respiratory rate estimated from the camera. Heat maps show the magnitude of the respiratory effort from regions on the chest during time stages (b) S1, (c) S2 and (d) S3. Each super-pixel in the heat maps was coloured according to the amplitude of the respiratory signal (in pixel units) computed from the 30 × 30 ROI centred on the super-pixel. The non-stationary respiratory rate was measured with a sustained increase in respiratory rate from t = 18 min to t = 27 min. A progression towards subdiaphragmatic leftlateral respiratory movements is observed as increased signal intensity over the lower left-side in the respiratory map for stage S3.
Summary of the study population.
Summary of vital-sign estimation results for all recording sessions.
Summary of vital-sign estimation results for the day-time and night-time periods.
Prolonged non-contact camera-based monitoring in critically ill patients presents unique challenges, but may facilitate safe recovery. A study was designed to evaluate the feasibility of introducing a non-contact video camera monitoring system into an acute clinical setting. We assessed the accuracy and robustness of the video camera-derived estimates of the vital signs against the electronically-recorded reference values in both day and night environments. We demonstrated non-contact monitoring of heart rate and respiratory rate for extended periods of time in 15 post-operative patients. Across day and night, heart rate was estimated for up to 53.2% (103.0 h) of the total valid camera data with a mean absolute error (MAE) of 2.5 beats/min in comparison to two reference sensors. We obtained respiratory rate estimates for 63.1% (119.8 h) of the total valid camera data with a MAE of 2.4 breaths/min against the reference value computed from the chest impedance pneumogram. Non-contact estimates detected relevant changes in the vital-sign values between routine clinical observations. Pivotal respiratory events in a post-operative patient could be identified from the analysis of video-derived respiratory information. Continuous vital-sign monitoring supported by non-contact video camera estimates could be used to track early signs of physiological deterioration during post-operative care.
Quality of the literature by each domain. The figure shows the number of studies scoring on each study quality item. 2 points are given for fully addressing quality criteria, 1 point for partially addressing quality criteria, and 0 points for failing to address quality criteria.
Details for studies analysing combined features using regression models.
The use of digital tools to measure physiological and behavioural variables of potential relevance to mental health is a growing field sitting at the intersection between computer science, engineering, and clinical science. We summarised the literature on remote measuring technologies, mapping methodological challenges and threats to reproducibility, and identified leading digital signals for depression. Medical and computer science databases were searched between January 2007 and November 2019. Published studies linking depression and objective behavioural data obtained from smartphone and wearable device sensors in adults with unipolar depression and healthy subjects were included. A descriptive approach was taken to synthesise study methodologies. We included 51 studies and found threats to reproducibility and transparency arising from failure to provide comprehensive descriptions of recruitment strategies, sample information, feature construction and the determination and handling of missing data. The literature is characterised by small sample sizes, short follow-up duration and great variability in the quality of reporting, limiting the interpretability of pooled results. Bivariate analyses show consistency in statistically significant associations between depression and digital features from sleep, physical activity, location, and phone use data. Machine learning models found the predictive value of aggregated features. Given the pitfalls in the combined literature, these results should be taken purely as a starting point for hypothesis generation. Since this research is ultimately aimed at informing clinical practice, we recommend improvements in reporting standards including consideration of generalisability and reproducibility, such as wider diversity of samples, thorough reporting methodology and the reporting of potential bias in studies with numerous features.
While the opportunities of ML and AI in healthcare are promising, the growth of complex data-driven prediction models requires careful quality and applicability assessment before they are applied and disseminated in daily practice. This scoping review aimed to identify actionable guidance for those closely involved in AI-based prediction model (AIPM) development, evaluation and implementation including software engineers, data scientists, and healthcare professionals and to identify potential gaps in this guidance. We performed a scoping review of the relevant literature providing guidance or quality criteria regarding the development, evaluation, and implementation of AIPMs using a comprehensive multi-stage screening strategy. PubMed, Web of Science, and the ACM Digital Library were searched, and AI experts were consulted. Topics were extracted from the identified literature and summarized across the six phases at the core of this review: (1) data preparation, (2) AIPM development, (3) AIPM validation, (4) software development, (5) AIPM impact assessment, and (6) AIPM implementation into daily healthcare practice. From 2683 unique hits, 72 relevant guidance documents were identified. Substantial guidance was found for data preparation, AIPM development and AIPM validation (phases 1–3), while later phases clearly have received less attention (software development, impact assessment and implementation) in the scientific literature. The six phases of the AIPM development, evaluation and implementation cycle provide a framework for responsible introduction of AI-based prediction models in healthcare. Additional domain and technology specific research may be necessary and more practical experience with implementing AIPMs is needed to support further guidance.
In times of crisis, communication by leaders is essential for mobilizing an effective public response. During the COVID-19 pandemic, compliance with public health guidelines has been critical for the prevention of infections and deaths. We assembled a corpus of over 1500 pandemic-related speeches, containing over 4 million words, delivered by all 50 US state governors during the initial months of the COVID-19 pandemic. We analyzed the semantic, grammatical and linguistic-complexity properties of these speeches, and examined their relationships to COVID-19 case rates over space and time. We found that as COVID-19 cases rose, governors used stricter language to issue guidance, employed greater negation to defend their actions and highlight prevailing uncertainty, and used more extreme descriptive adjectives. As cases surged to their highest levels, governors used shorter words with fewer syllables. Investigating and understanding such characteristic responses to stress is important for improving effective public communication during major health crises.
Performance of the multimodal HAIM framework on various demonstrations for healthcare operations. a Average and standard deviation values of the area under the receiver operating characteristic (AUROC) for all demonstrations including pathology diagnosis (i.e., lung lesions, fractures, atelectasis, lung opacities, pneumothorax, enlarged cardio mediastinum, cardiomegaly, pneumonia, consolidation, and edema), as well as length-of-stay and 48 h mortality prediction. The number of modalities refers to the coverage among tabular, time-series, text, and image data. The number of sources refers to the coverage among available input data sources (10 for pathology diagnosis, while 11 for length-of-stay and 48 h mortality prediction). Thus, the position (Modality = 2, Sources = 3) corresponds to the average AUROC of all models across all input combinations covering any 2 modalities using any 3 input sources. Increasing gradients on average AUROC appear to follow from increasing the number of modalities and number of sources across all evaluated tasks. Decreasing gradients on AUROC standard deviations follow from less variability in performance as a higher number of modalities and data sources is used. b Receiver operating characteristic (ROC) curves for typical HAIM model across all use cases exhibiting input multimodal. c ROC curves for a best-performing model with single-modality inputs across the same use cases. Consistent averaged improvements across all tasks are observed in multimodality as compared to single-modality systems. AUROC Area under the curve, AUROC Area under the receiver operating characteristic curve, CM Cardiomediastinum. Dx Diagnosis, HAIM Holistic Artificial Intelligence in Medicine, Ops Operations, SD Standard deviation.
Multimodal HAIM framework is a flexible and robust method to improve predictive capacity for healthcare machine learning systems as compared to single-modality approaches. a Average percent change of area under the receiver operating characteristic curve (Avg. ΔAUROC) for all tested multimodality HAIM models as compared to their single-source single-modality counterparts. While different models exhibit varying degrees of improvement, all tested models show positive Avg. ΔAUROC percentages. The number of modalities refers to the coverage among tabular, time-series, text, and image data. The number of sources refers to the coverage among available input data sources (10 for pathology diagnosis, 11 for length-of-stay, and 48 h mortality prediction). Thus, the position (Modality = 2, Sources = 3) corresponds to the average AUROC of all models across all input combinations covering any 2 modalities using any 3 input sources. b Expanded Avg. ΔAUROC percentages for all tested multimodality HAIM models and ordered by the number of used modalities (i.e., tabular, time-series, text, or images) as well as the number of used data sources. c Waterfall plots of aggregated Shapley values for independent data modalities per predictive task. While Shapley values for all data modalities appear to be positively contributing to the predictive capacity of all models, different tasks exhibit distinct distributions of aggregated Shapley values. d High-level schematic of the HAIM pipeline developed to support the presented work. After data collection or sourcing (HAIM-MIMIC-MM for this work), a process of feature selection and embedding extraction is applied to feed fusion embeddings into a process of iterative architecture engineering (model and hyperparameter selection). After particular models are selected and trained, they can be benchmarked to test and report results. This process concludes by the selection of a model for deployment in a use case scenario.
Artificial intelligence (AI) systems hold great promise to improve healthcare over the next decades. Specifically, AI systems leveraging multiple data sources and input modalities are poised to become a viable method to deliver more accurate results and deployable pipelines across a wide range of applications. In this work, we propose and evaluate a unified Holistic AI in Medicine (HAIM) framework to facilitate the generation and testing of AI systems that leverage multimodal inputs. Our approach uses generalizable data pre-processing and machine learning modeling stages that can be readily adapted for research and deployment in healthcare environments. We evaluate our HAIM framework by training and characterizing 14,324 independent models based on HAIM-MIMIC-MM, a multimodal clinical database ( N = 34,537 samples) containing 7279 unique hospitalizations and 6485 patients, spanning all possible input combinations of 4 data modalities (i.e., tabular, time-series, text, and images), 11 unique data sources and 12 predictive tasks. We show that this framework can consistently and robustly produce models that outperform similar single-source approaches across various healthcare demonstrations (by 6–33%), including 10 distinct chest pathology diagnoses, along with length-of-stay and 48 h mortality predictions. We also quantify the contribution of each modality and data source using Shapley values, which demonstrates the heterogeneity in data modality importance and the necessity of multimodal inputs across different healthcare-relevant tasks. The generalizable properties and flexibility of our Holistic AI in Medicine (HAIM) framework could offer a promising pathway for future multimodal predictive systems in clinical and operational healthcare settings.
Signal extraction and deep learning pipeline. a PPG signal extraction occurs after collecting video data from the smartphone camera, applying empirically determined per-channel gains to ensure that each channel is within a usable range (no clipping or saturating). Gains for the R, G, and B channels were empirically determined and held constant throughout all subjects to avoid clipping or biasing towards one channel. b Pre-processing of the data extracts the PPG signal for each channel by computing the average pixel value of each frame. The mean of each channel value across the entirety of each frame was used. c Training and evaluation was performed using Leave-One-Out CrossValidation (LOOCV) by using 5 subjects' data as the training set, holding one of these subject's data as the validation set for optimizing the model, and then evaluating the trained model on one test subject. d The deep learning model is constructed of 3 convolutional layers and 2 linear layers operating on the input of 3 s of RGB video data (90 frames for 3s at 30fps). The output is a prediction of the current blood-oxygen saturation (SpO 2 %) of the individual, which was evaluated using Mean Absolute Error (MAE) compared to the ground truth standalone pulse oximeter reading. e Equations for Loss and MAE that were used in training and evaluating the model.
Hypoxemia, a medical condition that occurs when the blood is not carrying enough oxygen to adequately supply the tissues, is a leading indicator for dangerous complications of respiratory diseases like asthma, COPD, and COVID-19. While purpose-built pulse oximeters can provide accurate blood-oxygen saturation (SpO 2 ) readings that allow for diagnosis of hypoxemia, enabling this capability in unmodified smartphone cameras via a software update could give more people access to important information about their health. Towards this goal, we performed the first clinical development validation on a smartphone camera-based SpO 2 sensing system using a varied fraction of inspired oxygen (FiO 2 ) protocol, creating a clinically relevant validation dataset for solely smartphone-based contact PPG methods on a wider range of SpO 2 values (70–100%) than prior studies (85–100%). We built a deep learning model using this data to demonstrate an overall MAE = 5.00% SpO 2 while identifying positive cases of low SpO 2 < 90% with 81% sensitivity and 79% specificity. We also provide the data in open-source format, so that others may build on this work.
Flow-chart of the literature search according to the recommendation of the PRISMA guidelines.
Cognitive behavioral therapy (CBT) represents one of the major treatment options for depressive disorders besides pharmacological interventions. While newly developed digital CBT approaches hold important advantages due to higher accessibility, their relative effectiveness compared to traditional CBT remains unclear. We conducted a systematic literature search to identify all studies that conducted a CBT-based intervention (face-to-face or digital) in patients with major depression. Random-effects meta-analytic models of the standardized mean change using raw score standardization (SMCR) were computed. In 106 studies including n = 11854 patients face-to-face CBT shows superior clinical effectiveness compared to digital CBT when investigating depressive symptoms ( p < 0.001, face-to-face CBT: SMCR = 1.97, 95%-CI: 1.74–2.13, digital CBT: SMCR = 1.20, 95%-CI: 1.08–1.32) and adherence ( p = 0.014, face-to-face CBT: 82.4%, digital CBT: 72.9%). However, after accounting for differences between face-to-face and digital CBT studies, both approaches indicate similar effectiveness. Important variables with significant moderation effects include duration of the intervention, baseline severity, adherence and the level of human guidance in digital CBT interventions. After accounting for potential confounders our analysis indicates comparable effectiveness of face-to-face and digital CBT approaches. These findings underline the importance of moderators of clinical effects and provide a basis for the future personalization of CBT treatment in depression.
The mobile health (mHealth) industry is an enormous global market; however, the dropout or continuance of mHealth is a major challenge that is affecting its positive outcomes. To date, the results of studies on the impact factors have been inconsistent. Consequently, research on the pooled effects of impact factors on the continuance intention of mHealth is limited. Therefore, this study aims to systematically analyze quantitative studies on the continuance intention of mHealth and explore the pooled effect of each direct and indirect impact factor. Until October 2021, eight literature databases were searched. Fifty-eight peer-reviewed studies on the impact factors and effects on continuance intention of mHealth were included. Out of the 19 direct impact factors of continuance intention, 15 are significant, with attitude (β = 0.450; 95% CI: 0.135, 0.683), satisfaction (β = 0.406; 95% CI: 0.292, 0.509), health empowerment (β = 0.359; 95% CI: 0.204, 0.497), perceived usefulness (β = 0.343; 95% CI: 0.280, 0.403), and perceived quality of health life (β = 0.315, 95% CI: 0.211, 0.412) having the largest pooled effect coefficients on continuance intention. There is high heterogeneity between the studies; thus, we conducted a subgroup analysis to explore the moderating effect of different characteristics on the impact effects. The geographic region, user type, mHealth type, user age, and publication year significantly moderate influential relationships, such as trust and continuance intention. Thus, mHealth developers should develop personalized continuous use promotion strategies based on user characteristics.
Important considerations across a development supply chain, showing cross-disciplinary involvement across components, that should be addressed early in a vertically integrated approach.
All supply chain components are essential for deployment and must work synergistically to support continued AI use. A focus on establishing a supply chain, has benefits over an isolated focus on producing an accurate model.
Substantial interest and investment in clinical artificial intelligence (AI) research has not resulted in widespread translation to deployed AI solutions. Current attention has focused on bias and explainability in AI algorithm development, external validity and model generalisability, and lack of equity and representation in existing data. While of great importance, these considerations also reflect a model-centric approach seen in published clinical AI research, which focuses on optimising architecture and performance of an AI model on best available datasets. However, even robustly built models using state-of-the-art algorithms may fail once tested in realistic environments due to unpredictability of real-world conditions, out-of-dataset scenarios, characteristics of deployment infrastructure, and lack of added value to clinical workflows relative to cost and potential clinical risks. In this perspective, we define a vertically integrated approach to AI development that incorporates early, cross-disciplinary, consideration of impact evaluation, data lifecycles, and AI production, and explore its implementation in two contrasting AI development pipelines: a scalable “AI factory” ( Mayo Clinic, Rochester, United States ), and an end-to-end cervical cancer screening platform for resource poor settings ( Paps AI, Mbarara, Uganda ). We provide practical recommendations for implementers, and discuss future challenges and novel approaches (including a decentralised federated architecture being developed in the NHS ( AI4VBH, London, UK )). Growth in global clinical AI research continues unabated, and introduction of vertically integrated teams and development practices can increase the translational potential of future clinical AI projects.
Prediction of survival for patients in intensive care units (ICUs) has been subject to intense research. However, no models exist that embrace the multiverse of data in ICUs. It is an open question whether deep learning methods using automated data integration with minimal pre-processing of mixed data domains such as free text, medical history and high-frequency data can provide discrete-time survival estimates for individual ICU patients. We trained a deep learning model on data from patients admitted to ten ICUs in the Capital Region of Denmark and the Region of Southern Denmark between 2011 and 2018. Inspired by natural language processing we mapped the electronic patient record data to an embedded representation and fed the data to a recurrent neural network with a multi-label output layer representing the chance of survival at different follow-up times. We evaluated the performance using the time-dependent concordance index. In addition, we quantified and visualized the drivers of survival predictions using the SHAP methodology. We included 37,355 admissions of 29,417 patients in our study. Our deep learning models outperformed traditional Cox proportional-hazard models with concordance index in the ranges 0.72–0.73, 0.71–0.72, 0.71, and 0.69–0.70, for models applied at baseline 0, 24, 48, and 72 h, respectively. Deep learning models based on a combination of entity embeddings and survival modelling is a feasible approach to obtain individualized survival estimates in data-rich settings such as the ICU. The interpretable nature of the models enables us to understand the impact of the different data domains.
Average number of primary care visits per patient remain stable from 2019 to 2021 across insurance groups. This figure shows the average number of encounters per year for all patients and matched patients by payor type. The number of patients in each insurance category are as follows: Commercial (621,490 total; 176,543 matched), Medicaid (74,853 total; 20,050 matched), Medicare (225,575 total; 128,137 matched), Other (42,306 total; 7291 matched).
Telehealth use occurs more in patients with multiple primary care visits. Number and percent of patients with no (blue) or at least one telehealth visit (orange) grouped by number of primary care appointments in that year for matched patients (top) and all patients (bottom).
The expanded availability of telehealth due to the COVID-19 pandemic presents a concern that telehealth may result in an unnecessary increase in utilization. We analyzed 4,114,651 primary care encounters (939,134 unique patients) from three healthcare systems between 2019 and 2021 and found little change in utilization as telehealth became widely available. Results suggest telehealth availability is not resulting in additional primary care visits and federal policies should support telehealth use.
With the explosive growth of biomarker data in Alzheimer’s disease (AD) clinical trials, numerous mathematical models have been developed to characterize disease-relevant biomarker trajectories over time. While some of these models are purely empiric, others are causal, built upon various hypotheses of AD pathophysiology, a complex and incompletely understood area of research. One of the most challenging problems in computational causal modeling is using a purely data-driven approach to derive the model’s parameters and the mathematical model itself, without any prior hypothesis bias. In this paper, we develop an innovative data-driven modeling approach to build and parameterize a causal model to characterize the trajectories of AD biomarkers. This approach integrates causal model learning, population parameterization, parameter sensitivity analysis, and personalized prediction. By applying this integrated approach to a large multicenter database of AD biomarkers, the Alzheimer’s Disease Neuroimaging Initiative, several causal models for different AD stages are revealed. In addition, personalized models for each subject are calibrated and provide accurate predictions of future cognitive status.
A process diagram of the various statuses from initial contact to enrolment. Participants progress from 'needs contact' (left) to 'enrolled' (right). Boxes along the bottom row detail non-participation statuses.
A recruitment flowchart from initial contact to enrolment for the RADAR-MDD London site. 1 Of total contacts, denominator = 1104. 2 Of willing & assessed for eligibility, denominator = 581.
The use of remote measurement technologies (RMTs) across mobile health (mHealth) studies is becoming popular, given their potential for providing rich data on symptom change and indicators of future state in recurrent conditions such as major depressive disorder (MDD). Understanding recruitment into RMT research is fundamental for improving historically small sample sizes, reducing loss of statistical power, and ultimately producing results worthy of clinical implementation. There is a need for the standardisation of best practices for successful recruitment into RMT research. The current paper reviews lessons learned from recruitment into the Remote Assessment of Disease and Relapse- Major Depressive Disorder (RADAR-MDD) study, a large-scale, multi-site prospective cohort study using RMT to explore the clinical course of people with depression across the UK, the Netherlands, and Spain. More specifically, the paper reflects on key experiences from the UK site and consolidates these into four key recruitment strategies, alongside a review of barriers to recruitment. Finally, the strategies and barriers outlined are combined into a model of lessons learned. This work provides a foundation for future RMT study design, recruitment and evaluation.
Baseline characteristics of study samples.
Abstract Physical activity is regarded as favorable to health but effects across the spectrum of human disease are poorly quantified. In contrast to self-reported measures, wearable accelerometers can provide more precise and reproducible activity quantification. Using wrist-worn accelerometry data from the UK Biobank prospective cohort study, we test associations between moderate-to-vigorous physical activity (MVPA) – both total MVPA minutes and whether MVPA is above a guideline-based threshold of ≥150 min/week—and incidence of 697 diseases using Cox proportional hazards models adjusted for age, sex, body mass index, smoking, Townsend Deprivation Index, educational attainment, diet quality, alcohol use, blood pressure, anti-hypertensive use. We correct for multiplicity at a false discovery rate of 1%. We perform analogous testing using self-reported MVPA. Among 96,244 adults wearing accelerometers for one week (age 62 ± 8 years), MVPA is associated with 373 (54%) tested diseases over a median 6.3 years of follow-up. Greater MVPA is overwhelmingly associated with lower disease risk (98% of associations) with hazard ratios (HRs) ranging 0.70–0.98 per 150 min increase in weekly MVPA, and associations spanning all 16 disease categories tested. Overall, associations with lower disease risk are enriched for cardiac (16%), digestive (14%), endocrine/metabolic (10%), and respiratory conditions (8%) (chi-square p
In the metaverse, users will actively engage with 3D content using extended reality (XR). Such XR platforms can stimulate a revolution in health communication, moving from information-based to experience-based content. We outline three major application domains and describe how the XR affordances (presence, agency and embodiment) can improve healthy behaviour by targeting the users' threat and coping appraisal. We discuss how health communication via XR can help to address long-standing health challenges.
With the increase of the ageing in the world’s population, the ageing and degeneration studies of physiological characteristics in human skin, bones, and muscles become important topics. Research on the ageing of bones, especially the skull, are paid much attention in recent years. In this study, a novel deep learning method representing the ageing-related dynamic attention (ARDA) is proposed. The proposed method can quantitatively display the ageing salience of the bones and their change patterns with age on lateral cephalometric radiographs images (LCR) images containing the craniofacial and cervical spine. An age estimation-based deep learning model based on 14142 LCR images from 4 to 40 years old individuals is trained to extract ageing-related features, and based on these features the ageing salience maps are generated by the Grad-CAM method. All ageing salience maps with the same age are merged as an ARDA map corresponding to that age. Ageing salience maps show that ARDA is mainly concentrated in three regions in LCR images: the teeth, craniofacial, and cervical spine regions. Furthermore, the dynamic distribution of ARDA at different ages and instances in LCR images is quantitatively analyzed. The experimental results on 3014 cases show that ARDA can accurately reflect the development and degeneration patterns in LCR images.
Health digital twins are defined as virtual representations (“digital twin”) of patients (“physical twin”) that are generated from multimodal patient data, population data, and real-time updates on patient and environmental variables. With appropriate use, HDTs can model random perturbations on the digital twin to gain insight into the expected behavior of the physical twin—offering groundbreaking applications in precision medicine, clinical trials, and public health. Main considerations for translating HDT research into clinical practice include computational requirements, clinical implementation, as well as data governance, and product oversight.
Journal metrics
$3290 / €2690 / £2390
Article Processing Charges (APC)
5 days
Submission to first decision
15.357 (2021)
Journal Impact Factor™
3.126 (2021)
Immediacy Index
0.012 (2021)
4.533 (2021)
Article Influence Score
Top-cited authors
Katherine Chou
  • Google Inc.
Gavin Duggan
  • Google Inc.
James Wexler
  • Google Inc.
Yun Liu
  • Google Inc.
Ronald M Summers
  • U.S. Department of Health and Human Services