Karel G M Moons’s research while affiliated with University Medical Center Utrecht and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (559)


Figure 2: Gwet's AC for each AMSTAR-PF question with dichotomised answering responses (Y/PY, and N/PN collapsed). Interrater and Inter-pair show AC and standard error as error bars. Intrapair show average AC across pairs with standard error of the mean (SEM) as the error bars. Error Bars are capped at 1.0. Questions marked (^) had an N/A option. Benchmark interpretation is colour-coded and calculated using 95% cumulative probabilities for Landis and Koch's benchmark categories, < 0, Poor; 0.0-0.2, Slight; 0.2-0.4, Fair; 0.4-0.6, Moderate; 0.6-0.8, Substantial; and 0.8-1.0, Almost Perfect
Figure 3: Timings to complete appraisal, and timings to complete consensus, for each article in order of completion. Box represents IQR, horizontal line is the median, and the dotted lines the range. Outliers are represented as circles.
Agreeability testing of AMSTAR-PF, a tool for quality appraisal of systematic reviews of prognostic factor studies
  • Preprint
  • File available

April 2025

·

6 Reads

Michael Henry

·

Neil O'Connell

·

Richard Riley

·

[...]

·

Lorimer Moseley

Background: This paper details initial testing of the agreeability and usability of a novel quality appraisal tool for systematic reviews of prognostic factor studies: AMSTAR-PF. Methods: Fourteen appraisers each assessed eight systematic reviews using AMSTAR-PF. Their ratings for each question and each article were compared, with interrater, inter-pair and intrapair agreeability calculated using Gwet's agreement coefficient. Time of use and time to reach consensus were also recorded. Results: Interrater agreement averaged 0.59 (range, 0.21-0.90), inter-pair 0.61 (range 0.24-0.91) and intrapair 0.75 (range 0.45-0.95) across the domains, with agreement for the overall rating 0.46 (95%CI 0.30-0.62) for interrater, 0.46 (95%CI 0.17-0.74) for inter-pair, and 0.68 (range of averages 0.22-1.00) for intrapair agreement. The majority (60.7%) of intrapair ratings were identical, with 94.6% of final ratings either identical or only one category different for the overall appraisal. The time taken to appraise a study with AMSTAR-PF improved with use and averaged around 34 minutes after the first two appraisals. Conclusions: Despite some variance in agreeability for different domains and between different appraisers, the testing results suggest that AMSTAR-PF has clear utility for appraising the quality of systematic reviews of prognostic factor studies.

Download

PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods

March 2025

·

134 Reads

·

2 Citations

The BMJ

The Prediction model Risk Of Bias ASsessment Tool (PROBAST) is used to assess the quality, risk of bias, and applicability of prediction models or algorithms and of prediction model/algorithm studies. Since PROBAST’s introduction in 2019, much progress has been made in the methodology for prediction modelling and in the use of artificial intelligence, including machine learning, techniques. An update to PROBAST-2019 is thus needed. This article describes the development of PROBAST+AI. PROBAST+AI consists of two distinctive parts: model development and model evaluation. For model development, PROBAST+AI users assess quality and applicability using 16 targeted signalling questions. For model evaluation, PROBAST+AI users assess the risk of bias and applicability using 18 targeted signalling questions. Both parts contain four domains: participants and data sources, predictors, outcome, and analysis. Applicability of the prediction model is rated for the participants and data sources, predictors, and outcome domains. PROBAST+AI may replace the original PROBAST tool and allows all key stakeholders (eg, model developers, AI companies, researchers, editors, reviewers, healthcare professionals, guideline developers, and policy organisations) to examine the quality, risk of bias, and applicability of any type of prediction model in the healthcare sector, irrespective of whether regression modelling or AI techniques are used.


Fig. 2 Association of metabolic scores in tertiles with overall survival. Kaplan-Meier curves of cumulative survival of patients using cox proportional hazards regression (n = 327).
Associations between metabolomic scores and clinical outcomes in hospitalized COVID-19 patients

March 2025

·

12 Reads

GeroScience

The disease course and outcome of COVID-19 greatly varies between individuals. To explore which biological systems may contribute to this variation, we examined how individual metabolites and three metabolic scores relate to COVID-19 outcomes in hospitalized COVID-19 patients. The metabolome of 346 patients was measured using the 1H-NMR Nightingale platform. The association of individual metabolomic features and multi-biomarker scores, i.e. MetaboHealth, MetaboAge, and Infectious Disease Score (IDS) (higher scores reflect poorer health), with in-hospital disease course, long-term recovery, and overall survival were analyzed. Higher values for the metabolites phenylalanine (HR = 1.33, CI = 1.14–1.56), glucose (HR = 1.37, CI = 1.16–1.62) and lactate (HR = 1.38, CI = 1.16–1.63) were associated with mortality. For all three metabolic scores, higher scores were significantly associated with higher odds of a poorer in-hospital disease course (MetaboHealth: OR = 1.61, CI = 1.29–2.02; ΔMetaboAge: OR = 1.42, CI = 1.16–1.74; IDS: OR = 1.55, 1.25–1.93) and with overall survival (MetaboHealth: HR = 1.57, CI = 1.28–1.92; ΔMetaboAge: HR = 1.34, CI = 1.15–1.57; IDS: HR = 1.56, CI = 1.27–1.93). MetaboHealth and ΔMetaboAge showed a stronger association in younger patients (< 70 yrs.) than older patients. No clear patterns were found in associations between the three scores and measures of long-term recovery. In conclusion, the heterogeneity in disease course after SARS-COV2 infection may be explained either by generic biological frailty reflected by the three metabolomics scores or by glycemic control (glucose, lactate) and respiratory distress (phenylalanine).



Summary of imbalance corrections included in the simulation study.
The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation Study

February 2025

·

41 Reads

·

7 Citations

Statistics in Medicine

Introduction Risk prediction models are increasingly used in healthcare to aid in clinical decision‐making. In most clinical contexts, model calibration (i.e., assessing the reliability of risk estimates) is critical. Data available for model development are often not perfectly balanced with the modeled outcome (i.e., individuals with vs. without the event of interest are not equally prevalent in the data). It is common for researchers to correct for class imbalance, yet, the effect of such imbalance corrections on the calibration of machine learning models is largely unknown. Methods We studied the effect of imbalance corrections on model calibration for a variety of machine learning algorithms. Using extensive Monte Carlo simulations we compared the out‐of‐sample predictive performance of models developed with an imbalance correction to those developed without a correction for class imbalance across different data‐generating scenarios (varying sample size, the number of predictors, and event fraction). Our findings were illustrated in a case study using MIMIC‐III data. Results In all simulation scenarios, prediction models developed without a correction for class imbalance consistently had equal or better calibration performance than prediction models developed with a correction for class imbalance. The miscalibration introduced by correcting for class imbalance was characterized by an over‐estimation of risk and was not always able to be corrected with re‐calibration. Conclusion Correcting for class imbalance is not always necessary and may even be harmful to clinical prediction models which aim to produce reliable risk estimates on an individual basis.


Alternative diagnoses of 707 included patients without pulmonary embolism stratified by YEARS classification
Contingency table
YEARS clinical decision rule for diagnosing pulmonary embolism: a prospective diagnostic cohort follow-up study in primary care

February 2025

·

2 Reads

BMJ Open

Objectives The Wells rule is often used in primary care to rule out pulmonary embolism (PE), but its efficiency is low as many referred patients do not have PE. In this study, we evaluated in primary care an alternative and potentially more efficient diagnostic strategy—the YEARS algorithm; a simplified three-item version of the Wells rule combined with a pretest probability adjusted D-dimer interpretation. Design In this comprehensive prospective diagnostic validation study, primary care patients suspected of PE were enrolled by their general practitioner. All three YEARS items were collected in addition to D-dimer results, and patients were followed for 3 months to establish the final diagnosis. Setting Primary care in the Netherlands. Participants 753 patients with suspected acute PE were included. Five patients (0.7%) were lost to follow-up. Main outcome measures Failure rate (number of PE cases among patients classified by the algorithm as ‘PE ruled-out’) and efficiency (fraction of patients classified as ‘PE probable/further imaging needed’). Results Prevalence of PE was 5.5% (41/748 patients). In total, 603 patients were classified as ‘PE ruled-out’ by the YEARS algorithm (532 with zero YEARS items and a D-dimer<1000 ng/mL and 71 with≥1 positive YEARS item and a D-dimer<500 ng/mL), resulting in an efficiency of 80.6% (603/748 patients, 95% CI 77.6% to 83.4%). Of these patients, three patients had a diagnosis of non-fatal PE during 3 months follow-up, all three with zero YEARS items and D-dimer between 500 and 1000 ng/mL, resulting in an overall diagnostic failure rate of 0.50% (3/603 patients, 95% CI 0.13% to 1.57%). In the patients categorised as ‘imaging needed’ (n=145), a total of 38 (26.2%) were indeed diagnosed with PE. Conclusions Our study suggests that acute PE can be safely ruled out in 80% of patients by the YEARS algorithm in a primary care setting, while only 20% of patients required referral to hospital care for imaging tests. In those classified as ‘imaging needed’, PE was present in about one in every four patients, demonstrating a high detection proportion.


Flowchart for inclusion of older individuals in the nursing home cohort
Distributions of predicted risks by penalized models stratified by outcome. Plot A shows the 28-day mortality risks as predicted by the base model, whereas plot B shows the risks predicted for the CCI model (both models showed very similar distribution). For each plot, the distribution of predicted risks for patients who survived 28 days is shown in blue, and the distribution of predicted risks for patients who died within 28 days is shown in yellow
Charlson comorbidity index has no incremental value for mortality risk prediction in nursing home residents with COVID-19 disease

January 2025

·

13 Reads

BMC Geriatrics

Background During the COVID-19 pandemic, nursing home (NH) residents faced the highest risk of severe COVID-19 disease and mortality. Due to their frailty status, comorbidity burden can serve as a useful predictive indicator of vulnerability in this frail population. However, the prognostic value of these cumulative comorbidity scores like the Charlson comorbidity index (CCI) remained unclear in this population. We evaluated the incremental predictive value of the CCI for predicting 28-day mortality in NH residents with COVID-19, compared to prediction using age and sex only. Methods We included older individuals of ≥ 70 years of age in a large retrospective observational cohort across NHs in the Netherlands. Individuals with PCR-confirmed COVID-19 diagnosis from 1 March 2020 to 31 December 2021 were included. The CCI score was computed by searching for the comorbidities recorded in the electronic patient records. All-cause mortality within 28 days was predicted using logistic regression based on age and sex only (base model) and by adding the CCI to the base model (CCI model). The predictive performance of the base model and the CCI model were compared visually by the distribution of predicted risks and area under the receiver operator characteristic curve (AUROC), scaled Brier score, and calibration slope. Results A total of 4318 older NH residents were included in this study with a median age of 88 years [IQR: 83–93] and a median CCI score of 6 [IQR: 5–7]. 1357 (31%) residents died within 28 days after COVID-19 diagnosis. The base model, with age and sex as predictors, had an AUROC of 0.61 (CI: 0.60 to 0.63), a scaled brier score of 0.03 (CI: 0.02 to 0.04), and a calibration slope of 0.97 (CI: 0.83 to 1.13). The addition of CCI did not improve these predictive performance measures. Conclusion The addition of the CCI as a vulnerability indicator did not improve short-term mortality prediction in NH residents. Similar (high) age and number of comorbidities in the NH population could reduce the effectiveness of these predictors, emphasizing the need for other population-specific predictors that can be utilized in the frail NH residents. Graphical Abstract


The TRIPOD-LLM reporting guideline for studies using large language models

January 2025

·

43 Reads

·

21 Citations

Nature Medicine

Large language models (LLMs) are rapidly being adopted in healthcare, necessitating standardized reporting guidelines. We present transparent reporting of a multivariable model for individual prognosis or diagnosis (TRIPOD)-LLM, an extension of the TRIPOD + artificial intelligence statement, addressing the unique challenges of LLMs in biomedical applications. TRIPOD-LLM provides a comprehensive checklist of 19 main items and 50 subitems, covering key aspects from title to discussion. The guidelines introduce a modular format accommodating various LLM research designs and tasks, with 14 main items and 32 subitems applicable across all categories. Developed through an expedited Delphi process and expert consensus, TRIPOD-LLM emphasizes transparency, human oversight and task-specific performance reporting. We also introduce an interactive website ( https://tripod-llm.vercel.app/ ) facilitating easy guideline completion and PDF generation for submission. As a living document, TRIPOD-LLM will evolve with the field, aiming to enhance the quality, reproducibility and clinical applicability of LLM research in healthcare through comprehensive reporting.


Flow chart for model selection in the study
Discriminative performance of selected prediction models for predicting in-hospital mortality (Berzuini et al., Wang et al., Zhang et al.) or ICU admission (Zhou et al.)
Calibration plots (observed vs predicted probabilities) of selected prediction models for predicting in-hospital mortality (Berzuini et al., Wang et al., Zhang et al.) or ICU admission (Zhou et al.). Calibration plots and loess lines were drawn in the stacked dataset (including all 50 imputed datasets) in the total population
Validation of prognostic models predicting mortality or ICU admission in patients with COVID-19 in low- and middle-income countries: a global individual participant data meta-analysis

December 2024

·

12 Reads

Diagnostic and Prognostic Research

Background We evaluated the performance of prognostic models for predicting mortality or ICU admission in hospitalized patients with COVID-19 in the World Health Organization (WHO) Global Clinical Platform, a repository of individual-level clinical data of patients hospitalized with COVID-19, including in low- and middle-income countries (LMICs). Methods We identified eligible multivariable prognostic models for predicting overall mortality and ICU admission during hospital stay in patients with confirmed or suspected COVID-19 from a living review of COVID-19 prediction models. These models were evaluated using data contributed to the WHO Global Clinical Platform for COVID-19 from nine LMICs (Burkina Faso, Cameroon, Democratic Republic of Congo, Guinea, India, Niger, Nigeria, Zambia, and Zimbabwe). Model performance was assessed in terms of discrimination and calibration. Results Out of 144 eligible models, 140 were excluded due to a high risk of bias, predictors unavailable in LIMCs, or insufficient model description. Among 11,338 participants, the remaining models showed good discrimination for predicting in-hospital mortality (3 models), with areas under the curve (AUCs) ranging between 0.76 (95% CI 0.71–0.81) and 0.84 (95% CI 0.77–0.89). An AUC of 0.74 (95% CI 0.70–0.78) was found for predicting ICU admission risk (one model). All models showed signs of miscalibration and overfitting, with extensive heterogeneity between countries. Conclusions Among the available COVID-19 prognostic models, only a few could be validated on data collected from LMICs, mainly due to limited predictor availability. Despite their discriminative ability, selected models for mortality prediction or ICU admission showed varying and suboptimal calibration.


Figure 4. Decision curve with net benefit (A), standardized net benefit (B), and expected cost (C) for the case study. We show the full x-axis range for educational purposes. As explained in Box 1, a reasonable range of decision thresholds could be 0.05 to 0.40. This corresponds one on one with the normalized costs of a false negative on the curve for expected cost. When showing the decision curve in a validation study, the x-axis should be restricted to the reasonable range. Panel A also shows a smoothed curve using central moving averages. "All" (cq. "None") refers to the net benefit or expect cost of the default strategy to classify all individuals as high (cq. low) risk.
Overview of performance measures and the assessment of the two key characteristics.
Performance measures for the ADNEX model before and after recalibration.
Performance evaluation of predictive AI models to support medical decisions: Overview and guidance

December 2024

·

124 Reads

·

1 Citation

A myriad of measures to illustrate performance of predictive artificial intelligence (AI) models have been proposed in the literature. Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs. We aim to assess the merits of classic and contemporary performance measures when validating predictive AI models for use in medical practice. We focus on models with a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall, classification, and clinical utility) along with accompanying graphical assessments. The first four domains cover statistical performance, the fifth domain covers decision-analytic performance. We explain why two key characteristics are important when selecting which performance measures to assess: (1) whether the measure's expected value is optimized when it is calculated using the correct probabilities (i.e., a "proper" measure), and (2) whether they reflect either purely statistical performance or decision-analytic performance by properly considering misclassification costs. Seventeen measures exhibit both characteristics, fourteen measures exhibited one characteristic, and one measure possessed neither characteristic (the F1 measure). All classification measures (such as classification accuracy and F1) are improper for clinically relevant decision thresholds other than 0.5 or the prevalence. We recommend the following measures and plots as essential to report: AUROC, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot with probability distributions per outcome category.


Citations (82)


... In other areas of medicine, the advantages of new tools and techniques may be more clear cut. Karel Moons and colleagues put forward a tool to assess the quality, risk of bias, and applicability of prediction models that use regression or artificial intelligence methods (doi:10.1136/bmj-2024-082505). 4 Carole Lunny and her team have developed a way of assessing the risk of bias within the individual network meta-analyses conducted as part of a systematic review (doi:10.1136/bmj-2024-079839). 5 And Timothy Feeney and colleagues explain how directed acyclic graphs can help communicate a clinical research study's strengths or limitations (doi:10.1136/bmj-2023-078226). 6 In a clinical setting, modern surveillance devices offer an opportunity to help deal with abusive incidents, as body cameras can provide evidence to support meaningful action in response (doi:10.1136/bmj.r529). 7 Adopting renewable energy technologies, such as solar, wind, tidal, and geothermal, will not only reduce air pollution but may also improve health (doi:10.1136/bmj-2025-084352). 8 And advances in evaluating data on safety signals provide a chance to improve the active surveillance of medical devices, enhancing patient safety and reducing healthcare costs (doi:10.1136/bmj.r484). ...

Reference:

Seeking lightbulb moments
PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods

The BMJ

... Differences across populations in the relationship between predictors and the outcome, differences in predictor assessment approaches across settings, and variations in outcome ascertainment methods further add to our uncertainties 39 . Fortunately, there have been recent calls on the importance of characterizing and communicating uncertainty around model predictions 40 . These calls encourage investigators to fully document and communicate uncertainty in model coefficients in reports of model development (e.g., reporting the covariance matrix of model coefficients, or reporting the coefficients of models fitted in bootstrapped copies of the original sample) 40 . ...

Uncertainty of risk estimates from clinical prediction models: rationale, challenges, and approaches

The BMJ

... BS and MSE are used in different contexts, as the BS compares a probability to an outcome of the binary random variable in the sense of scoring rules [17], while the MSE usually compares two real continuous values, in statistics typically comparing an estimator to the true value [11]. In particular, misconceptions about the BS are not uncommon and can sometimes be reinforced by potentially misleading statements in the literature [32,33,31,1,36,10]. Given the importance of accurate interpretation in clinical applications, it is crucial to address these misunderstandings. ...

The Harms of Class Imbalance Corrections for Machine Learning Based Prediction Models: A Simulation Study

Statistics in Medicine

... To ensure transparency and standardization in reporting of LLM model for individual prognosis or diagnosis, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis-LLM (TRIPOD-LLM) is one of the first guideline with a checklist of 19 main items and 50 subitems [75]. This guideline builds on the prior reporting statement of TRIPOD + AI [75]. ...

The TRIPOD-LLM reporting guideline for studies using large language models
  • Citing Article
  • January 2025

Nature Medicine

... The model has been developed for patients admitted in the first pandemic wave and showed a good discriminative performance with an AUC of 0.80 (95% CI 0.76-0.84) [16]. Because mortality is not the only important outcome for older patients, in future studies it may also be useful to validate the APOP screener and CFS for more patientcentered outcomes, for example functional decline or quality of life. ...

The influence of the dynamic context of the pandemic on the predictive performance of mortality predictions over time in older patients hospitalised for COVID-19
  • Citing Article
  • December 2024

Journal of Clinical Epidemiology

... The most informative assessment of calibration performance for a prediction model is a calibration plot, to assess whether, among patients with the same estimated risk of the event, the observed proportion of events equals the estimated risk. 4,19 Summarizing calibration statistics exist, such as the observed over expected (O:E) ratio, calibration intercept, calibration slope, expected calibration error (ECE), estimated calibration index (ECI) or integrated calibration index (ICI). Such summarizing statistics are by definition less informative than the calibration plot. ...

Performance evaluation of predictive AI models to support medical decisions: Overview and guidance

... Incorporating psychological and cognitive science into AR development is essential to ensure that training tools are practical in ideal circumstances and functional and safe in realworld emergencies. Addressing this gap requires interdisciplinary research teams that include engineers and surgeons, psychologists, neuroscientists, and human factors specialists [88][89][90]. ...

Updating methods for AI-based clinical prediction models: a scoping review
  • Citing Article
  • December 2024

Journal of Clinical Epidemiology

... Mean platelet volume (MPV) serves as a direct measure of platelet size. Elevated MPV levels generally signify enhanced platelet activation and are associated with severe inflammatory responses, including sepsis [6]. Another crucial indicator, red blood cell distribution width (RDW) is a parameter that reflects the variability of red blood cell volume. ...

Added value of inflammatory markers to vital signs for predicting mortality in patients with suspected infection: external validation and model development
  • Citing Article
  • November 2024

Internal and Emergency Medicine

... Addressing these concerns calls for initiatives like the proposed CARE-AI (Collaborative Assessment for Responsible and Ethical AI Implementation) framework, which seeks to align AI technologies with rigorous ethical standards and practical safeguards [44]. CARE-AI emphasizes risk assessment for misinformation, data privacy, fairness across diverse patient populations, and transparent declarations of an AI system's non-human nature. ...

An ethics assessment tool for artificial intelligence implementation in healthcare: CARE-AI
  • Citing Article
  • October 2024

Nature Medicine

... The study was exempt from research ethics board approval and the need for informed consent in accordance with European law, given the lack of involvement of human participants or patient data. We utilized the TRIPOD-LLM guideline for reporting [13]. ...

The TRIPOD-LLM Statement: A Targeted Guideline For Reporting Large Language Models Use