Marvin N. Wright

Marvin N. Wright
  • Dr.
  • Leibniz Institute for Prevention Research and Epidemiology - BIPS

About

87
Publications
57,247
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,236
Citations
Current institution

Publications

Publications (87)
Article
Full-text available
We introduce the C++ application and R package ranger. The software is a fast implementation of random forests for high dimensional data. Ensembles of classification, regression and survival trees are supported. We describe the implementation, provide examples, validate the package with a reference implementation, and compare runtime and memory usa...
Article
The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption may not always be fulfilled. An alternative approach for survival prediction is random forests for survival outcomes. The standard split criterion for random survival forests is the log-r...
Article
Full-text available
Background Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interaction...
Article
Random forests are one of the most successful methods for statistical learning and prediction. Here we consider random survival forests (RSF), which are an extension of the original random forest method to right-censored outcome variables. RSF use the log-rank split criterion to form an ensemble of survival trees, the prediction accuracy of the ens...
Preprint
Full-text available
We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constra...
Article
Full-text available
Background and Objectives To examine whether changes in the Mediterranean Diet (MD) or any of its MD food groups modulate the genetic susceptibility to obesity in European youth, both in cross‐sectional and longitudinal analyses. Methods For cross‐sectional analysis, 1982 participants at baseline, 1649 in follow‐up 1 (FU1) and 1907 in follow‐up 2...
Article
Full-text available
Purpose This study aimed to examine the impact of preprocessing and inclusion of various features on predicting the energy expenditure (EE) of preschool children (3.0–6.99 years). Methods The internal Canadian sample consisted of 36 children, equipped with accelerometers on their wrists (OPAL) and right hip (ActiGraph GT9X). The external German sa...
Article
This paper proposes a method for measuring conditional feature importance via generative modeling. In explainable artificial intelligence (XAI), conditional feature importance assesses the impact of a feature on a prediction model's performance given the information of other features. Model-agnostic post hoc methods to do so typically evaluate chan...
Preprint
Full-text available
Deep learning survival models often outperform classical methods in time-to-event predictions, particularly in personalized medicine, but their "black box" nature hinders broader adoption. We propose a framework for gradient-based explanation methods tailored to survival neural networks, extending their use beyond regression and classification. We...
Article
Variable selection is an important step in the analysis of high‐dimensional data, yet there are limited options for survival outcomes in the presence of competing risks. Commonly employed penalized Cox regression considers each event type separately through cause‐specific models, neglecting possibly shared information between them. We adapt the fea...
Preprint
Full-text available
This paper proposes a method for measuring conditional feature importance via generative modeling. In explainable artificial intelligence (XAI), conditional feature importance assesses the impact of a feature on a prediction model's performance given the information of other features. Model-agnostic post hoc methods to do so typically evaluate chan...
Article
Full-text available
The R package innsight offers a general toolbox for revealing variable-wise interpretations of deep neural networks' predictions with so-called feature attribution methods. Aside from the unified and user-friendly framework, the package stands out in three ways: It is generally the first R package implementing feature attribution methods for neural...
Preprint
Many existing interpretation methods are based on Partial Dependence (PD) functions that, for a pre-trained machine learning model, capture how a subset of the features affects the predictions by averaging over the remaining features. Notable methods include Shapley additive explanations (SHAP) which computes feature contributions based on a game t...
Article
Full-text available
Objective: This study aimed to develop convolutional neural networks (CNN) models to predict the energy expenditure (EE) of children from raw accelerometer data. Additionally, this study sought to external validation of the CNN models in addition to the linear regression (LM), random forest (RF), and full connected neural network (FcNN) models publ...
Article
Full-text available
Introduction Pharmacovigilance is vital for drug safety. The process typically involves two key steps: initial signal generation from spontaneous reporting systems (SRSs) and subsequent expert review to assess the signals’ (potential) causality and decide on the appropriate action. Methods We propose a novel discovery and verification approach to...
Article
Random survival forests (RSF) can be applied to many time‐to‐event research questions and are particularly useful in situations where the relationship between the independent variables and the event of interest is rather complex. However, in many clinical settings, the occurrence of the event of interest is affected by competing events, which means...
Chapter
Full-text available
While machine learning (ML) models are increasingly used due to their high predictive power, their use in understanding the data-generating process (DGP) is limited. Understanding the DGP requires insights into feature-target associations, which many ML models cannot directly provide due to their opaque internal mechanisms. Feature importance (FI)...
Chapter
Full-text available
Counterfactual explanations elucidate algorithmic decisions by pointing to scenarios that would have led to an alternative, desired outcome. Giving insight into the model’s behavior, they hint users towards possible actions and give grounds for contesting decisions. As a crucial factor in achieving these goals, counterfactuals must be plausible, i....
Preprint
Full-text available
This work presents the first large-scale neutral benchmark experiment focused on single-event, right-censored, low-dimensional survival data. Benchmark experiments are essential in methodological research to scientifically compare new and existing model classes through proper empirical evaluation. Existing benchmarks in the survival literature are...
Preprint
Full-text available
While machine learning (ML) models are increasingly used due to their high predictive power, their use in understanding the data-generating process (DGP) is limited. Understanding the DGP requires insights into feature-target associations, which many ML models cannot directly provide, due to their opaque internal mechanisms. Feature importance (FI)...
Article
Full-text available
Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R packa...
Conference Paper
Full-text available
Shapley values have achieved great popularity in explainable artificial intelligence. However, with standard sampling methods, resulting feature attributions are susceptible to adversarial attacks. This originates from target function evaluations at extrapolated data points, which are easily detectable and hence, enable models to behave accordingly...
Conference Paper
Full-text available
Scientists and practitioners increasingly rely on machine learning to model data and draw conclusions. Compared to statistical modeling approaches, machine learning makes fewer explicit assumptions about data structures, such as linearity. Consequently, the parameters of machine learning models usually cannot be easily related to the data generatin...
Preprint
Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R packa...
Article
Full-text available
Aims/hypothesis There is increasing evidence for the existence of shared genetic predictors of metabolic traits and neurodegenerative disease. We previously observed a U-shaped association between fasting insulin in middle-aged women and dementia up to 34 years later. In the present study, we performed genome-wide association (GWA) analyses for fas...
Preprint
Full-text available
The R package innsight offers a general toolbox for revealing variable-wise interpretations of deep neural networks' predictions with so-called feature attribution methods. Aside from the unified and user-friendly framework, the package stands out in three ways: It is generally the first R package implementing feature attribution methods for neural...
Article
Full-text available
Despite the popularity of feature importance (FI) measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analysing a variable’s importance before and after adjusting for covariates—i.e., between marginal and conditional measures. Our w...
Conference Paper
Full-text available
We propose methods for density estimation and data synthesis using a novel form of unsupervised random forests. Inspired by generative adversarial networks, we implement a recursive procedure in which trees gradually learn structural properties of the data through alternating rounds of generation and discrimination. The method is provably consisten...
Chapter
Full-text available
In this paper, we examine the bias towards high-entropy features exhibited by SHAP values on tree-based structures such as classification and regression trees, random forests or gradient boosted trees. Previous work has shown that many feature importance measures for tree-based models assign higher values to high-entropy features, i.e. with high ca...
Preprint
Full-text available
Despite the popularity of feature importance measures in interpretable machine learning, the statistical adequacy of these methods is rarely discussed. From a statistical perspective, a major distinction is between analyzing a variable's importance before and after adjusting for covariates - i.e., between marginal and conditional measures. Our work...
Preprint
We consider a global explanation of a regression or classification function by decomposing it into the sum of main components and interaction components of arbitrary order. When adding an identification constraint that is motivated by a causal interpretation, we find q-interaction SHAP to be the unique solution to that constraint. Here, q denotes t...
Data
The document provides information for transparency and reproducibility of the study according to the standard for species distribution modeling (ODMAP protocol) from Zurrell et al. (2020). Additional information reported include: (1) implementation strategy and results of spatial filtering operation, (2) hyperparameter space for model optimization...
Article
Full-text available
This paper describes a data-driven framework based on spatiotemporal machine learning to produce distribution maps for 16 tree species (Abies alba Mill., Castanea sativa Mill., Corylus avellana L., Fagus sylvatica L., Olea europaea L., Picea abies L. H. Karst., Pinus halepensis Mill., Pinus nigra J. F. Arnold, Pinus pinea L., Pinus sylvestris L., P...
Preprint
Full-text available
This paper describes a data-driven framework based on spatiotemporal machine learning to produce distribution maps for 16 tree species (Abies alba Mill., Castanea sativa Mill., Corylus avellana L., Fagus sylvatica L., Olea europaea L., Picea abies L. H. Karst., Pinus halepensis Mill., Pinus nigra J. F. Arnold, Pinus pinea L., Pinus sylvestris L., P...
Preprint
Full-text available
Density estimation is a fundamental problem in statistics, and any attempt to do so in high dimensions typically requires strong assumptions or complex deep learning architectures. An important application for density estimators is synthetic data generation, an area currently dominated by neural networks that often demand enormous training datasets...
Preprint
Introduction Pharmacovigilance shifted its focus from spontaneous reporting systems to electronic health care (EHC) data. Usually, a single statistical method is used to detect signals, i.e., potential adverse drug reactions (ADRs). Objective and Method We present a novel approach to detect ADRs in EHC databases. It aggregates the results of multi...
Preprint
Full-text available
This paper describes a data-driven framework based on spatiotemporal machine learning to produce distribution maps for 16 tree species (Abies alba Mill., Castanea sativa Mill., Corylus avellana L., Fagus sylvatica L., Olea europaea L., Picea abies L. H. Karst., Pinus halepensis Mill., Pinus nigra J. F. Arnold, Pinus pinea L., Pinus sylvestris L., P...
Preprint
Full-text available
Scientists and practitioners increasingly rely on machine learning to model data and draw conclusions. Compared to statistical modeling approaches, machine learning makes fewer explicit assumptions about data structures, such as linearity. However, their model parameters usually cannot be easily related to the data generating process. To learn abou...
Article
This study explored the relationship between motor abilities and accelerometer-derived measures of physical activity (PA) within preschool-aged children. A total of 193 children (101 girls, 4.2 ± 0.7 years) completed five tests to assess motor abilities, shuttle run (SR), standing long jump, lateral jumping, one-leg stand, and sit and reach. Four P...
Article
Full-text available
We propose the conditional predictive impact (CPI), a consistent and unbiased estimator of the association between one or several features and a given outcome, conditional on a reduced feature set. Building on the knockoff framework of Candès et al. (J R Stat Soc Ser B 80:551–577, 2018), we develop a novel testing procedure that works in conjunctio...
Preprint
Full-text available
Normalizing flows leverage the Change of Variables Formula (CVF) to define flexible density models. Yet, the requirement of smooth transformations (diffeomorphisms) in the CVF poses a significant challenge in the construction of these models. To enlarge the design space of flows, we introduce $\mathcal{L}$-diffeomorphisms as generalized transformat...
Article
Full-text available
Background Childhood obesity is a complex multifaceted condition, which is influenced by genetics, environmental factors, and their interaction. However, these interactions have mainly been studied in twin studies and evidence from population-based cohorts is limited. Here, we analyze the interaction of an obesity-related genome-wide polygenic risk...
Article
The Translational Machine (TM) is a machine learning (ML)‐based analytic pipeline that translates genotypic/variant call data into biologically contextualized features that richly characterize complex variant architectures and permit greater interpretability and biological replication. It also reduces potentially confounding effects of population s...
Article
Full-text available
Danish municipalities monitor older persons who are at high risk of declining health and would later need home care services. However, there is no established strategy yet on how to accurately identify those who are at high risk. Therefore, there is great potential to optimise the municipalities’ prevention strategies. Denmark’s comprehensive set o...
Article
Background: To evaluate a multicomponent health promotion program targeting preschoolers' physical activity (PA). Methods: PA of children from 23 German daycare facilities (DFs; 13 intervention DFs, 10 control DFs) was measured via accelerometry at baseline and after 12 months. Children's sedentary time, light PA, and moderate to vigorous PA wer...
Article
Full-text available
Random survival forests (RSF) are a powerful nonparametric method for building prediction models with a time-to-event outcome. RSF do not rely on the proportional hazards assumption and can be readily applied to both low- and higher-dimensional data. A remaining limitation of RSF, however, arises from the fact that the method is almost entirely foc...
Preprint
Objectives To investigate the degree by which the inherited susceptibility to obesity is modified by environmental factors during childhood and adolescence. Design Cohort study with repeated measurements of diet, lifestyle factors and anthropometry. Setting The pan-European IDEFICS/I.Family cohort Participants 8,609 repeated observations from 3,...
Article
Full-text available
In this paper, we give an overview of methodological issues related to the use of statistical learning approaches when analyzing high-dimensional genetic data. The focus is set on regression models and machine learning algorithms taking genetic variables as input and returning a classification or a prediction for the target variable of interest; fo...
Article
Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of non-additive predictor-response relationships. For bounded outcome variables restricted to the unit interval, however, classical modeling approaches based on mean squared error loss may severely suffer...
Article
Full-text available
Background In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method kno...
Article
Full-text available
One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors...
Article
Full-text available
The random forest (RF) algorithm has several hyperparameters that have to be set by the user, for example, the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain,...
Preprint
Full-text available
We propose a general test of conditional independence. The conditional predictive impact (CPI) is a provably consistent and unbiased estimator of one or several features' association with a given outcome, conditional on a (potentially empty) reduced feature set. The measure can be calculated using any supervised learning algorithm and loss function...
Preprint
Full-text available
Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of complex predictor-response relationships. For bounded outcome variables restricted to the unit interval, however, classical random forest approaches may severely suffer as they do not account for the he...
Article
Full-text available
Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal...
Data
RFsp—Random Forest for spatial data (R tutorial)
Preprint
Full-text available
Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal...
Preprint
Full-text available
Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal...
Article
Adverse drug reactions are among the leading causes of death. Pharmacovigilance aims to monitor drugs after they have been released to the market in order to detect potential risks. Data sources commonly used to this end are spontaneous reports sent in by doctors or pharmaceutical companies. Reports alone are rather limited when it comes to detecti...
Article
Full-text available
This article introduces the R package survivalsvm, implementing support vector machines for survival analysis. Three approaches are available in the package: The regression approach takes censoring into account when formulating the inequality constraints of the support vector problem. In the ranking approach, the inequality constraints set the obje...
Preprint
Full-text available
Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal...
Article
Full-text available
Motivation: Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the...
Article
Full-text available
Mutations in mitochondrial DNA (mtDNA) lead to heteroplasmy, i.e., the intracellular coexistence of wild-type and mutant mtDNA strands, which impact a wide spectrum of diseases but also physiological processes, including endurance exercise performance in athletes. However, the phenotypic consequences of limited levels of naturally arising heteropla...
Preprint
Full-text available
The random forest algorithm (RF) has several hyperparameters that have to be set by the user, e.g., the number of observations drawn randomly for each tree and whether they are drawn with or without replacement, the number of variables drawn randomly for each split, the splitting rule, the minimum number of samples that a node must contain and the...
Preprint
Full-text available
Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal...
Chapter
Full-text available
The advancement of high-throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates, an...
Article
Full-text available
Mutations in mitochondrial DNA (mtDNA) lead to heteroplasmy, i.e. the intracellular coexistence of wild-type and mutant mtDNA strands, which impact a wide spectrum of diseases but also physiological processes, including endurance exercise performance in athletes. However, the phenotypic consequences of limited levels of naturally-arising heteroplas...
Article
Full-text available
This paper describes the technical development and accuracy assessment of the most recent and improved version of the SoilGrids system at 250m resolution (June 2016 update). SoilGrids provides global predictions for standard numeric soil properties (organic carbon, bulk density, Cation Exchange Capacity (CEC), pH, soil texture fractions and coarse...
Article
Random survival forests (RSF) are a powerful method for risk prediction of right-censored outcomes in biomedical research. RSF use the log-rank split criterion to form an ensemble of survival trees. The most common approach to evaluate the prediction accuracy of a RSF model is Harrell's concordance index for survival data (‘C index’). Conceptually,...
Article
Objective: To compare the phenotype, clinical course, and outcome of myeloperoxidase (MPO)-antineutrophil cytoplasmic antibody (ANCA)-positive granulomatosis with polyangiitis (Wegener's) (GPA) to proteinase 3 (PR3)-ANCA-positive GPA and to MPO-ANCA-positive microscopic polyangiitis (MPA). Methods: We characterized all MPO-ANCA-positive patients...
Article
The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption is not always fulfilled. An alternative approach is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistics, which f...
Article
To evaluate the clinical presentation and long-term outcome of a vasculitis centre cohort of patients with microscopic polyangiitis (MPA) with respect to organ manifestations, treatment, chronic damage and mortality. We performed a retrospective chart review at our vasculitis referral centre. MPA patients admitted between 1991 and 2013 classified b...
Article
Caries infiltration is a novel treatment option for proximal caries lesions. The idea is to build a diffusion barrier inside the lesion to slow down or stop the caries progression. If a lesion still reaches a critical size, restorative treatment is required. Clinical trials investigating caries infiltration thus produce multiple censored ordinal da...
Article
The PARAFAC method is an approach to decompose multidimensional arrays into component matrices for a given number of components. The most common way for calculating the decomposition is the alternating least squares method (ALS). Many other algorithms are modifications of ALS, including algorithms utilizing line search, enhanced line search or Tikh...

Network

Cited By