
Tim Verdonck- Professor
- Professor at KU Leuven
Tim Verdonck
- Professor
- Professor at KU Leuven
About
119
Publications
42,207
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,005
Citations
Introduction
Skills and Expertise
Current institution
Publications
Publications (119)
Random forests are widely used in regression. However, the decision trees used as base learners are poor approximators of linear relationships. To address this limitation we propose RaFFLE (Random Forest Featuring Linear Extensions), a novel framework that integrates the recently developed PILOT trees (Piecewise Linear Organic Trees) as base learne...
Introduction
The study of attention has been pivotal in advancing our comprehension of cognition. The goal of this study is to investigate which EEG data representations or features are most closely linked to attention, and to what extent they can handle the cross-subject variability.
Methods
We explore the features obtained from the univariate ti...
As global fertilizer application rates increase, high-quality datasets are paramount for comprehensive analyses to support informed decision-making and policy formulation in crucial areas such as food security or climate change. This study aims to fill existing data gaps by employing two machine learning models, eXtreme Gradient Boosting and HistGr...
Robust estimation provides essential tools for analyzing data that contain outliers, ensuring that statistical models remain reliable even in the presence of some anomalous data. While robust methods have long been available in R, users of Python have lacked a comprehensive package that offers these methods in a cohesive framework. RobPy addresses...
Robust estimation provides essential tools for analyzing data that contain outliers, ensuring that statistical models remain reliable even in the presence of some anomalous data. While robust methods have long been available in R, Python users have lacked a comprehensive package that offers these methods in a cohesive framework. RobPy addresses thi...
Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and t...
As global fertilizer application rates increase, high-quality datasets are paramount for comprehensive analyses to support informed decision-making and policy formulation in crucial areas such as food security or climate change. This study aims to fill existing data gaps by employing two machine learning models, eXtreme Gradient Boosting and HistGr...
Estimating conditional average dose responses (CADR) is an important but challenging problem. Estimators must correctly model the potentially complex relationships between covariates, interventions, doses, and outcomes. In recent years, the machine learning community has shown great interest in developing tailored CADR estimators that target specif...
Money laundering presents a pervasive challenge, burdening society by financing illegal activities. To more effectively combat and detect money laundering, the use of network information is increasingly being explored, exploiting that money laundering necessarily involves interconnected parties. This has lead to a surge in literature on network ana...
Insurance pricing is the process of determining the premiums that policyholders pay in exchange for insurance coverage. In order to estimate premiums, actuaries use statistical based methods, assessing various factors such as the probability of certain events occurring (like accidents or damages), where the Generalized Linear Models (GLMs) are the...
There has been an increasing interest in fraud detection methods, driven by new regulations and by the financial losses linked to fraud. One of the state-of-the-art methods to fight fraud is network analytics. Network analytics leverages the interactions between different entities to detect complex patterns that are indicative of fraud. However, ne...
Over the past few years, the scale of sensor networks has greatly expanded. This generates extended spatiotemporal datasets, which form a crucial information resource in numerous fields, ranging from sports and healthcare to environmental science and surveillance. Unfortunately, these datasets often contain missing values due to systematic or inadv...
Outliers contaminating data sets are a challenge to statistical estimators. Even a small fraction of outlying observations can heavily influence most classical statistical methods. In this paper we propose generalized spherical principal component analysis, a new robust version of principal component analysis that is based on the generalized spatia...
Class imbalance is a common problem in (binary) classification problems. It appears in many application domains, such as text classification, fraud detection, churn prediction and medical diagnosis. A widely used approach to cope with this problem at the data level is the Synthetic Minority Oversampling Technique (SMOTE) which uses the K-Nearest Ne...
Changes in climate can greatly affect the phenology of plants, which can have important feedback effects, such as altering the carbon cycle. These phenologi-cal feedback effects are often induced by a shift in the start or end dates of the growing season of plants. The normalized difference vegetation index (NDVI) serves as a straightforward indica...
Within the broader context of improving interactions between artificial intelligence and humans, the question has arisen regarding whether auditory and rhythmic support could increase attention for visual stimuli that do not stand out clearly from an information stream. To this end, we designed an experiment inspired by pip-and-pop but more appropr...
Insurance fraud is a non self-revealing type of fraud. The true historical labels (fraud or legitimate) are only as precise as the investigators’ efforts and successes to uncover them. Popular approaches of supervised and unsupervised learning fail to capture the ambiguous nature of uncertain labels. Imprecisely observed labels can be represented i...
Logistic regression is one of the most popular statistical techniques for solving (binary) classification problems in various applications (e.g. credit scoring, cancer detection, ad click predictions and churn classification). Typically, the maximum likelihood estimator is used, which is very sensitive to outlying observations. In this paper, we pr...
The exploration and analysis of large high-dimensional data sets calls for well-thought techniques to extract the salient information from the data, such as co-clustering. Latent block models cast co-clustering in a probabilistic framework that extends finite mixture models to the two-way setting. Real-world data sets often contain anomalies which...
In lending, where prices are specific to both customers and products, having a well-functioning personalized pricing policy in place is essential to effective business making. Typically, such a policy must be derived from observational data, which introduces several challenges. While the problem of ``endogeneity'' is prominently studied in the esta...
Estimating the effects of treatments with an associated dose on an instance's outcome, the "dose response", is relevant in a variety of domains, from healthcare to business, economics, and beyond. Such effects, also known as continuous-valued treatment effects, are typically estimated from observational data, which may be subject to dose selection...
One of the established approaches to causal discovery consists of combining directed acyclic graphs (DAGs) with structural causal models (SCMs) to describe the functional dependencies of effects on their causes. Possible identifiability of SCMs given data depends on assumptions made on the noise variables and the functional classes in the SCM. For...
We study the feasibility of applying machine learning to predict the performance of road cyclists using publicly available data. The performance is investigated by predicting the presence or absence in the top places of next year’s ranking based on a rider’s characteristics and results in the current and previous years. We apply several classificat...
Performance measurement is an essential task once a statistical model is created. The area under the receiving operating characteristics curve (AUC) is the most popular measure for evaluating the quality of a binary classifier. In this case, the AUC is equal to the concordance probability, a frequently used measure to evaluate the discriminatory po...
In this study, we investigated the relationship between age and performance in professional road cycling. We considered 1864 male riders present in the yearly top 500 ranking of ProCyclingStats (PCS) since 1993 until 2021 with more than 700 PCS Points. We applied a data-driven approach for finding natural clusters of the rider’s speciality (General...
Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and t...
The direpack package establishes a set of modern statistical dimensionality reduction techniques into the Python universe as a single, consistent package. Several of the methods included are only available as open source through direpack, whereas the package also offers competitive Python implementations of methods previously only available in othe...
Instance-dependent cost-sensitive (IDCS) learning methods have proven useful for binary classification tasks where individual instances are associated with variable misclassification costs. However, we demonstrate in this paper by means of a series of experiments that IDCS methods are sensitive to noise and outliers in relation to instance-dependen...
We present a personalized approach for frequent fitness monitoring in road cycling solely relying on sensor data collected during bike rides and without the need for maximal effort tests. We use competition and training data of three world-class cyclists of Team Jumbo–Visma to construct personalised heart rate models that relate the heart rate duri...
The literature on fraud analytics and fraud detection has seen a substantial increase in output in the past decade. This has led to a wide range of research topics and overall little organization of the many aspects of fraud analytical research. The focus of academics ranges from identifying fraudulent credit card payments to spotting illegitimate...
This paper presents a multinomial multi-state micro-level reserving model, denoted mCube. We propose a unified framework for modelling the time and the payment process for IBNR and RBNS claims and for modeling IBNR claim counts. We use multinomial distributions for the time process and spliced mixture models for the payment process. We illustrate t...
Generalized Linear Models (GLMs) are a popular class of regression models when the responses follow a distribution in the exponential family. In real data the variability often deviates from the relation imposed by the exponential family distribution, which results in over- or underdispersion. Dispersion effects may even vary in the data. Such data...
Machine maintenance is a challenging operational problem, where the goal is to plan sufficient preventive maintenance to avoid machine failures and overhauls. Maintenance is often imperfect in reality and does not make the asset as good as new. Although a variety of imperfect maintenance policies have been proposed in the literature, these rely on...
Premium fraud concerns data misrepresentation committed by an insurance customer with the intent to benefit from an unduly low premium at the underwriting of a policy. In this paper, we propose a novel approach for evaluating the risk of underwriting premium fraud at the time of application in the presence of potentially misrepresented self-reporte...
A central problem in business concerns the optimal allocation of limited resources to a set of available tasks, where the payoff of these tasks is inherently uncertain. In credit card fraud detection, for instance, a bank can only assign a small subset of transactions to their fraud investigations team. Typically, such problems are solved using a c...
Predictive models are increasingly being used to optimize decision-making and minimize costs. A conventional approach is predict-then-optimize: first, a predictive model is built; then, this model is used to optimize decision-making. A drawback of this approach, however, is that it only incorporates costs in the second stage. Conversely, the predic...
Two new methods for sparse dimension reduction are introduced, based on martingale difference divergence and ball covariance, respectively. These methods can be utilized straightforwardly as sufficient dimension reduction (SDR) techniques to estimate a sufficient dimension reduced subspace, which contains all information sufficient to explain a dep...
In many practical applications, such as fraud detection, credit risk modeling or medical decision making, classification models for assigning instances to a predefined set of classes are required to be both precise and interpretable. Linear modeling methods such as logistic regression are often adopted since they offer an acceptable balance between...
The concordance probability, also called the C-index, is a popular measure to capture the discriminatory ability of a predictive model. In this article, the definition of this measure is adapted to the specific needs of the frequency and severity model, typically used during the technical pricing of a non-life insurance product. For the frequency m...
Professional road cycling is a very competitive sport, and many factors influence the outcome of the race. These factors can be internal (e.g., psychological preparedness, physiological profile of the rider, and the preparedness or fitness of the rider) or external (e.g., the weather or strategy of the team) to the rider, or even completely unpredi...
Topological data analysis is a recent and fast growing field that approaches the analysis of datasets using techniques from (algebraic) topology. Its main tool, persistent homology (PH), has seen a notable increase in applications in the last decade. Often cited as the most favourable property of PH and the main reason for practical success are the...
Topological data analysis is a recent and fast growing field that approaches the analysis of datasets using techniques from (algebraic) topology. Its main tool, persistent homology (PH), has seen a notable increase in applications in the last decade. Often cited as the most favourable property of PH and the main reason for practical success are the...
In order to improve the performance of any machine learning model, it is important to focus more on the data itself instead of continuously developing new algorithms. This is exactly the aim of feature engineering. It can be defined as the clever engineering of data hereby exploiting the intrinsic bias of the machine learning technique to our benef...
A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. Detecting fraud in an imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. We discuss some popular oversampling techniques that...
Background
Current EULAR guidelines recommend treating RA early, intensively and to-target. A data-driven tool for planning the optimal moment for subsequent visits might adapt visit schedules more to the patient’s needs, without losing treatment quality.
Objectives
To determine the optimal statistical model and clinical factors to predict the tim...
Performance measurement is an essential task once a statistical model is created. The Area Under the receiving operating characteristics Curve (AUC) is the most popular measure for evaluating the quality of a binary classifier. In this case, AUC is equal to the concordance probability, a frequently used measure to evaluate the discriminatory power...
Card transaction fraud is a growing problem affecting card holders worldwide. Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task of detecting fraudulent transactions is a bina...
Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task of detecting suspicious transactions is a binary classification problem and therefore many techniques can be applied. Interp...
In many practical applications, such as fraud detection, credit risk modeling or medical decision making, classification models for assigning instances to a predefined set of classes are required to be both precise as well as interpretable. Linear modeling methods such as logistic regression are often adopted, since they offer an acceptable balance...
As its name suggests, sufficient dimension reduction (SDR) targets to estimate a subspace from data that contains all information sufficient to explain a dependent variable. Ample approaches exist to SDR, some of the most recent of which rely on minimal to no model assumptions. These are defined according to an optimization criterion that maximizes...
Predicting cycling race results has always been a task left to experts with a lot of domain knowledge. This is largely due to the fact that the outcomes of cycling races can be rather surprising and depend on an extensive set of parameters. Examples of such factors are, among others, the preparedness of a rider, the weather, the team strategy, and...
In this work, a method is proposed for exploiting the predictive power of a geo-tagged dataset as a means of identification of user-relevant points of interest (POI). The proposed methodology is subsequently applied in an insurance context for the automatic identification of a driver’s residence address, solely based on his pattern of movements on...
The direpack package aims to establish a set of modern statistical dimension reduction techniques into the Python universe as a single, consistent package. The dimension reduction methods included resort into three categories: projection pursuit based dimension reduction, sufficient dimension reduction, and robust M estimators for dimension reducti...
Card transaction fraud is a growing problem affecting card holders worldwide. Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task of detecting fraudulent transactions is a bina...
A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. In most data sets, fraud occurs in typically less than 0.5% of the cases. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, caus...
In time to event studies, censoring often occurs and models that take this into account are wide-spread. In the presence of outliers, standard estimators of model parameters may be affected such that results and conclusions are not reliable anymore. This in turn also hampers the detection of these outliers due to masking effects. To cope with outli...
The cellwise robust M regression estimator is introduced as the first estimator of its kind that intrinsically yields both a map of cellwise outliers consistent with the linear model, and a vector of regression coefficients that is robust against vertical outliers and leverage points. As a by-product, the method yields a weighted and imputed data s...
The minimum covariance determinant (MCD) approach estimates the location and scatter matrix using the subset of given size with lowest sample covariance determinant. Its main drawback is that it cannot be applied when the dimension exceeds the subset size. We propose the minimum regularized covariance determinant (MRCD) approach, which differs from...
We propose a minimum distance estimator for the higher-order comoments of a multivariate distribution exhibiting a lower dimensional latent factor structure. We derive the influence function of the proposed estimator and prove its consistency and asymptotic normality. The simulation study confirms the large gains in accuracy compared to the traditi...
The cellwise robust M regression estimator is introduced as the first estimator of its kind that intrinsically yields both a map of cellwise outliers consistent with the linear model, and a vector of regression coefficients that is robust against vertical outliers and leverage points. As a by-product, the method yields a weighted and imputed data s...
The concordance probability or C-index is a popular measure to capture the discriminatory ability of a regression model. In this article, the definition of this measure is adapted to the specific needs of the frequency and severity model, typically used during the technical pricing of a non-life insurance product. Due to the typical large sample si...
Outlier detection is an inevitable step to most statistical data analyses. However, the mere detection of an outlying case does not always answer all scientific questions associated with that data point. Outlier detection techniques, classical and robust alike, will typically flag the entire case as outlying, or attribute a specific case weight to...
The Minimum Covariance Determinant (MCD) approach robustly estimates the location and scatter matrix using the subset of given size with lowest sample covariance determinant. Its main drawback is that it cannot be applied when the dimension exceeds the subset size. We propose the Minimum Regularized Covariance Determinant (MRCD) approach, which dif...
Decision-making in finance often requires an accurate estimate of the coskewness matrix to optimize the allocation to random variables with asymmetric distributions. The classical sample estimator of the coskewness matrix performs poorly for small sample sizes. A solution is to use shrinkage estimators, defined as the convex combination between the...
The chain ladder method is a popular technique to estimate the future reserves needed to handle claims that are not fully settled. Since the predictions of the aggregate portfolio (consisting of different subportfolios) do not need to be equal to the sum of the predictions of the subportfolios, a general multivariate chain ladder (GMCL) method has...
The chain ladder method is a popular technique to estimate the future reserves needed to handle claims that are not fully settled. Since the predictions of the aggregate portfolio (consisting of different subportfolios) in general differ from the sum of the predictions of the subportfolios, a general multivariate chain ladder (GMCL) method has alre...
In this paper, a multivariate constrained robust M‐regression method is developed to estimate shaping coefficients for electricity forward prices. An important benefit of the new method is that model arbitrage can be ruled out at an elementary level, as all shaping coefficients are treated simultaneously. Moreover, the new method is robust to outli...
In this paper, a multivariate constrained robust M-regression (MCRM) method is developed to estimate shaping coefficients for electricity forward prices. An important benefit of the new method is that model arbitrage can be ruled out at an elementary level, as all shaping coefficients are treated simultaneously. Moreover, the new method is robust t...
Customer retention campaigns increasingly rely on predictive models to detect potential churners in a vast customer base. From the perspective of machine learning, the task of predicting customer churn can be presented as a binary classification problem. Using data on historic behavior, classification algorithms are built with the purpose of accura...
Customer retention campaigns increasingly rely on predictive models to detect potential churners in a vast customer base. From the perspective of machine learning, the task of predicting customer churn can be presented as a binary classification problem. Using data on historic behavior, classification algorithms are built with the purpose of accura...
Outlier detection is an inevitable step to most statistical data analyses. However, the mere detection of an outlying case does not always answer all scientific questions associated with that data point. Outlier detection techniques, classical and robust alike, will typically flag the entire case as outlying, or attribute a specific case weight to...
The seemingly unrelated regression (SUR) model is a generalization of a linear regression model consisting of more than one equation, where the error terms of these equations are contemporaneously correlated. The standard Feasible Generalized Linear Squares (FGLS) estimator is efficient as it takes into account the covariance structure of the error...
The Minimum Covariance Determinant (MCD) approach robustly estimates the location and scatter matrix using the subset of given size with lowest sample covariance determinant. Its main drawback is that it cannot be applied when the dimension exceeds the subset size. We propose the Minimum Regularized Covariance Determinant (MRCD) approach, which dif...
Insurers are faced with the challenge of estimating the future reserves needed to handle historic and outstanding claims that are not fully settled. A well-known and widely used technique is the chain-ladder method, which is a deterministic algorithm. To include a stochastic component one may apply generalized linear models to the run-off triangles...
Insurers are faced with the challenge of estimating the future reserves needed to handle historic and outstanding claims that are not fully settled. A well-known and widely used technique is the chain-ladder method, which is a deterministic algorithm. To include a stochastic component one may apply generalized linear models to the run-off triangles...
A new sparse PCA algorithm is presented which is robust against outliers. The approach is based on the ROBPCA algorithm which generates robust but non-sparse loadings. The construction of the new ROSPCA method is detailed, as well as a selection criterion for the sparsity parameter. An extensive simulation study and a real data example are performe...