Book

Robust Regression and Outlier Detection: Rousseeuw/Robust Regression & Outlier Detection

Authors:
... Various robust regression methods exist for robust parameter estimation. These include the M-estimator, the MM-estimator, the S-estimator, the least trimmed squared method, the least median of squares method, the Least absolute deviation (LAD), and others (Rousseeuw 1984;Rousseeuw and Leroy 1987;Hawkins 1993;Maronna et al. 2006). ...
... However, a shortcoming of this method is its vulnerability to leverage points (outliers in the covariates) and its inability to perform variable selection (Giloni et al. 2006b). When both outlier and leverage points are present in a model, the breakdown point of LAD matches that of Least Squares, as observed by Rousseeuw and Leroy (1987). The minimum attainable value for the breakdown point is 1/n. ...
... Through simulations, they demonstrated that the performance of WLAD regression estimators competes favorably with high breakdown point regression estimators, especially in the presence of leverage points. For more details, refer to Hubert and Rousseeuw (1997); Rousseeuw and Leroy (1987). ...
Article
Full-text available
The Least Absolute Shrinkage and Selection Operator (LASSO) is widely used for parameter estimation and variable selection but can encounter challenges with outliers and heavy-tailed error distributions. Integrating variable selection methods such as LASSO with Weighted Least Absolute Deviation (WLAD) has been explored in limited studies to handle these problems. In this study, we proposed the integration of Weighted Least Absolute Deviation with Liu-LASSO to handle variable selection, parameter estimation, and heavy-tailed error distributions due to the advantages of the Liu-LASSO approach over traditional LASSO methods. This approach is demonstrated through a simple simulation study and real-world application. Our findings showcase the superiority of our method over existing techniques while maintaining the asymptotic efficiency comparable to the unpenalized LAD estimator.
... Determining the application scope of the suggested GPR-Rational Quadratic model and GMDH correlation (as the best intelligence paradigms developed in this work) along with possible outliers existing in the data bank can be achieved by employing the well-known leverage technique (Rousseeuw and Leroy 1987;Goodall 1993;Gramatica 2007). To this end, standardized residuals (SR) are defined as the discrepancies between the model's estimations and the actual laboratory data, where MSE is the mean square of error, e i is the error value, and H i is the ith leverage value (Rousseeuw and Leroy 1987;Hadavimoghaddam et al. 2022): ...
... Determining the application scope of the suggested GPR-Rational Quadratic model and GMDH correlation (as the best intelligence paradigms developed in this work) along with possible outliers existing in the data bank can be achieved by employing the well-known leverage technique (Rousseeuw and Leroy 1987;Goodall 1993;Gramatica 2007). To this end, standardized residuals (SR) are defined as the discrepancies between the model's estimations and the actual laboratory data, where MSE is the mean square of error, e i is the error value, and H i is the ith leverage value (Rousseeuw and Leroy 1987;Hadavimoghaddam et al. 2022): ...
... In addition, leverage values, as the diagonal elements of the Hat matrix, are computed in this approach considering the Hat matrix as follows (Rousseeuw and Leroy 1987;Hadavimoghaddam et al. 2023): ...
Article
Full-text available
The injection of carbon dioxide (CO 2) into coal seams is a prominent technique that can provide carbon sequestration in addition to enhancing coalbed methane extraction. However, CO 2 injection into the coal seams can alter the coal strength properties and their long-term integrity. In this work, the strength alteration of coals induced by CO 2 exposure was modeled using 147 laboratory-measured unconfined compressive strength (UCS) data points and considering CO 2 saturation pressure, CO 2 interaction temperature, CO 2 interaction time, and coal rank as input variables. Advanced white-box and black-box machine learning algorithms including Gaussian process regression (GPR) with rational quadratic kernel, extreme gradient boosting (XGBoost), categorical boosting (CatBoost), adaptive boosting decision tree (AdaBoost-DT), multivariate adaptive regression splines (MARS), K-nearest neighbor (KNN), gene expression programming (GEP), and group method of data handling (GMDH) were used in the modeling process. The results demonstrated that GPR-Rational Quadratic provided the most accurate estimates of UCS of coals having 3.53%, 3.62%, and 3.55% for the average absolute percent relative error (AAPRE) values of the train, test, and total data sets, respectively. Also, the overall determination coefficient (R 2) value of 0.9979 was additional proof of the excellent accuracy of this model compared with other models. Moreover, the first mathematical correlations to estimate the change in coal strength induced by CO 2 exposure were established in this work by the GMDH and GEP algorithms with acceptable accuracy. Sensitivity analysis revealed that the Spearman correlation coefficient shows the relative importance of the input parameters on the coal strength better than the Pearson correlation coefficient. Among the inputs, coal rank had the greatest influence on the coal strength (strong nonlinear relationship) based on the Spearman correlation coefficient. After that, CO 2 interaction time and CO 2 saturation pressure have shown relatively strong nonlinear relationships with model output, respectively. The CO 2 interaction temperature had the smallest impact on coal strength alteration induced by CO 2 exposure based on both Pearson and Spearman correlation coefficients. Finally, the leverage technique revealed that the laboratory database used for modeling CO 2-induced strength alteration of coals was highly reliable, and the suggested GPR-Rational Quadratic model and GMDH correlation could be applied for predicting the UCS of coals exposed to CO 2 with high statistical accuracy and reliability. Introduction The escalating negative impact of greenhouse gas emissions on global climate change has led to the emergence of diverse strategies aimed at mitigating the rapid increase of these gases in the atmosphere. Carbon capture and storage technologies have gained significant attention due to their potential to reduce CO 2 emissions into the atmosphere. Among these techniques, CO 2 sequestration in coal seams has been identified as an effective innovative approach (Wang et al. 2007; Self et al. 2012; Yamazaki et al. 2006; Jiang et al. 2023; Li and Fang 2014). Studies revealed that coal has a higher CO 2 adsorption capacity than methane (CH 4), which naturally occurs within the coal seam (Ottiger et al. 2008; Luu et al. 2022; Pandey et al. 2022). Injection of CO 2 in the coal matrix causes CH 4 displacement in coal steam. Therefore, this technique not only mitigates the destructive effects of CO 2 on the environment but is also known as enhanced coalbed methane recovery (He et al. 2017; Liu et al. 2019; Omotilewa et al. 2021; Kong et al. 2021). The process of injecting and sequestering CO 2 in coal seams causes chemical and physical interactions between the coal and CO 2 , leading to alterations in the coal structure. These structural changes have been determined to have an impact on the performance of CO 2 sequestration in coal seams (Zagorščak and Thomas 2018; Yan et al. 2019; Majewska et al. 2013). The storage conditions of CO 2 in coal seams lead to the transition of CO 2 to the supercritical state, which is characterized by complex and different properties from the subcritical state (Vishal et al. 2013; Patel et al. 2016). The process of injecting CO 2 into coal seams results in the extraction of some organic groups or mineral compositions, thereby inducing significant alterations in the volumetric strength of coal (Mazumder et al. 2006; Kędzior 2019; Farmer and Pooley 1967; Liu et al. 2021; Gathitu et al. 2009). Furthermore, the phenomenon of coal swelling in the presence of CO 2 and variations in permeability pose significant challenges to the practicality of the sequestration process (Hol and Spiers 2012; Masoudian 2016; Vandamme et al. 2010;
... A common approach is to use the median absolute deviation (MAD) of the residuals as a robust estimate of σ and then set " " to a multiple (usually 4 or 6) of the MAD. This ensures that the threshold is adaptive to the variability of the data and provides a good balance between robustness and efficiency in parameter estimation (Rousseeuw & Leroy, 1987). ...
Article
Full-text available
Robust regression estimation is crucial in addressing the influence of outliers and model misspecification in statistical modelling. This study proposes a Doubly Weighted M-Estimation (DWME) approach, integrating an adaptive weighting scheme with Generalized Jackknife Resampling (GJR) to enhance efficiency and robustness in parameter estimation. The DWME method incorporates case-specific and parameter-specific weighting functions, ensuring resistance against leverage points and heavy-tailed distributions. By leveraging GJR, the proposed estimator achieves reduced bias and variance while maintaining asymptotic efficiency under mild regularity conditions. Empirical analyses demonstrate that DWME outperforms traditional M-estimators, Least Absolute Deviation (LAD), and Huber regression in terms of robustness, efficiency, and predictive accuracy. The proposed methodology offers a reliable alternative for robust estimation in heteroscedastic, non-normal, and contaminated datasets, making it particularly valuable for econometric and high-dimensional applications.
... Here, y was the dependent which was predicted and x was the independent variable what was predicting using square residuals of baseline distance at a specific date; SLR was the slope of the regression line which was measured in m/year unit; α is constant coefficient of x; and e is the predicting error. The LMS is a reliable estimation method for regression estimation that was developed by an adaptive process that computed all potential slope values within a constrained range of angles (Rousseeuw & Leroy, 1987). For a given set of dates, the point where the transects and the bankline overlaps determines the slope of angle (x values). ...
... Several researchers have examined the robust least median of squares (LMS) method, which is the hyperplane that minimizes the squared residual median [22]. Although the LMS estimator has been the subject of most publications on robust estimation in the field of linear models, Rousseeuw and Leroy [23] noted that LMS is not the optimal option due to its statistical features. They asserted that selecting the least trimmed squares is the better alternative option because both LTS and LMS have the same breakdown point, approximately 50%, but the objective function of LTS is smoother than LMS. ...
Article
Full-text available
Regression analysis frequently encounters two issues: multicollinearity among the explanatory variables, and the existence of outliers in the data set. Multicollinearity in the semiparametric regression model causes the variance of the ordinary least-squares estimator to become inflated. Furthermore, the existence of multicollinearity may lead to wide confidence intervals for the individual parameters and even produce estimates with wrong signs. On the other hand, as is often known, the ordinary least-squares estimator is extremely sensitive to outliers, and it may be completely corrupted by the existence of even a single outlier in the data. Due to such drawbacks of the least-squares method, a robust Liu estimator based on the least trimmed squares (LTS) method for the regression parameters is introduced under some linear restrictions on the whole parameter space of the linear part in a semiparametric model. Considering that the covariance matrix of the error terms is usually unknown in practice, the feasible forms of the proposed estimators are substituted, and their asymptotic distributional properties are derived. Moreover, necessary and sufficient conditions for the superiority of the Liu type estimators over their counterparts for choosing the biasing Liu parameter d are extracted. The performance of the feasible type of robust Liu estimators is compared with the classical ones in constrained semiparametric regression models using extensive Monte-Carlo simulation experiments and a real data example.
... The conclusion of Sorokina et al. (2013) is that previous event studies do not adequately address this problem. Therefore, following the methodology of Hundt et al. (2017), this study also applies robust regression according to Rousseeuw (1984), Rousseeuw and Leroy (1987) and Mount et al. (2014). To calculate the regression coefficients a least trimmed square regression (LTS) is used instead of the ordinary least squares (OLS) regression used in standard event study approaches, i.e.: ...
Article
Full-text available
In order to be attractive to the capital market, companies are under increasing pressure to incorporate renewable energy (RE) targets into their business strategies. One of the most credible ways to demonstrate the renewable origin of electricity and to achieve a positive signalling effect is to enter into a power purchase agreement (PPA). A special form of this contract, the virtual PPA (VPPA), acts as a financial hedge, allowing the industrial buyer to achieve both a decarbonisation effect and a risk-minimising hedge. As the effect of a VPPA on the shareholder wealth of the electricity buyer has not yet been investigated in the literature, the purpose of this study is to fill this research gap. To this end, we analyse the abnormal stock returns of 89 VPPA announcements using a modified event study based on the Fama-French five-factor model (FFM5). Our results show significant positive abnormal returns around the announcement of a VPPA deal. This confirms the expectation that VPPAs are wealth-creating.
... The bisquare weights are obtained by using Eq. (7) (Rousseeuw and Leroy, 1987). ...
Article
Full-text available
Risk management (RM) is viewed by organizations from a holistic perspective instead of a silo-based point of view and a change in basic assumptions has occurred in recent years. The comprehensive approach to RM that is developed based on a holistic perspective and is adopted by companies is called enterprise risk management (ERM). From the general point of view, it is expected that companies will have sustainable operations, improved performance and value, and controlled risks by using the ERM concept. In this article, the relations of ERM adaptation with firm performance, firm value, and risks were researched through ordinal variables such as the implementation level of ERM and the sophistication level of the organizational structure of ERM in regression models based on panel data (PD). Empirical evidence depending on a sample of ten banking companies listed on the Borsa Istanbul (BIST) Banks index (XBANK) for the period from 2019 to 2022 confirms the above basic argument for firm performance, firm value, and the insolvency risk of banks in general terms. In addition, the prediction accuracy of PD models is calculated for the performance, value, and risk indicators of banks, and the partial least squares regression model is proposed as the alternative prediction model of data mining.
Article
Full-text available
Accurate elemental analysis is a critical requirement for mineral exploration, particularly in regions like Iran, where the mining sector has experienced a substantial increase in exploration activities over the past decade. Inductively Coupled Plasma Mass Spectrometry (ICP-MS) methods have long been regarded as the gold standard due to their high sensitivity and precision; however, their widespread adoption is often limited by high operational costs and complex sample preparation requirements. As Iran’s mining industry shifts toward more efficient and sustainable practices—with quantitative studies indicating a significant demand for cost-effective analytical solutions, there is a pressing need for alternative approaches that maintain the analytical strengths of ICP-MS while mitigating its limitations. This demand has paved the way for integrating advanced deep learning techniques with conventional methods, offering promising new avenues for cost-effective and rapid geochemical analysis. This study proposes an advanced deep learning-based approach for predicting critical elements—such as arsenic (As), lithium (Li), antimony (Sb), and vanadium (V)—in the Gohar Zamin iron ore mining area in southwest Kerman, Iran. Using X-ray fluorescence (XRF) geochemical data as input, three deep learning models were developed and compared: Convolutional Neural Networks (CNN), Gated Recurrent Units (GRU), and Spatial Attention Networks (SAN). Among the models tested, the CNN demonstrated superior performance in predicting the concentrations of the target elements, achieving the lowest error rates and effectively capturing complex spatial patterns in the geochemical data. The model’s ability to extract meaningful relationships from multidimensional data allowed it to outperform both the GRU and SAN models, particularly across low and high concentration ranges. Moreover, the results from CNN-based 3D modeling revealed significant potential for mineral exploration. This research introduces a novel AI-driven framework for utilizing low-cost XRF data in mineral prediction, reducing reliance on expensive analytical techniques while enhancing decision-making in mining operations. The proposed approach offers an efficient and environmentally friendly alternative for geochemical data analysis, contributing to more sustainable mineral exploration practices.Kindly check and confirm the corresponding author of the article and the first/last name of the authors are correctly identified.Checked and confirmed
Article
Full-text available
Identifying influential points in linear regression is vital for ensuring the validity of inferential conclusions. Traditional diagnostic measures, such as DFFITs (DFT), Cook’s D (CKD), COVRATIO (CVR), Hadi’s measure (HAD), Pena’s statistic (PEN), and Atkinson statistic (ATK), are typically based on the Ordinary Least Squares (OLS) estimator, which assumes no violation of the basic linear regression assumptions. This study develops new diagnostic measures for these statistics using the New Two-Parameter (NTP) estimator to address multicollinearity. The study evaluated the performance of these measures through simulation studies with 1,000 replications under varying levels of multicollinearity, error variances, outlier percentages and magnitudes, and sample sizes. Results revealed that the newly proposed CVR measure with the NTP estimator achieved 100% detection of influential points and recorded the highest detection counts, outperforming all other measures. While traditional measures like CKD, PEN, and ATK based on OLS were effective only for small sample sizes in the absence of multicollinearity, their performance declined when multicollinearity was present. Conversely, CVRNTP consistently demonstrated superior performance when multicollinearity was mitigated. These findings suggest that the proposed CVRNTP is a robust tool for identifying influential points in datasets affected by multicollinearity. Real-life data applications further validated their performances.
Article
Full-text available
The utilization of 2D Light Detection and Ranging (LiDAR) measurements does not always provide the precision needed to accurately determine the motion range or recalibrate the position of Autonomous Guided Vehicles (AGVs). Consequently, it is essential to employ filtering and calibration methods to enhance the precision and accuracy of measurements derived from 2D LiDAR. The article proposes a multi-sectional calibration (MSC) method incorporating a median filtration (MF) phase to enhance the measurement accuracy of 2D LiDAR. The investigation focused on identifying the optimal window width for the MF module among a selection of 2D LiDAR systems. The division of the complete measurement range into sections resulted in a significant enhancement in sensitivity to deviations in measurements. The efficacy of the proposed method is evidenced by its ability to enhance accuracy in distance measurements by up to 89% for the optimal window width. The experiments indicated that the proposed method has a significant impact on the precision and accuracy of distance measurements for 2D LiDAR systems.
Article
Full-text available
The Least Trimmed Squares (LTS) regression estimator is known to be very robust to the presence of “outliers”. It is based on a clear and intuitive idea: in a sample of size n , it searches for the h -subsample of observations with the smallest sum of squared residuals. The remaining nhn-h observations are declared “outliers”. Fast algorithms for its computation exist. Nevertheless, the existing asymptotic theory for LTS, based on the traditional ϵ\epsilon -contamination model, shows that the asymptotic behavior of both regression and scale estimators depend on nuisance parameters. Using a recently proposed new model, in which the LTS estimator is maximum likelihood, we show that the asymptotic behavior of both the LTS regression and scale estimators are free of nuisance parameters. Thus, with the new model as a benchmark, standard inference procedures apply while allowing a broad range of contamination.
Article
Full-text available
Metal organic frameworks (MOFs) have demonstrated remarkable performance in hydrogen storage due to their unique properties, such as high gravimetric densities, rapid kinetics, and reversibility. This paper models hydrogen storage capacity of MOFs utilizing numerous machine learning approaches, such as the Deep Neural Network (DNN), Convolutional Neural Network (CNN), and Gaussian Process Regression (GPR). Here, Radial Basic Function (RBF) and Rational Quadratic (RQ) kernel functions were employed in GPR. To this end, a comprehensive databank including 1729 experimental data points was compiled from various literature surveys. Temperature, pressure, surface area, and pore volume were utilized as input variables in this databank. The results indicate that the GPR-RQ intelligent model achieved superior performance, delivering highly accurate predictions with a mean absolute error (MAE) of 0.0036, Root Mean Square Error (RMSE) of 0.0247, and a correlation coefficient (R²) of 0.9998. In terms of RMSE values, the models GPR-RQ, GPR-RBF, CNN, and DNN were ranked in order of their performance, respectively. Moreover, by calculating Pearson correlation coefficient, the sensitivity analysis showed that pore volume and surface area emerged as the most influential factors in hydrogen storage, boasting absolute relevancy factors of 0.45 and 0.47, respectively. Lastly, outlier detection assessment employing the leverage approach revealed that almost 98% of the data points utilized in the modeling are reliable and fall within the valid range. This study contributed to understanding how input features collectively influence the estimation of hydrogen storage capacity of MOFs.
Article
Full-text available
Routine least squares regression analyses may sometimes miss important aspects of data. To exemplify this point we analyse a set of 1171 observations from a questionnaire intended to illuminate the relationship between customer loyalty and perceptions of such factors as price and community outreach. Our analysis makes much use of graphics and data monitoring to provide a paradigmatic example of the use of modern robust statistical tools based on graphical interaction with data. We start with regression. We perform such an analysis and find significant regression on all factors. However, a variety of plots show that there are some unexplained features, which are not eliminated by response transformation. Accordingly, we turn to robust analyses, intended to give answers unaffected by the presence of data contamination. A robust analysis using a non-parametric model leads to the increased significance of transformations of the explanatory variables. These transformations provide improved insight into consumer behaviour. We provide suggestions for a structured approach to modern robust regression and give links to the software used for our data analyses.
Article
Full-text available
Positive effects of plant diversity on productivity have been globally demonstrated and explained by two main effects: complementarity effects and selection effects1, 2, 3–4. However, plant diversity experiments have shown substantial variation in these effects, with driving factors poorly understood4, 5–6. On the basis of a meta-analysis of 452 experiments across the globe, we show that productivity increases on average by 15.2% from monocultures to species mixtures with an average species richness of 2.6; net biodiversity effects are stronger in grassland and forest experiments and weaker in container, cropland and aquatic ecosystems. Of the net biodiversity effects, complementarity effects and selection effects contribute 65.6% and 34.4%, respectively. Complementarity effects increase with phylogenetic diversity, the mixing of nitrogen-fixing and non-nitrogen-fixing species and the functional diversity of leaf nitrogen contents, which indicate the key roles of niche partitioning, biotic feedback and abiotic facilitation in complementarity effects. More positive selection effects occur with higher species biomass inequality in their monocultures. Complementarity effects increase over time, whereas selection effects decrease over time, and they remain consistent across global variations in climates. Our results provide key insights into understanding global variations in plant diversity effects on productivity and underscore the importance of integrating both complementarity and selection effects into strategies for biodiversity conservation and ecological restoration.
Article
In the last few years, the number of R packages implementing different robust statistical methods have increased substantially. There are now numerous packages for computing robust multivariate location and scatter, robust multivariate analysis like principal components and discriminant analysis, robust linear models, and other algorithms dedicated to cope with outliers and other irregularities in the data. This abundance of package options may be overwhelming for both beginners and more experienced R users. Here we provide an overview of the most important 25 R packages for different tasks. As metrics for the importance of each package, we consider its maturity and history, the number of total and average monthly downloads from CRAN (The Comprehensive R Archive Network), and the number of reverse dependencies. Then we briefly describe what each of these package does. After that we elaborate on the several above‐mentioned topics of robust statistics, presenting the methodology and the implementation in R and illustrating the application on real data examples. Particular attention is paid to the robust methods and algorithms suitable for high‐dimensional data. The code for all examples is accessible on the GitHub repository https://github.com/valentint/robust‐R‐ecosystem‐WIREs .
Article
Entropy, a key measure of chaos or diversity, has recently found intriguing applications in the realm of management science. Traditional entropy-based approaches for data analysis, however, prove inadequate when dealing with high-dimensional datasets. In this paper, a novel uncertainty coefficient based on entropy is proposed for categorical data, together with a pattern discovery method suitable for management applications. Furthermore, we present a robust fractal-inspired technique for estimating covariance matrices in multivariate data. The efficacy of this method is thoroughly examined using three real datasets with economic relevance. The results demonstrate the superior performance of our approach, even in scenarios involving a limited number of variables. This suggests that managerial decision-making processes should reflect the inherent fractal structure present in the given multivariate data. The work emphasizes the importance of considering fractal characteristics in managerial decision-making, thereby advancing the applicability and effectiveness of entropy-based methods in management science.
Article
Full-text available
Background: This research project examining the moderating role of the Scout Movement in supporting mental health through the shaping of personal competence is based on Bandura’s conception of social development (social cognitive theory) in terms of generating a sense of general self-efficacy. Methods: This research examined the moderating value of Scouting with regard to the connection between self-esteem and a sense of efficacy and styles of coping with stress in a group of 683 volunteers. Results: The results suggest that Scouting is a moderator of the relationship between the intensity of an emotion-focused stress coping style and a sense of self-efficacy—being a Scout intensifies the blocking effect of self-esteem on emotions in stressful situations, which can positively influence emotion regulation. Conclusions: The features described suggest the need to research Scouting as a non-formal education strategy to support the development of young people’s mental health in different theoretical and methodological contexts. This work provides conclusions regarding understanding the role of Scouting as a moderator in coping with stress and, consequently, ensuring good mental health. It detailed the knowledge pertaining to specific mechanisms thanks to which Scouting can influence the development of emotional regulation and adaptive response to stressful situations.
Article
Full-text available
Imbalanced data significantly affects the performance of standard classification models. Data-level approaches primarily use oversampling methods, such as the synthetic minority oversampling technique (SMOTE), to address this problem. However, because methods such as SMOTE generate instances via linear interpolation, the synthetic data space may appear similar to a star or tree. Thus, some methods apply Gaussian weights to linear interpolation to address this issue. In this study, we propose a Gaussian-based minority oversampling with adaptive outlier filtering and class overlap weighting (GMO-AC) for imbalanced datasets. Unlike existing oversampling techniques, our method employs a Gaussian mixture model (GMM) to approximate the distribution of the minority class and generate new instances that follow this distribution. As outliers can affect the distribution approximation, GMO-AC identifies outliers by calculating the Mahalanobis distance for each instance and the covariance determinant. This process uses segmented linear regression to assess whether an instance falls outside the expected distribution. In addition, we defined the degree of class overlap to generate additional instances in the overlapping areas to improve the classification of the minority class in those areas. Experiments were conducted on synthetic and benchmark datasets, comparing the performance of GMO-AC with that of other methods, such as SMOTE. Experimental results show that GMO-AC yielded better AUROC and G-mean.
Chapter
Artificial intelligence is nowadays equipped with a plethora of tools for obtaining relevant knowledge from (possibly big) data in a variety of tasks. While habitually used methods of machine learning applicable within artificial intelligence tools can be characterized as black boxes, practical applications often require to understand why a particular conclusion (e.g. decision) was made, which arouses interest in explainable machine learning. This chapter is devoted to variable selection methods for finding the most relevant variables for the given task. If statistically robust variable selection methods are exploited, the harmful influence of data contamination by outlying values on the results is typically eliminated or downweighted. Principles of prior, intrinsic, and posterior variable selection approaches are recalled and compared on three real datasets related to gene expression measurements, neighborhood crime rate, and tourism infrastructure. These examples reveal robust approaches to machine learning that outperform non-robust ones if the data are contaminated by outliers.
Article
Full-text available
In this study, we introduce an innovative methodology for anomaly detection of curves, applicable to both multivariate and multi-argument functions. This approach distinguishes itself from prior methods by its capability to identify outliers within clustered functional data sets. We achieve this by extending the recent AA + kNN technique, originally designed for multivariate analysis, to functional data contexts. Our method demonstrates superior performance through a comprehensive comparative analysis against twelve state-of-the-art techniques, encompassing simulated scenarios with either a single functional cluster or multiple clusters. Additionally, we substantiate the effectiveness of our approach through its application in three distinct computer vision tasks and a signal processing problem. To facilitate transparency and replication of our results, we provide access to both the code and the datasets used in this research.
Chapter
Regression estimates functional dependencies between features. Linear regression models can be efficiently computed from covariances but are restricted to linear dependencies. Substitution allows to identify specific types of nonlinear dependencies by linear regression. Robust regression finds models that are robust against outliers. A popular class of nonlinear regression methods are universal approximators. We present two well-known examples for universal approximators from the field of artificial neural networks: the multilayer perceptron and radial basis function networks. Universal approximators can realize arbitrarily small training errors, but cross-validation is required to find models with low validation errors that generalize well on other data sets. Feature selection allows us to include only relevant features in regression models leading to more accurate models.
Article
Full-text available
The least trimmed squares (LTS) estimator is popular in location, regression, machine learning, and AI literature. Despite the empirical version of least trimmed squares (LTS) being repeatedly studied in the literature, the population version of the LTS has never been introduced and studied. The lack of the population version hinders the study of the large sample properties of the LTS utilizing the empirical process theory. Novel properties of the objective function in both empirical and population settings of the LTS and other properties, are established for the first time in this article. The primary properties of the objective function facilitate the establishment of other original results, including the influence function and Fisher consistency. The strong consistency is established with the help of a generalized Glivenko–Cantelli Theorem over a class of functions for the first time. Differentiability and stochastic equicontinuity promote the establishment of asymptotic normality with a concise and novel approach.
Article
Full-text available
In this contribution, we introduce a novel methodology for outlier identification in GNSS networks. The new method consists of a multilayer perceptron neural network-based meta-classifier. Meta-classifiers are classification models that use machine learning algorithms to integrate multiple base classifiers. A statistical testing procedure for outlier identification can be interpreted as a classifier. An observation is classified as an outlier or not based on the decision rule of the testing procedure. Here, we utilize the decision response of an observation being flagged as an outlier or not from the following procedures: iterative data-snooping (IDS), the minimum L1-norm (MinL1), Sequential Likelihood Ratio Tests for Multiple Outliers and the minimum L∞-norm (MinL∞). The binary classification whether the observation is or not an outlier from those procedures and their corresponding test statistics were employed as attributes to construct our meta-classifier. The experiments were conducted for GNSS networks with low (r<0.5r<0.5), medium (r=0.5) and high redundancy (r>0.5r>0.5). Results show that the proposed approach via meta-classification performs better than all base classifiers in low redundancy GNSS networks for a large margin. Its mean successful rate in outlier identification was always more than 11 percentage points higher than the best base classifier (MinL1, in this case). Moreover, it presented a higher performance in outlier identification with less collected baselines, which reduces the financial cost of the GNSS network, a key factor in surveying engineering projects. For medium and high redundancy networks, there was no significant improvement in the meta-classification performance against the best meta-classifier (IDS, in this case). As the results for low redundancy GNSS networks seem very promising, many potential future work suggestions were also made considering the reality of several countries.
Article
In order to handle the overwhelming effects of the removed hydrogen sulphide (H 2 S) from natural gas and industrial waste gases on the environment, H 2 S can be converted to elemental sulphur. Among the available processes for sulphur recovery, the most widely employed process is a modified Claus process. In this work, first, least square version of support vector machine (LS‐SVM) approach is utilized for determining the properties of sulphur including heat of vaporization, heat of condensation ( S 6 , S 8 ), heat of dissociation ( S 6 , S 8 ), and heat capacity of equilibrium sulphur vapours as a function of temperature. An illustrative example is given to show the usefulness of the presented computer‐based models with two parameters for designing and operation of the Claus sulphur recovery unit (SRU). According to the error analysis results, predicted values by the proposed intelligent models are in excellent agreement with the reported data in the literature for the aforementioned sulphur properties where the coefficient of determination ( R ² ) is higher than 0.99 for all developed models. The average absolute relative deviation percent (%AARD) is less than 1.3 while predicting the heat capacity of equilibrium sulphur vapours. Other proposed models' predictions show less than 0.2% AARD from the target values. In addition, a mathematical algorithm on the basis of the Leverage approach is proposed to define the domain of applicability of the developed LS‐SVM models. It was found that the presented models are statistically valid and the employed data points for developing the models are within the range of their applicability.
Article
Recently, point clouds have been widely used in computer vision, whereas their collection is time-consuming and expensive. As such, point cloud datasets are the valuable intellectual property of their owners and deserve protection. To detect and prevent unauthorized use of these datasets, especially for commercial or open-sourced ones that cannot be sold again or used commercially without permission, we intend to identify whether a suspicious third-party model is trained on our protected dataset under the black-box setting. We achieve this goal by designing a scalable clean-label backdoor-based dataset watermark for point clouds that ensures both effectiveness and stealthiness. Unlike existing clean-label watermark schemes, which were susceptible to the number of categories, our method can watermark samples from all classes instead of only from the target one. Accordingly, it can still preserve high effectiveness even on large-scale datasets with many classes. Specifically, we perturb selected point clouds with non-target categories in both shape-wise and point-wise manners before inserting trigger patterns without changing their labels. The features of perturbed samples are similar to those of benign samples from the target class. As such, models trained on the watermarked dataset will have a distinctive yet stealthy backdoor behavior, i.e ., misclassifying samples from the target class whenever triggers appear, since the trained DNNs will treat the inserted trigger pattern as a signal to deny predicting the target label. We also design a hypothesis-test-guided dataset ownership verification based on the proposed watermark. Extensive experiments on benchmark datasets are conducted, verifying the effectiveness of our method and its resistance to potential removal methods.
Article
Full-text available
This study examines the bias in weighted least absolute deviation (WL1) estimation within the context of stationary first-order bifurcating autoregressive (BAR(1)) models, which are frequently employed to analyze binary tree-like data, including applications in cell lineage studies. Initial findings indicate that WL1 estimators can demonstrate substantial and problematic biases, especially when small to moderate sample sizes. The autoregressive parameter and the correlation between model errors influence the volume and direction of the bias. To address this issue, we propose two bootstrap-based bias-corrected estimators for the WL1 estimator. We conduct extensive simulations to assess the performance of these bias-corrected estimators. Our empirical findings demonstrate that these estimators effectively reduce the bias inherent in WL1 estimators, with their performance being particularly pronounced at the extremes of the autoregressive parameter range.
Article
The widespread use of fossil fuels drives greenhouse gas emissions, prompting the need for cleaner energy alternatives like hydrogen. Underground hydrogen storage (UHS) is a promising solution, but measuring the hydrogen (H 2) solubility in brine is complex and costly. Machine learning can provide accurate and reliable predictions of H 2 solubility by analyzing diverse input variables, surpassing traditional methods. This advancement is crucial for improving UHS, making it a viable component of the sustainable energy infrastructure. Given its importance, this study utilized convolutional neural network (CNN) and long−short-term memory (LSTM) deep learning algorithms in combination with growth optimization (GO) and gray wolf optimization (GWO) algorithms to predict H 2 solubility. A total of 1078 data points were collected from laboratory results, including the variables temperature (T), pressure (P), salinity (S), and salt type (ST). After removing 97 data points, which were identified as outliers and duplicates, the remaining 981 data points were divided into training and testing sets using the best separation ratio selected based on sensitivity analysis. Standalone and hybrid forms of deep learning algorithms were then applied to the training data to develop predictive models with optimized control parameters for both deep learning and optimization algorithms. Among the developed models, CNN-GO has the lowest root-mean-square error (RMSE, train: 0.00006 mole fraction and test: 0.00021 mole fraction) compared to other hybrid and standalone deep learning models. The application of scoring and regression error characteristic (REC) curve analysis showed that this model generated the best prediction performance. Shapley additive explanation analysis indicated that P was the most important factor influencing H 2 solubility, followed by S, T, and ST, in that order. Partial dependency analysis for the CNN-GO model revealed its ability to capture complex nonlinear relationships between input features and the target variable.
Article
With the expanding scale of current industries, monitoring systems centered around Key Performance Indicators (KPIs) play an increasingly crucial role. KPI anomaly detection can monitor the potential risks according to KPI data and has garnered widespread attention due to its rapid responsiveness and adaptability to dynamic changes. Considering the absence of labels and the high cost of manual annotation of KPI data, the self-supervised approaches are proposed. Among them, mask modeling methods draw great attention and can learn the intrinsic distribution of data without relying on prior assumptions. However, conventional mask modeling often overlooks the examination of relationships between unsynchronized variables, treating them with equal importance, and inducing inaccurate detection results. To address this, this paper proposes a Dual Masked modeling Approach combined with Similarity Aggregation, named DMASA. Starting from a self-supervised approach based on mask modeling, DMASA incorporates spectral residual techniques to explore inter-variable dependencies and aggregates information from similar data to eliminate interference from irrelevant variables in anomaly detection. Extensive experiments on eight datasets and state-of-the-art results demonstrate the effectiveness of our approach.
Article
Accurate and frequent monitoring of the solid content (SC) of drilling fluids is necessary to avoid the issues associated with improper solid particle concentrations. Conventional methods for determining SC, such as retort analysis, lack immediacy and are labor-intensive. This study applies machine learning (ML) techniques to develop SC predictive models using readily available data—Marsh funnel viscosity and fluid density. A dataset of 1290 data records was collected from 17 wells drilled in two oil fields located in southwest Iran. Four ML models—least squares support vector machine (LSSVM), multilayered perceptron neural network, extreme learning machine, and generalized regression neural network—were developed to predict SC from the compiled dataset. Multiple assessment techniques were applied to attentively evaluate the models’ prediction performances and select the best-performing, SC prediction model. The LSSVM model generated the least errors, exhibiting the lowest root-mean-square error values for the training (1.80%) and testing (1.84%) subsets. The narrowest confidence interval, 0.18, achieved by the LSSVM model confirmed its reliability for SC prediction. Leverage analysis revealed minimal influence of outlier data on the LSSVM model's SC prediction performance. The trained LSSVM model was further validated on unseen data from another well drilled in one of the studied oil fields, demonstrating the model’s generalizability for providing credible close-to-real-time SC predictions in the studied fields.
Article
Full-text available
Background: The heavily right-skewed data seen in recently reported Alzheimer’s disease (AD) clinical trials influenced treatment contrasts when data were analyzed via the typical mixed-effects model for repeated measures (MMRM). Methods: An MMRM analysis similar to what is commonly used in AD clinical trials was compared versus robust regression (RR) and the non-parametric Hodges–Lehman estimator (HL). Results: Results in simulated data patterned after AD trials showed that imbalance across treatment arms in the number of patients in the extreme right tail (those with rapid disease progression) frequently occurred. Each analysis method controlled Type I error at or below the nominal level. The RR analysis yielded smaller standard errors and more power than MMRM and HL. In data sets with appreciable imbalance in the number of rapid progressing patients, MMRM results favored the treatment arm with fewer rapid progressors. Results from HL showed the same trend but to a lesser degree. Robust regression yielded similar results regardless of the ratio of rapid progressors. Conclusions: Although more research is needed over a wider range of scenarios, it should not be assumed that MMRM is the optimal approach for trials in early Alzheimer’s disease.
Article
Full-text available
The Pearson product-moment correlation coefficient is the most commonly used measure to assess the strength and direction of the linear relationship between a pair of variables. However, it is extremely sensitive to outliers in the data, so it may be necessary to use its robust counterparts. This paper considers a correlation coefficient based on the L1-norm that has recently been proposed in the literature. Its properties are investigated for the first time and an original computational algorithm is presented, since the index cannot be computed directly but requires an iterative approach. Finally, some illustrative examples are discussed and a comparison with conventional measures is shown.
Article
Maximum correntropy criterion (MCC) is a robust and powerful technique to handle heavy-tailed non-Gaussian noise, which has many applications in the fields of vision, signal processing, machine learning, etc. In this paper, we introduce several contributions to the MCC and propose an augmented MCC (AMCC), which raises the robustness of classic MCC variants for robust fitting to an unprecedented level. Our first contribution is to present an accurate bandwidth estimation algorithm based on the probability density function (PDF) matching, which solves the instability problem of the Silverman's rule. Our second contribution is to introduce the idea of graduated non-convexity (GNC) and a worst-rejection strategy into MCC, which compensates for the sensitivity of MCC to high outlier ratios. Our third contribution is to provide a definition of local distribution measure (LDM) to evaluate the quality of inliers, which makes the MCC no longer limited to random outliers but is generally suitable for both random and clustered outliers. Our fourth contribution is to show the generalizability of the proposed AMCC by providing eight application examples in geometry perception and performing comprehensive evaluations on five of them. Our experiments demonstrate that (i) AMCC is empirically robust to 80% \sim 90% of random outliers across applications, which is much better than Cauchy M-estimation, MCC, and GNC-GM; (ii) AMCC achieves excellent performance in clustered outliers, whose success rate is 60% \sim 70% percentage points higher than the second-ranked method at 80% of outliers; (iii) AMCC can run in real-time, which is 10 \sim 100 times faster than RANSAC-type methods in low-dimensional estimation problems with high outlier ratios. This gap will increase exponentially with the model dimension. Our source code is available at https://github.com/LJY-WHU/AMCC .
Article
The major advantage of applying classical least-squares to multi-parameter regression is that the most efficient unbiased estimates of the parameters can be obtained when observations are coming from a normal population. These estimates, however, may loose their reliability and efficiency when the normal distribution is contaminated by gross errors. Against the deficiency of the traditional least-squares, robust estimators based on two “contaminated” normal distribution models are proposed in this paper. Then the efficiency and reliability of these robust estimators is evaluated when the distribution in the contaminated part is unknown. Comparisons between the robust and classical estimators for different types of data are also made. Finally, a numerical example is presented to illustrate how to apply the robust estimators to real data.
Article
Robust estimation with independent observations has been investigated in the field of geodesy. However, robust estimation in the dependent situation has not widely been studied. Robust estimation models for correlated observations, based on the principles of M-estimation and equivalent weights, are established in this paper. The general linear expressions of solutions and their corresponding influence functions are derived. Based on iterative calculation procedures, six computation schemes are performed and compared. A new equivalent weight function for correlated observations is proposed.
Article
We present a novel deep hypergraph modeling architecture (called DHM-Net) for feature matching in this paper. Our network focuses on learning reliable correspondences between two sets of initial feature points by establishing a dynamic hypergraph structure that models group-wise relationships and assigns weights to each node. Compared to existing feature matching methods that only consider pair-wise relationships via a simple graph, our dynamic hypergraph is capable of modeling nonlinear higher-order group-wise relationships among correspondences in an interaction capturing and attention representation learning fashion. Specifically, we propose a novel Deep Hypergraph Modeling block, which initializes an overall hypergraph by utilizing neighbor information, and then adopts node-to-hyperedge and hyperedge-to-node strategies to propagate interaction information among correspondences while assigning weights based on hypergraph attention. In addition, we propose a Differentiation Correspondence-Aware Attention mechanism to optimize the hypergraph for promoting representation learning. The proposed mechanism is able to effectively locate the exact position of the object of importance via the correspondence aware encoding and simple feature gating mechanism to distinguish candidates of inliers. In short, we learn such a dynamic hypergraph format that embeds deep group-wise interactions to explicitly infer categories of correspondences. To demonstrate the effectiveness of DHM-Net, we perform extensive experiments on both real-world outdoor and indoor datasets. Particularly, experimental results show that DHM-Net surpasses the state-of-the-art method by a sizable margin. Our approach obtains an 11.65% improvement under error threshold of 5° for relative pose estimation task on YFCC100M dataset. Code will be released at https://github.com/CSX777/DHM-Net .
Article
For the issue of target tracking in nonlinear and non-stationary heavy-tailed noise systems, this paper proposed a novel Robust Bayesian Recursive Ensemble Kalman Filter (RBREnKF), breaking through the limitations of EnKF under highly nonlinear and non-Gaussian noise conditions. Initially, to counteract the problem of divergence observed in the EnKF under highly nonlinear conditions, a Bayesian Recursive Update (BRU) method is introduced to further improve the performance of the filter, resulting in more precise nonlinear approximation. Subsequently, a more robust Gaussian-Generalized Hyperbolic (GGH) distribution is adopted to model the non-stationary heavy-tailed measurement noise, and the Student’s t distribution is used to model the non-Gaussian process noise at the same time. The variational Bayesian (VB) method is then applied to solve the joint posterior probability density of the target state, yielding the new Robust Bayesian Recursive Ensemble Kalman Filter. Finally, the simulation results conducted in scenarios of both point target and extended target tracking and along with the sensitivity analysis of the filter’s parameters, demonstrate the superiority of the proposed algorithm over existing methods, exhibiting more accurate estimation of the target state and significantly improved performance.
Conference Paper
Um dos objetivos das licitações públicas é evitar contratações com sobrepreço, no entanto, identificá-las nem sempre é trivial. Este artigo apresenta uma metodologia para identificar sobrepreço, focando na detecção de valores discrepantes, considerando como um passo anterior o ajuste de modelos para descrever o comportamento do preço unitário. Para exemplificar, utilizou-se itens de Notas Fiscais Eletrônicas (NF-e) de gasolina comum, adquirida por Instituições Públicas de Santa Catarina. Os resultados mostraram que a utilização do processo de modelagem do preço anteriormente à aplicação de técnicas de identificação de valores discrepantes é importante para evitar a identificação equivocada ou não identificação notas de itens com sobrepreço.
Article
Full-text available
The fast-trimmed likelihood estimate is a robust method to estimate the parameters of a mixture regression model. However, this method is vulnerable to the presence of bad leverage points, which are outliers in the direction of independent variables. To address this issue, we propose the weighted fast-trimmed likelihood estimate to mitigate the impact of leverage points. The proposed method applies the weights of the minimum covariance determinant to the rows suspected of containing leverage points. Notably, both real data and simulation studies were considered to determine the efficiency of the proposed method compared to the previous methods. The results reveal that the weighted fast-trimmed estimate method is more robust and reliable than the fast-trimmed likelihood estimate and the expectation–maximization (EM) methods, particularly in cases with small sample sizes.
Article
The severity of climate change and global warming necessitates the need for a transition from traditional hydrocarbon-based energy sources to renewable energy sources. One intrinsic challenge with renewable energy sources is their intermittent nature, which can be addressed by transforming excess energy into hydrogen and storing it safely for future use. To securely store hydrogen underground, a comprehensive knowledge of the interactions between hydrogen and residing fluids is required. Interfacial tension is an important variable influenced by cushion gases such as CO2 and CH4. This research developed explicit correlations for approximating the interfacial tension of a hydrogen–brine mixture using two advanced machine-learning techniques: gene expression programming and the group method of data handling. The interfacial tension of a hydrogen–brine mixture was considered to be heavily influenced by temperature, pressure, water salinity, and the average critical temperature of the gas mixture. The results indicated a higher performance of the group method of data handling-based correlation, showing an average absolute relative error of 4.53%. Subsequently, Pearson, Spearman, and Kendall methods were used to assess the influence of individual input variables on the outputs of the correlations. Analysis showed that the temperature and the average critical temperature of the gas mixture had considerable inverse impacts on the estimated interfacial tension values. Finally, the reliability of the gathered databank and the scope of application for the proposed correlations were verified using the leverage approach by illustrating 97.6% of the gathered data within the valid range of the Williams plot.
ResearchGate has not been able to resolve any references for this publication.