Article

Regression Shrinkage and Selection via the LASSO

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Song and Bickel (2011) assume a sparse structure for the lags and apply group sparsity to the columns of the parameter matrix. These studies build on the l 1 /l 2 -norm penalties proposed by Hoerl and Kennard (1988), Tibshirani (1996) and Zou and Hastie (2005), also known as ridge regression, the lasso, and naïve elastic net. Yuan and Lin (2006) develop the group sparsity method for regression models. ...
... We investigate the performance of DINAR in a large simulation study on various numbers of observations and different strength of persistence which relates to different magnitude of parameters. We compare DINAR against LASSO, SCAD and BGR (Tibshirani, 1996;Fan and Li, 2001;Bańbura et al., 2010). The results of the simulation study show that DINAR is overall more accurate in identifying the influential groups, in particular for high dimensions. ...
... We compare the DINAR model against 3 alternative methods, namely LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) and BGR (Bańbura et al., 2010). The former two methods are designed for parameter selection whereas they do not have a layer for network identification as DINAR does. ...
Preprint
Full-text available
Known as an active global virtual money network, Bitcoin blockchain with millions of accounts has played an ever-growing important role in fund transition , digital payment and hedging. We propose a method to Detect Influencers in Network AutoRegressive models (DINAR) via sparse-group regularization to detect regions influencing others cross-border. For a granular analysis we analyze if the transaction record size plays a role for the dynamics of the cross-border transactions in the network. With two-layer sparsity, DINAR enables discovering 1) the active regions with influential impact on the global digital money network and 2) if changes in the transaction record size impact the dynamic evolution of Bitcoin transactions. We illustrate the finite sample performance of DINAR along with intensive simulation studies and investigate its asymptotic properties. In the real data analysis on Bitcoin blockchain from Feb 2012 to December 2021, we found that in the earlier years (2012-2016) network effects came surprisingly from Africa and South America. In 2017 Asia and Europe dominate whereas from 2018 effects majorly originate from North America. The effects are robust in regard to different groupings, evaluation periods and choice of regularization parameters.
... The regularised regression is quite sensitive to the selection of the penalty coefficients. [24][25][26] To appropriately tune the best parameter, the approach was to estimate the performance with different values using cross-validation (CV). 27 Compared with other methods, LASSO was fast and accurate with the advantage of avoiding overfitting automatically. ...
... The average disease course was 28.98±15.36 months (IQR:[18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36], with the longest being 60 months and the shortest being 6 months. The average score of caregiver burden scale was 39.28±17.11 ...
Article
Full-text available
Objectives There is significant burden on caregivers of patients with amyotrophic lateral sclerosis (ALS). However, only a few studies have focused on caregivers, and traditional research methods have obvious shortcomings in dealing with multiple influencing factors. This study was designed to explore influencing factors on caregiver burden among ALS patients and their caregivers from a new perspective. Design Cross-sectional study. Setting The data were collected at an affiliated hospital in Guangzhou, Guangdong, China. Participants Fifty-seven pairs of patients with ALS and their caregivers were investigated by standardised questionnaires. Main outcome measures This study primarily assessed the influencing factor of caregiver burden including age, gender, education level, economic status, anxiety, depression, social support, fatigue, sleep quality and stage of disease through data mining. Statistical analysis was performed using SPSS 24.0, and least absolute shrinkage and selection operator (LASSO) regression model was established by Python 3.8.1 to minimise the effect of multicollinearity. Results According to LASSO regression model, we found 10 variables had weights. Among them, Milano-Torinos (MITOS) stage (0–1) had the highest weight (−12.235), followed by younger age group (−3.198), lower-educated group (2.136), fatigue (1.687) and social support (-0.455). Variables including sleep quality, anxiety, depression and sex (male) had moderate weights in this model. Economic status (common), economic status (better), household (city), household (village), educational level (high), sex (female), age (older) and MITOS stage (2–4) had a weight of zero. Conclusions Our study demonstrates that the severity of ALS patients is the most influencing factor in caregiver burden. Caregivers of ALS patients may suffer less from caregiver burden when the patients are less severe, and the caregivers are younger. Low educational status could increase caregiver burden. Caregiver burden is positively correlated with the degree of fatigue and negatively correlated with social support. Hopefully, more attention should be paid to caregivers of ALS, and effective interventions can be developed to relieve this burden.
... Even using the leave-one-out type of Gibbs sampling scheme (Held and Holmes, 2006;Kang et al., 2021), the algorithms can still be computationally costly. Another common example is the regression model with norm constraint on the parameters, β q ≤ C, such as Lasso (l 1 norm, q = 1) (Tibshirani, 1996a) or bridge estimator (l q norm, q ≥ 0) (Frank and Friedman, 1993b;Fu, 1998). Other examples include copula models, latent Dirichlet allocation, covariance matrix estimation, and nonparametric density function estimation. ...
... When q = 1, this corresponds to Lasso (least absolute shrinkage and selection operator) proposed by Tibshirani (1996a) which allows the model to force some of the coefficients to become exactly zero (i.e., become excluded from the model). When q = 2, this model is known as ridge regression. ...
Preprint
Full-text available
The problem of sampling constrained continuous distributions has frequently appeared in many machine/statistical learning models. Many Monte Carlo Markov Chain (MCMC) sampling methods have been adapted to handle different types of constraints on the random variables. Among these methods, Hamilton Monte Carlo (HMC) and the related approaches have shown significant advantages in terms of computational efficiency compared to other counterparts. In this article, we first review HMC and some extended sampling methods, and then we concretely explain three constrained HMC-based sampling methods, reflection, reformulation, and spherical HMC. For illustration, we apply these methods to solve three well-known constrained sampling problems, truncated multivariate normal distributions, Bayesian regularized regression, and nonparametric density estimation. In this review, we also connect constrained sampling with another similar problem in the statistical design of experiments of constrained design space.
... Specifically, we aimed to predict the shared disease effects using concatenated multiscale features of microstructural and functional gradients, cytoarchitectonic (i.e., mean, SD, skewness, kurtosis, and externopyramidization), and transmitter maps (i.e., D1, D2, 5-HT1a, 5-HT1b, 5-HT2a, FDOPA, GABAa, DAT, NAT, and SERT). We used five-fold nested crossvalidation [78][79][80] with LASSO regression 76 . Nested cross-validation split the dataset into training (4/5) and test (1/5) partitions, and each training partition was further split into inner training and testing folds using another five-fold cross-validation. ...
... We assessed associations between the shared disease dimension and microstructural and functional connectivity gradients, cytoarchitectonic features calculated from the BigBrain 66 , and neurotransmitter maps obtained from independent PET/SPECT studies 54-59 based on linear correlations with 1000 spin-tests followed by FDR 164,165 . We opted for supervised machine learning to associate multiscale features and shared effects based on five-fold nested cross-validation [78][79][80] with LASSO regression 76 . ...
Article
Full-text available
It is increasingly recognized that multiple psychiatric conditions are underpinned by shared neural pathways, affecting similar brain systems. Here, we carried out a multiscale neural contextualization of shared alterations of cortical morphology across six major psychiatric conditions (autism spectrum disorder, attention deficit/hyperactivity disorder, major depression disorder, obsessive-compulsive disorder, bipolar disorder, and schizophrenia). Our framework cross-referenced shared morphological anomalies with respect to cortical myeloarchitecture and cytoarchitecture, as well as connectome and neurotransmitter organization. Pooling disease-related effects on MRI-based cortical thickness measures across six ENIGMA working groups, including a total of 28,546 participants (12,876 patients and 15,670 controls), we identified a cortex-wide dimension of morphological changes that described a sensory-fugal pattern, with paralimbic regions showing the most consistent alterations across conditions. The shared disease dimension was closely related to cortical gradients of microstructure as well as neurotransmitter axes, specifically cortex-wide variations in serotonin and dopamine. Multiple sensitivity analyses confirmed robustness with respect to slight variations in analytical choices. Our findings embed shared effects of common psychiatric conditions on brain structure in multiple scales of brain organization, and may provide insights into neural mechanisms of transdiagnostic vulnerability.
... Om die reden is met behulp van een penalized survival regressiemodel gekeken welke set van kenmerken de uitkomst het beste voorspelt. Meer specifiek is gebruikt gemaakt van de zogenoemde L1-penalty, λ1, ook wel least absolute shrinkage and selection operator (LASSO; Tibshirani, 1996) genoemd. Een voordeel van deze methode is dat de selectie van voorspellers tegelijkertijd plaatsvindt en dat met deze methode ook meteen wordt bepaald hoeveel voorspellers in het model worden opgenomen. ...
... Met behulp van een penalized survival regressiemodel kan wordt gekeken welke set van variabelen de uitkomst het beste voorspellen. Meer specifiek is gebruikt gemaakt van de zogenoemde L1-penalty, λ1, ook wel least absolute shrinkage and selection operator (LASSO; Tibshirani, 1996) genoemd. Normaal gesproken worden in een regressiemodel de optimale coëfficiënten gevonden door de minus log-(partial)likelihood te minimaliseren naar deze coëfficiënten. ...
Research
Full-text available
This study found no evidence to support the effectiveness of the BORG training programme in terms of recidivism among perpetrators of intimate partner violence. In accordance with and in supplement to previous studies on the BORG training programme, various problems and barriers in the execution of the programme have been identified. For example, the programme does not appear to be reaching the intended target group, the set objectives do not seem feasible and are not formulated in SMART terms, participants’ partners are insufficiently involved in the programme, the content of the programme is insufficiently based on evidence-based techniques and on a robust comprehensive theoretical framework, and there is insufficient nationwide management and control of the execution of the programme. These problems, however, provide starting points for the improvement of the BORG training programme, which may make the programme effective after all. In addition, a substantial percentage of participants seems positive about the training programme. Furthermore, the BORG training programme could potentially be used at an early stage for a large group of perpetrators (couples) of intimate partner violence, given that the Probation and Parole Service is already involved by the police in any arrests due to domestic violence. By devoting attention to the aforementioned problems identified in this study within the planned revision of the BORG training programme, and through the establishment of better framework conditions for an integrated, cross-domain, system-oriented approach to domestic violence by the Ministry of Justice and Security and the Ministry of Health, Welfare and Sport, an effective intervention could be created that would contribute to reducing intimate partner violence in society.
... One patient was treated with ICBT alone, and 88 patients were treated with a combination of EBRT and ICBT. Three-dimensional radiotherapy planning using an X-ray beam (6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18) was performed for all the patients who received EBRT. Patients without PAN metastasis received whole pelvis irradiation (WPI), and patients with PAN metastasis received extended-field irradiation. ...
... The least absolute shrinkage and selection operator (LASSO) regression model was used with MATLAB code to prevent overfitting [14,15]. The most significant predictive features were selected with the LASSO regression, which reduces the dimension from among all the candidate features in the training dataset. ...
Article
Full-text available
Background: The current study aims to predict the recurrence of cervical cancer patients treated with radiotherapy from radiomics features on pretreatment T1- and T2-weighted MR images. Methods: A total of 89 patients were split into model training (63 patients) and model testing (26 patients). The predictors of recurrence were selected using the least absolute shrinkage and selection operator (LASSO) regression. The machine learning used neural network classifiers. Results: Using LASSO analysis of radiomics, we found 25 features from the T1-weighted and 4 features from T2-weighted MR images, respectively. The accuracy was highest with the combination of T1- and T2-weighted MR images. The model performances with T1- or T2-weighted MR images were 86.4% or 89.4% accuracy, 74.9% or 38.1% sensitivity, 81.8% or 72.2% specificity, and 0.89 or 0.69 of the area under the curve (AUC). The model performance with the combination of T1- and T2-weighted MR images was 93.1% accuracy, 81.6% sensitivity, 88.7% specificity, and 0.94 of AUC. Conclusions: The radiomics analysis with T1- and T2-weighted MR images could highly predict the recurrence of cervix cancer after radiotherapy. The variation of the distribution and the difference in the pixel number at the peripheral and the center were important predictors.
... This method includes an L 1 penalty [Tibshirani, 1996] for the variable selection and L 2 penalty for identifying clusters based on k-means [Forgy, 1965] that are associated with the response variables. This is different from the previously mentioned methods in that the estimation of the regression coefficients and grouping of the response variables is conducted simultaneously. ...
... When u s = 1, Xβ s belongs to the th cluster, otherwise u s = 0. v ( = 1, 2, · · · , k) is the partial regression coefficient for cluster centroid. The second formula term is the lasso penalty [Tibshirani, 1996], and the third term of Eq.(4) is the k-means clustering function. ...
Preprint
Full-text available
We propose a method for high dimensional multivariate regression that is robust to random error distributions that are heavy-tailed or contain outliers, while preserving estimation accuracy in normal random error distributions. We extend the Wilcoxon-type regression to a multivariate regression model as a tuning-free approach to robustness. Furthermore, the proposed method regularizes the L1 and L2 terms of the clustering based on k-means, which is extended from the multivariate cluster elastic net. The estimation of the regression coefficient and variable selection are produced simultaneously. Moreover, considering the relationship among the correlation of response variables through the clustering is expected to improve the estimation performance. Numerical simulation demonstrates that our proposed method overperformed the multivariate cluster method and other methods of multiple regression in the case of heavy-tailed error distribution and outliers. It also showed stability in normal error distribution. Finally, we confirm the efficacy of our proposed method using a data example for the gene associated with breast cancer.
... However, they execute FS while constructing the predictive model; consequently, the computational complexity generally falls between filters and wrappers methods (Cadenas et al., 2013). Famous examples of embedded methods are Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996), Support Vector Machine Recursive Feature Elimination (SVM-RFE) (Guyon et al., 2002) and Random Forest Importance (Leo, 2001). ...
... Least Absolute Shrinkage and Selection Operator (Lasso): lasso is a regression analysis approach that combines variable selection with regularisation to improve the statistical model's prediction accuracy and interpretability (Tibshirani, 1996). ...
Article
Full-text available
This study presents SPFSR, a novel stochastic approximation approach for performing simultaneous k-best feature ranking (FR) and feature selection (FS) based on Simultaneous Perturbation Stochastic Approximation (SPSA) with Barzilai and Borwein (BB) non-monotone gains. SPFSR is a wrapper-based method which may be used in conjunction with any given classifier or regressor with respect to any suitable corresponding performance metric. Numerical experiments are performed on 47 public datasets which contain both classification and regression problems, with the mean accuracy and R2 reported from four different classifiers and four different regressors respectively. In over 80% of classification experiments and over 85% of regression experiments SPFSR provided a statistically significant improvement or equivalent performance compared to existing, well-known FR techniques. Furthermore, SPFSR obtained a better classification accuracy and R-squared on average compared to utilizing the entire feature set.
... Moreover, the Lasso performs automatic variable selection, which is not the case for the L 2 -norm penalty. Although the performance of Lasso does not uniformly dominate the ridge regression (Tibshirani, 1996), the L 1 -norm CSVR appears very promising because the variable selection is increasingly important in modern data science. ...
Preprint
Full-text available
Nonparametric regression subject to convexity or concavity constraints is increasingly popular in economics, finance, operations research, machine learning, and statistics. However, the conventional convex regression based on the least squares loss function often suffers from overfitting and outliers. This paper proposes to address these two issues by introducing the convex support vector regression (CSVR) method, which effectively combines the key elements of convex regression and support vector regression. Numerical experiments demonstrate the performance of CSVR in prediction accuracy and robustness that compares favorably with other state-of-the-art methods.
... In order to identify racial subgroup performance variations within the probabilistic models, we built the following steps into a framework that we can later re-use for a wider variety of phenotypes. We started by selecting three of the most popular classical machine learning models to evaluate: LASSO [45], since regression-based classifiers are widely used for statistical learning purposes with EHR/medical data [46], Random Forest (RF) [47], and support vector machines (SVM) [48]. Note that the three models listed above are the ones supported by default by APHRODITE; however,any model supported by caret R package can also be included [49]. ...
Preprint
Full-text available
Objective: Biases within probabilistic electronic phenotyping algorithms are largely unexplored. In this work, we characterize differences in sub-group performance of phenotyping algorithms for Alzheimer's Disease and Related Dementias (ADRD) in older adults. Materials and methods: We created an experimental framework to characterize the performance of probabilistic phenotyping algorithms under different racial distributions allowing us to identify which algorithms may have differential performance, by how much, and under what conditions. We relied on rule-based phenotype definitions as reference to evaluate probabilistic phenotype algorithms created using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation (APHRODITE) framework. Results: We demonstrate that some algorithms have performance variations anywhere from 3 to 30% for different populations, even when not using race as an input variable. We show that while performance differences in subgroups are not present for all phenotypes, they do affect some phenotypes and groups more disproportionately than others. Discussion: Our analysis establishes the need for a robust evaluation framework for subgroup differences. The underlying patient populations for the algorithms showing subgroup performance differences have great variance between model features when compared to the phenotypes with little to no differences. Conclusion: We have created a framework to identify systematic differences in the performance of probabilistic phenotyping algorithms specifically in the context of ADRD as a use case. Differences in subgroup performance of probabilistic phenotyping algorithms are not widespread nor do they occur consistently. This highlights the great need for careful ongoing monitoring to evaluate, measure, and try to mitigate such differences.
... We study the classical regression problem with one continuous response for multigroup data. Although there are many well-established regression techniques for homogeneous data (Hoerl and Kennard, 1970;Tibshirani, 1996), they may not be suitable for multi-group data. One naive approach is to ignore data heterogeneity and fit a global model using these techniques. ...
Preprint
Multi-group data are commonly seen in practice. Such data structure consists of data from multiple groups and can be challenging to analyze due to data heterogeneity. We propose a novel Joint and Individual Component Regression (JICO) model to analyze multi-group data. In particular, our proposed model decomposes the response into shared and group-specific components, which are driven by low-rank approximations of joint and individual structures from the predictors respectively. The joint structure has the same regression coefficients across multiple groups, whereas individual structures have group-specific regression coefficients. Moreover, the choice of global and individual ranks allows our model to cover global and group-specific models as special cases. For model estimation, we formulate this framework under the representation of latent components and propose an iterative algorithm to solve for the joint and individual scores under the new representation. To construct the latent scores, we utilize the Continuum Regression (CR), which provides a unified framework that covers the Ordinary Least Squares (OLS), the Partial Least Squares (PLS), and the Principal Component Regression (PCR) as its special cases. We show that JICO attains a good balance between global and group-specific models and remains flexible due to the usage of CR. Finally, we conduct simulation studies and analysis of an Alzheimer's disease dataset to further demonstrate the effectiveness of JICO. R implementation of JICO is available online at https://github.com/peiyaow/JICO.
... where the loss function of (16) is modified with the addition of a regularization function R(⋅), which imposes some form of constraint upon the coefficient vector, and a weighting parameter λ which controls the importance of the regularization in the minimization. Specifically for computing PCEs, it is common to use ℓ 1 -regularization, such that R(ŝ) = ‖ŝ ‖ 1 [32,39,40,72], a technique closely connected to the least absolute shrinkage and selection operator (LASSO) method in statistics [73] and to compressed sensing methods in signal processing [74]. The ℓ 1 -regularization forces multiple coefficients to zero, thus effectively reducing the model parameters and resulting in a sparse PCE. ...
... Linear regression is a popular and interpretable approach to study the relationship between soil properties and cavity behavior [50]. We applied Lasso regularization [41] to all of our linear regressions, as it can shrink the coefficients of less predictive features to zero and improve model generalizability. The first regression model only uses two numerical values (cavity depth H and soil density d) as input features, while the second model adds vectorized stress fields r r as an extra feature ( Table 2). ...
Article
Full-text available
Estimating soil properties from the mechanical reaction to a displacement is a common strategy, used not only in in situ soil characterization (e.g., pressuremeter and dilatometer tests) but also by biological organisms (e.g., roots, earthworms, razor clams), which sense stresses to explore the subsurface. Still, the absence of analytical solutions to predict the stress and deformation fields around cavities subject to geostatic stress, has prevented the development of characterization methods that resemble the strategies adopted by nature. We use the finite element method (FEM) to model the displacement-controlled expansion of cavities under a wide range of stress conditions and soil properties. The radial stress distribution at the cavity wall during expansion is extracted. Then, methods are proposed to prepare, transform and use such stress distributions to back-calculate the far field stresses and the mechanical parameters of the material around the cavity (Mohr-Coulomb friction angle ϕ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi $$\end{document}, Young’s modulus E). Results show that: (i) The initial stress distribution around the cavity can be fitted to a sum of cosines to estimate the far field stresses; (ii) By encoding the stress distribution as intensity images, in addition to certain scalar parameters, convolutional neural networks can consistently and accurately back-calculate the friction angle and Young’s modulus of the soil.
... ML implementations were performed using existing libraries and methods from RStudio and Python software. The ML methods implemented were LASSO [26], Elastic Net, and Ridge [27] (Eq. (1)) regressions due to their practicality and output form. ...
Article
Features such as fast–growth rate, high strength–to–weight ratio, high carbon sequestering capability amongst others, make bamboo an excellent alternative environmental–friendly construction material. Therefore, it is very important to establish the most appropriate geometrical and/or physical properties that can be used to infer capacities as well as structural grades for bamboo such as Guadua angustifolia Kunth (GAK). Thus, an extensive experimental characterization of physical and mechanical properties of GAK was conducted by two independent laboratories –with samples from the same plantation in Colombia. Pooling of the two datasets were performed in order to create a larger data and undertake a more rigorous statistical analysis using machine learning (ML) methods. In addition, regression equations of mean and characteristic values for parallel–to–fiber compression, shear and bending capacities, and flexural stiffness were determined based on ML methods employing geometrical and physical properties. Finally, ML methods were used to propose a classification method based on four capacity classes that could enable a simpler grading process for structural bamboo species such as GAK.
... , p}, p is the number of covariates, V is the orthonormal set of eigenvectors, and δ is the vector of estimated regression coefficients. Lasso is a regression with an l1-norm penalty aiming to find β = {βj}, which minimizes Equation (3) [35]. ...
Article
Full-text available
To obtain a better performance when modeling soil spectral data for attribute prediction, researchers frequently resort to data pretreatment, aiming to reduce noise and highlight the spectral features. Even with the awareness of the existence of dimensionality reduction statistical approaches that can cope with data sparse dimensionality, few studies have explored its applicability in soil sensing. Therefore, this study’s objective was to assess the predictive performance of two dimensionality reduction statistical models that are not widespread in the proximal soil sensing community: principal components regression (PCR) and least absolute shrinkage and selection operator (lasso). Here, these two approaches were compared with multiple linear regressions (MLR). All of the modelling strategies were applied without employing pretreatment techniques for soil attribute determination using X-ray fluorescence spectroscopy (XRF) and visible and near-infrared diffuse reflectance spectroscopy (Vis-NIR) data. In addition, the achieved results were compared against the ones reported in the literature that applied pretreatment techniques. The study was carried out with 102 soil samples from two distinct fields. Predictive models were developed for nine chemical and physical soil attributes, using lasso, PCR and MLR. Both Vis-NIR and XRF raw spectral data presented a great performance for soil attribute prediction when modelled with PCR and the lasso method. In general, similar results were found comparing the root mean squared error (RMSE) and coefficient of determination (R2) from the literature that applied pretreatment techniques and this study. For example, considering base saturation (V%), for Vis-NIR combined with PCR, in this study, RMSE and R2 values of 10.60 and 0.79 were found compared with 10.38 and 0.80, respectively, in the literature. In addition, looking at potassium (K), XRF associated with lasso yielded an RMSE value of 0.60 and R2 of 0.92, and in the literature, RMSE and R2 of 0.53 and 0.95, respectively, were found. The major discrepancy was observed for phosphorus (P) and organic matter (OM) prediction applying PCR in the XRF data, which showed R2 of 0.33 (for P) and 0.52 (for OM) without using pretreatment techniques in this study, and R2 of 0.01 (for P) and 0.74 (for OM) when using preprocessing techniques in the literature. These results indicate that data pretreatment can be disposable for predicting some soil attributes when using Vis-NIR and XRF raw data modeled with dimensionality reduction statistical models. Despite this, there is no consensus on the best way to calibrate data, as this seems to be attribute and area specific.
... The models used to classify and predict RPE data were selected based on a combination of models used in the current sport literature to impute missing RPE data as well as models that can be completed using open-source software [2,3,10,11,14]. In this investigation, RPE values were classified and predicted using daily team mean substitution [2,3], regression models (linear, (R stats package) stepwise (R MASS), lasso, ridge, elastic net (lasso, ridge, and elastic net using R glmnet)), k-nearest neighbours (R FNN), random forest (R randomForest), support vector machine (R e1071), and neural network models [24][25][26][27][28][29][30][31][32][33][34][35]. ...
Article
Full-text available
Rate of perceived exertion (RPE) is used to calculate athlete load. Incomplete load data, due to missing athlete-reported RPE, can increase injury risk. The current standard for missing RPE imputation is daily team mean substitution. However, RPE reflects an individual's effort; group mean substitution may be suboptimal. This investigation assessed an ideal method for imputing RPE. A total of 987 datasets were collected from women's rugby sevens competitions. Daily team mean substitution, k-nearest neighbours, random forest, support vector machine, neural network, linear, stepwise, lasso, ridge, and elastic net regression models were assessed at different missing-ness levels. Statistical equivalence of true and imputed scores by model were evaluated. An ANOVA of accuracy by model and missingness was completed. While all models were equivalent to the true RPE, differences by model existed. Daily team mean substitution was the poorest performing model, and random forest, the best. Accuracy was low in all models, affirming RPE as mul-tifaceted and requiring quantification of potentially overlapping factors. While group mean substitution is discouraged, practitioners are recommended to scrutinize any imputation method relating to athlete load.
... Lasso regression is the Lasso (Least absolute shrinkage and selection operator) method first proposed by Tibshirani (1996). It is a biased estimation method that can be used for feature selection in high-dimensional data. ...
Article
Full-text available
Globally all countries encounter air pollution problems along their development path. As a significant indicator of air quality, PM2.5 concentration has long been proven to be affecting the population’s death rate. Machine learning algorithms proven to outperform traditional statistical approaches are widely used in air pollution prediction. However research on the model selection discussion and environmental interpretation of model prediction results is still scarce and urgently needed to lead the policy making on air pollution control. Our research compared four types of machine learning algorisms LinearSVR, K-Nearest Neighbor, Lasso regression, Gradient boosting by looking into their performance in predicting PM2.5 concentrations among different cities and seasons. The results show that the machine learning model is able to forecast the next day PM2.5 concentration based on the previous five days' data with better accuracy. The comparative experiments show that based on city level the Gradient Boosting prediction model has better prediction performance with mean absolute error (MAE) of 9 ug/m³ and root mean square error (RMSE) of 10.25–16.76 ug/m³, lower compared with the other three models, and based on season level four models have the best prediction performances in winter time and the worst in summer time. And more importantly the demonstration of models' different performances in each city and each season is of great significance in environmental policy implications.
... To avoid this, a minimization with an additional penalty term τ to the error function has been proposed. Among them, a Lasso function [25] is defined to be ...
Preprint
Full-text available
A novel framework has recently been proposed for designing the molecular structure of chemical compounds with a desired chemical property, where design of novel drugs is an important topic in bioinformatics and chemo-informatics. The framework infers a desired chemical graph by solving a mixed integer linear program (MILP) that simulates the computation process of a feature function defined by a two-layered model on chemical graphs and a prediction function constructed by a machine learning method. A set of graph theoretical descriptors in the feature function plays a key role to derive a compact formulation of such an MILP. To improve the learning performance of prediction functions in the framework maintaining the compactness of the MILP, this paper utilizes the product of two of those descriptors as a new descriptor and then designs a method of reducing the number of descriptors. The results of our computational experiments suggest that the proposed method improved the learning performance for many chemical properties and can infer a chemical structure with up to 50 non-hydrogen atoms.
... The N P-hardness of the problem has contributed to the belief that discrete optimization problems were intractable [3]. For this reason, plenty of impressive sparsity-promoting techniques have focused on computationally feasible algorithms for solving the approximations, including Lasso [33], Elastic-net [37], nonconvex regularization [15,19] and stepwise regression [13]. These approximations induce obscure sparsity via regularization that often includes a large set of active terms (many are correlated terms and the coefficients are shrunken to zero to avoid overfitting) in order to deliver good prediction. ...
Preprint
Full-text available
The identification of governing equations for dynamical systems is everlasting challenges for the fundamental research in science and engineering. Machine learning has exhibited great success to learn and predict dynamical systems from data. However, the fundamental challenges still exist: discovering the exact governing equations from highly noisy data. In present work, we propose a compressive sensing-assisted mixed integer optimization (CS-MIO) method to make a step forward from a modern discrete optimization lens. In particular, we first formulate the problem into a mixed integer optimization model. The discrete optimization nature of the model leads to exact variable selection by means of cardinality constraint, and hereby powerful capability of exact discovery of governing equations from noisy data. Such capability is further enhanced by incorporating compressive sensing and regularization techniques for highly noisy data and high-dimensional problems. The case studies on classical dynamical systems have shown that CS-MIO can discover the exact governing equations from large-noise data, with up to two orders of magnitude larger noise comparing with state-of-the-art method. We also show its effectiveness for high-dimensional dynamical system identification through the chaotic Lorenz 96 system.
... The expression of each gene was modeled as a linear combination of its own gene copy number and the expression values of all other genes. The parameters of the underlying linear models were computed using the R package regNet [30], which uses lasso regression [76] in combination with a significance test for lasso [77], to determine the most relevant predictors for each gene-specific linear model. Depending on the sign of the learned parameter, a selected predictor can either represent a potential activator (positive sign) or a potential inhibitor (negative sign) of the considered gene. ...
Article
Full-text available
T-cell prolymphocytic leukemia (T-PLL) is a rare blood cancer with poor prognosis. Overexpression of the proto-oncogene TCL1A and missense mutations of the tumor suppressor ATM are putative main drivers of T-PLL development, but so far only little is known about the existence of T-PLL gene expression subtypes. We performed an in-depth computational reanalysis of 68 gene expression profiles of one of the largest currently existing T-PLL patient cohorts. Hierarchical clustering combined with bootstrapping revealed three robust T-PLL gene expression subgroups. Additional comparative analyses revealed similarities and differences of these subgroups at the level of individual genes, signaling and metabolic pathways, and associated gene regulatory networks. Differences were mainly reflected at the transcriptomic level, whereas gene copy number profiles of the three subgroups were much more similar to each other, except for few characteristic differences like duplications of parts of the chromosomes 7, 8, 14, and 22. At the network level, most of the 41 predicted potential major regulators showed subgroup-specific expression levels that differed at least in comparison to one other subgroup. Functional annotations suggest that these regulators contribute to differences between the subgroups by altering processes like immune responses, angiogenesis, cellular respiration, cell proliferation, apoptosis, or migration. Most of these regulators are known from other cancers and several of them have been reported in relation to leukemia (e.g. AHSP, CXCL8, CXCR2, ELANE, FFAR2, G0S2, GIMAP2, IL1RN, LCN2, MBTD1, PPP1R15A). The existence of the three revealed T-PLL subgroups was further validated by a classification of T-PLL patients from two other smaller cohorts. Overall, our study contributes to an improved stratification of T-PLL and the observed subgroup-specific molecular characteristics could help to develop urgently needed targeted treatment strategies.
... In particular, these multiplicative constants can be calibrated by the slope heuristic approach in a finite-sample setting. Then, in the spirit of the methods based on concentration inequalities developed in [60,61,22,23], a number of finite-sample oracle type inequalities have been established for the least absolute shrinkage and selection operator (LASSO) [94] and general penalized maximum likelihood estimators (PMLE). These results include the works for high-dimensional Gaussian graphical models [33], Gaussian mixture model selection [62,63,64], finite mixture regression models [68,28,29,31,32], and LinBoSGaME models, outside the high-dimensional setting [70]. ...
... A multivariate (multi-input, multi-output) regression model with the L1 penalty was designed, commonly known as LASSO [24]. We selected this model since it achieves sparsity in the estimated model by setting the regression coefficients for features to zero for those features that don't affect the output or target values. ...
Preprint
Detailed flow distributions in vascular systems are the key to identifying hemodynamic risk factors for the development and progression of vascular diseases. Although computational fluid dynamics (CFD) has been widely used in bioengineering research on hemodynamics predictions, not only are high-fidelity CFD simulations time-consuming and computing-expensive, but also not friendly to clinical applications due to the difficulty of comprehensive numerical calculations. Machine learning (ML) algorithums to estimate the flow field in vascular systems based on the angiographic images of the blood flow using existed diagnostic tools are emerging as a new pathway to facilitate the mapping of hemodynamics. In present work, the dye injection in a water flow was simulated as an analogy of the contrast perfusion in blood flow using CFD. In the simulation, the light passes through the flow field and generates projective images, as an analogy of X-ray imaging. The simulations provide both the ground truth velocity field and the projective images of the flow with dye patterns. A rough velocity field was estimated using the optical flow method (OFM) based on projective images. ML algorithums are then trained using the ground truth CFD data and the OFM velocity estimation as the input. Finally, the interpretable (logistic regression) and deep (neural networks, convolutional neural networks, long short term memory) machine learning models are validated by using parallel in vitro experiments on the same flow setup. The validation results showed that the employed ML model significantly reduced the error rate from 53.5% to 2.5% in average for the v-velocity estimation.
... The LASSO (Least absolute shrinkage and selection operator, Tibshirani) method is a compression estimation [21]. By shrinking the regression coe cients, reducing some of them to zero, a penalty function can be constructed to obtain a more re ned model. ...
Preprint
Full-text available
Background: Multiple myeloma (MM) is an incurable, relapse-prone disease with apparent prognostic heterogeneity. At present, the risk stratification of myeloma is still incomplete. Pyroptosis, a type of programmed cell death, has been shown to regulate tumor growth, and may have potential prognostic value. However, the role of pyroptosis-related genes (PRGs) in MM remains undetermined. The aim of this study was to to identify potential prognostic biomarkers and construct a predictive model related to PRGs. Methods: Sequencing and clinical data were obtained from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) database. Non-negative matrix factorization (NMF) was performed to identify molecular subtypes screening. LASSO regression was used to screen for prognostic markers. Maxstat package was utilized to calculate the optimal cutoff value for the risk score's ability. Patients were then divided into high/low risk groups depending on the cutoff value, and survival curves were plotted using the Kaplan-Meier (K-M) method. The nomogram and a calibration curve of the multi-factor model was established using the rms package. Results: A total of 33 PRGs were extracted from TCGA database underlying which 4 MM molecular subtypes were defined. Patients in cluster 1 had poorer survival than those in cluster 2 (p = 0.035), and the infiltration degree of many immune cells was the opposite in these two clusters. A total of 9 PRGs were screened out as prognostic markers, and the risk score consisting of which had the best predictive ability of 3-year survival (AUC=0.658). Patients in the high-risk group have worse survival than those in the low-risk group (p < 0.0001), consisting of the results verified by GSE2658 dataset. The nomogram constructed by gender, age, ISS stage and risk score had the better prognostic predictive performance with a c-index of 0.721. Conclusions: Our model could enhance the predictive ability of ISS staging and give a reference for clinical decision-making. The new prognostic pyroptosis-related markers in MM screened out by us may facilitate the development of novel risk stratification for MM. Clinical trial registration: Not applicable.
... (3) DNAm surrogates of risk factors/biomarkers were constructed, regressing the response variables on the top 1% CpG sites, adjusting for sex and age. Finally, we applied L1 penalised estimation for enforcing sparsity in the regression coefficients employing the LASSO procedure[60] or the corresponding penalised mixed model[61] (for the biomarkers showing difference by centre) depending on the biomarker. For the latter method, ad hoc R routines were devised: the source code is freely available in the form of an R package at https:// github. ...
Article
Full-text available
Background Recent evidence highlights the epidemiological value of blood DNA methylation (DNAm) as surrogate biomarker for exposure to risk factors for non-communicable diseases (NCD). DNAm surrogate of exposures predicts diseases and longevity better than self-reported or measured exposures in many cases. Consequently, disease prediction models based on blood DNAm surrogates may outperform current state-of-the-art prediction models. This study aims to develop novel DNAm surrogates for cardiovascular diseases (CVD) risk factors and develop a composite biomarker predictive of CVD risk. We compared the prediction performance of our newly developed risk score with the state-of-the-art DNAm risk scores for cardiovascular diseases, the ‘next-generation’ epigenetic clock DNAmGrimAge, and the prediction model based on traditional risk factors SCORE2. Results Using data from the EPIC Italy cohort, we derived novel DNAm surrogates for BMI, blood pressure, fasting glucose and insulin, cholesterol, triglycerides, and coagulation biomarkers. We validated them in four independent data sets from Europe and the USA. Further, we derived a DNAmCVDscore predictive of the time-to-CVD event as a combination of several DNAm surrogates. ROC curve analyses show that DNAmCVDscore outperforms previously developed DNAm scores for CVD risk and SCORE2 for short-term CVD risk. Interestingly, the performance of DNAmGrimAge and DNAmCVDscore was comparable (slightly lower for DNAmGrimAge, although the differences were not statistically significant). Conclusions We described novel DNAm surrogates for CVD risk factors useful for future molecular epidemiology research, and we described a blood DNAm-based composite biomarker, DNAmCVDscore, predictive of short-term cardiovascular events. Our results highlight the usefulness of DNAm surrogate biomarkers of risk factors in epigenetic epidemiology to identify high-risk populations. In addition, we provide further evidence on the effectiveness of prediction models based on DNAm surrogates and discuss methodological aspects for further improvements. Finally, our results encourage testing this approach for other NCD diseases by training and developing DNAm surrogates for disease-specific risk factors and exposures.
... Although the solution using the Moore-Penrose Pseudoinverse [3] b ¼ yX y ð5Þ might be better as it satisfies (y − bX) 2 = 0 under the condition of min b b 2 , it is unclear whether min b b 2 is a good constraint from the biological viewpoint. Adding the regulation term of L 1 norm [4] to Eq (1) ðy À bXÞ 2 ...
Article
Full-text available
Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outper-formed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit (PP) that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA-and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. For this purpose, we newly applied PP to the three data sets to which PCA and TD based unsupervised FE were previously applied; these data sets treated two topics, biomarker identification for kidney cancers (the first two) and the drug discovery for COVID-19 (the thrid one). Then we found the coincidence between PP and PCA or TD based unsupervised FE is pretty well. Shuffling procedures described above are also successfully applied to these three data sets. These findings thus rationalize the success of PCA-and TD-based unsupervised FE for the first time.
... 207 Therefore, we used a statistical regularization framework to optimize model selection for 208 predictive ability rather than strictly inference (Gerber & Northrup, 2020;Tredennick et al., 209 2021). We used the least absolute shrinkage and selection operator (LASSO; Tibshirani, 1996) as 210 a method for variable selection to determine a final model for predictions. We fit LASSO models 211 using a downweighted Poisson generalized linear regression to approximate an inhomogeneous Poisson point process with a set of 200,000 random points distributed across our study area to 213 serve as quadrature locations (Renner et al., 2015). ...
... Logistic Regression is a well-known and often used statistical technique for binary classification [73] that is easy to interpret but has limited model capacity and suffers from overfitting when many features are considered. Lasso regularization often helps in overcoming overfitting [74]. It achieves this by adding a penalty of the absolute value of the magnitude of the coefficients. ...
Article
Full-text available
Objective This study used machine learning (ML) to test an empirically derived set of risk factors for marijuana use. Models were built separately for child welfare (CW) and non-CW adolescents in order to compare the variables selected as important features/risk factors. Method Data were from a Time 4 (M age = 18.22) of longitudinal study of the effects of maltreatment on adolescent development (n = 350; CW = 222; non-CW = 128; 56%male). Marijuana use in the past 12 months (none versus any) was obtained from a single item self-report. Risk factors entered into the model included mental health, parent/family social support, peer risk behavior, self-reported risk behavior, self-esteem, and self-reported adversities (e.g., abuse, neglect, witnessing family violence or community violence). Results The ML approaches indicated 80% accuracy in predicting marijuana use in the CW group and 85% accuracy in the non-CW group. In addition, the top features differed for the CW and non-CW groups with peer marijuana use emerging as the most important risk factor for CW youth, whereas externalizing behavior was the most important for the non-CW group. The most important common risk factor between group was gender, with males having higher risk. Conclusions This is the first study to examine the shared and unique risk factors for marijuana use for CW and non-CW youth using a machine learning approach. The results support our assertion that there may be similar risk factors for both groups, but there are also risks unique to each population. Therefore, risk factors derived from normative populations may not have the same importance when used for CW youth. These differences should be considered in clinical practice when assessing risk for substance use among adolescents.
... Logistic regression is a simple and efficient classification algorithm, where each predictor's importance is explicitly denoted by its corresponding coefficient. LASSO is a L1 penalized regression approach which attempts to avoid overfitting by forcing the coefficients of the least contributive predictors to be exactly zero [17]. Random forest is a machine learning algorithm that trains a number of decision trees using a combination of bootstrap aggregating and random feature selection. ...
Article
Full-text available
Background: Hospitalization-associated acute kidney injury (AKI), affecting one-in-five inpatients, is associated with increased mortality and major adverse cardiac/kidney endpoints. Early AKI risk stratification may enable closer monitoring and prevention. Given the complexity and resource utilization of existing machine learning models, we aimed to develop a simpler prediction model. Methods: Models were trained and validated to predict risk of AKI using electronic health record (EHR) data available at 24 h of inpatient admission. Input variables included demographics, laboratory values, medications, and comorbidities. Missing values were imputed using multiple imputation by chained equations. Results: 26,410 of 209,300 (12.6%) inpatients developed AKI during admission between 13 July 2012 and11 July 2018. The area under the receiver operating characteristic curve (AUROC) was 0.86 for Random Forest and 0.85 for LASSO. Based on Youden’s Index, a probability cutoff of >0.15 provided sensitivity and specificity of 0.80 and 0.79, respectively. AKI risk could be successfully predicted in 91% patients who required dialysis. The model predicted AKI an average of 2.3 days before it developed. Conclusions: The proposed simpler machine learning model utilizing data available at 24 h of admission is promising for early AKI risk stratification. It requires external validation and evaluation of effects of risk prediction on clinician behavior and patient outcomes.
... To avoid overfitting in the outcome model, we then utilize elastic net regularization or penalization [69] to estimate the coefficients. The elastic net penalty is a mixture of the ridge regression [70] penalty and least absolute shrinkage and selection operator (LASSO) [71] penalty. The ridge regression penalty partially deletes all variables by shrinking the coefficient estimates toward, but not entirely to, zero. ...
Article
Full-text available
Background Increasing attention is being given to assessing treatment effect heterogeneity among individuals belonging to qualitatively different latent subgroups. Inference routinely proceeds by first partitioning the individuals into subgroups, then estimating the subgroup-specific average treatment effects. However, because the subgroups are only latently associated with the observed variables, the actual individual subgroup memberships are rarely known with certainty in practice and thus have to be imputed. Ignoring the uncertainty in the imputed memberships precludes misclassification errors, potentially leading to biased results and incorrect conclusions. Methods We propose a strategy for assessing the sensitivity of inference to classification uncertainty when using such classify-analyze approaches for subgroup effect analyses. We exploit each individual’s typically nonzero predictive or posterior subgroup membership probabilities to gauge the stability of the resultant subgroup-specific average causal effects estimates over different, carefully selected subsets of the individuals. Because the membership probabilities are subject to sampling variability, we propose Monte Carlo confidence intervals that explicitly acknowledge the imprecision in the estimated subgroup memberships via perturbations using a parametric bootstrap. The proposal is widely applicable and avoids stringent causal or structural assumptions that existing bias-adjustment or bias-correction methods rely on. Results Using two different publicly available real-world datasets, we illustrate how the proposed strategy supplements existing latent subgroup effect analyses to shed light on the potential impact of classification uncertainty on inference. First, individuals are partitioned into latent subgroups based on their medical and health history. Then within each fixed latent subgroup, the average treatment effect is assessed using an augmented inverse propensity score weighted estimator. Finally, utilizing the proposed sensitivity analysis reveals different subgroup-specific effects that are mostly insensitive to potential misclassification. Conclusions Our proposed sensitivity analysis is straightforward to implement, provides both graphical and numerical summaries, and readily permits assessing the sensitivity of any machine learning-based causal effect estimator to classification uncertainty. We recommend making such sensitivity analyses more routine in latent subgroup effect analyses.
... LASSO reduces collinearity and increases precision. The goal of using this is to find the covariates, or risk factors/ comorbidities in our case, to see what is selected, what the coefficients are, and interpret them [8]. Selected variables were put in the final multivariable logistic regression models to generate adjusted odds ratios (aORs). ...
Article
Full-text available
Background Haiti’s first COVID-19 cases were confirmed on March 18, 2020, and subsequently spread throughout the country. The objective of this study was to describe clinical manifestations of COVID-19 in Haitian outpatients and to identify risk factors for severity of clinical manifestations. Methods We conducted a retrospective study of COVID-19 outpatients diagnosed from March 18-August 4, 2020, using demographic, epidemiological, and clinical data reported to the Ministry of Health (MoH). We used univariate and multivariate analysis, including multivariable logistic regression, to explore the risk factors and specific symptoms related to persons with symptomatic COVID-19 and the severity of symptomatic COVID-19 disease. Results Of 5,389 cases reported to MOH during the study period, 1,754 (32.5%) were asymptomatic. Amongst symptomatic persons 2,747 (75.6%) had mild COVID-19 and 888 (24.4%) had moderate-to-severe disease; the most common symptoms were fever (69.6%), cough (51.9%), and myalgia (45.8%). The odds of having moderate-to-severe disease were highest among persons with hypertension (aOR = 1.72, 95% Confidence Interval [CI] (1.34, 2.20), chronic pulmonary disease (aOR = 3.93, 95% CI (1.93, 8.17)) and tuberculosis (aOR = 3.44, 95% CI (1.35, 9.14)) compared to persons without those conditions. The odds of having moderate-to-severe disease increased with age but was also seen among children aged 0–4 years (OR: 1.73, 95% CI (0.93, 3.08)), when using 30–39 years old as the reference group. All of the older age groups, 50–64 years, 65–74 years, 75–84 years, and 85+ years, had significantly higher odds of having moderate-to-severe COVID-19 compared with ages 30–39 years. Diabetes was associated with elevated odds of moderate-to-severe disease in bivariate analysis (OR = 2.17, 95% CI (1.58,2.98) but, this association did not hold in multivariable analyses (aOR = 1.22,95%CI (0.86,1.72)). Conclusion These findings from a resource-constrained country highlight the importance of surveillance systems to track emerging infections and their risk factors. In addition to co-morbidities described elsewhere, tuberculosis was a risk factor for moderate-to-severe COVID-19 disease.
... LASSO, short for Least Absolute Shrinkage and Selection Operator, is a statistical formula whose main purpose is the feature selection and regularization of data models. The method was first introduced in 1996 by Statistics Professor Robert Tibshirani [14]. LASSO introduces parameters to the sum of a model, giving it an upper bound that acts as a constraint for the sum to include absolute parameters within an allowable range. ...
Article
Full-text available
In many applications, indexing of high-dimensional data has become increasingly important. High-dimensional data is characterized by multiple dimensions. There can be thousands, if not millions, of dimensions in applications. Classic methods cannot analyse this kind of data set. So, we need the appropriate alternative methods to analyse them. In high-dimensional data sets, since the number of predictors is greater than the sample size, it is generally impossible to apply classical methods to fit a efficient model. A popular method for combating the challenge of the high-dimensionality curse is to solve a penalized least squares optimization problem, which combines the residual sum of squares loss function measuring the goodness of the fitted model to the data sets with some penalization terms that promote the underlying structure. So, the penalized methods can analyse and provide a good fit for the high-dimensional data sets. In this paper, we express some of these approaches and then, apply them to the eye data set for investigating the computational performance of the proposed methods.
... Two major approaches for improving explainability are implementing transparency in the ML models themselves and inferring explanations via post-hoc techniques [3]. For example, transparency can be directly introduced to ML models by incorporating simulatability (i.e., developing the model in such a way that the model can be simulated in the thoughts of a human analyzing the model output) [50] or decomposability (i.e., enabling understandability of each part of the model individually) [33]. ...
Preprint
Full-text available
Concept induction, which is based on formal logical reasoning over description logics, has been used in ontology engineering in order to create ontology (TBox) axioms from the base data (ABox) graph. In this paper, we show that it can also be used to explain data differentials, for example in the context of Explainable AI (XAI), and we show that it can in fact be done in a way that is meaningful to a human observer. Our approach utilizes a large class hierarchy, curated from the Wikipedia category hierarchy, as background knowledge.
... -Logistic regression (ridge [13], LASSO [19] or ElasticNet [21]) -Linear Support Vector Machine (SVM) [7] -Passive Agressive algorithms [6] (PA-I and PA-II) ...
Preprint
Adaptive Multi-Agent Systems (AMAS) transform dynamic problems into problems of local cooperation between agents. We present smapy, an ensemble based AMAS implementation for mobility prediction, whose agents are provided with machine learning models in addition to their cooperation rules. With a detailed methodology, we propose a framework to transform a classification problem into a cooperative tiling of the input variable space. We show that it is possible to use linear classifiers for online non-linear classification on three benchmark toy problems chosen for their different levels of linear separability, if they are integrated in a cooperative Multi-Agent structure. The results obtained show a significant improvement of the performance of linear classifiers in non-linear contexts in terms of classification accuracy and decision boundaries, thanks to the cooperative approach.
... The objective is to create a wide variety of ML models, both inherently interpretable and not inherently interpretable, to allow for comparison. The two inherently interpretable models implemented are a decision tree [9] and a logistic regression model with L1 regularization (LASSO) [10]. The implemented models that are not inherently interpretable are an adaptive boosting classifier [11], a gradient boosting classifier [12], an XGBoost classifier [13], a random forest classifier [14], a support vector machine (SVM) [15], and a multilayer perceptron neural network [16]. ...
Preprint
Criminal recidivism models are tools that have gained widespread adoption by parole boards across the United States to assist with parole decisions. These models take in large amounts of data about an individual and then predict whether an individual would commit a crime if released on parole. Although such models are not the only or primary factor in making the final parole decision, questions have been raised about their accuracy, fairness, and interpretability. In this paper, various machine learning-based criminal recidivism models are created based on a real-world parole decision dataset from the state of Georgia in the United States. The recidivism models are comparatively evaluated for their accuracy, fairness, and interpretability. It is found that there are noted differences and trade-offs between accuracy, fairness, and being inherently interpretable. Therefore, choosing the best model depends on the desired balance between accuracy, fairness, and interpretability, as no model is perfect or consistently the best across different criteria.
... 9 The LASSO can minimize the potential collinearity of variables measured from the same patient and over-fitting of variables. 22 To identify the optimal tuning parameter lambda in LASSO regression, we performed 5-fold cross-validation with 1 standard error rule of the minimum criteria. Using the suitable lambda value, variables with nonzero coefficients in the model were selected. ...
Article
Full-text available
Objective To develop an inflammation-based risk stratification tool for operative mortality in patients with acute type A aortic dissection. Methods Between January 1, 2016 and December 31, 2021, 3124 patients from Beijing Anzhen Hospital were included for derivation, 571 patients from the same hospital were included for internal validation, and 1319 patients from other 12 hospitals were included for external validation. The primary outcome was operative mortality according to the Society of Thoracic Surgeons criteria. Least absolute shrinkage and selection operator regression were used to identify clinical risk factors. A model was developed using different machine learning algorithms. The performance of the model was determined using the area under the receiver operating characteristic curve (AUC) for discrimination, calibration curves, and Brier score for calibration. The final model (5A score) was tested with respect to the existing clinical scores. Results Extreme gradient boosting was selected for model training (5A score) using 12 variables for prediction—the ratio of platelet to leukocyte count, creatinine level, age, hemoglobin level, prior cardiac surgery, extent of dissection extension, cerebral perfusion, aortic regurgitation, sex, pericardial effusion, shock, and coronary perfusion—which yields the highest AUC (0.873 [95% confidence interval (CI) 0.845-0.901]). The AUC of 5A score was 0.875 (95% CI 0.814-0.936), 0.845 (95% CI 0.811-0.878), and 0.852 (95% CI 0.821-0.883) in the internal, external, and total cohort, respectively, which outperformed the best existing risk score (German Registry for Acute Type A Aortic Dissection score AUC 0.709 [95% CI 0.669-0.749]). Conclusion The 5A score is a novel, internally and externally validated inflammation-based tool for risk stratification of patients before surgical repair, potentially advancing individualized treatment. Trial Registration clinicaltrials.gov Identifier: NCT04918108
Article
Early detection and treatment of Alzheimer’s Disease (AD) are significant. Recently, multi-modality imaging data have promoted the development of the automatic diagnosis of AD. This paper proposes a method based on latent feature fusion to make full use of multi-modality image data information. Specifically, we learn a specific projection matrix for each modality by introducing a binary label matrix and local geometry constraints and then project the original features of each modality into a low-dimensional target space. In this space, we fuse latent feature representations of different modalities for AD classification. The experimental results on Alzheimer’s Disease Neuroimaging Initiative database demonstrate the proposed methods effectiveness in classifying AD.
Article
To enhance the sparseness of the network, improve its generalization ability and accelerate its training, we propose a novel pruning approach for sigma-pi-sigma neural network (SPSNN) under the relaxed condition by adding smoothing group L1/2 regularization and adaptive momentum. The main strength of this method is that it can prune both the redundant nodes between groups in the network, and also the redundant weights of the non-redundant nodes within the group, so as to achieve the sparseness of the network. Another strength is that the non-smooth absolute value function in the traditional L1/2 regularization method is replaced by a smooth function. This reduces the oscillations of learning and enables us to more effectively prove the convergence of the proposed algorithm. Finally, the numerical simulation results demonstrate the effectiveness of the proposed algorithm.
Article
Batteries are vital energy storage carriers in industry and in our daily life. There is continued interest in the developments of batteries with excellent service performance and safety. Traditional trial-and-error experimental approaches have the limitations of high-cost and low-efficiency. Atomistic computational simulations are relatively expensive and take long time to screen massive materials. The rapid development of machine learning (ML) has brought innovations in many fields and has also changed the paradigm of the battery research. Numerous ML applications have emerged in the battery community, such as novel materials discovery, property prediction, and characterization. In this review, we introduced the workflow of ML, where the task, data, feature engineering, and evaluation were involved. Several typical ML models used in batteries were highlighted. In addition, we summarized the applications of ML for the discovery of novel materials, and for property and battery state prediction. The challenges for the application of ML in batteries were also discussed.
Article
Carbon neutrality has been proposed as a solution for the current severe energy and climate crisis caused by the overuse of fossil fuels, and machine learning (ML) has exhibited excellent performance in accelerating related research owing to its powerful capacity for big data processing. This review presents a detailed overview of ML accelerated carbon neutrality research with a focus on energy management, screening of novel energy materials, and ML interatomic potentials (MLIPs), with illustrations of two selected MLIP algorithms: moment tensor potential (MTP) and neural equivariant interatomic potential (NequIP). We conclude by outlining the important role of ML in accelerating the achievement of carbon neutrality from global-scale energy management, unprecedented screening of advanced energy materials in massive chemical space, to the revolution of atomic-scale simulations of MLIPs, which has the bright prospect of applications.
Article
The Internet of Things (IoT) has altered living by controlling devices/things over the Internet. IoT has specified many smart solutions for daily problems, transforming cyber‐physical systems (CPS) and other classical fields into smart regions. Most of the edge devices that make up the Internet of Things have very minimal processing power. To bring down the IoT network, attackers can utilize these devices to conduct a variety of network attacks. In addition, as more and more IoT devices are added, the potential for new and unknown threats grows exponentially. For this reason, an intelligent security framework for IoT networks must be developed that can identify such threats. In this paper, we have developed an unsupervised ensemble learning model that is able to detect new or unknown attacks in an IoT network from an unlabeled dataset. The system‐generated labeled dataset is used to train a deep learning model to detect IoT network attacks. Additionally, the research presents a feature selection mechanism for identifying the most relevant aspects in the dataset for detecting attacks. The study shows that the suggested model is able to identify the unlabeled IoT network datasets and DBN (Deep Belief Network) outperform the other models with a detection accuracy of 97.5% and a false alarm rate of 2.3% when trained using labeled dataset supplied by the proposed approach.
Article
A group of variables are commonly seen in diagnostic medicine when multiple prognostic factors are aggregated into a composite score to represent the risk profile. A model selection method considers these covariates as all‐in or all‐out types. Model selection procedures for grouped covariates and their applications have thrived in recent years, in part because of the development of genetic research in which gene–gene or gene–environment interactions and regulatory network pathways are considered groups of individual variables. However, little has been discussed on how to utilize grouped covariates to grow a classification tree. In this paper, we propose a nonparametric method to address the selection of split variables for grouped covariates and their following selection of split points. Comprehensive simulations were implemented to show the superiority of our procedures compared to a commonly used recursive partition algorithm. The practical use of our method is demonstrated through a real data analysis that uses a group of prognostic factors to classify the successful mobilization of peripheral blood stem cells.
Chapter
In this paper, price analysis and prediction are carried out for shared homes in the Boston area on Airbnb, and price performance is tested through the processing and analysis of platform data. Through the test, we found that not only the house itself and geographical location, but also the landlord’s own situation, the landlord’s attitude and the landlord’s professionalism have a great impact on the price. In the end, we found that the model fits the data best in terms of the performance of gradient boosting, with Rmse at 169.53 and R2 at 0.26.KeywordsMachine learningStatistical modellingData miningAirbnbPricing
Article
The principle of causality constrains the real and imaginary parts of the complex modulus \(G^{*} = G^{\prime } + i G^{\prime \prime }\) via Kramers–Kronig relations (KKR). Thus, the consistency of observed elastic or storage (\(G^{\prime }\)) and viscous or loss (\(G^{\prime \prime }\)) moduli can be ascertained by checking whether they obey KKR. This is important when master curves of the complex modulus are constructed by transforming a number of individual datasets; for example, during time-temperature superposition. We adapt a recently developed statistical technique called the ‘Sum of Maxwell Elements using Lasso’ or SMEL test to assess the KKR compliance of linear viscoelastic data. We validate this test by successfully using it on real and synthetic datasets that follow and violate KKR. The SMEL test is found to be both accurate and efficient. As a byproduct, the parameters inferred during the SMEL test provide a noisy estimate of the discrete relaxation spectrum. Strategies to improve the quality and interpretability of the extracted discrete spectrum are explored by appealing to the principle of parsimony to first reduce the number of parameters, and then to nonlinear regression to fine tune the spectrum. Comparisons with spectra obtained from the open-source program pyReSpect suggest possible tradeoffs between speed and accuracy.
Article
Full-text available
Background The identification of baseline host determinants that associate with robust HIV-1 vaccine-induced immune responses could aid HIV-1 vaccine development. We aimed to assess both the collective and relative performance of baseline characteristics in classifying individual participants in nine different Phase 1-2 HIV-1 vaccine clinical trials (26 vaccine regimens, conducted in Africa and in the Americas) as High HIV-1 vaccine responders. Methods This was a meta-analysis of individual participant data, with studies chosen based on participant-level (vs. study-level summary) data availability within the HIV-1 Vaccine Trials Network. We assessed the performance of 25 baseline characteristics (demographics, safety haematological measurements, vital signs, assay background measurements) and estimated the relative importance of each characteristic in classifying 831 participants as High (defined as within the top 25th percentile among positive responders or above the assay upper limit of quantification) versus Non-High responders. Immune response outcomes included HIV-1-specific serum IgG binding antibodies and Env-specific CD4+ T-cell responses assessed two weeks post-last dose, all measured at central HVTN laboratories. Three variable importance approaches based on SuperLearner ensemble machine learning were considered. Findings Overall, 30.1%, 50.5%, 36.2%, and 13.9% of participants were categorized as High responders for gp120 IgG, gp140 IgG, gp41 IgG, and Env-specific CD4+ T-cell vaccine-induced responses, respectively. When including all baseline characteristics, moderate performance was achieved for the classification of High responder status for the binding antibody responses, with cross-validated areas under the ROC curve (CV-AUC) of 0.72 (95% CI: 0.68, 0.76) for gp120 IgG, 0.73 (0.69, 0.76) for gp140 IgG, and 0.67 (95% CI: 0.63, 0.72) for gp41 IgG. In contrast, the collection of all baseline characteristics yielded little improvement over chance for predicting High Env-specific CD4+ T-cell responses [CV-AUC: 0.53 (0.48, 0.58)]. While estimated variable importance patterns differed across the three approaches, female sex assigned at birth, lower height, and higher total white blood cell count emerged as significant predictors of High responder status across multiple immune response outcomes using Approach 1. Of these three baseline variables, total white blood cell count ranked highly across all three approaches for predicting vaccine-induced gp41 and gp140 High responder status. Interpretation The identified features should be studied further in pursuit of intervention strategies to improve vaccine responses and may be adjusted for in analyses of immune response data to enhance statistical power. Funding National Institute of Allergy and Infectious Diseases (UM1AI068635 to YH, UM1AI068614 to GDT, UM1AI068618 to MJM, and UM1 AI069511 to MCK), the Duke CFAR P30 AI064518 to GDT, and National Institute of Dental and Craniofacial Research (R01DE027245 to JJK). This work was also supported by the Bill and Melinda Gates Foundation. The content is solely the responsibility of the authors and does not necessarily represent the official views of any of the funding sources.
Preprint
Full-text available
BACKGROUND: Biomarkers provide a framework for a biological diagnosis of Alzheimer’s disease (AD) whereas polygenic risk scores (PRS) provide method to estimate genetic risk. We derive biomarker-based PRS by incorporating endophenotype genetic risk relevant to amyloid, tau, neurodegeneration and cerebrovascular (A/T/N/V) pathology. METHODS: Endophenotype-PRSs (PRSA, PRST, PRSN, PRSV) and combined-PRSs (PRSAT, PRSATNV) were generated using the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data. Prediction performance of the PRSs was assessed in terms of dementia risk, age at onset (AAO) and longitudinal change of 14 important AD biomarkers. RESULTS: PRSA and PRST explained more amyloid and tau variability than combined PRSs (CSF-amyloid: R²PRSA = 9.22%; CSF-tau: R²PRST = 6.37%; CSF-ptau: R²PRST = 7.10%). Combined-PRSs explained more neurodegeneration-related variability (R²PRSATNV range: 1.22%-4.20%) and were strong predictors of dementia risk (HR and OR p-value<8.3e-03) and AAO (AAO(predicted_vs_observed): rAT=0.76). CONCLUSIONS: PRSA and PRST are AD-specific, while combined-PRSs are linked to neurodegeneration in general. Biomarker-derived PRSs provide mechanistic insights beyond aggregate disease susceptibility, supporting development of precision medicine for dementia.
Article
Data science is the foundation of our modern world. It underlies applications used by billions of people every day, providing new tools, forms of entertainment, economic growth, and potential solutions to difficult, complex problems. These opportunities come with significant societal consequences, raising fundamental questions about issues such as data quality, fairness, privacy, and causation. In this book, four leading experts convey the excitement and promise of data science and examine the major challenges in gaining its benefits and mitigating its harms. They offer frameworks for critically evaluating the ingredients and the ethical considerations needed to apply data science productively, illustrated by extensive application examples. The authors' far-ranging exploration of these complex issues will stimulate data science practitioners and students, as well as humanists, social scientists, scientists, and policy makers, to study and debate how data science can be used more effectively and more ethically to better our world.
Article
Cervical and anal carcinoma are neoplastic diseases with various intraepithelial neoplasia stages. The underlying mechanisms for cancer initiation and progression have not been fully revealed. DNA methylation has been shown to be aberrantly regulated during tumorigenesis in anal and cervical carcinoma, revealing the important roles of DNA methylation signaling as a biomarker to distinguish cancer stages in clinics. In this research, several machine learning methods were used to analyze the methylation profiles on anal and cervical carcinoma samples, which were divided into three classes representing various stages of tumor progression. Advanced feature selection methods, including Boruta, LASSO, LightGBM, and MCFS, were used to select methylation features that are highly correlated with cancer progression. Some methylation probes including cg01550828 and its corresponding gene RNF168 have been reported to be associated with human papilloma virus-related anal cancer. As for biomarkers for cervical carcinoma, cg27012396 and its functional gene HDAC4 were confirmed to regulate the glycolysis and survival of hypoxic tumor cells in cervical carcinoma. Furthermore, we developed effective classifiers for identifying various tumor stages and derived classification rules that reflect the quantitative impact of methylation on tumorigenesis. The current study identified methylation signals associated with the development of cervical and anal carcinoma at qualitative and quantitative levels using advanced machine learning methods.
Chapter
Sparse optimization for the identification of parametric model structure of an ARX model is equivalent to the estimation of the parameter vector. This paper discusses the sparse estimation of output error models by using nonconvex penalty. The number of parameters to define the model is not known a priori, and the non-convexity of penalty function makes the problem computationally intensive search for the true model. The sparse estimation technique relaxes the assumption on the order of the model and estimates the optimum model based on certain validation criterion directly. The Lagrangian equivalent of the estimation problem is constructed, and the parameters are estimated using genetic algorithm optimization. For non-convexity of penalty function, the lq penalty function with 0
Article
Full-text available
A crucial problem in building a multiple regression model is the selection of predictors to include. The main thrust of this article is to propose and develop a procedure that uses probabilistic considerations for selecting promising subsets. This procedure entails embedding the regression setup in a hierarchical normal mixture model where latent variables are used to identify subset choices. In this framework the promising subsets of predictors can be identified as those with higher posterior probability. The computational burden is then alleviated by using the Gibbs sampler to indirectly sample from this multinomial posterior distribution on the set of possible subset choices. Those subsets with higher probability—the promising ones—can then be identified by their more frequent appearance in the Gibbs sample.
Article
Full-text available
The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses – the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferroni-type procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.
Article
The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses — the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.
Article
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
Article
A new method, called the nonnegative (nn) garrote, is proposed for doing subset regression. It both shrinks and zeroes coefficients. In tests on real and simulated data, it produces lower prediction error than ordinary subset selection. It is also compared to ridge regression. If the regression equations generated by a procedure do not change drastically with small changes in the data, the procedure is called stable. Subset selection is unstable, ridge is very stable, and the nn-garrote is intermediate. Simulation results illustrate the effects of instability on prediction error.
Article
We discuss the following problem given a random sample X = (X 1, X 2,…, X n) from an unknown probability distribution F, estimate the sampling distribution of some prespecified random variable R(X, F), on the basis of the observed data x. (Standard jackknife theory gives an approximate mean and variance in the case R(X, F) = \(\theta \left( {\hat F} \right) - \theta \left( F \right)\), θ some parameter of interest.) A general method, called the “bootstrap”, is introduced, and shown to work satisfactorily on a variety of estimation problems. The jackknife is shown to be a linear approximation method for the bootstrap. The exposition proceeds by a series of examples: variance of the sample median, error rates in a linear discriminant analysis, ratio estimation, estimating regression parameters, etc.
Article
Markov chain Monte Carlo methods for Bayesian computation have until recently been restricted to problems where the joint distribution of all variables has a density with respect to some fixed standard underlying measure. They have therefore not been available for application to Bayesian model determination, where the dimensionality of the parameter vector is typically not fixed. This paper proposes a new framework for the construction of reversible Markov chain samplers that jump between parameter subspaces of differing dimensionality, which is flexible and entirely constructive. It should therefore have wide applicability in model determination problems. The methodology is illustrated with applications to multiple change-point analysis in one and two dimensions, and to a Bayesian comparison of binomial experiments.
Compressed sensingAvailable from http
  • D Donoho
  • Stanford
Coordinate descent procedures for lasso penalized regression
  • T Wu
  • K Lange
Stability selection (with discussion)
  • N Meinshausen
  • P Bühlmann
Better subset selection using the non-negative garotte
  • Breiman