Article

The Analysis And Selection Of Variables In Linear Regression

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The advancement of computing power in the early 1960s provided significant impetus to research in this area. The majority of early research is conducted by statisticians and focuses on linear regression, such as Hocking [3], who conducted a literature review on variable selection for linear regression. Variable selection research has now expanded to involve classification and clustering issues as well. ...
... It is also possible to specify the number of iterations and obtain the model's prediction performance computation [83]. The existing feature set is then randomly included or excluded from a small fraction (1)(2)(3)(4)(5) as well as a parameter c that controls how quickly features are perturbed. Following the calculation of the acceptance probability, a random uniform value is generated. ...
... Big data are defined as "a dataset whose size exceeds the capability of typical dataset management systems in gathering, storing, processing, and analyzing." It usually has three characteristics: Huge volume, wide variety, and rapid change [1][2][3]. The challenge posed by those 3V characteristics, namely volume, type, and velocity, has become the focus of learning methods when dealing with extensive data. ...
Article
Full-text available
Feature selection is employed to reduce feature dimensions and computational complexity by eliminating irrelevant and redundant features. A vast amount of increasing data and its processing generate many feature sets, that are reduced by the feature selection process to improve the performance in all sorts of classification, regression, clustering models. This research performs a detailed analysis of motivation and concentrates on the fundamental architecture of feature selection. The study aims to establish a structured formation related to popular methods such as filter, wrapper, embedded into search strategies, evaluation criteria, and learning methods. Different methods organize a comparison of benefits and drawbacks followed by multiple classification algorithms and standard validation measures. The diversity of applications in multiple domains such as data retrieval, prediction analysis, and medical, intrusion, and industrial applications are efficiently highlighted. The study focused on some additional feature selection methods for handling big data. Nonetheless, new challenges have surfaced in the analysis of such data, which are also addressed in this study. Reflecting on commonly encountered challenges and clarifying how to obtain the absolute feature selection method are the significant components of this study.
... This approach yields several benefits: (a) allows a better understanding of the different techniques, (b) allows the combination of different ranking and number selection schemes, and (c) produces a more complete view of the variable selection problem. We consider five different ranking methods and also compare the results to the classical ranking method based on p-values [9], [17]. Moreover, we apply the best sequence search (keeping fixed the number of variables). ...
... They allow us to perform a more robust analysis, as discussed in Section IV. As an additional final check on the obtained results, we also apply a classical ranking method based on p-values [9], [17]. In Section III-B, we also describe the best sequence search (keeping fixed the number M of variables in the sequence) and an alternating optimization technique for obtaining the optimal sequence. ...
... More specifically, although we will see that RM1 and RM2 provide the best performance in terms of prediction error, but the results of RM3, RM4 and RM5 reveal other important aspects shown by the rest of our analyses below. Moreover, for completing our view, we will also apply a classical ranking method based on p-values [9], [17], and show the results in Table IV. ...
Preprint
Full-text available
In the last decade, soundscapes have become one of the most active topics in Acoustics, providing a holistic approach to the acoustic environment, which involves human perception and context. Soundscapes-elicited emotions are central and substantially subtle and unnoticed (compared to speech or music). Currently, soundscape emotion recognition is a very active topic in the literature. We provide an exhaustive variable selection study (i.e., a selection of the soundscapes indicators) to a well-known dataset (emo-soundscapes). We consider linear soundscape emotion models for two soundscapes descriptors: arousal and valence. Several ranking schemes and procedures for selecting the number of variables are applied. We have also performed an alternating optimization scheme for obtaining the best sequences keeping fixed a certain number of features. Furthermore, we have designed a novel technique based on Gibbs sampling, which provides a more complete and clear view of the relevance of each variable. Finally, we have also compared our results with the analysis obtained by the classical methods based on p-values. As a result of our study, we suggest two simple and parsimonious linear models of only 7 and 16 variables (within the 122 possible features) for the two outputs (arousal and valence), respectively. The suggested linear models provide very good and competitive performance, with $R^2>0.86$ and $R^2>0.63$ (values obtained after a cross-validation procedure), respectively.
... This approach yields several benefits: (a) allows a better understanding of the different techniques, (b) allows the combination of different ranking and number selection schemes, and (c) produces a more complete view of the variable selection problem. We consider five different ranking methods and also compare the results to the classical ranking method based on p-values [9], [17]. Moreover, we apply the best sequence search (keeping fixed the number of variables). ...
... They allow us to perform a more robust analysis, as discussed in Section IV. As an additional final check on the obtained results, we also apply a classical ranking method based on p-values [9], [17]. In Section III-B, we also describe the best sequence search (keeping fixed the number M of variables in the sequence) and an alternating optimization technique for obtaining the optimal sequence. ...
... More specifically, although we will see that RM1 and RM2 provide the best performance in terms of prediction error, but the results of RM3, RM4 and RM5 reveal other important aspects shown by the rest of our analyses below. Moreover, for completing our view, we will also apply a classical ranking method based on p-values [9], [17], and show the results in Table IV. ...
Article
Full-text available
In the last decade, soundscapes have become one of the most active topics in Acoustics, providing a holistic approach to the acoustic environment, which involves human perception and context. Soundscapes-elicited emotions are central and substantially subtle and unnoticed (compared to speech or music). Currently, soundscape emotion recognition is a very active topic in the literature. We provide an exhaustive variable selection study (i.e., a selection of the soundscapes indicators) to a well-known dataset (emo-soundscapes). We consider linear soundscape emotion models for two soundscapes descriptors: arousal and valence. Several ranking schemes and procedures for selecting the number of variables are applied. We have also performed an alternating optimization scheme for obtaining the best sequences keeping fixed a certain number of features. Furthermore, we have designed a novel technique based on Gibbs sampling, which provides a more complete and clear view of the relevance of each variable. Finally, we have also compared our results with the analysis obtained by the classical methods based on p-values. As a result of our study, we suggest two simple and parsimonious linear models of only 7 and 16 variables (within the 122 possible features) for the two outputs (arousal and valence), respectively. The suggested linear models provide very good and competitive performance, with $R^{2}>0.86$ and $R^{2}>0.63$ (values obtained after a cross-validation procedure), respectively.
... It started with a complete (saturated) model that gradually removed variables at each step, to obtain a condensed model that best explained the data. The stepwise technique is beneficial since it reduces multicollinearity, identifies the significant predictors, and addresses overfitting of the data to the model [36]. The final model for each simulator sickness domain was determined using Akaike information criterion (AIC, [36]). ...
... The stepwise technique is beneficial since it reduces multicollinearity, identifies the significant predictors, and addresses overfitting of the data to the model [36]. The final model for each simulator sickness domain was determined using Akaike information criterion (AIC, [36]). Statistical significance was set to an alpha level of 0.05. ...
Article
Full-text available
Highly autonomous vehicles (HAV) have the potential of improving road safety and providing alternative transportation options. Given the novelty of HAVs, high-fidelity driving simulators operating in an autonomous mode are a great way to expose transportation users to HAV prior to HAV adoption. In order to avoid the undesirable effects of simulator sickness, it is important to examine whether factors such as age, sex, visual processing speed, and exposure to acclimation scenario predict simulator sickness in driving simulator experiments designed to replicate the HAV experience. This study identified predictors of simulator sickness provocation across the lifespan (N = 210). Multiple stepwise backward regressions identified that slower visual processing speed predicts the Nausea and Dizziness domain with age not predicting any domains. Neither sex, nor exposure to an acclimation scenario predicted any of the four domains of simulator sickness provocation, namely Queasiness, Nausea, Dizziness, and Sweatiness. No attrition occurred in the study due to simulator sickness and thus the study suggests that high-fidelity driving simulator may be a viable way to introduce drivers across the lifespan to HAV, a strategy that may enhance future HAV acceptance and adoption.
... Bu tür regresyon modellerinde, genelde iki amaç vardır: Bağımlı değişkeni etkilediği düşünülen bağımsız değişkenlerden hangisi ya da hangilerinin bağımlı değişkeni daha çok etkilediğini bulmak, Bağımlı değişkeni etkilediği belirlenen bağımsız değişkenler yardımı ile bağımlı değişken değerini tahmin etmek (Alpar, 2013). Hocking (1976), Mallows (1973) tarafından verilen regresyon eşitliklerinin altı potansiyel kullanımına işaret etmektedirler. Bunlar; Bağımlı değişkenin değişiminin iyi bir açıklayıcısının sağlanması, Analizin bağımlı değişken tahmini ve gelecek tahmini yapmak için kullanılması, Parametrelerin güven aralığının belirlenmesi, Parametre tahmininin sağlanması, Veri girişinin farklı düzeyleriyle bir işlemin kontrolünün sağlanması, Gerçek modelin gelişiminin sağlanmasıdır (Fox, 2015). ...
... Bu da tek tek kısmi regresyon katsayılarının testleri ile elde edilir. Hocking (1976) tüm olası alt kümeler yönteminde birden fazla seçim kriterinin olduğunu ifade etmiştir. Bunlardan ikisi Mallows C p ve R 2 belirleyicilik katsayısıdır (Rao, 1998). ...
Book
Full-text available
The user has requested an enhancement of the downloaded file.
... The choice of variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion (Hocking, 1976). Application of this approach to biomarker selection can be found in Xiong et al. (2001) and Lu et al. (2020). ...
... Dans le cadre de ces approches le choix des variables est effectué par une procédure qui, à chaque étape, envisage d'ajouter ou de soustraire une variable à l'ensemble des variables explicatives en fonction d'un critère pré-spécifié (Hocking (1976)). On trouve des applications de cette approche à la sélection de biomarqueurs dans Xiong et al. (2001) et Lu et al. (2020. ...
Thesis
With the genomic revolution and the new era of precision medicine, the identification of biomarkers that are informative (i.e. active) for a response (endpoint) is becoming increasingly important in clinical research. These biomarkers are beneficial to better understand the progression of a disease (prognostic biomarkers) and to better identify patients more likely to benefit from a given treatment (predictive biomarkers). Biomarker data (e.g. genomics, transcriptomics, and proteomics) usually have a high-dimensional nature, with the number of measured biomarkers (variables) much larger than the sample size. However, only a fraction of biomarkers is truly active, therefore raising the need for variable selection. Among various statistical learning approaches, regularized methods such as Lasso have become very popular for high-dimensional variable selection due to their statistical and numerical performance. However, their selection consistency is not guaranteed when the biomarkers are highly correlated. Throughout my PhD, several novel regularized approaches were developed to perform variable selection in this challenging context. More precisely, four methods were proposed in different statistical models (linear regression model, ANCOVA-type model, and logistic regression model). The main idea is to remove the correlations by whitening the design matrix. For one of the methods, results of the sign consistency were established under mild conditions. The proposed approaches were evaluated through simulation studies and applications on publicly available datasets. The results suggest that our approaches are more performant than compared methods for selecting prognostic and predictive biomarkers in high-dimensional (correlated) settings. Three of our methods are implemented in the R packages: WLasso, PPLasso, and WLogit, available from the CRAN (Comprehensive R Archive Network).
... Bu tür regresyon modellerinde, genelde iki amaç vardır: Bağımlı değişkeni etkilediği düşünülen bağımsız değişkenlerden hangisi ya da hangilerinin bağımlı değişkeni daha çok etkilediğini bulmak, Bağımlı değişkeni etkilediği belirlenen bağımsız değişkenler yardımı ile bağımlı değişken değerini tahmin etmek (Alpar, 2013). Hocking (1976), Mallows (1973) tarafından verilen regresyon eşitliklerinin altı potansiyel kullanımına işaret etmektedirler. Bunlar; Bağımlı değişkenin değişiminin iyi bir açıklayıcısının sağlanması, Analizin bağımlı değişken tahmini ve gelecek tahmini yapmak için kullanılması, Parametrelerin güven aralığının belirlenmesi, Parametre tahmininin sağlanması, Veri girişinin farklı düzeyleriyle bir işlemin kontrolünün sağlanması, Gerçek modelin gelişiminin sağlanmasıdır (Fox, 2015). ...
... Bu da tek tek kısmi regresyon katsayılarının testleri ile elde edilir. Hocking (1976) tüm olası alt kümeler yönteminde birden fazla seçim kriterinin olduğunu ifade etmiştir. Bunlardan ikisi Mallows C p ve R 2 belirleyicilik katsayısıdır (Rao, 1998). ...
... The model was selected using Mallows' Cp statistics [30], which should be lower than or equal to p + 1 (p is the number of independent variables included in the model) for avoiding biases because of the omission of relevant explanatory variables. When this criterion was satisfied, the model with the minimum value of the statistics-SP Statistics (SP) [31], Final Prediction Error (JP) [31,32], Amemiya's Prediction Criteria (PC) [32,33] and Akaike's Information Criterion (AIC) [34]-was selected. After selecting the independent variables to be included in the multiple linear regression (MLR) models, the parameter estimation was performed using PROC REG. ...
... The model was selected using Mallows' Cp statistics [30], which should be lower than or equal to p + 1 (p is the number of independent variables included in the model) for avoiding biases because of the omission of relevant explanatory variables. When this criterion was satisfied, the model with the minimum value of the statistics-SP Statistics (SP) [31], Final Prediction Error (JP) [31,32], Amemiya's Prediction Criteria (PC) [32,33] and Akaike's Information Criterion (AIC) [34]-was selected. After selecting the independent variables to be included in the multiple linear regression (MLR) models, the parameter estimation was performed using PROC REG. ...
Article
Full-text available
The aim of this study was to assess and validate, using independent data, the prediction equations obtained to estimate in vivo carcass composition using bioelectrical impedance analysis (BIA) to determine the nutrient retention and overall energy and nitrogen retention efficiencies of growing rabbits. Seventy-five rabbits grouped into five different ages (25, 35, 49, 63 and 77 days) were used in the study. A four-terminal body-composition analyzer was applied to obtain resistance (Rs, Ω) and reactance (Xc, Ω) values. All the animals were stunned and bled at each selected age, and the chilled carcasses were analyzed to determine water, fat, crude protein (CP), ash and gross energy (GE). Multiple linear regression analysis was conducted to determine the equations, using body weight, length and impedance data as independent variables. The coefficients of determination (R2) to estimate the content of water, protein, fat and ash in grams, and energy in Mega Jules(MJ), were: 0.99, 0.99, 0.95, 0.96 and 0.98, respectively, and the relative mean prediction errors (RMPE) were: 4.20, 5.48, 21.9, 9.10 and 6.77%, respectively. Carcass yield (%) estimation had values of 0.50 and 10.0 for R2 and RMPE, respectively. When water content was expressed as a percentage, the R2 and RMPE were 0.79 and 1.62%, respectively. When the protein, fat and ash were expressed as a percentage of dry matter (%DM) and the energy content as kJ/100 g DM, the R2 values were 0.68, 0.76, 0.66 and 0.82, respectively, and the RMPEs were 3.22, 10.5, 5.82 and 2.54%, respectively. Energy Retention Efficiency was 20.4 ± 7.29%, 21.0 ± 4.18% and 20.8 ± 2.79% from 35 to 49, from 49 to 63 and from 35 to 63 d, respectively. Nitrogen Retention Efficiency was 46.9 ± 11.7%, 34.5 ± 7.32% and 39.1 ± 3.23% for the same periods. Energy was retained in body tissues for growth with an efficiency of approximately 52.5%, and the energy efficiency for protein and fat retention was 33.3 and 69.9%, respectively. This work shows that BIA is a non-invasive and good method to estimate in vivo carcass composition and to determine the nutrient retention of growing rabbits from 25 to 77 days of age.
... To calculate the proportion of protein level variability explained by pQTL SNPs, we first refined the search for significantly associated SNPs in a multivariable model following the approach in [34,35]. Specifically, among the top 50 univariate SNPprotein level associations the best set of SNPs, in an explanatory sense, were selected using a stepwise regression procedure [36], fixing a threshold α = 1 × 10 −4 for the optimal statistically significant subset of SNPs included in the model. The best set of statistically significant SNPs was then included in the multivariable LMM model formulated as in Equation (7), and the respective marginal proportion of protein level variability explained by these SNPs was calculated using the marginal R 2 statistic as defined by Nakagawa and Schielzeth [37]: (1), σ 2 F is the variance for the fixed effects component (i.e., sex), and σ 2 SNPs is the variance for the significant SNPs. ...
... HE regression highlighted 13 out of 56 MS-related proteins with unadjusted p-value ≤ 0.05 under the null hypothesis of null h 2 : Gc, Anxa1, Plat, Sod1, Irf8, Ptger4, Fadd, Il-7, Mmp8, Pdgfa, Il-21, Tnfsf13, Il7, and Apex1. Among these 13 proteins, 7, i.e., Gc (h 2 = 0.77, 95%CI: 0. 36 ...
Article
Full-text available
This work aimed at estimating narrow-sense heritability, defined as the proportion of the phenotypic variance explained by the sum of additive genetic effects, via Haseman–Elston regression for a subset of 56 plasma protein levels related to Multiple Sclerosis (MS). These were measured in 212 related individuals (with 69 MS cases and 143 healthy controls) obtained from 20 Sardinian families with MS history. Using pedigree information, we found seven statistically significant heritable plasma protein levels (after multiple testing correction), i.e., Gc (h2 = 0.77; 95%CI: 0.36, 1.00), Plat (h2 = 0.70; 95%CI: 0.27, 0.95), Anxa1 (h2 = 0.68; 95%CI: 0.27, 1.00), Sod1 (h2 = 0.58; 95%CI: 0.18, 0.96), Irf8 (h2 = 0.56; 95%CI: 0.19, 0.99), Ptger4 (h2 = 0.45; 95%CI: 0.10, 0.96), and Fadd (h2 = 0.41; 95%CI: 0.06, 0.84). A subsequent analysis was performed on these statistically significant heritable plasma protein levels employing Immunochip genotyping data obtained in 155 healthy controls (92 related and 63 unrelated); we found a meaningful proportion of heritable plasma protein levels’ variability explained by a small set of SNPs. Overall, the results obtained, for these seven MS-related proteins, emphasized a high additive genetic variance component explaining plasma levels’ variability.
... In the main text, we outlined two variable selection procedures: an exhaustive best subset selection for models with fewer than 15 species and a forward stepwise selection algorithm for more complex models. Other common algorithms such as ridge regression [87] and the Least Absolute Shrinkage and Selection Operator (LASSO) are frequently applied in a wide variety of machine learning and model validation contexts [13,17,[88][89][90]. These methods seek to reduce the influence of extraneous parameters via L2 or L1 regularization, respectively. ...
Preprint
Full-text available
The Cassini spacecraft discovered that Saturn's moon Enceladus possesses a series of jets erupting from its South Polar Terrain. Previous studies of in situ data collected by Cassini's Ion and Neutral Mass Spectrometer (INMS) have identified H$_2$O, CO$_2$, CH$_4$, H$_2$, and NH$_3$ within the plume of ejected material. Identification of minor species in the plume remains an ongoing challenge, owing to the large number of possible combinations that can be used to fit the INMS data. Here, we present the discovery of several new compounds of strong importance to the habitability of Enceladus, including HCN, CH$_2$O, C$_2$H$_2$, and C$_3$H$_6$. Our analyses of the low velocity INMS data coupled with our detailed statistical framework enable discriminating between previously ambiguous species in the plume by alleviating the effects of high-dimensional model fitting. Together with plausible mineralogical catalysts and redox gradients derived from surface radiolysis, these compounds could potentially support extant microbial communities or drive complex organic synthesis leading to the origin of life.
... The concrete values of 23 descriptors in 64 study basins are shown in Tables A1-A3. Because of the multicollinearity among the 23 basin descriptors selected, stepwise regression [38] is performed to analyze the significant impacts of the descriptors on the 13 percentile flows in the above-selected basins. It determines the predictors (basin descriptors) individually and identifies a set of predictors with the lowest Akaike Information Criterion (AIC). ...
Article
Full-text available
Flow duration curves (FDCs) that represent streamflow regime function through an empirical relationship between the FDC parameters and basin descriptors are widely adopted for hydrologic applications. However, the applications of this method are highly dependent on the availability of observation data. Hence, it is still of great significance to explore the process controls of underpinning regional patterns on streamflow regimes. In this study, we developed a new regionalization method of FDCs to solve the problem of runoff prediction for ungauged mountainous basins. Five empirical equations (power, exponential, logarithmic, quadratic, and cubic) were used to fit the observed FDCs in the 64 mountainous basins in eastern China, and the power model outperforms other models. Stepwise regression was used to explore the differentiated control of 23 basin descriptors on the 13 percentile flows of FDCs, and seven descriptors remained as independent variables for further developing the regional FDCs. Application results with different combinations of these selected descriptors showed that five indices, i.e., average annual rainfall (P), average elevation (H), average gradient (β), average topographic index (TI), and maximum 7d of annual rainfall (Max7d), were the main control factors of FDCs in these areas. Through the regional method, we found that 95.31% of all the basins have NSE values greater than 0.60 and ε (namely the relative mean square error) values less than 20%. In conclusion, our study can guide runoff predictions to help manage booming demands for water resources and hydropower developments in mountainous areas.
... Covariates were incorporated until no further improvement of the model was obtained. Non-significant variables were eliminated to avoid over-parameterization (Hocking, 1976), which could dilute other effects. The model that minimized the variance of the residuals was chosen as the most appropriate and was considered a robust estimate when there was suspicion Self-efficacy, self-regulation and cooperative learning in Secondary Education Spanish and Portuguese students of heteroscedasticity. ...
Article
Full-text available
International reports show more positive academic and drop-out results in the neighbor Portugal than in Spain, but comparisons should be considered carefully. Data which reflect students’ own perceptions on pedagogical and psychological variables significant for learning are needed. The goal of this study was to compare two similar groups of students in Portugal and Spain in relation to their academic self-efficacy, self-regulated learning, and cooperative learning. An ex post facto research design was followed. A total of 1619 students (816 Portuguese, 795 Spanish) enrolled in 27 different schools in Spain and Portugal participated. Ages varied between 12 and 17 years. The only condition to participate was having experienced cooperative learning in the last six months. The multivariant lineal general model showed significant differences based on country, sex and age. Portuguese students scored significantly higher in interpersonal skills, group processing and positive interdependence, while Spanish students scored higher in individual accountability, academic self-efficacy and self-regulated learning prior, during and after. Women scored significantly higher in all the variables except academic self-efficacy, where there were no differences. Regarding age, as it increases the scores decrease in promotive interaction, academic self-efficacy and selfregulated learning prior, during and after. Finally, the generalized linear model showed that group processing and the three dimensions of self-regulated learning predicted academic self-efficacy. In conclusion, Portuguese students perceived that cooperative learning was more intensely promoted in their classes. The Spanish students showed stronger academic self-efficacy and self-regulated learning, which contradicts the worst results obtained in the latest PISA reports. These students could suffer the “Dunning-Kruger” effect and not be aware of the knowledge they lack.
... 2) Backward stepwise regression gradually removes independent variables starting with all variables, in order to find a reduced model that explains the data the best. It is a reverse process to forward stepwise regression [19]. 3) A combination of forward and backward stepwise regressions is frequently used in practice. ...
Article
Growing evidence shows that there is an increased risk of cardiovascular diseases among gout patients, especially coronary heart disease (CHD). Screening for CHD in gout patients based on simple clinical factors is still challenging. Here we aim to build a diagnostic model based on machine learning so as to avoid missed diagnoses or over exaggerated examinations as much as possible. Over 300 patient samples collected from Jiangxi Provincial People's Hospital were divided into two groups (gout and gout+CHD). The prediction of CHD in gout patients has thus been modeled as a binary classification problem. A total of eight clinical indicators were selected as features for machine learning classifiers. A combined sampling technique was used to overcome the imbalanced problem in the training dataset. Eight machine learning models were used including logistic regression, decision tree, ensemble learning models (random forest, XGBoost, LightGBM, GBDT), support vector machine (SVM) and neural networks. Our results showed that stepwise logistic regression and SVM achieved more excellent AUC values, while the random forest and XGBoost models achieved more excellent performances in terms of recall and accuracy. Furthermore, several high-risk factors were found to be effective indices in predicting CHD in gout patients, which provide insights into the clinical diagnosis.
... These methods typically under-perform when a high number of features is provided (due to problem under-specification and curse of dimensionality), such as in the high-throughput biological data, and therefore need to be preceded by some form of feature engineering method. Some methods build signatures by means of single-feature scoring methods 8,9 (e.g. inferential testing for two-class comparison) but these approaches could fail even in simple 2-dimensional situations. ...
Article
Full-text available
One of the main objectives of high-throughput genomics studies is to obtain a low-dimensional set of observables—a signature—for sample classification purposes (diagnosis, prognosis, stratification). Biological data, such as gene or protein expression, are commonly characterized by an up/down regulation behavior, for which discriminant-based methods could perform with high accuracy and easy interpretability. To obtain the most out of these methods features selection is even more critical, but it is known to be a NP-hard problem, and thus most feature selection approaches focuses on one feature at the time (k-best, Sequential Feature Selection, recursive feature elimination). We propose DNetPRO, Discriminant Analysis with Network PROcessing, a supervised network-based signature identification method. This method implements a network-based heuristic to generate one or more signatures out of the best performing feature pairs. The algorithm is easily scalable, allowing efficient computing for high number of observables (103–105). We show applications on real high-throughput genomic datasets in which our method outperforms existing results, or is compatible with them but with a smaller number of selected features. Moreover, the geometrical simplicity of the resulting class-separation surfaces allows a clearer interpretation of the obtained signatures in comparison to nonlinear classification models.
... Therefore, as can be seen in Table 6, the same set of predictors have been used to ensure fair comparisons between the zero-inflated models under evaluation. However, the apparent real world application has been finalized by the model building technique Backward stepwise regression (Efrymson, 1960;Hocking, 1976) performed on the proposed ZIPA model that would facilitate interpretation of the results. In this manner, our strategy comprises three basic steps. ...
... After determining the variables in the model, the parameters of the linear regression function are estimated, and the quality of the regression is assessed by the determination index. Additional variables are gradually added to the model as the proportion of explained variability in the values of the quantity increases (Hocking, 1976;Christensen, 2002). ...
... This is a stepwise regression method that starts with a full (saturated) model and gradually removes variables to find a reduced model that best fits the data at each step (Hocking, 1976). Backward Elimination is also known as Backward Stepwise ...
Research
Full-text available
An unpublished MSc. thesis submitted to School of Engineering, Computing and Mathematics, University of Plymouth, Plymouth, United Kingdom. September 2022.
... This is a stepwise regression method that starts with a full (saturated) model and gradually removes variables to find a reduced model that best fits the data at each step (Hocking, 1976). Backward Elimination is also known as Backward Stepwise ...
... A low value of these metrics implies less information loss, therefore high model quality. A set of five main statistical criteria were employed: Akaike Information Criterion (AIC) (Akaike, 1969), Bayesian Information Criterion (BIC) (Findley, 1991), Amemiya Prediction Criterion (APC) (Amemiya, 1980), Hocking's Sp (HSP) (Hocking, 1976), and Sawa's Bayesian Information Criterion (SBIC) (Sawa, 1978). ...
Article
It is unquestionable that time series forecasting is of paramount importance in many fields. The most used machine learning models to address time series forecasting tasks are Recurrent Neural Networks (RNNs). Typically, those models are built using one of the three most popular cells, ELMAN, Long-Short Term Memory (LSTM), or Gated Recurrent Unit (GRU) cells, each cell has a different structure and implies a different computational cost. However, it is not clear why and when to use each RNN-cell structure. Actually, there is no comprehensive characterization of all the possible time series behaviors and no guidance on what RNN cell structure is the most suitable for each behavior. The objective of this study is two-fold: it presents a comprehensive taxonomy of all-time series behaviors (deterministic, random-walk, nonlinear, long-memory, and chaotic), and provides insights into the best RNN cell structure for each time series behavior. We conducted two experiments: (1) The first experiment evaluates and analyzes the role of each component in the LSTM-Vanilla cell by creating 11 variants based on one alteration in its basic architecture (removing, adding, or substituting one cell component). (2) The second experiment evaluates and analyzes the performance of 20 possible RNN-cell structures. To evaluate, compare, and select the best model, different statistical metrics were used: error-based metrics, information criterion-based metrics, naïve-based metric, and direction change-based metric. To further improve our confidence in the models’ interpretation and selection, Friedman Wilcoxon-Holm signed-rank test was used. Our results showed that the MUT2, SCRN, and ELMAN cells are the most recommended to forecast time series data with deterministic, random-walk, and nonlinear behaviors, respectively. Whereas, the MGU-SLIM2 and the LSTM-SLIM3 are the most suitable for the long-memory and chaotic behaviors, respectively.
... A number of dimension reduction approaches have been used in ecology and related fields to reduce many potential predictor variables to a subset of variables with high explanatory and predictive power. Popular examples include stepwise regression (Hocking, 1976) or all-subsets regression (Miller, 2002), and both are widely available in several R packages (R Core Team, 2022); examples include 'step' in stats, 'stepAIC' in MASS (Venables & Ripley, 2002), 'dredge' in MuMIn (Barto n, 2020), and 'regsubsets' in leaps (Miller, 2020). Both stepwise and all subsets regression have widely documented shortcomings, including violating assumptions about multiple hypothesis testing (Whittingham et al., 2006;Mundry & Nunn, 2009) and the potential to identify spurious correlations (Olden & Jackson, 2000;Anderson et al., 2001), but they continue to be widely used. ...
Article
Full-text available
Using multi-species time series data has long been of interest for estimating inter-specific interactions with vector autoregressive models (VAR) and state space VAR models (VARSS); these methods are also described in the ecological literature as multivariate autoregressive models (MAR, MARSS). To date, most studies have used these approaches on relatively small food webs where the total number of interactions to be estimated is relatively small. However, as the number of species or functional groups increases, the length of the time series must also increase to provide enough degrees of freedom with which to estimate the pairwise interactions. To address this issue, we use Bayesian methods to explore the potential benefits of using regularized priors, such as Laplace and regularized horseshoe, on estimating interspecific interactions with VAR and VARSS models. We first perform a large-scale simulation study, examining the performance of alternative priors across various levels of observation error. Results from these simulations show that for sparse matrices, the regularized horseshoe prior minimizes the bias and variance across all inter-specific interactions. We then apply the Bayesian VAR model with regularized priors to a output from a large marine food web model (37 species) from the west coast of the USA. Results from this analysis indicate that regularization improves predictive performance of the VAR model, while still identifying important inter-specific interactions.
... Although various methodologies have been developed to make statistical inference on the aforementioned simplex mixed-effects models, little work has been performed for the variable selection of simplex mixed-effects models. Classical model-selection methods, such as the step-wise selection method [11], the model comparison via Bayes factor [12], the Akaike information criterion [13] and Deviance information criterion [14], are often used to identify the important covariates in regression analysis; however, these approaches are generally computationally intensive and unstable for complicated mixed models with many covariates. On the other hand, the regularization (penalization) method has increasingly become a popular tool for conducting variable selection in regression analysis. ...
Article
Full-text available
In the development of simplex mixed-effects models, random effects in these mixed-effects models are generally distributed in normal distribution. The normality assumption may be violated in an analysis of skewed and multimodal longitudinal data. In this paper, we adopt the centered Dirichlet process mixture model (CDPMM) to specify the random effects in the simplex mixed-effects models. Combining the block Gibbs sampler and the Metropolis–Hastings algorithm, we extend a Bayesian Lasso (BLasso) to simultaneously estimate unknown parameters of interest and select important covariates with nonzero effects in semiparametric simplex mixed-effects models. Several simulation studies and a real example are employed to illustrate the proposed methodologies.
... These criteria are based on the principle of parsimony which suggests selecting a model with small residual sum of squares with as few parameters as possible. Hockings (1976) reviewed eight model selection criteria while Bendel and Afifi (1977) compared also eight criteria but not all the same as Hockings. A selection criterion is an index that can be computed for each candidate model and used to compare models (Kleinbaum et al. 1987). ...
... The method used was the standard stepwise regression, which is a combination of the stepwise forward and the stepwise backward procedures (Draper and Smith, 1998;Hocking, 1976). It essentially consists of a series of iterations in which, step-by-step, the explanatory variables are included in or excluded from the model, according to a statistical criterion (i.e., F-test statistical significance), to achieve an optimal model (Reimann et al, 2008). ...
Article
Bioavailability of some major and trace elements was evaluated in 1,993 topsoil samples collected across Campania region (Southern Italy). A main focus was made on Al, Ca, K, Mg, Cu, Tl since they are linked, for different reasons, to agriculture. Bioavailability was assessed by an extraction with ammonium nitrate and the data were compared with the pseudo-total concentration determined by Aqua Regia digestion. Geochemical maps of the pseudo-total and bioavailable concentrations were generated using a multifractal inverse distance weighted (MIDW) interpolation. In addition, the spatial distribution patterns of the percent bioavailability of elements, based on the ratio among bioavailable and the pseudo-total fractions, were also determined. The median value of the percent bioavailability showed the order Ca>K>>Mg≃Tl>>Cu>>Al and it represents a positive finding in terms of both agricultural productivity and environmental quality. Further, a multiple linear regression was finally applied to data to unveil any dependence of the bioavailable fraction on the pseudo-total content of elements. The grain size distribution and organic matter content of samples were later included to evaluate their possible role in promoting the environmental availability of elements. The pseudo-total concentrations of Al, Ca, K, and Mg alone resulted to be poorly able to predict the variability of the bioavailable fraction. The addition of the grain size distribution and organic matter content to the models expanded the predictive capability of Ca, K, and Mg whereas a marginal improvement was showed by Al, Cu, and Tl. This study represents a methodological contribution to a better understanding of the processes underlying the spatial variability of chemical elements in soil. Considering the positive outcomes obtained, further researches were planned to include more variables (e.g. soil pH, redox potential, content in Iron and Manganese oxides, etc.) in the predictive models.
... In machine learning, Neural networks can emerge data as one of the most important for prediction. Neural networks in prediction methods in real-life data are challenging due to the massive size of the data, high dimensionality, and the presence of seasonal variations [28]. ...
Article
Full-text available
Among the challenges in industrial revolutions, 4.0 is managing organizations' talents, especially to ensure the right person for the position can be selected. This study is set to introduce a predictive approach for talent identification in the sport of netball using individual player qualities in terms of physical fitness, mental capacity, and technical skills. A data mining approach is proposed using three data mining algorithms, which are Decision Tree (DT), Neural Network (NN), and Linear Regressions (LR). All the models are then compared based on the Relative Absolute Error (RAE), Mean Absolute Error (MAE), Relative Square Error (RSE), Root Mean Square Error (RMSE), Coefficient of Determination (R²), and Relative Square Error (RSE). The findings are presented and discussed in light of early talent spotting and selection. Generally, LR has the best performance in terms of MAE and RMSE as it has the lowest values among the three models.
... Instead of regressing on the original variables, in principal component regression the PCs, which are orthogonal by construction, are used as predictors to address multicollinearity. There also exist approaches to reduce the number of PCs based on principal component regression (see, e.g., [9,16]). ...
Article
Full-text available
Principal loading analysis is a dimension reduction method that discards variables which have only a small distorting effect on the covariance matrix. As a special case, principal loading analysis discards variables that are not correlated with the remaining ones. In multivariate linear regression on the other hand, predictors that are neither correlated with both the remaining predictors nor with the dependent variables have a regression coefficients equal to zero. Hence, if the goal is to select a number of predictors, variables that do not correlate are discarded as it is also done in principal loading analysis. That both methods select the same variables occurs not only for the special case of zero correlation however. We contribute conditions under which both methods share the same variable selection. Further, we extend those conditions to provide a choice for the threshold in principal loading analysis which only follows recommendations based on simulation results so far.
... The association of these variables (independent variables) with case/control status (dependent dichotomous variable; value labels: case = 1, control = 0) was ascertained by binomial logistic regression. The model was adjusted using a stepwise procedure (method: forward; Wald test); the significance levels for the variables to enter and to be removed were p ≤ 0.05 and p ≥ 0.10, respectively [37]. Nagelkerke R 2 was used to estimate how much variation in the dependent variable can be explained by the model. ...
Article
Full-text available
Blastocystis sp. is known to be the most prevalent parasite in fecal samples of humans worldwide. In the present report, a case–control study (1:9.89 (≈10)) was performed, by analyzing data from 3682 patients who attended a public hospital in the northern area of Spain showing gastrointestinal symptoms. Diagnosis was performed in human fecal samples by means of optical microscopy. The prevalence of Blastocystis sp. in patients with gastrointestinal symptoms was 9.18% (338/3682). Most of the Blastocystis sp.-infected patients tested negative for protozoa and helminths, and were underweight and foreign-born (26.4%), mainly from Africa and Central/South America. Gastrointestinal symptoms, such as abdominal pain, anorexia, halitosis, plus relative eosinophilia, as well as co-infections with pathogenic bacteria were associated with Blastocystis sp. infection. Both type 2 diabetes and treatment with immunosuppressive medicines at the time of Blastocystis sp. detection were associated with a higher proportion of infected patients. This is the first case–control study of Blastocystis sp. in humans in northern Spain and may contribute to surveillance and intervention strategies by public health authorities.
... In this model, saccade amplitude, saccade velocity, fixation duration, proportion of regressive saccades, microsaccade rate, microsaccade amplitude, and line-changing duration were entered as independent variables, with reading speed being the dependent variable. Specifically, we performed stepwise linear regression (Hocking, 1976) to determine the degree to which these independent variables contribute to prediction of reading speed. Independent variables were added to the model based on the degree to which they explained the dependent variable (all p < 0.001). ...
Article
Full-text available
Degraded viewing conditions caused by either natural environments or visual disorders lead to slow reading. Here, we systematically investigated how eye movement patterns during reading are affected by degraded viewing conditions in terms of spatial resolution, contrast, and background luminance. Using a high-speed eye tracker, binocular eye movements were obtained from 14 young normally sighted adults. Images of text passages were manipulated with varying degrees of background luminance (1.3-265 cd/m2), text blur (severe blur to no blur), or text contrast (2.6%-100%). We analyzed changes in key eye movement features, such as saccades, microsaccades, regressive saccades, fixations, and return-sweeps across different viewing conditions. No significant changes were observed for the range of tested background luminance values. However, with increasing text blur and decreasing text contrast, we observed a significant decrease in saccade amplitude and velocity, as well as a significant increase in fixation duration, number of fixations, proportion of regressive saccades, microsaccade rate, and duration of return-sweeps. Among all, saccade amplitude, fixation duration, and proportion of regressive saccades turned out to be the most significant contributors to reading speed, together accounting for 90% of variance in reading speed. Our results together showed that, when presented with degraded viewing conditions, the patterns of eye movements during reading were altered accordingly. These findings may suggest that the seemingly deviated eye movements observed in individuals with visual impairments may be in part resulting from active and optimal information acquisition strategies operated when visual sensory input becomes substantially deprived.
... In order to complete this approach, a Stepwise Regression method with Backward Elimination is applied [25]. This method is based on the successive and iterative elimination of features, but the elimination is performed on the basis of a model fit criterion. ...
Article
Full-text available
Growth models of uneven-aged forests on the diameter class level can support silvicultural decision making. Machine learning brings added value to the modeling of dynamics at the stand or individual tree level based on data from permanent plots. The objective of this study is to explore the potential of machine learning for modeling growth dynamics in uneven-aged forests at the diameter class level based on inventory data from practice. Two main modeling approaches are conducted and compared: (i) fine-tuned linear models differentiated per diameter class, (ii) an artificial neural network (multilayer perceptron) trained on all diameter classes. The models are trained on the inventory data of the Canton of Neuchâtel (Switzerland), which are area-wide data without individual tree-level growth monitoring. Both approaches produce convincing results for predicting future diameter distributions. The linear models perform better at the individual diameter class level with test R2 typically between 50% and 70% for predicting increments in the numbers of stems at the diameter class level. From a methodological perspective, the multilayer perceptron implementation is much simpler than the fine-tuning of linear models. The linear models developed in this study achieve sufficient performance for practical decision support.
... Stepwise Linear Regression or Stepwise Multiple Linear Regression is similar in approach as the ordinary MLR, but the difference is in the implementation of independent variables to the overall equation [30]. In this technique, the predicting variables either added or subtracted from the overall explanatory variables group [31]. ...
Thesis
Full-text available
The continuously escalating problem of the climate and environment forces the stake- holders to switch to the renewable and zero-emission energy sources very rapidly. This introduces much of the uncertainty and complexity into the power systems operations. Thus, the need for the electric load forecasting increases in high rate. The advent of the smart technologies and infrastructures, such as smart grid and smart meters floods the power grid operators with unseen before piles of the information. That data then can be used beneficially for power demand prediction by applying the Artificial Intelligence based calculation technologies. The machine learning being of such methods can be purposefully implemented for the load forecasting. This master thesis reviews the possibilities of using the machine learning techniques for load forecasting on the real data, taken from the smart meters. As it happens, the dataset has real-life imperfections, such as missing data, outliers and errors. Taking all of that into consideration, the data processing is presented. This paper begins with the Chapter 1, introducing the field of Artificial Intelligence in power systems application. It covers the main notions, such as smart grid, smart meters, machine learning, deep learning, Big Data and other relevant technologies. The motivation and actuality are discussed in this and the beginning of the next section. The next Chapter 2 covers the theoretical basics of the used methods for machine learning load forecasting. The learning algorithms and their working principles are explained, as well as the related techniques. The focus of this section is to give the necessary background information of the implemented machine learning methods, in order to better relate to the reader their application in the prediction of the power demand. In other words, this Chapter gives the state-of-art of machine learning load forecasting. The Chapter 3 explains the related work and researches on the topic of this thesis. The ideas introduced by the researchers worldwide are related in this section. They are classified based on the main finding and the model, highlighted the most in the related works. The final Chapter 4 shows the experimental part of this master thesis. All models, built upon the contribution of the previous sections are implemented on the real data. The experiments are conducted in two ways, based on the data splitting method. The results of the load forecasting are discussed and the outlook is given. This thesis ends by summarizing the findings and giving the bibliography of the used literature and providing the Appendix with the programming code in MATLAB(TM) and Python, written for this research.
... Stepwise regression (SR) uses an automatic process for choosing predictive variables 19,20 . Here, the main approaches include forward selection, backward elimination and bidirectional elimination. ...
Article
Full-text available
This research introduces a new combined modelling approach for mapping soil salinity in the Minab plain in southern Iran. This study assessed the uncertainty (with 95% confidence limits) and interpretability of two deep learning (DL) models (deep boltzmann machine—DBM) and a one dimensional convolutional neural networks (1DCNN)—long short-term memory (LSTM) hybrid model (1DCNN-LSTM) for mapping soil salinity by applying DeepQuantreg and game theory (Shapely Additive exPlanations (SHAP) and permutation feature importance measure (PFIM)), respectively. Based on stepwise forward regression (SFR)—a technique for controlling factor selection, 18 of 47 potential controls were selected as effective factors. Inventory maps of soil salinity were generated based on 476 surface soil samples collected for measuring electrical conductivity (ECe). Based on Taylor diagrams, both DL models performed well (RMSE < 20%), but the 1DCNN-LSTM hybrid model performed slightly better than the DBM model. The uncertainty range associated with the ECe values predicted by both models estimated using DeepQuantilreg were similar (0–25 dS/m for the 1DCNN-LSTM hybrid model and 2–27 dS/m for DBM model). Based on the SFR and PFIM (permutation feature importance measure)—a measure in game theory, four controls (evaporation, sand content, precipitation and vertical distance to channel) were selected as the most important factors for soil salinity in the study area. The results of SHAP (Shapely Additive exPlanations)—the second measure used in game theory—suggested that five factors (evaporation, vertical distance to channel, sand content, cation exchange capacity (CEC) and digital elevation model (DEM)) have the strongest impact on model outputs. Overall, the methodology used in this study is recommend for applications in other regions for mapping environmental problems.
... To assess the statistical significance of predictors of perineal dose, multiple linear regression analysis was used. Stepwise regression, in forwards, backwards and both directions (R function StepAIC), was used to identify potential predictors in the model [16]. Decisions to include/exclude model predictors at each step were made by comparing the Akaike Information Criterion (AIC) values between models with and without the relevant predictor. ...
Article
Full-text available
Introduction This study investigated the relationship between anatomical compression introduced via ultrasound probe pressure and maximum perineum dose in prostate radiotherapy patients using the Clarity transperineal ultrasound (TPUS) system. Methods 115 patient ultrasound and computed tomography scans were retrospectively analysed. The probe to prostate apex distance (PPA), probe to inferior corpus spongiosum distance (PICS) and maximum perineum dose were calculated. Compression was represented by the PICS and the calculated corpus to prostate ratio (CPR). Demographics included treatment technique, image quality, body mass index (BMI) and age. Multiple linear regression analysis assessed the relationship between compression measures and perineum dose. Results The maximum dose to perineum ranged from 1.81 to 45.56 Gy, with a median of 5.87 Gy (Interquartile range (IQR) 3.17)., The PICS distance and CPR recorded was 1.67 cm (IQR 0.63) and 0.51 (range 0.29 – 0.85) respectively. Regression analysis demonstrated both PICS and CPR were significant predictors of maximum dose to the perineum (p <0.001). Patient-specific factors, including age, BMI, treatment technique and ultrasound image quality, were not factors that significantly impacted the maximum perineum dose. Conclusion There was a statistically significant association between increased anatomical compression and perineal dose measurements. A PICS of 1.2cm or greater is recommended, with compression reduced as much as possible without losing anatomical US definition. Future investigations would be beneficial to evaluate the optimal balance between ultrasound image quality and transducer compression considering the perineum dose.
... At this stage, we are only interested in whether the regressors are statistically significant or not. This approach has a point of contact with the classical stepwise procedure of Efroymson [31], which starts with no regressors and adds them one at a time according to their partial F-statistics (see Hocking [32]) until either all regressors are included or until no excluded regressors' partial F-statistic is statistically significant. This procedure converges (see Miller [33]). ...
Article
Full-text available
For the linear model , where the number of regressors () exceeds the number of observations (), the Elastic Net (EN) was proposed, in 2005, to estimate . The EN uses both the Lasso, proposed in 1996, and ordinary Ridge Regression (RR), proposed in 1970, to estimate . However, when , using only RR to estimate has not been considered in the literature thus far. Because RR is based on the least-squares framework, only using RR to estimate is computationally much simpler than using the EN. We propose a generalized ridge regression (GRR) algorithm, a superior alternative to the EN, for estimating as follows: partition from left to right so that every partition, but the last one, has 3 observations per regressor; for each partition, we estimate with the regressors in that partition using ordinary RR; retain the regressors with statistically significant -ratios and the corresponding RR tuning parameter , by partition; use the retained regressors and values to re-estimate by GRR across all partitions, which yields . Algorithmic efficacy is compared using 4 metrics by simulation, because the algorithm is mathematically intractable. Three metrics, with their probabilities of RR’s superiority over EN in parentheses, are: the proportion of true regressors discovered (99%); the squared distance, from the true coefficients, of the significant coefficients (86%); and the squared distance, from the true coefficients, of estimated coefficients that are both significant and true (74%). The fourth metric is the probability that none of the regressors discovered are true, which for RR and EN is 4% and 25%, respectively. This indicates the additional advantage RR has over the EN in terms of discovering causal regressors.
... A multiple linear regression was conducted using the forward stepwise method to determine whether EF tasks accounted for significant variance in gross motor skills (locomotor, OC, and GMQ), adjusting for sex, age, maternal education, socioeconomic status, quality of the home environment, and quality of the school environment. Forward stepwise regression was used because it is an appropriate analysis when you have many variables and are interested in identifying a useful subset of the predictors [98]. The F Statistic (probability of F) was used to determine whether a variable should be included in the model. ...
Article
Full-text available
Background: Preschool age (3-5 years old) is a crucial period for children to acquire gross motor skills and develop executive functions (EFs). However, the association between the qualitative gross motor skills and EFs remains unknown in preschoolers, especially among overweight and obese children. Methods: This was a cross-sectional, exploratory, and quantitative study carried out on 49 preschool children, divided into two subgroups according to their body mass index (overweight/obese: 24; eutrophic [normal weight]: 25). The mean age was 4.59 years. More than half of the sample were boys (55%) and most of the mothers had completed high school (67%) and were class C socioeconomic level (63%). Gross motor skills were assessed using the Test of Gross Motor Development-2, while EFs were evaluated using Semantic verbal fluency (SVF), Tower of Hanoi (TH), Day/Night Stroop, and Delayed Gratification tests. Multiple linear regression models adjusted for sex, age, maternal education, socioeconomic status, quality of the home environment, and quality of the school environment using the stepwise method were executed, considering the cognitive tasks as independent variables and gross motor skills as dependent variable. Results: The overweight/obese preschoolers showed worse locomotor skills than their eutrophic peers and below average gross motor quotient (GMQ). Overweight/obese girls performed worse in OC skills than boys with excess weight. SVF (number of errors) and TH (rule breaks) explained 57.8% of the variance in object control (OC) skills and 40.5% of the variance in GMQ (p < .05) in the overweight/obese children. Surprisingly, there was no significant association between any of the EF tasks and gross motor skills in the eutrophic children. Conclusion: A relationship between EF tasks (number of errors in SVF and rule breaks in TH) and gross motor skills (OC and GMQ) was demonstrated in the overweight/obese preschoolers, indicating that worse cognitive flexibility, working memory, planning, and problem solving are associated with worse gross motor skills in this population when compared to eutrophic children.
... Another aspect of this study is to select the best variables that explain the obtained model. To do this, for the Random Forest method, the weight of each variable in the meteorological visibility prediction model is evaluated by the "Forward Selection" method [18]. This method tests the effect of adding a new variable in the prediction accuracy, using a regression criterion. ...
Article
Full-text available
Accurately measuring meteorological visibility is an important factor in road, sea, rail, and air transportation safety, especially under visibility-reducing weather events. This paper deals with the application of Machine Learning methods to estimate meteorological visibility in dusty conditions, from the power levels of commercial microwave links and weather data including temperature, dew point, wind speed, wind direction, and atmospheric pressure. Three well-known Machine Learning methods are investigated: Decision Trees, Random Forest, and Support Vector Machines. The correlation coefficient and the mean square error, between the visibility distances estimated by Machine Learning methods and those provided by Burkina Faso weather services are computed. Except for the SVM method, all the other methods give a correlation coefficient greater than 0.90. The Random Forest method presents the best result both in terms of correlation coefficient (0.97) and means square error (0.60). For this last method, the best variables that explain the model are selected by evaluating the weight of each variable in the model. The best performance is obtained by considering the attenuation of the microwave signal and the dew point.
... This included France, Québec and Belgium. Alternately for each dichotomous wellbeing indicator (job satisfaction, poor work/life balance, poor subjective health, poor mental health, life satisfaction), and for each country sample, we ran forward stepwise regression analyses to automatically select a parsimonious set of important wellbeing factors according to the national context [55]. It is noteworthy that the results of any stepwise regression procedure should not be overstated because of methodological issues [56], but as our approach consists of a preliminary scoping of teachers' wellbeing determinants, this intuitive procedure remains informative. ...
Article
Full-text available
To highlight effective levers to promote teachers’ wellbeing worldwide, particularly during difficult times such as the COVID-19 pandemic, we investigated work-related factors associated with teacher wellbeing, across borders and cultures. In six countries/territories, we examined the factors that were most consistently and strongly associated with two indicators of wellbeing at work: (i) job satisfaction; and (ii) work/life balance, and three indicators of general wellbeing: (i) subjective health; (ii) mental health; and (iii) life satisfaction. Between May and July 2021, after 18 months of the pandemic, 8000 teachers answered the first edition of the International Barometer of Education Personnel’s Health and Wellbeing (I-BEST): 3646 teachers from France, 2349 from Québec, 1268 from Belgium, 302 from Morocco, 222 from The Gambia, and 215 from Mexico. For each country/territory and each wellbeing indicator, we used a forward stepwise regression procedure to identify important determinants among a carefully selected set of 31 sociodemographic, private, and professional life factors. Aside from healthcare access, the factors most consistently and strongly associated with teacher wellbeing in France, Québec and Belgium (samples whose size were ≥1000) were related to the psychosocial and the organizational dimensions of work, namely: feeling of safety at school, autonomy at work, and the quality of relationships with superiors and quality of relationships with students. In the smaller samples of teachers from the three remaining countries (Morocco, The Gambia and Mexico), exploratory analyses showed that the feeling of safety and autonomy at work were, there too, consistently associated with wellbeing indicators. During the COVID-19 pandemic, the factors most consistently associated with teachers’ wellbeing across countries were related to security and autonomy at work, supporting the importance to consider these aspects in a continuous, structural way at school. Factors associated with teachers’ wellbeing in very different contexts require further cross-cultural study.
... Training details: To refine the manually designed features we perform feature selection in a similar fashion to the forward and backwards stepwise selection [24]. In the forward method, we add features if this improves the negative log-likelihood on the validation set. ...
Preprint
Full-text available
Recent developments in self-supervised learning give us the possibility to further reduce human intervention in multi-step pipelines where the focus evolves around particular objects of interest. In the present paper, the focus lays in the nuclei in histopathology images. In particular we aim at extracting cellular information in an unsupervised manner for a downstream task. As nuclei present themselves in a variety of sizes, we propose a new Scale-dependant convolutional layer to bypass scaling issues when resizing nuclei. On three nuclei datasets, we benchmark the following methods: handcrafted, pre-trained ResNet, supervised ResNet and self-supervised features. We show that the proposed convolution layer boosts performance and that this layer combined with Barlows-Twins allows for better nuclei encoding compared to the supervised paradigm in the low sample setting and outperforms all other proposed unsupervised methods. In addition, we extend the existing TNBC dataset to incorporate nuclei class annotation in order to enrich and publicly release a small sample setting dataset for nuclei segmentation and classification.
... The root means square error (RMSE) equation and the VAF value are given below. In these equations, y is the measured value, y is the predicted value, and N is the number of data (Hocking, 1976). The results of this study showed that the determination of brittleness indices B3, B4 and BI is possible using n and Is50 (bivariate regression) and PMP (univariate regression). ...
Article
Brittleness is an important parameter that controls the mechanical behavior and fractures characteristics of rocks in drilling and rock bursting. Lack of accurate evaluation assessing rock brittleness can lead to many risks related to rock mechanics. In this paper, the relationship between brittleness with the ratio of point load index and porosity (PMP), was investigated for Hamedan limestone. Besides, the existing estimation methods for the brittleness index of the rock have been summarized and their application was briefly discussed. In order to estimate brittleness indices and the ratio of point load index to porosity, 18 blocks of Abshineh and Sarab Gyan limestone rocks have been chosen. Abshineh limestone rocks with the Oligo-Miocene age and Sarab Gyan limestones are extracted from Cretaceous rocks. In the following, at first, thin sections of limestone were investigated. Then the physical (porosity (n)) and mechanical properties of limestones were determined. The samples were exposed to point load index (Is 50), uniaxial compressive strength (UCS), and Brazilian tensile strength (BTS) tests, and the ratio of their point load to porosity (PMP) was calculated. Then, the relationship between brittleness indices and PMP (univariate regression) was checked. Also, the relationship between brittleness indices was determined with two-variable regression (input variables including porosity and point load index). Finally, the results of different types of relationships were compared. The results illustrate that using the PMP parameter, to predict the values of brittleness indices, obtains more reliable results compared to two-variable regression (n and Is 50). Also, the results of experiments showed that the highest agreement between brittleness parameters B3 and B4 with PMP parameter and the coefficient of determination (r 2) are 0.89 and 0.90, respectively.
... Using a hybrid version of forward and backward stepwise method (Hocking, 1976;Tabachnick and Fidell, 2001), in addition to FI, smoking habit, alcohol consumption, age, sex, and MMSE score were considered. We evaluated several logistic regression models with the following dependent variables: the presence of VBD, its degree, and its predominance in specific brain regions. ...
Article
Purpose An association between frailty and vascular brain damage (VBD) has been described in older adults. However, most studies have identified frailty according to the phenotypic model. It is less clear whether frailty, operationalized as an accumulation of health deficits, is associated with the presence and severity of VBD. The present study was therefore undertaken to verify whether a 50-item frailty index (FI) is related to VBD in a large and relatively unselected cohort of attendees of a memory clinic. Materials and methods The TREDEM (Treviso Dementia) registry includes retrospective observational data of 1584 participants. A modified FI was calculated from 50 variables comprising diseases, disability, behavioral disorders, and blood biochemistry. The presence and severity of VBD, including leukoaraiosis, lacunes, larger infarctions and the hierarchical vascular rating scale (HVRS), were determined based on brain computerized tomography imaging. Multiple logistic regression models were built according to the stepwise method. Results Mean age of the 1584 participants was 79.6 ± 7.5 years and 1033 (65.2 %) were females. The average number of health deficits was 11.6 ± 6.2, corresponding to an FI of 0.23 ± 0.12 (range: 0.00–0.56). Each 0.01-point increase in the FI was associated with an increased probability of leukoaraiosis (+2.3 %) and severe leukoaraiosis (+5 %), lacunas in the basal ganglia (+1.73 %), occipital lobes (+2.7 %), parietal lobes (+3 %), frontal lobes (+3.6 %), temporal lobes (+4.2 %), and thalamus (+4.4 %). Moreover, an increase of 0.01 points in the FI was associated with a 3.1 % increase in the probability of HVRS score (≥2). Conclusion An FI based on routine clinical and laboratory variables was associated with the presence, degree, and some localizations of VBD in a population of older adults with cognitive decline. This frailty assessment tool may therefore be used to identify individuals at risk of developing cerebrovascular disease and, consequently, to implement strategies for vascular risk factor control.
Article
Background: Current predictive tools for TKA focus on clinicians rather than patients as the intended user. The purpose of this study was to develop a patient-focused model to predict health-related quality of life outcomes at 1-year post-TKA. Methods: Patients who underwent primary TKA for osteoarthritis from a tertiary institutional registry after January 2006 were analysed. The primary outcome was improvement after TKA defined by the minimal clinically important difference in utility score at 1-year post-surgery. Potential predictors included demographic information, comorbidities, lifestyle factors, and patient-reported outcome measures. Four models were developed, including both conventional statistics and machine learning (artificial intelligence) methods: logistic regression, classification tree, extreme gradient boosted trees, and random forest models. Models were evaluated using discrimination and calibration metrics. Results: A total of 3755 patients were included in the study. The logistic regression model performed the best with respect to both discrimination (AUC = 0.712) and calibration (intercept = -0.083, slope = 1.123, Brier score = 0.202). Less than 2% (n = 52) of the data were missing and therefore removed for complete case analysis. The final model used age (categorical), sex, baseline utility score, and baseline Veterans-RAND 12 responses as predictors. Conclusion: The logistic regression model performed better than machine learning algorithms with respect to AUC and calibration plot. The logistic regression model was well calibrated enough to stratify patients into risk deciles based on their likelihood of improvement after surgery. Further research is required to evaluate the performance of predictive tools through pragmatic clinical trials. Level of evidence: Level II, decision analysis.
Thesis
One-off or unexpected events cause levels of political and economic uncertainty to rise, generating investor uncertainty, which spills over into financial markets and consequently affects company returns. The development of quantitative analysis and the proliferation of financial data have made it possible to investigate the effects of uncertainty in markets better than ever before. Therefore, this dissertation aims to contribute to the study of unique social and political events, such as referendums, elections or even pandemics, by measuring and explaining their stock market impact. Specifically, we conduct three analyses of abnormal returns in the Spanish, European and global capital markets. To do so, we draw on the event study literature, recovering and updating the method of seemingly unrelated regressions, a simultaneous equations model whose main advantages are that it considers how assets are related to each other, providing more reliable tests, and that it allows for joint hypothesis testing by means of linear restrictions. In addition, political and social events are characterised by a large dispersion in returns, meaning that they do not affect all stocks equally. From the abnormal returns obtained and using cross-sectional analysis, we try to determine which characteristics may have caused some companies to be more adversely affected than others, thus trying to quantify their exposure to the risk derived from the event. In short, we statistically analyse the stock market returns following an extreme event to determine their significance and then identify which elements aggravate or diminish the impact. Using the above framework, we present the study of three single events that form the core of this research: chapter 1 is a study of the Spanish stock market after the failed attempt at independence in Catalonia, chapter 2 is an analysis of three European markets after the unexpected arrival of the far-right in their respective governments, and chapter 3 is a global study of the effects of the COVID-19 pandemic. Our results suggest the existence of negative price effects following the events, directly related to the increase in political and economic uncertainty. Good examples are the results in the Spanish continuous market after the illegal referendum in Catalonia or in the Milan stock market after the formation of the government of La Liga and the 5 Star Movement. We can also highlight the vastly different reactions to the same event and the identification of characteristics that partially explain the reasons for these differences. For example, the location of the firm or its level of internationalisation after the Catalan independence attempt, the relationship of the firm with the European Union during the far-right government negotiations, or the level of competitiveness of the country or its income inequality after the outbreak of COVID-19. Our work contributes to the field of finance with the quantitative analysis of three extraordinary and highly relevant cases, but we also provide specific economic and financial factors that can help institutions and professionals to reduce their risk exposure in future events.
Chapter
This work aims to identify the critical production costs, related to raw materials and labor, of ordered inflatable-based products without standardization in order to develop a quantitative model to predict these costs accurately in the early project stage, within the budget step. In order to achieve this goal, it was necessary to understand the production processes and the raw materials, as well as to study the principal theoretical aspects related to cost estimating techniques and methods, cost estimating models, model selection, and validation. Therefore, it is intended to develop a multiple linear regression model, applied to historical quantitative data, to estimate each critical variable concerning the quantity of the main raw material and the labor times for critical processes. Six models were analyzed, in which two models are identified for each critical variable such as the linear meters value of the main raw material used in the product, the main raw material cut time involved in the product and the sew time required by the product. The models were evaluated, selected, and validated, defining the best model for each critical variable. The model parameters were obtained using a train dataset and, afterwards, the results of the selected models were validated using a test dataset. The obtained results, through the proposed methodology, were evaluated and proved to be reliable for use in the early stage of product development within the budget step.
Article
Full-text available
Unemployment rate forecasting has become a particularly promising domain of comparative studies in recent years because it is a major issue facing the economic forecasting process. Since the time-series data are rarely pure linear or nonlinear, obviously, sometimes contain both components jointly. Therefore, this study introduces a hybrid model that combines two commonly used models, namely, the Linear Autoregressive Moving Average with exogenous variable (ARMAX) model and nonlinear Generalized Autoregressive Conditional Heteroskedasticity with exogenous variable (GARCHX) model whose conditional variance follows a General error distribution (GED). That is, build a hybrid (ARMAX-GARCHXGED) model employed in modeling bivariate time-series data of the unemployment rate and exchange rate. Usually, the forecasting performance evaluation based on the common classical forecast accuracy criteria such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percent Error (MAPE) have some specific limitations in application to choosing the optimal forecasting model. Therefore, in this paper, we employed a modern evaluation criterion based on the methodology advocated by Diebold–Mariano (DM) known as (DM test) as a new criterion for evaluation based on statistical hypothesis tests. This (DM test) has been applied in this study to distinguish the significant differences in forecasting accuracy between hybrid (ARMAX-GARCHX-GED) and individual ARMAX models. From the case study results and according to DM-test it is observed that the differences between the forecasting performances of models are significant and the hybrid model (ARMAX-GARCHX-GED) is more efficient than the individual competitive ARMAX model for the unemployment rate forecasting.
Article
For insect radar observations, exploiting radar echoes from insects to accurately estimate size parameters such as body mass, length, and width of insects can help to identify insect species. At present, the commonly used method for estimating insect body size parameters in insect radar is to use the monotonic mapping relationship between insect RCS parameters of a single frequency (mainly 9.4 GHz) and body size, and obtain the empirical formula for body size estimation by polynomial fitting. However, the useful information used by the traditional methods is limited (1 to 2 features), and these retrieval methods are simple and with limited estimation accuracy. This paper proposed a feature-selection-based machine learning method for insect body size estimation, which could effectively improve the body size parameter estimation accuracy of insect radars. First of all, based on the published insect scattering dataset (9.4GHz, 366 specimens of 76 species), stepwise regression was used to select the optimal feature combinations for body size estimation, then three machine learning methods, Random Forest Regression (RFR), Support Vector Regression (SVR) and Multilayer Perceptron (MLP), were adopted to achieve estimation of insect body size. Among them, RFR has the best performance (mass 18.83%, length 11.37%, width 16.87%). Subsequently, based on the measured dataset of migratory insects (5532 specimens of 23 species), the influence of the estimation error of insect body size on the identification accuracy of migratory insect species was analyzed. When incorporating the estimation error of the feature-selection-based RFR method, the insect identification rate of 83.68% was reached.
Article
Mendelian randomization (MR) is the use of genetic variants to assess the existence of a causal relationship between a risk factor and an outcome of interest. Here, we focus on two-sample summary-data MR analyses with many correlated variants from a single gene region, particularly on cis-MR studies which use protein expression as a risk factor. Such studies must rely on a small, curated set of variants from the studied region; using all variants in the region requires inverting an ill-conditioned genetic correlation matrix and results in numerically unstable causal effect estimates. We review methods for variable selection and estimation in cis-MR with summary-level data, ranging from stepwise pruning and conditional analysis to principal components analysis, factor analysis, and Bayesian variable selection. In a simulation study, we show that the various methods have comparable performance in analyses with large sample sizes and strong genetic instruments. However, when weak instrument bias is suspected, factor analysis and Bayesian variable selection produce more reliable inferences than simple pruning approaches, which are often used in practice. We conclude by examining two case studies, assessing the effects of low-density lipoprotein-cholesterol and serum testosterone on coronary heart disease risk using variants in the HMGCR and SHBG gene regions, respectively.
Article
Full-text available
Integrated water vapour (IWV) measurements from similar or different techniques are often inter-compared for calibration and validation purposes. Results are usually assessed in terms of bias (difference of the means), standard deviation of the differences, and linear fit slope and offset (intercept) estimates. When the instruments are located at different elevations, a correction must be applied to account for the vertical displacement between the sites. Empirical formulations are traditionally used for this correction. In this paper we show that the widely used correction model based on a standard, exponential, profile for water vapour cannot properly correct the bias, slope, and offset parameters simultaneously. Correcting the bias with this model degrades the slope and offset estimates and vice versa. This paper proposes an improved correction method that overcomes these limitations. It implements a multiple linear regression method where the slope and offset parameters are provided from a radiosonde climatology. It is able to predict monthly mean IWVs with a bias smaller than 0.1 kg m−2 and a root-mean-square error smaller than 0.5 kg m−2 for height differences up to 500 m. The method is applied to the inter-comparison of GPS IWV data in a tropical mountainous area and to the inter-validation of GPS and satellite microwave radiometer data. This paper also emphasizes the need for using a slope and offset regression method that accounts for errors in both variables and for correctly specifying these errors.
Article
Historical analyses are inevitably based on data – documents, fossils, drawings, oral traditions, artifacts, and more. Recently, historians have been urged to embrace the data deluge (Guldi and Armitage 2014) and teams are now systematically assembling large digital collections of historical data that can be used for rigorous statistical analysis (Slingerland and Sullivan 2017; Turchin et al. 2015; Whitehouse et al. 2019; Slingerland et al. 2018–2019). The promise of large, widely accessible databases is the opportunity for rigorous statistical testing of plausible historical models. The peril is the temptation to ransack these databases for heretofore unknown statistical patterns. Statisticians bearing algorithms are a poor substitute for expertise.
Chapter
Carnivore species are believed to exert strong competitive pressure on each other, resulting in adaptations to allow for niche separation through resource partitioning. However, factors that promote ecological separation among species in tropical forests are difficult to explain and are poorly understood because robust field studies are lacking. We examined spatial, temporal and morphological segregation between tropical carnivores in a protected forest in north‐central Thailand. Sympatric spatial overlap was calculated from radio‐telemetry data of 38 individuals from six species (5 yellow‐throated martens, Martes flavigula , 20 leopard cats, Prionailurus bengalensis , 2 Asiatic golden cats, Catopuma temminckii , 4 clouded leopards, Neofelis nebulosa , 5 binturongs, Arctictis binturong , and 2 dholes, Cuon alpinus ) in the same study area. Spatial overlap was then correlated with 14 independent variables (i.e. skull and dental morphology, body mass, habitat use and activity patterns) compared among the six species. We predicted that carnivores with differing morphology and activity patterns would exhibit more spatial overlap because these species would compete less for prey resources. Our statistical analyses indicated that lower mean carnassial length and activity patterns in closed habitat cover were significantly correlated ( p < 0.05) with species spatial overlap. Binturongs appeared to have the greatest amount of spatial overlap with other species of carnivores, whereas dholes had the least spatial overlap; also, dholes and yellow‐throated martens tended to be more active in open habitats and during diurnal time periods, whereas clouded leopards and Asiatic golden cats were more active in closed cover and were more arrhythmic in activity. Although these results provide useful information on carnivore coexistence, we recommend that future studies monitor larger sample sizes of carnivore species over the same time period to provide more robust statistical analyses. In addition, we suggest that future research on carnivore coexistence evaluates the impacts of anthropogenic activity on study results.
Article
Unobserved heterogeneity causing overdispersion and the excessive number of zeros take a prominent place in the methodological development on count modeling. An insight into the mechanisms that induce heterogeneity is required for better understanding of the phenomenon of overdispersion. When the heterogeneity is sourced by the stochastic component of the model, the use of a heterogenous Poisson distribution for this part encounters as an elegant solution. Hierarchical design of the study is also responsible for the heterogeneity as the unobservable effects at various levels also contribute to the overdispersion. Zero-inflation, heterogeneity and multilevel nature in the count data present special challenges in their own respect, however the presence of all in one study adds more challenges to the modeling strategies. This study therefore is designed to merge the attractive features of the separate strand of the solutions in order to face such a comprehensive challenge. This study differs from the previous attempts by the choice of two recently developed heterogeneous distributions, namely Poisson–Lindley (PL) and Poisson–Ailamujia (PA) for the truncated part. Using generalized linear mixed modeling settings, predictive performances of the multilevel PL and PA models and their hurdle counterparts were assessed within a comprehensive simulation study in terms of bias, precision and accuracy measures. Multilevel models were applied to two separate real world examples for the assessment of practical implications of the new models proposed in this study.
Article
Full-text available
It is known that regression results can be misleading when the predictor variables(x's) are highly correlated (nonorthogonal). The objective of this paper is to present some guidelines for deciding when the correlations among the x's are so large that the numerical accuracy and/or physical interpretation of regression results should be questioned. A measure of nonorthogonality is presented and the effects of correlated x's and poor model formulation on the estimated coefficients are discussed. Emphasis is placed on the practical interpretation of regression results. Two illustrative examples are presented.
Article
This paper describes the use of elements in regression analysis (Newton and Spurrell, 1967) to clarify the problems which arose in two industrial studies. The use of elements contributed significantly to a better understanding of the processes under investigation.
Article
In an earlier paper, Garside (1965) gave an algorithm for calculating all subsets in multiple-regression analysis, thereby obtaining the subset of a given size that minimized the minimum residual sum of squares. This paper develops further the facilities in the earlier algorithm and introduces a new algorithm which, although slightly restrictive, is very much faster in execution.
Article
The multiple correlation coefficient, R (or r, in the simple case), is frequently used in evaluating regression models. Statistical significance is the usual criterion for judging R. Various views of its practical significance should also be considered. One practical measure is the percent reduction in the standard deviation of the response variable achieved by the model. A graph giving this as a function of R and percent loss in degrees of freedom is presented. This measure is compared with others which are sometimes appropriate.
Article
Selection of explanatory variables in the regression equation has been a prime problem in constructing a predicting equation. This paper describes and gives an illustration of a selection technique which makes use of the orthogonality among factors extracted from the correlation matrix. Using the factors not as new variables, but merely as the reference frame, we can identify a near orthogonal subset of explanatory variables. It is indicated that this approach provides the model builder with the flexibility that is not available in the conventional, purely mechanical, selection methods.
Article
Properties of some currently used reduced-rank regression procedures are investigated. Difficulties associated with the selection of a predictor subset on the basis of the calibration sample are discussed. It is shown that the largest principal components regression procedure of Horst [1941] and Burket [1964] may be regarded as a means for obtaining estimates of regression weights under the assumption of the formal factor analysis model. The use of alternative weighting procedures in principal components regression is discussed. It is pointed out that a factor analysis regression procedure suggested by Horst [1941] may be used in conjunction with the maximum likelihood factor analytic solution to provide maximum likelihood estimates of the regression weights under the assumption of the formal factor analysis model. A superficially different procedure proposed by Scott [1966] is shown to be equivalent to this factor analysis regression procedure. Some properties of factor analysis regression using a maximum likelihood solution are given. A description is given of a series of artificial experiments which was carried out to throw light on some properties of the reduced-rank regression procedures. Implications of the results are discussed.
Article
This paper presents a technique using quantities termed "elements" which are derived from a limited number of regression equations, and assist in the computations leading to tests of a regression model. It will be seen that special attention is given to "secondary elements" in choosing variables either for predicting the result arising from certain specified circumstances or for explaining the operation in terms of recognizably useful parameters. The meaning of these elements is argued geometrically, and their advantages in computation can easily be demonstrated; they have already helped to clarify complex industrial data.
Article
In this article, Mr Garside gives a procedure for comparing all sub-sets in multiple regression analysis and thereby obtaining the best sub-set of a given size in the sense of the minimum residual sum of squares. The author also points out that this is a special case of a more general problem.
Article
This paper derives a stochastic linear equation from factor analysis called factor analysis regression which is suggested as an alternative to classical least squares regression whenever least squares estimation is questionable or breaks down because of errors in the variables or multicollinearity. Statistical tests for the factor analysis regression are also suggested and an empirical example comparing factor-analysis regression with least squares is shown.
Article
In a multiple regression problem, let the px1 vector x consist of the dependent variable and p-1 predictor variables. The correlation matrix of x is reduced to principal components. The components corresponding to low eigenvalues may be useful in suggesting possible alternative subregressions. This possibility is analyzed, and formulas derived for the derivation of subregressions from the principal components.
Article
This paper is concerned with the analysis of data from a multiple regression of a single variable, y, on a set of independent variables, x1,x2,...,xr. It is argued that the form of the analysis should depend on the use that is to be made of the regression, and that therefore an approach employing ideas from decision theory may be worth while. Two situations are analysed in this way: in the first it is desired to predict a future value of y; in the second we wish to control y at a preassigned value. The two analyses are found to be different: in particular, the standard errors of the regression coefficients are found to be irrelevant in the prediction problem, but not in the control problem. In the former it is shown that, under rather special assumptions on the multiple regression experiment, the analysis is similar to that recommended by other writers. If the costs of control do not depend on the values at which the control takes place, a similar analysis holds for the second problem. The approach throughout is Bayesian: there is no discussion of this point, I merely ask the non‐Bayesian reader to examine the results and consider whether they provide sensible and practical answers.
Article
It is shown that for any design matrix X in the model y equals X beta plus e there is a symmetric matrix C, which commutes with X prime X, such that the elements of (C plus X prime X)** minus **1X prime Y have smaller mean square error than the variances of the corresponding elements in (X prime X)** minus **1X prime Y. C need only have small positive eigenvalues. Further, (C plus X prime X)** minus **1X prime Y can be a quite different estimator than (X prime X)** minus **1X prime Y under these conditions.
Article
The objectives of this paper are to examine the mean square error criterion for rejecting or adopting restrictions on the parameter space in a regression model, and to develop a uniformly most powerful testing procedure for the criterion. We present a tabulation of critical points for the test for one restriction and selected points of the power function. The mean square error criterion suggests a framework for thinking about the problem of multicollinearity in a linear model. To this end we present some examples to illustrate the linkage of the mean square error criterion with multicollinearity.
Article
For the standard linear model containing several explanatory variables, the precision of estimation of linear parametric functions is analysed in terms of latent roots and vectors of X'X, where X is the matrix of values of explanatory variables. This analysis provides a practical method for detecting multicollinearity, and it is demonstrated that it is also useful in solving problems of optimum choice of new values of explanatory variables.
Article
Two methods are given for computing all possible regressions for a set of independent variables. One produces all the usual regression statistics; the other gives only the sums of squares of residuals. Both require less computation than other methods of computing all possible regressions and the second may compare favorably with the procedure suggested by Hocking and Leslie for finding the best subset without evaluating all possible subsets.
Article
The mean square error of prediction is proposed as a criterion for selecting variables. This criterion utilizes the values of the predictor variables associated with the future observation and the magnitude of the estimated variance. Mean square error is believed to be more meaningful than the commonly used criterion, the residual sum of squares.
Article
A principal objective of this paper is to discuss a class of biased linear estimators employing generalized inverses. A second objective is to establish a unifying perspective. The paper exhibits theoretical properties shared by generalized inverse estimators, ridge estimators, and corresponding nonlinear estimation procedures. From this perspective it becomes clear why all these methods work so well in practical estimation from nonorthogonal data.
Article
This paper is an exposition of the use of ridge regression methods. Two examples from the literature are used as a base. Attention is focused on the RIDGE TRACE which is a two-dimensional graphical procedure for portraying the complex relationships in multifactor data. Recommendations are made for obtaining a better regression equation than that given by ordinary least squares estimation.
Article
In this paper we explore the similarity between the minimum mean square error estimator and Hoer1 and Kennard's ridge regression.
Article
The linear models selection-of-variables problem is formulated and the integrated mean square error (IMSE) is discussed as a parametric measure of the “distance” between a true, unknown function, f, and a linear estimating function or “substitute” function, , determined from data. Here where R is a region of interest—a set of x values for which is to be used as a substitute for f, and W(x) is a function which assigns weights to the values of x in R; the weight at x quantifies the importance that (x) be close to f(x). The IMSE, a parameter, cannot be calculated from the data. A statistic which more or less successfully mimics the IMSE in model selection problems is the AEV, defined as: The AEV is introduced, its first two moments are displayed, and for linear functions a simple form of the AEV is derived which uses the second order moment matrix, of R and W: where s ² is a biased estimate of σ². The use of the AEV in the linear models selection-ofvariables problem is discussed and illustrated with a problem which has previously been used to illustrate the use of the C p statistic.
Article
Componentrplus-residual plots, plots of the estimated effect of each independent variable at each observed data point plus corresponding residuals, are used as an aid (1) to choose an appropriate form of the equation, (2) to observe the distribution of the observations over the range of each independent variable and (3) to estimate the influence of each observation on each component of the equation. With the help of indicator variables and the Cp search technique, far-out observations can be tested to see if their responses are compatible with those of the remainder of the data.Examples are given from studies of manufacturing and marketing data in which either a new form of the equation is found or a new insight is gained into the limitations and strengths of the fitted equation.
Article
An algorithm is proposed for computing statistics for all possible subsets of variables for a discriminant analysis. Optimal subsets of any given size can then be determined. A comparison with a stepwise procedure is also presented through two examples.
Article
Least squares estimates of parameters of a multiple linear regression model are known to be highly variable when the matrix of independent variables is near singular. Using the latent roots and latent vectors of the “correlation matrix” of the dependent and independent variables a modified least squares estimation procedure is introduced. This technique enables one to determine whether the near singularity has predictive value and examine alternate prediction equations in which the effect of the near singrtlarity has been removed from the estimates of the regression coefficients. In addition a method for performing backward elimination of variables using standard least squares or the modified procedure is presented.
Article
This paper describes several algorithms for computing the residual sums of squares for all possible regressions with what appears to be a minimum of arithmetic (less than six floating-point operations per regression) and shows how two of these algorithms can be combined to form a simple leap and bolmd technique for finding the best subsets without examining all possible subsets. The resldt is a reduction of several orders of magnitude in the nllmber of operations reqllired to find the best subsets.
Article
This is an expository paper, pointing out explicitly the pseudoness of the “F-statistic” used in stepwise procedures for determining the independent variables to be used in a linear prediction equation. Unfortunately this pseudoness prevents one from obtaining any probabilistic measure of the goodness of the final prediction equation. The use of the distribution of an order statistic is discussed as an aid to understanding the problem as well as furnishing a “better” (although still unsatisfactory) approach.
Article
When the number of independent variables for a regression analysis is not too large, calculation of all possible regressions is an alternative to the various stepwise procedures available. However, the amount of information available from all possible regressions is quite large even for a moderate number of variables. And screening out most of the regressions can be done simply on the basis of information contained in the residual sum of squares.This paper presents an efficient procedure for the calculation of the residual sum of squares for each regression and some suggestions for screening procedures.
Article
Recent reviews have dealt with the subject of which variables to select and which to discard in multiple regression problems. Lindley (1968) emphasized that the method to be employed in any analysis should be related to the use intended for the finally fitted regression. In the report by Beale et al. (1967), the emphasis is on selecting the best subset for any specified number of retained independent variables. Here we will be concerned with pointing out the advantages of the variable selection scheme in which independent variables are successively discarded one at a time from the original full set. While these advantages are not unknown to workers in this field, they are however not appreciated by the statistical community in general. For the purposes of this demonstration it is assumed that we are in the nonsingular case so that the number of observations exceeds the number of regressor variables. Let us begin by considering economy of effort. Suppose that we were using a step-up regression procedure, ignoring for the while its theoretical deficiencies (to be discussed later). We should then first fit k simple regressions, one for each of the k regressor variables considered, selecting the single most significant individual regressor variable. Having made this selection we would proceed with k - 1 additional fits to determine which of the remaining variables in conjunction with the first selected yielded the greatest reduction in residual variation. This process is continued on so as to provide a successive selection and ordering of variables. We may even require the ordering of all k variables, leaving for later decision what critical juncture is to be employed in determining which of the k variables to retain, which to reject-if we do so we shall have made a total of k(k + 1)/2 fits, albeit they may have differed greatly in their degree of complexity. A complete stepdown regression procedure however requires but k fits, as will now be indicated. Suppose we have done a multiple regression on all k variables and wish to consider the k possible multiple regressions on all sets of k - 1 variables, that is where 1 variable has been deleted. The results for these k possible multiple regressions are implicit in the initial k-variable regression, provided we have secured the inverse matrix, or at least its diagonal, necessary for testing the significance of the fitted partial regression coefficients. The case
Article
Mantel (1970) has pointed out that many procedures are now available for selecting variables in multiple regression analyses. This note reviews the more important ones briefly, and suggests that Mantel exaggerates the advantages of the backward elimination or %tepdown” procedure.
Article
C. Mallows proposed a statistic, Cp, for variable selection in multiple regression. Gorman and Toman (1966) published the statistic, its derivation, and several examples of its use. The purpose of this note is to show that there is a one-to-one correspondence between Cp and the older, well established criterion known as adjusted R , e.g. Ezekiel (1930).
Article
A number of criteria have been proposed for selecting the best subset or subsets of independent variables in linear regression analysis. Applying these criteria to all possible subsets is, in general, not feasible if the number of variables is large. Many of the criteria are monotone functions of the residual sum of squares hence the problem is reduced to identifying subsets for which this quantity is small.In an earlier paper (Selection of the Best Subset in Regression Analysis by R. R. Hocking and R. N. Leslie, 1967) a method was described for identifying such subsets without considering all possible subsets. However, the amount of computation required if more than fifteen independent variables were considered was excessive.The present paper extends the basic ideas in that paper so that moderately large problems can how be treated with what appears to be a minimum of computation.
Article
Multiple regression analysis can be used to analyze data from undesigned, non-orthogonal experiments, to provide a prediction equation which may be adequate for many purposes. However, it is impossible to separate the effects for those independent variables which are highly correlated with each other. Thus, it is often desirable to augment, in an efficient manner, the existing data with a fixed number of experimental runs such that the independent variables are made more orthogonal to each other. In this study, such methods have been developed for a linear model with no interactions among the independent variables.A solution is obtained for augmenting data within a region defined by rectangular limits. A theoretical solution is obtained such that the volume of the simultaneous confidence region for the estimates of the regression coefficients is minimized and the maximum variance of a predicted response in the region is minimized under the restriction that the correlations among the independent variables are eliminated, in order that their effects can be estimated and tested separately. With a moderate number of additional data points, these augmented designs can be constructed such that the desired conditions are approximately fulfilled with a considerable gain over the information contained in the original data.
Article
The problem of selecting the best subset or subsets of independent variables in a multiple linear regression analysis is two-fold. The first, and most important problem is the development of criterion for choosing between two contending subsets. Applying these criteria to all possible subsets, if the number of independent variables is large, may not be economically feasible and so the second problem is concerned with decreasing the computational effort. This paper is concerned with the second question using the Cp-statistic of Mallows as the basic criterion for comparing two regressions. A procedure is developed which will indicate ‘good’ regressions with B minimum of computation.
Article
Biased estimators of the coefficients in the linear regression model have been the subject of considerable discussion in the recent, literature. The purpose of this paper is to provide a unified approach to the study of biased estimators in an effort to determine their relative merits. The class of estimators includes the simple and the generalized ridge estimators proposed by Hoerl and Kennard [9], the principal component estimator with extensions such as that, proposed by Marquardt [19] and the shrunken estimator proposed by Stein [23]. The problem of estimating the biasing parameters is considered and illustrated with two examples.