New Multicollinearity Indicators in Linear Regression Models
ABSTRACT Correlation is an important statistical issue for the Ordinary Least Squares estimates and for data-reduction techniques, such as the Factor and the Principal Components analyses. In this paper we propose new indicators for the multicollinearity problem in the multiple linear regression model. Copyright 2007 The Authors. Journal compilation (c) 2007 International Statistical Institute.
- [show abstract] [hide abstract]
ABSTRACT: This work aims to predict the air to water partitioning for 96 organic pesticides by means of the Quantitative Structure–Property Relationships Theory. After performing structural feature selection with Genetics Algorithms and Replacement Method linear approaches, it is found that among the most important molecular features appears the Moriguchi octanol–water partition coefficient, and higher lipophilicities would lead to compounds having higher Henry’s law constants. We also compare the statistical performance achieved by four fully-connected Feed-Forward Multilayer Perceptrons Artificial Neural Networks. The statistical results found reveal that the best performing model uses the Levenberg–Marquardt with Bayesian regularization (BR) weighting function for achieving the most accurate predictions.Atmospheric Environment 01/2010; · 3.11 Impact Factor
Article: The corrected VIF (CVIF)[show abstract] [hide abstract]
ABSTRACT: In this paper, we propose a new corrected variance inflation factor (VIF) measure to evaluate the impact of the correlation among the explanatory variables in the variance of the ordinary least squares estimators. We show that the real impact on variance can be overestimated by the traditional VIF when the explanatory variables contain no redundant information about the dependent variable and a corrected version of this multicollinearity indicator becomes necessary.Journal of Applied Statistics. 01/2011; 38(7):1499-1507.
- [show abstract] [hide abstract]
ABSTRACT: In modeling approaches, artificial neural networks (ANNs) have a special place to address the nonlinear phenomena or curved manifold. Often one or other feature selection approach is used prior to ANN to feed the input variables for its models. The function of ‘selected’ versus ‘arbitrary’ features on the outcome of ANN models is investigated with a variety of objectively selected and arbitrarily chosen variables from chemical databases namely thiazolidinones, anilinoquinolines and piperazinoquinolines. For each database, its biological activity is considered as the dependent variable and the molecular descriptors from DRAGON software are used as explanatory variables. The selection sets are obtained from feature selection approaches namely, combinatorial protocol in multiple linear regression, stepwise regression and genetic algorithm. Apart from these, a large number of arbitrary sets have been created by randomly picking the descriptors from corresponding databases. The features of all sets have shown a variety of inter- and intra- set diversities. A three-layer back propagation ANN with Levenberg-Marquardt optimization algorithm has been used for modeling the phenomena. Regardless of the origin of the feature sets, the ANN models from a very large number of sets have well explained the activity and qualified themselves to be predictive models. Also, no specific pattern is apparent between the quality of ANN model and the origin of its input feature set. Since these results are unusual, the study is extended to a few more databases. All the results emphasized the innate ability of ANN in developing complex network of relations among features to estimate the target variable. This has prompted us to suggest that prior feature selection is not essential for ANN and it is a desirable option for meaningful outputs in terms of the rationale behind the inputs.QSAR & Combinatorial Science 11/2009; 28(11‐12):1487 - 1499. · 1.55 Impact Factor