ArticlePDF Available

Regularized Multiple Regression Methods to Deal with Severe Multicollinearity

Authors:
International Journal of Statistics and Applications 2018, 8(4): 167-172
DOI: 10.5923/j.statistics.20180804.02
Regularized Multiple Regression Methods to Deal with
Severe Multicollinearity
N. Herawati*, K. Nisa, E. Setiawan, Nusyirwan, Tiryono
Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Lampung, Bandar Lampung, Indonesia
Abstract This study aims to compare the performance of Ordinary Least Square (OLS), Least Absolute Shrinkage and
Selection Operator (LASSO), Ridge Regression (RR) and Principal Component Regression (PCR) methods in handling
severe multicollinearity among explanatory variables in multiple regression analysis using data simulation. In order to select
the best method, a Monte Carlo experiment was carried out, it was set that the simulated data contain severe multicollinearity
among all explanatory variables (ρ = 0.99) with different sample sizes (n = 25, 50, 75, 100, 200) and different levels of
explanatory variables (p = 4, 6, 8, 10, 20). The performances of the four methods are compared using Average Mean Square
Errors (AMSE) and Akaike Information Criterion (AIC). The result shows that PCR has the lowest AMSE among other
methods. It indicates that PCR is the most accurate regression coefficients estimator in each sample size and various levels of
explanatory variables studied. PCR also performs as the best estimation model since it gives the lowest AIC values compare
to OLS, RR, and LASSO.
Keywords Multicollinearity, LASSO, Ridge Regression, Principal Component Regression
1. Introduction
Multicollinearity is a condition that arises in multiple
regression analysis when there is a strong correlation or
relationship between two or more explanatory variables.
Multicollinearity can create inaccurate estimates of the
regression coefficients, inflate the standard errors of the
regression coefficients, deflate the partial t-tests for the
regression coefficients, give false, nonsignificant, p-values,
and degrade the predictability of the model [1, 2]. Since
multicollinearity is a serious problem when we need to make
inferences or looking for predictive models, it is very
important to find a best suitable method to deal with
multicollinearity [3].
There are several methods of detecting multicollinearity.
Some of the common methods are by using pairwise scatter
plots of the explanatory variables, looking at near-perfect
relationships, examining the correlation matrix for high
correlations and the variance inflation factors (VIF), using
eigenvalues of the correlation matrix of the explanatory
variables and checking the signs of the regression
coefficients [4, 5].
Several solutions for handling multicollinearity problem
* Corresponding author:
netti.herawati@fmipa.unila.ac.id (N. Herawati)
Published online at http://journal.sapub.org/statistics
Copyright © 2018 The Author(s). Published by Scientific & Academic Publishing
This work is licensed under the Creative Commons Attribution International
License (CC BY). http://creativecommons.org/licenses/by/4.0/
have been developed depending on the sources of
multicollinearity. If the multicollinearity has been created by
the data collection, collect additional data over a wider
X-subspace. If the choice of the linear model has increased
the multicollinearity, simplify the model by using variable
selection techniques. If an observation or two has induced
the multicollinearity, remove those observations. When
these steps are not possible, one might try ridge regression
(RR) as an alternative procedure to the OLS method in
regression analysis which suggested by [6].
Ridge Regression is a technique for analyzing multiple
regression data that suffer from multicollinearity. By adding
a degree of bias to the regression estimates, RR reduces the
standard errors and obtains more accurate regression
coefficients estimation than the OLS. Other techniques, such
as LASSO and principal components regression (PCR), are
also very common to overcome the multicollinearity. This
study will explore LASSO, RR and PCR regression which
performs best as a method for handling multicollinearity
problem in multiple regression analysis.
2. Parameter Estimation in Multiple
Regression
2.1. Ordinary Least Squares (OLS)
The multiple linear regression model and its estimation
using OLS method allows to estimate the relation between a
dependent variable and a set of explanatory variables. If data
consists of n observations {,}
and each observation i
168 N. Herawati et al.: Regularized Multiple Regression Methods to Deal with Severe Multicollinearity
includes a scalar response yi and a vector of p explanatory
(regressors) xij for j=1,...,p, a multiple linear regression
model can be written as =+ where  is the
vector dependent variable,  represents the explanatory
variables,  is the regression coefficients to be estimated,
and  represents the errors or residuals.
 =
() is estimated regression coefficients using OLS
by minimizing the squared distances between the observed
and the predicted dependent variable [1, 4]. To have
unbiased OLS estimation in the model, some assumptions
should be satisfied. Those assumptions are that the errors
have an expected value of zero, that the explanatory
variables are non-random, that the explanatory variables are
lineary independent, that the disturbance are homoscedastic
and not autocorrelated. Explanatory variables subject to
multicollinearity produces imprecise estimate of regression
coefficients in a multiple regression. There are some
regularized methods to deal with such problems, some of
them are RR, LASSO and PCR. Many studies on the three
methods have been done for decades, however, investigation
on RR, LASSO and PCR is still an interesting topic and
attract some authors until recent years, see e.g. [7-12] for
recent studies on the three methods.
2.2. Regularized Methods
a. Ridge regression (RR)
Regression coeficients
 require X as a centered and
scaled matrix, the cross product matrix (X’X) is nearly
singular when X-columns are highly correlated. It is often the
case that the matrix X’X is “close” to singular. This
phenomenon is called multicollinearity. In this situation
 still can be obtained, but it will lead to significant
changes in the coefficients estimates [13]. One way to detect
multicollinearity in the regression data is to use the use the
variance inflation factors VIF. The formula of VIF is
(VIF)j=()=

.
Ridge regression technique is based on adding a ridge
parameter (λ) to the diagonal of X’X matrix forming a new
matrix (X’X+λI). It’s called ridge regression because the
diagonal of ones in the correlation matrix can be described as
a ridge [6]. The ridge formula to find the coefficients is
=(+),  0. When =0, the ridge
estimator become as the OLS. If all ’s are the same, the
resulting estimators are called the ordinary ridge estimators
[14, 15]. It is often convenient to rewrite ridge regression in
Lagrangian form:
=
,
 
+
.
Ridge regression has the ability to overcome this
multicollinearity by constraining the coefficient estimates,
hence, it can reduce the estimator’s variance but introduce
some bias [16].
b. The LASSO
The LASSO regression estimates
 by the
optimazation problem:
=
,1
2 
+
for some 0. By Lagrangian duality, there is one-to-one
correspondence between constrained problem  
and the Lagrangian form. For each value of t in the range
where the constraint   is active, there is a
corresponding value of λ that yields the same solution form
Lagrangian form. Conversely, the solution of
to the
problem solves the bound problem with =
[17, 18].
Like ridge regression, penalizing the absolute values of
the coefficients introduces shrinkage towards zero. However,
unlike ridge regression, some of the coefficients are
shrunken all the way to zero; such solutions, with multiple
values that are identically zero, are said to be sparse. The
penalty thereby performs a sort of continuous variable
selection.
c. Principal Component Regression (PCR)
Let V=[V1,...,Vp} be the matrix of size p x p whose
columns are the normalized eigenvectors of , and let
λ1,..., λp be the corresponding eigenvalues. Let
W=[W1,...,Wp]= XV. Then Wj= XVj is the j-th sample
principal components of X. The regression model can be
written as = +=+= where
=. Under this formulation, the least estimator of is
=()=.
And hence, the principal component estimator of β is
defined by
=
= [19-21]. Calculation of
OLS estimates via principal component regression may be
numerically more stable than direct calculation [22]. Severe
multicollinearity will be detected as very small eigenvalues.
To rid the data of the multicollinearity, principal component
omit the components associated with small eigen values.
2.3. Measurement of Performances
To evaluate the performances at the methods studied,
Average Mean Square Error (AMSE) of regression
coefficient
is measured. The AMSE is defined by

=1

()

where
() denotes the estimated parameter in the l-th
simulation. AMSE value close to zero indicates that the slope
and intercept are correctly estimated. In addition, Akaike
Information Criterion (AIC) is also used as the performance
criterion with formula: = 2  2ln (
) where
=
,,
are the parameter values that maximize
the likelihood function, x = the observed data, n = the number
of data points in x, and k = the number of parameters
estimated by the model [23, 24]. The best model is indicated
by the lowest values of AIC.
3. Methods
In this study, we consider the true model as = +.
International Journal of Statistics and Applications 2018, 8(4): 167-172 169
We simulate a set of data with sample size n= 25, 50, 75, 100,
200 contain severe multicolleniarity among all explanatory
variables (ρ=0.99) using R package with 100 iterations.
Following [25] the explanatory variables are generated by
 = (1 )/ +,
= 1,2, … , = 1,2, … , .
Where  are independent standard normal
pseudo-random numbers and ρ is specified so that the
theoretical correlation between any two explanatory
variables is given by . Dependent variable () for each
explanatory variables is from =+ with β
parameters vectors are chosen arbitrarily (=0, and β=1
otherwise) for p= 4, 6, 8, 10, 20 and ε~N (0, 1). To measure
the amount of multicolleniarity in the data set, variance
inflation factor (VIF) is examined. The performances of OLS,
LASSO, RR, and PCR methods are compared based on the
value of AMSE and AIC. Cross-validation is used to find a
value for the λ value for RR and LASSO.
4. Results and Discussion
The existence of severe multicollinearity in explanatory
variables for all given cases are examined by VIF values.
The result of the analysis to simulated dataset with p = 4, 6, 8,
10, 20 with n = 25, 50, 75, 100, 200 gives the VIF values
among all the explanatory variables are between 40-110.
This indicates that severe multicollinearity among all
explanatory variables is present in the simulated data
generated from the specified model and that all the
regression coefficients appear to be affected by collinearity.
LASSO method is for choosing which covariates to include
in the model. It is based on stepwise selection procedure. In
this study, LASSO, cannot overcome severe
multicollinearity among all explanatory variables since it can
reduce the VIF in data set a little bit. Whereas in every cases
of simulated data set studied, RR reduces the VIF values less
than 10 and PCR reduce the VIF to 1. Using this data, we
compute different estimation methods alternate to OLS. The
experiment is repeated 100 times to get an accurate
estimation and AMSE of the estimators are observed. The
result of the simulations can be seen in Table 1.
In order to compare the four methods easily, the AMSE
results in Table 1 are presented as graphs in Figure 1 - Figure
5. From those figures, it is seen that OLS has the highest
AMSE value compared to the other three methods in every
cases being studied followed by LASSO. Both OLS and
LASSO are not able to resolve the severe multicollinearity
problems. On the other hand, RR gives lower AMSE than
OLS and LASSO but still high as compare to that in PCR.
Ridge regression and PCR seem to improve prediction
accuracy by shrinking large regression coefficients in order
to reduce over fitting. The lowest AMSE is given by PCR in
every case.
It clearly indicates that PCR is the most accurate estimator
when severe multcollinearity presence. The result also show
that sample size affects the value of AMSEs. The higher the
sample size used, the lower the value of AMSE from each
estimators. Number of explanatory variables does not seem
to affect the accuracy of PCR.
Tab le 1. Average Mean Square Error of OLS, LASSO, RR, and PCR
p n
AMSE
OLS LASSO RR PCR
4
25 5.7238 3.2880 0.5484 0.0169
50 3.2870 2.5210 0.3158 0.0035
75 2.3645 2.0913 0.2630 0.0029
100 1.7750 1.6150 0.2211 0.0017
200 0.8488 0.8438 0.1512 0.0009
6
25 15.3381 6.5222 0.5235 0.0078
50 5.3632 4.0902 0.4466 0.0051
75 4.0399 3.4828 0.3431 0.0031
100 2.8200 2.5032 0.2939 0.0020
200 1.3882 1.3848 0.2044 0.0013
8
25 20.4787 8.7469 0.5395 0.0057
50 8.2556 5.9925 0.4021 0.0037
75 5.6282 4.7016 0.3923 0.0018
100 3.8343 3.4771 0.3527 0.0017
200 1.9906 1.9409 0.2356 0.0008
10
25 27.9236 12.3202 1.2100 0.0119
50 12.1224 7.8290 0.5129 0.0089
75 7.0177 5.8507 0.4293 0.0035
100 4.7402 4.3165 0.3263 0.0022
200 2.5177 2.4565 0.2655 0.0013
20
25 396.6900 33.6787 1.0773 0.0232
50 33.8890 16.4445 0.7861 0.0065
75 18.5859 13.1750 0.6927 0.0052
100 12.1559 9.7563 0.5670 0.0032
200 5.5153 5.2229 0.4099 0.0014
Figure 1. AMSE of OLS, LASSO, RR and PCR for p=4
0
1
2
3
4
5
6
7
25
50
75
100
200
AMSE
Sample Size (n)
OLS
LASSO
PCR
170 N. Herawati et al.: Regularized Multiple Regression Methods to Deal with Severe Multicollinearity
Figure 2. AMSE of OLS, LASSO, RR and PCR for p=6
Figure 3. AMSE of OLS, LASSO, RR and PCR for p=8
Figure 4. AMSE of OLS, LASSO, RR and PCR for p=10
Figure 5. AMSE of OLS, LASSO, RR and PCR for p=20
To choose the best model, we use Akaike Information
Criterion (AIC) of the models obtained using the four
methods being studied. The AIC values for all methods with
different number of explanatory variables and sample sizes is
presented in Table 2 and displayed as bars-graphs in Figure 6
Figure 10.
Figure 6 –Figure 10 show that the greater the sample sizes
are the lower the values of AIC and in contrary to sample
sizes, number of explanatory variables does not seem to
affect the value of AIC. OLS has the highest AIC values in
every level of explanatory variables and sample sizes.
LASSO as one of the regularized method has the highest AIC
values compare to RR and PCR. The differences of AIC
values between the PCR performances from RR are small.
PCR is the best methods among the selected methods
including based on the value of AIC. It is consistent with the
result in Table 1 where PCR has the smallest AMSE value
among all the methods applied in the study. PCR is
approximately effective and efficient for a small and high
number of regressors. This finding is in accordance with
previous study [20].
Table 2. AIC values for OLS, RR, LASSO, and PCR with different
number of explanatory variables and sample sizes
p Methods
n
25 50 75 100 200
4
OLS 8.4889 8.2364 8.2069 8.1113 8.0590
LASSO 8.4640 8.2320 8.2056 8.1108 8.0589
RR 8.3581 8.1712 8.1609 8.0774 8.0439
PCR 8.2854 8.1223 8.1173 8.0439 8.0239
6
OLS 8.7393 8.3541 8.2842 8.1457 8.0862
LASSO 8.6640 8.3449 8.2806 8.1443 8.0861
RR 8.4434 8.2333 8.1995 8.0868 8.0598
PCR 8.3257 8.1521 8.1327 8.0355 8.0281
8
OLS 8.8324 8.3983 8.3323 8.2125 8.1060
LASSO 8.7181 8.3816 8.3259 8.2104 8.1058
RR 8.3931 8.2039 8.2062 8.1247 8.0660
PCR 8.2488 8.1069 8.1162 8.0550 8.0254
10
OLS 9.0677 8.4906 8.3794 8.2595 8.1142
LASSO 8.9011 8.4556 8.3711 8.2570 8.1140
RR 8.4971 8.2275 8.2120 8.1446 8.0608
PCR 8.2405 8.0969 8.1035 8.0674 8.0104
20
OLS 11.3154 9.1698 8.7443 8.5138 8.2652
LASSO 9.8490 9.0324 8.7055 8.4968 8.2638
RR 8.5775 8.4475 8.3195 8.2202 8.1390
PCR 8.2628 8.2138 8.1375 8.0759 8.0535
0
3
6
9
12
15
18
25
50
75
100
200
AMSE
Sample Size (n)
OLS
LASSO
RR
PCR
0
3
6
9
12
15
18
21
24
25
50
75
100
200
AMSE
Sample Size (n)
OLS
LASSO
RR
PCR
0
5
10
15
20
25
30
25
50
75
100
200
AMSE
Sample Size (n)
OLS
LASSO
RR
PCR
0
50
100
150
200
250
300
350
400
25
50
75
100
200
AMSE
Sample Size (n)
OLS
LASSO
RR
PCR
International Journal of Statistics and Applications 2018, 8(4): 167-172 171
Figure 6. Bar-graph of AIC for p=4
Figure 7. Bar-graph of AIC for p=6
Figure 8. Bar-graph of AIC for p=8
Figure 9. Bar-graph of AIC for p=10
Figure 10. Bar-graph of AIC for p=20
5. Conclusions
Based on the simulation results at p = 4, 6, 8, 10, and 20
and the number of data n = 25, 50, 75, 100 and 200
containing severe multicollinearity among all explanatory
variables, it can be concluded that RR and PCR method are
capable of overcoming severe multicollinearity problem. In
contrary, the LASSO method does not resolve the problem
very well when all variables are severely correlated even
though LASSO do better than OLS. In Overall PCR
performs best to estimate the regression coefficients on data
containing severe multicolinearity among all explanatory
variables.
REFERENCES
[1] Draper, N.R. and Smith, H. Applied Regression Analysis. 3rd
edition. New York: Wiley, 1998.
[2] Gujarati, D. Basic Econometrics. 4th ed. New York:
McGraw−Hill, 1995.
[3] Judge, G.G., Introduction to Theory and Practice of
Econometrics. New York: John Willy and Sons, 1988.
[4] Montgomery, D.C. and Peck, E.A., Introduction to Linear
Regression Analysis. New York: John Willy and Sons, 1992.
[5] Kutner, M.H. et al., Applied Linear Statistical Models. 5th
Edition. New York: McGraw-Hill, 2005.
[6] Hoerl, A.E. and Kennard, R.W., 2000, Ridge Regression:
Biased Estimation for nonorthogonal Problems.
Technometrics, 42, 80-86.
[7] Melkumovaa, L.E. and Shatskikh, S.Ya. 2017. Comparing
Ridge and LASSO estimators for data analysis. Procedia
Engineering, 201, 746-755.
[8] Boulesteix, A-L., R. De Bin, X. Jiang and M. Fuchs. 2017.
IPF-LASSO: Integrative-Penalized Regression with Penalty
Factors for Prediction Based on Multi-Omics Data.
Computational and Mathematical Methods in Medicine, 2017,
14 p.
7.6
7.8
8
8.2
8.4
8.6
25
50
75
100
200
AIC
Sample size (n)
OLS
LASSO
RR
PCR
7.6
7.8
8
8.2
8.4
8.6
8.8
25
50
75
100
200
AIC
Sample size (n)
OLS
LASSO
RR
PCR
7.6
7.8
8
8.2
8.4
8.6
8.8
9
25
50
75
100
200
AIC
Sample size (n)
OLS
LASSO
RR
PCR
7
7.5
8
8.5
9
9.5
25
50
75
100
200
AIC
Sample size (n)
OLS
LASSO
RR
PCR
0
2
4
6
8
10
12
25
50
75
100
200
AIC
Sample size (n)
OLS
LASSO
RR
PCR
172 N. Herawati et al.: Regularized Multiple Regression Methods to Deal with Severe Multicollinearity
[9] Helton, K.H. and N.L. Hjort. 2018. Fridge: Focused
fine-tuning of ridge regression for personalized predictions.
Statistical Medicine, 37(8), 1290-1303.
[10] Abdel Bary, M.N. 2017. Robust Regression Diagnostic for
Detecting and Solving Multicollinearity and Outlier Problems:
Applied Study by Using Financial Data Applied
Mathematical Sciences, 11 (13), 601-622.
[11] Usman, U., D. Y. Zakari, S. Suleman and F. Manu. 2017. A
Comparison Analysis of Shrinkage Regression Methods of
Handling Multicollinearity Problems Based on Lognormal
and Exponential Distributions. MAYFEB Journal of
Mathematics, 3, 45-52.
[12] Slawski, M. 2017. On Principal Components Regression,
Random Projections, and Column Subsampling. Arxiv:
1709.08104v2 [Math-ST].
[13] Wethrill, H., 1986, Evaluation of ordinary Ridge Regression.
Bulletin of Mathematical Statistics, 18, 1-35.
[14] Hoerl, A.E., 1962, Application of ridge analysis to regression
problems. Chem. Eng. Prog., 58, 54-59.
[15] Hoerl, A.E., R.W. Kannard and K.F. Baldwin, 1975, Ridge
regression: Some simulations. Commun. Stat., 4, 105-123.
[16] James, G., Witten D., Hastie T., Tibshirani R An Introduction
to Statistical Learning: With Applications in R. New York:
Springer Publishing Company, Inc., 2013.
[17] Tibshirani, R., 1996, Regression shrinkage and selection via
the LASSO. J Royal Stat Soc, 58, 267-288.
[18] Hastie, T., Tibshirani, R., Mainwright, M., 2015, Statistical
learning with Sparsity The LASSO and Generalization.
USA: Chapman and Hall/CRC Press.
[19] Coxe, K.L., 1984, Multicollinearity, principal component
regression and selection rules for these components,” ASA
Proceed. Bus fj Econ sect'ion, 222-227.
[20] Jackson, J.E., A User's Guide To Principal Components. New
York: Tiley, 1991.
[21] Jolliffe, LT, Principal Component Analysis. New York:
Springer-Verlag, 2002.
[22] Flury, B. and Riedwyl, H., Multivariate Statistics. A Practical
Approach, London: Chapman and Hall, 1988.
[23] Akaike, H. 1973. Information theory and an extension of the
maximum likelihood principle. In B.N. Petrow and F. Csaki
(eds), Second International symposium on information theory
(pp.267-281). Budapest: Academiai Kiado.
[24] Akaike, H. 1974. A new look at the statistical model
identification. IEEE Transactions on Automatic Control, 19,
716-723.
[25] McDonald G.C. and Galarneau, D.I., 1975, A Monte Carlo
evaluation of some ridge type estimators. J. Amer. Statist.
Assoc., 20, 407-416.
[26] Zhang, M., Zhu, J., Djurdjanovic, D. and Ni, J. 2006, A
comparative Study on the Classification of Engineering
Surfaces with Dimension Reduction and Coefficient
Shrinkage Methods. Journal of Manufacturing Systems, 25(3):
209-220.
... Multicollinearity is a condition that arises in multiple regression analysis when there is a strong correlation or relationship between two or more explanatory variables. Multicollinearity can create inaccurate estimates of the regression coefficients, inflate the standard errors of the regression coefficients, deflate the partial t-tests for the regression coefficients, give false, non-significant p-values and degrade the predictability of the model (Herawati et al, 2018). Since multicollinearity is a serious problem when need to make inferences or looking for predictive models, it is very important to find a suitable method to deal with multicollinearity. ...
... Based on their study, it is observed that PLSR has a lower measure of accuracy in normal distribution while RR shows better results in uniform distribution. Herawati et al (2018), compared the performance of Ordinary Least Square (OLS), Least Absolute Shrinkage and Selection Operator (LASSO), Ridge Regression (RR) and Principal Component Regression (PCR) methods in handling severe multicollinearity among explanatory variables in multiple regression analysis using data simulation, and the performances of the four methods are compared using Average Mean Squared Errors (AMSE) and Akaike Information Criterion (AIC). They concluded that PCR has the lowest AMSE among other methods, indicating that PCR is the most accurate regression coefficients estimator in each sample size and various levels of explanatory variables studied. ...
Article
Full-text available
Multicollinearity is a peculiar problem in especially in Ordinary Linear and Non-Linear Regression models. Researchers have made several attempts to the correction of multicollinearity among which include the use of VIF and Parsimony. Standard Score of Variables has been proven to be correction or adjustment of observations without the data in any form. In modeling, standard score can be introduced to correct the possible existing multicollinearity among the variable. In this paper, standard score was compared with other methods of modeling which include ridge regression different levels. The models were tested for adequacy and comparison of methods was checked using MSE, RMSE, MAE and MAPE. Using simulation approach, it was observed that Standard score is adequate for the correction of multicollinearity.
... However, while the importance of each candidate constitutive term in a ROM can be ranked by repeatedly applying the LASSO regression with different degrees of shrinkage, the total computation in such an approach is costly when the total number of candidate terms is large, a situation typically faced when the dimension of the ROM is not small. It is also known that LASSO does not handle severe multicollinearity well [49]. ...
Article
Full-text available
Constructing sparse, effective reduced-order models (ROMs) for high-dimensional dynamical data is an active area of research in applied sciences. In this work, we study an efficient approach to identifying such sparse ROMs using an information-theoretic indicator called causation entropy. Given a feature library of possible building block terms for the sought ROMs, the causation entropy ranks the importance of each term to the dynamics conveyed by the training data before a parameter estimation procedure is performed. It thus allows for an efficient construction of a hierarchy of ROMs with varying degrees of sparsity to effectively handle different tasks. This article examines the ability of the causation entropy to identify skillful sparse ROMs when a relatively high-dimensional ROM is required to emulate the dynamics conveyed by the training dataset. We demonstrate that a Gaussian approximation of the causation entropy still performs exceptionally well even in presence of highly non-Gaussian statistics. Such approximations provide an efficient way to access the otherwise hard to compute causation entropies when the selected feature library contains a large number of candidate functions. Besides recovering long-term statistics, we also demonstrate good performance of the obtained ROMs in recovering unobserved dynamics via data assimilation with partial observations, a test that has not been done before for causation-based ROMs of partial differential equations. The paradigmatic Kuramoto–Sivashinsky equation placed in a chaotic regime with highly skewed, multimodal statistics is utilized for these purposes.
... During this type of analysis, it is common to encounter specific issues and hypotheses regarding the model. One important assumption in regression analysis is the absence of multicollinearity, as violating this assumption can render the model unreliable for estimating population parameters (Damodar N. Gujarati, 2013, Herawati et al., 2018. In multiple regression analysis, multicollinearity occurs when one independent variable is correlated with another independent variable. ...
Article
Full-text available
In linear regression models, multicollinearity often results in unstable and unreliable parameter estimates. Ridge regression, a biased estimation technique, is commonly used to mitigate this issue and produce more reliable estimates of regression coefficients. Several estimators have been developed to select the optimal ridge parameter. This study focuses on the top 16 estimators from the 366 evaluated by Mermi et al. (2024), along with seven additional estimators introduced over time. These 23 estimators were compared to Ordinary Least Squares (OLS), Elastic-Net (EN), Lasso, and generalized ridge (GR) regression, to evaluate their performance across different levels of multicollinearity in multiple regression settings. Simulated data, both with and without outliers, and various parametric conditions were used for the comparisons. The results indicated that certain ridge regression estimators perform reliably with small sample sizes and high correlations (around 0.95) in the absence of outliers. However, when outliers were present, some estimators performed better due to small sample sizes and increased variance. Furthermore, GR, EN, and Lasso exhibited robustness with large datasets, except in cases with substantial outliers and high variance.
... Multicollinearity will affect the accuracy of model predictions and cause errors in decision making. There are many methods that can be used to overcome this multicollinearity problem, including the ridge regression method (Enwere, 2023), binary logistic equations and the Least Absolute Shrinkage And Selection Operator (LASSO) (Herawati et al., 2018) and (Enwere et al., 2023). ...
Article
Full-text available
Higher education is a level of education that is able to produce and prepare graduates so that college graduates are able to compete and are ready to face the world of work. The aim of this research is to determine the dominant factors that influence the study period and waiting time for undergraduates in Department of Mathmatics FST UIN Sumatera Utara Medan until they get their first job in less than 9 months. The Least Absolute Shrinkage And Selection Operator (LASSO) method was used in this research, where this method is expected to be able to provide high accuracy results in terms of determining the most dominant factors. From the results of calculations using the LASSO method, the three most dominant factors that greatly influence the study period and waiting time for undergraduates are working status (X3), organizational participation (X5), grade point average (X2), final assignment position (X2). By producing this most dominant factor, it is hoped that mathematics study programs will care more about their students by making major and minor improvements to future accreditation
... Like Ridge regression which causes the coefficients to shrink towards zero. However, in LASSO some coefficients shrink to zero [18]. The estimation in the LASSO method is formulated as follows: ...
Article
Full-text available
In spatial data, multicollinearity and spatial heterogeneity are often encountered simultaneously. To overcome the problem of heterogeneity in spatial data, GWR method can be used but this method can only overcome heterogeneity but not multicollinearity. Therefore, another method is needed to overcome multicollinearity in spatial data. The purpose of this study is to look at the ability of LCR-GWR and GWL methods to overcome multicollinearity problems simultaneously. The best method is determined by the results of the study which has smaller AIC and RMSE values. The results showed that the GWL method has lower AIC and RMSE values compared to the LCR-GWR model. Therefore, it can be said that GWL is better able to overcome multicollinearity and spatial heterogeneity in Income data compared to LCR-GWR.
Article
Full-text available
The unemployment rate is a key economic indicator that reflects a country's economic health, influencing policy decisions and citizens' living standards. This study examines Pakistan's economic indicators, using the unemployment rate as the dependent variable, while GDP, exchange rate (ER), inflation rate (INF), foreign direct investment (FDI), exports of goods and services (EGS), general government final consumption expenditure (GFCE), budget deficits (BDF), and population (POP) serve as independent variables. A Variance Inflation Factor (VIF) analysis identifies multicollinearity among predictors, revealing ER as having the highest VIF of 7.544, indicating strong multicollinearity. Other variables like GDP, FDI, GFCE, BDF, and INF exhibit low VIFs, while EGS and POP have moderate levels of multicollinearity. The study employs Ridge and Lasso regression with 2-fold cross-validation to determine significant predictors and assess their impact on unemployment rates. The optimal lambda for Ridge regression is found to be 0.7758532, selected through cross-validation to minimize error. ER emerges as the most influential variable, with a feature importance score of 100. Lasso regression, with an optimal lambda of 0.1943467, eliminates GDP, EGS, POP, and INF, enhancing model simplicity and reducing overfitting. The Ridge model yields an RMSE of 0.32, while Lasso achieves a lower RMSE of 0.25, indicating better predictive accuracy. The study underscores the importance of addressing multicollinearity and demonstrates the effectiveness of Ridge and Lasso regression in predicting unemployment rates, with each model offering unique strengths for economic analysis.
Article
Full-text available
This study investigates the relationship between food literacy and farmer’s intent to produce millets for their subsistence. To do so, we used a structured questionnaire highlighting three aspects of food literacy by drawing data from 100 millet producers from a remote tribal region in the Koraput district of Odisha and employed the least absolute shrinkage and selection operator regression technique and other regression methods for robustness. The result from the regression methods has revealed that variables associated with food literacy, such as knowledge about the farmland type, farm mechanization, and nutritional uses influence millet production. However, the cultural importance of millet had a significant impact on millet production as millet is culturally associated with tribal culture. Overall findings underscore the effective use of technology that supports the feminization and protection of indigenous millet crops and endorse the implementation of special schemes for the promotion and marketing of millets.
Article
Clustered coefficient regression (CCR) extends the classical regression model by allowing regression coefficients varying across observations and forming clusters of observations. It has become an increasingly useful tool for modeling the heterogeneous relationship between the predictor and response variables. A typical issue of existing CCR methods is that the estimation and clustering results can be unstable in the presence of multicollinearity. To address the instability issue, this paper introduces a low-rank structure of the CCR coefficient matrix and proposes a penalized non-convex optimization problem with an adaptive group fusion-type penalty tailor-made for this structure. An iterative algorithm is developed to solve this non-convex optimization problem with guaranteed convergence. An upper bound for the coefficient estimation error is also obtained to show the statistical property of the estimator. Empirical studies on both simulated datasets and a COVID-19 mortality rate dataset demonstrate the superiority of the proposed method to existing methods.
Article
Full-text available
This paper is devoted to the comparison of Ridge and LASSO estimators. Test data is used to analyze advantages of each of the two regression analysis methods. All the required calculations are performed using the R software for statistical computing.
Article
Full-text available
Handling multicollinearity problem in regression analysis is very important because the existence of multicollinearity among the predictor variables inflates the variances, and confidence interval of the parameter estimates which may lead to lack of statistical significance of individual independent variables, even though the overall model may have significance difference. It is also mislead p-values of the parameter estimate. In this paper, several regression techniques were used for prediction in the presence of multicollinearity which include: Ridge Regression (RR), Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR). Therefore, we investigated the performance of these methods with the simulated data that follows lognormal and exponential distributions. Hence, Mean square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) were obtained. And the result shows that PLSR and RR methods are generally effective in handling multicollinearity problems at both lognormal and exponential distributions.
Article
Full-text available
As modern biotechnologies advance, it has become increasingly frequent that different modalities of high-dimensional molecular data (termed “omics” data in this paper), such as gene expression, methylation, and copy number, are collected from the same patient cohort to predict the clinical outcome. While prediction based on omics data has been widely studied in the last fifteen years, little has been done in the statistical literature on the integration of multiple omics modalities to select a subset of variables for prediction, which is a critical task in personalized medicine. In this paper, we propose a simple penalized regression method to address this problem by assigning different penalty factors to different data modalities for feature selection and prediction. The penalty factors can be chosen in a fully data-driven fashion by cross-validation or by taking practical considerations into account. In simulation studies, we compare the prediction performance of our approach, called IPF-LASSO (Integrative LASSO with Penalty Factors) and implemented in the R package ipflasso , with the standard LASSO and sparse group LASSO. The use of IPF-LASSO is also illustrated through applications to two real-life cancer datasets. All data and codes are available on the companion website to ensure reproducibility.
Book
Discover New Methods for Dealing with High-Dimensional Data A sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and interpret than a dense model. Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data. Top experts in this rapidly evolving field, the authors describe the lasso for linear regression and a simple coordinate descent algorithm for its computation. They discuss the application of ℓ1 penalties to generalized linear models and support vector machines, cover generalized penalties such as the elastic net and group lasso, and review numerical methods for optimization. They also present statistical inference methods for fitted (lasso) models, including the bootstrap, Bayesian methods, and recently developed approaches. In addition, the book examines matrix decomposition, sparse multivariate analysis, graphical models, and compressed sensing. It concludes with a survey of theoretical results for the lasso. In this age of big data, the number of features measured on a person or object can be large and might be larger than the number of observations. This book shows how the sparsity assumption allows us to tackle these problems and extract useful and reproducible patterns from big datasets. Data analysts, computer scientists, and theorists will appreciate this thorough and up-to-date treatment of sparse statistical modeling.
Article
Statistical prediction methods typically require some form of fine-tuning of tuning parameter(s), with K-fold cross-validation as the canonical procedure. For ridge regression, there exist numerous procedures, but common for all, including cross-validation, is that one single parameter is chosen for all future predictions. We propose instead to calculate a unique tuning parameter for each individual for which we wish to predict an outcome. This generates an individualized prediction by focusing on the vector of covariates of a specific individual. The focused ridge—fridge—procedure is introduced with a 2-part contribution: First we define an oracle tuning parameter minimizing the mean squared prediction error of a specific covariate vector, and then we propose to estimate this tuning parameter by using plug-in estimates of the regression coefficients and error variance parameter. The procedure is extended to logistic ridge regression by using parametric bootstrap. For high-dimensional data, we propose to use ridge regression with cross-validation as the plug-in estimate, and simulations show that fridge gives smaller average prediction error than ridge with cross-validation for both simulated and real data. We illustrate the new concept for both linear and logistic regression models in 2 applications of personalized medicine: predicting individual risk and treatment response based on gene expression data. The method is implemented in the R package fridge.
Article
Principal Components Regression (PCR) is a traditional tool for dimension reduction in linear regression that has been both criticized and defended. One concern about PCR is that obtaining the leading principal components tends to be computationally demanding for large data sets. While random projections do not possess the optimality properties of the leading principal subspace, they are computationally appealing and hence have become increasingly popular in recent years. In this paper, we present an analysis showing that for random projections satisfying a Johnson-Lindenstrauss embedding property, the prediction error in subsequent regression is close to that of PCR, at the expense of requiring a slightly large number of random projections than principal components. Column sub-sampling constitutes an even cheaper way of randomized dimension reduction outside the class of Johnson-Lindenstrauss transforms. We provide numerical results based on synthetic and real data as well as basic theory revealing differences and commonalities in terms of statistical performance.