Content uploaded by Vito Muggeo
Author content
All content in this area was uploaded by Vito Muggeo on Jun 18, 2019
Content may be subject to copyright.
The R package islasso:
estimation and hypothesis testing
in lasso regression
Gianluca Sottile∗Giovanna Cilluffo†Vito M.R. Muggeo∗
Abstract
In this short note we present and briefly discuss the R package islasso deal-
ing with regression models having a large number of covariates. Estimation
is carried out by penalizing the coefficients via a quasi-lasso penalty, wherein
the nonsmooth lasso penalty is replaced by its smooth counterpart determined
iteratively by data according to the induced smoothing idea. The package in-
cludes functions to estimate the model and to test for linear hypothesis on
linear combinations of relevant coefficients. We illustrate R code throughout
a worked example, by avoiding intentionally to report details and extended
bibliography.
1 Introduction
Let y=Xβ +be the linear model of interest with usual zero-means and ho-
moscedastic errors. As usual, y= (y1, . . . , yn)Tis the response vector, Xis the
n×pdesign matrix (having pquite large) with regression coefficients β.
When interest lies in selecting the non-noise covariates and estimating the rele-
vant effect, one assumes the lasso penalized objective function (Tibshirani, 1996),
1
2||y−Xβ||2
2+λ||β||1(1)
to be minimized at fixed λ > 0. As it is well-know, the lasso penalty ||β||1
allows to rule out the noise covariates by returning exactly zero estimates as model
output.
Model estimation via the aforementioned penalized objective does not come
without price. The non-null estimates are shrunken towards the zero and, probably
more importantly, inference on the model is complicated and not straightforward.
In other words, no confidence intervals or p-values on (linear combinations of) β
are easily obtained. The islasso aims at filling this gap, partially. More specifically,
at time of writing, the package returns point estimates, reliable standard errors and
corresponding p-values for the regression coefficients and any linear combination
of them. We do not provide details of the methodology that can be found in the
paper (Cilluffo et al., 2019). Rather, we describe the R functions in the package by
providing a worked example.
∗Dip. Scienze Economiche, Az e Statistiche, Universit`a di Palermo, Italy. Email:
gianluca.sottile@unipa.it, vito.muggeo@unipa.it
†Istituto per la Ricerca e l’Innovazione Biomedica (IRIB), Consiglio Nazionale delle Ricerche
(CNR), Palermo, Italy. Email giovanna.cilluffo@ibim.cnr.it
1
2 The R functions
The main function of the package is islasso() where the user supplies the model
formula as in the usual lm or glm functions, i.e.
islasso(formula, family, lambda, alpha, data, weights, subset,
offset, unpenalized, contrasts, control = is.control())
family accepts specification of family and link function as in Table 1, lambda
is the tuning parameter and unpenalized allows to indicate covariates with unpe-
nalized coefficients.
Table 1: Families and link functions allowed in islasso
family link
gaussian identity
binomial logit, probit
poisson log
gamma identity, log, inverse
The fitter function is is.lasso.fit() which reads as
islasso.fit(X, y, family, lambda, alpha = 1, intercept = FALSE,
weights = NULL, offset = NULL, unpenalized = NULL,
control)
which actually implements the estimating algorithm as described in the paper.
The lambda argument in islasso.fit and islasso specifies the positive tuning
parameter in the penalized objective. Any non-negative value can be provided, but
if missing, it is computed via K-fold cross validation by the function cv.glmnet()
from package glmnet (Friedman et al., 2010). The number of folds being used can
be specified via the argument nfolds of the auxiliary function is.control().
3 A worked example: the Diabetes data set
We use the well-known diabetes dataset available in the lars package. The data refer
to n= 442 patients enrolled to investigate a measure of disease progression one
year after the baseline. There are ten covariates, such as age, sex, bmi (body mass
index), map (average blood pressure) and several blood serum measurements (tc,
ldl, hdl, tch, ltg, glu). The matrix x2 in the dataframe also includes second-order
terms, namely first-order interactions between covariates, and quadratic terms for
the continuous variables.
To select the important terms in the regression equation we apply the lasso
> library(glmnet)
> library(lars)
> data(diabetes)
> a1 <- with(diabetes, cv.glmnet(x2, y))
> n <- nrow(diabetes)
> a1$lambda.min*n #the lambda value of (1)
[1] 1344.186
>
> b <- drop(coef(a1, "lambda.min")) #coeffs at the optimum lambda
> length(b[b != 0])
[1] 15
2
Ten-fold cross validation ‘selects’ λ= 1344.2 corresponding to 15 non null coeffi-
cients, whose the last ones, are, just to illustrate
> tail(b[b != 0])
glu^2 age:sex age:map age:ltg age:glu bmi:map
69.599081 107.479925 29.970061 8.506032 11.675332 85.530937
A reasonable question is if all the ‘selected’ coefficients are significant in the
model. Unfortunately lasso regression does not return standard errors due to nons-
moothness of objective, and some alternative approaches have been proposed. One
of them, is the ‘covariance test’ (Lockhart et al., 2013) as implemented in the pack-
age covTest
> library(covTest)
>
> o <- with(diabetes, lars(x2, y))
> with(diabetes, covTest(o, x2, y))
$‘results’
Predictor_Number Drop_in_covariance P-value
3 20.1981 0.0000
9 52.5964 0.0000
4 5.7714 0.0034
7 4.0840 0.0176
37 1.3310 0.2655
20 0.3244 0.7232
.....................
The CovTest approach suggest that only the terms corresponding to columns 3,
9, 4, and 7 in the matrix x2 are significant, namely
> colnames(diabetes$x2)[c(3, 9, 4, 7)]
[1] "bmi" "ltg" "map" "hdl"
However covTest returns p-values across the λpath. It means that such p-values
are not matched to the corresponding point estimates obtained at the optimal λ
value (λ= 1344.2 in this example). As a consequence, some discrepancies between
the results by covTest and glmnet/lars are likely to occur. For instance, out of the
15 selected non-null coefficients, just 4 are assessed as significant.
The R package islasso provides and alternative to covTest, by implementing the
recent ‘quasi’ lasso approach based on the induced smoothing idea (Brown and
Wang, 2005) as discussed in Cilluffo et al. (2019). Point estimates and p-values are
returned within the same framework. While the optimal lambda could be selected
(without supplying any value to lambda), we use the same above value to facilitate
comparisons
> library(islasso)
> out <- islasso(y ~ x2, data=diabetes, lambda=1344.186)
The summary method quickly returns the main output of the fitted model, in-
cluding point estimates, standard errors and p-values
> summary(out)
Call:
islasso(formula = y ~ x2, lambda = 1344.186, data = diabetes)
3
Residuals:
Min 1Q Median 3Q Max
-138.74 -40.18 -4.53 34.45 143.43
Estimate Std. Error Df z value Pr(>|z|)
(Intercept) 1.521e+02 2.554e+00 1.000 59.570 < 2e-16 ***
x2age 1.873e-01 2.408e+01 0.005 0.008 0.99379
x2sex -1.149e+02 5.377e+01 0.891 -2.137 0.03258 *
x2bmi 4.952e+02 7.058e+01 1.000 7.016 2.29e-12 ***
x2map 2.514e+02 6.447e+01 0.999 3.899 9.64e-05 ***
x2tc -4.514e-01 2.837e+01 0.012 -0.016 0.98730
......................
x2tch:glu 2.848e-01 2.546e+01 0.006 0.011 0.99107
x2ltg:glu 2.712e-01 3.611e+01 0.005 0.008 0.99401
......................
Visualizing estimates for all covariates could be somewhat inconvenient, espe-
cially when the number of covariates is large, thus one could opt to print estimates
only if their p-value is less than a specified value. We use 0.10 as a threshold.
> summary(out, pval = .1)
Call:
islasso(formula = y ~ x2, lambda = 1344.186, data = diabetes)
Residuals:
Min 1Q Median 3Q Max
-138.74 -40.18 -4.53 34.45 143.43
Estimate Std. Error Df z value Pr(>|z|)
(Intercept) 152.133 2.554 1.000 59.570 < 2e-16 ***
x2sex -114.923 53.773 0.891 -2.137 0.03258 *
x2bmi 495.168 70.581 1.000 7.016 2.29e-12 ***
x2map 251.409 64.473 0.999 3.899 9.64e-05 ***
x2hdl -189.213 67.826 0.978 -2.790 0.00528 **
x2ltg 466.026 70.701 1.000 6.592 4.35e-11 ***
x2age:sex 109.177 50.732 0.904 2.152 0.03139 *
x2bmi:map 86.476 47.404 0.812 1.824 0.06812 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 2855.198)
Null deviance: 2621009 on 441.0 degrees of freedom
Residual deviance: 1222391 on 428.1 degrees of freedom
AIC: 4786.9
Lambda: 1344.2
Number of Newton-Raphson iterations: 40
In addition to the usual information printed by the summary method, the output
also includes the column Df representing the degrees of freedom of each coefficient.
Negligible coefficients (i.e. with approximately a null estimate) will exhibit almost
zero degree-of-freedom, see the previous output of summary(out). The sum of all
degrees-of-freedom is used to quantify the model complexity
4
> sum(out$internal$hi)
[1] 13.87174
and the corresponding residual degrees of freedom (428.1) are printed next to the
residual deviance, as reported above. The Wald test (column z value) and p-values
can be used to assess important or significant covariates. In addition to those ruled
out by covTest (”bmi” ”ltg” ”map” ”hdl”), islasso() also returns ‘small’ p-values
for the terms ”sex”, ”sex:age”, and ”bmi:map”. Simulation studies in Cilluffo et
al. (2019) have shown good performance of islasso with respect to some alternative
approaches.
As an alternative to the Cross Validation, it is also possible to select the tuning
parameter λby means of the Bayesian or Akaike Information Criterion. The func-
tion aic.islasso, requires a islasso fit object and specification of the criterion to
be used (AIC/BIC). Hence
> lmb.bic <- aic.islasso(out, "bic")
> out1 <- update(out, lambda = lmb.bic) #fit with a BIC-based lambda
Comparisons between methods to select the tuning parameter and further dis-
cussions are beyond our goals.
We conclude this short note by emphasizing that islasso also accepts the so-called
elastic-net penalty, such that
1
2||y−Xβ||2
2+λ{α||β||1+1
2(1 −α)||β||2
2}
where 0 ≤α≤1 is the mixing parameter to be specified in islasso() and
islasso.fit() via the argument alpha.
4 Concluding remarks
The package islasso provides an alternative to the ‘plain’ lasso regression. The
main disadvantage with respect to lasso lies in the point estimates: islasso does not
perform variable selection, in that the point estimates will be never exactly zero,
however differences in terms of findings will be typically negligible. However unlike
the plain lasso, islasso is able to return reliable standard errors and p-values which
can be used to assess significance of coefficients.
References
Brown B and Wang Y. Standard errors and covariance matrices for smoothed rank
estimators. Biometrika 2005; 92: 149–158.
Cilluffo, G, Sottile, G, La Grutta, S and Muggeo, VMR (2019) The Induced
Smoothed lasso: A practical framework for hypothesis testing in high dimen-
sional regression. Statistical Methods in Medical Research, online
doi: 10.1177/0962280219842890.
Friedman, J, Hastie, T, Tibshirani, R (2010). Regularization Paths for Generalized
Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1),
1-22.
Lockhart R, Taylor J, Tibshirani R, et al. A significance test for the lasso. Ann
Stat 2014; 42: 413–468.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc: Series
B 1996; 58: 267–288
5