Content uploaded by Vito Muggeo

Author content

All content in this area was uploaded by Vito Muggeo on Jun 18, 2019

Content may be subject to copyright.

The R package islasso:

estimation and hypothesis testing

in lasso regression

Gianluca Sottile∗Giovanna Cilluﬀo†Vito M.R. Muggeo∗

Abstract

In this short note we present and brieﬂy discuss the R package islasso deal-

ing with regression models having a large number of covariates. Estimation

is carried out by penalizing the coeﬃcients via a quasi-lasso penalty, wherein

the nonsmooth lasso penalty is replaced by its smooth counterpart determined

iteratively by data according to the induced smoothing idea. The package in-

cludes functions to estimate the model and to test for linear hypothesis on

linear combinations of relevant coeﬃcients. We illustrate R code throughout

a worked example, by avoiding intentionally to report details and extended

bibliography.

1 Introduction

Let y=Xβ +be the linear model of interest with usual zero-means and ho-

moscedastic errors. As usual, y= (y1, . . . , yn)Tis the response vector, Xis the

n×pdesign matrix (having pquite large) with regression coeﬃcients β.

When interest lies in selecting the non-noise covariates and estimating the rele-

vant eﬀect, one assumes the lasso penalized objective function (Tibshirani, 1996),

1

2||y−Xβ||2

2+λ||β||1(1)

to be minimized at ﬁxed λ > 0. As it is well-know, the lasso penalty ||β||1

allows to rule out the noise covariates by returning exactly zero estimates as model

output.

Model estimation via the aforementioned penalized objective does not come

without price. The non-null estimates are shrunken towards the zero and, probably

more importantly, inference on the model is complicated and not straightforward.

In other words, no conﬁdence intervals or p-values on (linear combinations of) β

are easily obtained. The islasso aims at ﬁlling this gap, partially. More speciﬁcally,

at time of writing, the package returns point estimates, reliable standard errors and

corresponding p-values for the regression coeﬃcients and any linear combination

of them. We do not provide details of the methodology that can be found in the

paper (Cilluﬀo et al., 2019). Rather, we describe the R functions in the package by

providing a worked example.

∗Dip. Scienze Economiche, Az e Statistiche, Universit`a di Palermo, Italy. Email:

gianluca.sottile@unipa.it, vito.muggeo@unipa.it

†Istituto per la Ricerca e l’Innovazione Biomedica (IRIB), Consiglio Nazionale delle Ricerche

(CNR), Palermo, Italy. Email giovanna.cilluffo@ibim.cnr.it

1

2 The R functions

The main function of the package is islasso() where the user supplies the model

formula as in the usual lm or glm functions, i.e.

islasso(formula, family, lambda, alpha, data, weights, subset,

offset, unpenalized, contrasts, control = is.control())

family accepts speciﬁcation of family and link function as in Table 1, lambda

is the tuning parameter and unpenalized allows to indicate covariates with unpe-

nalized coeﬃcients.

Table 1: Families and link functions allowed in islasso

family link

gaussian identity

binomial logit, probit

poisson log

gamma identity, log, inverse

The ﬁtter function is is.lasso.fit() which reads as

islasso.fit(X, y, family, lambda, alpha = 1, intercept = FALSE,

weights = NULL, offset = NULL, unpenalized = NULL,

control)

which actually implements the estimating algorithm as described in the paper.

The lambda argument in islasso.fit and islasso speciﬁes the positive tuning

parameter in the penalized objective. Any non-negative value can be provided, but

if missing, it is computed via K-fold cross validation by the function cv.glmnet()

from package glmnet (Friedman et al., 2010). The number of folds being used can

be speciﬁed via the argument nfolds of the auxiliary function is.control().

3 A worked example: the Diabetes data set

We use the well-known diabetes dataset available in the lars package. The data refer

to n= 442 patients enrolled to investigate a measure of disease progression one

year after the baseline. There are ten covariates, such as age, sex, bmi (body mass

index), map (average blood pressure) and several blood serum measurements (tc,

ldl, hdl, tch, ltg, glu). The matrix x2 in the dataframe also includes second-order

terms, namely ﬁrst-order interactions between covariates, and quadratic terms for

the continuous variables.

To select the important terms in the regression equation we apply the lasso

> library(glmnet)

> library(lars)

> data(diabetes)

> a1 <- with(diabetes, cv.glmnet(x2, y))

> n <- nrow(diabetes)

> a1$lambda.min*n #the lambda value of (1)

[1] 1344.186

>

> b <- drop(coef(a1, "lambda.min")) #coeffs at the optimum lambda

> length(b[b != 0])

[1] 15

2

Ten-fold cross validation ‘selects’ λ= 1344.2 corresponding to 15 non null coeﬃ-

cients, whose the last ones, are, just to illustrate

> tail(b[b != 0])

glu^2 age:sex age:map age:ltg age:glu bmi:map

69.599081 107.479925 29.970061 8.506032 11.675332 85.530937

A reasonable question is if all the ‘selected’ coeﬃcients are signiﬁcant in the

model. Unfortunately lasso regression does not return standard errors due to nons-

moothness of objective, and some alternative approaches have been proposed. One

of them, is the ‘covariance test’ (Lockhart et al., 2013) as implemented in the pack-

age covTest

> library(covTest)

>

> o <- with(diabetes, lars(x2, y))

> with(diabetes, covTest(o, x2, y))

$‘results’

Predictor_Number Drop_in_covariance P-value

3 20.1981 0.0000

9 52.5964 0.0000

4 5.7714 0.0034

7 4.0840 0.0176

37 1.3310 0.2655

20 0.3244 0.7232

.....................

The CovTest approach suggest that only the terms corresponding to columns 3,

9, 4, and 7 in the matrix x2 are signiﬁcant, namely

> colnames(diabetes$x2)[c(3, 9, 4, 7)]

[1] "bmi" "ltg" "map" "hdl"

However covTest returns p-values across the λpath. It means that such p-values

are not matched to the corresponding point estimates obtained at the optimal λ

value (λ= 1344.2 in this example). As a consequence, some discrepancies between

the results by covTest and glmnet/lars are likely to occur. For instance, out of the

15 selected non-null coeﬃcients, just 4 are assessed as signiﬁcant.

The R package islasso provides and alternative to covTest, by implementing the

recent ‘quasi’ lasso approach based on the induced smoothing idea (Brown and

Wang, 2005) as discussed in Cilluﬀo et al. (2019). Point estimates and p-values are

returned within the same framework. While the optimal lambda could be selected

(without supplying any value to lambda), we use the same above value to facilitate

comparisons

> library(islasso)

> out <- islasso(y ~ x2, data=diabetes, lambda=1344.186)

The summary method quickly returns the main output of the ﬁtted model, in-

cluding point estimates, standard errors and p-values

> summary(out)

Call:

islasso(formula = y ~ x2, lambda = 1344.186, data = diabetes)

3

Residuals:

Min 1Q Median 3Q Max

-138.74 -40.18 -4.53 34.45 143.43

Estimate Std. Error Df z value Pr(>|z|)

(Intercept) 1.521e+02 2.554e+00 1.000 59.570 < 2e-16 ***

x2age 1.873e-01 2.408e+01 0.005 0.008 0.99379

x2sex -1.149e+02 5.377e+01 0.891 -2.137 0.03258 *

x2bmi 4.952e+02 7.058e+01 1.000 7.016 2.29e-12 ***

x2map 2.514e+02 6.447e+01 0.999 3.899 9.64e-05 ***

x2tc -4.514e-01 2.837e+01 0.012 -0.016 0.98730

......................

x2tch:glu 2.848e-01 2.546e+01 0.006 0.011 0.99107

x2ltg:glu 2.712e-01 3.611e+01 0.005 0.008 0.99401

......................

Visualizing estimates for all covariates could be somewhat inconvenient, espe-

cially when the number of covariates is large, thus one could opt to print estimates

only if their p-value is less than a speciﬁed value. We use 0.10 as a threshold.

> summary(out, pval = .1)

Call:

islasso(formula = y ~ x2, lambda = 1344.186, data = diabetes)

Residuals:

Min 1Q Median 3Q Max

-138.74 -40.18 -4.53 34.45 143.43

Estimate Std. Error Df z value Pr(>|z|)

(Intercept) 152.133 2.554 1.000 59.570 < 2e-16 ***

x2sex -114.923 53.773 0.891 -2.137 0.03258 *

x2bmi 495.168 70.581 1.000 7.016 2.29e-12 ***

x2map 251.409 64.473 0.999 3.899 9.64e-05 ***

x2hdl -189.213 67.826 0.978 -2.790 0.00528 **

x2ltg 466.026 70.701 1.000 6.592 4.35e-11 ***

x2age:sex 109.177 50.732 0.904 2.152 0.03139 *

x2bmi:map 86.476 47.404 0.812 1.824 0.06812 .

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 2855.198)

Null deviance: 2621009 on 441.0 degrees of freedom

Residual deviance: 1222391 on 428.1 degrees of freedom

AIC: 4786.9

Lambda: 1344.2

Number of Newton-Raphson iterations: 40

In addition to the usual information printed by the summary method, the output

also includes the column Df representing the degrees of freedom of each coeﬃcient.

Negligible coeﬃcients (i.e. with approximately a null estimate) will exhibit almost

zero degree-of-freedom, see the previous output of summary(out). The sum of all

degrees-of-freedom is used to quantify the model complexity

4

> sum(out$internal$hi)

[1] 13.87174

and the corresponding residual degrees of freedom (428.1) are printed next to the

residual deviance, as reported above. The Wald test (column z value) and p-values

can be used to assess important or signiﬁcant covariates. In addition to those ruled

out by covTest (”bmi” ”ltg” ”map” ”hdl”), islasso() also returns ‘small’ p-values

for the terms ”sex”, ”sex:age”, and ”bmi:map”. Simulation studies in Cilluﬀo et

al. (2019) have shown good performance of islasso with respect to some alternative

approaches.

As an alternative to the Cross Validation, it is also possible to select the tuning

parameter λby means of the Bayesian or Akaike Information Criterion. The func-

tion aic.islasso, requires a islasso ﬁt object and speciﬁcation of the criterion to

be used (AIC/BIC). Hence

> lmb.bic <- aic.islasso(out, "bic")

> out1 <- update(out, lambda = lmb.bic) #fit with a BIC-based lambda

Comparisons between methods to select the tuning parameter and further dis-

cussions are beyond our goals.

We conclude this short note by emphasizing that islasso also accepts the so-called

elastic-net penalty, such that

1

2||y−Xβ||2

2+λ{α||β||1+1

2(1 −α)||β||2

2}

where 0 ≤α≤1 is the mixing parameter to be speciﬁed in islasso() and

islasso.fit() via the argument alpha.

4 Concluding remarks

The package islasso provides an alternative to the ‘plain’ lasso regression. The

main disadvantage with respect to lasso lies in the point estimates: islasso does not

perform variable selection, in that the point estimates will be never exactly zero,

however diﬀerences in terms of ﬁndings will be typically negligible. However unlike

the plain lasso, islasso is able to return reliable standard errors and p-values which

can be used to assess signiﬁcance of coeﬃcients.

References

Brown B and Wang Y. Standard errors and covariance matrices for smoothed rank

estimators. Biometrika 2005; 92: 149–158.

Cilluﬀo, G, Sottile, G, La Grutta, S and Muggeo, VMR (2019) The Induced

Smoothed lasso: A practical framework for hypothesis testing in high dimen-

sional regression. Statistical Methods in Medical Research, online

doi: 10.1177/0962280219842890.

Friedman, J, Hastie, T, Tibshirani, R (2010). Regularization Paths for Generalized

Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1),

1-22.

Lockhart R, Taylor J, Tibshirani R, et al. A signiﬁcance test for the lasso. Ann

Stat 2014; 42: 413–468.

Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc: Series

B 1996; 58: 267–288

5