Technical ReportPDF Available

# The R package islasso: estimation and hypothesis testing in lasso regression

Authors:

## Abstract and Figures

In this short note we present and briefly discuss the R package islasso dealing with regression models having a large number of covariates. Estimation is carried out by penalizing the coefficients via a quasi-lasso penalty, wherein the nonsmooth lasso penalty is replaced by its smooth counterpart determined iteratively by data according to the induced smoothing idea. The package includes functions to estimate the model and to test for linear hypothesis on linear combinations of relevant coefficients. We illustrate R code throughout a worked example, by avoiding intentionally to report details and extended bibliography.
Content may be subject to copyright.
The R package islasso:
estimation and hypothesis testing
in lasso regression
Gianluca SottileGiovanna CilluﬀoVito M.R. Muggeo
Abstract
In this short note we present and brieﬂy discuss the R package islasso deal-
ing with regression models having a large number of covariates. Estimation
is carried out by penalizing the coeﬃcients via a quasi-lasso penalty, wherein
the nonsmooth lasso penalty is replaced by its smooth counterpart determined
iteratively by data according to the induced smoothing idea. The package in-
cludes functions to estimate the model and to test for linear hypothesis on
linear combinations of relevant coeﬃcients. We illustrate R code throughout
a worked example, by avoiding intentionally to report details and extended
bibliography.
1 Introduction
Let y=Xβ +be the linear model of interest with usual zero-means and ho-
moscedastic errors. As usual, y= (y1, . . . , yn)Tis the response vector, Xis the
n×pdesign matrix (having pquite large) with regression coeﬃcients β.
When interest lies in selecting the non-noise covariates and estimating the rele-
vant eﬀect, one assumes the lasso penalized objective function (Tibshirani, 1996),
1
2||yXβ||2
2+λ||β||1(1)
to be minimized at ﬁxed λ > 0. As it is well-know, the lasso penalty ||β||1
allows to rule out the noise covariates by returning exactly zero estimates as model
output.
Model estimation via the aforementioned penalized objective does not come
without price. The non-null estimates are shrunken towards the zero and, probably
more importantly, inference on the model is complicated and not straightforward.
In other words, no conﬁdence intervals or p-values on (linear combinations of) β
are easily obtained. The islasso aims at ﬁlling this gap, partially. More speciﬁcally,
at time of writing, the package returns point estimates, reliable standard errors and
corresponding p-values for the regression coeﬃcients and any linear combination
of them. We do not provide details of the methodology that can be found in the
paper (Cilluﬀo et al., 2019). Rather, we describe the R functions in the package by
providing a worked example.
Dip. Scienze Economiche, Az e Statistiche, Universit`a di Palermo, Italy. Email:
gianluca.sottile@unipa.it, vito.muggeo@unipa.it
Istituto per la Ricerca e l’Innovazione Biomedica (IRIB), Consiglio Nazionale delle Ricerche
(CNR), Palermo, Italy. Email giovanna.cilluffo@ibim.cnr.it
1
2 The R functions
The main function of the package is islasso() where the user supplies the model
formula as in the usual lm or glm functions, i.e.
islasso(formula, family, lambda, alpha, data, weights, subset,
offset, unpenalized, contrasts, control = is.control())
family accepts speciﬁcation of family and link function as in Table 1, lambda
is the tuning parameter and unpenalized allows to indicate covariates with unpe-
nalized coeﬃcients.
Table 1: Families and link functions allowed in islasso
gaussian identity
binomial logit, probit
poisson log
gamma identity, log, inverse
The ﬁtter function is is.lasso.fit() which reads as
islasso.fit(X, y, family, lambda, alpha = 1, intercept = FALSE,
weights = NULL, offset = NULL, unpenalized = NULL,
control)
which actually implements the estimating algorithm as described in the paper.
The lambda argument in islasso.fit and islasso speciﬁes the positive tuning
parameter in the penalized objective. Any non-negative value can be provided, but
if missing, it is computed via K-fold cross validation by the function cv.glmnet()
from package glmnet (Friedman et al., 2010). The number of folds being used can
be speciﬁed via the argument nfolds of the auxiliary function is.control().
3 A worked example: the Diabetes data set
We use the well-known diabetes dataset available in the lars package. The data refer
to n= 442 patients enrolled to investigate a measure of disease progression one
year after the baseline. There are ten covariates, such as age, sex, bmi (body mass
index), map (average blood pressure) and several blood serum measurements (tc,
ldl, hdl, tch, ltg, glu). The matrix x2 in the dataframe also includes second-order
terms, namely ﬁrst-order interactions between covariates, and quadratic terms for
the continuous variables.
To select the important terms in the regression equation we apply the lasso
> library(glmnet)
> library(lars)
> data(diabetes)
> a1 <- with(diabetes, cv.glmnet(x2, y))
> n <- nrow(diabetes)
> a1$lambda.min*n #the lambda value of (1) [1] 1344.186 > > b <- drop(coef(a1, "lambda.min")) #coeffs at the optimum lambda > length(b[b != 0]) [1] 15 2 Ten-fold cross validation ‘selects’ λ= 1344.2 corresponding to 15 non null coeﬃ- cients, whose the last ones, are, just to illustrate > tail(b[b != 0]) glu^2 age:sex age:map age:ltg age:glu bmi:map 69.599081 107.479925 29.970061 8.506032 11.675332 85.530937 A reasonable question is if all the ‘selected’ coeﬃcients are signiﬁcant in the model. Unfortunately lasso regression does not return standard errors due to nons- moothness of objective, and some alternative approaches have been proposed. One of them, is the ‘covariance test’ (Lockhart et al., 2013) as implemented in the pack- age covTest > library(covTest) > > o <- with(diabetes, lars(x2, y)) > with(diabetes, covTest(o, x2, y))$‘results’
Predictor_Number Drop_in_covariance P-value
3 20.1981 0.0000
9 52.5964 0.0000
4 5.7714 0.0034
7 4.0840 0.0176
37 1.3310 0.2655
20 0.3244 0.7232
.....................
The CovTest approach suggest that only the terms corresponding to columns 3,
9, 4, and 7 in the matrix x2 are signiﬁcant, namely
> colnames(diabetes$x2)[c(3, 9, 4, 7)] [1] "bmi" "ltg" "map" "hdl" However covTest returns p-values across the λpath. It means that such p-values are not matched to the corresponding point estimates obtained at the optimal λ value (λ= 1344.2 in this example). As a consequence, some discrepancies between the results by covTest and glmnet/lars are likely to occur. For instance, out of the 15 selected non-null coeﬃcients, just 4 are assessed as signiﬁcant. The R package islasso provides and alternative to covTest, by implementing the recent ‘quasi’ lasso approach based on the induced smoothing idea (Brown and Wang, 2005) as discussed in Cilluﬀo et al. (2019). Point estimates and p-values are returned within the same framework. While the optimal lambda could be selected (without supplying any value to lambda), we use the same above value to facilitate comparisons > library(islasso) > out <- islasso(y ~ x2, data=diabetes, lambda=1344.186) The summary method quickly returns the main output of the ﬁtted model, in- cluding point estimates, standard errors and p-values > summary(out) Call: islasso(formula = y ~ x2, lambda = 1344.186, data = diabetes) 3 Residuals: Min 1Q Median 3Q Max -138.74 -40.18 -4.53 34.45 143.43 Estimate Std. Error Df z value Pr(>|z|) (Intercept) 1.521e+02 2.554e+00 1.000 59.570 < 2e-16 *** x2age 1.873e-01 2.408e+01 0.005 0.008 0.99379 x2sex -1.149e+02 5.377e+01 0.891 -2.137 0.03258 * x2bmi 4.952e+02 7.058e+01 1.000 7.016 2.29e-12 *** x2map 2.514e+02 6.447e+01 0.999 3.899 9.64e-05 *** x2tc -4.514e-01 2.837e+01 0.012 -0.016 0.98730 ...................... x2tch:glu 2.848e-01 2.546e+01 0.006 0.011 0.99107 x2ltg:glu 2.712e-01 3.611e+01 0.005 0.008 0.99401 ...................... Visualizing estimates for all covariates could be somewhat inconvenient, espe- cially when the number of covariates is large, thus one could opt to print estimates only if their p-value is less than a speciﬁed value. We use 0.10 as a threshold. > summary(out, pval = .1) Call: islasso(formula = y ~ x2, lambda = 1344.186, data = diabetes) Residuals: Min 1Q Median 3Q Max -138.74 -40.18 -4.53 34.45 143.43 Estimate Std. Error Df z value Pr(>|z|) (Intercept) 152.133 2.554 1.000 59.570 < 2e-16 *** x2sex -114.923 53.773 0.891 -2.137 0.03258 * x2bmi 495.168 70.581 1.000 7.016 2.29e-12 *** x2map 251.409 64.473 0.999 3.899 9.64e-05 *** x2hdl -189.213 67.826 0.978 -2.790 0.00528 ** x2ltg 466.026 70.701 1.000 6.592 4.35e-11 *** x2age:sex 109.177 50.732 0.904 2.152 0.03139 * x2bmi:map 86.476 47.404 0.812 1.824 0.06812 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 2855.198) Null deviance: 2621009 on 441.0 degrees of freedom Residual deviance: 1222391 on 428.1 degrees of freedom AIC: 4786.9 Lambda: 1344.2 Number of Newton-Raphson iterations: 40 In addition to the usual information printed by the summary method, the output also includes the column Df representing the degrees of freedom of each coeﬃcient. Negligible coeﬃcients (i.e. with approximately a null estimate) will exhibit almost zero degree-of-freedom, see the previous output of summary(out). The sum of all degrees-of-freedom is used to quantify the model complexity 4 > sum(out$internal$hi) [1] 13.87174 and the corresponding residual degrees of freedom (428.1) are printed next to the residual deviance, as reported above. The Wald test (column z value) and p-values can be used to assess important or signiﬁcant covariates. In addition to those ruled out by covTest (”bmi” ”ltg” ”map” ”hdl”), islasso() also returns ‘small’ p-values for the terms ”sex”, ”sex:age”, and ”bmi:map”. Simulation studies in Cilluﬀo et al. (2019) have shown good performance of islasso with respect to some alternative approaches. As an alternative to the Cross Validation, it is also possible to select the tuning parameter λby means of the Bayesian or Akaike Information Criterion. The func- tion aic.islasso, requires a islasso ﬁt object and speciﬁcation of the criterion to be used (AIC/BIC). Hence > lmb.bic <- aic.islasso(out, "bic") > out1 <- update(out, lambda = lmb.bic) #fit with a BIC-based lambda Comparisons between methods to select the tuning parameter and further dis- cussions are beyond our goals. We conclude this short note by emphasizing that islasso also accepts the so-called elastic-net penalty, such that 1 2||yXβ||2 2+λ{α||β||1+1 2(1 α)||β||2 2} where 0 α1 is the mixing parameter to be speciﬁed in islasso() and islasso.fit() via the argument alpha. 4 Concluding remarks The package islasso provides an alternative to the ‘plain’ lasso regression. The main disadvantage with respect to lasso lies in the point estimates: islasso does not perform variable selection, in that the point estimates will be never exactly zero, however diﬀerences in terms of ﬁndings will be typically negligible. However unlike the plain lasso, islasso is able to return reliable standard errors and p-values which can be used to assess signiﬁcance of coeﬃcients. References Brown B and Wang Y. Standard errors and covariance matrices for smoothed rank estimators. Biometrika 2005; 92: 149–158. Cilluﬀo, G, Sottile, G, La Grutta, S and Muggeo, VMR (2019) The Induced Smoothed lasso: A practical framework for hypothesis testing in high dimen- sional regression. Statistical Methods in Medical Research, online doi: 10.1177/0962280219842890. Friedman, J, Hastie, T, Tibshirani, R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. Lockhart R, Taylor J, Tibshirani R, et al. A signiﬁcance test for the lasso. Ann Stat 2014; 42: 413–468. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc: Series B 1996; 58: 267–288 5 ... Statistical analyses were performed by a dedicated statistician (M.E. with 16 years of experience) by using the R packages glmnet [32] and islasso [33] at steps 1-2, respectively. ... Article Full-text available Objectives: To explore the potential of radiomics on gadoxetate disodium-enhanced MRI for predicting hepatocellular carcinoma (HCC) response after transarterial embolization (TAE). Methods: This retrospective study included cirrhotic patients treated with TAE for unifocal HCC naïve to treatments. Each patient underwent gadoxetate disodium-enhanced MRI. Radiomics analysis was performed by segmenting the lesions on portal venous (PVP), 3-min transitional, and 20-min hepatobiliary (HBP) phases. Clinical data, laboratory variables, and qualitative features based on LI-RADSv2018 were assessed. Reference standard was based on mRECIST response criteria. Two different radiomics models were constructed, a statistical model based on logistic regression with elastic net penalty (model 1) and a computational model based on a hybrid descriptive-inferential feature extraction method (model 2). Areas under the ROC curves (AUC) were calculated. Results: The final population included 51 patients with HCC (median size 20 mm). Complete and objective responses were obtained in 14 (27.4%) and 29 (56.9%) patients, respectively. Model 1 showed the highest performance on PVP for predicting objective response with an AUC of 0.733, sensitivity of 100%, and specificity of 40.0% in the test set. Model 2 demonstrated similar performances on PVP and HBP for predicting objective response, with an AUC of 0.791, sensitivity of 71.3%, specificity of 61.7% on PVP, and AUC of 0.790, sensitivity of 58.8%, and specificity of 90.1% on HBP. Conclusions: Radiomics models based on gadoxetate disodium-enhanced MRI can achieve good performance for predicting response of HCCs treated with TAE. ... The elastic net logistic regression models were computed using the islasso R package. 29 In the implemented algorithm, the penalization term is defined as follows: ... Article Full-text available Introduction Unbalanced dietary intake has been increasingly recognized as an important modifiable risk factor for asthma. In this study we assessed whether a pro-inflammatory diet is associated with higher asthma burden in three steps: 1) identification of asthma latent classes (LC) based on symptoms, indoor exposures, and pulmonary function; 2) identification of risk factors associated with LC membership; 3) estimation of the probabilities of LC membership with variation in DII. Methods Cross-sectional study on 415 children aged 5-14 years (266 with persistent asthma and 149 controls). LC analysis was performed in asthmatic children. The DII was calculated based on a semi-quantitative food frequency questionnaire. Elastic net logistic regression was used to investigate whether increasing DII was associated with worse asthma burden. Results Two LCs were identified. Children in Class 1, “high burden”, had higher symptom burden and worse lung function. Children in Class 2, “low burden”, had lower symptom burden and less impaired lung function, but were more subject to indoor exposures. DII was the only risk factor significantly associated with Class 1 membership. As the DII increased (from -4.0 to +4.0), the probability of Class 1 membership increased from 32% to 65% when compared to control group, while it increased from 41% to 72% when compared to Class 2. Conclusions We identified two phenotypes of persistent asthma associated with different disease burden linked to indoor exposures. An increasing DII was associated with high-burden asthma, providing further evidence about the role of a pro-inflammatory diet in asthma morbidity. ... IS-lasso allows the fitting of generalized linear models with an l 1 -penalty, thus returning, along with point estimates, the resulting standard errors that can be used to draw inferences in the lasso framework. The IS-lasso case-control analysis was carried out using the islasso R package ( Sottile et al. 2019), and at P < 0.05 level revealed a total of four significant markers ( Fig. 2; Table S3) on different chromosomes (SSC). These (Table S4). ... Article Full-text available Nero Siciliano (Sicilian Black, SB) is a local pig breed generally of uniform black color. In addition to this officially recognized breed, there are animals showing morphological characteristics resembling the SB but with gray hair (Sicilian Grey, SG). The SG, compared with the SB, also shows a more compact structure with greater transverse diameters, higher average daily gains and lower thickness of the back fat. In this study, using the Illumina PorcineSNP60 BeadChip, we run genome‐wide analyses to identify regions that may explain the phenotypic differences between SB (n = 21) and SG (n = 27) individuals. Combining the results of the two case–control approaches (GWAS and FST), we identified two significant regions, one on SSC5 (95 401 083 bp) and one on SSC15 (55 051 435 bp), which contains several candidate genes related to growth traits in pig. The results of the Bayesian population differentiation approach identified a marker near the MGAT4C, a gene associated with average daily gain in pigs. Finally, scanning the genome for runs of homozygosity islands, we found that the two groups have different runs of homozygosity islands, with several candidate genes involved in coat color (in SG) or related to different pig performance traits (in SB). In summary, the two analyzed groups differed for several phenotypic traits, and genes involved in these traits (growth, meat traits and coat color) were detected. This study provided another contribution to the identification of genomic regions involved in phenotypic variability in local pig populations Article Full-text available A 'pseudo-Bayesian' interpretation of standard errors yields a natural induced smoothing of statistical estimating functions. When applied to rank estimation, the lack of smoothness which prevents standard error estimation is remedied. Efficiency and robustness are preserved, while the smoothed estimation has excellent computational properties. In particular, convergence of the iterative equation for standard error is fast, and standard error calculation becomes asymptotically a one-step procedure. This property also extends to covariance matrix calculation for rank estimates in multi-parameter problems. Examples, and some simple explanations, are given. Copyright 2005, Oxford University Press. Article In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the {\it covariance test statistic}, and show that when the true model is linear, this statistic has an$\Exp(1)$asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result assumes some (reasonable) regularity conditions on the predictor matrix$X$, and covers the important high-dimensional case$p>n$. Of course, for testing the significance of an additional variable between two nested linear models, one may use the usual chi-squared test, comparing the drop in residual sum of squares (RSS) to a$\chi^2_1$distribution. But when this additional variable is not fixed, but has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than$\chi^2_1$under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter$\lambda$decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the$\ell_1$penalty. Therefore the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties---adaptivity and shrinkage---and its null distribution is tractable and asymptotically$\Exp(1)\$.
The Induced Smoothed lasso: A practical framework for hypothesis testing in high dimensional regression
• G Cilluffo
• G Sottile
• La Grutta
• Muggeo
• Vmr
Cilluffo, G, Sottile, G, La Grutta, S and Muggeo, VMR (2019) The Induced Smoothed lasso: A practical framework for hypothesis testing in high dimensional regression. Statistical Methods in Medical Research, online doi: 10.1177/0962280219842890.