PreprintPDF Available

A note on regression with log Normal errors: linear and piecewise linear modelling in R.

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We provide some comments about regression with log Normal or log Normal-type data. R code with some examples are presented to illustrate fitting of linear and segmented linear regression models. Details, including references, are intentionally skipped.
Content may be subject to copyright.
A note on regression with log Normal errors:
linear and piecewise linear modelling in R.
Vito M.R. Muggeo
Universit`a di Palermo, Italy
Abstract
We provide some comments about regression with log Normal or log Normal-
type data. R code with some examples are presented to illustrate fitting of
linear and segmented linear regression models. Details, including references,
are intentionally skipped.
Background
Many environmental measurements are non negative and their distributions are
frequently skewed to the right. From a statistical perspective, skewed data can be
thought coming from the following data generating process
Yi=µiexp{i}(1)
where E[Yi] = µiand the exp{i}are right-skewed random noise terms with positive
support. Expressing the actual disturbance via exponentiating a different random
term is not restrictive in any way, but useful in the following.
In regression contexts, when the covariate vector xT
iis observed along the re-
sponses yis and the interest lies in modelling E[Y|xi] = µi, most of times a scatter
plot like in Figure 1 is firstly produced.
0 2 4 6 8 10
0 5 10 15 20 25
covariate X
Response Y
Figure 1: Hypothetical covariate - response scatterplot with evident heroscedastic-
ity.
Email: vito.muggeo@unipa.it
1
Minimum distance estimation
If data exhibit heteroscedasticity like in Figure 1, practitioners use to take the log
values, and to minimize
X
i
(log yixT
iβ)2(2)
to obtain parameter estimates of the regression coefficients.
Our guess is that model fitting via (2) is quite widespread in practice (and prob-
ably purposeless sometimes), and thus some comments are probably noteworthy.
Taking the logs, (1) can be written as
log Yi= log µi+i.
If E[i] = 0 and the multiplicative model for the mean holds, i.e. µi= exp{xT
iβ},
minimization of (2) is sound and it yields unbiased estimates of the log relative
change of µcorresponding to unit covariate increases.
Rather than taking logs, one could minimize the loss functions based on differ-
ences on the original scale,
X
i
(yiexp{xT
iβ})2.(3)
However if E[ei]6= 1, it follows E[Yi]6=µi, causing the least square objective (3) to
make biased estimates. Obviously if µi= exp{xT
iβ}, the ordinary loss Pi(yixT
iβ)2
would be meaningless here.
On the other hand, if the response-covariate relationship is understood to be
linear, that is µi=xT
iβ, the regression equation for the logs takes the form
log Yi= log xT
iβ+i,
and if E[i] = 0 still holds, the following loss function
X
i
(log yilog xT
iβ)2(4)
leads to unbiased estimates of the β. Interpretation is on the additive scale, i.e. the
same of the usual linear regression model: average change in the response by unit
covariate increases.
Finally, note if in the data generating process (1) the errors are such that
E[exp{i}] = 1, then it is E[Yi] = µiand the ordinary least square P(yixT
iβ)2
would still produce unbiased but inefficient estimates, since heteroscedasticity in
the Yis would be ignored in the estimation process.
Likelihood-based estimation
In the previous section, the first ordinary moment of the random term has been
exploited to get the different loss objectives. In a Likelihood framework, asymmetric
distributions have to be chosen to model skewed data. The log Normal distribution
is one of them. Applications may be found in Ecology and Environmetrics (rainfalls
and abundance of species) in particular, but also in Epidemiology and Medicine
(exposure and biomarkers), and health economics (health care expenditure, length-
of-stay).
For the continuous random variable Y, the log Normal density is
f(y;θ, σ2) = 1
2πexp (log yθ)2
2σ2,
2
where θand σare location and scale parameters for the normally distributed trans-
formation log Y, i.e. log Y∼ N(θ, σ2). The expected value and variance of the
outcome Yare E[Y] = µ= exp{θ+σ2/2}and V[Y] = exp{σ21}exp{2θ+σ2},
with constant coefficient of variation given by eσ21.
In order to set up a log likelihood in a regression context where the µis depend
on the β, we just explicit θi= log µi(β)σ2/2 and plug it into density to obtain
`(β, σ2) = n
2log σ2Pi(log yilog µi(β) + σ2/2)2
2σ2.(5)
After specification of the appropriated link function, e.g. µi=xT
iβ, maximization
of (5) yields the ML estimates of regression coefficients and error variance.
Implementation in R
Curiously, at time of writing, a search on the web did not return any R package
aimed at fitting simple linear regression models with log Normal errors. The pack-
age gamlss allows to specify several distributions, including the log Normal one,
with possibly nonparametric terms and random effects, but its use is probably not
immediate for the beginner. Furthermore we experienced some discrepancy in the
results, and finally implementation of segmented terms, as discussed afterwards,
would require some more work.
Fitting log Normal regression is straightforward since it suffices to write down the
log likelihood and to use any optimizer, for instance optim or nlmnib. For instance
assuming a single covariate xand response y, the log likelihood to maximize may
be written as
lik.v<-function(b,x,y){
mu<-b[1]+b[2]*x
theta<- log(mu)-b[3]^2/2
li<- -log(y)-.5*log(2*pi*b[3]^2)-(log(y)-theta)^2/(2*b[3]^2)
r<- sum(li)
r
}
where brepresents the whole model parameters vector, i.e. the regression coefficients
and the response standard deviation in the last component.
However to facilitate implementation and usage of multiple regression models
with log Normal errors and linear (i.e. µi=xT
iβ) regression function using a
formula interface, the new (at time of writing) R package logNormReg can be used.
A preliminary version of the package is available on ResearchGate at
https://www.researchgate.net/publication/326271533
and next updates will be released on CRAN (https://cran.r-project.org/) as
usual.
The main function is lognlm,
> library(logNormReg)
> args(lognlm)
function (formula, data, subset, weights, na.action, y = TRUE,
start, model = TRUE, lik = TRUE, opt = c("nlminb", "optim"),
...)
where two arguments are worth stressing. opt specifies the optimizer to be used
with optional arguments passed via .... Default is nlminb which uses analytical
gradient and hessian.
3
lik indicates the objective to be optimized: if lik=TRUE a likelihood approach
is followed and likelihood (5) is optimized; otherwise when lik=FALSE, the loss (4)
is minimized.
To illustrate the package, we use simulated data
> n=200
> s=.4
> x<-seq(0,10,l=n) #covariate
> mu<- 10+2*x #the true regression function
> set.seed(1234) #just to get reproducible results..
> y<-rlnorm(n, log(mu)-s^2/2, s) #exp(log(mu)-s^2/2 +rnorm(n,0,s))
A plot of such data is reported in Figure 1. As discussed in the previous section,
people tend to log transform data to gain homoskedasticity in order to fit a simple
linear model
> coef(lm(log(y)~x)) #far from true (10,2)
(Intercept) x
2.2085667 0.1276675
which yields unbiased and potentially meaningless estimates of the regression pa-
rameters βand of the conditional means µiaccordingly.
Instead, linear log Normal regression model keeping the observation on the
original scale should be fitted
> o<-lognlm(y~x)
or, if the iterative process has to be monitored for instance,
> o<-lognlm(y~x, control=list(trace=1))
0: 113.54315: 14.8345 2.65552 2.14266
1: 28.586808: 14.8460 2.70753 1.23320
..........
11: -43.593889: 7.70307 2.29063 -0.392217
where the list specified in the control argument is passed to the used optimizer
(nlminb, the default). Results may be summarized via the the print or the summary
methods, the latter also reporting asymptotic standard errors and corresponding p-
values,
> summary(o)
Call:
lognlm(formula = y ~ x)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.1427 0.6801 11.97 <2e-16 ***
x 2.3648 0.1780 13.29 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Standard deviation estimate: 0.40126 (St.Err = 0.02006)
Log Likelihood: 82.628
4
The reported standard errors come from vcov.lognlm() wherein the argument
sandw (default TRUE) specifies whether the standard errors have to be computed via
the sandwich formula or the Hessian only. Finally the log likelihood value printed
in the last row refers to the kernel only, i.e. formula (5) omitting the constant
depending on data. That is fine for model comparisons but the full log likelihood
may be obtained via logLik()
> logLik(o, full=TRUE)
[1] -670.5405
On the other hands, when the fit has been obtained by setting lik=FALSE,
logLik() returns the minimized loss objective (4), and full=TRUE gets meaningless.
The package logNormReg fits linear models, but it can be used in conjunction
with the package segmented to include piecewise linear relationships in the model,
with regression equation written as µi=β0+xT
iβ+δ(ziψ)+where ziis the contin-
uous covariate having a piecewise linear effect with breakpoint ψ. More specifically,
once the function for fitting the simple linear model has been set, estimation of
segmented regression is straightforward via the function segmented.default() in
the package segmented.
> mu1<-10+2*x-2*pmax(x-6,0)
> y1<-rlnorm(n, log(mu1)-s^2/2, s)
As usual in segmented, the model is fitted in two steps
> library(segmented)
>
> o <-lognlm(y1~x)
> os<-segmented(o, ~x) #segmented.default() is being used..
In order to perform bootstrap restarting to escape local optima, the function
segmented.default() needs an objective function to be minimized which has to
be passed via the argument fn.obj of seg.control(..). Default is to take minus
the log likelihood, i.e. by means of the function logLik. Therefore the following
call produces the same fit of os above.
> segmented.default(o, ~x, control=seg.control(fn.obj="-logLik(x)"))
where the string fn.obj represents the objective (as a function of x) to be mini-
mized. The returned object is not of class ’segmented’, therefore the summary and
plot methods, say, have to be called explicitly. For instance to display the data with
the segmented line superimposed,
> plot(x, y1)
> plot.segmented(os, add=TRUE, col=2, lwd=2)
However if the model has been fitted by minimization of (4), logLik() returns
the loss objective value at the solution, therefore the function to be minimized is
simply logLik(x),without the minus sign, namely
> a<-lognlm(y1 ~ x, lik=FALSE)
> as<-segmented(a, ~x, control=seg.control(fn.obj="logLik(x)"))
5
... Gene expression data has extreme skewness and inconsistent variance, but most existing deconvolution algorithms are based in least-squares regression and implicitly assume unskewed data with constant variance [3][4][5] . We propose to replace the leastsquares regression at the heart of classical deconvolution with log-normal regression 10 . This approach retains the mean model of least-squares regression while modeling variability on the logscale, which largely corrects the skewness and unequal variance of gene expression data in both bulk and spatial experiments (Supplementary Note and Supplementary Figs. 1 and 2). ...
... Cell abundance estimates were taken from the Spa-tialDecon run described above. In only the stroma segments, each gene's linearscale expression was predicted from the cell abundance estimates using the R library logNormReg 10 . An intercept term was included in the fit, and all estimates were constrained to be non-negative. ...
Article
Full-text available
Mapping cell types across a tissue is a central concern of spatial biology, but cell type abundance is difficult to extract from spatial gene expression data. We introduce SpatialDecon, an algorithm for quantifying cell populations defined by single cell sequencing within the regions of spatial gene expression studies. SpatialDecon incorporates several advancements in gene expression deconvolution. We propose an algorithm harnessing log-normal regression and modelling background, outperforming classical least-squares methods. We compile cell profile matrices for 75 tissue types. We identify genes whose minimal expression by cancer cells makes them suitable for immune deconvolution in tumors. Using lung tumors, we create a dataset for benchmarking deconvolution methods against marker proteins. SpatialDecon is a simple and flexible tool for mapping cell types in spatial gene expression studies. It obtains cell abundance estimates that are spatially resolved, granular, and paired with highly multiplexed gene expression data.
Preprint
Full-text available
We introduce SpatialDecon, an algorithm for quantifying cell populations defined by single cell RNA sequencing within the regions of spatially-resolved gene expression studies. It obtains cell abundance estimates that are spatially-resolved, granular, and paired with highly multiplexed gene expression data. SpatialDecon incorporates several advancements in the field of gene expression deconvolution. We propose an algorithm based in log-normal regression, attaining sometimes dramatic performance improvements over classical least-squares methods. We compile cell profile matrices for 27 tissue types. We identify genes whose minimal expression by cancer cells makes them suitable for immune deconvolution in tumors. And we provide a lung tumor dataset for benchmarking immune deconvolution methods. In a lung tumor GeoMx DSP experiment, we observe a spatially heterogeneous immune response in intricate detail and identify 7 distinct phenotypes of the localized immune response. We then demonstrate how cell abundance estimates give crucial context for interpreting gene expression results.
ResearchGate has not been able to resolve any references for this publication.