Content uploaded by Vito Muggeo
Author content
All content in this area was uploaded by Vito Muggeo on Jul 11, 2018
Content may be subject to copyright.
A note on regression with log Normal errors:
linear and piecewise linear modelling in R.
Vito M.R. Muggeo∗
Universit`a di Palermo, Italy
Abstract
We provide some comments about regression with log Normal or log Normal-
type data. R code with some examples are presented to illustrate fitting of
linear and segmented linear regression models. Details, including references,
are intentionally skipped.
Background
Many environmental measurements are non negative and their distributions are
frequently skewed to the right. From a statistical perspective, skewed data can be
thought coming from the following data generating process
Yi=µiexp{i}(1)
where E[Yi] = µiand the exp{i}are right-skewed random noise terms with positive
support. Expressing the actual disturbance via exponentiating a different random
term is not restrictive in any way, but useful in the following.
In regression contexts, when the covariate vector xT
iis observed along the re-
sponses yis and the interest lies in modelling E[Y|xi] = µi, most of times a scatter
plot like in Figure 1 is firstly produced.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 2 4 6 8 10
0 5 10 15 20 25
covariate X
Response Y
Figure 1: Hypothetical covariate - response scatterplot with evident heroscedastic-
ity.
∗Email: vito.muggeo@unipa.it
1
Minimum distance estimation
If data exhibit heteroscedasticity like in Figure 1, practitioners use to take the log
values, and to minimize
X
i
(log yi−xT
iβ)2(2)
to obtain parameter estimates of the regression coefficients.
Our guess is that model fitting via (2) is quite widespread in practice (and prob-
ably purposeless sometimes), and thus some comments are probably noteworthy.
Taking the logs, (1) can be written as
log Yi= log µi+i.
If E[i] = 0 and the multiplicative model for the mean holds, i.e. µi= exp{xT
iβ},
minimization of (2) is sound and it yields unbiased estimates of the log relative
change of µcorresponding to unit covariate increases.
Rather than taking logs, one could minimize the loss functions based on differ-
ences on the original scale,
X
i
(yi−exp{xT
iβ})2.(3)
However if E[ei]6= 1, it follows E[Yi]6=µi, causing the least square objective (3) to
make biased estimates. Obviously if µi= exp{xT
iβ}, the ordinary loss Pi(yi−xT
iβ)2
would be meaningless here.
On the other hand, if the response-covariate relationship is understood to be
linear, that is µi=xT
iβ, the regression equation for the logs takes the form
log Yi= log xT
iβ+i,
and if E[i] = 0 still holds, the following loss function
X
i
(log yi−log xT
iβ)2(4)
leads to unbiased estimates of the β. Interpretation is on the additive scale, i.e. the
same of the usual linear regression model: average change in the response by unit
covariate increases.
Finally, note if in the data generating process (1) the errors are such that
E[exp{i}] = 1, then it is E[Yi] = µiand the ordinary least square P(yi−xT
iβ)2
would still produce unbiased but inefficient estimates, since heteroscedasticity in
the Yis would be ignored in the estimation process.
Likelihood-based estimation
In the previous section, the first ordinary moment of the random term has been
exploited to get the different loss objectives. In a Likelihood framework, asymmetric
distributions have to be chosen to model skewed data. The log Normal distribution
is one of them. Applications may be found in Ecology and Environmetrics (rainfalls
and abundance of species) in particular, but also in Epidemiology and Medicine
(exposure and biomarkers), and health economics (health care expenditure, length-
of-stay).
For the continuous random variable Y, the log Normal density is
f(y;θ, σ2) = 1
yσ√2πexp −(log y−θ)2
2σ2,
2
where θand σare location and scale parameters for the normally distributed trans-
formation log Y, i.e. log Y∼ N(θ, σ2). The expected value and variance of the
outcome Yare E[Y] = µ= exp{θ+σ2/2}and V[Y] = exp{σ2−1}exp{2θ+σ2},
with constant coefficient of variation given by √eσ2−1.
In order to set up a log likelihood in a regression context where the µis depend
on the β, we just explicit θi= log µi(β)−σ2/2 and plug it into density to obtain
`(β, σ2) = −n
2log σ2−Pi(log yi−log µi(β) + σ2/2)2
2σ2.(5)
After specification of the appropriated link function, e.g. µi=xT
iβ, maximization
of (5) yields the ML estimates of regression coefficients and error variance.
Implementation in R
Curiously, at time of writing, a search on the web did not return any R package
aimed at fitting simple linear regression models with log Normal errors. The pack-
age gamlss allows to specify several distributions, including the log Normal one,
with possibly nonparametric terms and random effects, but its use is probably not
immediate for the beginner. Furthermore we experienced some discrepancy in the
results, and finally implementation of segmented terms, as discussed afterwards,
would require some more work.
Fitting log Normal regression is straightforward since it suffices to write down the
log likelihood and to use any optimizer, for instance optim or nlmnib. For instance
assuming a single covariate xand response y, the log likelihood to maximize may
be written as
lik.v<-function(b,x,y){
mu<-b[1]+b[2]*x
theta<- log(mu)-b[3]^2/2
li<- -log(y)-.5*log(2*pi*b[3]^2)-(log(y)-theta)^2/(2*b[3]^2)
r<- sum(li)
r
}
where brepresents the whole model parameters vector, i.e. the regression coefficients
and the response standard deviation in the last component.
However to facilitate implementation and usage of multiple regression models
with log Normal errors and linear (i.e. µi=xT
iβ) regression function using a
formula interface, the new (at time of writing) R package logNormReg can be used.
A preliminary version of the package is available on ResearchGate at
https://www.researchgate.net/publication/326271533
and next updates will be released on CRAN (https://cran.r-project.org/) as
usual.
The main function is lognlm,
> library(logNormReg)
> args(lognlm)
function (formula, data, subset, weights, na.action, y = TRUE,
start, model = TRUE, lik = TRUE, opt = c("nlminb", "optim"),
...)
where two arguments are worth stressing. opt specifies the optimizer to be used
with optional arguments passed via .... Default is nlminb which uses analytical
gradient and hessian.
3
lik indicates the objective to be optimized: if lik=TRUE a likelihood approach
is followed and likelihood (5) is optimized; otherwise when lik=FALSE, the loss (4)
is minimized.
To illustrate the package, we use simulated data
> n=200
> s=.4
> x<-seq(0,10,l=n) #covariate
> mu<- 10+2*x #the true regression function
> set.seed(1234) #just to get reproducible results..
> y<-rlnorm(n, log(mu)-s^2/2, s) #exp(log(mu)-s^2/2 +rnorm(n,0,s))
A plot of such data is reported in Figure 1. As discussed in the previous section,
people tend to log transform data to gain homoskedasticity in order to fit a simple
linear model
> coef(lm(log(y)~x)) #far from true (10,2)
(Intercept) x
2.2085667 0.1276675
which yields unbiased and potentially meaningless estimates of the regression pa-
rameters βand of the conditional means µiaccordingly.
Instead, linear log Normal regression model keeping the observation on the
original scale should be fitted
> o<-lognlm(y~x)
or, if the iterative process has to be monitored for instance,
> o<-lognlm(y~x, control=list(trace=1))
0: 113.54315: 14.8345 2.65552 2.14266
1: 28.586808: 14.8460 2.70753 1.23320
..........
11: -43.593889: 7.70307 2.29063 -0.392217
where the list specified in the control argument is passed to the used optimizer
(nlminb, the default). Results may be summarized via the the print or the summary
methods, the latter also reporting asymptotic standard errors and corresponding p-
values,
> summary(o)
Call:
lognlm(formula = y ~ x)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.1427 0.6801 11.97 <2e-16 ***
x 2.3648 0.1780 13.29 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Standard deviation estimate: 0.40126 (St.Err = 0.02006)
Log Likelihood: 82.628
4
The reported standard errors come from vcov.lognlm() wherein the argument
sandw (default TRUE) specifies whether the standard errors have to be computed via
the sandwich formula or the Hessian only. Finally the log likelihood value printed
in the last row refers to the kernel only, i.e. formula (5) omitting the constant
depending on data. That is fine for model comparisons but the full log likelihood
may be obtained via logLik()
> logLik(o, full=TRUE)
[1] -670.5405
On the other hands, when the fit has been obtained by setting lik=FALSE,
logLik() returns the minimized loss objective (4), and full=TRUE gets meaningless.
The package logNormReg fits linear models, but it can be used in conjunction
with the package segmented to include piecewise linear relationships in the model,
with regression equation written as µi=β0+xT
iβ+δ(zi−ψ)+where ziis the contin-
uous covariate having a piecewise linear effect with breakpoint ψ. More specifically,
once the function for fitting the simple linear model has been set, estimation of
segmented regression is straightforward via the function segmented.default() in
the package segmented.
> mu1<-10+2*x-2*pmax(x-6,0)
> y1<-rlnorm(n, log(mu1)-s^2/2, s)
As usual in segmented, the model is fitted in two steps
> library(segmented)
>
> o <-lognlm(y1~x)
> os<-segmented(o, ~x) #segmented.default() is being used..
In order to perform bootstrap restarting to escape local optima, the function
segmented.default() needs an objective function to be minimized which has to
be passed via the argument fn.obj of seg.control(..). Default is to take minus
the log likelihood, i.e. by means of the function logLik. Therefore the following
call produces the same fit of os above.
> segmented.default(o, ~x, control=seg.control(fn.obj="-logLik(x)"))
where the string fn.obj represents the objective (as a function of x) to be mini-
mized. The returned object is not of class ’segmented’, therefore the summary and
plot methods, say, have to be called explicitly. For instance to display the data with
the segmented line superimposed,
> plot(x, y1)
> plot.segmented(os, add=TRUE, col=2, lwd=2)
However if the model has been fitted by minimization of (4), logLik() returns
the loss objective value at the solution, therefore the function to be minimized is
simply logLik(x),without the minus sign, namely
> a<-lognlm(y1 ~ x, lik=FALSE)
> as<-segmented(a, ~x, control=seg.control(fn.obj="logLik(x)"))
5