Content uploaded by Martin Mächler

Author content

All content in this area was uploaded by Martin Mächler on Nov 29, 2016

Content may be subject to copyright.

Content uploaded by Douglas M. Bates

Author content

All content in this area was uploaded by Douglas M. Bates on Jul 02, 2014

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

JSS Journal of Statistical Software

October 2015, Volume 67, Issue 1. doi: 10.18637/jss.v067.i01

Fitting Linear Mixed-Eﬀects Models Using lme4

Douglas Bates

University of Wisconsin-Madison

Martin Mächler

ETH Zurich

Benjamin M. Bolker

McMaster University

Steven C. Walker

McMaster University

Abstract

Maximum likelihood or restricted maximum likelihood (REML) estimates of the pa-

rameters in linear mixed-eﬀects models can be determined using the lmer function in the

lme4 package for R. As for most model-ﬁtting functions in R, the model is described in

an lmer call by a formula, in this case including both ﬁxed- and random-eﬀects terms.

The formula and data together determine a numerical representation of the model from

which the proﬁled deviance or the proﬁled REML criterion can be evaluated as a function

of some of the model parameters. The appropriate criterion is optimized, using one of

the constrained optimization functions in R, to provide the parameter estimates. We de-

scribe the structure of the model, the steps in evaluating the proﬁled deviance or REML

criterion, and the structure of classes or types that represents such a model. Suﬃcient

detail is included to allow specialization of these structures by users who wish to write

functions to ﬁt specialized linear mixed models, such as models incorporating pedigrees or

smoothing splines, that are not easily expressible in the formula language used by lmer.

Keywords: sparse matrix methods, linear mixed models, penalized least squares, Cholesky

decomposition.

1. Introduction

The lme4 package (Bates, Maechler, Bolker, and Walker 2015) for R(RCore Team 2015)

provides functions to ﬁt and analyze linear mixed models, generalized linear mixed models

and nonlinear mixed models. In each of these names, the term “mixed” or, more fully, “mixed

eﬀects”, denotes a model that incorporates both ﬁxed- and random-eﬀects terms in a linear

predictor expression from which the conditional mean of the response can be evaluated. In this

paper we describe the formulation and representation of linear mixed models. The techniques

2Linear Mixed Models with lme4

used for generalized linear and nonlinear mixed models will be described separately, in a

future paper.

At present, the main alternative to lme4 for mixed modeling in Ris the nlme package (Pin-

heiro, Bates, DebRoy, Sarkar, and RCore Team 2015). The main features distinguishing

lme4 from nlme are (1) more eﬃcient linear algebra tools, giving improved performance on

large problems; (2) simpler syntax and more eﬃcient implementation for ﬁtting models with

crossed random eﬀects; (3) the implementation of proﬁle likelihood conﬁdence intervals on

random-eﬀects parameters; and (4) the ability to ﬁt generalized linear mixed models (al-

though in this paper we restrict ourselves to linear mixed models). The main advantage of

nlme relative to lme4 is a user interface for ﬁtting models with structure in the residuals (var-

ious forms of heteroscedasticity and autocorrelation) and in the random-eﬀects covariance

matrices (e.g., compound symmetric models). With some extra eﬀort, the computational

machinery of lme4 can be used to ﬁt structured models that the basic lmer function cannot

handle (see Appendix A).

The development of general software for ﬁtting mixed models remains an active area of re-

search with many open problems. Consequently, the lme4 package has evolved since it was

ﬁrst released, and continues to improve as we learn more about mixed models. However,

we recognize the need to maintain stability and backward compatibility of lme4 so that it

continues to be broadly useful. In order to maintain stability while continuing to advance

mixed-model computation, we have developed several additional frameworks that draw on

the basic ideas of lme4 but modify its structure or implementation in various ways. These

descendants include the MixedModels package (Bates 2015) in Julia (Bezanson, Karpinski,

Shah, and Edelman 2012), the lme4pureR package (Bates and Walker 2013) in R, and the

ﬂexLambda development branch of lme4. The current article is largely restricted to describing

the current stable version of the lme4 package (1.1-10), with Appendix Adescribing hooks

into the computational machinery that are designed for extension development. The gamm4

(Wood and Scheipl 2014) and blme (Dorie 2015;Chung, Rabe-Hesketh, Dorie, Gelman, and

Liu 2013) packages currently make use of these hooks.

Another goal of this article is to contrast the approach used by lme4 with previous formu-

lations of mixed models. The expressions for the proﬁled log-likelihood and proﬁled REML

(restricted maximum likelihood) criteria derived in Section 3.4 are similar to those presented

in Bates and DebRoy (2004) and, indeed, are closely related to “Henderson’s mixed-model

equations” (Henderson Jr. 1982). Nonetheless there are subtle but important changes in

the formulation of the model and in the structure of the resulting penalized least squares

(PLS) problem to be solved (Section 3.6). We derive the current version of the PLS problem

(Section 3.2) and contrast this result with earlier formulations (Section 3.5).

This article is organized into four main sections (Sections 2,3,4and 5), each of which

corresponds to one of the four largely separate modules that comprise lme4. Before describing

the details of each module, we describe the general form of the linear mixed model underlying

lme4 (Section 1.1); introduce the sleepstudy data that will be used as an example throughout

(Section 1.2); and broadly outline lme4’s modular structure (Section 1.3).

1.1. Linear mixed models

Just as a linear model is described by the distribution of a vector-valued random response

variable, Y, whose observed value is yobs, a linear mixed model is described by the distribution

Journal of Statistical Software 3

of two vector-valued random variables: Y, the response, and B, the vector of random eﬀects.

In a linear model the distribution of Yis multivariate normal,

Y ∼ N (Xβ +o, σ2W−1),(1)

where nis the dimension of the response vector, Wis a diagonal matrix of known prior

weights, βis a p-dimensional coeﬃcient vector, Xis an n×pmodel matrix, and ois a vector

of known prior oﬀset terms. The parameters of the model are the coeﬃcients βand the scale

parameter σ.

In a linear mixed model it is the conditional distribution of Ygiven B=bthat has such a

form,

(Y|B =b)∼ N (Xβ +Zb +o, σ2W−1),(2)

where Zis the n×qmodel matrix for the q-dimensional vector-valued random-eﬀects variable,

B, whose value we are ﬁxing at b. The unconditional distribution of Bis also multivariate

normal with mean zero and a parameterized q×qvariance-covariance matrix, Σ,

B ∼ N (0,Σ).(3)

As a variance-covariance matrix, Σmust be positive semideﬁnite. It is convenient to express

the model in terms of a relative covariance factor,Λθ, which is a q×qmatrix, depending on

the variance-component parameter,θ, and generating the symmetric q×qvariance-covariance

matrix, Σ, according to

Σθ=σ2ΛθΛ>

θ,(4)

where σis the same scale factor as in the conditional distribution (2).

Although Equations 2,3, and 4fully describe the class of linear mixed models that lme4 can

ﬁt, this terse description hides many important details. Before moving on to these details,

we make a few observations:

•This formulation of linear mixed models allows for a relatively compact expression for

the proﬁled log-likelihood of θ(Section 3.4, Equation 34).

•The matrices associated with random eﬀects, Zand Λθ, typically have a sparse structure

with a sparsity pattern that encodes various model assumptions. Sections 2.3 and 3.7

provide details on these structures, and how to represent them eﬃciently.

•The interface provided by lme4’s lmer function is slightly less general than the model

described by Equations 2,3, and 4. To take advantage of the entire range of possibili-

ties, one may use the modular functions (Sections 1.3 and Appendix A) or explore the

experimental ﬂexLambda branch of lme4 on Github.

1.2. Example

Throughout our discussion of lme4, we will work with a data set on the average reaction time

per day for subjects in a sleep deprivation study (Belenky et al. 2003). On day 0 the subjects

had their normal amount of sleep. Starting that night they were restricted to 3 hours of sleep

per night. The response variable, Reaction, represents average reaction times in milliseconds

(ms) on a series of tests given each Day to each Subject (Figure 1),

4Linear Mixed Models with lme4

Days of sleep deprivation

Average reaction time (ms)

200

250

300

350

400

450

●

●●●●●●●●●

335

02468

●●●●●●●●●●

309

●●●●●●●

●●

●

330

02468

●●●●●●●

●

●

●

331

●●

●●●●●●●●

310

02468

●

●●●●

●●●

●●

351

●●●●●

●●●●●

333

02468

●●●●●●●

●

●●

371

●●●

●●●

●

●●

●

332

02468

●●●●●

●●●●●

372

●●●●

●●●

●●●

369

02468

●●

●●●●●●●

●

334

●●●●●●●●●●

349

02468

●

●●●●●●●●●

352

●●●●●

●

●

●●●

370

02468

●●●

●●●●●

●●

337

●●●●●

●

●●

●●

350

02468

200

250

300

350

400

450

●●●

●

●

●

●

●

●

●

308

Figure 1: Average reaction time versus days of sleep deprivation by subject. Subjects ordered

(from left to right starting on the top row) by increasing slope of subject-speciﬁc linear

regressions.

R> str(sleepstudy)

'data.frame': 180 obs. of 3 variables:

$ Reaction: num 250 259 251 321 357 ...

$Days :num 0123456789...

$ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1..

Each subject’s reaction time increases approximately linearly with the number of sleep-

deprived days. However, subjects also appear to vary in the slopes and intercepts of these

relationships, which suggests a model with random slopes and intercepts. As we shall see,

such a model may be ﬁtted by minimizing the REML criterion (Equation 39) using

R> fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)

The estimates of the standard deviations of the random eﬀects for the intercept and the slope

are 24.74 ms and 5.92 ms/day. The ﬁxed-eﬀects coeﬃcients, β, are 251.4 ms and 10.47 ms/day

for the intercept and slope. In this model, one interpretation of these ﬁxed eﬀects is that they

are the estimated population mean values of the random intercept and slope (Section 2.2).

We have chosen the sleepstudy example because it is a relatively small and simple example

to illustrate the theory and practice underlying lmer. However, lmer is capable of ﬁtting

more complex mixed models to larger data sets. For example, we direct the interested reader

to RShowDoc("lmerperf", package = "lme4") for examples that more thoroughly exercise

the performance capabilities of lmer.

Journal of Statistical Software 5

Module Rfunction Description

Formula

module

(Section 2)lFormula Accepts a mixed-model formula, data, and

other user inputs, and returns a list of objects

required to ﬁt a linear mixed model.

Objective

function

module

(Section 3)mkLmerDevfun Accepts the results of lFormula and returns a

function to calculate the deviance (or

restricted deviance) as a function of the

covariance parameters, θ.

Optimization

module

(Section 4)optimizeLmer Accepts a deviance function returned by

mkLmerDevfun and returns the results of the

optimization of that deviance function.

Output

module

(Section 5)mkMerMod Accepts an optimized deviance function and

packages the results into a useful object.

Table 1: The high-level modular structure of lmer.

1.3. High-level modular structure

The lmer function is composed of four largely independent modules. In the ﬁrst module, a

mixed-model formula is parsed and converted into the inputs required to specify a linear mixed

model (Section 2). The second module uses these inputs to construct an Rfunction which

takes the covariance parameters, θ, as arguments and returns negative twice the log proﬁled

likelihood or the REML criterion (Section 3). The third module optimizes this objective

function to produce maximum likelihood (ML) or REML estimates of θ(Section 4). Finally,

the fourth module provides utilities for interpreting the optimized model (Section 5).

To illustrate this modularity, we recreate the fm1 object by a series of four modular steps;

the formula module,

R> parsedFormula <- lFormula(formula = Reaction ~ Days + (Days | Subject),

+ data = sleepstudy)

the objective function module,

R> devianceFunction <- do.call(mkLmerDevfun, parsedFormula)

the optimization module,

R> optimizerOutput <- optimizeLmer(devianceFunction)

and the output module,

R> mkMerMod(rho = environment(devianceFunction), opt = optimizerOutput,

+ reTrms = parsedFormula$reTrms, fr = parsedFormula$fr)

6Linear Mixed Models with lme4

2. Formula module

2.1. Mixed-model formulas

Like most model-ﬁtting functions in R,lmer takes as its ﬁrst two arguments a formula spec-

ifying the model and the data with which to evaluate the formula. This second argument,

data, is optional but recommended and is usually the name of an Rdata frame. In the R

lm function for ﬁtting linear models, formulas take the form resp ~ expr, where resp deter-

mines the response variable and expr is an expression that speciﬁes the columns of the model

matrix. Formulas for the lmer function contain special random-eﬀects terms,

R> resp ~ FEexpr + (REexpr1 | factor1) + (REexpr2 | factor2) + ...

where FEexpr is an expression determining the columns of the ﬁxed-eﬀects model matrix, X,

and the random-eﬀects terms, (REexpr1 | factor1) and (REexpr2 | factor2), determine

both the random-eﬀects model matrix, Z(Section 2.3), and the structure of the relative co-

variance factor, Λθ(Section 2.3). In principle, a mixed-model formula may contain arbitrarily

many random-eﬀects terms, but in practice the number of such terms is typically low.

2.2. Understanding mixed-model formulas

Before describing the details of how lme4 parses mixed-model formulas (Section 2.3), we

provide an informal explanation and then some examples. Our discussion assumes familiarity

with the standard Rmodeling paradigm (Chambers 1993).

Each random-eﬀects term is of the form (expr | factor). The expression expr is evaluated

as a linear model formula, producing a model matrix following the same rules used in standard

Rmodeling functions (e.g., lm or glm). The expression factor is evaluated as an Rfactor.

One way to think about the vertical bar operator is as a special kind of interaction between

the model matrix and the grouping factor. This interaction ensures that the columns of the

model matrix have diﬀerent eﬀects for each level of the grouping factor. What makes this a

special kind of interaction is that these eﬀects are modeled as unobserved random variables,

rather than unknown ﬁxed parameters. Much has been written about important practical

and philosophical diﬀerences between these two types of interactions (e.g., Henderson Jr.

1982;Gelman 2005). For example, the random-eﬀects implementation of such interactions

can be used to obtain shrinkage estimates of regression coeﬃcients (e.g., Efron and Morris

1977), or account for lack of independence in the residuals due to block structure or repeated

measurements (e.g., Laird and Ware 1982).

Table 2provides several examples of the right-hand-sides of mixed-model formulas. The

ﬁrst example, (1 | g), is the simplest possible mixed-model formula, where each level of

the grouping factor, g, has its own random intercept. The mean and standard deviation of

these intercepts are parameters to be estimated. Our description of this model incorporates

any nonzero mean of the random eﬀects as ﬁxed-eﬀects parameters. If one wishes to specify

that a random intercept has a priori known means, one may use the offset function as in

the second model in Table 2. This model contains no ﬁxed eﬀects, or more accurately the

ﬁxed-eﬀects model matrix, X, has zero columns and βhas length zero.

We may also construct models with multiple grouping factors. For example, if the observations

are grouped by g2, which is nested within g1, then the third formula in Table 2can be used

Journal of Statistical Software 7

Formula Alternative Meaning

(1 | g) 1 + (1 | g) Random intercept with

ﬁxed mean.

0 + offset(o) + (1 | g) -1 + offset(o) + (1 | g) Random intercept with

a priori means.

(1 | g1/g2) (1 | g1) + (1 | g1:g2) Intercept varying among

g1 and g2 within g1.

(1 | g1) + (1 | g2) 1 + (1 | g1) + (1 | g2) Intercept varying among

g1 and g2.

x+(x|g) 1+x+(1+x|g) Correlated random

intercept and slope.

x+(x||g) 1+x+(1|g)+(0+x|g) Uncorrelated random

intercept and slope.

Table 2: Examples of the right-hand-sides of mixed-eﬀects model formulas. The names of

grouping factors are denoted g,g1, and g2, and covariates and a priori known oﬀsets as x

and o.

to model variation in the intercept. A common objective in mixed modeling is to account

for such nested (or hierarchical) structure. However, one of the most useful aspects of lme4

is that it can be used to ﬁt random eﬀects associated with non-nested grouping factors. For

example, suppose the data are grouped by fully crossing two factors, g1 and g2, then the

fourth formula in Table 2may be used. Such models are common in item response theory,

where subject and item factors are fully crossed (Doran, Bates, Bliese, and Dowling 2007).

In addition to varying intercepts, we may also have varying slopes (e.g., the sleepstudy data,

Section 1.2). The ﬁfth example in Table 2gives a model where both the intercept and slope

vary among the levels of the grouping factor.

Specifying uncorrelated random eﬀects

By default, lme4 assumes that all coeﬃcients associated with the same random-eﬀects term

are correlated. To specify an uncorrelated slope and intercept (for example), one may either

use double-bar notation, (x || g), or equivalently use multiple random-eﬀects terms, x + (1

|g)+(0+x|g), as in the ﬁnal example of Table 2. For example, if one examined the

results of model fm1 of the sleepstudy data (Section 1.2) using summary(fm1), one would see

that the estimated correlation between the slope for Days and the intercept is fairly low (0.066)

(See Section 5.2 below for more on how to extract the random-eﬀects covariance matrix.) We

may use double-bar notation to ﬁt a model that excludes a correlation parameter:

R> fm2 <- lmer(Reaction ~ Days + (Days || Subject), sleepstudy)

Although mixed models where the random slopes and intercepts are assumed independent

are commonly used to reduce the complexity of random-slopes models, they do have one

subtle drawback. Models in which the slopes and intercepts are allowed to have a nonzero

correlation (e.g., fm1) are invariant to additive shifts of the continuous predictor (Days in

this case). This invariance breaks down when the correlation is constrained to zero; any shift

in the predictor will necessarily lead to a change in the estimated correlation, and in the

likelihood and predictions of the model. For example, we can eliminate the correlation in

8Linear Mixed Models with lme4

Symbol Size

nLength of the response vector, Y.

pNumber of columns of the ﬁxed-eﬀects model matrix, X.

q=Pk

iqiNumber of columns of the random-eﬀects model matrix, Z.

piNumber of columns of the raw model matrix, Xi.

`iNumber of levels of the grouping factor indices, ii.

qi=pi`iNumber of columns of the term-wise model matrix, Zi.

kNumber of random-eﬀects terms.

mi=pi+1

2Number of covariance parameters for term i.

m=Pk

imiTotal number of covariance parameters.

Table 3: Dimensions of linear mixed models. The subscript i= 1, . . . , k denotes a speciﬁc

random-eﬀects term.

Symbol Size Description

Xin×piRaw random-eﬀects model matrix.

Jin×`iIndicator matrix of grouping factor indices.

Xij pi×1Column vector containing jth row of Xi.

Jij `i×1Column vector containing jth row of Ji.

iinVector of grouping factor indices.

Zin×qiTerm-wise random-eﬀects model matrix.

θmCovariance parameters.

Tipi×piLower triangular template matrix.

Λiqi×qiTerm-wise relative covariance factor.

Table 4: Symbols used to describe the structure of the random-eﬀects model matrix and the

relative covariance factor. The subscript i= 1, . . . , k denotes a speciﬁc random-eﬀects term.

fm1 simply by adding an amount equal to the ratio of the estimated among-subject standard

deviations multiplied by the estimated correlation (i.e., σslope/σintercept ·ρslope:intercept) to

the Days variable. The use of models such as fm2 should ideally be restricted to cases where

the predictor is measured on a ratio scale (i.e., the zero point on the scale is meaningful, not

just a location deﬁned by convenience or convention), as is the case here.

2.3. Algebraic and computational account of mixed-model formulas

The ﬁxed-eﬀects terms of a mixed-model formula are parsed to produce the ﬁxed-eﬀects model

matrix, X, in the same way that the Rlm function generates model matrices. However, a

mixed-model formula incorporates k≥1random-eﬀects terms of the form (r | f) as well.

These kterms are used to produce the random-eﬀects model matrix, Z(Equation 2), and

the structure of the relative covariance factor, Λθ(Equation 4), which are matrices that

typically have a sparse structure. We now describe how one might construct these matrices

from the random-eﬀects terms, considering ﬁrst a single term, (r | f), and then generalizing

to multiple terms. Tables 3and 4summarize the matrices and vectors that determine the

structure of Zand Λθ.

The expression, r, is a linear model formula that evaluates to an Rmodel matrix, Xi, of

size n×pi, called the raw random-eﬀects model matrix for term i. A term is said to be a

Journal of Statistical Software 9

scalar random-eﬀects term when pi= 1, otherwise it is vector-valued. For a simple, scalar

random-eﬀects term of the form (1 | f),Xiis the n×1matrix of ones, which implies a

random intercept model.

The expression fevaluates to an Rfactor, called the grouping factor, for the term. For the

ith term, we represent this factor mathematically with a vector iiof factor indices, which is

an n-vector of values from 1, . . . , `i.1Let Jibe the n×`imatrix of indicator columns for ii.

Using the Matrix package (Bates and Maechler 2015) in R, we may construct the transpose

of Jifrom a factor vector, f, by coercing fto a ‘sparseMatrix’ object. For example,

R> (f <- gl(3, 2))

[1]112233

Levels: 1 2 3

R> (Ji <- t(as(f, Class = "sparseMatrix")))

6 x 3 sparse Matrix of class "dgCMatrix"

123

[1,] 1 . .

[2,] 1 . .

[3,] . 1 .

[4,] . 1 .

[5,] . . 1

[6,] . . 1

When k > 1we order the random-eﬀects terms so that `1≥`2≥ · · · ≥ `k; in general, this

ordering reduces “ﬁll-in” (i.e., the proportion of elements that are zero in the lower triangle

of Λ>

θZ>W ZΛθ+Ibut not in the lower triangle of its left Cholesky factor, Lθ, described

below in Equation 18). This reduction in ﬁll-in provides more eﬃcient matrix operations

within the penalized least squares algorithm (Section 3.2).

Constructing the random-eﬀects model matrix

The ith random-eﬀects term contributes qi=`ipicolumns to the model matrix Z. We group

these columns into a matrix, Zi, which we refer to as the term-wise model matrix for the ith

term. Thus q, the number of columns in Zand the dimension of the random variable, B, is

q=

k

X

i=1

qi=

k

X

i=1

`ipi.(5)

Creating the matrix Zifrom Xiand Jiis a straightforward concept that is, nonetheless,

somewhat awkward to describe. Consider Zias being further decomposed into `iblocks of

picolumns. The rows in the ﬁrst block are the rows of Ximultiplied by the 0/1 values in

1In practice, ﬁxed-eﬀects model matrices and random-eﬀects terms are evaluated with respect to a model

frame, ensuring that any expressions for grouping factors have been coerced to factors and any unused levels

of these factors have been dropped. That is, `i, the number of levels in the grouping factor for the ith

random-eﬀects term, is well-deﬁned.

10 Linear Mixed Models with lme4

the ﬁrst column of Jiand similarly for the subsequent blocks. With these deﬁnitions we

may deﬁne the term-wise random-eﬀects model matrix, Zi, for the ith term as a transposed

Khatri-Rao product,

Zi= (J>

i∗X>

i)>=

J>

i1⊗X>

i1

J>

i2⊗X>

i2

.

.

.

J>

in ⊗X>

in

,(6)

where ∗and ⊗are the Khatri-Rao2(Khatri and Rao 1968) and Kronecker products, and J>

ij

and X>

ij are row vectors of the jth rows of Jiand Xi. These rows correspond to the jth

sample in the response vector, Y, and thus jruns from 1, . . . , n. The Matrix package for R

contains a KhatriRao function, which can be used to form Zi. For example, if we begin with

a raw model matrix,

R> (Xi <- cbind(1, rep.int(c(-1, 1), 3L)))

[,1] [,2]

[1,] 1 -1

[2,] 1 1

[3,] 1 -1

[4,] 1 1

[5,] 1 -1

[6,] 1 1

then the term-wise random-eﬀects model matrix is,

R> (Zi <- t(KhatriRao(t(Ji), t(Xi))))

6 x 6 sparse Matrix of class "dgCMatrix"

[1,] 1 -1 . . . .

[2,] 1 1 . . . .

[3,] . . 1 -1 . .

[4,] . . 1 1 . .

[5,] . . . . 1 -1

[6,] . . . . 1 1

In particular, for a simple, scalar term, Ziis exactly Ji, the matrix of indicator columns. For

other scalar terms, Ziis formed by element-wise multiplication of the single column of Xiby

each of the columns of Ji.

Because each Ziis generated from indicator columns, its cross-product, Z>

iZiis block-

diagonal consisting of `idiagonal blocks each of size pi.3Note that this means that when

2Note that the original deﬁnition of the Khatri-Rao product is more general than the deﬁnition used in the

Matrix package, which is the deﬁnition we use here.

3To see this, note that by the properties of Kronecker products we may write the cross-product matrix

Z>

iZias Pn

j=1 Jij J>

ij ⊗Xij X>

ij . Because Jij is a unit vector along a coordinate axis, the cross-product Jij J>

ij

is a pi×pimatrix of all zeros except for a single 1along the diagonal. Therefore, the cross-products, Xij X>

ij ,

will be added to one of the `iblocks of size pi×pialong the diagonal of Z>

iZi.

Journal of Statistical Software 11

k= 1 (i.e., there is only one random-eﬀects term, and Zi=Z), Z>Zwill be block diag-

onal. These block-diagonal properties allow for more eﬃcient sparse matrix computations

(Section 3.7).

The full random-eﬀects model matrix, Z, is constructed from k≥1blocks,

Z=hZ1Z2. . . Zki.(7)

By transposing Equation 7and substituting in Equation 6, we may represent the structure

of the transposed random-eﬀects model matrix as follows,

Z>=

sample 1 sample 2 . . . sample n

J11 ⊗X11 J12 ⊗X12 . . . J1n⊗X1nterm 1

J21 ⊗X21 J22 ⊗X22 . . . J2n⊗X2nterm 2

.

.

..

.

.....

.

..

.

.

.(8)

Note that the proportion of elements of Z>that are structural zeros is

Pk

i=1 pi(`i−1)

Pk

i=1 pi

.(9)

Therefore, the sparsity of Z>increases with the number of grouping factor levels. As the

number of levels is often large in practice, it is essential for speed and eﬃciency to take

account of this sparsity, for example by using sparse matrix methods, when ﬁtting mixed

models (Section 3.7).

Constructing the relative covariance factor

The q×qcovariance factor, Λθ, is a block diagonal matrix whose ith diagonal block, Λi, is of

size qi, i = 1, . . . , k. We refer to Λias the term-wise relative covariance factor. Furthermore,

Λiis a homogeneous block diagonal matrix with each of the `ilower-triangular blocks on

the diagonal being a copy of a pi×pilower-triangular template matrix,Ti. The covariance

parameter vector, θ, of length mi=pi+1

2, consists of the elements in the lower triangle of

Ti, i = 1, . . . , k. To provide a unique representation we require that the diagonal elements of

the Ti, i = 1, . . . , k be non-negative.

The template, Ti, can be constructed from the number pialone. In Rcode we denote pias

nc. For example, if we set nc <- 3, we could create the template for term ias

R> (rowIndices <- rep(1:nc, 1:nc))

[1]122333

R> (colIndices <- sequence(1:nc))

[1]112123

R> (template <- sparseMatrix(rowIndices, colIndices,

+ x = 1 * (rowIndices == colIndices)))

12 Linear Mixed Models with lme4

3 x 3 sparse Matrix of class "dgCMatrix"

[1,] 1 . .

[2,] 0 1 .

[3,] 0 0 1

Note that the rowIndices and colIndices ﬁll the entire lower triangle, which contains the

initial values of the covariance parameter vector, θ,

R> (theta <- template@x)

[1]100101

(because the @x slot of the sparse matrix template is a numeric vector containing the nonzero

elements). This template contains three types of elements: structural zeros (denoted by .), oﬀ-

diagonal covariance parameters (initialized at 0), and diagonal variance parameters (initialized

at 1). The next step in the construction of the relative covariance factor is to repeat the

template once for each level of the grouping factor to construct a sparse block diagonal

matrix. For example, if we set the number of levels, `i, to two, nl <- 2, we could create the

transposed term-wise relative covariance factor, Λ>

i, using the .bdiag function in the Matrix

package,

R> (Lambdati <- .bdiag(rep(list(t(template)), nl)))

6 x 6 sparse Matrix of class "dgTMatrix"

[1,]100...

[2,].10...

[3,]..1...

[4,]...100

[5,]....10

[6,].....1

For a model with a single random-eﬀects term, Λ>would be the initial transposed relative

covariance factor Λ>

θitself.

The transposed relative covariance factor, Λ>

θ, that arises from parsing the formula and data

is set at the initial value of the covariance parameters, θ. However, during model ﬁtting, it

needs to be updated to a new θvalue at each iteration. This is achieved by constructing

a vector of indices, Lind, that identiﬁes which elements of theta should be placed in which

elements of Lambdat,

R> LindTemplate <- rowIndices + nc * (colIndices - 1) - choose(colIndices, 2)

R> (Lind <- rep(LindTemplate, nl))

[1]124356124356

For example, if we randomly generate a new value for theta,

Journal of Statistical Software 13

R> thetanew <- round(runif(length(theta)), 1)

we may update Lambdat as follows,

R> Lambdati@x <- thetanew[Lind]

Section 3.6 describes the process of updating the relative covariance factor in more detail.

3. Objective function module

3.1. Model reformulation for improved computational stability

In our initial formulation of the linear mixed model (Equations 2,3, and 4), the covari-

ance parameter vector, θ, appears only in the marginal distribution of the random eﬀects

(Equation 3). However, from the perspective of computational stability and eﬃciency, it is

advantageous to reformulate the model such that θappears only in the conditional distribu-

tion for the response vector given the random eﬀects. Such a reformulation allows us to work

with singular covariance matrices, which regularly arise in practice (e.g., during intermediate

steps of the nonlinear optimizer, Section 4).

The reformulation is made by deﬁning a spherical4random-eﬀects variable, U, with distribu-

tion

U ∼ N (0, σ2Iq).(10)

If we set

B=ΛθU,(11)

then Bwill have the desired N(0,Σθ)distribution (Equation 3). Although it may seem more

natural to deﬁne Uin terms of Bwe must write the relationship as in Equation 11 to allow

for singular Λθ. The conditional distribution (Equation 2) of the response vector given the

random eﬀects may now be reformulated as

(Y|U =u)∼ N (µY|U=u, σ 2W−1),(12)

where

µY|U =u=X β +ZΛθu+o(13)

is a vector of linear predictors, which can be interpreted as a conditional mean (or mode).

Similarly, we also deﬁne µU|Y=yobs as the conditional mean (or mode) of the spherical random

eﬀects given the observed value of the response vector. Note also that we use the usymbol

throughout to represent a speciﬁc value of the random variable, U.

3.2. Penalized least squares

Our computational methods for maximum likelihood ﬁtting of the linear mixed model involve

repeated applications of the PLS method. In particular, the PLS problem is to minimize the

penalized weighted residual sum-of-squares,

r2(θ,β,u) = ρ2(θ,β,u) + kuk2,(14)

4N(µ, σ2I)distributions are called “spherical” because contours of the probability density are spheres.

14 Linear Mixed Models with lme4

over "u

β#, where

ρ2(θ,β,u) =

W1/2hyobs −µY|U =ui

2(15)

is the weighted residual sum-of-squares. This notation makes explicit the fact that r2and ρ2

both depend on θ,β, and u. The reason for the word “penalized” is that the term, kuk2, in

Equation 14 penalizes models with larger magnitude values of u.

In the so-called “pseudo-data” approach we write the penalized weighted residual sum-of-

squares as the squared length of a block matrix equation,

r2(θ,β,u) =

"W1/2(yobs −o)

0#−"W1/2ZΛθW1/2X

Iq0#"u

β#

2

.(16)

This pseudo-data approach shows that the PLS problem may also be thought of as a standard

least squares problem for an extended response vector, which implies that the minimizing

value, "µU|Y=yobs

b

βθ#, satisﬁes the normal equations,

"Λ>

θZ>W(yobs −o)

X>W(yobs −o)#="Λ>

θZ>W ZΛθ+IΛ>

θZ>W X

X>W ZΛθX>W X #"µU |Y =yobs

b

βθ#,(17)

where µU|Y=yobs is the conditional mean of Ugiven that Y=yobs. Note that this conditional

mean depends on θ, although we do not make this dependency explicit in order to reduce

notational clutter.

The cross-product matrix in Equation 17 can be Cholesky decomposed,

"Λ>

θZ>W ZΛθ+IΛ>

θZ>W X

X>W ZΛθX>W X #="Lθ0

R>

ZX R>

X#"L>

θRZX

0RX#.(18)

We may use this decomposition to rewrite the penalized weighted residual sum-of-squares as

r2(θ,β,u) = r2(θ) +

L>

θ(u−µU|Y=yobs ) + RZ X (β−b

βθ)

2+

RX(β−b

βθ)

2,(19)

where we have simpliﬁed notation by writing r2(θ,b

βθ,µU|Y=yobs )as r2(θ). This is an im-

portant expression in the theory underlying lme4. It relates the penalized weighted residual

sum-of-squares, r2(θ,β,u), with its minimum value, r2(θ). This relationship is useful in

the next section where we integrate over the random eﬀects, which is required for maximum

likelihood estimation.

3.3. Probability densities

The residual sums-of-squares discussed in the previous section can be used to express various

probability densities, which are required for maximum likelihood estimation of the linear

Journal of Statistical Software 15

mixed model5,

fY|U (yobs |u) = |W|1/2

(2πσ2)n/2exp "−ρ2(θ,β,u)

2σ2#,(20)

fU(u) = 1

(2πσ2)q/2exp "− kuk2

2σ2#,(21)

fY,U(yobs,u) = |W|1/2

(2πσ2)(n+q)/2exp "−r2(θ,β,u)

2σ2#,(22)

fU |Y (u|yobs) = fY,U(yobs,u)

fY(yobs),(23)

where

fY(yobs) = ZfY,U(yobs ,u)du.(24)

The log-likelihood to be maximized can therefore be expressed as

L(θ,β, σ2|yobs) = log fY(yobs).(25)

The integral in Equation 24 may be more explicitly written as

fY(yobs) = |W|1/2

(2πσ2)(n+q)/2exp

−r2(θ)−

RX(β−b

βθ)

2

2σ2

Zexp

−

L>

θ(u−µU|Y=yobs ) + RZ X (β−b

βθ)

2

2σ2

du,

(26)

which can be evaluated with the change of variables,

v=L>

θ(u−µU|Y=yobs ) + RZ X (β−b

βθ).(27)

The Jacobian determinant of the transformation from uto vis |Lθ|. Therefore we are able

to write the integral as

fY(yobs) = |W|1/2

(2πσ2)(n+q)/2exp

−r2(θ)−

RX(β−b

βθ)

2

2σ2

Zexp "− kvk2

2σ2#|Lθ|−1dv,

(28)

which by the properties of exponential integrands becomes,

exp L(θ,β, σ2|yobs) = fY(yobs) = |W|1/2|Lθ|−1

(2πσ2)n/2exp

−r2(θ)−

RX(β−b

βθ)

2

2σ2

.(29)

5These expressions only technically hold at the observed value, yobs , of the response vector, Y.

16 Linear Mixed Models with lme4

3.4. Evaluating and proﬁling the deviance and the REML criterion

We are now in a position to understand why the formulation in Equations 2and 3is particu-

larly useful. We are able to explicitly proﬁle βand σout of the log-likelihood (Equation 25), to

ﬁnd a compact expression for the proﬁled deviance (negative twice the proﬁled log-likelihood)

and the proﬁled REML criterion as a function of the relative covariance parameters, θ, only.

Furthermore these criteria can be evaluated quickly and accurately.

To estimate the parameters, θ,β, and σ2, we minimize negative twice the log-likelihood,

which can be written as

−2L(θ,β, σ2|yobs) = log |Lθ|2

|W|+nlog(2πσ2) + r2(θ)

σ2+

RX(β−b

βθ)

2

σ2.(30)

It is very easy to proﬁle out β, because it enters into the ML criterion only through the

ﬁnal term, which is zero if β=b

βθ, where b

βθis found by solving the penalized least-squares

problem in Equation 16. Therefore we can write a partially proﬁled ML criterion as

−2L(θ, σ2|yobs) = log |Lθ|2

|W|+nlog(2πσ2) + r2(θ)

σ2.(31)

This criterion is only partially proﬁled because it still depends on σ2. Diﬀerentiating this

criterion with respect to σ2and setting the result equal to zero yields,

0 = n

b

σ2

θ

−r2(θ)

b

σ4

θ

,(32)

which leads to a maximum proﬁled likelihood estimate,

b

σ2

θ=r2(θ)

n.(33)

This estimate can be substituted into the partially proﬁled criterion to yield the fully proﬁled

ML criterion,

−2L(θ|yobs) = log |Lθ|2

|W|+n"1 + log 2πr2(θ)

n!#.(34)

This expression for the proﬁled deviance depends only on θ. Although q, the number of

columns in Zand the size of Σθ, can be very large indeed, the dimension of θis small,

frequently less than 10. The lme4 package uses generic nonlinear optimizers (Section 4) to

optimize this expression over θto ﬁnd its maximum likelihood estimate.

The REML criterion

The REML criterion can be obtained by integrating the marginal density for Ywith respect

to the ﬁxed eﬀects (Laird and Ware 1982),

ZfY(yobs)dβ=|W|1/2|Lθ|−1

(2πσ2)n/2exp "−r2(θ)

2σ2#Zexp

−

RX(β−b

βθ)

2

2σ2

dβ,(35)

Journal of Statistical Software 17

which can be evaluated with the change of variables,

v=RX(β−b

βθ).(36)

The Jacobian determinant of the transformation from βto vis |RX|. Therefore we are able

to write the integral as

ZfY(yobs)dβ=|W|1/2|Lθ|−1

(2πσ2)n/2exp "−r2(θ)

2σ2#Zexp "− kvk2

2σ2#|RX|−1dv,(37)

which simpliﬁes to,

ZfY(yobs)dβ=|W|1/2|Lθ|−1|RX|−1

(2πσ2)(n−p)/2exp "−r2(θ)

2σ2#.(38)

Minus twice the log of this integral is the (unproﬁled) REML criterion,

−2LR(θ, σ2|yobs) = log |Lθ|2|RX|2

|W|+ (n−p) log(2πσ2) + r2(θ)

σ2.(39)

Note that because βgets integrated out, the REML criterion cannot be used to ﬁnd a point

estimate of β. However, we follow others in using the maximum likelihood estimate, b

βb

θ, at

the optimum value of θ=b

θ. The REML estimate of σ2is,

b

σ2

θ=r2(θ)

n−p,(40)

which leads to the proﬁled REML criterion,

−2LR(θ|yobs) = log |Lθ|2|RX|2

|W|+ (n−p)"1 + log 2πr2(θ)

n−p!#.(41)

3.5. Changes relative to previous formulations

We compare the PLS problem as formulated in Section 3.2 with earlier versions and describe

why we use this version. What have become known as “Henderson’s mixed-model equations”

are given as Equation 6 of Henderson Jr. (1982) and would be expressed as

"X>X/σ2X>Z/σ2

Z>X/σ2Z>Z/σ2+Σ−1#" b

βθ

µB|Y=yobs #="X>yobs/σ2

Z>yobs/σ2#,(42)

in our notation (ignoring weights and oﬀsets, without loss of generality). The matrix written

as Rin Henderson Jr. (1982) is σ2Inin our formulation of the model.

Bates and DebRoy (2004) modiﬁed the PLS equations to

"Z>Z+ΩZ>X

X>Z X>X#"µB|Y=yobs

b

βθ#="X>yobs

Z>yobs #,(43)

18 Linear Mixed Models with lme4

where Ωθ=Λ>

θΛθ−1=σ2Σ−1is the relative precision matrix for a given value of θ. They

also showed that the proﬁled log-likelihood can be expressed (on the deviance scale) as

−2L(θ) = log |Z>Z+Ω|

|Ω|!+n"1 + log 2πr2(θ)

n!#.(44)

The primary diﬀerence between Equation 42 and Equation 43 is the order of the blocks in the

system matrix. The PLS problem can be solved using a Cholesky factor of the system matrix

with the blocks in either order. The advantage of using the arrangement in Equation 43

is to allow for evaluation of the proﬁled log-likelihood. To evaluate |Z>Z+Ω|from the

Cholesky factor that block must be in the upper-left corner, not the lower right. Also, Zis

sparse whereas Xis usually dense. It is straightforward to exploit the sparsity of Z>Zin the

Cholesky factorization when the block containing this matrix is the ﬁrst block to be factored.

If X>Xis the ﬁrst block to be factored, it is much more diﬃcult to preserve sparsity.

The main change from the formulation in Bates and DebRoy (2004) to the current formulation

is the use of a relative covariance factor, Λθ, instead of a relative precision matrix, Ωθ, and

solving for the mean of U|Y =yobs instead of the mean of B|Y =yobs. This change improves

stability, because the solution to the PLS problem in Section 3.2 is well-deﬁned when Λθis

singular whereas the formulation in Equation 43 cannot be used in these cases because Ωθ

does not exist.

It is important to allow for Λθto be singular because situations where the parameter esti-

mates, b

θ, produce a singular Λb

θdo occur in practice. And even if the parameter estimates

do not correspond to a singular Λθ, it may be desirable to evaluate the estimation criterion

at such values during the course of the numerical optimization of the criterion.

Bates and DebRoy (2004) also provided expressions for the gradient of the proﬁled log-

likelihood expressed as Equation 44. These expressions can be translated into the current

formulation. From Equation 34 we can see that (again ignoring weights),

∇(−2L(θ)) = ∇log(|Lθ|2) + ∇nlog(r2(θ))

=∇log(|Λ>

θZ>ZΛθ+I|) + n∇r2(θ)/r2(θ)

=∇log(|Λ>

θZ>ZΛθ+I|) + ∇r2(θ)/(c

σ2).

(45)

The ﬁrst gradient is easy to express but diﬃcult to evaluate for the general model. The

individual elements of this gradient are

∂log(|Λ>

θZ>ZΛθ+I|)

∂θi

= tr

∂Λ>

θZ>ZΛθ

∂θiΛ>

θZ>ZΛθ+I−1

= tr "LθL>

θ−1 Λ>

θZ>Z∂Λθ

∂θi

+∂Λ>

θ

∂θi

Z>ZΛθ!#.

(46)

The second gradient term can be expressed as a linear function of the residual, with individual

elements of the form

∂r2(θ)

∂θi

=−2u>∂Λ>

θ

∂θi

Z>y−ZΛθu−Xb

βθ,(47)

Journal of Statistical Software 19

Name/description Pseudocode Math Type

Mapping from covariance

parameters to relative

covariance factor

mapping function

Response vector yyobs (Section 1.1) double vector

Fixed-eﬀects model matrix XX(Equation 2) double denseamatrix

Transposed random-eﬀects

model matrix

Zt Z>(Equation 2) double sparse matrix

Square-root weights matrix sqrtW W1/2(Equation 2) double diagonal matrix

Oﬀset offset o(Equation 2) double vector

aIn previous versions of lme4 a sparse Xmatrix, useful for models with categorical ﬁxed-eﬀect predictors

with many levels, could be speciﬁed; this feature is not currently available.

Table 5: Inputs into a linear mixed model.

using the results of Golub and Pereyra (1973). Although we do not use these results in lme4,

they are used for certain model types in the MixedModels package for Julia and do provide

improved performance.

3.6. Penalized least squares algorithm

For eﬃciency, in lme4 itself, PLS is implemented in compiled C++ code using the Eigen

(Guennebaud, Jacob, and and others 2015) templated C++ package for numerical linear

algebra. Here however, in order to improve readability we describe a version in pure R.

Section 3.7 provides greater detail on the techniques and concepts for computational eﬃciency,

which is important in cases where the nonlinear optimizer (Section 4) requires many iterations.

The PLS algorithm takes a vector of covariance parameters, θ, as inputs and returns the

proﬁled deviance (Equation 34) or the REML criterion (Equation 41). This PLS algorithm

consists of four main steps:

1. Update the relative covariance factor.

2. Solve the normal equations.

3. Update the linear predictor and residuals.

4. Compute and return the proﬁled deviance.

PLS also requires the objects described in Table 5, which deﬁne the structure of the model.

These objects do not get updated during the PLS iterations, and so it is useful to store various

matrix products involving them (Table 6). Table 7lists the objects that do get updated over

the PLS iterations. The symbols in this table correspond to a version of lme4 that is imple-

mented entirely in R(i.e., no compiled code as in lme4 itself). This implementation is called

lme4pureR and is currently available on Github (https://github.com/lme4/lme4pureR/).

PLS step I: Update relative covariance factor

The ﬁrst step of PLS is to update the relative covariance factor, Λθ, from the current value

of the covariance parameter vector, θ. The updated Λθis then used to update the random-

20 Linear Mixed Models with lme4

Pseudocode Math

ZtW Z>W1/2

ZtWy Z>W yobs

ZtWX Z>W X

XtWX X>W X

XtWy X>W yobs

Table 6: Constant symbols involved in penalized least squares.

Name/description Pseudocode Math Type

Relative covariance factor lambda Λθ(Equation 4) double sparse lower-

triangular matrix

Random-eﬀects Cholesky

factor

LLθ(Equation 18) double sparse triangu-

lar matrix

Intermediate vector in the

solution of the normal

equations

cu cu(Equation 48) double vector

Block in the full Cholesky

factor

RZX RZX (Equation 18) double dense matrix

Cross-product of the

ﬁxed-eﬀects Cholesky factor

RXtRX R>

XRX(Equation 50) double dense matrix

Fixed-eﬀects coeﬃcients beta β(Equation 2) double vector

Spherical conditional modes uu(Section 3.1) double vector

Non-spherical conditional

modes

bb(Equation 2) double vector

Linear predictor mu µY|U =u(Equation 13) double vector

Weighted residuals wtres W1/2(yobs −µ)double vector

Penalized weighted residual

sum-of-squares

pwrss r2(θ)(Equation 19) double

Twice the log determinant

random-eﬀects Cholesky

factor

logDet log |Lθ|2double

Table 7: Quantities updated during an evaluation of the linear mixed model objective function.

eﬀects Cholesky factor, Lθ(Equation 18). The mapping from the covariance parameters to

the relative covariance factor can take many forms, but in general involves a function that

takes θinto the values of the nonzero elements of Λθ.

If Λθis stored as an object of class ‘dgCMatrix’ from the Matrix package for R, then we may

update Λθfrom θby,

R> Lambdat@x[] <- mapping(theta)

where mapping is an Rfunction that both accepts and returns a numeric vector. The nonzero

elements of sparse matrix classes in Matrix are stored in a slot called x.

In the current version of lme4 (1.1-10) the mapping from θto Λθis represented as an R

integer vector, Lind, of indices, so that

Journal of Statistical Software 21

R> mapping <- function(theta) theta[Lind]

The index vector Lind is computed during the model setup and stored in the function’s

environment. Continuing the example from Section 2.3, consider a new value for theta,

R> thetanew <- c(1, -0.1, 2, 0.1, -0.2, 3)

To put these values in the appropriate elements in Lambdati, we use mapping,

R> Lambdati@x[] <- mapping(thetanew)

R> Lambdati

6 x 6 sparse Matrix of class "dgTMatrix"

[1,] 1 -0.1 2.0 . . .

[2,] . 0.1 -0.2 . . .

[3,] . . 3.0 . . .

[4,] . . . 1 -0.1 2.0

[5,] . . . . 0.1 -0.2

[6,] . . . . . 3.0

This Lind approach can be useful for extending the capabilities of lme4 by using the modular

approach to ﬁtting mixed models. For example, Appendix A.1 shows how to use Lind to ﬁt

a model where two random slopes are uncorrelated, but both slopes are correlated with an

intercept.

The mapping from the covariance parameters to the relative covariance factor is treated

diﬀerently in other implementations of the lme4 approach to linear mixed models. At the other

extreme, the ﬂexLambda branch of lme4 and the lme4pureR package provide the capabilities

for a completely general mapping. This added ﬂexibility has the advantage of allowing a

much wider variety of models (e.g., compound symmetry, auto-regression). However, the

disadvantage of this approach is that it becomes possible to ﬁt a much wider variety of ill-

posed models. Finally, if one would like such added ﬂexibility with the current stable version

of lme4, it is always possible to use the modular approach to wrap the Lind-based deviance

function in a general mapping function taking a parameter to be optimized, say φ, into θ.

However, this approach is likely to be ineﬃcient in many cases.

The update method from the Matrix package eﬃciently updates the random-eﬀects Cholesky

factor, Lθ, from a new value of θand the updated Λθ.

R> L <- update(L, Lambdat %*% ZtW, mult = 1)

The mult = 1 argument corresponds to the addition of the identity matrix to the upper-left

block on the left-hand-side of Equation 18.

PLS step II: Solve normal equations

With the new covariance parameters installed in Λθ, the next step is to solve the normal

equations (Equation 17) for the current estimate of the ﬁxed-eﬀects coeﬃcients, b

βθ, and the

22 Linear Mixed Models with lme4

conditional mode, µU |Y=yobs . We solve these equations using a sparse Cholesky factorization

(Equation 18). In a complex model ﬁt to a large data set, the dominant calculation in the

evaluation of the proﬁled deviance (Equation 34) or REML criterion (Equation 41) is this

sparse Cholesky factorization (Equation 18). The factorization is performed in two phases: a

symbolic phase and a numeric phase. The symbolic phase, in which the ﬁll-reducing permu-

tation Pis determined along with the positions of the nonzeros in Lθ, does not depend on

the value of θ. It only depends on the positions of the nonzeros in ZΛθ. The numeric phase

uses θto determine the numeric values of the nonzeros in Lθ. Using this factorization, the

solution proceeds by the following steps,

Lθcu=PΛ>

θZ>W y (48)

LθRZX =PΛ>

θZ>W X (49)

R>

XRX=X>W X −R>

ZX RZ X (50)

R>

XRXb

βθ=X>W y −RZX cu(51)

L>

θP u =cu−RZX b

βθ(52)

which can be solved using the Matrix package with,

R> cu[] <- as.vector(solve(L, solve(L, Lambdat %*% ZtWy, system = "P"),

+ system = "L"))

R> RZX[] <- as.vector(solve(L, solve(L, Lambdat %*% ZtWX, system = "P"),

+ system = "L"))

R> RXtRX <- as(XtWX - crossprod(RZX), "dpoMatrix")

R> beta[] <- as.vector(solve(RXtRX, XtWy - crossprod(RZX, cu)))

R> u[] <- as.vector(solve(L, solve(L, cu - RZX %*% beta,

+ system = "Lt"), system = "Pt"))

Notice the nested calls to solve. The inner calls of the ﬁrst two assignments determine

and apply the permutation matrix (system = "P"), whereas the outer calls actually solve

the equation (system = "L"). In the assignment to u[], the nesting is reversed in order to

return to the original permutation.

PLS step III: Update linear predictor and residuals

The next step is to compute the linear predictor, µY|U (Equation 13), and the weighted resid-

uals with new values for b

βθand µB|Y=yobs . In lme4pureR these quantities can be computed

as

R> b[] <- as.vector(crossprod(Lambdat, u))

R> mu[] <- as.vector(crossprod(Zt, b) + X %*% beta + offset)

R> wtres <- sqrtW * (y - mu)

where brepresents the current estimate of µB|Y=yobs .

PLS step IV: Compute proﬁled deviance

Finally, the updated linear predictor and weighted residuals can be used to compute the

proﬁled deviance (or REML criterion), which in lme4pureR proceeds as

Journal of Statistical Software 23

R> pwrss <- sum(wtres^2) + sum(u^2)

R> logDet <- 2 * determinant(L, logarithm = TRUE)$modulus

R> if (REML) logDet <- logDet + determinant(RXtRX, logarithm = TRUE)$modulus

R> attributes(logDet) <- NULL

R> profDev <- logDet + degFree * (1 + log(2 * pi * pwrss) - log(degFree))

The proﬁled deviance consists of three components: (1) log-determinant(s) of Cholesky factor-

ization (logDet), (2) the degrees of freedom (degFree), and the penalized weighted residual

sum-of-squares (pwrss).

3.7. Sparse matrix methods

In ﬁtting linear mixed models, an instance of the PLS problem (17) must be solved at each

evaluation of the objective function during the optimization (Section 4) with respect to θ.

Because this operation must be performed many times it is worthwhile considering how to

provide eﬀective evaluation methods for objects and calculations involving the sparse matrices

associated with random-eﬀects terms (Sections 2.3).

The CHOLMOD library of Cfunctions (Chen, Davis, Hager, and Rajamanickam 2008), on

which the sparse matrix capabilities of the Matrix package for Rand the sparse Cholesky

factorization in Julia are based, allows for separation of the symbolic and numeric phases.

Thus we perform the symbolic phase as part of establishing the structure representing the

model (Section 2). Furthermore, because CHOLMOD functions allow for updating Lθdirectly

from the matrix Λ>

θZ>without actually forming Λ>

θZ>ZΛθ+Iwe generate and store Z>

instead of Z(note that we have ignored the weights matrix, W, for simplicity). We can

update Λ>

θZ>directly from θwithout forming Λθand multiplying two sparse matrices.

Although such a direct approach is used in the MixedModels package for Julia, in lme4 we

ﬁrst update Λ>

θthen form the sparse product Λ>

θZ>. A third alternative, which we employ

in lme4pureR, is to compute and save the cross-products of the model matrices, Xand Z,

and the response, y, before starting the iterations. To allow for case weights, we save the

products X>W X ,X>W y,Z>W X,Z>W y and Z>W Z (see Table 6).

We wish to use structures and algorithms that allow us to take a new value of θand evaluate

the Lθ(Equation 18) eﬃciently. The key to doing so is the special structure of Λ>

θZ>W1/2.

To understand why this matrix, and not its transpose, is of interest we describe the sparse

matrix structures used in Julia and in the Matrix package for R.

Dense matrices are stored in Rand in Julia as a one-dimensional array of contiguous storage

locations addressed in column-major order. This means that elements in the same column

are in adjacent storage locations whereas elements in the same row can be widely separated

in memory. For this reason, algorithms that work column-wise are preferred to those that

work row-wise.

Although a sparse matrix can be stored in a triplet format, where the row position, column

position and element value of the nonzeros are recorded as triplets, the preferred storage

forms for actual computation with sparse matrices are compressed sparse column (CSC) or

compressed sparse row (CSR, Davis 2006, Chapter 2). The CHOLMOD (and, more generally,

the SuiteSparse package of Clibraries; Davis et al. 2015) uses the CSC storage format. In

this format the nonzero elements in a column are in adjacent storage locations and access to

all the elements in a column is much easier than access to those in a row (similar to dense

24 Linear Mixed Models with lme4

matrices stored in column-major order).

The matrices Zand ZΛθhave the property that the number of nonzeros in each row, Pk

i=1 pi,

is constant. For CSC matrices we want consistency in the columns rather than the rows, which

is why we work with Z>and Λ>

θZ>rather than their transposes.

An arbitrary m×nsparse matrix in CSC format is expressed as two vectors of indices, the row

indices and the column pointers, and a numeric vector of the nonzero values. The elements

of the row indices and the nonzeros are aligned and are ordered ﬁrst by increasing column

number then by increasing row number within column. The column pointers are a vector of

size n+ 1 giving the location of the start of each column in the vectors of row indices and

nonzeros.

4. Nonlinear optimization module

The objective function module produces a function which implements the penalized least

squares algorithm for a particular mixed model. The next step is to optimize this function

over the covariance parameters, θ, which is a nonlinear optimization problem. The lme4

package separates the eﬃcient computational linear algebra required to compute proﬁled

likelihoods and deviances for a given value of θfrom the nonlinear optimization algorithms,

which use general-purpose nonlinear optimizers.

One beneﬁt of this separation is that it allows for experimentation with diﬀerent nonlinear

optimizers. Throughout the development of lme4, the default optimizers and control param-

eters have changed in response to feedback from users about both eﬃciency and convergence

properties. lme4 incorporates two built-in optimization choices, an implementation of the

Nelder-Mead simplex algorithm adapted from Steven Johnson’s NLopt Clibrary (Johnson

2014) and a wrapper for Powell’s BOBYQA algorithm, implemented in the minqa package

(Bates, Mullen, Nash, and Varadhan 2014) as a wrapper around Powell’s original Fortran

code (Powell 2009). More generally, lme4 allows any user-speciﬁed optimizer that (1) can

work with an objective function (i.e., does not require a gradient function to be speciﬁed),

(2) allows box constraints on the parameters, and (3) conforms to some simple rules about

argument names and structure of the output. An internal wrapper allows the use of the op-

timx package (Nash and Varadhan 2011), although the only optimizers provided via optimx

that satisfy the constraints above are the nlminb and L-BFGS-B algorithms that are them-

selves wrappers around the versions provided in base R. Several other algorithms from Steven

Johnson’s NLopt package are also available via the nloptr wrapper package (e.g., alternate

implementations of Nelder-Mead and BOBYQA, and Powell’s COBYLA algorithm).

This ﬂexibility assists with diagnosing convergence problems – it is easy to switch among

several algorithms to determine whether the problem lies in a failure of the nonlinear opti-

mization stage, as opposed to a case of model misspeciﬁcation or unidentiﬁability or a problem

with the underlying PLS algorithm. To date we have only observed PLS failures, which arise

if X>W X −R>

ZX RZ X becomes singular during an evaluation of the objective function, with

badly scaled problems (i.e., problems with continuous predictors that take a very large or

very small numerical range of values).

The requirement for optimizers that can handle box constraints stems from our decision to

parameterize the variance-covariance matrix in a constrained space, in order to allow for

singular ﬁts. In contrast to the approach taken in the nlme package (Pinheiro et al. 2015),

Journal of Statistical Software 25

which goes to some lengths to use an unconstrained variance-covariance parameterization

(the log-Cholesky parameterization; Pinheiro and Bates 1996), we instead use the Cholesky

parameterization but require the elements of θcorresponding to the diagonal elements of the

Cholesky factor to be non-negative. With these constraints, the variance-covariance matrix is

singular if and only if any of the diagonal elements is exactly zero. Singular ﬁts are common

in practical data-analysis situations, especially with small- to medium-sized data sets and

complex variance-covariance models, so being able to ﬁt a singular model is an advantage:

when the best ﬁtting model lies on the boundary of a constrained space, approaches that try

to remove the constraints by ﬁtting parameters on a transformed scale will often give rise

to convergence warnings as the algorithm tries to ﬁnd a maximum on an asymptotically ﬂat

surface (Bolker et al. 2013).

In principle the likelihood surfaces we are trying to optimize over are smooth, but in practice

using gradient information in optimization may or may not be worth the eﬀort. In special

cases, we have a closed-form solution for the gradients (Equations 45–47), but in general

we will have to approximate them by ﬁnite diﬀerences, which is expensive and has limited

accuracy. (We have considered using automatic diﬀerentiation approaches to compute the

gradients more eﬃciently, but this strategy requires a great deal of additional machinery, and

would have drawbacks in terms of memory requirements for large problems.) This is the

primary reason for our switching to derivative-free optimizers such as BOBYQA and Nelder-

Mead in the current version of lme4, although as discussed above derivative-based optimizers

based on ﬁnite diﬀerencing are available as an alternative.

There is most likely further room for improvement in the nonlinear optimization module; for

example, some speed-up could be gained by using parallel implementations of derivative-free

optimizers that evaluated several trial points at once (Klein and Neira 2013). In practice

users most often have optimization diﬃculties with poorly scaled or centered data – we are

working to implement appropriate diagnostic tests and warnings to detect these situations.

5. Output module

Here we describe some of the methods in lme4 for exploring ﬁtted linear mixed models (Ta-

ble 8), which are represented as objects of the S4 class ‘lmerMod’. We begin by describing the

theory underlying these methods (Section 5.1) and then continue the sleepstudy example

introduced in Section 1.2 to illustrate these ideas in practice.

5.1. Theory underlying the output module

Covariance matrix of the ﬁxed-eﬀect coeﬃcients

In the lm function, the variance-covariance matrix of the coeﬃcients is the inverse of the

cross-product of the model matrix, times the residual variance (Chambers 1993). The inverse

cross-product matrix is computed by ﬁrst inverting the upper triangular matrix resulting from

the QR decomposition of the model matrix, and then taking its cross-product,

Varθ,σ "µU |Y=yobs

ˆ

β#!=σ2"L>

θRZX

0RX#−1"Lθ0

R>

ZX R>

X#−1

.(53)

26 Linear Mixed Models with lme4

Because of normality, the marginal covariance matrix of ˆ

βis just the lower-right p-by-pblock

of Varθ,σ "µU |Y=yobs

ˆ

β#!.This lower-right block is

Varθ,σ (ˆ

β) = σ2R−1

X(R>

X)−1,(54)

which follows from the Schur complement identity (Horn and Zhang 2005, Theorem 1.2).

This matrix can be extracted from a ﬁtted ‘merMod’ object as

R> RX <- getME(fm1, "RX")

R> sigma2 <- sigma(fm1)^2

R> sigma2 * chol2inv(RX)

[,1] [,2]

[1,] 46.574573 -1.451097

[2,] -1.451097 2.389463

which could be computed with lme4 as vcov(fm1).

The square-root diagonal of this covariance matrix contains the estimates of the standard

errors of ﬁxed-eﬀects coeﬃcients. These standard errors are used to construct Wald conﬁdence

intervals with confint(., method = "Wald"). Such conﬁdence intervals are approximate,

and depend on the assumption that the likelihood proﬁle of the ﬁxed eﬀects is linear on the

ζscale.

Proﬁling

The theory of likelihood proﬁles is straightforward: the deviance (or likelihood) proﬁle,

−2Lp(), for a focal model parameter Pis the minimum value of the deviance conditioned

on a particular value of P. For each parameter of interest, our goal is to evaluate the de-

viance proﬁle for many points – optimizing over all of the non-focal parameters each time –

over a wide enough range and with high enough resolution to evaluate the shape of the proﬁle

(in particular, whether it is quadratic, which would allow use of Wald conﬁdence intervals and

tests) and to ﬁnd the values of Psuch that −2Lp(P) = −2L(b

P) + χ2(1 −α), which represent

the proﬁle conﬁdence intervals. While proﬁle conﬁdence intervals rely on the asymptotic dis-

tribution of the minimum deviance, this is a much weaker assumption than the log-quadratic

likelihood surface required by Wald tests.

An additional challenge in proﬁling arises when we want to compute proﬁles for quantities

of interest that are not parameters of our PLS function. We have two problems in using the

deviance function deﬁned above to proﬁle linear mixed models. First, a parameterization of

the random-eﬀects variance-covariance matrix in terms of standard deviations and correla-

tions, or variances and covariances, is much more familiar to users, and much more relevant

as output of a statistical model, than the parameters, θ, of the relative covariance factor –

users are interested in inferences on variances or standard deviations, not on θ. Second, the

ﬁxed-eﬀects coeﬃcients and the residual standard deviation, both of which are also of interest

to users, are proﬁled out (in the sense used above) of the deviance function (Section 3.4), so

we have to ﬁnd a strategy for estimating the deviance for values of βand σ2other than the

proﬁled values.

Journal of Statistical Software 27

To handle the ﬁrst problem we create a new version of the deviance function that ﬁrst takes

a vector of standard deviations (and correlations), and a value of the residual standard devi-

ation, maps them to a θvector, and then computes the PLS as before; it uses the speciﬁed

residual standard deviation rather than the PLS estimate of the standard deviation (Equa-

tion 33) in the deviance calculation. We compute a proﬁle likelihood for the ﬁxed-eﬀects

parameters, which are proﬁled out of the deviance calculation, by adding an oﬀset to the

linear predictor for the focal element of β. The resulting function is not useful for general

nonlinear optimization – one can easily wander into parameter regimes corresponding to in-

feasible (non-positive semideﬁnite) variance-covariance matrices – but it serves for likelihood

proﬁling, where one focal parameter is varied at a time and the optimization over the other

parameters is likely to start close to an optimum.

In practice, the profile method systematically varies the parameters in a model, assessing

the best possible ﬁt that can be obtained with one parameter ﬁxed at a speciﬁc value and

comparing this ﬁt to the globally optimal ﬁt, which is the original model ﬁt that allowed all

the parameters to vary. The models are compared according to the change in the deviance,

which is the likelihood ratio test statistic. We apply a signed square root transformation

to this statistic and plot the resulting function, which we call the proﬁle zeta function or

ζ, versus the parameter value. The signed aspect of this transformation means that ζis

positive where the deviation from the parameter estimate is positive and negative otherwise,

leading to a monotonically increasing function which is zero at the global optimum. A ζ

value can be compared to the quantiles of the standard normal distribution. For example,

a 95% proﬁle deviance conﬁdence interval on the parameter consists of the values for which

−1.96 < ζ < 1.96. Because the process of proﬁling a ﬁtted model can be computationally

intensive – it involves reﬁtting the model many times – one should exercise caution with

complex models ﬁt to large data sets.

The standard approach to this computational challenge is to compute ζat a limited number

of parameter values, and to ﬁll in the gaps by ﬁtting an interpolation spline. Often one is able

to invert the spline to obtain a function from ζto the focal parameter, which is necessary

in order to construct proﬁle conﬁdence intervals. However, even if the points themselves are

monotonic, it is possible to obtain non-monotonic interpolation curves. In such a case, lme4

falls back to linear interpolation, with a warning.

The last part of the technical speciﬁcation for computing proﬁles is deciding on a strategy

for choosing points to sample. In one way or another one wants to start at the estimated

value for each parameter and work outward either until a constraint is reached (i.e., a value

of 0 for a standard deviation parameter or a value of ±1for a correlation parameter), or

until a suﬃciently large deviation from the minimum deviance is attained. lme4’s proﬁler

chooses a cutoﬀ φbased on the 1−αmax critical value of the χ2distribution with a number

of degrees of freedom equal to the total number of parameters in the model, where αmax is

set to 0.05 by default. The proﬁle computation initially adjusts the focal parameter piby

an amount = 1.01ˆpifrom its ML or REML estimate ˆpi(or by = 0.001 if ˆpiis zero, as

in the case of a singular variance-covariance model). The local slope of the likelihood proﬁle

(ζ(ˆpi+)−ζ(ˆpi))/ is used to pick the next point to evaluate, extrapolating the local slope

to ﬁnd a new that would be expected to correspond to the desired step size ∆ζ(equal to

φ/8by default, so that 16 points would be used to cover the proﬁle in both directions if the

log-likelihood surface were exactly quadratic). Some fail-safe testing is done to ensure that

the step chosen is always positive, and less than a maximum; if a deviance is ever detected

28 Linear Mixed Models with lme4

that is lower than that of the ML deviance, which can occur if the initial ﬁt was wrong due

to numerical problems, the proﬁler issues an error and stops.

Parametric bootstrapping

To avoid the asymptotic assumptions of the likelihood ratio test, at the cost of greatly in-

creased computation time, one can estimate conﬁdence intervals by parametric bootstrapping

– that is, by simulating data from the ﬁtted model, reﬁtting the model, and extracting the

new estimated parameters (or any other quantities of interest). This task is quite straightfor-

ward, since there is already a simulate method, and a refit function which re-estimates the

(RE)ML parameters for new data, starting from the previous (RE)ML estimates and re-using

the previously computed model structures (including the ﬁll-reducing permutation) for eﬃ-

ciency. The bootMer function is thus a fairly thin wrapper around a simulate/refit loop,

with a small amount of additional logic for parallel computation and error-catching. (Some

of the ideas of bootMer are adapted from merBoot (Sánchez-Espigares and Ocaña 2009), a

more ambitious framework for bootstrapping lme4 model ﬁts which unfortunately seems to

be unavailable at present.)

Conditional variances of random eﬀects

It is useful to clarify that the conditional covariance concept in lme4 is based on a simpliﬁ-

cation of the linear mixed model. In particular, we simplify the model by assuming that the

quantities, β,Λθ, and σ, are known (i.e., set at their estimated values). In fact, the only

way to deﬁne the conditional covariance is at ﬁxed parameter values. Our approximation

here is using the estimated parameter values for the unknown “true” parameter values. In

this simpliﬁed case, Uis the only quantity in the statistical model that is both random and

unknown.

Given this simpliﬁed linear mixed model, a standard result in Bayesian linear regression mod-

eling (Gelman et al. 2013) implies that the conditional distribution of the spherical random

eﬀects given the observed response vector is Gaussian,

(U|Y =yobs)∼ N (µU|Y=yobs ,b

σ2V),(55)

where

V=Λ>

b

θZ>W ZΛb

θ+Iq−1=L−1

b

θ>L−1

b

θ(56)

is the unscaled conditional variance and

µU|Y=yobs =VΛ>

b

θZ>Wyobs −o−Xb

β(57)

is the conditional mean/mode. Note that in practice the inverse in Equation 56 does not get

computed directly, but rather an eﬃcient method is used that exploits the sparse structures.

The random-eﬀects coeﬃcient vector, b, is often of greater interest. The conditional covari-

ance matrix of Bmay be expressed as

b

σ2Λb

θVΛ>

b

θ.(58)

The hat matrix

The hat matrix, H, is a useful object in linear model diagnostics (Cook and Weisberg 1982).

In general, Hrelates the observed and ﬁtted response vectors, but we speciﬁcally deﬁne it as

Journal of Statistical Software 29

µY|U =u−o=H(yobs −o).(59)

To ﬁnd Hwe note that

µY|U =u−o=hZΛXi"µU |Y=yobs

b

βθ#.(60)

Next we get an expression for "µU|Y =yobs

b

βθ#by solving the normal equations (Equation 17),

"µU|Y=yobs

b

βθ#="L>

θRZX

0RX#−1"Lθ0

R>

ZX R>

X#−1"Λ>

θZ>

X>#W(yobs −o).(61)

By the Schur complement identity (Horn and Zhang 2005),

"L>

θRZX

0RX#−1

=

L>

θ−1−L>

θ−1RZX R−1

X

0R−1

X

.(62)

Therefore, we may write the hat matrix as

H= (C>

LCL+C>

RCR),(63)

where

LθCL=Λ>

θZ>W1/2(64)

and

R>

XCR=X>W1/2−R>

ZX CL.(65)

The trace of the hat matrix is often used as a measure of the eﬀective degrees of freedom

(e.g., Vaida and Blanchard 2005). Using a relationship between the trace and vec operators,

the trace of Hmay be expressed as

tr(H) = kvec(CL)k2+kvec(CR)k2.(66)

5.2. Using the output module

The user interface for the output module largely consists of a set of methods (Table 8) for

objects of class ‘merMod’, which is the class of objects returned by lmer. In addition to these

methods, the getME function can be used to extract various objects from a ﬁtted mixed model

in lme4. Here we illustrate the use of several of these methods.

Updating ﬁtted mixed models

To illustrate the update method for ‘merMod’ objects we construct a random intercept only

model. This task could be done in several ways, but we choose to ﬁrst remove the random-

eﬀects term (Days | Subject) and add a new term with a random intercept,

R> fm3 <- update(fm1, . ~ . - (Days | Subject) + (1 | Subject))

R> formula(fm3)

30 Linear Mixed Models with lme4

Generic Brief description of return value

anova Decomposition of ﬁxed-eﬀects contributions

or model comparison.

as.function Function returning proﬁled deviance or REML criterion.

coef Sum of the random and ﬁxed eﬀects for each level.

confint Conﬁdence intervals on linear mixed-model parameters.

deviance Minus twice maximum log-likelihood.

(Use REMLcrit for the REML criterion.)

df.residual Residual degrees of freedom.

drop1 Drop allowable single terms from the model.

extractAIC Generalized Akaike information criterion

fitted Fitted values given conditional modes (Equation 13).

fixef Estimates of the ﬁxed-eﬀects coeﬃcients, b

β

formula Mixed-model formula of ﬁtted model.

logLik Maximum log-likelihood.

model.frame Data required to ﬁt the model.

model.matrix Fixed-eﬀects model matrix, X.

ngrps Number of levels in each grouping factor.

nobs Number of observations.

plot Diagnostic plots for mixed-model ﬁts.

predict Various types of predicted values.

print Basic printout of mixed-model objects.

profile Proﬁled likelihood over various model parameters.

ranef Conditional modes of the random eﬀects.

refit A model (re)ﬁtted to a new set of observations of the response variable.

refitML A model (re)ﬁtted by maximum likelihood.

residuals Various types of residual values.

sigma Residual standard deviation.

simulate Simulated data from a ﬁtted mixed model.

summary Summary of a mixed model.

terms Terms representation of a mixed model.

update An updated model using a revised formula or other arguments.

VarCorr Estimated random-eﬀects variances, standard deviations, and correlations.

vcov Covariance matrix of the ﬁxed-eﬀects estimates.

weights Prior weights used in model ﬁtting.

Table 8: List of currently available methods for objects of the class ‘merMod’.

Reaction ~ Days + (1 | Subject)

Model summary and associated accessors

The summary method for ‘merMod’ objects provides information about the model ﬁt. Here

we consider the output of summary(fm1) in four steps. The ﬁrst few lines of output indicate

that the model was ﬁtted by REML as well as the value of the REML criterion (Equation 39)

at convergence (which may also be extracted using REMLcrit(fm1)). The beginning of the

Journal of Statistical Software 31

summary also reproduces the model formula and the scaled Pearson residuals,

Linear mixed model fit by REML ['lmerMod']

Formula: Reaction ~ Days + (Days | Subject)

Data: sleepstudy

REML criterion at convergence: 1743.6

Scaled residuals:

Min 1Q Median 3Q Max

-3.9536 -0.4634 0.0231 0.4634 5.1793

This information may also be obtained using standard accessor functions,

R> formula(fm1)

R> REMLcrit(fm1)

R> quantile(residuals(fm1, "pearson", scaled = TRUE))

The second piece of the summary output provides information regarding the random eﬀects

and residual variation,

Random effects:

Groups Name Variance Std.Dev. Corr

Subject (Intercept) 612.09 24.740

Days 35.07 5.922 0.07

Residual 654.94 25.592

Number of obs: 180, groups: Subject, 18

which can also be accessed using,

R> print(vc <- VarCorr(fm1), comp = c("Variance", "Std.Dev."))

R> nobs(fm1)

R> ngrps(fm1)

R> sigma(fm1)

The print method for objects returned by VarCorr hides the internal structure of these

‘VarCorr.merMod’ objects. The internal structure of an object of this class is a list of

matrices, one for each random-eﬀects term. The standard deviations and correlation ma-

trices for each term are stored as attributes, stddev and correlation, respectively, of the

variance-covariance matrix, and the residual standard deviation is stored as attribute sc. For

programming use, these objects can be summarized diﬀerently using their as.data.frame

method,

R> as.data.frame(VarCorr(fm1))

grp var1 var2 vcov sdcor

1 Subject (Intercept) <NA> 612.089963 24.74045195

2 Subject Days <NA> 35.071661 5.92213312

3 Subject (Intercept) Days 9.604306 0.06555113

4 Residual <NA> <NA> 654.941028 25.59181564

32 Linear Mixed Models with lme4

which contains one row for each variance or covariance parameter. The grp column gives

the grouping factor associated with this parameter. The var1 and var2 columns give the

names of the variables associated with the parameter (var2 is <NA> unless it is a covariance

parameter). The vcov column gives the variances and covariances, and the sdcor column

gives these numbers on the standard deviation and correlation scales.

The next chunk of output gives the ﬁxed-eﬀects estimates,

Fixed effects:

Estimate Std. Error t value

(Intercept) 251.405 6.825 36.84

Days 10.467 1.546 6.77

Note that there are no pvalues. The ﬁxed-eﬀects point estimates may be obtained separately

via fixef(fm1). Conditional modes of the random-eﬀects coeﬃcients can be obtained with

ranef (see Section 5.1 for information on the theory). Finally, we have the correlations

between the ﬁxed-eﬀects estimates

Correlation of Fixed Effects:

(Intr)

Days -0.138

The full variance-covariance matrix of the ﬁxed-eﬀects estimates can be obtained in the usual

Rway with the vcov method (Section 5.1). Alternatively, coef(summary(fm1)) will return

the full ﬁxed-eﬀects parameter table as shown in the summary.

Diagnostic plots

lme4 provides tools for generating most of the standard graphical diagnostic plots (with the

exception of those incorporating inﬂuence measures, e.g., Cook’s distance and leverage plots),

in a way modeled on the diagnostic graphics of nlme (Pinheiro and Bates 2000). In particular,

the familiar plot method in base Rfor linear models (objects of class ‘lm’) can be used to

create the standard ﬁtted vs. residual plots,

R> plot(fm1, type = c("p", "smooth"))

scale-location plots,

R> plot(fm1, sqrt(abs(resid(.))) ~ fitted(.), type = c("p", "smooth"))

and quantile-quantile plots (from lattice),

R> qqmath(fm1, id = 0.05)

In contrast to the plot method for ‘lm’ objects, these scale-location and Q-Q plots are based

on raw rather than standardized residuals.

In addition to these standard diagnostic plots, which examine the validity of various assump-

tions (linearity, homoscedasticity, normality) at the level of the residuals, one can also use

the dotplot and qqmath methods for the conditional modes (i.e., ‘ranef.mer’ objects gen-

erated by ranef(fit)) to check for interesting patterns and normality in the conditional

Journal of Statistical Software 33

modes. lme4 does not provide inﬂuence diagnostics, but these (and other useful diagnostic

procedures) are available in the dependent packages HLMdiag and inﬂuence.ME (Loy and

Hofmann 2014;Nieuwenhuis, Te Grotenhuis, and Pelzer 2012).

Finally, posterior predictive simulation (Gelman and Hill 2006) is a generally useful diagnostic

tool, adapted from Bayesian methods, for exploring model ﬁt. Users pick some summary

statistic of interest. They then compute the summary statistic for an ensemble of simulations,

and see where the observed data falls within the simulated distribution; if the observed data is

extreme, we might conclude that the model is a poor representation of reality. For example,

using the sleep study ﬁt and choosing the interquartile range of the reaction times as the

summary statistic:

R> iqrvec <- sapply(simulate(fm1, 1000), IQR)

R> obsval <- IQR(sleepstudy$Reaction)

R> post.pred.p <- mean(obsval >= c(obsval, iqrvec))

The (one-tailed) posterior predictive pvalue of 0.78 indicates that the model represents the

data adequately, at least for this summary statistic. In contrast to the full Bayesian case,

the procedure described here does not allow for the uncertainty in the estimated parameters.

However, it should be a reasonable approximation when the residual variation is large.

Sequential decomposition and model comparison

Objects of class ‘merMod’ have an anova method which returns Fstatistics corresponding to

the sequential decomposition of the contributions of ﬁxed-eﬀects terms. In order to illustrate

this sequential ANOVA decomposition, we ﬁt a model with orthogonal linear and quadratic

Days terms,

R> fm4 <- lmer(Reaction ~ polyDays[ , 1] + polyDays[ , 2] +

+ (polyDays[ , 1] + polyDays[ , 2] | Subject),

+ within(sleepstudy, polyDays <- poly(Days, 2)))

R> anova(fm4)

Analysis of Variance Table

Df Sum Sq Mean Sq F value

polyDays[, 1] 1 23874.5 23874.5 46.0757

polyDays[, 2] 1 340.3 340.3 0.6567

The relative magnitudes of the two sums of squares indicate that the quadratic term explains

much less variation than the linear term. Furthermore, the magnitudes of the two Fstatistics

strongly suggest signiﬁcance of the linear term, but not the quadratic term. Notice that this

is only an informal assessment of signiﬁcance as there are no pvalues associated with these

Fstatistics, an issue discussed in more detail in the next subsection (“Computing pvalues”,

p. 29).

To understand how these quantities are computed, let Ricontain the rows of RX(Equa-

tion 18) associated with the ith ﬁxed-eﬀects term. Then the sum of squares for term iis

SSi=b

β>R>

iRib

β.(67)

34 Linear Mixed Models with lme4

If DFiis the number of columns in Ri, then the Fstatistic for term iis

Fi=SSi

b

σ2DFi

.(68)

For multiple arguments, the anova method returns model comparison statistics.

R> anova(fm1, fm2, fm3)

refitting model(s) with ML (instead of REML)

Data: sleepstudy

Models:

fm3: Reaction ~ Days + (1 | Subject)

fm2: Reaction ~ Days + ((1 | Subject) + (0 + Days | Subject))

fm1: Reaction ~ Days + (Days | Subject)

Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)

fm3 4 1802.1 1814.8 -897.04 1794.1

fm2 5 1762.0 1778.0 -876.00 1752.0 42.0754 1 8.782e-11

fm1 6 1763.9 1783.1 -875.97 1751.9 0.0639 1 0.8004

The output shows χ2statistics representing the diﬀerence in deviance between successive

models, as well as pvalues based on likelihood ratio test comparisons. In this case, the

sequential comparison shows that adding a linear eﬀect of time uncorrelated with the intercept

leads to an enormous and signiﬁcant drop in deviance (∆deviance ≈42,p≈10−10), while

the further addition of correlation between the slope and intercept leads to a trivial and non-

signiﬁcant change in deviance (∆deviance ≈0.06). For objects of class ‘lmerMod’ the default

behavior is to reﬁt the models with ML if ﬁtted with REML = TRUE, which is necessary in

order to get sensible answers when comparing models that diﬀer in their ﬁxed eﬀects; this

can be controlled via the refit argument.

Computing pvalues

One of the more controversial design decisions of lme4 has been to omit the output of pvalues

associated with sequential ANOVA decompositions of ﬁxed eﬀects. The absence of analytical

results for null distributions of parameter estimates in complex situations (e.g., unbalanced

or partially crossed designs) is a long-standing problem in mixed-model inference. While the

null distributions (and the sampling distributions of non-null estimates) are asymptotically

normal, these distributions are not tdistributed for ﬁnite size samples – nor are the corre-

sponding null distributions of diﬀerences in scaled deviances Fdistributed. Thus approximate

methods for computing the approximate degrees of freedom for tdistributions, or the denom-

inator degrees of freedom for Fstatistics (Satterthwaite 1946;Kenward and Roger 1997), are

at best ad hoc solutions.

However, computing ﬁnite-size-corrected pvalues is sometimes necessary. Therefore, although

the package does not provide them (except via parametric bootstrapping, Section 5.1), we

have provided a help page to guide users in ﬁnding appropriate methods:

R> help("pvalues")

Journal of Statistical Software 35

This help page provides pointers to other packages that provide machinery for calculating

pvalues associated with ‘merMod’ objects. It also suggests framing the inference problem as

a likelihood ratio test, achieved by supplying multiple ‘merMod’ objects to the anova method,

as well as alternatives to pvalues such as conﬁdence intervals.

Previous versions of lme4 provided the mcmcsamp function, which generated a Markov chain

Monte Carlo sample from the posterior distribution of the parameters (assuming ﬂat priors).

Due to diﬃculty in constructing a version of mcmcsamp that was reliable even in cases where

the estimated random-eﬀects variances were near zero, mcmcsamp has been withdrawn.

Computing conﬁdence intervals

As described above (Section 5.1), lme4 provides conﬁdence intervals (using confint) via

Wald approximations (for ﬁxed-eﬀects parameters only), likelihood proﬁling, or parametric

bootstrapping (the boot.type argument selects the bootstrap conﬁdence interval type).

As is typical for computationally intensive proﬁle conﬁdence intervals in R, the intervals can

be computed either directly from ﬁtted ‘merMod’ objects (in which case proﬁling is done

as an interim step), or from a previously computed likelihood proﬁle (of class ‘thpr’, for

“theta proﬁle”). Parametric bootstrapping conﬁdence intervals use a thin wrapper around

the bootMer function that passes the results to boot.ci from the boot package (Canty and

Ripley 2015;Davison and Hinkley 1997) for conﬁdence interval calculation.

In the running sleep study example, the 95% conﬁdence intervals estimated by all three

methods are quite similar. The largest change is a 26% diﬀerence in conﬁdence interval

widths between proﬁle and parametric bootstrap methods for the correlation between the

intercept and slope random eﬀects ({−0.54,0.98} vs. {−0.48,0.68}).

General proﬁle zeta and related plots

lme4 provides several functions (built on lattice graphics, Sarkar 2008) for plotting the proﬁle

zeta functions (Section 5.1) and other related quantities.

•The proﬁle zeta plot (Figure 2;xyplot) is simply a plot of the proﬁle zeta function

for each model parameter; linearity of this plot for a given parameter implies that the

likelihood proﬁle is quadratic (and thus that Wald approximations would be reasonably

accurate).

•The proﬁle density plot (Figure 3;densityplot) displays an approximation of the

probability density function of the sampling distribution for each parameter. These

densities are derived by setting the cumulative distribution function (c.d.f) to be Φ(ζ)

where Φis the c.d.f. of the standard normal distribution. This is not quite the same as

evaluating the distribution of the estimator of the parameter, which for mixed models

can be very diﬃcult, but it gives a reasonable approximation. If the proﬁle zeta plot is

linear, then the proﬁle density plot will be Gaussian.

•The proﬁle pairs plot (Figure 4;splom) gives an approximation of the two-dimensional

proﬁles of pairs of parameters, interpolated from the univariate proﬁles as described in

Bates and Watts (1988, Chapter 6). The proﬁle pairs plot shows two-dimensional 50%,

80%, 90%, 95% and 99% marginal conﬁdence regions based on the likelihood ratio, as

well as the proﬁle traces, which indicate the conditional estimates of each parameter

36 Linear Mixed Models with lme4

ζ

−2

−1

0

1

2

20 30 40

σ1

−0.5 0.0 0.5

σ2

4 6 8 10

σ3

22 24 26 28 30

σ

240 250 260 270

(Intercept)

6 8 10 12 14

−2

−1

0

1

2

Days

Figure 2: Proﬁle zeta plot: xyplot(prof.obj)

for ﬁxed values of the other parameters. While panels above the diagonal show proﬁles

with respect to the original parameters (with random-eﬀects parameters on the standard

deviation/correlation scale, as for all proﬁle plots), the panels below the diagonal show

plots on the (ζi, ζj)scale. The below-diagonal panels allow us to see distortions from an

elliptical shape due to nonlinearity of the traces, separately from the one-dimensional

distortions caused by a poor choice of scale for the parameters. The ζscales provide, in

some sense, the best possible set of single-parameter transformations for assessing the

contours. On the ζscales the extent of a contour on the horizontal axis is exactly the

same as the extent on the vertical axis and both are centered about zero.

For users who want to build their own graphical displays of the proﬁle, there is a method for

as.data.frame that converts proﬁle (‘thpr’) objects to a more convenient format.

Computing ﬁtted and predicted values; simulating

Because mixed models involve random coeﬃcients, one must always clarify whether predic-

tions should be based on the marginal distribution of the response variable or on the distri-

bution that is conditional on the modes of the random eﬀects (Equation 12). The fitted

method retrieves ﬁtted values that are conditional on all of the modes of the random eﬀects;

the predict method returns the same values by default, but also allows for predictions to be

made conditional on diﬀerent sets of random eﬀects. For example, if the re.form argument

is set to NULL (the default), then the predictions are conditional on all random eﬀects in the

model; if re.form is ~ 0 or NA, then the predictions are made at the population level (all

random-eﬀects values set to zero). In models with multiple random eﬀects, the user can give

re.form as a formula that speciﬁes which random eﬀects are conditioned on.

lme4 also provides a simulate method, which allows similar ﬂexibility in conditioning on

Journal of Statistical Software 37

density

0.00 0.02 0.04 0.06

10 20 30 40 50

σ1

0.0 0.2 0.4 0.6 0.8 1.0 1.2

−0.5 0.0 0.5 1.0

σ2

0.0 0.1 0.2 0.3

4 6 8 10

σ3

0.00 0.05 0.10 0.15 0.20 0.25

22 24 26 28 30

σ

0.000.010.020.03 0.040.050.06

230 240 250 260 270

(Intercept)

0.00 0.05 0.10 0.15 0.20 0.25

6 8 10 12 14 16

Days

Figure 3: Proﬁle density plot: densityplot(prof.obj).

random eﬀects; in addition it allows the user to choose (via the use.u argument) between

conditioning on the ﬁtted conditional modes or choosing a new set of simulated condi-

tional modes (zero-mean Normal deviates chosen according to the estimated random-eﬀects

variance-covariance matrices). Finally, the simulate method has a method for ‘formula’

objects, which allows for de novo simulation in the absence of a ﬁtted model. In this case,

the user must specify the random eﬀects (θ), ﬁxed eﬀects (β), and residual standard devia-

tion (σ) parameters via the newparams argument. The standard simulation method (based

on a ‘merMod’ object) is the basis of parametric bootstrapping (Section 5.1) and posterior

predictive simulation (Section 5.2); de novo simulation based on a formula provides a ﬂexible

framework for power analysis.

6. Conclusion

Mixed modeling is an extremely useful but computationally intensive technique. Computa-

tional limitations are especially important because mixed models are commonly applied to

moderately large data sets (104–106observations). By developing an eﬃcient, general, and

(now) well-documented platform for ﬁtted mixed models in R, we hope to provide both a

practical tool for end users interested in analyzing data and a reusable, modular framework

for downstream developers interested in extending the class of models that can be easily and

eﬃciently analyzed in R.

We have learned much about linear mixed models in the process of developing lme4, both from

our own attempts to develop robust and reliable procedures and from the broad community

of lme4 users; we have attempted to describe many of these lessons here. In moving forward,

our main priorities are (1) to maintain the reference implementation of lme4 on the Compre-

38 Linear Mixed Models with lme4

Scatter Plot Matrix

.sig01

10 20 30 40 50

−3

−2

−1

0

.sig02

−0.5

0.0

0.5

1.0 0.5 1.0

0 1 2 3

.sig03

4

6

8

10 8 10

0 1 2 3

.sigma

22

24

26

28

30 26 28 30

0 1 2 3

(Intercept)

230

240

250

260

270 250260270

0 1 2 3

Days

6

8

10

12

14

16

0 1 2 3

Figure 4: Proﬁle pairs plot: splom(prof.obj).

hensive RArchive Network (CRAN), developing relatively few new features; (2) to improve

the ﬂexibility, eﬃciency and scalability of mixed-model analysis across multiple compatible

implementations, including both the MixedModels package for Julia and the experimental

ﬂexLambda branch of lme4.

Acknowledgments

Rune Haubo Christensen, Henrik Singmann, Fabian Scheipl, Vincent Dorie, and Bin Dai

contributed ideas and code to the current version of lme4; the large community of lme4

users has exercised the software, discovered bugs, and made many useful suggestions. Søren

Højsgaard participated in useful discussions and Xi Xia and Christian Zingg exercised the code

and reported problems. We would like to thank the Banﬀ International Research Station for

hosting a working group on lme4, and an NSERC Discovery grant and NSERC postdoctoral

fellowship for funding.

Journal of Statistical Software 39

References

Bates D (2015). MixedModels: A Julia Package for Fitting (Statistical) Mixed-Eﬀects Models.

Julia package version 0.3-22, URL https://github.com/dmbates/MixedModels.jl.

Bates D, Maechler M (2015). Matrix: Sparse and Dense Matrix Classes and Methods.R

package version 1.2-2, URL http://CRAN.R-project.org/package=Matrix.

Bates D, Maechler M, Bolker B, Walker S (2015). lme4: Linear Mixed-Eﬀects Models Using

Eigen and S4.Rpackage version 1.1-10, URL http://CRAN.R-project.org/package=

lme4.

Bates D, Mullen KM, Nash JC, Varadhan R (2014). minqa: Derivative-Free Optimization

Algorithms by Quadratic Approximation.Rpackage version 1.2.4, URL http://CRAN.

R-project.org/package=minqa.

Bates D, Walker S (2013). lme4pureR:lme4 in Pure R.Rpackage version 0.1-0, URL

https://github.com/lme4/lme4pureR.

Bates DM, DebRoy S (2004). “Linear Mixed Models and Penalized Least Squares.” Journal

of Multivariate Analysis,91(1), 1–17. doi:10.1016/j.jmva.2004.04.013.

Bates DM, Watts DG (1988). Nonlinear Regression Analysis and Its Applications. John Wiley

& Sons, Hoboken, NJ. doi:10.1002/9780470316757.

Belenky G, Wesensten NJ, Thorne DR, Thomas ML, Sing HC, Redmond DP, Russo MB,

Balkin TJ (2003). “Patterns of Performance Degradation and Restoration During Sleep

Restriction and Subsequent Recovery: A Sleep Dose-Response Study.” Journal of Sleep

Research,12(1), 1–12. doi:10.1046/j.1365-2869.2003.00337.x.

Bezanson J, Karpinski S, Shah VB, Edelman A (2012). “Julia: A Fast Dynamic Language for

Technical Computing.” arXiv:1209.5145 [cs.PL], URL http://arxiv.org/abs/1209.5145.

Bolker BM, Gardner B, Maunder M, Berg CW, Brooks M, Comita L, Crone E, Cubaynes S,

Davies T, de Valpine P, Ford J, Gimenez O, Kéry M, Kim EJ, Lennert-Cody C, Magnusson

A, Martell S, Nash J, Nielsen A, Regetz J, Skaug H, Zipkin E (2013). “Strategies for Fitting

Nonlinear Ecological Models in R,AD Model Builder, and BUGS.” Methods in Ecology and

Evolution,4(6), 501–512. doi:10.1111/2041-210x.12044.

Canty A, Ripley B (2015). boot: Bootstrap R(S-PLUS) Functions.Rpackage version 1.3-17,

URL http://CRAN.R-project.org/package=boot.

Chambers JM (1993). “Linear Models.” In JM Chambers, TJ Hastie (eds.), Statistical Models

in S, chapter 4, pp. 95–144. Chapman & Hall.

Chen Y, Davis TA, Hager WW, Rajamanickam S (2008). “Algorithm 887: CHOLMOD,

Supernodal Sparse Cholesky Factorization and Update/Downdate.” ACM Transactions on

Mathematical Software,35(3), 22:1–22:14. doi:10.1145/1391989.1391995.