A Simulation-based Goodness-of-fit Test for Random Effects in Generalized Linear Mixed Models
ABSTRACT The goodness-of-fit of the distribution of random effects in a generalized linear mixed model is assessed using a conditional simulation of the random effects conditional on the observations. Provided that the specified joint model for random effects and observations is correct, the marginal distribution of the simulated random effects coincides with the assumed random effects distribution. In practice, the specified model depends on some unknown parameter which is replaced by an estimate. We obtain a correction for this by deriving the asymptotic distribution of the empirical distribution function obtained from the conditional sample of the random effects. The approach is illustrated by simulation studies and data examples. Copyright 2006 Board of the Foundation of the Scandinavian Journal of Statistics..
A simulation-based goodness-of-fit test for
random effects in generalized linear mixed
Institute of Mathematical Sciences, Aalborg University
The goodness-of-fit of the distribution of random effects in a gener-
alized linear mixed model is assessed using a conditional simulation of
the random effects conditional on the observations. Provided that the
specified joint model for random effects and observations is correct, the
marginal distribution of the simulated random effects coincides with
the assumed random effects distribution. In practice the specified
model depends on some unknown parameter which is replaced by an
estimate. We obtain a correction for this by deriving the asymptotic
distribution of the empirical distribution function obtained from the
conditional sample of the random effects. The approach is illustrated
by simulation studies and data examples.
Keywords: conditional simulation, empirical distribution function, general-
ized linear mixed model, goodness-of-fit, random effects.
Running header: A simulation-based goodness-of-fit test.
This paper is concerned with assessment of the distributional assumptions for
the random effects in a generalized linear mixed model (GLMM). Since the
random effects are not observed, one approach would be to consider random
effects predictions like conditional expectations or modes. However, except
for linear mixed models, the distributional properties of such predictions are
unknown. It is thus difficult to judge whether a sample of random effects
predictions is consistent with the assumed random effects distribution.
In this paper we pursue instead the following simple idea: consider a
pair of random vectors (A,Y ) which is assumed to follow some fully specified
distribution where A represents the unobserved random effects and Y the ob-
servations. If the specified joint model for (A,Y ) is correct then the observed
data y is a realization from the marginal distribution of Y . Suppose next we
generate a simulation A∗from the conditional distribution of A given Y = y.
Then marginally, A∗and A are identically distributed and correlated. We
can thus base goodness-of-fit testing on the sample A∗proceeding just as if
A had been observed itself.
The situation becomes more complicated when the joint distribution of
(A,Y ) depends on some unknown parameter θ. If a proper prior is specified
for θ in a Bayesian framework, one may consider a simulation (A∗,θ∗) from
the posterior of (A,θ) given Y . Again, the distribution of (A∗,θ∗) coincides
with the specified distribution of (A,θ) provided the assumed joint model for
(θ,A,Y ) is correct. In practice, however, one often uses very vague or even
improper priors which are not regarded as bona-fide components in a joint
model for (θ,A,Y ).
In this paper we replace θ by a point estimateˆθ. Let P|y,θ denote the
conditional distribution of A given Y = y and indexed by θ. We then base
goodness-of-fit tests on a simulation˜A from P|y,ˆθ. In this case˜A only approx-
imately follows the specified distribution for A due to the effect of replacing
the unobserved θ withˆθ. Inspired by Ritz (2004) we derive the asymp-
totic distribution of the empirical distribution function obtained from the
conditional sample˜A of the random effects. This provides an asymptotic
correction for the effect of replacing θ with an estimate. In Section 5 we
comment briefly on the alternative of using a parametric bootstrap.
Our goodness-of-fit test is targeted at the random effects but a rejection
could be due to a wrongly specified conditional distribution of Y given A. So
the test should be accompanied by an assessment of the conditional distribu-
tion of the observations given the random effects. In Example 1 of Section 4.1
we briefly comment on the possibility of using simulated residuals.
When the objective is to estimate fixed effects in a GLMM one may argue
that the assumptions concerning the shape of the random effects distribu-
tion are not critical. However, in many applications, e.g. in animal breeding,
the random effects themselves and their distributional characteristics are the
focal objects of the statistical analysis, see e.g. Sorensen & Waagepetersen
(2003). A thorough assessment of the goodness-of-fit of the random effects
distribution then seems mandatory. Moreover, the approach in this paper
is not confined to providing a p-value for a goodness-of-fit test. In the ex-
amples in Section 4, exploratory plots of the simulated random effects e.g.
disclose patterns of heterogeneity or correlation among the individuals to
which the random effects are associated. Such patterns should also be taken
into account in an analysis of fixed effects.
In Section 2 we obtain the asymptotic distribution of the empirical dis-
tribution function for simulated random effects within the framework of
GLMMs with iid normal random intercepts. Section 3 is concerned with
the practical implementation of a goodness-of-fit test based on the asymp-
totic result. Simulation studies and applications are considered in Section 4.
Section 5 contains a concluding discussion.
2Convergence of the empirical distribution
function for a conditional sample of ran-
The asymptotic result in this section is derived within the set-up of gener-
alized linear models (GLMs) with iid normal random intercepts. Maximum
likelihood inference for such models is implemented in e.g. the SAS proce-
dure nlmixed (Wolfinger, 1999) or the Stata program gllamm (Rabe-Hesketh,
Skrondal & Pickles, 2004).
2.1Set-up and notation
Denote by Y = (Yi)i≥1a sequence of observation vectors Yi= (Yi1,...,YiNi)
with associated covariates Xi= (Xi1,...,XiNi), Xij∈ Rp, p ≥ 1, and random
effects Ai. The (Ni,Xi,Ai,Yi), i ≥ 1, are assumed to be independent where
Niis integer valued, (Ni,Xi) follows some unspecified distribution, and given
(Ni,Xi), Aiis N(0,1). The linear predictor for the observation Yijis ηij=
XijβT+σAiwhere β ∈ Rpand σ > 0. Conditional on Ai= ai,Ni= ni,Xi=
xi, Yi1,...,Yiniare conditionally independent with densities of GLM type
(see e.g. McCullagh & Nelder, 1989), i.e. the conditional density of Yijis of
f(yij|ψij,φ) = exp((yijψij− b(ψij))/φ + c(yij,φ))
where φ > 0, b and c are certain functions, and ψij= h(ηij) for some one-to-
one function h.
The joint distribution of (Ni,Xi,Ai,Yi), i ≥ 1, is parametrized by θ =
(β,σ,φ) which belongs to an open set Θ ⊂ Rp+2. We assume that Y is
generated under the joint distribution Pθ0corresponding to a specific param-
eter value θ0∈ Θ. Henceforth, probabilities, expectations, and variances are
computed with respect to Pθ0unless otherwise stated.
With a slight abuse of notation let Fi|yi,θ denote the distribution func-
tion of Aigiven (Ni,Xi,Yi) = (ni,xi,yi) and letˆθndenote an estimate of
θ based on Y1,...,Yn. For each n, (˜A1n,...,˜Ann) denotes a sample where
˜Ainis generated from Fi|Yi,ˆθnand˜Ain, i = 1,...,n, are independent given
Y , N = (Ni)i, and X = (Xi)i. The empirical distribution function based
on˜A1n,...,˜Annis denoted˜Fn. For a finite index set I ⊆ R, the asymptotic
distribution of (˜Fn(t))t∈Iis given in the following Section 2.2.
Assume that h and c are continuously differentiable and (for sake of the sec-
ond result (10) in the Appendix) assume that h?is bounded, and that |h(A1)|,
|b(h(A1))| and |A1b?(h(A1))| have finite expectation. All these assumptions
are valid for the common examples of GLMMs considered in Section 4. As-
suming in addition thatˆθnis asymptotically normal and efficient, we obtain
Theorem 1. Under the above set-up and assumptions, (˜Gn(t))t∈I= (√n(˜Fn(t)−
Φ(t)))t∈Iconverges in distribution to a zero mean Gaussian vector (G(t))t∈I
with covariances given by
EG(s)G(t) = Φ(s ∧ t) − Φ(s)Φ(t) − h(s,θ0)V (θ0)h(t,θ0)T, s,t ∈ I(1)
where h(t,θ0) = E(dF1|Y1,θ(t)/dθ|θ=θ0), Φ(·) is the standard normal distribu-
tion function, and V (θ0) is the asymptotic covariance matrix forˆθn.
Proof. Without loss of generality we can assume˜Ain = F−1
U1,U2,... iid uniform on [0,1] and independent of N, X, and Y . Similarly,
we let A∗
based on A∗
We now split˜Gn(t) as follows:
i|Yi,θ0(Ui) and let F∗
ndenote the empirical distribution function
˜Gn(t) = G∗
n(t) + Zn(t)
n(t) =√n(F∗(t) − Φ(t))
Zn(t) =√n(˜Fn(t) − F∗
Note that, marginally, the A∗
Hence, weak convergence of (G∗
covariance function Φ(s ∧ t) − Φ(s)Φ(t) is a classical result, see e.g. Van der
Consider now the term Zn(t) and let din(t) = 1[Ui≤ Fi|Yi,ˆθn(t)] − 1[Ui≤
Fi|Yi,θ0(t)], min(t) = E(din(t)|N,X,Y ) = Fi|Yi,ˆθn(t) − Fi|Yi,θ0(t), and vin(t) =
Var(din(t) − min(t)) = E(|min(t)|(1 − |min(t)|)). Then
iare independent standard normal variables.
n(t))t∈Rto a zero-mean Gaussian process with
(din(t) − min(t))/√n.(2)
Note that Cov(din(t)−min(t),djn(t)−mjn(t)) = 0 when i ?= j. Hence, using