Fitting Generalized Estimating Equation (GEE) Regression Models in Stata
ABSTRACT Researchers are often interested in analyzing data which arise from a longitudinal or clustered design. While there are a variety of standard likelihoodbased approaches to analysis when the outcome variables are approximately multivariate normal, models for discretetype outcomes generally require a different approach. Liang and Zeger formalized an approach to this problem using Generalized Estimating Equations (GEEs) to extend Generalized Linear Models (GLMs) to a regression setting with correlated observations within subjects. In this talk, I will briefly review the GEE methodology, introduce some examples, and provide a tutorial on how to fit models using "xtgee" in Stata.
 Citations (0)
 Cited In (0)
Page 1
3/16/2001 Nicholas Horton, BU SPH1
Fitting generalized estimating
equation (GEE) regression models
in Stata
Nicholas Horton
horton@bu.edu
Dept of Epidemiology and Biostatistics
Boston University
School of Public Health
3/16/2001Nicholas Horton, BU SPH2
Outline
• Regression models for clustered or longitudinal data
• Brief review of GEEs
– mean model
– working correlation matrix
• Stata GEE implementation
• Example: Mental health service utilization
• Summary and conclusions
Page 2
3/16/2001 Nicholas Horton, BU SPH3
Regression models for clustered or
longitudinal data
• Longitudinal, repeated measures, or clustered data
commonly encountered
• Correlations between observations on a given subject may
exist, and need to be accounted for
• If outcomes are multivariate normal, then established
methods of analysis are available (Laird and Ware,
Biometrics, 1982)
• If outcomes are binary or counts, likelihood based
inference less tractable
3/16/2001 Nicholas Horton, BU SPH4
Generalized estimating equations
• Described by Liang and Zeger (Biometrika, 1986) and
Zeger and Liang (Biometrics, 1986) to extend the
generalized linear model to allow for correlated
observations
• Characterize the marginal expectation (average response
for observations sharing the same covariates) as a function
of covariates
• Method accounts for the correlation between observations
in generalized linear regression models by use of empirical
(sandwich/robust) variance estimator
• Posits model for the working correlation matrix
Page 3
3/16/2001Nicholas Horton, BU SPH5
The marginal mean model
• We assume the marginal regression model:
'

( [g E Y])
ij
ijij
x
x β=
• Where
the p regression parameters of interest, g(.) is the link
function, and denotes the jth outcome (for j=1,…,J) for
the ith subject (for i=1,…,N)
• Common choices for the link function include:
g(a)=a (identity link)
g(a)=log(a) [for count data]
g(a)=log(a/(1a)) [logit link for binary data]
is a p times 1 vector of covariates, consists of
ijx
β
ij Y
3/16/2001Nicholas Horton, BU SPH6
Model for the correlation
• Assuming no missing data, the J x J covariance matrix for
Y is modeled as:
1/ 2
i
1/ 2
i
( )
α
i
VA R
φ
A
=
• Where is a glm dispersion parameter, A is a diagonal
matrix of variance functions, and
correlation matrix of Y
is the working
φ
( )R α
Page 4
3/16/2001Nicholas Horton, BU SPH7
Model for the correlation (cont.)
• If mean model is correct, correlation structure may be mis
specified, but parameter estimates remain consistent
• Liang and Zeger showed that modeling correlation may
boost efficiency
• But this is a large sample result; there must be enough
clusters to estimate these parameters
• Variety of models that are supported in Stata
3/16/2001Nicholas Horton, BU SPH8
Model for the correlation (cont.)
• Independence
1 0
0 1
M
0
0
( )R α
0 01
=
L
L
M O M
L
• Number of parameters: 0
Page 5
3/16/2001Nicholas Horton, BU SPH9
Model for the correlation (cont.)
• Exchangeable (compound symmetry)
1
α
M
1
M
( )R
α
1
αα
α
α α
=
L
L
O M
L
• Number of parameters: 1
3/16/2001Nicholas Horton, BU SPH10
Model for the correlation (cont.)
• Unstructured
12
1
M
1
12
M
2
M
12
1
( )
R
α
1
J
J
JJ
αα
αα
=
αα
L
L
O
L
• Number of parameters: J(J1)/2
Page 6
3/16/2001Nicholas Horton, BU SPH11
Model for the correlation (cont.)
• Autoregressive
1
2
12
1
α
M
1
M
( )
R
α
1
J
J
JJ
αα
α
α
α
−
−
−−
=
L
L
O
L
M
• Number of parameters: 1
3/16/2001 Nicholas Horton, BU SPH12
Model for the correlation (cont.)
• Stationary (gdependent)
11
12
12
1
1
M
( )
R
α
1
J
J
M
JJ
αα
αα
M
αα
−
−
−−
=
L
L
O
L
• Number of parameters: 0 <g <= J1
Page 7
3/16/2001Nicholas Horton, BU SPH 13
Model for the correlation (cont.)
• Fixed
12
1
M
1
12
M
2
M
12
1
c
( )R
α
1
J
J
JJ
cc
c
cc
=
L
L
O
L
• Number of parameters: 0 (user specified)
3/16/2001Nicholas Horton, BU SPH 14
Model for the correlation (cont.)
• If J is small and data are balanced and complete, then an
unstructured matrix is recommended
• If observations are mistimed, then use a structure that
accounts for correlation as function of time (stationary, or
autoregressive)
• If observations are clustered (i.e. no logical ordering) then
exchangeable may be appropriate
• If number of clusters small, independent may be best
• Issues discussed further in Diggle, Liang and Zeger (1994,
book)
Page 8
3/16/2001 Nicholas Horton, BU SPH15
Missing data
• Standard GEE models assume that missing observations
are Missing Completely at Random (MCAR) in the sense
of Little and Rubin (book, 1987)
• Robins, Rotnitzky and Zhao (JASA, 1995) proposed
methods to allow for data that is missing at random (MAR)
• These methods not yet implemented in standard software
(requires estimation of weights and more complicated
variance formula)
3/16/2001Nicholas Horton, BU SPH16
Variance estimators
• Empirical (aka sandwich or robust/semirobust)
consistent when the mean model is correctly specified
(if no missing data)
• Modelbased (aka naïve) [default in Stata]
consistent when both the mean model and the
covariance model are correctly specified
Page 9
3/16/2001Nicholas Horton, BU SPH 17
Syntax for xtgee
xtgee depvar varlist, family(family) link(link) corr(corr)
i(idvar) t(timevar) robust
Family: binomial, gaussian, gamma, igaussian, nbinomial,
poisson
Link: identity, cloglog, log, logit, nbinomial, opwer, power,
probit, reciprocal
Correlation: independent, exchangeable, ar#, stationary#,
nonstationary#,unstructured, fixed
Also options to change the scale parameter, use weighted
equations, specify offsets
3/16/2001Nicholas Horton, BU SPH18
Example: Mental Health Service
Utilization
• Connecticut child studies (Zahner et al, AJPH, 1997)
• Outcome: use of general health, school, or mental health
services (dichotomous report)
• Sample: 2,519 children
• Other dichotomous predictors: age, gender, academic
problems
Page 10
1
3/16/2001Nicholas Horton, BU SPH 21
Data format and variables
A
C
A
D
P
R
O
S
E
T
T
I
N
G
G
E
N
E
R
A
L
S
C
H
O
O
L
M
E
N
T
A
L
S
E
R
V
O
B
S
B
O
Y
O
L
D
I
D
1
2
3
4
5
6
7
8
9
90111502
90111502
90111502
80111206
80111206
80111206
40111608
40111608
40111608
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
2
0
1
2
0
1
2
1
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
3/16/2001Nicholas Horton, BU SPH23
Stata code to fit model
iis id
tis setting
xtdes
xi: xtgee serv i.old*mental i.old*school
i.boy*mental i.boy*school
i.acadpro*mental i.acadpro*school,
link(logit) corr(unst) family(binomial)
robust
xtcorr
Page 11
1
3/16/2001 Nicholas Horton, BU SPH 24
id: 1, 2, ..., 2519
setting:
n =
T =
2519
0, 1, ..., 2
Delta(type) = 1; (20)+1 = 3
(id*setting uniquely identifies each observation)
3
Distribution of T_i: min5%
3
25%
3
50%
3
75%
3
95%
3
max
33
Freq.PercentCum.  Pattern
+
2519 100.00
+
2519100.00
100.00 111
XXX
(No missing data!)
Describe crosssectional data (xtdes)
3/16/2001Nicholas Horton, BU SPH25
GEE populationaveraged model
Group and time vars:
Link:
Family:
Correlation:
Number of obs
Number of groups
Obs per group: min =
=
=
7557
2519id setting
logit3
binomial
unstructured
avg =
max =
3.0
3
Wald chi2(11)
Prob > chi2
=
=
605.12
0.0000Scale parameter:1
(standard errors adjusted for clustering on id)

Semirobust
serv  Coef.Std. Err.
+
_Iold_1  .1233576 .1441123
mental .3520988 .1933698
_IoldXment~1 .2905076.189558
school .1850487 .1734874
_IoldXscho~1 .330549.162133
_Iboy_1 .3652564.1464068
_IboyXment~1  .2779134.1894824
_IboyXscho~1 .1538587.1650033
_Iacadpro_1 .7239641.1445971
_IacaXment~1 .1843236.1911094
_IacaXscho~1  1.136088.1669423
_cons 2.944382.1489399
zP>z[95% Conf. Interval]
0.86
1.82
1.53
1.07
2.04
2.49
1.47
0.93
5.01
0.96
6.81
19.77
0.392
0.069
0.125
0.286
0.041
0.013
0.142
0.351
0.000
0.335
0.000
0.000
.1590973
.7310967
.0810192
.1549804
.0127742
.0783043
.6492921
.4772592
.440559
.1902441
.8088873
3.236298 2.652465
.4058124
.0268992
.6620344
.5250778
.6483239
.6522084
.0934654
.1695418
1.007369
.5588912
1.463289
Page 12
1
3/16/2001 Nicholas Horton, BU SPH 26
Estimates of working correlation (xtcorr)
Estimated withinid corr matrix R
school mental general
c1
1.0000
0.1646 1.0000
0.19770.2270
c2c3
r1
r2
r31.0000
3/16/2001Nicholas Horton, BU SPH27
Multidimensional test of OLD effect
test _IoldXmenta_1=0
( 1)_IoldXmenta_1 = 0.0
chi2(
Prob > chi2 =
test _IoldXschoo_1=0,accumulate
( 1) _IoldXschoo_1 = 0.0
( 2) _IoldXmenta_1 = 0.0
chi2(2) =
Prob > chi2 =
test _Iold_1=0,accumulate
( 1)_IoldXschoo_1 = 0.0
( 2) _IoldXmenta_1 = 0.0
( 3) _Iold_1 = 0.0
chi2(3) =
Prob > chi2 =
1) =2.35
0.1254
4.55
0.1029
!
20.61
0.0001
!
Page 13
1
3/16/2001 Nicholas Horton, BU SPH 28
Results from Example
• There is a significant interaction between service setting
and academic problems (df=2,p<0.0001), but not for age
and setting (df=2,p=0.10) or gender and setting
(df=2,p=0.33)
• Overall, a higher proportion of boys use services
(df=3,p=0.04) and older children use them more than
younger children (df=3,p=0.0001)
3/16/2001Nicholas Horton, BU SPH 29
More resources
• Generalized estimating equations: an annotated
bibliography (Ziegler, Kastner and Blettner, Biometrical
Journal, 1998)
• Review of software to fit Generalized Estimating Equation
regression models (Horton and Lipsitz, The American
Statistician, 1999, article online at
http://www.biostat.harvard.edu/~horton/geereview.pdf)
View other sources
Hide other sources
 Available from bc.edu
 Available from Nicholas Jon Horton · Jun 3, 2014