ArticlePDF Available

Simultaneous Coefficient Penalization and Model Selection in Geographically Weighted Regression: The Geographically Weighted Lasso

Authors:

Abstract and Figures

In the field of spatial analysis, the interest of some researchers in modeling relationships between variables locally has led to the development of regression models with spatially varying coefficients. One such model that has been widely applied is geographically weighted regression (GWR). In the application of GWR, marginal inference on the spatial pattern of regression coefficients is often of interest, as is, less typically, prediction and estimation of the response variable. Empirical research and simulation studies have demonstrated that local correlation in explanatory variables can lead to estimated regression coefficients in GWR that are strongly correlated and, hence, problematic for inference on relationships between variables. The author introduces a penalized form of GWR, called the ‘geographically weighted lasso’ (GWL) which adds a constraint on the magnitude of the estimated regression coefficients to limit the effects of explanatory-variable correlation. The GWL also performs local model selection by potentially shrinking some of the estimated regression coefficients to zero in some locations of the study area. Two versions of the GWL are introduced: one designed to improve prediction of the response variable, and one more oriented toward constraining regression coefficients for inference. The results of applying the GWL to simulated and real datasets show that this method stabilizes regression coefficients in the presence of collinearity and produces lower prediction and estimation error of the response variable than does GWR and another constrained version of GWR—geographically weighted ridge regression.
Content may be subject to copyright.
Simultaneous Coefficient Penalization
and Model Selection in Geographically
Weighted Regression: The Geographically
Weighted Lasso
by
David C. Wheeler
Technical Report 07-08
October 2007
Department of Biostatistics
Rollins School of Public Health
Emory University
Atlanta, Georgia
Correspondence Author: Dr. David Wheeler
Telephone: (404) 727-8059 FAX: (404) 727-1370
e-mail: dcwheel@sph.emory.edu
Simultaneous Coefficient Penalization and Model Selection in Geographically
Weighted Regression: The Geographically Weighted Lasso
Abstract. In the field of spatial analysis, the interest in modeling relationships between
variables locally by some researchers has lead to the development of regression models
with spatially varying coefficients. One such model that has been widely applied is
geographically weighted regression (GWR). In the application of GWR, marginal
inference on the spatial pattern of regression coefficients is often of interest, as is, less
typically, prediction and estimation of the response variable. Empirical research and
simulation studies have demonstrated that local correlation in explanatory variables can
lead to estimated regression coefficients in GWR that are strongly correlated, and hence,
problematic for inference on relationships between variables. We introduce in this paper
a penalized form of GWR called the geographically weighted lasso (GWL) that adds a
constraint on the magnitude of the estimated regression coefficients to limit the effects of
explanatory variable correlation. The geographically weighted lasso also performs local
model selection by potentially shrinking some of the estimated regression coefficients to
zero in some locations of the study area. We introduce two versions of GWL, one
designed to improve prediction of the response variable and one more oriented for
constraining regression coefficients for inference. The results of applying GWL to
simulated and real datasets show that this method stabilizes regression coefficients in the
presence of collinearity and produces lower prediction and estimation error of the
2
response variable than does GWR and another constrained version of GWR,
geographically weighted ridge regression.
Key Words: geographically weighted regression, penalized regression, lasso, model
selection, collinearity, ridge regression
1 Introduction
In the field of spatial analysis, the interest of some researchers in modeling
relationships between variables locally has lead to the development of regression models
with spatially varying coefficients. This is evidenced by the spatial expansion method
(Casetti, 1992), geographically weighted regression (GWR) designed to model spatial
parametric nonstationarity (Brunsdon et al, 1996; Fotheringham et al, 2002), and
geographically weighted regression designed to model variance heterogeneity (Páez et al,
2002). Of these, GWR as a model for spatial parametric nonstationarity has experienced
the widest application to date, at least partly due to readily available software for this
technique. One can see the similarities of GWR to nonparametric local, or locally
weighted, regression models that were first developed in the field of statistics (Cleveland,
1979; see also Loader, 1999 and Hastie et al, 2001 for more details). A clear
methodological link between local regression and GWR is found in the similarity of the
estimation procedures for loess smoothing, which is synonymous with local regression, in
Martinez and Martinez (2002, p. 292-293) and the GWR model in Fotheringham et al
(2002), which suggests viewing GWR as a local smoothing method. A key difference
3
between GWR and locally weighted regression is that in GWR weights arise from a
spatial kernel function applied to observations in a series of related local weighted
regression models across the study area, whereas the weights in locally weighted
regression are from a kernel function applied in variable space. Historically, GWR is
based on the replacement of attribute space in locally weighted regression for curve
fitting with geographical space in locally weighted regression for modeling potentially
spatially varying relationships. GWR also differs from local regression in the focus of its
typical application. Most published applications of GWR are concerned with measuring
statistically significant variation in the estimated regression coefficients and then
visualizing and interpreting the varying regression coefficients, as is in line with the
primary proposed benefit of GWR (Fotheringham et al, 2002). In contrast, local
regression is concerned with fitting a curve to the response variable (Loader, 1999, p. 19).
This difference in objectives may be summarized as one of inference on relationships in
GWR and estimation and prediction of the response variable in local regression. The
discrepancy between the principle applied focus of GWR and its methodological origins
appears to be a noteworthy one, and perhaps a seemingly more appropriate use of GWR
in line with its theoretical statistical origins is for estimation and prediction of the
response variable.
One issue of concern with GWR models expressed in the literature is with
correlation in the estimated coefficients, at least partly due to collinearity in the
explanatory variables of each local model. Wheeler and Tiefelsdorf (2005) show that
while GWR coefficients can be correlated when there is no explanatory variable
collinearity, the coefficient correlation increases systematically with increasingly more
4
collinearity. The collinearity in explanatory variables can apparently be increased by the
GWR spatial kernel weights, and moderate collinearity of locally weighted explanatory
variables can lead to potentially strong dependence in the local estimated coefficients
(Wheeler and Tiefelsdorf, 2005), which makes interpreting individual coefficients
problematic. As an additional example, Wheeler (2007) applies collinearity diagnostic
tools in a Columbus, Ohio crime dataset to clearly link local collinearity to strong GWR
coefficient correlation and increased coefficient variability for two covariates at
numerous data locations with counter-intuitive regression coefficient signs.
Another issue in GWR is with the customary standard error calculations
associated with regression coefficient estimates. The standard error calculations in GWR
are only approximate due to reuse of the data for estimation at multiple locations
(Congdon, 2003; Lesage, 2004) and due to using the data to estimate both the kernel
bandwidth and the regression coefficients (Wheeler and Calder, 2007). In addition, local
collinearity can increase variances of estimated regression coefficients in the general
regression setting (Neter et al, 1996). The issue with the standard errors implies that the
confidence intervals for estimated GWR coefficients are only approximate and are not
entirely reliable for local model selection via significance tests. An issue related to
inference on the regression coefficients is that of multiple testing in GWR, where tests of
coefficient significance are carried out at many locations using the same data. One
potential solution is to use a Bonferroni adjustment to adjust the significance level of
individual tests to achieve an overall significance level.
There are methods in the statistical literature that attempt to circumvent
collinearity in traditional linear regression models with constant coefficients. These
5
methods include ridge regression, the lasso, principal components regression, and partial
least squares. Hastie et al (2001) and Frank and Friedman (1993) independently provide
performance comparisons of these methods. Ridge regression and the lasso are both
penalization, or regularization, methods that place a constraint on the regression
coefficients, and principal components regression and partial least squares are both
variable subset selection methods that use linear combinations of the explanatory
variables in the regression model. Ridge regression was designed specifically to reduce
collinearity effects by penalizing the size of regression coefficients and decreasing the
influence in the model of variables with relatively small variance in the design matrix.
The lasso is a more recent development that also shrinks the regression coefficients, but
shrinks the least significant variable coefficients to zero, thereby simultaneously
performing coefficient penalization and model selection. The name for the lasso
technique is derived from its function as a “least absolute shrinkage and selection
operator” (Tibshirani, 1996). Ridge regression and the lasso are deemed as better
candidates than principal components regression and partial least squares to address
collinearity in local spatial regression models because they more directly reduce the
variance in the regression coefficients while retaining interpretability of covariate effects.
To address the issue of collinearity in the GWR framework, Wheeler (2007)
implemented a ridge regression version of GWR, called GWRR, and found it was able to
constrain the regression coefficients to counter local correlation present in an existing
dataset. Another finding was a reduced prediction error for the response variable in
GWRR compared to that from GWR. The lasso has not yet been introduced into the
GWR framework in the literature, and its implementation in GWR is the goal of this
6
paper. The lasso is appealing in the GWR framework due to its ability to carry out
coefficient shrinkage and local model selection, as well as for its potential to improve on
the performance of GWR for estimating the response variable, in terms of lower
prediction and estimation errors. While ridge regression in GWR has the potential to
control the variability in estimated regression coefficients, the lasso in theory should be
able to constrain the coefficients and additionally perform local model selection by
eliminating covariates from individual local models. Thus, the lasso offers a key
advantage to ridge regression in the GWR framework and should lessen the reliance on
approximate confidence intervals in GWR for identification of insignificant local effects.
In this paper, we first review the GWR and lasso methods and then introduce the lasso in
the GWR framework. We then demonstrate the benefit of using the geographically
weighted lasso (GWL) through a comparative analysis with GWR and GWRR of two
existing crime datasets and simulated data.
2 Methods
Geographically Weighted Regression
In the application of GWR, data are often mean measures of aggregate data at
fixed points with associated spatial coordinates; for example, see the Georgia county
example in Fotheringham et al (2002), although this need not be the case. The spatial
coordinates of the data are used in calculation of distances that are input into a kernel
function to determine weights for spatial dependence between observations. Typically, a
7
regression model is fitted at each point location in the dataset, called a model calibration
location. Local regression models are related through sharing data, but the dependence
between regression coefficients at different model calibration locations is not specified in
the model. For each calibration location, 1, ,in
=
K, the GWR model at location i is
() () () ()
y
iiii=+Xβε, (1)
where ( )
y
i is the dependent variable at location i, ( )iX is the row vector of explanatory
variables at location i, ( )iβ is the column vector of regression coefficients at location i,
and ( )
iε is the random error at location i. The vector of estimated regression coefficients
at location i is
1
ˆ() [ () ] () ,
TT
ii i
=⋅ ⋅ ⋅ ⋅βXW XXW
y
(2)
where [ (1); (2); ; ( )]
TT TT
n=XX X XK is the design matrix of explanatory variables,
which typically includes a column of 1's for the intercept; 1
( ) [ ( ), , ( )]
n
idiagwi wi
=
WK is
the diagonal weights matrix that is calculated for each calibration location i and applies
weights to observations 1, ,jn=K, with typically more weight applied to proximate or
neighboring observations;
y
is the 1n
×
vector of dependent variables; and
()
01
ˆˆˆˆ
() , , , T
ii ip
i
ββ β
=βK is the vector of 1
p
+
local regression coefficients at location i
for
explanatory variables and an intercept term.
8
The weights matrix, ( )i
W, is calculated from a kernel function that places more
emphasis on observations that are closer to the model calibration location i. There are
numerous choices for the kernel function, including the Gaussian function, the bi-square
nearest neighbor function, and the exponential function. The exponential kernel function
is utilized in this paper. The weight from the exponential kernel function between any
location
j
and the model calibration location i is calculated as
( ) exp( / )
jij
wi d
φ
=− , (3)
where ij
d is the distance between the calibration location i and location
j
, and
φ
is the
kernel bandwidth parameter.
To fit the GWR model, the kernel bandwidth is first estimated, often in practice
by leave-one-out cross-validation (CV) across all the calibration locations. Cross-
validation is an iterative process that finds the kernel bandwidth with the lowest
associated prediction error of all the responses ( )
y
i. For each calibration location i, it
removes the data for observation i in the model calibration at location i and predicts
()
y
i using the other data points and the kernel weights associated with the current
bandwidth. An alternative to CV in kernel bandwidth estimation is the Akaike
Information Criterion (AIC), as discussed in Fotheringham et al (2002). CV and the AIC
are tools used in model selection and more general information on the AIC and model
selection are available elsewhere (Burnham and Anderson, 2004). It is currently unclear
whether CV or AIC will generally return the same solution or one method should be
favored in certain situations. The need for more research in this area is stressed by Farber
9
and Páez (2007). Next, the kernel weights are calculated at each calibration location
using the estimated bandwidth in the kernel function. Then, the regression coefficients
are estimated at each model calibration location, and, finally, the responses are estimated
by the expression ˆ
ˆ() () ()
y
iii=Χβ .
The Lasso
Shrinkage methods such as ridge regression and the lasso introduce a constraint
on the regression coefficients. The ridge regression coefficients minimize the sum of a
penalty on the size of the squared coefficients and the residual sum of squares (see
Wheeler, 2007 for details). The lasso takes the shrinkage of ridge regression a step further
by potentially shrinking the regression coefficients of some variables to zero. The lasso
specification is similar to ridge regression, but it has a 1
L coefficient penalty in place of
the ridge 2
L penalty, where 1
L denotes a sum of absolute values and 2
L denotes a sum
of squared values. The lasso is defined as
2
0
11
1
ˆargmin
subject to
p
n
Riikk
ik
p
k
k
yx
s
ββ
β
==
=
⎛⎞
=−
⎜⎟
⎝⎠
∑∑
β
β
. (4)
Tibshirani (1996) notes that the lasso constraint k
ks
β
is equivalent to adding the
penalty term k
k
λ
β
to the residual sum of squares, hence there is a direct
10
correspondence between the parameters s and
λ
that control the amount of shrinkage of
the regression coefficients. The equivalent statement for the lasso coefficients is
2
0
11 1
ˆargmin pp
n
Riikkk
ik k
yx
β
βλβ
== =
⎧⎫
⎛⎞
⎪⎪
=−+
⎨⎬
⎜⎟
⎝⎠
⎪⎪
⎩⎭
∑∑ ∑
β
β. (5)
The absolute value constraint on the regression coefficients makes the problem nonlinear
and a typical way to solve this type of problem is with quadratic programming.
There are, however, ways to estimate the lasso coefficients outside of the
mathematical programming framework. Tibshirani (1996) provides an algorithm that
finds the lasso solutions by treating the problem as a least squares problem with 2p
inequality constraints, one for each possible sign of the k
β
’s, and applying the constraints
sequentially. An even more attractive way to solve the lasso problem is proposed by
Efron et al (2004a), who solve the lasso problem with a small modification to the least
angle regression (LARS) algorithm, which is a variation of the classic forward selection
algorithm in linear regression. The modification ensures that the sign of any non-zero
estimated regression coefficient is the same as the sign of the correlation coefficient
between the corresponding explanatory variable and the current residuals. Grandvalet
(1998) shows that the lasso is equivalent to adaptive ridge regression and develops an EM
algorithm to compute the lasso solution.
It is worthwhile to describe in more detail the LARS and lasso algorithms of
Efron et al (2004a) because these methods have not been previously introduced in the
geography literature at the time of this writing. The LARS algorithm is similar in spirit to
11
forward stepwise regression, which we now describe. The forward stepwise regression
algorithm is:
(1) Start with all coefficients k
β
equal to zero and set
=
r
y
, where r is the residual
vector and
y
is the dependent variable vector.
(2) Find the predictor k
x
most correlated with the residuals r and add it to the model.
(3) Calculate the residuals ˆ
=−r
yy
.
(4) Continue steps 2-3 until all predictors are in the model.
While the LARS algorithm is described in detail algebraically in Efron et al
(2004a), Efron et al (2004b) restate the LARS algorithm as a purely statistical one with
repeated fitting of the residuals, similar to the forward stepwise regression algorithm. The
statistical statement of the LARS algorithm is:
(1) Start with all coefficients k
β
equal to zero and set
=
r
y
.
(2) Find the predictor k
x
most correlated with the residuals r.
(3) Increase the coefficient k
β
in the direction of the sign of its correlation with r,
calculating the residuals ˆ
=−r
yy
at each increase, and continue until some other
predictor m
x
has as much correlation with the current residual vector r as does
predictor k
x
.
(4) Update the residuals and increase ( , )
km
β
β
in the joint least squares direction for the
regression of r on ( , )
km
x
x until some other predictor j
x
has as much correlation
with the current residual r.
12
(5) Continue steps 2-4 until all predictors are in the model. Stop when corr( , ) 0
j
rx j=∀,
which is the OLS solution.
As with ridge regression, typically the response variable is centered and the
explanatory variables are centered and scaled to have equal (unit) variance prior to
starting the LARS algorithm. In other words, 10
n
i
iy
=
=
, 10
n
ij
ix
=
=
, and 2
11
n
ij
ix
=
=
for
1, ,jm=K. Efron et al (2004a) show that a small modification to the LARS algorithm
yields the lasso solutions. In a lasso solution, the sign of any nonzero coefficient k
β
must
agree with the sign of the current correlation of k
x
and the residual. The LARS algorithm
does not enforce this, but Efron and coauthors modify the algorithm to do so by removing
k
β
from the lasso solution if it changes in sign from the sign of the correlation of k
x
and
the current residual. This modification means that in the lasso solution, the active set of
variables in the solution does not necessarily monotonically increase as the routine
progresses. Therefore, the LARS algorithm typically takes less iterations than does the
lasso algorithm. The modified LARS algorithm produces the entire range of possible
lasso solutions, from the initial solution with all coefficients equal to zero, to the final
solution, which is also the OLS solution.
In some of the lasso algorithms, such as the modified LARS algorithm and the
algorithm Tibshirani describes, the shrinkage parameter s (or t) must be estimated
before finding the lasso solutions. Hastie et al (2001) estimate the parameter
13
t
s
p
1k
ols
k
=
=
β
ˆ
(6)
through ten-fold cross-validation, where t is some positive scalar that reduces the
ordinary least squares coefficient estimates. Tibshirani (1996) uses five-fold cross-
validation, generalized cross-validation, and a risk minimizer to estimate the parameter t,
with the computational cost of the three methods decreasing in the same order. Efron et al
(2004a) also recommend using cross-validation to estimate the lasso parameter. If t is
one or less, there is no shrinkage and the lasso solutions for the coefficients are the least
squares solutions. One can also define the lasso shrinkage parameter as
1
1
ˆ
ˆ
p
k
k
pols
k
k
s
β
β
=
=
=
, (7)
and s ranges from 0 to 1, where 0 corresponds to the initial lasso solution with all
regression coefficients shrunk to 0 and 1 corresponds to the final lasso solution, which is
also the OLS solution. Then, s can be viewed as the fraction of the OLS solution that is
the lasso solution. This is the definition of the lasso shrinkage parameter that we will use
in the subsequent work in this paper.
Geographically Weighted Lasso
14
The lasso can be implemented in GWR relatively easily, and the result is here
called the geographically weighted lasso (GWL). An efficient implementation of the
GWL outlined here uses the lars function from the package of the same name written
in the R language by Hastie and Efron (see the R Project web site: http://cran.r-
project.org/). The lars function implements the LARS and lasso methods, where the
lasso is the default method, and details are described in Efron et al (2004a; 2004b). To
make use of the lars function in the GWR framework, the
x
and
y
variables input to
the function must be weighted by the kernel weights at each model calibration location.
The lars function must be run at each model calibration location. This can be done in
one of two ways: separate models with local scaling of the explanatory variables (GWL-
local) or one model with global scaling of the explanatory variables (GWL-global). The
first way, local scaling, requires n calls of the lars function, one for each location, and
the weighted
x
and
y
are centered and the
x
variables are scaled by the norm in the
lars function. This effectively removes the intercept and equates the scales of the
explanatory variables to avoid the problem of different scales. The local scaling version
estimates the lasso parameter to control the amount of coefficient shrinkage at each
calibration location, so there is a shrinkage parameter i
s estimated at each location i.
Since we are working here in the GWR framework, we will estimate the model shrinkage
and kernel bandwidth parameters using leave-one-out cross-validation while minimizing
the root mean square prediction error (RMSPE) of the response variable. Therefore, the
n i
s parameters and the kernel bandwidth
φ
must be estimated in GWL with CV before
the final lasso coefficient solutions are estimated. We have chosen to estimate these
parameters simultaneously, as the lasso solution will likely depend on the kernel
15
bandwidth. The algorithm to estimate the local scaling GWL parameters using cross-
validation is:
For each attempted bandwidth
φ
in the binary search for the lowest RMSPE
o Calculate the nn× weights matrix W using an nn
×
inter-point distance
matrix D and
φ
.
o For each location i from 1, , nK
Set 12() sqrt(diag( ()))ii=WW and 12() 0
ii
i
=
W, that is, set the ( , )ii
element of the square root of the diagonal weights matrix to 0 to
effectively remove observation i.
Set 12()i=
w
XW X and 12()i=
w
y
W
y
using the square root of the
kernel weights ( )i
W at location i.
Call lars(,
ww
X
y
), save the series of lasso solutions, find the lasso
solution that minimizes the error for i
, and save this solution.
Stop when there is only a small change in the estimated
φ
. Save the estimated
φ
.
In the previous algorithm, saving the lasso solution entails saving the estimated shrinkage
fraction i
s at each location, as well as an indicator vector b of which variable
coefficients are shrunken to zero. The algorithm uses a binary search to find the
φ
that
minimizes the RMSPE. The small change in
φ
is set exogenously. The square root of the
weights are used to weight the data because this is how the weights are applied to the data
in the estimation of GWR regression coefficients in equation (2).
16
The algorithm to estimate the final local scaling GWL solutions after cross-
validation estimation of the shrinkage and kernel bandwidth parameters is:
Calculate the nn× weights matrix W using an nn
×
inter-point distance matrix D
and
φ
.
For each location i from 1, , nK
o Set 12() sqrt(diag( ()))ii=WW.
o Set 12()i=
w
XW X and 12()i=
w
y
W
y
using the square root of the diagonal
weights matrix ( )i
W at location i.
o Call lars(,
ww
X
y
) and save the series of lasso solutions.
o Select the lasso solution that matches the cross-validation solution according
to the fraction i
s and the indicator vector b.
The second GWL method, global scaling, calls the lars function only one time,
using specially structured input data matrices. This method fits all the local models at
once, using global scaling of the
x
variables. It also estimates only one lasso parameter
to control the amount of coefficient shrinkage. The weighted design matrix for the global
version is a ()()nn np⋅×⋅ matrix and the weighted response vector is ()1nn⋅×. This
results in a ()1np⋅×
vector of estimated regression coefficients. The weighted design
matrix is such that the design matrix is repeated n times, shifting
columns in its
starting position each time it is repeated. The kernel weights for the 1st location are
applied to the first n rows of the matrix, the weights for the 2nd location are applied to
17
the next n rows of the matrix, and so forth. The weighted response vector has the
response vector repeated n times, with the weights for the 1st location applied to the first
n elements of the vector, and so on. The algorithm to estimate the global scaling GWL
parameters using cross-validation is:
For each attempted bandwidth
φ
in the binary search for the lowest RMSPE
o Calculate the nn× weights matrix W using an nn
×
inter-point distance
matrix D and
φ
.
o Set diagonal of W = 0.
o Set 12 ()
GT
w
y
W1
y
using the square root of each element of the weights
matrix W and the column unity vector 1 of length n. The operator ×
indicates element-by-element multiplication here. Set 1k
=
and 1m=.
o For each location i from 1, , nK
Set (1)jkn n=⋅− − and (1)lmp p
=
⋅− −.
Set 12()i=
w
XW X using the square root of the kernel weights ( )iW
at location i. Set ( : , : )
Gjnklpm⋅⋅=
ww
XX.
Set 1kk=+
and 1mm
=
+.
o Call lars(,vec()
GG
ww
X
y
) and save the series of lasso solutions, where the
vec() operator turns a matrix into a vector by sequentially placing columns,
starting with the first, into one row.
18
In the previous algorithm, saving the lasso solution entails saving the estimated overall
shrinkage fraction s, as well as a vector b that indicates which of the variable
coefficients are shrunken to zero. The algorithm uses a binary search to find the
φ
that
minimizes the RMSPE. The small change in
φ
is set exogenously.
The algorithm to estimate the final global scaling GWL solutions after cross-
validation estimation of the shrinkage and kernel bandwidth parameters is:
Calculate the nn× weights matrix W using an nn
×
inter-point distance matrix D
and
φ
.
Set 12 ()
GT
w
y
W1
y
using the square root of each element of the weights matrix
W and the column unity vector 1 of length n. The operator
×
indicates element-by-
element multiplication here. Set 1k
=
and 1m
=
.
For each location i from 1, , nK
o Set ( 1)jkn n=⋅− − and ( 1)lmp p
=
⋅− −.
o Set 12()i=
w
XW X
using the square root of the kernel weights ( )iW at
location i. Set ( : , : )
Gjnklpm⋅⋅=
ww
XX.
o Set 1kk=+ and 1mm=+.
Call lars(,vec( )
GG
ww
X
y
) and save the series of lasso solutions, where vec() turns the
matrix into a vector.
Select the lasso solution that matches the cross-validation solution according to the
fraction s and the indicator vector b.
19
In comparing the local and global scaling GWL algorithms, the global GWL
algorithm requires more computational time due to the matrix inversion of a much larger
matrix. The global GWR algorithm must invert a ( )np np
×⋅ matrix, while the local
GWR algorithm must invert a ( )
p
p
×
n times, which is clearly faster. Considering that
calculating the inverse of a general
j
j
×
matrix takes between 2
()Oj and 3
()Oj time
(Banerjee et al. 2004), there can be quite a difference in the computation time for the two
versions of GWR. In general, global GWL can take between two and three times more
computation time than local GWL. In fact, global GWL may not be possible for large
datasets, where large is defined relative to the computing environment, as the memory
requirements of the method could exceed available computer system memory. In terms of
expected model performance, the local GWL method should produce lower prediction
error of the response variable than the global GWL method, as adding more shrinkage
parameters generally increases model stability and hence lowers prediction error. The
benefit of global GWL may be in lower estimation error of the regression coefficients, as
the one shrinkage parameter may control excessive coefficient variation in GWR without
stabilizing the model to the degree of local GWL. In summary, the local GWL should be
faster than the global GWL and should have lower prediction error. The local and global
versions of GWL will be compared empirically to each other and to GWR in the data
example and simulation study in the next two sections.
3 Houston and Columbus Crime Examples
20
In this section, we demonstrate the use of the GWL methodology with two
existing data sets dealing with crime in Houston, TX and Columbus, OH and compare the
GWL results with those from both GWR and GWRR. Waller et al (2007) previously
analyzed violent crime incidence related to alcohol sales and drug law violations in the
Houston dataset using GWR and a Bayesian hierarchical model. The Columbus crime
dataset has been analyzed in spatial analysis work (Anselin, 1988) and in GWR-related
work (LeSage, 2004; Wheeler, 2007). Wheeler (2007) demonstrated with diagnostic tools
the presence of collinearity in a GWR model for Columbus neighborhood crime rates
using median income and housing values. We use the Columbus crime dataset here as an
illustrative example to compare model performance and select it for its problem with
collinearity in the GWR model. In analyzing the Columbus crime data, Wheeler (2007)
used a nearest neighbor bi-square kernel function with cross-validation to estimate the
GWR kernel bandwidth. In this work, we use an exponential kernel function with cross-
validation to demonstrate that the collinearity issue persists with a different kernel
function. All subsequent GWR-related models presented here use this kernel function.
Wheeler (2007) introduced the collinearity diagnostics of variance-decomposition
proportions, condition indexes, and variance inflation factors for GWR and applied them
to the Columbus crime data to illustrate collinearity issues with the GWR model. The
details for the diagnostics are available in that paper and are omitted here for brevity.
Instead, we briefly summarize the results of applying the variance-decomposition
diagnostic tool to the Columbus crime data. The GWR model is
01122
() () () () () () ()
y
iiixiixii
β
ββ ε
=+ + +, (8)
21
where
y
is residential and vehicle thefts combined per thousand people for 1980, 1
x
is
mean income, 2
x
is mean housing value, and i is the index for neighborhoods. Through
cross-validation, the estimated GWR kernel bandwidth ˆ1.26
φ
=. This estimated
bandwidth is used in the variance-decomposition of the kernel weighted design matrix to
assess the collinearity in the model. The variance-decomposition is done through singular
value decomposition and it has an associated condition index, which is the ratio of the
largest singular value to the smallest singular value. In diagnosing collinearity, the larger
the condition index, the stronger is the collinearity among the columns of the GWR
weighted design matrix. Belsley (1991) recommends a conservative value of 30 for a
condition index that indicates collinearity, but suggests the threshold value could be as
low as 10 when there are large variance proportions for the same component. The
variance-decomposition proportion is the proportion of the variance of a regression
coefficient that is affiliated with one component of its decomposition. In addition, the
presence of two or more variance proportions greater than 0.5 in one component of the
variance-decomposition indicates that collinearity exists between at least two regression
terms, one of which may be the intercept. Of the 49 records in the data, 6 have a
condition index above 30, 12 have a condition index above 20, and 45 have a condition
index above 10 and have large shared variances for the same component. There are many
observations with large variance proportions (> 0.5) from the same component, with the
shared component being between a covariate and the intercept for some records and
between the two covariates for other records. Of the 47 records with a large shared
variance component, 23 are with the intercept and income, 4 are with the intercept and
22
housing value, and 20 are between income and housing value. Overall, the diagnostic
values indicate local collinearity in the GWR model.
Due to the collinearity in the GWR model, it is beneficial to apply the GWL
models to these data and compare their performance to the GWR and GWRR models in
terms of prediction and estimation error of the response variable. The accuracy of the
estimated and predicted responses is measured by calculating the root mean square error
(RMSE) and root mean square prediction error, respectively. The RMSE is the square
root of the mean of the squared deviations of the estimates from the true values and
should be small for accurate estimators. The results of fitting all four models to the data
provide the error values in Table 1. The lowest prediction error and estimation error
among the four models are listed in bold font. In this case, the constrained versions of
GWR do substantially better than GWR at predicting the dependent variable, and GWL-
local performs better than GWRR and GWL-global. The RMSPE for the GWL-local
model is 32% lower than for GWR and 24% lower than for GWRR. For estimating the
dependent variable, GWL-global performs best and substantially better than the other
models. The RMSE for the GWL-global model is 17% lower than for the GWR model.
Overall, GWL performs better than both GWR and GWRR. Figure 1 shows the estimated
GWR coefficients and the GWL-local coefficients for income 1
()
β
and housing value
2
()
β
. The figure shows the nature of the shrinkage in the estimated GWL coefficients
and how GWL enforces local model selection by shrinking some estimated coefficients to
zero. In some neighborhoods, either the income or housing value has been effectively
removed from the model. The estimated shrinkage parameter ˆ0.75s
=
for the GWL-
23
global model and the mean estimated shrinkage parameter is ˆ0.76
s
=
for the GWL-local
model.
The Houston crime data consist of 439 census tracts in the City of Houston with
attributes from year 2000. The number of violent crimes per person in each census tract is
displayed in Figure 2. There are a few census tracts with a total number of violent crimes
that exceeds the population size. For the Houston crime data, the GWR model notation is
the same as in equation 8, but where
y
is the number of violent crimes (murder, robbery,
rape, and aggregated assault) per person, 1
x
is the number of drug law violations per
person, 2
x
is the number of alcohol outlets per person, and i is the index for census
tracts. Since the distribution of the response variable is positively skewed, we use the
natural logarithm of violent crimes in the model and also use the natural logarithm of
both covariates to maintain linear relationships with violence rates. The GWR estimated
kernel bandwidth ˆ0.89
φ
= found through cross-validation. To assess collinearity in the
GWR model, we use the variance-decomposition diagnostic. The variance-decomposition
proportions and condition indexes are listed in Table 2 for records with the largest
condition indexes. These 10 records are labeled in the left plot of estimated GWR
coefficients for the drug and alcohol covariates in Figure 3. These labeled records
comprise many of the more extreme points in the plot. Observation 153 is clearly the
most extreme of the points, as it has the largest value for the drug rate effect and the
smallest value for the alcohol rate effect. In Table 2, this record has large variance
proportions for the same component for all three regression terms. Of the 439 records in
the dataset, 5 have a condition index above 30, 10 have a condition index above 20, and
41 have a condition index above 10. There are 411 records in the data with large variance
24
proportions (> 0.5) from the same component, with the shared component being between
a covariate and the intercept for some records and between the two covariates for other
records. Overall, the variance-decomposition proportions and condition index values
indicate the presence of local collinearity in the GWR model.
Given the presence of local collinearity in the GWR model for violent crime in
Houston, we also fit the constrained versions of GWR and compare them to GWR in
terms of model performance. The RMSE and RMSPE values for the response variable are
listed in Table 3 for the GWR, GWRR, GWL-global, and GWL-local models. As with
the Columbus crime data, the constrained versions of GWR improve on GWR in
prediction of the response variable. The GWL-local model again produces the lowest
RMSPE, 18% lower than the GWR model. In estimating violent crime, the GWL models
improve upon the GWR model. The GWL-global model produces the lowest RMSE and
its RMSE is 14% lower than in the GWR model. The estimated regression coefficients
for the GWL-global model in the right plot in Figure 3 show that the GWL model has
penalized some of the most extreme coefficients in the GWR model in the left side of
Figure 3, particularly record 153. Figure 4 displays the estimated regression coefficients
for the drug covariate from the GWL-local model plotted against the estimated
coefficients from the GWR model. This figure shows the effective shrinkage of the
GWL-local model, where the GWL-local model shrinks certain larger GWR coefficients,
some to zero. The large estimated regression coefficient for record 153 is greatly reduced
in the GWL-local model. The estimated shrinkage parameter ˆ0.92s
=
for the GWL-
global model and the mean estimated shrinkage parameter is ˆ0.65
s
=
for the GWL-local
model. The correlation in the estimated regression coefficients for the drug and alcohol
25
covariates is -0.41 with GWR, -0.39 with GWRR, -0.37 with GWL-global, and 0.03 with
GWL-local. The results with the crime data examples consistently show that the
constrained versions of GWR improve on the performance of GWR and that the GWL-
local model produces the lowest prediction error and the GWL-global model produces the
lowest estimation error.
4 Simulation Study
In this section, we use a simulation study to evaluate and compare the accuracy of
the predicted and estimated responses and the estimated regression coefficients from the
GWR, GWRR, and GWL models. We assess the accuracy of the models both when there
is no collinearity in the explanatory variables and when there is collinearity, expressed at
various levels. The expectation is that the GWL model will improve on GWR for
regression coefficient estimation when there is collinearity in the model. Another
expectation is that the GWL model will improve on GWR for prediction and estimation
of the response variable. While it has been conventional for researchers to apply a newly
introduced method to an existing dataset as a demonstration of the utility of the method,
we make use of simulated data here to learn about the performance of the method in a
comparative setting. It is necessary to use simulation in order to set the “true” values of
the regression coefficients, which are unknown with existing data, so that it is possible to
measure the deviation from the truth of the estimates from competing models. The
simulation study presented here is not intended to be exhaustive, but rather is an
26
appealing alternative to existing data for demonstrating the performance of the introduced
method in a certain situation.
The data-generating model in the simulation study has four explanatory variables,
with the true coefficients used to generate the data set equal to nearly zero for one
explanatory variable. The model to generate the data for this simulation study is
** * *
11 22 33 44
() () () () () () () () () ()
y
iixiixiixiixii
β
βββε
=++++, (9)
where 234
,,,
1
xxxx are the first four principal components from a random sample drawn
from a multivariate normal distribution of dimension ten with a mean vector of zeros and
an identity covariance matrix, the errors ε are sampled independently from a normal
distribution with mean 0 and variance of 2*
τ
, and i denotes the location. The star
notation denotes the true values of the parameters used to generate the data. Note that
there is no true intercept in the model used to generate the data and we do not fit an
intercept in the simulation study. The data points are equally spaced on a 14 14× grid, for
a total of 196 observations. The goal of the simulation study is to use the model in
equation (9) to generate the data and see if the regression coefficient estimates match *
β
and if the estimated and predicted responses approximate
y
for the GWR, GWRR, and
GWL models. To produce comparable summary measures of deviance of the estimates
and responses from the true values, we generate 100 realizations of the coefficient
process, estimate the model parameters and responses for each data realization, measure
the error in the estimates, and then produce average errors over the many realizations of
27
the data. Using 100 realizations of the data-generating process is advantageous compared
to one dataset because it allows us to assess model performance over 100 datasets.
Each realization of the true regression coefficients, *
β, is sampled through the
distribution
1
|, ( ,)
n
N×
⎡⎤
=⊗
⎣⎦
ββ ββ
β
μ
Σ1
μ
Σ, (10)
where the vector 0
(,, )
p
T
ββ
μμ
=
β
μK contains the means of the regression coefficients
corresponding to each of the
explanatory variables, and spatial dependence in the
coefficients is specified through the covariance, β
Σ. We assume a separable covariance
matrix (Gelfand et al, 2003) for β of the form
()
γ
=⊗
β
ΣHT, (11)
where ( )
γ
H is the nn × correlation matrix that captures the spatial association between
the n locations,
γ
is the spatial dependence parameter, T is a positive-definite
p
p
×
matrix for the covariance of the regression coefficients at any spatial location, and
denotes the Kronecker product operator, which is the multiplication of every element in
()
γ
H by T. In the specification of the variance in the distribution for β (equation 11),
the Kronecker product results in a np np
×
positive definite covariance matrix, since
()
γ
H and T are both positive definite. The elements of the correlation matrix ( )
γ
H,
() ( ;)
jk j k
Hii
γ
ργ
=−, are calculated from the exponential function ( ; ) exp( / )dd
ρ
γγ
=− .
28
For this simulation study, the true values used to generate the data are
*(1,5,5,0)
β
=μ, 2*
τ
= 1, *10
γ
=, and *
T= diag(.1, .5, .5, .0000001), where diag() makes
a diagonal matrix with the input numbers on the diagonal. The mean of 0 and the small
variance for the fourth type of regression coefficient produce a variable effect that is
effectively zero across the study area. More information regarding drawing samples from
the coefficient distribution utilized here is available from Wheeler and Calder (2007). In
general, as *
γ
increases there are more consistent and clear patterns in the true regression
coefficients. The range is the distance beyond which the spatial association becomes
insignificant and is approximately *
3
γ
with the covariance function parameterization
used here, so there is some dependence in the coefficients for each covariate throughout
the study area. Figure 5 illustrates the pattern in the true coefficients for two covariates
for one realization of the coefficient process, and shows that there is some smoothness
and spatial variation in the true coefficients when *10
γ
=
. This pattern reflects a situation
where there is spatial parametric nonstationarity, in other words, one in which GWR is
intended to be applied.
In this simulation study, we start with no substantial collinearity in the model and
systematically increase it until the explanatory variables are nearly perfectly collinear.
This is done by replacing one of the original explanatory variables with one created from
a weighted linear combination of the original explanatory variables, where the weight
determines the amount of correlation of the variables. The formula for the new weighted
variable is
21 2
(1 )
c
x
cx c x=⋅ + − ⋅ , (12)
29
where 2
c
x
replaces 2
x
in the model in equation (9) and c is a weight between 0 and 1.
The simulation study is carried out with four levels of explanatory variable collinearity.
The weights used in equation (12) to create the collinearity are (0.0, 0.5, 0.7, 0.9)c
=
,
which coincide with explanatory variable correlation of (0.0, 0.74, 0.93, 0.99)r
=
. These
levels of correlation correspond to no collinearity as a baseline, and then moderate,
strong, and nearly perfect collinearity. In this study, the model parameters and responses
are estimated for each realization for each of the following models: GWR, GWRR,
GWL–global, and GWL–local. The kernel bandwidth is estimated for each data
realization using cross-validation and is thus potentially different for each realization. To
measure the accuracy of the estimated regression coefficients and estimated responses,
the RMSPE and RMSE are calculated for the responses ˆ
y
and the RMSE is calculated
for the coefficients ˆ
β for each data realization. The average RMSE for ˆ
β and ˆ
y
and the
average RMSPE for ˆ
y
are then calculated from averaging the individual RMSE’s and
RMSPE’s from the 100 realizations of the coefficients.
The average RMSPE and RMSE for ˆ
y
and the average RMSE for ˆ
β for each
model are listed in Table 4. The lowest value for each error measure for each level of
variable correlation (column) is in bold font. The results in the table show that the GWL-
local model produces the lowest prediction error of the response. The GWL-local model
prediction error is approximately 20% lower on average than the GWR error. This is not
an unexpected result, as the GWL-local model adds the most local penalization
parameters to the GWR model, which should lower the prediction error by stabilizing the
model. The next best performer in terms of RMSPE of the response is the GWL-global
30
model. GWR has the highest average prediction error of the response at each level of
collinearity. These results demonstrate that adding penalization terms for the regression
coefficients in GWR results in lower prediction error of the response than with GWR.
The RMSE results in the table show that the GWL-local model produces the
lowest estimation error of the response at all levels of collinearity. The GWL-local
estimation error is approximately 20% lower on average than the GWR error. Overall, the
simulation study shows that the GWL models perform better than GWR in explaining the
response variable. The better performance of the two versions of GWL relative to the
GWR is not unexpected, given that the GWL methods can shrink the regression
coefficients to zero to match the true values for one of the variables in an effort to
estimate the response variable. Taken together, the results from Table 4 indicate that the
GWL-local model is best for predicting and estimating the response variable in the
presence of an insignificant explanatory variable.
The RMSE results for ˆ
β in the table show that the GWL-global model produces
the lowest average estimation error of the regression coefficients. The GWR model
performs next best, except when there is nearly perfect collinearity and the GWRR model
outperforms GWR considerably. An explanation for the leading performance of the
GWL-global model is that it applies moderate shrinkage to the coefficients towards zero
for the variable with true coefficients set to zero to effectively remove its effect from the
model. It strikes a balance between the stronger shrinkage of GWL and the weaker
shrinkage of GWRR. The RMSE results for ˆ
β suggest that an improvement in marginal
inference on the regression coefficients in the presence of collinearity or insignificant
explanatory variables is possible with the GWL-global model used in place of GWR.
31
An example of the difference in the estimated coefficients from GWR and GWL-
local is illustrated in Figure 6, which displays the estimated coefficients for 4
β
from the
GWR and GWL-local models for one realization of the coefficient process when there is
no collinearity in the model. The true coefficients for this variable are all approximately
zero, so a plot of them would be a constant white surface. Figure 6 shows that the GWL
model estimates more of the coefficients near zero for this variable through coefficient
shrinkage than does GWR. This results in lower prediction and estimation error of the
response variable.
Many times in traditional regression analyses, researchers only consider using
penalization methods, such as the lasso and ridge regression, when there are many
explanatory variables to include in the model. However, the results from this simulation
study show that one can improve on GWR in terms of prediction and estimation of the
response and estimation of the regression coefficients for even relatively small models.
There may be situations, however, where it is beneficial to use GWR without
penalization when prediction is not of primary interest, particularly for quick descriptive
analyses of spatially varying relationships in data where collinearity is not present.
However, we anticipate that the benefits of penalization in GWR for prediction will
increase with an increasing number of potentially correlated explanatory variables.
5 Conclusions
There has been an increasing interest in spatially varying relationships between
variables in recent years in the spatial analysis literature. Recent attempts at modeling
32
these relationships have resulted in numerous forms of geographically weighted
regression, which has technical origins in locally weighted regression. While GWR offers
the promise of an understanding of the spatially varying relationships between variables,
local collinearity in the weighted explanatory variables used in GWR can produce
unstable models and dependence in the local regression coefficients, which can interfere
with conclusions about these relationships. While GWR has been applied to numerous
real world datasets in the literature, there has been inadequate consideration of the
accuracy of inferences derived from this model and an unclear distinction as to its use for
prediction and estimation of the response variable versus its role in inference on the
relationships between variables. The work in this paper uses real and simulated data to
evaluate the accuracy of the response variable estimates and predictions provided from
GWR and constrained versions of GWR, namely geographically weighted ridge
regression and the newly introduced geographically weighted lasso models. It also
evaluates the accuracy of regression coefficients from GWR, GWRR, and the GWL
models using simulated data, while considering the presence of collinearity and an
insignificant variable.
The work presented here shows that it is possible to implement the lasso in the
geographically weighted regression framework to perform regression coefficient
shrinkage while simultaneously performing local model selection and reducing prediction
and estimation error of the response variable. The data example and simulation study
results show that the penalized versions of GWR can outperform GWR in terms of
response variable prediction and estimation, both when there is no collinearity and where
there are various levels of collinearity in the model. In both the real and simulated data,
33
the GWL-local model produces the lowest prediction error of the response variable
among the methods considered. For the actual data, the GWL-global model produced the
lowest response variable estimation error. Other related preliminary work (Wheeler,
2006) suggests that the geographically weighted lasso may perform better at dependent
variable estimation than a Bayesian spatially varying coefficient process (SVCP) model
(Gelfand et al, 2003) that may be viewed as an alternative to GWR. Wheeler and Calder
(2007) recently demonstrated that the SVCP model can offer more accurate coefficient
inference and lower response variable estimation error than GWR, although at a greater
computational cost. A theoretical comparison of the performance of GWR, all penalized
versions of GWR, and the SVCP model is planned for future work. In summary, the
penalized versions of GWR introduced in this paper extend the method of GWR to
improve prediction and estimation of the response variable, which is in agreement with
its statistical theoretical origins.
34
References
Anselin L, 1988 Spatial Econometrics: Methods and Models (Kluwer, Dorddrecht)
Banerjee S, Carlin B P, Gelfand A E, 2004 Hierarchical Modeling and Analysis for
Spatial Data (Chapman & Hall, Boca Raton)
Belsley D A, 1991 Conditioning Diagnostics: Collinearity and Weak Data in Regression
(John Wiley, New York)
Brunsdon C, Fotheringham A S, Charlton M, 1996, “Geographically weighted regression:
a method for exploring spatial nonstationarity” Geographical Analysis 28(4) 281 -
298
Burnham K, Anderson D, 2004, Model Selection and Multi-Model Inference: A Practical
Information-Theoretic Approach (Springer-Verlag, Berlin)
Casetti E, 1992, “Generating models by the expansion method: applications to
geographic research” Geographical Analysis 4 81 - 91
Cleveland W S, 1979, “Robust locally-weighted regression and smoothing scatterplots”
Journal of the American Statistical Association 74 829 - 836
35
Cleveland W S, Devlin S J, 1988, "Locally-weighted regression: an approach to
regression analysis by local fitting" Journal of the American Statistical
Association 83(403) 596 - 610
Congdon P, 2003, “Modelling spatially varying impacts of socioeconomic predictors on
mortality outcomes” Journal of Geographical Systems 5 161 - 184
Efron B, Hastie T, Johnstone I, Tibshirani R, 2004a, “Least angle regression” Annals of
Statistics 32(2) 407 - 451
Efron B, Hastie T, Johnstone I, Tibshirani R, 2004b, “Rejoinder to least angle regression”
Annals of Statistics 32(2) 494 - 499
Farber S, Páez A, 2007, “A systematic investigation of cross-validation in GWR model
estimation: empirical analysis and Monte Carlo simulations” Journal of
Geographical Systems, forthcoming
Fotheringham A S, Brunsdon C, Charlton M, 2002 Geographically Weighted Regression:
The Analysis of Spatially Varying Relationships (John Wiley & Sons, West
Sussex)
Frank I E, Friedman J H, 1993, “A statistical view of some chemometrics regression
tools” Technometrics 35(2) 109 - 148
36
Gelfand A E, Kim H, Sirmans C F, Banerjee S, 2003, “Spatial modeling with spatially
varying coefficient processes” Journal of the American Statistical Association 98
387 - 396
Grandvalet Y, 1998, “Least absolute shrinkage is equivalent to quadratic penalization”, in
ICANN'98, Volume 1 of Perspectives in Neural Computing Eds L Niklasson, M
Boden, T Ziemske (Springer-Verlag, Berlin) pp 201 - 206
Hastie T, Tibshirani R, Friedman J, 2001 The Elements of Statistical Learning: Data
Mining, Inference, and Prediction (Springer-Verlag, New York)
LeSage J P, 2004, “A family of geographically weighted regression models” in Advances
in Spatial Econometrics. Methodology, Tools and Applications Eds L Anselin, R J
G M Florax, S J Rey (Springer Verlag, Berlin) pp 241 - 264
Loader C, 1999 Local Regression and Likelihood (Springer, New York)
Martinez W L, Martinez A R, 2002 Computational Statistics Handbook with Matlab
(Chapman & Hall, Boca Raton)
Neter J, Kutner M H, Nachtsheim C J, Wasserman W, 1996 Applied Linear Regression
Models (Irwin, Chicago)
37
Páez A, Uchida T, Miyamoto K, 2002, “A general framework for estimation and
inference of geographically weighted regression models: 1. location-specific
kernel bandwidths and a test for locational heterogeneity” Environment and
Planning A 34 733 - 754
Tibshirani R, 1996, “Regression shrinkage and selection via the lasso” Journal of the
Royal Statistical Society B 58(1) 267 - 288
Waller L, Zhu L, Gotway C, Gorman D, Gruenewald P, 2007, “Quantifying geographic
variations in associations between alcohol distribution and violence: a comparison
of geographically weighted regression and spatially varying coefficient models”
Stochastic Environmental Research and Risk Assessment 21(5) 573 - 588
Wheeler D, 2007, “Diagnostic tools and a remedial method for collinearity in
geographically weighted regression” Environment and Planning A 39(10)
Wheeler, D, Calder C, 2007, “An assessment of coefficient accuracy in linear regression
models with spatially varying coefficients” Journal of Geographical Systems 9(2)
145 - 166
38
Wheeler D, Tiefelsdorf M, 2005, “Multicollinearity and correlation among local
regression coefficients in geographically weighted regression” Journal of
Geographical Systems 7 161 - 187
39
Tables
Method RMSPE(y) RMSE(y)
GWR 11.074 2.640
GWRR 9.808 2.800
GWL - global 9.946 2.197
GWL - local 7.483 2.687
Table 1. RMSPE and RMSE of the response variable for the GWR, GWRR, GWL-
global, and GWL-local models using the Columbus crime data
ID k p1 p
2 p
3
1 27.60 0.996 0.995 0.136
2 87.66 0.992 0.992 0.001
5 21.29 0.995 0.993 0.188
27 24.25 0.997 0.690 0.947
33 35.45 0.865 0.949 0.045
67 29.49 0.994 0.982 0.371
114 40.58 0.739 0.988 0.283
116 39.45 0.579 0.996 0.922
153 38.38 0.737 0.999 0.609
158 21.94 0.955 0.942 0.006
Table 2. Record number, condition index (k), and variance-decomposition proportions (p1
= intercept, p2 = drug, p3 = alcohol) for the Houston crime data
Method RMSPE(y) RMSE(y)
GWR 0.720 0.342
GWRR 0.713 0.349
GWL - global 0.714 0.300
GWL - local 0.590 0.311
Table 3. RMSPE and RMSE of the response variable for the GWR, GWRR, GWL-
global, and GWL-local models using the Houston crime data
40
Correlation
r = 0.00 r = 0.74 r = 0.93 r = 0.99
Method RMSPE(y)
GWR 1.187 1.154 1.158 1.174
GWRR 1.187 1.153 1.158 1.168
GWL - global 1.181 1.144 1.148 1.158
GWL - local 0.928 0.932 0.954 0.959
RMSE(y)
GWR 0.856 0.873 0.869 0.860
GWRR 0.856 0.873 0.871 0.877
GWL - global 0.849 0.862 0.858 0.821
GWL - local 0.662 0.675 0.669 0.700
RMSE(B)
GWR 0.503 0.553 0.689 1.586
GWRR 0.504 0.554 0.691 1.515
GWL - global 0.499 0.549 0.686 1.513
GWL - local 1.815 2.147 2.101 1.991
Table 4. RMSPE and RMSE of the response variable and RMSE of the regression
coefficients for each model used in the simulation study at four levels of explanatory
variable correlation
41
Figures
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2
B1
B2
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5
-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2
B1
B2
Figure 1. GWR estimated coefficients (left) and GWL-local estimated coefficients (right)
for the income (B1) and housing value (B2) covariates in the Columbus crime data
42
Figure 2. Number of violent crimes per person in Houston in year 2000
43
0.00.51.01.52.02.5
-2 -1 0 1
GWR Be ta1
GWR B eta2
153
114
2
116
158
27
33
1
5
67
0.0 0.5 1.0 1.5 2.0 2.5
-2 -1 0 1
GWL-global Beta1
GWL-global Beta2
153
114
2
116
158
27
33
1
5
67
Figure 3. GWR estimated coefficients (left) and GWL-global estimated coefficients
(right) for the drug (Beta1) and alcohol (Beta2) covariates in the Houston crime data
44
0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5
GWR Beta1
GWL-local Beta1
Figure 4. GWR estimated coefficients (x-axis) and GWL-local estimated coefficients (y-
axis) for the drug (Beta1) covariate in the Houston crime data
45
x coordinate
y coordinate
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1234567891011121314
0.2
0.4
0.6
0.8
1.0
1.2
1.4
x coordinate
y coordinate
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1234567891011121314
4.5
5.0
5.5
6.0
6.5
7.0
Figure 5. Coefficient patterns for the first two *
β
parameters for one realization of the
coefficient process in the simulation study. The left plot is *
1
β
and the right plot is *
2
β
.
x coordinate
y coordinate
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1234567891011121314
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
x coordinate
y coordinate
1
2
3
4
5
6
7
8
9
10
11
12
13
14
1234567891011121314
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
Figure 6. Coefficient estimates for 4
β
from GWR (left) and GWL-local (right) for one
realization of the coefficient process in the simulation study
... Some methods to handle multicollinearity problems in regression analysis include the Ridge Regression and Lasso Regression methods. Methods to overcome local multicollinearity and spatial heterogeneity in GWR models are the Locally Compensated Ridge-Geographically Weighted Regression (LCR-GWR) method first introduced by [4] and the Geographically Weighted Lasso (GWL) method introduced by [5]. The GWL method develops the concept of Least Absolute Shrinkage and Selection Operator (LASSO) with the solution used is the Least Angle Regression (LARS) algorithm so that the insignificant parameter coefficient will shrink to zero [6]. ...
... Vol. 5 Meanwhile, h is the optimum bandwidth obtained using the Cross Validation (CV) method. CV is an iterative process that aims to find the kernel bandwidth that minimizes the prediction error of all observed outcome variables [3]. ...
Article
In spatial data, multicollinearity and spatial heterogeneity are often encountered simultaneously. To overcome the problem of heterogeneity in spatial data, GWR method can be used but this method can only overcome heterogeneity but not multicollinearity. Therefore, another method is needed to overcome multicollinearity in spatial data. The purpose of this study is to look at the ability of LCR-GWR and GWL methods to overcome multicollinearity problems simultaneously. The best method is determined by the results of the study which has smaller AIC and RMSE values. The results showed that the GWL method has lower AIC and RMSE values compared to the LCR-GWR model. Therefore, it can be said that GWL is better able to overcome multicollinearity and spatial heterogeneity in Income data compared to LCR-GWR.
... Both techniques are suitable tools to address the spatial heterogeneity issue (Zhu and Turner, 2022), that is, empirical frameworks in which the relationship between variables is not constant over space. While GWR, and its variants, such as Geographically Weighted Regression Lasso (Wheeler, 2009) and Geographically and Temporally Weighted Regression (Wu et al, 2014), frequently lead to unstable estimates that are sensitive to the size of the dataset being considered, SCR provides a good compromise between computational efficiency, stability of results, and interpretability. Moreover, SCR naturally adapts to the case of Generalized Linear Models (GLMs) (see Section 14 of Fox, 2015), enabling the modeling of more complex data structures than just the Gaussian case and with the potential to be expanded to more fine-grained modeling, such as Generalized Additive Models (Wood, 2020). ...
Preprint
Full-text available
This paper evaluates the influence of multidimensional phenomena on a guaranteed minimum income policy aimed at supporting the incomes of Italian families in difficulty, namely the Italian Citizenship Income, from 2018 to 2022. We implement a variety of spatial econometric models that relate the number of households benefiting from income support interventions with wealth and poverty indicators, including the average per capita income, share of poverty, and the Gini index. Spatial models handle the strong spatial heterogeneity exhibited by the recipient households by grouping municipal units into homogeneous and spatially-contiguous groups and estimating local relationships. In this way, we are enabled to evaluate how geographical and local factors influence the effectiveness of income support policies. Results show that the presence of multidimensional phenomena significantly influences the requests for income support. However, the sign and the magnitude of the estimated correlation strongly depend on the type of indicator used and by the local structural characteristics. Also, a remarkable augment in term of complexity of the social phenomenon and spatial heterogeneity throughout the period of interest. We estimate positive and statistically significant correlations regarding per capita income and the share of municipal poverty, in particular where both higher socio-economic vulnerability and low-income levels persist. Also, we observe that where both average per capita income and income inequality are high, the policy was unable to reach potential household targets, while in areas characterized by low income but lower income inequality, the income support reached a high number of households. JEL Classification: H53 , I38 , R12 , C21
... The least absolute shrinkage and selection operator (LASSO) is widely used for variable selection. GWLASSO is a modified LASSO that alleviates the collinearity effect among explanatory factors by adding geographical weights, which allows the implementation of variable selection with spatial information (68). In this study, GWLASSO was used to identify the social/environmental influencing factors leading to obesity with potential spatial relationship patterns among Gu districts using the Euclidean distance between each district (54). ...
Article
Full-text available
Introduction The rising prevalence of obesity has become a public health concern, requiring efficient and comprehensive prevention strategies. Methods This study innovatively investigated the combined influence of individual and social/environmental factors on obesity within the urban landscape of Seoul, by employing advanced machine learning approaches. We collected ‘Community Health Surveys’ and credit card usage data to represent individual factors. In parallel, we utilized ‘Seoul Open Data’ to encapsulate social/environmental factors contributing to obesity. A Random Forest model was used to predict obesity based on individual factors. The model was further subjected to Shapley Additive Explanations (SHAP) algorithms to determine each factor’s relative importance in obesity prediction. For social/environmental factors, we used the Geographically Weighted Least Absolute Shrinkage and Selection Operator (GWLASSO) to calculate the regression coefficients. Results The Random Forest model predicted obesity with an accuracy of >90%. The SHAP revealed diverse influential individual obesity-related factors in each Gu district, although ‘self-awareness of obesity’, ‘weight control experience’, and ‘high blood pressure experience’ were among the top five influential factors across all Gu districts. The GWLASSO indicated variations in regression coefficients between social/environmental factors across different districts. Conclusion Our findings provide valuable insights for designing targeted obesity prevention programs that integrate different individual and social/environmental factors within the context of urban design, even within the same city. This study enhances the efficient development and application of explainable machine learning in devising urban health strategies. We recommend that each autonomous district consider these differential influential factors in designing their budget plans to tackle obesity effectively.
Article
This work proposes a new method for building an explanatory spatial autoregressive model in a multicollinearity context. We use Ridge regularization to bypass the collinearity issue. We present new estimation algorithms that allow for the estimation of the regression coefficients as well as the spatial dependence parameter. A spatial cross-validation procedure is used to tune the regularization parameter. In fact, ordinary cross-validation techniques are not applicable to spatially dependent observations. Variable importance is assessed by permutation tests since classical tests are not valid after Ridge regularization. We assess the performance of our methodology through numerical experiments conducted on simulated synthetic data. Finally, we apply our method to a real data set and evaluate the impact of some socioeconomic variables on the COVID-19 intensity in France.
Thesis
Full-text available
In areas such as spatial analysis and time series analysis, it is essential to understand and quantify spatial or temporal heterogeneity. In this dissertation, we focus on a spatially varying coefficient model, in which spatial heterogeneity is accommodated by allowing the regression coefficients to vary in a given spatial domain. We propose a model selection method for spatially varying coefficient models using penalized bivariate splines. It uses bivariate splines defined on triangulation to approximate nonparametric varying coefficient functions and minimizes the sum of squared errors with local penalty on L2 norms of spline coefficients for each triangle. Our method partitions the region of interest using triangulation and provides efficient approximation of irregular domains. In addition, we propose an efficient algorithm to obtain the proposed estimator using the local quadratic approximation. We also establish the consistency of estimated nonparametric coefficient functions and the estimated null regions. Moreover, we develop model confidence regions as the inference tool to quantify the uncertainty of the estimated null regions. The numerical performance of the proposed method is evaluated in both simulation case and real data analysis.
Article
Mixture analysis is an emerging statistical tool in epidemiological research that seeks to estimate the health effects associated with mixtures of several exposures. This approach acknowledges that individuals experience many simultaneous exposures and it can estimate the relative importance of components in the mixture. Health effects due to mixtures may vary over space driven by to political, demographic, environmental, or other differences. In such cases, estimating a global mixture effect without accounting for spatial variation would induce bias in effect estimates and potentially lower statistical power. To date, no methods have been developed to estimate spatially varying chemical mixture effects. We developed a Bayesian spatially varying mixture model that estimates spatially varying mixture effects and the importance weights of components in the mixture, while adjusting for covariates. We demonstrate the efficacy of the model through a simulation study that varies the number of mixtures (one and two) and spatial pattern (global, one‐dimensional, radial) and magnitude of mixture effects, showing that the model is able to accurately reproduce the spatial pattern of mixture effects across a diverse set of scenarios. Finally, we apply our model to a multi‐center case‐control study of non‐Hodgkin lymphoma (NHL) in Detroit, Iowa, Los Angeles, and Seattle. We identify significant spatially varying positive and inverse associations with NHL for two mixtures of pesticides in Iowa and do not find strong spatial effects at the other three centers. In conclusion, the Bayesian spatially varying mixture model represents a novel method for modeling spatial variation in mixture effects.
Article
Full-text available
Introduction Health disparities exist at every step of the HIV care continuum (HCC) among racial/ethnic minority population. Such racial/ethnic disparities may have significantly delayed the progress in HCC in the Southern US states that are strongly represented among geographic focus areas in the 2019 federal initiative titled ‘Ending the HIV Epidemic: A Plan for America’. However, limited efforts have been made to quantify the long-term spatiotemporal variations of HCC disparities and their contributing factors over time, particularly in the context of COVID-19 pandemic. This project aims to identify the spatiotemporal patterns of racial disparities of each HCC outcome and then determine the contribution of contextual features for temporal change of disparities in HCC. Methods and analysis This cohort study will use statewide HIV cohort data in South Carolina, including all people living with HIV (PLWH) who were diagnosed with HIV in 2005–2020. The healthcare encounter data will be extracted from longitudinal EHR from six state agencies and then linked to aggregated county-level community and social structural-level data (eg, structural racism, COVID-19 pandemic) from multiple publicly available data sources. The South Carolina Revenue of Fiscal and Affairs will serve as the honest broker to link the patient-level and county-level information. We will first quantify the HCC-related disparities by creating a county-level racial/ethnic disparity index (RDI) for each key HCC outcomes (eg, HIV testing, timely diagnosis), examine the temporal patterns of each RDI over time and then using geographical weighted lasso model examine which contextual factors have significant impacts on the change of county-level RDI from 2005 to 2020. Ethics and dissemination The study was approved by the Institutional Review Board at the University of South Carolina (Pro00121718) as a Non-Human Subject study. The study’s findings will be published in peer-reviewed journals and disseminated at national and international conferences and through social media.
Article
Geographically weighted (GW) method is a type of spatial statistical framework. GW methods have been developed to tackle spatial heterogeneity in data, with a kernel that moves across geographical space. The GW method applies to a wide range of statistical analysis methods to explore the local geographical characteristics of data and its relationships in bivariate and multivariate data analysis. GW methods currently include (generalized) linear regression, summary statistics, and principal components analysis. They have further potentials to be extended to any statistical methods. To discuss future directions of GW method developments, we reviewed previous works regarding the state-of-art GW methods and available software and tools. As its customization is flexible, the GW method is feasible for any spatial phenomenon in cases where spatial heterogeneity is to be considered.
Article
Full-text available
A Bayesian treatment of locally linear regression methods intro-duced in McMillen (1996) and labeled geographically weighted regres-sions (GWR) in Brunsdon, Fotheringham and Charlton (1996) is set forth in this paper. GWR uses distance-decay-weighted sub-samples of the data to produce locally linear estimates for every point in space. While the use of locally linear regression represents a true contribution in the area of spatial econometrics, it also presents problems. It is ar-gued that a Bayesian treatment can resolve these problems and has a great many advantages over ordinary least-squares estimation used by the GWR method.
Book
1: Introduction.- 2: The Scope of Spatial Econometrics.- 3: The Formal Expression of Spatial Effects.- 4: A Typology of Spatial Econometric Models.- 5: Spatial Stochastic Processes: Terminology and General Properties.- 6: The Maximum Likelihood Approach to Spatial Process Models.- 7: Alternative Approaches to Inference in Spatial Process Models.- 8: Spatial Dependence in Regression Error Terms.- 9: Spatial Heterogeneity.- 10: Models in Space and Time.- 11: Problem Areas in Estimation and Testing for Spatial Process Models.- 12: Operational Issues and Empirical Applications.- 13: Model Validation and Specification Tests in Spatial Econometric Models.- 14: Model Selection in Spatial Econometric Models.- 15: Conclusions.- References.
Article
Chemometrics is a field of chemistry that studies the application of statistical methods to chemical data analysis. In addition to borrowing many techniques from the statistics and engineering literatures, chemometrics itself has given rise to several new data-analytical methods. This article examines two methods commonly used in chemometrics for predictive modeling—partial least squares and principal components regression—from a statistical perspective. The goal is to try to understand their apparent successes and in what situations they can be expected to work well and to compare them with other statistical methods intended for those situations. These methods include ordinary least squares, variable subset selection, and ridge regression.
Book
This book, and the associated software, have grown out of the author’s work in the field of local regression over the past several years. The book is designed to be useful for both theoretical work and in applications. Most chapters contain distinct sections introducing methodology, computing and practice, and theoretical results. The methodological and practice sections should be accessible to readers with a sound background in statistical meth- ods and in particular regression, for example at the level of Draper and Smith (1981). The theoretical sections require a greater understanding of calculus, matrix algebra and real analysis, generally at the level found in advanced undergraduate courses. Applications are given from a wide vari- ety of fields, ranging from actuarial science to sports. The extent, and relevance, of early work in smoothing is not widely appre- ciated, even within the research community. Chapter 1 attempts to redress the problem. Many ideas that are central to modern work on smoothing: local polynomials, the bias-variance trade-off, equivalent kernels, likelihood models and optimality results can be found in literature dating to the late nineteenth and early twentieth centuries. The core methodology of this book appears in Chapters 2 through 5. These chapters introduce the local regression method in univariate and multivariate settings, and extensions to local likelihood and density estima- tion. Basic theoretical results and diagnostic tools such as cross validation are introduced along the way. Examples illustrate the implementation of the methods using the locfit software. The remaining chapters discuss a variety of applications and advanced topics: classification, survival data, bandwidth selection issues, computa- vi tion and asymptotic theory. Largely, these chapters are independent of each other, so the reader can pick those of most interest. Most chapters include a short set of exercises. These include theoretical results; details of proofs; extensions of the methodology; some data analysis examples and a few research problems. But the real test for the methods is whether they provide useful answers in applications. The best exercise for every chapter is to find datasets of interest, and try the methods out! The literature on mathematical aspects of smoothing is extensive, and coverage is necessarily selective. I attempt to present results that are of most direct practical relevance. For example, theoretical motivation for standard error approximations and confidence bands is important; the reader should eventually want to know precisely what the error estimates represent, rather than simply asuming software reports the right answers (this applies to any model and software; not just local regression and loc- fit!). On the other hand, asymptotic methods for boundary correction re- ceive no coverage, since local regression provides a simpler, more intuitive and more general approach to achieve the same result. Along with the theory, we also attempt to introduce understanding of the results, along with their relevance. Examples of this include the discussion of non-identifiability of derivatives (Section 6.1) and the problem of bias estimation for confidence bands and bandwidth selectors (Chapters 9 and 10).
Article
In this paper an expansion method for the construction and modification of models is defined. It consists of a procedure whereby a terminal model is generated from an initial one by making the parameters of the latter function of some variables. The usefulness of the method for arriving at improved predictive models, for testing hypotheses, and for removing inadequacies of theoretical models is demonstrated by a number of examples.