The Colocation Quotient: A New Measure of Spatial Association Between Categorical Subsets of Points


Abstract and Figures

This article presents a new metric we label the colocation quotient (CLQ), a measurement designed to quantify (potentially asymmetrical) spatial association between categories of a population that may itself exhibit spatial autocorrelation. We begin by explaining why most metrics of categorical spatial association are inadequate for many common situations. Our focus is on where a single categorical data variable is measured at point locations that constitute a population of interest. We then develop our new metric, the CLQ, as a point-based association metric most similar to the cross-k-function and join count statistic. However, it differs from the former in that it is based on distance ranks rather than on raw distances and differs from the latter in that it is asymmetric. After introducing the statistical calculation and underlying rationale, a random labeling technique is described to test for significance. The new metric is applied to economic and ecological point data to demonstrate its broad utility. The method expands upon explanatory powers present in current point-based colocation statistics.
The Colocation Quotient: A New Measure of
Spatial Association Between Categorical
Subsets of Points
Timothy F. Leslie,
Barry J. Kronenfeld
Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA,
Department of Geology/Geography, Eastern Illinois University, Charleston, IL
This article presents a new metric we label the colocation quotient (CLQ), a mea-
surement designed to quantify (potentially asymmetrical) spatial association between
categories of a population that may itself exhibit spatial autocorrelation. We begin by
explaining why most metrics of categorical spatial association are inadequate for many
common situations. Our focus is on where a single categorical data variable is mea-
sured at point locations that constitute a population of interest. We then develop our
new metric, the CLQ, as a point-based association metric most similar to the cross-k-
function and join count statistic. However, it differs from the former in that it is based
on distance ranks rather than on raw distances and differs from the latter in that it is
asymmetric. After introducing the statistical calculation and underlying rationale, a
random labeling technique is described to test for significance. The new metric is
applied to economic and ecological point data to demonstrate its broad utility. The
method expands upon explanatory powers present in current point-based colocation
Geographers have long considered the relationship between the characteristics of
an object and its neighbors. Tobler’s statement relating all things, near more than
far, remains one of the few statements geographers can claim as ‘‘law’’ (Tobler
1970). Tobler’s law applies to qualitative concepts such as culture and ecological
process, as well as to quantified measures like patenting rates and potential eva-
potranspiration (Galiano 1986; O
´in and Leslie 2005). Quantifying spatial
relationships has become a hallmark of geographic analysis.
Among the various Tobleresque relationships that are of interest to a geogra-
pher, the spatial relationship between distinct populations or distributions is one of
the most fundamental. This type of spatial relationship may be denoted by the term
spatial association. Aspects of spatial association are captured in the concepts of
‘‘spatial overlay,’’ ‘‘cross-correlation,’’ and ‘‘colocation’’ (Wartenberg 1985; de
Smith, Goodchild, and Longley 2009). The term spatial association is used else-
where to refer to pattern either within a single population (i.e., spatial autocorre-
lation) or between two or more populations. Here, we confine usage of the term to
the latter situation only, corresponding to what Lee (2001) refers to as ‘‘bivariate
spatial association.’’ In contrast to spatial autocorrelation, analysis of spatial asso-
ciation requires simultaneous consideration of multiple patterns and processes. The
autocorrelative structure of a joint population, and of each distinct subpopulation,
should be considered when selecting a metric for spatial analysis, as these aspects
of pattern may influence the observed association between populations.
Our interest lies in the situation where a single categorical data variable is
measured at point locations that constitute a population of interest. Because the
values of interest are nominal in nature, measures of spatial association developed
for ratio point data, such as the cross-variogram (Vallejos 2008), are not suitable.
Other measures, such as the join count statistic (Cliff and Ord 1981), are typically
applied to polygon rather than point data; this is also true of Moran’s coefficient,
which can be applied to nominal data (Griffith 2010). The most similar measure to
what we propose is the cross-k-function (Cressie 1991), but because it measures
spatial association between two populations, the null hypothesis that it tests is not
appropriate to the situation in which categorized individuals come from a single
To analyze this situation, we develop a generalized method to determine
whether categories within a population are spatially correlated and, if so, how, in
what direction, and by how much. This situation typifies a variety of problems in
both human and physical geography. For example, one might wish to determine the
colocation preferences of businesses of different types within a metropolitan area or
examine the relationship between pairs of tree species in a forest setting in order to
identify possible interspecies relationships.
In either case, the underlying distribution is the result of two conceptually dis-
tinct spatial processes. First, the spatial structure of the overall population may
cause point locations to be clustered or dispersed. Second, nested within the overall
spatial pattern, relationships between categories may result in some categories be-
ing more or less likely to occur near others. Failure to distinguish between these two
hierarchical processes can result in spurious findings. In addition to separating
spatial association between categories from overall population clustering, recog-
nizing that categorical effects may be asymmetric is also important. Asymmetric
relationships in ecology include obligatory predatorism and parasitism, in which a
predator or parasite is confined to locations where the prey or host is found, but
the reverse is not necessarily true. In logistics, businesses further down a supply
chain often are dependent on (and therefore located near) their suppliers, while
suppliers locate based on natural resources and other inputs. Any metric of pairwise
The Colocation QuotientTimothy F. Leslie and Barry J. Kronenfeld
categorical spatial correlation should be able to deal with this asymmetry. In some
cases, points in category A may prefer category B, points in category B may prefer
category C, and points in C may prefer being close to category A. In such cases, a
symmetrical spatial association metric could potentially find ‘‘significance’’ in all
or none of the bidirectional pairwise associations it measures (e.g., A 2B,
B2C, and C 2A).
Our metric, the colocation quotient (CLQ), quantifies spatial relationships
between categories by building on the concept of the location quotient used by
geographers and economists to judge a region’s degree of specialization in a
particular industry (Blair 1995; Stimson, Stough, and Roberts 2006). The CLQ is
defined with respect to two categories (e.g., types A and B), and provides a measure
of the degree to which one categorical subset is spatially dependent on the other.
Specifically, CLQ
measures the degree to which type A events are spatially
attracted to type B events. The CLQ is calculated as a ratio of observed versus ex-
pected points of one type among the set of nearest neighbors of points of another
type. It also may be viewed as a modification of traditional measures of spatial
correlation between categories, including the join count statistic and the cross-k-
In the next section, we explain why these existing metrics of categorical spatial
association are inadequate for many common situations. We then develop a new
statistical measure and a corresponding significance-testing framework. Finally, we
present applications of the metric in socioeconomic and physical geography con-
texts, and then conclude with suggestions for implementation.
Categorical variables present an interesting challenge to measuring Tobler’s law
due to the multiplicity of relationships. In particular, the presence of multiple sub-
categories of a single type of entity results in two complicating factors, which are
illustrated in Fig. 1. First, the interaction between any given pair of categories often
is asymmetrical. In Fig. 1(1), asymmetry results from a unidirectional dependency,
in which individuals of category B are found only in close proximity to category A,
but individuals of category A may be found in any location independent of the
Figure 1. Illustrations of spatial patterns exhibiting (1, 2) asymmetry in pairwise categorical
spatial associations and (3) spatial autocorrelation in the overall population.
Geographical Analysis
presence of category B. As demonstrated, and despite its similarity to Fig. 1(1), the
attraction between A and B in Fig. 1(2) is symmetric; a spatial association metric
must be capable of distinguishing between these two patterns. Second, spatial
relationships between categories often are confounded by spatial autocorrelation of
the joint population. Fig. 1(3) illustrates a situation in which spatial autocorrelation
exists in the overall population, but little further categorical association is evident.
A metric that cannot distinguish between these two types of correlative processes
has only limited practical value.
In the following section, we review the two most commonly used metrics of
spatial association between categories: the join count statistic and Ripley’s cross-k-
function. We argue that each metric has shortcomings when applied to the afore-
mentioned situations.
Join count statistic
The join count statistic is an area-level measure of spatial association, that is, of
correlation between categories on a k-color map (Dacey 1965). The statistic op-
erates by comparing the number of times a pair of categories occurs in adjacent
positions with the expectation of randomness (Iyer 1949; David 1971; Cliff and Ord
1981). It often is applied to binary grid data, which are typically conceptualized as
a black-and-white checkerboard or as an irregular polygon tessellation, in which
case the statistic becomes a measure of spatial autocorrelation. Links between each
polygon are counted as color-same (black touching black or white touching white)
or color-different (black touching white). The resulting counts are tallied, and they
can be used to determine if the data are significantly autocorrelated. These counts
are compared with the expectations for a binomial random variable under the null
hypothesis using a w
Despite being conceptually simple, the join count statistic has not been im-
plemented in most popular geographic information system software packages. In
spatial settings where source data come as points rather than as areal values, com-
putation requires a geometrical association of point pairs. Traditionally, this asso-
ciation is accomplished by drawing Thiessen polygons around each point and
treating the resulting diagram as a polygon tessellation (Upton and Fingleton 1985).
This implementation introduces a certain degree of arbitrariness to the pairing of
points with their ‘‘neighbors.’’ Furthermore, point pairs are defined in a symmet-
rical manner, so that counts of A !B and B !A joins are equal by definition.
The join count statistic, by nature, cannot detect asymmetry such as that shown in
Fig. 1(1).
The emphasis on binary measures also results in the join count statistic
rarely being used to measure categorical spatial association. Although the under-
lying theory for moving the join count statistic to more than two colors is well
developed (Haining 2003), this sort of analysis is rarely done in practice. Instead,
scholars primarily use this statistic to examine the degree of spatial autocorrelation
The Colocation QuotientTimothy F. Leslie and Barry J. Kronenfeld
within individual categories, even when their data contain multiple categories
(e.g., Stevens and Jenkins 2000). Another spatial binary conceptualization, the
autologistic model, measures the likelihood of a point’s neighbor having a value
of one given that the point itself has a value of one (Cressie 1991). The autologistic
model provides a framework for quantifying the probability of occurrence
given neighborhood relations but does not provide theoretical guidance about
how neighborhood relations should be defined. Multivariate developments for the
autologistic model are very recent (Kavousi, Meshkani, and Mohammadzadeh
Researchers who have applied join count measures to multivariate data have
not implemented an ‘‘overall’’ statistic, so analysis can be done only by examining
each pair of variables separately. The original join count statistic has no need of an
overall statistic: either a binary pair is significantly autocorrelated negatively or
positively, or it is insignificant. In the multiple-category situation, the matrix of
pairwise significance in each pair of categories should be augmented by an overall
count of same–same linkages and an associated significance value, similar to the
work by Dixon (2002) and Ceyhan (2008).
The cross-k-function is an extension of Ripley’s k-function for two distributions
(Cressie 1991). It measures the overall density of category B within a prescribed
neighborhood around individuals of category A. This measure is compared with the
overall density of B, which is equivalent to the probability of finding B in a random
area. Results are presented as a graph of clustering over the range of distance radii,
similar to presentations of Ripley’s k.
The cross-k-function is asymmetrical and can account for differences in the
complementary relationships between two categories. However, metrical distance
is used rather than topological neighborhood distance. As a consequence, the
effects of the spatial pattern of an overall population are comingled with the effects
of cross-correlation between categories. The cross-k-function, while providing a
graph of results, can lead to erroneous conclusions because of effects occurring at
multiple scales within a data set. Given the pattern shown in Fig. 1(3), for example,
the cross-k-function would report highly significant positive spatial correlations
between every pair of categories because the density of individuals in each cate-
gory is higher in the vicinity of individuals of any other category. However, if A, B,
and C are businesses and if each cluster represents a city, then this is a trivial result
in most types of analysis because businesses clustering within cities is already well-
known. Of greater interest is the question of whether specific pairs of categories are
more mutually clustered than would be expected given the spatial pattern of a
parent population. A variation of the cross-k-function developed for network situ-
ations partially corrects this problem, but the correction is only applicable for net-
work-based analyses (Okabe and Yamada 2001).
Geographical Analysis
Other related metrics
A number of papers and articles propose or examine methods of analyzing spatial
association between values measured at points. Clifford, Richardson, and Hemon
(1989) and Dutilleul (1993) examine the effect of spatial autocorrelation on the
significance value of the standard correlation coefficient (r) between two geocoded
variables. Wartenberg (1985) multivariate spatial correlation expands Moran’s Ito
examine quantitative multivariate geographic distributions and shows analogies
to principal components analysis. Lee (2001) builds on Wartenberg (1985) work to
decompose Moran’s Iinto a spatial smoothing scalar and correlation (Pearson’s r)
between spatially lagged (smoothed) values of observed variables, and uses this
decomposition to create a bivariate measure of spatial association. Vallejos (2008)
and Rukhin and Vallejos (2008) use a normalized cross-variogram, treating point
data as samples of a continous spatial process. These measures all derive from a
conceptualization of two or more ratio variables distributed on a continuous spatial
domain. Although it may be possible to adapt these methods to the measurement
of spatial association between nominal values distributed on a discontinuous
(point) domain, such adaptation is not straightforward and is beyond the scope of
this article.
Other metrics have been developed to describe spatial association between
categories of points, but none handle the problems of asymmetry and nested pattern
correlation. Galiano (1986) calculates conditional probabilities within distance
neighborhoods, similar to the cross-k-function, to study relationships between tree
species. Dale (1999) dismisses Galiano’s conditional probabilities as being equiv-
alent to the paired quadrat covariance, a symmetrical measure of cross-correlation
based on fixed-area quadrats that is affected by spatial pattern in a joint population.
Leslie and O
´in (2006) presented the nearest establishment with asym-
metrical relationships (NEAR) statistic, a preliminary version of the CLQ, but they
exclude same-category associations, discount multiple observations at a single lo-
cation, and do not provide a basis for understanding the statistical significance of
their results. The need remains for an asymmetrical topological measure that can
work with categorical data constrained by a parent population that may itself be
clustered (or dispersed). The use of nearest neighbors is apposite, as the closest
individuals are generally expected to have the greatest influence (Ord 1990).
Within the field of ecology, a few investigations use nearest neighbor contingency
tables to investigate point patterns (Dixon 1994, 2002; Ceyhan 2008). However,
these investigations lack a solid semantic foundation from which other researchers
can choose when to use their statistics versus other statistics. The CLQ was devel-
oped to fill this need and is explained in the following section.
Although a number of metrics exist to quantify spatial association (Cressie 1991;
Okabe and Yamada 2001; Dixon 2002; Leslie and O
´in 2006), a
The Colocation QuotientTimothy F. Leslie and Barry J. Kronenfeld
conceptual framework to distinguish these patterns and processes has not been ar-
ticulated. As an impetus for our methods, we seek to do three things: (1) to con-
solidate our proposed analytic into a small number of accessible equations; (2) to
discuss the existence and causes of asymmetry in measures of spatial association;
and (3) to create and test a null hypothesis that captures the analytical power and
explains the limits of the statistic. In developing our methodological framework, we
build upon the importance of nearest neighbors. This construction is done to ad-
dress the problems that arise from the clustering of joint point patterns for reasons
other than colocation.
In developing a statistical metric that distinguishes between the effects of
autocorrelation of a joint population and the specific associative relationships
between pairs of categorical subsets, identification of the appropriate null hypoth-
esis is important. While research implementing the cross-k-function uses a null
hypothesis of ‘‘there is no spatial association between any pair of categorical sub-
sets,’’ the null hypothesis for a CLQ-based analysis is ‘‘given the clustering of the
joint population, there is no spatial association between pairs of categorical sub-
sets.’’ That is, we take the geometric pattern of a joint population as a given and
search for associations that cannot be explained by this joint pattern alone.
Let Pdenote a point population within which each individual is assigned
uniquely to one of k-categories in a classification system X, and let AAX and BAX
denote (possibly the same) categories in X. CLQ
is defined as the ratio of ob-
served to expected proportions of B among A’s nearest neighbors. Formally, this
calculation is given by
where Ndenotes the population size of the set of categories under analysis; N
denotes the population size of A; N0
denotes the population size of B (if AB) or
the population size of B minus 1 (if A 5B); and C
denotes the count of type A
points whose nearest neighbor is a type B point; defined more rigorously in equa-
tion (7) below. The numerator of CLQ
is the proportion of type B points among
A’s nearest neighbors (i.e., the observed proportion), while the denominator is the
proportion of type B points that could be a nearest neighbor to each type A point
(i.e., the expected proportion). To calculate the expected proportion, N1 rather
than Nis used in the denominator, because a point cannot be its own nearest
neighbor (Dixon 2002). Similarly, in the calculation of the same–same category
is defined as the count of type B ( 5A) points minus one because each
point of category A can have all other points of type A as neighbors except itself.
Semantically, CLQ
denotes the spatial attraction of A to B, or, alterna-
tively, as the degree to which B attracts A. For example, CLQ
52 indicates
that A is twice as likely to have B as its nearest neighbor (i.e., to locate near a point
of type B) as would be expected by chance. The attraction expressed by CLQ
is unidirectional because it is dependent on nearest neighbor relationships that may
Geographical Analysis
be asymmetric. If many cases exist where A’s nearest neighbor is B but B’s nearest
neighbor is not A, then C
, and, therefore, CLQ
logically expressing that A is more attracted to B than B is to A. Same-category
CLQs are interpreted in a similar manner, such that a CLQ
50.67 indicates
that A is only two-thirds as likely to be its own nearest neighbor as would be
expected given A’s proportion in the overall parent population; in this case, the
attraction is bidirectional.
The CLQ can be viewed as a simple modification of either the join count sta-
tistic or the cross-k-function. With regard to the join count statistic, the CLQ is
derived by replacing pairwise joins with nearest neighbor counts. From the cross-k-
function, the CLQ is derived by substituting neighbor ranks for absolute distances as
the basis for determining relative probabilities. The CLQ also may be considered an
extension of metrics used to measure the degree to which categories are associated
with specific locations or types of locations. Economic geographers are familiar
with the location quotient, which measures the ratio of a local economy’s propor-
tion of economic activity in a particular sector to the proportion of activity in the
country and/or region that encompasses it (Blair 1995; Stimson, Stough, and Rob-
erts 2006). The location quotient assigns values greater (or less) than one to places
with greater (or less) than average activity in a particular sector. A similar measure
used in forestry and other ecological applications is fidelity, which describes the
degree to which a given species is associated with a particular community type
(e.g., Dyer 2006). Though a conceptual descendant of these metrics, the CLQ de-
scribes spatial association of one category of objects with another category rather
than with a region or set of regions.
Like classical location quotients, a CLQ value of one has semantic importance.
The value of one occurs when the proportion of category B individuals among
category A individuals’ closest neighbors equals the proportion of category B in its
overall population (excluding one individual of category A). A CLQ
than one shows a higher number of nearest neighbors of category B than expected
given the relative counts in its population, whereas a value less than one indicates
that points in group B are closest neighbors to points in group A less frequently than
expected. The lowest possible value is zero, which occurs when no points in cat-
egory B are the closest neighbor to any points of category A. Every integer value
above unity indicates a multiple of ‘‘closeness’’ more than expected. The same-
category CLQ is undefined if any category has fewer than two points.
The CLQ does have a maximum value that depends on the proportion within
the overall population of the category under examination as well as certain geo-
metrical constraints. Ignoring these geometrical constraints, the proportional max-
imum CLQ
is found when all of A’s neighbors are B (C
), resulting in
The Colocation QuotientTimothy F. Leslie and Barry J. Kronenfeld
This formula shows that the maximum degree to which A can be attracted to B
depends on the population of B and, somewhat counterintuitively, that this rela-
tionship is an inverse one. In other words, the larger the population of B, the less the
attractive force it can exert on A. The reason for this is that if B constitutes a sig-
nificant proportion of the overall population, then one would expect a large pro-
portion of type A individuals to locate near a type B individual due to chance alone.
For example, suppose that type B individuals made up 500 (nearly half) of an
overall population of 1001; regardless of the population of A, one would expect
half of all type A individuals to have a type B individual as their nearest neighbor.
Even if every type A individual had a type B individual as its nearest neighbor, this
would only be twice as many as expected, and the maximum value of CLQ
would be equal to two. This issue becomes important only when dealing with
a large number of categories with substantial variance in their relative popula-
tions, especially when one category makes up a large percentage of the parent
Geometry also limits the maximum CLQ value. Maximum values occur when
each point of category B has a point of category A as its nearest neighbor. This
maximum increases as B becomes a larger share of the overall population. A geo-
metric limit occurs when every point of category A is surrounded by a ring of five
category B points that each have the category A point as their nearest neighbor. In
this situation, any additional category B point will be just as close or closer to an
existing category B point as it is to the central A point, and so the proportion of
category B points that have A as their nearest neighbor cannot increase any further.
Therefore, 5N
is substituted for C
in equation (1) to determine the geometric
for nonsame category analysis. For the same–same category analysis, the geometric
maximum does not apply, as higher values are not achieved when a set of points is
closer to a central point than to other points in a ring. In situations where N
is more
than five times the size of N
, the geometric maximum rather than the proportional
maximum is the maximum CLQ value.
Considering both numerical and geometric constraints, equation (4) furnishes
the formula for the maximum CLQ, which holds for both the multivariate and
bivariate situations:
MaxðCLQA!BÞ¼Min N1
This maximum expresses two constraints on the value of CLQ
. First, as the
relative population of B increases, the degree to which A is attracted to B is limited
numerically. This restriction arises because having a category B point as a nearest
neighbor is less ‘‘surprising’’ when category B points are more common. Second, as
Geographical Analysis
the proportion of A increases, the degree to which A is attracted to B is limited
geometrically. This constraint occurs because only so many category A points can
be packed around each category B point.
Because the CLQ semantically indicates the ratio of observed to expected,
demonstrating that its expectation is unity is important. Given the values of N,N
and N0
, a random allocation of categories dictates that the expected count C
of type A points’ nearest neighbors that are of type B is simply N
multiplied by the
conditional probability of selecting a point of type B given that one point of type A
has already been selected:
Substituting this expected count of nearest neighbors E(C
) into equation
(1), the expectation of the CLQ becomes
Therefore, the expected value of the CLQ is one if categories are randomly
allocated across a fixed point pattern.
In the preceding discussion, each category A point is assumed to have exactly
one nearest neighbor. However, often a point has several equidistant nearest
neighbors. This could occur when multiple points coexist or appear to coexist at
a single location, as, for example, when several retail stores share the same postal
address. Equidistant nearest neighbors also are common when points are arranged
on a regular grid. By making explicit the definition of C
in equation (1), the
CLQ can easily accommodate such situations. Because CLQ
expresses the
degree to which A is attracted to B, the weight of each category A point is made
equal by formally defining C
where nis the number of equidistant nearest neighbors a point has, and B
is a 0–1
decision rule variable of whether the jth equidistant nearest neighbor of the ith
category A point is of category B. Thus, the contribution of each category A point to
the total nearest neighbor count is exactly one and is not determined by the number
of equidistant neighbors.
Asymmetry is defined by the condition that CLQ
. Equation (1)
shows that this occurs only when C
. This means that asymmetry in
the CLQ results if and only if there exist asymmetrical spatial configurations in
which the nearest neighbor of an individual does not have that individual as its own
nearest neighbor. In Fig. 1(1), for example, several clusters are present in which
The Colocation QuotientTimothy F. Leslie and Barry J. Kronenfeld
more than one individual of type B is clustered around a single individual of type A.
This results in asymmetry because there are many type B individuals whose nearest
neighbor is an individual of type A but who are not the nearest neighbors of that
type A individual. Specifically in this example, CLQ
50.95 (not statistically
significantly different from one), but CLQ
51.9: B is the closest neighbor of A
just slightly less than would be expected for a completely random mixture, while A
is the closest neighbor of B much more than a random mixture would suggest. This
result indicates that the occurrence of B is strongly dependent on the occurrence of
A, but not vice versa. The maximum CLQ value for both sectors is 1.9, which is
reached by CLQ
. Also notable is that CLQ
is 0, while CLQ
is 1.05.
These same-category CLQs show that B appears to avoid itself as its own nearest
neighbor (as noted previously, it appears to strongly prefer A as its nearest neigh-
bor), while A has itself as its own nearest neighbor only slightly more than average.
In Fig. 1(2), half of the points of category B have been removed. This new pat-
tern shares very little in common with Fig. 1(1) because significant differences exist
between the two figures of when A is B’s nearest neighbor but not the reverse.
and CLQ
are the same value: 1.4. These CLQ values seman-
tically indicate that A and B prefer the opposite category 40% more than would be
expected for a random distribution. The cross-category CLQs are not the only val-
ues to change with the removal of category B points: while CLQ
remains zero,
decreases to 0.77. In Fig. 1(2), the number of As is a much larger portion
of the population, and the expectation of same-category links increases. However,
as the actual number of same-category nearest neighbors from category A to cat-
egory A remains the same, CLQ
decreases between Figs. 1(1) and 1(2). This
example illustrates how spatial attraction of one categorical subset for another is
influenced not only by the number of observed nearest neighbors but also by the
proportions of the category types within the parent population that influence the
expected number of nearest neighbors.
While the pairwise CLQ matrix provides a means of identifying important cat-
egorical relationships, a global statistic facilitates significance testing. The global
CLQ is defined as the ratio of the observed number of same-category nearest
neighbor pairs to that expected number under the null hypothesis of no spatial as-
sociation between categories. This global CLQ is demonstrated as
CLQGlobal ¼P
The denominator represents the expected number of same-category nearest
neighbors under the null hypothesis of no spatial association.
To determine which patterns are statistically different from random, we need to
know the likelihood of an observed CLQ occurring within a given spatial pattern if
categorical assignments are random. Research on spatial autocorrelation statistics
Geographical Analysis
such as the join count shows that the assumptions of a simple t-test of the difference
of proportions do not hold and that nonnormality varies with the size of a data set
(Cliff and Ord 1981). We follow recent developments in spatial statistics, such as
the LISA (Anselin 1995, 2003), the network cross-k-function (Okabe and Yamada
2001), and the Ripley’s k-function (Marcon and Puech 2003) in using Monte Carlo
simulation, which makes no assumptions about the expected distribution except
that location behaviors are similar across a study area. In each simulation trial, the
proportion of the total population assigned to each category is held constant, but
these category assignments are randomly redistributed within a population, and
each pairwise CLQ as well as the CLQ
is recalculated. After a predetermined
number of permutations, the simulated sample distributions for the pairwise and
global CLQs are used to determine the significance of observed CLQs. Two-tailed
significance is determined by taking the lesser of the number of trials in which the
simulated CLQ was greater than or equal to, or less than or equal to, the observed
CLQ and multiplying by two.
Calculation of the CLQ and Monte Carlo simulation to determine significance
was implemented in a Visual Basic. Net (Microsoft Corp., Redmond, WA) stand-
alone program. A spatial index was created to facilitate efficient computation of
nearest neighbors, which is performed in O(nlog (n)) time including index creation
(Friedman, Bentley, and Finkel 1977), where nis the number of points in the pop-
ulation. Each Monte Carlo simulation requires an O(n) allocation of categories
among the existing points but not recomputation of nearest neighbors. Therefore,
the overall computational efficiency is, therefore, the greater of O(nlog n) and
O(mn), where mis the number of simulations.
Finally, the ability of the method to incorporate easily point-based data sets in
the CLQ is substantial because it does not require the use of predefined areal units
that were devised for purposes outside the identification of the phenomenon in
question, such as traffic analysis zones to examine metropolitan employment (Gi-
uliano and Small 1991) or municipal townships to analyze ecological data (Cogbill,
Burk, and Motzkin 2002). Point data do not have the typical location quotient
sensitivity to scale (Mulligan and Schmidt 2005). However, caution is warranted
when applying the CLQ to categories with a small count. We do not recommend
the use of this method when a category has fewer than 10 individuals because the
power of the test will be low. Aggregating categories, if theoretically appropriate,
would likely solve this problem.
Two applications of the CLQ
To illustrate the proposed metric, we determined global and pairwise CLQs for two
data sets from very different domains. The first data set consists of 36,909 business
establishments in the Phoenix metropolitan region, classified according to eco-
nomic sector. The second data set consists of 368,122 trees tallied in a 50 ha plot on
Barro Colorado Island in the Panama Canal, classified according to health and
The Colocation QuotientTimothy F. Leslie and Barry J. Kronenfeld
health-related physiognomic characteristics. For each data set, we ran 10,000 sim-
ulations to determine significance values for the global and pairwise CLQs. Global
CLQs for both data sets are strongly positive and highly significant (Po0.001).
Establishments in phoenix
Which types of businesses locate near one another? Do establishments colocate
near establishments in their own sector or in different sectors? While this topic is
addressed at the intrasectoral level (no same-category associations) in Leslie and
´in (2006), the addition of same-category associations as well as intra-
sectoral links provides a comparison and a more in-depth look at the postmodern
metropolitan area of Phoenix. Phoenix is a flat city with few natural barriers, de-
veloped around the automobile, and decentralized for suburbanizing residents
(Gammage 2003). Intrasectorally, in 2002 Phoenix had three groupings: secondary
sector, wholesale and transportation, and administrative support; finance, insur-
ance, and producer services; and retailers with entertainment and accommodation
establishments (Leslie and O
´in 2006). In the Leslie and O
(2006) analysis, the effect of same-category associations is mentioned but not
A point-level establishment data set created by the Maricopa Association of
Governments (MAG) is used here. As a regional planning alliance, MAG conducts a
regular survey of nongovernmental employers in the region, most recently in 2004.
Business category groupings identified by this survey appear in Table 1. Maps
Table 1 Descriptive Categorical Information for Phoenix Economic Analysis, 2004
NAICS Sector N
11–23 Agriculture, mining, utilities, construction 3,705
31–33 Manufacturing 3,021
42–43 Wholesale trade 2,732
44–45 Retail 5,279
48–49 Transport 802
51 Information 786
52 Finance and insurance 1,931
53 Real estate 1,608
54–55 Professional, scientific, technical, management services 3,631
56 Administrative support 1,954
61 Education 1,185
62 Health care and social assistance 3,045
71 Arts, entertainment, and recreation 622
72 Accommodation and food services 3,199
81 Other services 2,886
92 Public administration 519
Total 36,905
Geographical Analysis
of these data appear in Leslie and O
´in (2006). The data are spatially
clustered and have a nearest neighbor R-value of 0.24 (significantly nonrandom at
the 0.01 level). Conducting a cross-k-function analysis on this data set would likely
find many pairwise categorical associations simply because the data set itself is
highly clustered.
As noted, the global CLQ is strongly positive (CLQ
52.53). This finding is
highlighted by the presence of same-category pairwise CLQs (Table 2, diagonal)
that are significant and greater than one, which indicates that businesses of all
categories have strong preferences for colocating with other businesses of the same
category. This effect is strongest in public administration establishments (NAICS
92), which are 14 times more likely to have their neighbor be the same category
than would be expected for a random distribution. This effect is weakest for ad-
ministrative support establishments (NAICS 56), which are just one-and-a-half
times more likely to have another administrative support establishment as their
nearest neighbor. Same-category values indicate that most establishment types are,
in general, two-and-a-half times more likely to locate next to a similar establish-
ment than would be expected from a random mixture.
A sector-by-sector inspection reveals that, in general, location preferences do
tend to be symmetric and reveal substantial category-groupings. The classic group-
ing of natural resources and subsequent processing is present. The primary sector
(agriculture, mining, utilities, and construction [NAICS 11–23]), manufacturing
sector (NAICS 31–33), wholesale sector (NAICS 42–43), and transportation and
warehousing sector (NAICS 48–49) all have high preferences for each other, with
the unexpected exception that the wholesale sector and transportation and ware-
housing sector have no associative links with the primary sector. The primary sector
also has a mutually high CLQ with administrative support (NAICS 56) and educa-
tion (NAICS 61), although the manufacturing, wholesale, and transportation and
warehousing sectors do not. The retail sector (NAICS 44) has CLQs significantly less
than one for almost every category except itself and the accommodation and food
services sector (NAICS 72), likely a result of the placement of Phoenix retail es-
tablishments in strip and shopping malls. Other services (NAICS 81) tend to have
retail as a nearest neighbor more often than expected, although the reverse is not
true. Producer services also have a grouping; information (NAICS 51), finance and
insurance (NAICS 52), and professional, scientific, technical, and management
services (NAICS 54–55) have mutually reciprocated high CLQs with each other.
Left out of this mix is the real estate sector (NAICS 53), which has strong location
preferences only with the finance and insurance sector but not with information or
professional, scientific, technical, and management services. Professional, scien-
tific, technical, and management services are mutually close to administrative sup-
port, but information and finance and insurance services do not share this
association. Finally, accommodation and food services have a link with other ser-
vices, also likely to be a result of colocation in strip malls throughout the metro-
politan area.
The Colocation QuotientTimothy F. Leslie and Barry J. Kronenfeld
Table 2 CLQs for Phoenix Establishments by NAICS, 2004
NAICS 11–23 31–33 42–43 44–45 48–49 51 52 53 54–55 56 61 62 71 72 81 92 GEO MAX
11–23 2.16 1.25 1.20 0.61 0.74 0.86 1.34 1.36 0.56 0.44 0.63 49.80
31–33 1.33 2.59 2.19 0.69 1.60 0.42 0.69 0.75 0.56 0.35 0.49 0.48 0.77 0.49 61.08
42–43 2.28 2.33 0.73 1.48 0.61 0.74 0.75 0.65 0.43 0.56 0.56 0.75 0.57 67.54
44–45 0.53 0.62 0.64 2.41 0.61 0.49 0.82 0.82 0.48 0.64 0.52 0.58 1.70 0.51 34.95
48–49 1.34 1.90 1.68 0.72 5.11 0.43 0.58 0.44 0.52 0.70 230.07
51 0.76 0.73 2.27 1.7 1.63 0.69 0.75 234.76
52 0.52 0.45 0.55 0.75 0.31 1.9 3.31 1.38 1.73 1.30 0.61 0.64 95.56
53 0.53 0.73 1.37 1.68 1.81 114.75
54–55 0.88 0.67 0.76 0.55 0.47 1.69 1.87 2.19 1.5 0.76 0.60 0.76 50.82
56 1.18 0.82 0.72 1.39 1.52 1.63 0.70 0.63 94.43
61 1.29 0.61 0.60 0.71 0.69 1.37 3.52 1.65 0.64 155.71
62 0.47 0.42 0.41 0.67 0.47 0.69 0.81 0.72 4.19 0.74 60.60
71 0.61 0.61 1.5 2.48 1.34 1.38 296.66
72 0.35 0.41 0.45 1.73 0.45 0.70 0.64 0.62 0.67 0.74 2.60 1.29 0.56 57.68
81 0.72 0.86 1.20 0.81 0.70 0.85 1.19 1.77 0.57 63.94
92 0.73 0.42 0.60 0.42 0.49 14.39 355.53
NUM MAX 9.96 12.22 13.51 6.99 46.01 46.95 19.11 22.95 10.16 18.89 31.14 12.12 59.33 11.54 12.79 71.11
Note: Values indicate the likelihood of a point’s nearest neighbor belonging to the column category given that the point belongs to the row
category, as compared with a random distribution (CLQ
row !column
). Only colocation values significantly different from one at the 0.05 level or
below are shown. Global CLQ 52.53, P50.001.
Geographical Analysis
Important asymmetries are present, aside from those previously mentioned. The
primary and other services sectors both have a large number of asymmetric associ-
ations. Public administration (NAICS 92) also has several of these asymmetric asso-
ciations. Asymmetries appear to reflect not supply-chain mechanics but rather
category sets where one sector may be the basis of a cluster (such as a medical
[NAICS 62], retail, or government complex) and other establishment types are arrayed
around it. This pattern causes high levels of same-category associations, with asym-
metries in the support services (administrative support, other services). Some sectors
appear to have very little colocation requirements in either direction. The public ad-
ministration and arts, entertainment, and recreation sectors (NAICS 71) have only
three or four significant associations outside of their same-sector preferences.
Tree conditions in Barro Colorado island
Data for the Barro Colorado Island example were obtained from the Center for
Tropical Forest Science (Hubbell, Condit, and Foster 2005). Previously connected
with the larger forest, this island was formed when the surrounding area was
flooded during the construction of the Panama Canal in the early 20th century. All
trees within the plot have been tallied since 1982. The most recent data, collected
in 2005, include observations about health and growth conditions. Live trees are
noted if they are buttressed (i.e., have enlarged trunks at the base), multistemmed,
or leaning significantly from a vertical position, or if the main trunk is broken below
the crown. Buttressed trees have noticeably widened trunks near the ground, which
may be caused by wet or unstable soil or internal rot. In addition, trees that died
since the previous census are noted as still standing, down, or missing. More than
one condition is recorded for many trees; in these cases, we used the most severe
condition for analysis. In this manner, we assigned trees to one of eight possible
categories (Table 3).
Several hypotheses relating to spatial patterns of growth and mortality naturally
arise from these data. For example, one hypothesis is that all types of mortality are
Table 3 Tree Condition Categories Used in Analysis of the Barro Colorado Data
Condition Definition Count
Normal No special code recorded 175,077
Buttressed Buttressed to at least 1.3 m, but not leaning or broken 2,462
Multistem Multistemmed plant, not buttressed 20,542
Leaning Leaning, but not broken or dead 12,757
Broken Broken above 1.3 m, but not dead 11,816
Dead Dead, not downed or missing 86,491
Downed Dead, trunk lying on ground 7,408
Missing Tree recorded in earlier survey, but not found in 2005 51,569
Note: Categories were aggregated from the original data, in which a single tree could be
assigned multiple codes.
The Colocation QuotientTimothy F. Leslie and Barry J. Kronenfeld
colocated. An alternative hypothesis is that wind-controlled mortality events ex-
hibit a spatial pattern distinct from senescence, disease, and other types of mortality
that leave a tree standing. Other hypotheses relate to the causes of certain mor-
phological characteristics. For example, buttressing and multistemmed growth are
the result of favorable growing conditions or, conversely, are a response to adverse
conditions. Spatial patterns of colocation provide both direct and indirect evidence
for these types of hypotheses.
Fig. 2 portrays a portion of the data. The nearest neighbor statistic indicates that
the overall spatial pattern is slightly clustered (R50.982; Po0.01). By itself, this
minor departure from randomness is not a great concern. However, careful exam-
ination of Fig. 2 reveals that the pattern is more nuanced: apparent clustering is the
result of large gaps containing few or no trees, likely caused by the presence of
roads, rivers, or recent disturbances. If these gaps are excluded, the tree pattern in
the remaining areas is likely to exhibit a slightly dispersed pattern. When the CLQ
matrix is computed for these data, several significant patterns stand out.
First, same-category CLQs are significantly positive for seven of the eight cat-
egories. This tendency for trees of the same category to colocate results in a weak
but highly significant global CLQ (CLQ
51.13). Second, patterns of colocation
suggest common disturbances, perhaps from wind: live trees that are leaning or
broken are strongly colocated with each other and also with downed dead trees.
The strongest measures of same-category spatial autocorrelation also occur among
these three categories, which also are somewhat colocated with missing trees but
not with standing dead trees. Standing dead trees are not colocated with any other
category, a pattern that suggests that most deaths are caused by senescence or local
Figure 2. Portion of the Barro Colorado tree data set.
Geographical Analysis
disease outbreaks, rather than from windthrow. Also, both buttressing and multi-
stem growth appear to be negatively associated with disturbance. Indeed, the ab-
sence of buttressed trees is notable in the vicinity of leaning and downed trees and
of standing dead trees. Even normal live trees show a slight tendency toward co-
location with buttressed and multistem trees. Consequently, healthy trees living in
fertile, sheltered locations appear to be more capable of growing multiple stems
and buttressed bases.
Although most relationships in this example are symmetrical, a notable asym-
metry exists in relationships that involve buttressed trees. Leaning and downed trees
avoid locating near buttressed trees (CLQ
leaning !buttressed
50.56, Po0.01;
downed !buttressed
50.71, P50.04). In contrast, buttressed trees locate near
leaning and downed trees only slightly less than would be expected by chance
buttressed !leaning
50.91, P50.45; CLQ
buttressed !downed
50.95, P50.78).
This finding suggests that buttressed trees tend to exclude other trees and prefer
Final considerations
The CLQ is a reenvisioning of spatial association at the point level that provides an
overall and pairwise method of describing degrees of spatial association. The pair-
wise information is descriptive of the strength of a relationship and has a natural
interpretation as a multiplicative factor on probability of occurrence. Comparison
of bidirectional CLQ pairs further provides information about the potential asym-
metry among types of points. Each pair’s linkages are described simply, and the
final results describe spatial associations after controlling for the clustering level of
an overall data set, an improvement to the cross-k-function analysis. Finally, the
addition of same-category associations, the ability to calculate properly with mul-
tiple points in the same location, and a quantification of statistical significance are
substantial improvements over Leslie and O
´in (2006) NEAR statistic. The
CLQ statistic is an extension of work by Dixon (1994, 2002) in its evaluation
of maximum ratio values and the production of a comparison variable that has a
semantic meaning.
While one of our goals for the CLQ is to capture asymmetrical relationships,
over the course of developing the CLQ we found that the definition of asymmetry is
more nuanced than we had at first anticipated. The CLQ specifically captures
asymmetry in spatial configuration, which occurs when the nearest neighbor of the
nearest neighbor of a point is not the original point. The CLQ describes how certain
categories can locate more or less often compared with a random distribution and
supports discovery of asymmetrical relationships in which one category appears to
have a greater chance of locating next to category X, while category X points do not
reciprocate, which can lead to meaningful insights about the underlying pattern.
Application of the CLQ to two very different data sets illustrates its general
applicability. Empirically, pairwise CLQs reveal three types of information, which
The Colocation QuotientTimothy F. Leslie and Barry J. Kronenfeld
are similar for both data sets. First, significant same-type autocorrelation is strong in
both data sets. Both business establishments in Phoenix and trees on Barro Colo-
rado Island are more likely to have neighbors of the same sector or category than
indicated by a random distribution. Second, symmetrical associations between
categories reveal major groupings or data clusters. In Phoenix, two major groupings
are revealed: one of establishments in resource- and transportation-based sectors,
and a second of high-level services involving information, finance and insurance,
and professional, scientific, technical, and management services, though not the real
estate sector. On Barro Colorado Island, two major groupings are also revealed: one
of healthy, buttressed, and multistemmed trees, and a second of leaning, broken,
downed, and missing trees, but not standing dead trees. Some categories, such as
retail in Phoenix and standing dead trees on Barro Colorado Island, do not show such
group affinities as they have relatively weak or no positive colocation tendencies
with any other category. Third, asymmetries in pairwise CLQs reveal inequities in the
degree of influence exerted by each category. In Phoenix, asymmetries surprisingly
do not appear to follow supply-chain mechanics but instead indicate that certain
industries form core clusters with a mix of other sector types around this core. On
Barro Colorado Island, asymmetry suggests that the forces that cause trees to buttress
themselves at the base also isolate these trees from others.
The CLQ appears to be robust and stable. We propose that the CLQ be used to
examine a point data set consisting of multiple subcategories of a single type of en-
tity, such as trees or businesses. The cross-k-function, in contrast, should be used to
quantify the relationship between conceptually distinct entity types whose joint pop-
ulation does not form a semantically meaningful unit. The CLQ should be used in
place of the cross-k-function in situations where clustering of a joint population could
confound results, though we realize that the distinction between conceptual category
types and subcategories is not always so clear-cut. In many real-world situations, the
joint population shares traits that suggest similar spatial distributions, and the purpose
of analysis is to identify pairwise categorical relationships beyond those expected
from a joint population. In the realm of human geography, the spatial patterns of
cities, homes, businesses, and political institutions are controlled by the overall pop-
ulation distribution and transportation infrastructure, and exhibit tendencies to locate
in varying degrees of proximity to other human activity centers. Similarly, natural
categories, such as lichen, birds, and igneous rocks, have distinct spatial patterns that
can be further decomposed into subcategories that might exhibit varying degrees of
spatial correlation with each other. In these cases, the CLQ can reveal interesting
patterns of association that may shed light on underlying processes of attraction,
repulsion, dependency, and resource requirements.
A BSD-licensed software tool to calculate the CLQ for any point shapefile is avail-
able at (accessed April 18, 2011).
Geographical Analysis
