ArticlePDF Available

On the Use of ZIP Codes and ZIP Code Tabulation Areas (ZCTAs) for the Spatial Analysis of Epidemiological Data

Authors:

Abstract and Figures

While the use of spatially referenced data for the analysis of epidemiological data is growing, issues associated with selecting the appropriate geographic unit of analysis are also emerging. A particularly problematic unit is the ZIP code. Lacking standardization and highly dynamic in structure, the use of ZIP codes and ZIP code tabulation areas (ZCTA) for the spatial analysis of disease present a unique challenge to researchers. Problems associated with these units for detecting spatial patterns of disease are explored. A brief review of ZIP codes and their spatial representation is conducted. Though frequently represented as polygons to facilitate analysis, ZIP codes are actually defined at a narrower spatial resolution reflecting the street addresses they serve. This research shows that their generalization as continuous regions is an imposed structure that can have serious implications in the interpretation of research results. ZIP codes areas and Census defined ZCTAs, two commonly used polygonal representations of ZIP code address ranges, are examined in an effort to identify the spatial statistical sensitivities that emerge given differences in how these representations are defined. Here, comparative analysis focuses on the detection of patterns of prostate cancer in New York State. Of particular interest for studies utilizing local, spatial statistical tests, is that differences in the topological structures of ZIP code areas and ZCTAs give rise to different spatial patterns of disease. These differences are related to the different methodologies used in the generalization of ZIP code information. Given the difficulty associated with generating ZIP code boundaries, both ZIP code areas and ZCTAs contain numerous representational errors which can have a significant impact on spatial analysis. While the use of ZIP code polygons for spatial analysis is relatively straightforward, ZCTA representations contain additional topological features (e.g. lakes and rivers) and contain fragmented polygons that can hinder spatial analysis. Caution must be exercised when using spatially referenced data, particularly that which is attributed to ZIP codes and ZCTAs, for epidemiological analysis. Researchers should be cognizant of representational errors associated with both geographies and their resulting spatial mismatch, especially when comparing the results obtained using different topological representations. While ZCTAs can be problematic, topological corrections are easily implemented in a geographic information system to remedy erroneous aggregation effects.
Content may be subject to copyright.
BioMed Central
Page 1 of 15
(page number not for citation purposes)
International Journal of Health
Geographics
Open Access
Research
On the use of ZIP codes and ZIP code tabulation areas (ZCTAs) for
the spatial analysis of epidemiological data
Tony H Grubesic*1 and Timothy C Matisziw2
Address: 1Department of Geography, Indiana University, Bloomington, IN 47405-7100, USA and 2Center for Urban and Regional Analysis, The
Ohio State University, Columbus, OH 43210-1361, USA
Email: Tony H Grubesic* - tgrubesi@indiana.edu; Timothy C Matisziw - matisziw.1@osu.edu
* Corresponding author
Abstract
Background: While the use of spatially referenced data for the analysis of epidemiological data is
growing, issues associated with selecting the appropriate geographic unit of analysis are also
emerging. A particularly problematic unit is the ZIP code. Lacking standardization and highly
dynamic in structure, the use of ZIP codes and ZIP code tabulation areas (ZCTA) for the spatial
analysis of disease present a unique challenge to researchers. Problems associated with these units
for detecting spatial patterns of disease are explored.
Results: A brief review of ZIP codes and their spatial representation is conducted. Though
frequently represented as polygons to facilitate analysis, ZIP codes are actually defined at a
narrower spatial resolution reflecting the street addresses they serve. This research shows that
their generalization as continuous regions is an imposed structure that can have serious
implications in the interpretation of research results. ZIP codes areas and Census defined ZCTAs,
two commonly used polygonal representations of ZIP code address ranges, are examined in an
effort to identify the spatial statistical sensitivities that emerge given differences in how these
representations are defined. Here, comparative analysis focuses on the detection of patterns of
prostate cancer in New York State. Of particular interest for studies utilizing local, spatial statistical
tests, is that differences in the topological structures of ZIP code areas and ZCTAs give rise to
different spatial patterns of disease. These differences are related to the different methodologies
used in the generalization of ZIP code information. Given the difficulty associated with generating
ZIP code boundaries, both ZIP code areas and ZCTAs contain numerous representational errors
which can have a significant impact on spatial analysis. While the use of ZIP code polygons for spatial
analysis is relatively straightforward, ZCTA representations contain additional topological features
(e.g. lakes and rivers) and contain fragmented polygons that can hinder spatial analysis.
Conclusion: Caution must be exercised when using spatially referenced data, particularly that
which is attributed to ZIP codes and ZCTAs, for epidemiological analysis. Researchers should be
cognizant of representational errors associated with both geographies and their resulting spatial
mismatch, especially when comparing the results obtained using different topological
representations. While ZCTAs can be problematic, topological corrections are easily implemented
in a geographic information system to remedy erroneous aggregation effects.
Published: 13 December 2006
International Journal of Health Geographics 2006, 5:58 doi:10.1186/1476-072X-5-58
Received: 16 October 2006
Accepted: 13 December 2006
This article is available from: http://www.ij-healthgeographics.com/content/5/1/58
© 2006 Grubesic and Matisziw; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 2 of 15
(page number not for citation purposes)
Background
As the production and consumption of spatial data con-
tinues to increase, the subsequent use and abuse of spa-
tially referenced data is also on the rise. Jacquez [1]
provides a timely review of the key issues, outlining a
number of limitations to working with spatial and tempo-
ral data. For example, one of the major issues confronting
analysts is spatiotemporal mismatch. Broadly defined,
this occurs when data collected in both space and time do
not coincide. For example, Jacquez [1] highlights a recent
study of lung cancer on Long Island that used cancer data
collected at the ZIP+4 level reported for 1994–97 [2].
Cancer incidence was then compared to air toxics data
from the Environmental Protection Agency for 1996. In
this particular instance, the mismatch is both spatial and
temporal.
A second concern highlighted by Jacquez [1] and others
[3-5] is the issue of granularity in epidemiological data. In
sum, granularity deals with the spatial and temporal reso-
lution of data. Because human health applications must
adhere to patient privacy protocols, individual level data
is frequently aggregated to larger spatial units for analysis.
For instance, rather than utilizing geocoded household
data corresponding to individual patients, these records
are aggregated to the ZIP code level for analysis. This proc-
ess prevents unwanted disclosure or reconstruction of
patient identity [1]. However, it also reduces the ability for
analysts to compare data across spatial units. For example,
if one set of data is aggregated to census tracts and another
set to ZIP codes, issues relating to the modifiable areal
unit problem emerge [6].
A third major issue of interest is more technical in nature,
that of polygons, topology and computational geometry.
As noted by Jacquez [1], many spatial statistical tech-
niques are predicated on the accurate representation of
areal units (polygons), points and lines. If there are prob-
lems with areal units, such as self intersection, the result-
ing statistical analyses can be interlaced with errors.
As with most technical issues, epidemiologists, geogra-
phers and other analysts are aware of the limitations and
caveats of working with spatial data. For example, in a
study of cerebrovascular disease in New York State, Han et
al. [7] note:
"[t]here may be some bias related to spatial mismatch,
since we have used ZIP-code level hospitalization data
and ZCTA-level population and income data in our anal-
ysis.... Unfortunately, we could not find any empirical
study that validates this issue of spatial mismatch."
Of particular interest in the previous statement is the issue
of bias and spatial mismatch between ZIP code areas and
ZIP code tabulation areas (ZCTA). In fact, the problems of
spatiotemporal mismatches between these two units have
largely gone unnoticed. While Kreiger et al. [8] provide a
brief overview regarding many of the technical differences
between ZIP codes and ZCTAs, a full treatise of the differ-
ences, particularly how these differences may bias empiri-
cal analysis, is not available.
The purpose of this study is to 1) reexamine the use and
misuse of ZIP codes and ZCTAs for epidemiological anal-
ysis, 2) provide enough technical detail on the construc-
tion of ZIP code and ZCTA boundaries, and their
associated characteristics, to supply analysts with a more
complete picture of their utility for spatial analysis, 3) pro-
vide an empirically based analysis of the spatial and statis-
tical mismatch between ZIP code areas and ZCTAs,
highlighting their relative weaknesses, and 4) develop a
methodological approach for rectifying the problems
inherent to ZCTA topologies, so that more direct compar-
isons between ZCTA and ZIP code-based analysis may be
performed.
Results and discussion
Issues of spatial misrepresentation and mismatch
In the context of longitudinal spatial analyses, the ability
to match spatial units through time is important. Fortu-
nately, the hierarchically nested spatial units provided by
the Census Bureau (e.g. blocks, block groups, tracts, coun-
ties, etc.) simplify this task. In most cases, changes to the
spatial structure of Census tracts and even block groups,
can be tracked between the decennial surveys. As a result,
accurate longitudinal analyses are much easier to perform.
However, for temporally and spatially dynamic areal units
that are not hierarchically nested, the problems of spatio-
temporal mismatch are significant. Not surprisingly, the
ZIP code and its spatial characteristics are of concern.
Exceedingly popular for epidemiological analysis, the ZIP
code has become a de-facto spatial unit for the study of
disease distribution and etiology [9-13].
Zone Improvement Plan codes, or ZIP codes as they are
commonly known, originated as a way of classifying street
segments, address ranges and delivery points to expedite
the delivery of mail. Given that ZIP codes can be associ-
ated with most places of human habitation in the United
States, they present researchers with an alternative means
of collecting, visualizing, and analyzing spatial informa-
tion. However, given their use in directing the distribution
of mail, ZIP codes are not attributed to space in general,
but rather to roads, post offices, and other facilities within
the U.S. postal system. For instance, if an area does not
have a recognized delivery point or address range, no ZIP
code is assigned. Geographically, the best examples of this
are in desolate and uninhabited places such as the Sonora
Desert in Arizona, the Mojave Desert in California and the
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 3 of 15
(page number not for citation purposes)
Klamath Mountains in Oregon. Simply put, if no residen-
tial areas or business establishments exist, there is no need
to deliver mail or assign a five digit ZIP code. The process
for making ZIP codes accessible for spatial analysis, has
involved their generalization into polygonal units repre-
senting the spatial extent of ZIP code delivery areas
(referred to here as ZIP code areas). In large part, the tiling
of the United States with ZIP code areas has been accom-
plished by various private data vendors. More recently, the
U.S. Census Bureau has produced its own ZIP code topol-
ogy for area based representations – ZIP Code Tabulation
Areas (ZCTAs).
The use of ZIP codes for applications other than postal
delivery can present many challenges and there are several
major issues worth summarizing. First, the United States
Postal Service (USPS) makes updates to its ZIP codes reg-
ularly [14], providing this information in the biweekly
Postal Bulletin. However, for analysts unfamiliar with a
particular area, understanding the magnitude and nature
of these changes is a challenge. For example, it is not
uncommon for postal delivery routes to be realigned or
for ZIP codes to be split. More importantly, ZIP codes can
be discontinued, added or expanded between months/
years. Thus, where longitudinal studies are concerned,
even the slightest modification in ZIP codes and their
associated coverage can create a spatiotemporal disconti-
nuity [8]. Many private data vendors update ZIP code area
databases quarterly. However, even this relatively short
time-lag between updates can be problematic for areas
where significant changes were made, particularly for syn-
dromic surveillance or infectious outbreaks. Further, if
analysts fail to make use of available updates, problems
can also emerge. Another difficulty associated with ZIP
code areas is the significant variation in geographic extent
[8,10]. Grubesic [15] notes that the average size of a ZIP
code area in Wyoming is (1,430 square kilometers), while
the average size of a ZIP code area in New Jersey is 12.8
km2. The USPS does attempt to optimize the size or pop-
ulation allocation of ZIP codes given that the sole purpose
of the ZIP code is to expedite the delivery of mail. As a
result, ZIP codes can range in size from a single building
to a delivery zone spanning hundreds of square miles and
crossing several political jurisdictions [16].
As mentioned earlier, ZCTAs were developed as spatial
units by the U.S. Census Bureau for the 2000 decennial
census. In fact, ZCTAs were specifically designed to "meet
requests by data users for statistical data by ZIP Code area"
[17]. Given the Census Bureau's motivations, Krieger et al.
[8] note that there are significant differences in the techni-
cal definitions of ZIP codes areas and ZCTAs. Table 1
highlights the technical details of ZCTAs. First, ZCTAs can
be discontiguous. By definition, spatial contiguity refers to
the ability to travel from any point in a polygon to any
other internal point without leaving it. Where two or
more polygons are considered, spatial contiguity is the
property of sharing a common boundary or vertex [18].
The lack of spatial contiguity can have a dramatic impact
on spatial statistical analysis, particularly if ZCTAs with a
common identifier are split into different non-adjacent
polygons. Second, ZCTAs are compiled based on census
block topology. In the generation of a ZCTA, each under-
lying block is assigned one, and only one, ZCTA code –
regardless of its location. Therefore, it is possible for
blocks to straddle more than one ZCTA or ZIP code. This
can be problematic when aggregating population data to
both units.
To provide some perspective on the extent of these prob-
lems, consider the following. Table 2 highlights the
numerical differences between unedited ZIP code and
ZCTA geographic base files (GBF) available for New York
State. In addition to there being 851 additional entries/
polygons in the ZCTA file, the average size of these poly-
gons is significantly smaller (51.90 km2 v. 70.26 km2)
than those found in the ZIP code GBF. While the numeri-
cal characteristics of these files are certainly different,
these statistics only hint at to the severity of spatial mis-
match present between these two geographies.
As discussed earlier, ZIP code information is often used to
generate polygonal representations of ZIP code delivery
areas. During this conversion process, the vast majority of
the spatial mismatch problems begin to arise. In large
part, this can be attributed to attempts to generalize linear
features (i.e. street segments) into zones for representa-
tional convenience [15]. For instance, Figure 1 illustrates
ZIP code 14225 in Buffalo, New York. In this example, the
ZIP code boundary is clearly demarcated as a discrete unit
by polygonal boundaries [19]. However, because ZIP
codes are, in fact, associated with linear features, the
actual boundaries of 14225 are not so clear-cut. As dis-
played in Figure 1, there are a total of seven other streets
Table 1: A Summary of Census ZCTA Characteristics
1. ZCTAs are linked to Census blocks and every tabulation block has a single ZCTA code
2. ZCTAs cover all tabulation blocks in the United States and Puerto Rico
3. ZCTAs may consist of two or more discontiguous areas
4. A ZCTA code represents a five digit ZIP code where possible
5. In large undeveloped areas where there are no master address file (MAF) addresses with five-digit ZIP codes, the ZCTA code assigned is based
on the three-digit ZIP code (e.g. XX for tracts of undeveloped land and HH for water features)
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 4 of 15
(page number not for citation purposes)
in the 14225 polygon that actually belong to alternative
ZIP codes. The implications for such spatial misrepresen-
tations can be problematic, particularly if one considers
the application of geocoded data for epidemiological
analysis [20]. When individual records are geocoded to a
street address, point-based representations of latitude and
longitude coordinates are assigned to a street centerline,
and then placed at an appropriate offset distance to repre-
sent the location of a household or business [21,22].
However, if the actual location of the street segment and
its associated centerline deviates from its "native" ZIP
code polygon, both uncertainty and error can be intro-
duced to the analysis, even if the geocode is a perfect
match. For example, a geocoded point might be assigned
to the correct ZIP code, based on the underlying network
data, but the ZIP code area or ZCTA covering its actual
location could be different. In other words, the network
data and the ZIP polygons are not in correspondence.
Therefore, although the data was accurately aggregated to
the appropriate ZIP code, its spatial representation will
not be accurately accounted for in the analysis. Similarly,
if patients' ZIP codes are collected and attributed to poly-
gons based on an obsolete ZIP code topology, error is also
introduced. Further, even when public health agencies
avoid more traditional geocoding routines (i.e. point-
based representation of latitude and longitude coordi-
nates) problems may emerge. For example, situations
exist where geocoding based on the street network can
fail. In these cases, analysts may to attribute ZIP code
information based on visual inspection, possible resulting
in a misclassification. While one or two of these errors
might not make a significant difference to a local study,
the accumulation of error for statewide or national-level
analyses can be significant.
In an effort to diagnose the local level of uncertainty asso-
ciated with the problem of non-native street segments
within ZIP code polygons, consider Figure 2. Displayed
are the results of a calculation developed for this paper
called the Coefficient of ZIP Code Uncertainty, or CZUi.
CZUi measures the local concentration of non-native
street segments within a ZIP code area relative to the
number of non-native segments for all ZIP codes in New
York State. As a diagnostic, the resulting index values pro-
vide a baseline measure of spatial uncertainty and poten-
tial representational error associated with each ZIP code.
The interpretation of CZUi is as follows:
CZUi <1 = decreased level of uncertainty
CZUi = 1 = average level of uncertainty
CZUi >1 = increased level of uncertainty
Figure 2 suggests that while many of the GDT ZIP codes in
New York State include fewer than expected numbers of
non-native street segments, many others display an
increased level of uncertainty. Clearly, this suggests the
presence of a relatively substantial gap between the ZIP
codes assigned to linear features and their location relative
to interpolated ZIP code areas. Interestingly, much of this
uncertainty can be attributed to the process of ZIP code
polygon interpolation, which is outlined in the next sec-
tion.
ZIP code polygon interpolation
The process for developing ZIP code area polygons is rel-
atively laborious. As mentioned previously, these areal
units are not developed and distributed by the USPS [15].
Rather, private data vendors, such as GDT/TeleAtlas [19]
and Caliper [23] generate these boundaries. Boundaries
are created by using several important pieces of informa-
tion. First, data vendors leverage mail-stop (i.e. residential
and business addresses) information from the USPS and
their associated street segments. Second, other non-street
features are also analyzed, including water bodies, parks,
and large tracts of undeveloped land. Third, ZIP+4 state
directories are used to differentiate delivery zones and the
corresponding boundaries for areas that might not have a
clear-cut group of street segments. Finally, technicians
make telephone inquiries to area post offices in an effort
to determine predominant ZIP codes [24]. Once all of this
information is collected, ZIP code polygons are manually
digitized. This process, particularly the use of manual dig-
itizing routines, can lead to polygon generalization and a
"smoother" geographic boundary file.
Table 2: Numerical Differences between ZCTA and ZIP Code Geographic Base Files in New York State
ZCTA (2000) ZIP Code (GDT 2000)
Number of Polygons 2,450 1,599
Number of Unique Records 1,676 1,599
Average Size 51.90 km270.26 km2
Minimum Size 0.003 km20.006 km2
Maximum Size 1,054 km21,217 km2
Standard Deviation in Size 80.34 km2102.71 km2
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 5 of 15
(page number not for citation purposes)
The process for developing ZCTAs by the U.S. Census
Bureau is much different. As highlighted in Table 1,
ZCTAs have some relatively distinct features that ZIP
codes do not. Many of these features relate to the charac-
teristics of the Census blocks on which they are based.
There is no standard spatial extent of Census blocks. Some
blocks are relatively small (i.e. those located in a city),
while others are large and irregular, covering many square
miles. Utilizing Census block boundaries, USPS ZIP code
data and the 2000 Master Address File (MAF)[25], the
Census Bureau calculated the numbers of addresses asso-
ciated with each ZIP code represented in each tabulation
block and then assigned the ZCTA that represented the
most frequently occurring ZIP code with preference given
to residential addresses. If no ZIP code data were availa-
ble, ZCTA codes were assigned from an adjoining block.
Finally, it is important to remember that since the size of
Census blocks vary widely over space, zone delineation is
guided more by the Census geographies than by the distri-
bution of ZIP coded addresses.
Figure 3 displays an example neighborhood that graphi-
cally highlights a few of these quirks. For instance, the
United States Postal Service assigns a ZIP code of 12345 to
both sides of Park Ave, but assigns a ZIP code of 12347 to
segments south of Park Ave, including Rogers St. While
this appears to be an oddity, the USPS often utilizes rear
property lines for assigning ZIP codes [17]. Therefore, the
resulting ZIP code polygon that straddles both sides of
Park Ave. is not surprising. However, this geographic
quirk is not characteristic of ZCTAs, because blocks are
assigned one, and only one, ZCTA code. Therefore,
Non-native zip code segments within 14225 (Buffalo, NY)Figure 1
Non-native zip code segments within 14225 (Buffalo, NY).
ZIP code (street segments)
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 6 of 15
(page number not for citation purposes)
because Park Ave. is the dividing segment between two
blocks, the entire south side of Park Ave. inherits an erro-
neous ZCTA code of 12347, instead of its correct ZIP code
of 12345. A second interesting example is illustrated by
the factory located in ZCTA 12345, which is assigned a ZIP
code of 12346. In many instances, USPS customers that
receive an extraordinarily high volume of mail are
assigned their own ZIP code. This might be a large corpo-
rate campus or other institution. Because these locations
are treated as delivery points by the USPS, they are system-
atically excluded by the Census Bureau and do not appear
in the ZCTA boundary file. This is understandable since
these delivery points do not have any spatial boundaries
nor are they associated with any census related demo-
graphic or socioeconomic information. Also, the inability
to precisely locate structures and a lack of available block
boundaries for many of these locales influences the deci-
sion to exclude many of these features. Finally, the Census
Bureau assigns three digit ZIP codes (e.g. 123 HH) to areas
associated with water features and where no entries exist
within the Master Address File (MAF). However, because
Census blocks were developed before ZCTAs, the resulting
ZCTA boundaries had to conform to tabulation block
boundaries. As a result, any attempts to assign water bod-
ies, such as a river, to a ZCTA would result in a polygon
with a tail-like feature. In an effort to avoid these prob-
lems, the Census Bureau designated these areas with the
alphanumeric code rather than a five-digit ZCTA. In other
cases (not displayed in Figure 3), this might include a
123XX code. The XX codes are assigned to large tracts of
land where no mailing addresses are located and no ZIP
codes are maintained by the USPS. The decision to assign
Coefficient of ZIP Code Uncertainty: Shown is a map of GDT ZIP code areas and the level of uncertainty associated with eachFigure 2
Coefficient of ZIP Code Uncertainty: Shown is a map of GDT ZIP code areas and the level of uncertainty associated with each.
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 7 of 15
(page number not for citation purposes)
a three-digit wildcard ZCTA code (e.g. HH or XX) to some
areas in the United States is a complex and speculative one
[17]. Given that ZCTA geographies incorporate these
additional landscape features, problems often arise in
assessing ZCTA contiguity.
For example, to illustrate the topological problems that
water features create in the ZCTA geographic base file,
observe Figure 4. Illustrated is Blossvale, NY (13308), a
small community near Syracuse, located north of Inter-
state 90 and about 2 miles northeast of Lake Oneida. The
13308 ZIP code (as merged by the New York State Depart-
ment of Health), also includes the communities of Sylvan
Beach, North Bay, Verona Beach and McConnellsville.
The standard GDT (2000) ZIP code boundaries for Bloss-
vale are highlighted in yellow. The ZCTA boundaries for
the same ZIP code and the neighboring Lake Oneida are
displayed in red. There are several critical points worth
addressing here. First, the 13308 ZCTA and GDT ZIP code
area representations are not in complete spatial corre-
spondence, given that there are a number of slight devia-
tions between these two areal units. Clearly, this
represents a spatial mismatch. Second, notice that a small
water feature, Fish Creek, cuts the 13308 ZCTA in half.
When one examines the raw geographic base files for
ZCTAs, 13308 actually appears twice. That is, there are two
separate and distinct entries in the geographic base file for
the 13308 ZCTA. Thus, if the ZCTA remains uncorrected,
data assigned to the ZCTA will be represented twice. Addi-
tionally, if an adjacency matrix is constructed, as is often
necessary in spatial statistical analysis, the 13308 ZCTAs
are not treated as neighbors because they are split by the
130 HH water feature polygon. Therefore, inclusion of
these polygons can muddle spatial relationships between
ZCTAs that have socioeconomic, demographic and epide-
miologic data associated with them. Clearly, any lack of
adjustment to the ZCTA geographic base file incorporates
these types of errors into the subsequent analysis.
Given this background in ZIP code area interpolation and
ZCTA development, there are several questions remaining
to be answered. First, how do these potential spatial
inconsistencies manifest in the real-world? Second, what
kind of impact would these problems have on spatial-sta-
tistical analysis? Third, how does one correct these prob-
lems to ensure consistency and accuracy in an analysis?
Mitigating topological anomalies in the ZCTA geographic
base file
To illustrate some of the issues associated with use of ZIP
code areas and ZCTAs in spatial analysis, both topologies
for New York State were obtained for analysis. In order to
compare ZIP code areas with ZCTAs in New York, several
important steps must be undertaken to mitigate the topo-
logical anomalies between these two geographic base
files. Based on year 2000 ZIP code data from GDT, New
York is covered by 1,599 ZIP code areas. Conversely,
2,450 Census ZCTAs cover the state (Table 2). In part, this
high number of ZCTAs is a product of the 398 water fea-
tures found in the state that fragment the ZCTAs. To bring
these two geographies into greater accord, several steps
must be taken to adjust the ZCTA file for the presence of
these features [15]:
1. In order to rectify the topological anomalies in the
ZCTA file, one must remove all ZCTAs with HH codes.
This eliminates all water features in the file. While the fea-
tures are still visible, they are no longer entities in the geo-
graphic base file. It is not as critical to remove features
with XX codes, because these actually do represent land
masses with no formal addresses in the system, rarely
splitting a ZCTA into multiple features like a river or creek
might (See Figure 4).
2. All five-digit ZCTA entries that consist of multiple pol-
ygons (e.g. split by a water feature) must be dissolved on
a common attribute ID. In virtually every case, this can be
the ZCTA code. The dissolve process merges polygons into
single features, removing double or triple entries in the
geographic base file and ignoring any splits in polygon
continuity that may have been created by water features.
3. Cancer incident cases, population, or whatever varia-
bles of interest are being analyzed, must be reaggregated
back to the topologically rectified ZCTA geographic base
file for analysis. This effectively removes the aggregation
errors (e.g. double counting) from the original file.
4. Finally, if one is conducting a spatial statistical analysis
that relies on neighborhood information, the adjacency
matrix must be recalculated using the rectified ZCTA file.
Again, because the water features are removed, and ZCTA
polygons are now dissolved on a common attribute, the
newly calculated adjacency matrix will represent a more
realistic and accurate snapshot of spatial relationships
between polygons.
After correcting for the water polygons, the ZCTA and ZIP
code area boundary files are in nearly complete corre-
spondence. For the analysis that follows, ZIP code based
prostate incidence data was obtained from the New York
State Department of Health (NYSDOH) [26]. As discussed
in the methodology section, data for some ZIP code areas
were aggregated in this particular dataset. In an attempt to
accurately represent this data, both the New York ZIP code
area and ZCTA geographies used in this analysis were sub-
ject to similar aggregation of areas where necessary. Given
this aggregation, the GDT ZIP code areas, subsequently
modified to meet confidentiality requirements by the
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 8 of 15
(page number not for citation purposes)
NYSDOH, numbered 1,384 while the topologically
adjusted ZCTA file now includes 1,389 areas – yielding a
difference of only 5 polygons. This small difference can be
attributed to five partitions of land with no five-digit ZIP
codes – areas maintained by the Census Bureau in the
ZCTA file (i.e. XX codes).
Statistical mismatch
Figure 5 displays incidence of prostate cancer in New York
State for 1999–2003 which was collected from the New
York State Cancer Registry [26]. Specifically, Figure 5a
illustrates the prostate cancer rates using the ZIP code pol-
ygons based on modified GDT data. In contrast, Figure 5b
illustrates prostate cancer rates using ZCTA polygons from
the year 2000 distributed from the U.S. Census Bureau.
Cartographically, there is little discernable difference
between these two maps. Given this distribution of rates,
a formal epidemiological analysis might seek an approach
that facilitates the identification of high-risk ZIP codes or
groups of ZIP codes for intervention. Such analysis might
also benefit from the identification of low-risk ZIP codes
or groups of ZIP codes for additional exploration. For
example, Han et al. [7] utilized ZCTAs and cluster analysis
to explore the geographical variation of cerebrovascular
disease in New York State, while Moonan et al., [27] used
ZCTAs and basic cartographic analysis to examine areas of
tuberculosis transmission and incidence.
Example ZCTA NeighborhoodFigure 3
Example ZCTA Neighborhood.
ZCTA 123HH
Block 1
Block 4
Block 3
ZIP Code 123 46
Block 2
ZCTA 12345
Park Ave
ZCTA 12347
Park Ave
ZCTA 12348
201
X
203
X
209
X
200
X
ZCTA 12349
Rogers St
Durbin Ct
P.O. Box ser ving 209
Durbin Ct. (12340)
High volume mail
receiver with a
unique ZIP code n ot
appearing in the
ZCTA universe
ZIP codes are based
on street segments
and address ranges,
not block bo undaries
ZIP Code 12345
ZIP Code 12347
Uninhabited area, no
Master Ad dress File
(MAF) records
Lake Suitland
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 9 of 15
(page number not for citation purposes)
For the purposes of this study, our goal does not include
a formal epidemiological analysis of prostate cancer, per
se. We are primarily interested in identifying the potential
spatial and statistical mismatches between results
obtained through the use of ZIP codes area and ZCTA
geographies. Interestingly, Figure 5 indicates relatively
substantial differences in prostate cancer rates when com-
paring the descriptive statistics between ZIP code areas
and ZCTAs. As noted in the introductory section, topolog-
ical issues associated with these areal units are critical
when conducting spatial statistical analyses. In an effort to
illustrate the problem of spatial mismatch and the impact
of topology, consider Figure 6. Figure 6a illustrates statis-
tically derived prostate cancer clusters for New York State,
generated using a local indicator of spatial association
(Moran's I) [28,29], based here upon a first order queen's
contiguity. Specifically, the areas represented in Figure 6
correspond to one of five classifications generated
through the test of local spatial association. For example,
areas denoted in the darker red are indicative of ZIP code
areas of high prostate cancer rates which are surrounded
by other high rate ZIP code areas. Conversely, ZIP codes
denoted in the darker blue color are indicative of low rate
areas surrounded by other low rate areas. The remaining
classifications are high-low, low-high and not significant
(p <= 0.05). It is important to note that Figure 6a utilizes
the GDT ZIP code areas while Figure 6b utilizes ZCTAs.
When comparing these two figures, there are some
remarkable differences in the statistical results. Even the
simplest visual inspection suggests that these patterns of
spatial association between ZIP code areas and ZCTA data
do not match, even though the underlying data on pros-
Topological Disruption: Shown is ZIP code area 13308 and its companion 13308 ZCTA for Blossvale, NYFigure 4
Topological Disruption: Shown is ZIP code area 13308 and its companion 13308 ZCTA for Blossvale, NY. Also displayed is
Lake Oneida and the 133 HH code for the corresponding water bodies in the ZCTA geographic base file.
Lake Oneida
Fish Creek
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 10 of 15
(page number not for citation purposes)
Prostate Cancer Rates in New York State – 1999 – 2003Figure 5
Prostate Cancer Rates in New York State – 1999 – 2003. (a) GDT ZIP code boundaries; (b) Census 2000 ZCTA boundaries.
Descriptive Statistics:
Mean = 885.13
Max = 7432.43
Min = 0.0
SD = 481.26
Descriptive Statistics:
Mean = 860.52
Max = 4950.5
Min = 0.0
SD = 406.80
(a)
(b)
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 11 of 15
(page number not for citation purposes)
tate cancer incidence is identical. For example, Figure 6a
displays a relatively large area of high-high ZIP codes in
the Adirondacks and several low-low clusters in portions
of Western New York and Long Island. This is not corrob-
orated by the pattern generated using ZCTAs as units of
analysis. Statistically, the differences are also relatively
obvious. For instance, there are 108 ZIP code areas classi-
fied as low-low clusters in New York. Conversely, only 96
ZCTAs are classified as low-low.
There are four major reasons these differences in patterns
emerge. First, although all of the ZIP code areas and
ZCTAs share identical identifier codes (e.g. 12065), this
does not guarantee that they share the same geographic
boundary or extent. For example, Figure 7 illustrates a
composite map of four ZIP code areas and ZCTAs in
Upstate New York. In this case, there is a clear difference
in spatial extent and bounding between the two geo-
graphic base files. As a result, when a spatial weights
matrix is constructed, the local neighborhoods for each of
these ZIP codes will be different. Further, once a statistical
test is constructed for examining local spatial association,
the derived results will also be different (see Figure 6). A
second factor relates to the inclusion of XX coded ZCTAs
in the spatial adjacency matrix. While it is possible to
remove these polygons, the resulting map does not con-
vey the true geography of New York State. Moreover,
because these polygons do represent a landmass, it is
important to include them to assure the continuity of the
spatial weights matrix. A third problem relates to how
other spatial data can be associated with these units the
ZIP code areas and ZCTAs. For example, in this study,
Census block population data are used to calculate pros-
tate cancer rates. Specifically, male population for each
block was aggregated to each ZIP code area and ZCTA,
ensuring that each block was only counted once. Clearly,
if the ZIP code area and ZCTA polygons are different in
spatial extent, the results of this aggregation process will
differ. As Figures 5 and 6 suggest, these differences can
substantially impact the resulting analysis. Finally, many
of the more obvious spatial mismatches in New York are
in sparsely populated areas such as the Adirondack Moun-
tains. In part, this can be attributed to the sensitivity of the
local Moran's I test to low population counts. In these
instances, cluster results can fluctuate dramatically based
on small differences in observed cases [30]. That said,
there are still numerous cases of spatial mismatch in heav-
ily populated areas, particularly Long Island.
In summary, ZIP code areas and ZCTAs are not directly
comparable units of observation. In addition to display-
ing significant differences in size and extent, there is a
major disconnect in the way these units are generated.
These differences stem from the fact that ZIP codes are
based on address ranges, developed for mail delivery and
their representation as polygons does not accurately por-
tray all of the linear features in a ZIP code. Given the
methods by which these areal units are generated, there
are many instances where ZIP ranges are misclassified by
ZIP code areas and ZCTAs. Our research also suggests that
ZCTAs present some challenges with which analysts must
address, particularly in their spatial representation. As
noted previously, Census blocks are used for building
ZCTA boundaries. In addition to the errors introduced by
representing linear features with polygons, each block is
assigned a single ZCTA code. While this is good for look-
ing at census data, if there is overlap or underlap between
ZIP code segments, the ZCTA zoning scheme is unable to
accurately portray these differences. Further, the incorpo-
ration of water features and uninhabited areas into the
ZCTA geographic base file can also complicate spatial
analysis.
In conclusion, the problem of spatiotemporal mismatch
is significant for ZIP codes and ZCTAs. Caution must be
used when attempting to compare statistical results across
both time and space when these units are used. More
importantly, analysts must also weigh the cost/time bene-
fits of rectifying ZCTA topology for conducting epidemio-
logical analysis. While this certainly involves more work
and GIS processing time, the benefits of these modifica-
tions are significant.
Methods
Data
Observed values of prostate cancer incidence were
retrieved from the New York State Cancer Registry. ZIP
code boundaries were created by Geographic Data Tech-
nology for the year 2000 and subsequently modified by
the NYSDOH [26]. These modifications include the fol-
lowing:
1. Some adjacent ZIP codes were combined due to confi-
dentiality requirements because an insufficient numbers
of cases of prostate cancer were reported.
2. A subset of residential point ZIP codes with no defined
delivery area and ZIPs too small to be included in the GDT
file were also combined with adjacent ZIP code areas.
3. NYSDOH also eliminated uninhabited islands from the
ZIP code area file.
ZCTA boundaries were delineated by the U.S. Census
Bureau for the year 2000. The street network used for cal-
culating CZUi were based on TIGER 2000 data [23].
Modeling
The coefficient of ZIP code uncertainty is calculated as fol-
lows:
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 12 of 15
(page number not for citation purposes)
Clustering of Prostate Cancer Rates in New York State: 1999 – 2003Figure 6
Clustering of Prostate Cancer Rates in New York State: 1999 – 2003. Shown are maps that display cluster memberships
derived by a local indicator of spatial association (Moran's I). (a) GDT ZIP code boundaries; (b) Census 2000 ZCTA bounda-
ries.
(a)
(b)
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 13 of 15
(page number not for citation purposes)
Spatial Mismatch Between ZIP Code Areas and ZCTAsFigure 7
Spatial Mismatch Between ZIP Code Areas and ZCTAs. (a) GDT ZIP code area boundaries; (b) Census 2000 ZCTA bounda-
ries; (c) Shown is a composite map of 13420, 13360 13338 and 13343 in New York State.
13690
13327
133XX
No border
with 13327
(a) (b)
(c)
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 14 of 15
(page number not for citation purposes)
Where
xi =the number of non-native ZIP code street segments in
ZIP code i
yi = the number of street segments in ZIP code i
As mentioned previously, CZUi measures the local con-
centration of non-native street segments within a ZIP code
area relative to the number of non-native segments for a
larger spatial unit (e.g. a metropolitan area or a state). Seg-
ments with no ZIP codes were not included in this com-
putation given that there is no way of telling whether or
not they actually contained an address and which ZIP it
was attributed to. It is also important to remember that
CZUi says nothing about the length of these street seg-
ments. However, with a slight adjustment to both the
numerator and denominator, the magnitude of uncer-
tainty, as measured by the distance associated with each
non-native street segment could be quantified.
ZIP code and ZCTA contiguity measurements were quan-
tified through the use of a spatial weights matrix, W. Ele-
ments of W are specified as:
Where cij = 1 if i and j share a common boundary or vertex;
0 otherwise. For the purposes of this study, first order
properties include only those vertices and boundaries that
are contiguous to the observation (ZIP code or ZCTA) in
question (viz. a Queen's contiguity matrix). While there
are alternatives to this spatial weight matrix (e.g. rook, or
distance based), the selection of a queen's based measure
provided an effective approach for highlighting the topo-
logical complexities of the ZCTA geographic base layer. A
more robust contiguity matrix, using other spatial lags, or
polygon boundary lengths would be appropriate for a for-
mal analysis of cancer incidence and clustering.
The statistical analysis of local spatial association was con-
ducted by using a local Moran's I test statistic. The local
Moran's I [28] is defined as:
Where
xi and xj are observations for locations i and j (with mean
μ
)
zi = (xi -
μ
),
zj = (xj -
μ
), and
wij = spatial weights matrix with values of 0 or 1.
Authors' contributions
THG designed the study, conducted the analysis, drafted
the manuscript and developed the coefficient of ZIP code
uncertainty. TCM collaborated on the design of the analy-
sis, manuscript revisions and coded several of the proc-
esses in TransCad and ArcGIS.
References
1. Jacquez GM: Current practices in the spatial analysis of can-
cer: flies in the ointment. International Journal of Health Geographics
2004, 3(22):.
2. Jacquez GM, Grieling DA: Local clustering in breast, lung and
colorectal cancer in Long Island, New York. International Jour-
nal of Health Geographics 2003, 2(3):.
3. Boscoe FP, Ward MH, Reynolds P: Current practices in spatial
analysis of cancer data: data characteristics and data sources
for geographic studies of cancer. International Journal of Health
Geographics 2004, 3(28):.
4. Miller HJ, Wentz EA: Representation and spatial analysis in geo-
graphic information systems. Annals of the Association of American
Geographers 2003, 93:574-594.
5. Johnson GD: Small area mapping of prostate cancer incidence
in New York State (USA) using fully Bayesian hierarchical
modeling. International Journal of Health Geographics 2004, 3(29):.
6. Openshaw S: The modifiable areal unit problem. In Concepts and
techniques in modern geography Volume 38. Norwich: Geobooks; 1984.
7. Han D, Carrow SS, Rogerson PA, Munschauer FE: Geographical
variation of cerebrovascular disease in New York State: the
correlation with income. International Journal of Health Geographics
2005, 4(25):.
8. Krieger N, Waterman P, Chen JT, Soobader MJ, Subramanian SV,
Carson R: ZIP code caveat: bias due to spatiotemporal mis-
matches between ZIP codes and US census-defined geo-
graphic areas – the Public Health Disparities Geocoding
Project. Am J Public Health 2002, 92:1100-1102.
9. Wang F: Spatial clusters of cancers in Illinois 1986–2000. J Med
Syst 2004, 28(3):237-56.
10. Cook WH, Grala K, Wallis RC: Avian GIS models to signal
human risk for West Nile virus in Mississippi. International Jour-
nal of Health Geographics 2006, 5(36):.
11. Acevedo GD: ZIP code-level risk factors for tuberculosis:
neighborhood environment and residential segregation in
New Jersey, 1985–1992. Am J Public Health 2001, 91(5):734-741.
12. Luo W, Wang F: Measures of spatial accessibility to healthcare
in a GIS environment: Synthesis and a case study in Chicago
region. Env Plan B 2003, 30(6):865-884.
13. Dohn MN, White ML, Vigdorth EM, Ralph Buncher C, Hertzberg VS,
Baughman RP, George Smulian A, Walzer PD: Geographic cluster-
ing of Pneumocystis carinii pneumonia in patients with HIV
infection. Am J Respir Crit Care Med 162(5):1617-1621.
14. ZIP Code Frequently Asked Questions [http://www.usps.com/
ncsc/ziplookup/zipcodefaqs.htm]
15. Grubesic TH: ZIP codes and spatial analysis: Problems and
prospects. Socio-Economic Planning Sciences in press.
16. ZIP code tabulation areas (ZCTA) frequently asked ques-
tions [http://www.census.gov/geo/ZCTA/zctafaq.html]
17. Census 2000 ZIP code tabulation areas technical documen-
tation [http://www.census.gov/geo/ZCTA/zcta_tech_doc.pdf]
CZU xy
xy
iii
i
i
n
i
i
n
=
()
∑∑
Equation 1
wc
c
ij
ij
ij
j
n
=
()
=
1
2Equation
Izwz
ii ij
j
j
=
()
Equation 3
Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
International Journal of Health Geographics 2006, 5:58 http://www.ij-healthgeographics.com/content/5/1/58
Page 15 of 15
(page number not for citation purposes)
18. Cova TJ, Church RL: Contiguity constraints for single-region
site search problems. Geographical Analysis 2000, 32(4):306-329.
19. Geographic Data Technology/TeleAtlas: ZIP code boundary
files (year 2000). Lebanon. 2001.
20. Krieger N, Waterman P, Lemieux K, Zierler S, Hogan JW: On the
wrong side of the tracts? Evaluating accuracy of geocoding
for public health research. Am J Public Health 2001, 91:1114-1116.
21. Ratcliffe JH: On the accuracy of TIGER type geocoded address
data in relation to cadastral and census areal units. Interna-
tional Journal of Geographical Information Science 2001, 15(5):473-485.
22. Grubesic TH, Murray AT: Assessing the locational uncertainties
of geocoded data. Proceedings from the 24th Urban Data Manage-
ment Symposium. Chioggia . 27–29 October 2004
23. Caliper Corporation [http://www.caliper.com]
24. Geographic Data Technology/TeleAtlas: Ohio ZIP Code
areas [http://www.co.warren.oh.us/warrengis/metadata/
ohZIP.htm]
25. U.S. Census Bureau: Master Address File (MAF) Basics. [http:/
/www.census.gov/geo/mod/maf_basics.pdf].
26. New York State Department of Health (NYSDOH): New York
State Cancer Registry. [http://www.health.state.ny.us/statistics/
cancer/registry/nyscr.htm].
27. Moonan PK, Bayona M, Quitagua TN, Oppong J, Dunbar D, Jost KC,
Burgess G, Singh KP, Weis SE: Using GIS technology to identify
areas of tuberculosis transmission and incidence. International
Journal of Health Geographics 2004, 3(23):.
28. Anselin L: Local Indicators of Spatial Association – LISA. Geo-
graphical Analysis 1995, 27(2):93-115.
29. Anselin L, Syabri I, Kho Y: GeoDa: An Introduction to Spatial
Data Analysis. Geographical Analysis 2006, 38(1):5-22.
30. McLaughlin CC, Boscoe FP: Effects of randomization methods
on statistical inference in disease cluster detection. Health
and Place 2007, 13(1):152-163.
... Postal codes can be problematic for epidemiological studies since area boundaries are not standardized and are constantly redrawn. There are GIS procedures for topological correction to reduce errors from postal code aggregation (Grubesic and Matisziw 2006;Grubesic 2008). Ideally, data aggregation should not exceed postal codes or their spatial equivalent, as larger areas produce increasingly imprecise conclusions regarding health outcomes (Fig. 1). ...
Chapter
This chapter explores the role of geospatial science in health research. First, geospatial technology and data considerations are discussed, focusing on spatial resolution and geocoding techniques. Due to privacy concerns, health data are often aggregated into areal units, resulting in limitations related to data aggregation and the Modifiable Areal Unit Problem (MAUP). Geographic Information Systems (GIS) have tools for addressing the MAUP and other challenges, including spatial autocorrelation that arises due to the tendency of spatial data to cluster. Part two covers mapping considerations for health-related data, focusing on thematic maps for various data types. In part three, applications in health research are discussed in relation to two goals: (i) disease mapping and descriptive analysis and (ii) the assessment of spatial relationships. A primary goal in geospatial health analyses is to identify the patterning of diseases. Disease mapping and cluster detection are well-cited approaches in disease surveillance and descriptive analysis. Another goal in geospatial health research is identifying relationships that influence spatial patterns. Place-based social and environmental factors are often analyzed alongside health data using geospatial statistical analysis. Throughout analysis, attention must be placed on protecting patient privacy. Methods for preserving point-level data privacy while producing meaningful results are discussed.
... Zone Improvement Plan (ZIP) Codes and their associated ZIP Code Tabulation Areas (ZCTAs) are often used for geospatial analysis but are not ideal for this purpose as ZIP Codes often change, do not align well with U.S. Census Bureau or other administrative boundaries, and are designed for mail delivery efficiency, not for public health and epidemiological analysis. 3,4 In February 2022, the DCR began participating in the National Cancer Institute (NCI)/North American Association of Central Cancer Registries (NAACCR) Zone Design Project to create cancer reporting zones (to be referred to as "zones") that could be used for sub-county reporting in the state. The goals of the NCI/NAACCR Zone Design Project are to work with individual central cancer registries to create zones that will reduce suppression of data for small counties, increase spatial resolution for large counties, and create geographies that are more meaningful to cancer registries and stakeholders for cancer reporting and analysis. ...
Article
Full-text available
Objective To describe the Delaware Cancer Registry (DCR)’s participation in the National Cancer Institute (NCI)/North American Association of Central Cancer Registries (NAACCR) Zone Design Project to create sub-county geographic areas (“zones”) for use in cancer reporting and geospatial analysis. Methods DCR and other stakeholders reviewed up to ten unique zone configurations for each of Delaware’s three counties. The zone configurations were created using AZTool and were set to optimize three objectives: create zones that have a minimum and target population of 50,000; are homogenous based on the variables of percent minority, percent below poverty, and percent urban; and are as compact as possible. The DCR sent a survey to stakeholders to provide input on their preferred zone configuration for each county. Following the final selection of zones, the DCR utilized the geographies for calculating overall and late-stage breast cancer incidence statistics and created choropleth maps to visualize the rates by quintiles. Results The final selections resulted in a total of 15 zones for Delaware, with three in Kent County, nine in New Castle County, and three in Sussex County. The zones ranged in population size from 54,013 to 67,693 people. Zones with higher late-stage breast cancer incidence rates included those near the areas of Wilmington, Middletown, and between Milford and Georgetown. Comparing results of overall breast cancer incidence rates by zone with late-stage rates by zone, there were areas that had lower relative overall breast cancer incidence rates but were relatively higher for late-stage rates by zones or vice versa. Conclusions Aggregating census tracts into zones allows for reporting reliable cancer rates at sub-county levels, which is instrumental in conveying meaningful information about regional cancer trends to stakeholders and the public. Delaware will be able to utilize zone-level cancer information to provide targeted interventions and outreach initiatives.
Article
BACKGROUND AND OBJECTIVES To examine the effects of racial and socioeconomic disparities on clinical outcomes: in-hospital mortality, discharge dispositions, and hospital length of stay (LOS) among patients with traumatic brain injury (TBI) stratified by race and socioeconomic status (SES). METHODS We conducted a retrospective analysis by analyzing the 1995-2015 Nationwide Inpatient Sample database. Adjusted logistic regressions and multinomial logistic regression models with and without propensity score matching were applied to investigate the effects of disparities on clinical outcomes. RESULTS African American and Hispanic patients with TBI had a lower risk of in-hospital mortality, longer hospital LOS, and lower likelihood of being discharged to rehabilitation compared with White patients. The TBI patients with poor SES (pSES) had lower in-hospital mortality and were more likely to leave against medical advice compared with non-pSES TBI patients. CONCLUSION Racial and socioeconomic disparities had significant influences on in-hospital mortality, discharge dispositions, and hospital LOS among the TBI population. Our study observed pSES TBI patients had a lower likelihood of in-hospital mortality than non-pSES patients, which may be partially attributed to the fact that most of the pSES TBI patients were hospitalized in urban teaching hospitals and hospitals with large bed size. In effect, our data suggest that the Social Safety Net of the United States is effective in preventing mortality in patients with TBI.
Article
Introduction Pancreatic adenocarcinoma (PDAC) is a challenging disease, with outcomes influenced by several factors including socioeconomic status. The area deprivation index (ADI) has been used to understand how neighborhood disadvantages affect healthcare outcomes. Prior research has indicated that a higher ADI, reflective of a greater neighborhood disadvantage, is associated with an increased risk of major complications and unplanned readmission following PDAC resection. This study aimed to extend this investigation to the Northwell Health System in New York and explore the association between neighborhood ADI and surgical outcomes in patients with PDAC. Methods A retrospective analysis of the Northwell Health multicenter pancreatic cancer database from 2014 to 2023 included patients who underwent PDAC resection. The ADI scores were divided into low (1–3), moderate (4–6), and high (7–10), as previously described. Multinomial regression models and Kaplan–Meier log‐rank tests were used to compare differences in surgical outcomes between the patients in each ADI group. Results Out of 314 PDAC patients who underwent resection and had available ADI data, 116 (36.9%) were in the low, 163 (51.9%) in the moderate, and 35 (11.2%) in the high ADI category. The median ADI score was 4 (IQR: 3–5). Adjusted multinomial regression analysis revealed the following disparities: compared to the low ADI group, patients in the moderate ADI group demonstrated a significantly higher risk of diabetes (RR: 1.76, 95% CI 1.06–2.90, p = 0.028); high ADI was associated with a poorer response to neoadjuvant therapy (RR 3.13, 95% CI 1.11–8.82, p = 0.031), higher incidence of microscopic positive margins (RR 1.87, 95% CI 1.11–5.17, p = 0.028), increased severe complications (Clavien–Dindo class III−IV) (RR 1.36, 95% CI 1.04–1.80, p = 0.027), and a higher failure‐to‐rescue (FTR) rate (RR 1.44, 95% CI 1.12–1.85, p = 0.048). Although readmission and mortality rates at 30 and 90 days did not show significant differences ( p > 0.05), the Kaplan–Meier log‐rank test indicated a marked disparity in survival probabilities among ADI ranks ( p = 0.0025). Conclusion This study underscores a pronounced survival disparity across ADI categories among PDAC patients, suggesting an association between socioeconomic status and postoperative survival. Consideration of patient ADI may guide tailored healthcare strategies, such as the distribution of navigation and resources, to bridge the gap in survival outcomes and ensure equitable care for all socioeconomic strata.
Article
Importance Elevated ambient fine particulate matter (PM 2.5 ) air pollution exposure has been associated with poor health outcomes across several domains, but its associated outcomes among lung transplant recipients are poorly understood. Objective To investigate whether greater PM 2.5 exposure at the zip code of residence is associated with a higher hazard for mortality and graft failure in patients with lung transplants. Design, Setting, and Participants This retrospective cohort study used panel data provided by the United Network for Organ Sharing, which includes patients receiving transplants across all active US lung transplant programs. Adult patients who received lung transplants between May 2005 and December 2016 were included, with a last follow-up of September 10, 2020. Data were analyzed from September 2022 to May 2023. Exposure Zip code–level annual PM 2.5 exposure was constructed using previously published North American estimates. Main Outcomes and Measures The primary outcome was time to death or lung allograft failure after lung transplant. A gamma shared frailty Cox proportional hazards model was used to produce unadjusted and adjusted hazard ratios (HRs) to estimate the association of zip code PM 2.5 exposure at the time of transplant with graft failure or mortality. Results Among 18 265 lung transplant recipients (mean [SD] age, 55.3 [13.2] years; 7328 female [40.2%]), the resident zip code’s annual PM 2.5 exposure level was greater than or equal to the Environmental Protection Agency (EPA) standard of 12μg/m ³ for 1790 patients (9.8%) and less than the standard for 16 475 patients (90.2%). In unadjusted analysis, median graft survival was 4.87 years (95% CI, 4.57-5.23 years) for recipients living in high PM 2.5 areas and 5.84 years (95% CI, 5.71-5.96 years) for recipients in the low PM 2.5 group. Having an annual PM 2.5 exposure level greater than or equal to the EPA standard 12 μg/m ³ was associated with an increase in the hazard of death or graft failure (HR, 1.11; 95% CI, 1.05-1.18; P < .001) in the unadjusted analysis and after adjusting for covariates (HR, 1.08; 95% CI, 1.01-1.15; P = .02). Each 1 μg/m ³ increase in exposure was associated with an increase in the hazard of death or graft failure (adjusted HR, 1.01; 95% CI, 1.00-1.02; P = .004) when treating PM 2.5 exposure as a continuous variable. Conclusions and Relevance In this study, elevated zip code–level ambient PM 2.5 exposure was associated with an increased hazard of death or graft failure in lung transplant recipients. Further study is needed to better understand this association, which may help guide risk modification strategies at individual and population levels.
Article
Background To our knowledge, no agreed-upon best practices exist for joining U.S. Census ZIP Code Tabulation Areas (ZCTAs) and U.S. Postal Service ZIP Codes (ZIPs). One-to-one linkage using 5-digit ZCTA identifiers excludes ZIPs without direct matches. “Crosswalk” linkage may match a ZCTA to multiple ZIPs, avoiding losses. Methods We compared non-crosswalk and crosswalk linkages nationally and for mortality and health insurance in California. To elucidate selection implications, generalized additive models related sociodemographics to whether ZCTAs contained non-matching ZIPs. Results Nationwide, 15% of ZCTAs had non-matching ZIPs, i.e., ZIPs dropped under non-crosswalk linkage. ZCTAs with non-matching ZIPs were positively associated with metropolitan core location, lower socioeconomics, and non-white population. In California, 34% of ZIPs in the mortality and 25% in the health insurance data had ZCTAs with non-matching ZIPs; however, these ZIPs constitute only 0.03% of total mortality and 0.44% of total insurance enrollees. Conclusions Our study findings support the use of crosswalk linkages and ZCTAs as a unit of analysis. One-to-one linkage may cause bias by differentially excluding ZIPs with more disadvantaged populations, although affected population sizes appear small.
Article
Importance Many youths experience mental health challenges. Identifying which neighborhood and community factors may influence mental health may guide health policy and practice. Objective To explore associations between community assets (eg, schools, parks, libraries, and barbershops) and past-year mental health symptoms among youths. Design, Setting, and Participants This cross-sectional study leveraged 3 datasets, which were linked by 26 zip codes: the Western Pennsylvania Regional Data Center, the Child Opportunity Index 2.0 database, and the Allegheny County Youth Risk Behavior Survey (YRBS). The YRBS was administered during the study period in 2018 to youths across 13 high schools in Allegheny County, Pennsylvania; the study dates were from October 15 to October 19, 2018. Dates of analysis were from August 1, 2023, to July 15, 2024. Exposures Asset density in each zip code across 8 asset categories (transportation, education, parks and recreation, faith-based entities, health services, food resources, personal care services, and social infrastructure) was calculated. Main Outcomes and Measures The main outcomes were mental health measures included in the past 12 months, which comprised feelings of hopelessness (feeing so sad or hopeless that you stopped doing activities), nonsuicidal self-injury (hurt yourself on purpose without wanting to die), and suicidal ideation (seriously considered attempting suicide). All were operationalized to any or none. Data were analyzed using multivariable generalized linear mixed models and were adjusted for age, sex assigned at birth, race and ethnicity, and identification as sexually or gender diverse. Results Among 6306 students who were eligible for the YRBS based on their enrollment in participating high schools, 4487 students completed surveys, and 2162 were included in the analytic sample (mean [SD] age, 15.8 [1.2] years; 1245 [57.6%] were assigned female sex at birth). Over one-third of the participants (811 [37.5%]) reported past-year feelings of hopelessness; 587 (27.2%), past-year nonsuicidal self-injury; and 450 (20.8%), past-year suicidal ideation. High total asset population density (adjusted odds ratio [AOR], 0.85 [95% CI, 0.75-0.97]; P = .01), as well as population density of transportation assets (AOR, 0.77 [95% CI, 0.66-0.90]; P < .001), educational resources (AOR, 0.78 [95% CI, 0.67-0.92]; P = .002), and health services (AOR, 0.74 [95% CI, 0.60-0.91]; P = .006), were associated with lower odds of past-year hopelessness after adjusting for covariates. There were no correlations between asset density, Child Opportunity Index, and other mental health measures. Conclusions and Relevance The findings of this cross-sectional study suggest that access to certain community assets was associated with lower odds of feelings of hopelessness among youths. Ongoing work is needed to characterize other forms of social and cultural capital, which may mitigate negative mental health outcomes among adolescent youths.
Article
American Indian/Alaska Natives (AI/ANs) disproportionately suffer from diabetes compared to non-Hispanic whites (NHW). In 2013, 69% of end-stage kidney disease (ESKD) in AI/ANs was caused by diabetes (ESKD-D) but accounts for only 44% of ESKD diagnoses in the overall USA population. Moreover, the diagnosis of diabetes and ESKD-D may be significantly related to social determinants of health. The purpose of this study was to conduct a survival analysis of AI/ANs and NHWs diagnosed with ESKD-D nationally and by Indian Health Service region and correlate the survival analysis to the Area Deprivation Index® (ADI®). This manuscript reports a retrospective cohort analysis of 2021 United States Renal Data System data. Eligible patient records were AI/AN and NHWs with diabetes as the primary cause of ESKD and started dialysis on January 1, 2014, or later. A total of 81,862 patient records were included in this analysis, of which 1798 (2.2%) were AI/AN. AI/ANs survive longer, with an 18.4% decrease in risk of death compared to NHW. However, AI/ANs are diagnosed with ESKD-D and start dialysis earlier than NHWs. ADI® variables became significant as ADI® ratings increased, meaning persons with greater social disadvantage had worse survival outcomes. The findings reveal that AI/ANs have better survival outcomes than NWH, explained in part by initiating dialysis earlier than NHW. Additional research is needed to explore factors (e.g., social determinants; cultural; physiologic) that contribute to earlier diagnosis of ESKD-D in AI/ANs and the impact of prolonged dialysis on quality of life of those with ESKD-D.
Article
Full-text available
This paper proposes an explicit set of constraints as a general approach to the contiguity problem in site search modeling. Site search models address the challenging problem of identifying the best area in a study region for a particular land use, given that there are no candidate sites. Criteria that commonly arise in a search include a site's area, suitability, cost, shape, and proximity to surrounding geographic features. An unsolved problem in this modeling arena is the identification of a general set of mathematical programming constraints that can guarantee a contiguous solution (site) for any 0-1 integer-programming site search formulation. The constraints proposed herein address this problem, and we evaluate their efficacy and efficiency in the context of a regular and irregular tessellation of geographic space. An especially efficient constraint form is derived from a more general form and similarly evaluated. The results demonstrate that the proposed constraints represent a viable, general approach to the contiguity problem.
Article
Full-text available
The assignment of geo-referenced coordinates to individual households, known as geocoding, is a fundamental and important task for urban data management. Not surprisingly, geographic information systems (GIS) play a critical role in processing the spatial information used for coordinate assignment. There are a wide variety of applications where geocoded information is relied upon for the efficient and effective delivery of services. For example, emergency planning/response efforts, crime mapping and analysis, public facility location modeling and the calculation of municipal tax exposure utilize geocoded data. However, because the results of the geocoding process are largely dependent on the quality of the TIGER-based street files used for geo-referencing, as well as the assumptions built into them, there is the potential for introducing significant spatial inaccuracies during address conversion. The purpose of this paper is to both explore and document the problems associated with the geographic base files used for coordinate assignment. We provide an extended empirical example that highlights the relative uncertainty of household location in geographic space. This includes an assessment of positional uncertainty through the spatial perturbation of geocoded points within an established bound - allowing for alternative, yet equally accurate, spatial realizations of geocoded data. Sensitivity analysis is then conducted to evaluate the significance of locational uncertainties in geocoded data and spatial analysis
Article
Full-text available
This paper proposes an explicit set of constraints as a general approach to the contiguity problem in site search modeling. Site search models address the challenging problem of identifying the best area in a study region for a particular land use, given that there are no candidate sites. Criteria that commonly arise in a search include a site's area, suitability, cost, shape, and proximity to surrounding geographic features. An unsolved problem in this modeling arena is the identification of a general set of mathematical programming constraints that can guarantee a contiguous solution (site) for any 0–1 integer-programming site search formulation. The constraints proposed herein address this problem, and we evaluate their efficacy and efficiency in the context of a regular and irregular tessellation of geographic space. An especially efficient constraint form is derived from a more general form and similarly evaluated. The results demonstrate that the proposed constraints represent a viable, general approach to the contiguity problem.
Article
Full-text available
In many applications of Geographical Information Systems (GIS) a common task is the conversion of addresses into grid coordinates. In many countries this is usually accomplished using address range TIGER-type é les in conjunction with geocoding packages within a GIS. Improvements in GIS func- tionality and the storage capacity of large databases mean that the spatial investi- gation of data at the individual address level is now commonly performed. This process relies on the accuracy of the geocoding mechanism and this paper exam- ines this accuracy in relation to cadastral records and census tracts. Results from a study of over 20000 addresses in Sydney, Australia, using a TIGER-type geocoding process suggest that 5-7.5% (depending on geocoding method) of addresses may be misallocated to census tracts, and more than 50% may be given coordinates within the land parcel of a diŒerent property.
Article
Full-text available
This article synthesizes two GIS-based accessibility measures into one framework, and applies the methods to examining spatial accessibility to primary health care in the Chicago ten-county region. The floating catchment area (FCA) method defines the service area of physicians by a threshold travel time while accounting for the availability of physicians by their surrounded demands. The gravity-based method considers a nearby physician more accessible than a remote one and discounts a physician's availability by a gravity-based potential. The former is a special case of the latter. Based on the 2000 Census and primary care physician data, this research assesses the variation of spatial accessibility to primary care in the Chicago region, and analyzes the sensitivity of results by experimenting with ranges of threshold travel times in the FCA method and travel friction coefficients in the gravity model. The methods may be used to help the US Department of Health and Human Services and state health departments improve designation of Health Professional Shortage Areas.
Book
Considers the nature of the modifiable areal unit problem. A survey is made of the prevailing ambivalent attitudes that geographers display and the general absence of any sense of verisimilitude is emphasized. A critical review is made of several alternative approaches to handling the problem.-from Authors
Article
This article presents an overview of GeoDa™, a free software program intended to serve as a user-friendly and graphical introduction to spatial analysis for non-geographic information systems (GIS) specialists. It includes functionality ranging from simple mapping to exploratory data analysis, the visualization of global and local spatial autocorrelation, and spatial regression. A key feature of GeoDa is an interactive environment that combines maps with statistical graphics, using the technology of dynamically linked windows. A brief review of the software design is given, as well as some illustrative examples that highlight distinctive features of the program in applications dealing with public health, economic development, real estate analysis, and criminology.
Article
A common—perhaps modal—representation of geography in spatial analysis and geographic information systems is native (unexamined) objects interacting based on simple distance and connectivity relationships within an empty Euclidean space. This is only one possibility among a large set of geographic representations that can support quantitative analysis. Through the vehicle of GIS, many researchers are adopting this representation without realizing its assumptions or its alternatives. Rather than locking researchers into a single representation, GIS could serve as a toolkit for estimating and exploring alternative geographic representations and their analytical possibilities. The article reviews geographic representations, their associated analytical possibilities and relevant computational tools in the combined spatial analysis and GIScience literatures. The discussion identifies several research and development frontiers, including analytical gaps in current GIS software.
Article
The capabilities for visualization, rapid data retrieval, and manipulation in geographic information systems (GIS) have created the need for new techniques of exploratory data analysis that focus on the “spatial” aspects of the data. The identification of local patterns of spatial association is an important concern in this respect. In this paper, I outline a new general class of local indicators of spatial association (LISA) and show how they allow for the decomposition of global indicators, such as Moran's I, into the contribution of each observation. The LISA statistics serve two purposes. On one hand, they may be interpreted as indicators of local pockets of nonstationarity, or hot spots, similar to the Gi and G*i statistics of Getis and Ord (1992). On the other hand, they may be used to assess the influence of individual locations on the magnitude of the global statistic and to identify “outliers,” as in Anselin's Moran scatterplot (1993a). An initial evaluation of the properties of a LISA statistic is carried out for the local Moran, which is applied in a study of the spatial pattern of conflict for African countries and in a number of Monte Carlo simulations.
Article
The use of zip codes for spatial, demographic, and socio-economic analysis is growing. As of August 2005, 193 articles were indexed by “zip code” in the Social Sciences Citation Index, while 386 were indexed in PubMed. All of these articles were published since 1989. While the treatment of zip codes as units of analysis varies widely in epidemiology, marketing, geography, and the socio-economic planning sciences, there are a number of common “errors” that could be avoided if analysts retained a better understanding of zip code characteristics. The purpose of this paper is to outline the problems and prospects of utilizing zip codes for spatial analysis. Issues associated with spatial contiguity, data aggregation, and boundary definitions are addressed. Results suggest that, although zip codes are not the most robust spatial units of analysis available, they retain a modest degree of utility for specialized applications. Recommendations for future research regarding zip codes and their use in socio-economic applications are offered.