Content uploaded by Harvey Miller
Author content
All content in this area was uploaded by Harvey Miller on Apr 26, 2016
Content may be subject to copyright.
Data-driven geography
Harvey J. Miller •Michael F. Goodchild
Published online: 10 October 2014
ÓSpringer Science+Business Media Dordrecht 2014
Abstract The context for geographic research has
shifted from a data-scarce to a data-rich environment,
in which the most fundamental changes are not just the
volume of data, but the variety and the velocity at
which we can capture georeferenced data; trends often
associated with the concept of Big Data. A data-driven
geography may be emerging in response to the wealth
of georeferenced data flowing from sensors and people
in the environment. Although this may seem revolu-
tionary, in fact it may be better described as evolu-
tionary. Some of the issues raised by data-driven
geography have in fact been longstanding issues in
geographic research, namely, large data volumes,
dealing with populations and messy data, and tensions
between idiographic versus nomothetic knowledge.
The belief that spatial context matters is a major theme
in geographic thought and a major motivation behind
approaches such as time geography, disaggregate
spatial statistics and GIScience. There is potential to
use Big Data to inform both geographic knowledge-
discovery and spatial modeling. However, there are
challenges, such as how to formalize geographic
knowledge to clean data and to ignore spurious
patterns, and how to build data-driven models that
are both true and understandable.
Keywords Big data GIScience Spatial statistics
Geographic knowledge discovery Geographic
thought Time geography
Introduction
A great deal of attention is being paid to the potential
impact of data-driven methods on the sciences. The
ease of collecting, storing, and processing digital data
may be leading to what some are calling the fourth
paradigm of science, following the millennia-old
traditional of empirical science describing natural
phenomena, the centuries-old tradition of theoretical
science using models and generalization, and the
decades-old traditional of computational science sim-
ulating complex systems. Instead of looking through
telescopes and microscopes, researchers are increas-
ingly interrogating the world through large-scale,
complex instruments and systems that relay observa-
tions to large databases to be processed and stored as
information and knowledge in computers (Hey et al.
2009).
This fundamental change in the nature of the data
available to researchers is leading to what some call
Big Data. Big Data refer to data that outstrip our
H. J. Miller (&)
Department of Geography, The Ohio State University,
Columbus, OH, USA
e-mail: miller.81@osu.edu
M. F. Goodchild
Department of Geography, University of California, Santa
Barbara, Santa Barbara, CA, USA
e-mail: good@geog.ucsb.edu
123
GeoJournal (2015) 80:449–461
DOI 10.1007/s10708-014-9602-6
capabilities to analyze. This has three dimensions, the
so-called ‘‘three Vs’’: (1) volume—the amount of data
that can be collected and stored; (2) velocity—the
speed at which data can be captured; and (3) variety—
encompassing both structured (organized and stored in
tables and relations) and unstructured (text, imagery)
data (Dumbill 2012). Some of these data are generated
from massive simulations of complex systems such as
cities (e.g., TRANSIMs; see Cetin et al. 2002), but a
large portion of the flood is from sensors and software
that digitize and store a broad spectrum of social,
economic, political, and environmental patterns and
processes (Graham and Shelton 2013; Kitchin 2014).
Sources of geographically (and often temporally)
referenced data include location-aware technologies
such as the Global Positioning System and mobile
phones; in situ sensors carried by individuals in
phones, attached to vehicles, and embedded in infra-
structure; remote sensors carried by airborne and
satellite platforms; radiofrequency identification
(RFID) tags attached to objects; and georeferenced
social media (Miller 2007,2010; Sui and Goodchild
2011; Townsend 2013).
Yet despite the enthusiasm over Big Data and data-
driven methods, the role it can play in scholarly
research, and specifically research in geography may
not be immediately apparent. Are theory and expla-
nation archaic when we can measure and describe so
much, so quickly? Does data velocity really matter in
research, with its traditions of careful reflection? Can
the obvious problems associated with variety—lack of
quality control, lack of rigorous sampling design—be
overcome? Can we make valid generalizations from
ongoing, serendipitous (instead of carefully designed
and instrumented) data collection? In short, can Big
Data and data-driven methods lead to significant
discoveries in geographic research? Or will the
research community continue to rely on what for the
purposes of this paper we will term Scarce Data: the
products of public-sector statistical programs that
have long provided the major input to research in
quantitative human geography?
Our purpose in this paper is to explore the impli-
cations of these tensions—theory-driven versus data-
driven research, prediction versus discovery, law-
seeking versus description-seeking—for research in
geography. We anticipate that geography will provide
a distinct context for several reasons: the specific issues
associated with location, the integration of the social
and the environmental, and the existence within the
discipline of traditions with very different approaches
to research. Moreover, although data-driven geogra-
phy may seem revolutionary, in fact it may be better
described as evolutionary since its challenges have
long been themes in the history of geographic thought
and the development of geographical techniques.
The next section of this paper discusses the
concepts of Big Data and data-driven geography,
addressing the question of what is special about the
new flood of georeferenced data. The ‘‘Data-driven
geography: challenges’’ section of this paper dis-
cusses major challenges facing data-driven geogra-
phy; these include dealing with populations (not
samples), messy (not clean) data, and correlations
(not causality). The ‘‘Theory in data-driven geogra-
phy’’ section discusses the role of theory in data-
driven geography. ‘‘Approaches to data-driven geog-
raphy’’ identifies ways to incorporate Big Data into
geographic research. The final section concludes this
paper with a summary and some cautions on the
broader impacts of data-driven geography on society.
Big data and data-driven geography
Humanity’s current ability to acquire, process, share,
and analyze huge quantities of data is without prec-
edent in human history. It has led to the coining of such
terms as the ‘‘exaflood’’ and the metaphor of ‘‘drinking
from a firehose’’ (Sui et al. 2013; Waldrop 1990). It is
also led to the suggestion that we are entering a new,
fourth phase of science that will be driven not so much
by careful observation by individuals, or theory
development, or computational simulation, as by this
new abundance of digital data (Hey et al. 2009).
It is worth recognizing immediately, however, that
the firehose metaphor has a comparatively long history
in geography, and that the discipline is by no means
new to an abundance of voluminous data. The Landsat
program of satellite-based remote sensing began in the
early 1970s by acquiring data at rates that were well in
excess of the analytic capacities of the computational
systems of the time; subsequent improvements in
sensor resolution and the proliferation of military and
civilian satellites have meant that four decades later
data volumes continue to challenge even the most
powerful computational systems.
450 GeoJournal (2015) 80:449–461
123
Volume is clearly not the only characteristic that
distinguishes today’s data supply from that of previous
eras. Today, data are being collected from many
sources, including social media, crowd sourcing,
ground-based sensor networks, and surveillance cam-
eras, and our ability to integrate such data and draw
inferences has expanded along with the volume of the
supply. The phrase Big Data implies a world in which
predictions are made by mining data for patterns and
correlations among these new sources, and some very
compelling instances of surprisingly accurate predic-
tions have surfaced in the past few years with respect
to the results of the Eurovision song contest (O’Leary
2012), the stock market (Preis et al. 2013), and the flu
(Butler 2008). The theme of Big Data is often
associated not only with volume but with variety,
reflecting these multiple sources, and velocity, given
the speed with which such data can now be analyzed to
make predictions in close-to-real time.
Ubiquitous, ongoing data flows are a big deal
because they allow us to capture spatio-temporal
dynamics directly (rather than inferring them from
snapshots) and at multiple scales. The data are
collected on an ongoing basis, meaning that both
mundane and unplanned events can be captured. To
borrow Nassim Taleb’s metaphor for probable and
inconsequential versus improbable but consequential
events (Taleb 2007): we do not need to sort the white
swans from the black swans before collecting data: we
can measure all swans and then figure out later which
are white or black. White swans may also combine in
surprising ways to form black-swan events.
Big Data is leading to new approaches to research
methodology. Fotheringham (1998) defines geocom-
putation as quantitative spatial analysis where the
computer plays a pivotal role. The use of the computer
drives the form of the analysis rather than just being a
convenient vehicle: analysts design geocomputational
techniques with the computer in mind. Similarly, data
play a pivotal role in data-driven methods. From this
perspective data are not just a convenient way to
calibrate, validate, and test but rather the driving force
behind the analysis. Consequently, analysts design
data-driven techniques with data in mind–and not just
large volumes of data, but a wider spectrum of data
flowing at higher speeds from the world. In this sense
we may indeed be entering a fourth scientific paradigm
where scientific methods are configured to satisfy data
rather than data configured to satisfy methods.
Data-driven geography: challenges
In Big Data: A Revolution That Will Transform How
We Live, Work, and Think, Mayer-Schonberger and
Cukier (2013) identify three main challenges of Big
Data in science: (1) populations, not samples; (2)
messy, not clean data, and; (3) correlations, not
causality. We discuss these three challenges for
geographic research in the following subsections.
Populations, not samples
Back when analysis was largely performed by hand
rather than by machines, dealing with large volumes of
data was impractical. Instead, researchers developed
methods for collecting representative samples and for
generalizing to inferences about the population from
which they were drawn. Random sampling was thus a
strategy for dealing with information overload in an
earlier era. In statistical programs such as the US Census
of Population it was also a means for controlling costs.
Random sampling works well, but it is fragile: it
works only as long as the sampling is representative. A
sampling rate of one in six (the rate previously used by
the US Bureau of the Census for its more elaborate
Long Form) may be adequate for some purposes, but
becomes increasingly problematic when analysis
focuses on comparatively rare subcategories. Random
sampling also requires a process for enumerating and
selecting from the population (a sampling frame),
which is problematic if enumeration is incomplete.
Sample data also has a lack of extensibility for
secondary uses. Because randomness is so critical, one
must carefully plan for sampling, and it may be
difficult to re-analyze the data for purposes other than
those for which it was collected (Mayer-Schonberger
and Cukier 2013).
In contrast, many of the new data sources consist of
populations, not samples: the ease of collecting,
storing, and processing digital data means that instead
of dealing with a small representation of the popula-
tion we can work with the entire population and thus
escape one of the constraints of the past. But one
problem with populations is that they are often self-
selected rather than sampled: for example, all people
who signed up for Facebook, all people who carry
smartphones, or all cars than happened to travel within
the City of London between 8 a.m.–11:00 a.m. on 2
September 2013. Geolocated tweets are an attractive
GeoJournal (2015) 80:449–461 451
123
source of information on current trends (e.g., Tsou
et al. 2013), but only a small fraction of tweets are
accurately geolocated using GPS. Since we do not
know the demographic characteristics of any of these
groups, it is impossible to generalize from them to any
larger populations from which they might have been
drawn.
Yet geographers have long had to contend with the
issues associated with samples and their parent
populations. Consider, for example, an analysis of
the relationship between people over 65 years old and
people registered as Republicans, the case studied by
Openshaw and Taylor in their seminal article on the
modifiable areal unit problem (Openshaw and Taylor
1979). The 99 counties of Iowa (their source of data)
are all of the counties that exist in Iowa. They are not
therefore a random sample of Iowa counties, or even a
representative sample of counties of the US, so the
methods of inferential statistics that assume random
and independent sampling are not applicable. In
remote sensing it is common to analyze all of the
pixels in a given scene; again, these are not a random
sample of any larger population.
However, the cases discussed above are where we
can be assured that the entire population of interest is
included: we are interested in all of the land cover in a
scene, or all of the people over 65 and Republicans in
Iowa. This is often not true with many new sources of
data. A challenge is how to identify the niches to
which monitored population data can be applied with
reasonable generality. This inverts the classic sam-
pling problem where we identify a question and collect
data to answer that question. Instead, we collect the
data and determine what questions we can answer.
Another issue concerns what people are volunteer-
ing when they volunteer geographic and other infor-
mation (Goodchild 2007). Social media such as
Facebook may have high penetration rates with
respect to population, but do not necessarily have
high penetration rates into peoples’ lives. Checking in
at an orchestra concert or lecture provides a noble
image that a person would like to promote, while
checking in at a bar at 10am is an image that a person
may be less keen to share. In the classic sociology text
The Presentation of Self in Everyday Life, Erving
Goffman uses theater as a metaphor and distinguishes
between stage and backstage behaviors, with stage
behaviors being consistent with the role people wish to
play in public life and backstage behaviors being
private actions that people wish to keep private
(Goffman 1959). While there are certainly cases of
over-sharing behavior (especially among celebrities)
we cannot be assured that the information people
volunteer is an accurate depiction of their complete
lives or just of the lives they wish to present to the
social sphere. Several geographic questions follow
from these observations. What is the geography of
stage versus backstage realms in a city or region? Does
this distribution vary by age, gender, socioeconomic
status, or culture? What do these imply for what we
can know about human spatial behavior?
In addition to selective volunteering of information
about their lives, there also may be selection biases in
the information people volunteer about environments.
Open Street Map (OSM) is often identified as a
successful crowdsourced mapping project: many cities
of the world have been mapped by people on a voluntary
basis to a remarkable degree of accuracy. However,
some regions get mapped quicker than others, such as
tourist locations, recreation areas, and affluent neigh-
borhoods, while locations of less interest to those who
participate in OSM (such as poorer neighborhoods)
receive less attention (Haklay 2010). While biases exist
in official, administrative maps (e.g., governments in
developing nations often do not map informal settle-
ments such as favelas), the biases in crowdsourced maps
are likely to be more subtle. Similarly, the rise of civic
hacking where citizens generate data, maps, and tools to
solve social problems tends to focus on the problems
that citizens with laptops, fast internet connections,
technical skills, and available time consider to be
important (Townsend 2013).
Messy, not clean
The new data sources are often messy, consisting of
data that are unstructured, collected with no quality
control, and frequently accompanied by no documen-
tation or metadata. There are at least two ways of
dealing with such messiness. On the one hand, we can
restrict our use of the data to tasks that do not attempt
to generalize or to make assumptions about quality.
Messy data can be useful in what one might term the
softer areas of science: initial exploration of study
areas, or the generation of hypotheses. Ethnography,
qualitative research, and investigations of Grounded
Theory (Glaser and Strauss 1967) often focus on using
interviews, text, and other sources to reveal what was
452 GeoJournal (2015) 80:449–461
123
otherwise not known or recognized, and in such
contexts the kinds of rigorous sampling and docu-
mentation associated with Scarce Data are largely
unnecessary. We discuss this option in greater detail
later in the paper.
On the other hand, we can attempt to clean and verify
the data, removing as much as possible of the messi-
ness, for use in traditional scientific knowledge con-
struction. Goodchild and Li (2012) discuss this
approach in the context of crowdsourced geographic
information. They note that traditional production of
geographic information has relied on multiple sources,
and on the expertise of cartographers and domain
scientists to assemble an integrated picture of the
landscape. For example, terrain information may be
compiled from photogrammetry, point measurements
of elevation, and historic sources; as a result of this
process of synthesis the published result may well be
more accurate than any of the original sources.
Goodchild and Li (2012) argue that that traditional
process of synthesis, which is largely hidden from
popular view and not apparent in the final result, will
become explicit and of critical importance in the new
world of Big Data. They identify three strategies for
cleaning and verifying messy data: (1) the crowd
solution; (2) the social solution; and (3) the knowledge
solution. The crowd solution is based on Linus’ Law,
named in honor of the developer of Linux, Linus
Torvalds: ‘‘Given enough eyeballs, all bugs are
shallow’’ (Raymond 2001). In other words, the more
people who can access and review your code, the
greater the accuracy of the final product. Geographic
facts that can be synthesized from multiple original
reports are likely to be more accurate than single
reports. This is of course the strategy used by
Wikipedia and its analogs: open contributions and
open editing are evidently capable of producing
reasonably accurate results when assisted by various
automated editing procedures.
In the geographic case, however, several issues
arise that limit the success of the crowd solution.
Reports of events at some location may be difficult to
compare if the means used to specify location (place
names, street address, GPS) are uncertain, and if the
means used to describe the event is ambiguous.
Geographic facts may be obscure, such as the names
of mountains in remote parts of the world, and the
crowd may therefore have little interest or ability to
edit errors.
Goodchild and Li (2012) describe the social
solution as implementing a hierarchical structure of
volunteer moderators and gatekeepers. Individuals are
nominated to roles in the hierarchy based on their track
record of activity and the accuracy of their contribu-
tions. Volunteered facts that appear questionable or
contestable are referred up the hierarchy, to be
accepted, queried, or rejected as appropriate. Schemes
such as this have been implemented by many projects,
including OSM and Wikipedia. Their major disad-
vantage is speed: since humans are involved, the
solution is best suited to applications where time is not
critical.
The third, the knowledge solution, asks how one
might know if a purported fact is false, or likely to be
false. Spelling errors and mistakes of syntax are simple
indicators which all of us use to triage malicious email.
In the geographic case, one can ask whether a
purported fact is consistent with what is already
known about the geographic world, in terms both of
facts and theories. Moreover such checks of consis-
tency can potentially be automated, allowing triage to
occur in close-to real time; this approach has been
implemented, although on a somewhat unstructured
basis, by companies that daily receive thousands of
volunteered corrections to their geographic databases.
A purported fact can deviate from established
geographic knowledge in either syntax or semantics,
or both. Syntax refers to the rules by which the world is
constructed, while semantics refers to the meaning of
those facts. Syntactical knowledge is often easier to
check than semantic knowledge. For example, Fig. 1
Fig. 1 Syntactical geographic knowledge: Highway on-ramp
feature geometry
GeoJournal (2015) 80:449–461 453
123
illustrates an example of syntactical geographic
knowledge. We know from engineering specifications
that an on-ramp can only intersect a freeway at a small
angle (typically 30 degrees or less). If a road-network
database appears to have on-ramp intersections of[30
degrees we know that the data are likely to be wrong;
in the case of Fig. 1, many of the apparent intersec-
tions of the light-blue segments are more likely to be
overpasses or underpasses. Such errors have been
termed errors of logical consistency in the literature of
geographic information science (e.g., Guptill and
Morrison 1995).
In contrast, Fig. 2illustrates semantic geographic
knowledge: a photograph of a lake that has been linked
to the Google Earth map of The Ohio State University
campus. However, this photograph seems to be located
incorrectly: we recognize the scene as Mirror Lake,a
campus icon to the southeast of the purported location
indicated on the map. The purported location must be
wrong, but can we be sure? Perhaps the university
moved Mirror Lake to make way for a new Geography
building? Or perhaps Mirror Lake was so popular that
the university created a mirror Mirror Lake to handle
the overflow? We cannot immediately and with
complete confidence dismiss this empirical fact with-
out additional investigation since it does not violate
any known rules by which the world is constructed:
there is nothing preventing Mirror Lake from being
moved or mirrored. Of course, there are some
semantic facts that can be dismissed confidently as
absurd—one would not expect to see a lake scene on
the top of Mt. Everest or in the Sahara Desert.
Nevertheless, there is no firm line between clearly
absurd and non-absurd semantic facts—e.g., one
would not expect to see Venice or New York City in
the Mojave Desert, but Las Vegas certainly exists.
A major task for the knowledge solution is formal-
izing knowledge to support automated triage of
asserted facts and automated data fusion. Knowledge
can be derived empirically or as predictions from
theories, models, and simulations. In the latter case,
we may be looking for data at variance with predic-
tions as part of the knowledge-discovery and con-
struction processes.
There are at least two major challenges to
formalizing geographic knowledge. First, geographic
concepts such as neighborhood, region, the Midwest,
and developing nations can be vague, fluid, and
contested. A second challenge is the development of
explicit, formal, and computable representations of
geographic knowledge. Much geographic knowledge
is buried in formal theories, models, and equations
that must be solved or processed, or in informal
language that must be interpreted. In contrast,
knowledge-discovery techniques require explicit
representations such as rules, hierarchies, and con-
cept networks that can be accessed directly without
processing (Miller 2010).
Fig. 2 Semantic
geographic knowledge:
Where is Mirror Lake?
(Google Earth; last accessed
24 September 2013 10:00am
EDT)
454 GeoJournal (2015) 80:449–461
123
Correlations, not causality
Traditionally, scholarly research concerns itself with
knowing why something occurs. Correlations alone
are not sufficient, because the existence of correlation
does not imply that change in either variable causes
change in the other. In the correlation explored by
Openshaw and Taylor cited earlier (Openshaw and
Taylor 1979), the existence of a correlation between
the number of registered Republicans in a county and
the number of people aged 65 and over does not imply
that either one has a causal effect on the other. Over the
years, science has adopted pejorative phrases to
describe research that searches for correlations with-
out concern for causality or explanation: ‘‘curve-
fitting’’ comes to mind. Nevertheless correlations may
be useful for prediction, especially if one is willing to
assume that an observed correlation can be general-
ized beyond the specific circumstances in which it is
observed.
But while they may be sufficient, explanation and
causality are not necessary conditions for scientific
research: much research, especially in such areas as
spatial analysis, is concerned with advancing method,
whether its eventual use is for explanation or for
prediction. The literature of geographic information
science is full of tools that have been designed not for
finding explanations but for more mundane activities
such as detecting patterns, or massaging data for
visualization. Such tools are clearly valuable in an era
of data-driven science, where questions of ‘‘why’’ may
not be as important. In the next section we extend this
argument by taking up the broader question of the role
of theory in data-driven geography.
Theory in data-driven geography
In a widely discussed article published in Wired
magazine, Anderson called for the end of science as
we know it, claiming that the data deluge is making the
scientific method obsolete (Anderson 2008). Using
physics and biology as examples, he argued that as
science has advanced it has become apparent that
theories and models are caricatures of a deeper
underlying reality that cannot be easily explained.
However, explanation is not required for continuing
progress: as Anderson states ‘‘Correlation supersedes
causation, and science can advance even without
coherent models, unified theories, or really any
mechanistic explanation at all.’’
Duncan Watts makes a similar argument about
theory in the social sciences, stating that unprece-
dented volumes of social data have the potential to
revolutionize our understanding of society, but this
understanding will not be in the form of general laws
of social science or cause-and-effect social relation-
ships. Although Watts suggests the limitations of
theory in the era of data-driven science, he does not
call for the end of theory but rather for a more modest
type of theory that would include general propositions
(such as what interventions work for particular social
problems) or how more obvious social facts fit
together to generate less obvious outcomes. Watts
links this approach to calls by sociologist Robert
Merton in the mid-twentieth century for middle-range
theories: theories that address identifiable social
phenomena instead of abstract entities such as the
entire social system (Watts 2011). Middle-range
theories are empirically grounded: they are based in
observations, and serve to derive hypotheses that can
be investigated. However, they are not endpoints:
rather, they are temporary stepping-stones to general
conceptual schemes that can encompass multiple
middle-range theories (Merton 1967).
Data-driven science seems to entail a shift away
from the general and towards the specific—away from
attempts to find universal laws than encompass all
places and times and towards deeper descriptions of
what is happening at particular places and times. There
are clearly some benefits to this change: as Batty
(2012) points out, urban science and planning in the
era of Scarce Data focused on radical and massive
changes to cities over the long-term, with little
concern for small spaces and local movements.
Data-driven urban science and planning can rectify
some of the consequent urban ills by allowing greater
focus on the local and routine. However, over longer
time spans and wider spatial domains the local and
routine merges into the long-term; a fundamental
scientific challenge is how local and short-term Big
Data can inform our understanding of processes over
longer temporal and spatial horizons; in short, the
problem of generalization.
Geography has long experience with partner-
ships—and tensions—between nomothetic (law-seek-
ing) and idiographic (description-seeking) knowledge
(Cresswell 2013). Table 1provides a summary. The
GeoJournal (2015) 80:449–461 455
123
early history of geography in the time of Strabo (64/63
BCE–24 CE) and Ptolemy (90-168 CE) involved both
generalizations about the Earth and intimate descrip-
tions of specific places and regions; these were two
sides of the same coin. Bernhardus Varenius
(1622–1650) conceptualized geography as consisting
of general (scientific) and special (regional) knowl-
edge, although he considered the latter to be subsidiary
to the former (Warntz 1989; Goodchild et al. 1999).
Alexander von Humboldt (1769–1859) and Carl Ritter
(1779–1859), often regarded as the founders of
modern geography, tried to derive general laws
through careful measurement of geographic phenom-
ena at particular locations and times. In more recent
times, the historic balance between nomothetic and
idiographic geographic knowledge has become more
unstable. The early twentieth century witnessed the
dominance of nomothetic geography in the guise of
the environmental determinism in the early 1900s,
followed by a backlash against its abuses and the
subsequent rise of idiographic geography in the form
of areal differentiation: Richard Hartshorne famously
declared in The Nature of Geography that the only law
in geography is that all areas are unique (Hartshorne
1939). The dominance of idiographic geography and
the concurrent crisis in American academic geography
(in particular, the closing of Harvard’s geography
program in 1948; Smith 1992) led to the Quantitative
Revolution of the 1950s and 1960s, with geographers
such as Fred Schaefer, William Bunge, Peter Haggett,
and Edward Ullman asserting that geography should
be a law-seeking science that answers the question
‘‘why?’’ rather than building a collection of facts
describing what is happening in particular regions.
Physical geographers have—perhaps wisely—disen-
gaged themselves from these debates, but the tension
between nomothetic and idiographic approaches per-
sists in human geography (see Cresswell 2013;
DeLyser and Sui 2013; Schuurman 2000; Sui 2004;
Sui and DeLyser 2012).
However, attempts to reconcile nomothetic and
idiographic knowledge did not die with Humboldt and
Ritter. Approaches such as time geography seek to
capture context and history and recognize the roles of
both agency and structure in human behavior (Cres-
swell 2013). In spatial analysis, the trend towards local
statistics, exemplified by Geographically Weighted
Regression (Fotheringham et al. 2002) and Local
Indicators of Spatial Association (Anselin 1995),
represents a compromise in which the general princi-
ples of nomothetic geography are allowed to express
themselves differently across geographic space.
Goodchild (2004) has characterized GIS as combining
the nomothetic, in its software and algorithms, with
the idiographic in its databases.
In a sense, the paths to geographic knowledge
engendered by data-intensive approaches such as time
geography, disaggregate spatial statistics and GI-
Science are a return to the early foundation of
geography where neither law-seeking nor descrip-
tion-seeking were privileged. Geographic generaliza-
tions and laws are possible but space matters: spatial
dependency and spatial heterogeneity create local
context that shapes physical and human processes as
they evolve on the surface of the Earth. Geographers
have believed this for a long time, but this belief is also
supported by recent breakthroughs in complex sys-
tems theory, which suggests that patterns of local
interactions lead to emergent behaviors that cannot be
understood in isolation at either the local or global
levels. Understanding the interactions among agents
within an environment is the scientific glue that binds
the local with the global (Flake 1998).
In short, data-driven geography is not necessarily a
radical break with the geographic tradition: geography
has a longstanding belief in the value of idiographic
knowledge by itself as well as its role in constructing
nomothetic knowledge. Although this belief has been
tenuous and contested at times, data-driven geography
Table 1 A brief history of partnerships and tensions between
nomothetic (law-seeking) and idiographic (description-seek-
ing) knowledge in geographic thought
Path to geographic
knowledge
Advocates
Nomothetic $idiographic Strabo
Ptolemy
Nomothetic ?idiographic Varenius
Nomothetic /idiographic Humboldt
Ritter
Idiographic Hartshorne
Nomothetic Schaefer
Nomothetic $idiographic Ha
¨gerstrand (time geography)
Fotheringham/Anselin (local
spatial statistics)
Tomlinson/Goodchild
(GIScience)
456 GeoJournal (2015) 80:449–461
123
may provide the paths between idiographic and
nomothetic knowledge that geographers have been
seeking for two millennia. However, while complexity
theory supports this belief, it also suggests that this
knowledge may have inherent limitations: emergent
behavior is by definition surprising.
Approaches to data-driven geography
If we accept the premise—at least until proven
otherwise—that Big Data and data-driven science
harmonize with longstanding themes and beliefs in
geography, the question that follows is: how can data-
driven approaches fit into geographic research? Data-
driven approaches can support both geographic
knowledge-discovery and spatial modeling. However,
there are some challenges and cautions that must be
recognized.
Data-driven geographic knowledge discovery
Geographic knowledge-discovery refers to the initial
stage of the scientific process where the investigator
forms his or her conceptual view of the system,
develops hypotheses to be tested, and performs
groundwork to support the knowledge-construction
process. Geographic data facilitates this crucial phase
of the scientific process by supporting activities such
as study-site selection and reconnaissance, ethnogra-
phy, experimental design, and logistics.
Perhaps the most transformative impact of data-
driven science on geographic knowledge-discovery
will be through data-exploration and hypothesis
generation. Similar to a telescope or microscope,
systems for capturing, storing, and processing massive
amounts of data can allow investigators to augment
their perceptions of reality and see things that would
otherwise be hidden or too faint to perceive. From this
perspective, data-driven science is not necessarily a
radically new approach, but rather a way to enhance
inference for the longstanding processes of explora-
tion and hypothesis generation prior to knowledge-
construction through analysis, modeling, and verifi-
cation (Miller 2010).
Data-driven knowledge-discovery has a philo-
sophical foundation: abductive reasoning, a form of
inference articulated by astronomer and mathemati-
cian C. S. Peirce (1894–1914). Abductive reasoning
starts with data describing something and ends with
a hypothesis that explains the data. It is a weaker
form of inference relative to deductive or inductive
reasoning: deductive reasoning shows that X must
be true, inductive reasoning shows that X is true,
while abductive reasoning shows only that X may be
true. Nevertheless, abductive reasoning is critically
important in science, particularly in the initial
discovery stage that precedes the use of deductive
or inductive approaches to knowledge-construction
(Miller 2010).
Abductive reasoning requires four capabilities: (1)
the ability to posit new fragments of theory; (2) a
massive set of knowledge to draw from, ranging from
common sense to domain expertise; (3) a means of
searching through this knowledge collection for
connections between data patterns and possible expla-
nations, and; (4) complex problem-solving strategies
such as analogy, approximation, and guesses. Humans
have proven to be more successful than machines in
performing these complex tasks, suggesting that data-
driven knowledge-discovery should try to leverage
these human capabilities through methods such as
geovisualization rather than try to automate the
discovery process. Gahegan (2009) envisions a
human-centered process where geovisualization
serves as the central framework for creating chains
of inference among abductive, inductive, and deduc-
tive approaches in science, allowing more interactions
and synergy among these approaches to geographic
knowledge building.
One of the problems with Big Data is the size and
complexity of the information space implied by a
massive multivariate database. A good data-explora-
tion system should generate all of the interesting
patterns in a database, but only the interesting ones to
avoid overwhelming the analyst. Two ways to manage
the large number of potential patterns are background
knowledge and interestingness measures. Background
knowledge guides the search for patterns by repre-
senting accepted knowledge about the system to focus
the search for novel patterns. In contrast, we can use
interestingness measures a posteriori to filter spurious
patterns by rating each pattern based on dimensions
such as simplicity, certainty, utility, and novelty.
Patterns with ratings below a user-specified threshold
are discarded or ignored (Miller 2010). Both of these
approaches require formalization of geographic
knowledge, a challenge discussed earlier in this paper.
GeoJournal (2015) 80:449–461 457
123
Data-driven modeling
Traditional approaches to modeling are deductive: the
scientist develops (or modifies or borrows) a theory
and derives a formal representation that can be
manipulated to generate predictions about the real
world that can be tested with data. Theory-free
modeling, on the other hand, builds models based on
induction from data rather than through deduction
from theory.
The field of economics has flirted with data-driven
modeling in the form of general-to-specific modeling
(Miller 2010). In this strategy, the researcher starts
with the most complex model possible and reduces it
to a more elegant one based on data, founded on the
belief that, given enough data, only the true specifi-
cation will survive a sufficiently stringent battery of
statistical tests designed to pare variables from the
model. This contrasts with the traditional specific-to-
general strategy where one starts with a spare model
based on theory and conservatively builds a more
complex model (Hoover and Perez 1999). However,
this approach is controversial, with some arguing that
given the enormous number of potential models one
would have to be very lucky to encompass the true
model within the initial, complex model. Therefore,
predictive performance is the only relevant criterion;
explanation is irrelevant (Hand 1999).
Geography has also witnessed attempts at theory-
free modeling, also not without controversy. Stan
Openshaw is a particularly strong advocate for using
the power of computers to build models from data:
examples include the Geographical Analysis Machine
(GAM) for spatial clustering of point data, and
automated systems for spatial interaction modeling.
GAM uses a technique that generates local clusters or
‘‘hot spots’’ without requiring a priori theory or
knowledge about the underlying statistical distribu-
tion. GAM searches for clusters by systematically
expanding circular search from locations within a
lattice. The system saves circles with observed counts
greater than expected and then systematically varies
the radii and lattice resolution to begin the search
again. The researcher does not need to hypothesize or
have any prior expectations regarding the spatial
distribution of the phenomenon: the system searches,
in a brute-force manner, all possible (or reasonable, at
least) spatial resolutions and neighborhoods (Charlton
2008; Openshaw et al. 1987).
GAM is arguably an exploratory technique, while
Openshaw’s automated system for exploring a uni-
verse of possible spatial interaction models leaps more
into the traditional realm of deductive modeling. The
automated system uses genetic programming to breed
spatial interaction models from basic elements such as
the model variables (e.g., origin inflow and destination
outflow totals, travel cost, intervening opportunities),
functional forms (e.g., square root, exponential),
parameterizations, and binary operators (add, subtract,
multiply and divide) using goodness-of fit as a
criterion (Diplock 1998; Openshaw 1988).
One challenge in theory-free modeling is that it
takes away a powerful mechanism for improving the
effectiveness of a search for an explanatory model—
namely, theory. Theory tells us where to look for
explanation, and (perhaps more importantly) where
not to look. In the specific case of spatial interaction
modeling, for example, the need for models to be
dimensionally consistent can limit the options, though
the possibility of dimensional analysis (Gibbings
2011) was not employed in Openshaw’s work. The
information space implied by a universe of potential
models can be enormous even in a limited domain
such as spatial interaction. Powerful computers and
clever search techniques can certainly improve our
chances (Gahegan 2000). But as the volume, variety,
and velocity of data increase, the size of the informa-
tion spaces for possible models also increases, leading
to a type of arms race with perhaps no clear winner.
A second challenge in data-driven modeling is that
the data drive the form of the model, meaning there is
no guarantee that the same model will result from a
different data set. Even given the same data set, many
different models could be generated that fit the data,
meaning that slight alterations in the goodness-of-fit
criterion used to drive model selection can produce
very different models (Fotheringham 1998). This is
essentially the problem of statistical overfitting, a
well-known problem with inductive techniques such
as artificial neural networks and machine learning.
However, despite methods and strategies to avoid
overfitting, it appears to be endemic: some estimate
that three-quarters of the published scientific papers in
machine learning are flawed due to overfitting (The
Economist 19 October 2013).
A third challenge in theory-free modeling is the
complexity of resulting models. Traditional model
building in science uses parsimony as a guiding
458 GeoJournal (2015) 80:449–461
123
principle: the best model is the one that explains the
most with the least. This is sometimes referred to as
‘‘Occam’s Razor’’: given two models with equal
validity, the simpler model is better. Model interpre-
tation is an informal but key test: the model builder
must be able to explain what the model results say
about reality. Models derived computationally from
data and fine-tuned based on feedback from predic-
tions can generate reliable predictions from processes
that are too complex for the human brain (Townsend
2013; Weinberger 2011). For example, Openshaw’s
automated system for breeding spatial interaction
models has been known to generate very complex,
non-intuitive models (Fotheringham 1998), many of
which are also dimensionally inconsistent. Figure 3
illustrates some of the spatial interaction models
generated by Openshaw’s automated system; as can
be seen, they defy easy comprehension.
The knowledge from data-driven models can be
complex and non-compressible: the data are the
explanation. But if the explanation is not understand-
able, do we really have an explanation? Perhaps the
nature of explanation is evolving. Perhaps computers
are fundamental in data-driven science not only for
discovering but also for representing complex patterns
that are beyond human comprehension. Perhaps this is
a temporary stopgap until we achieve convergence
between human and machine intelligence as some
predict (Kurzweil 1999). While we cannot hope to
resolve this question (or its philosophical implica-
tions) within this paper, we can add a cautionary note
from Nate Silver: telling stories about data instead of
reality is dangerous and can lead to mistaking noise for
signal (Silver 2012).
A final challenge in data-driven spatial modeling is
de-skilling: a loss of modeling and analysis skills.
While allocating mundane tasks to computers frees
humans to perform sophisticated activities, there are
times when mundane skills become crucial. For
example, there are documented cases of airline pilots,
due to a lack of manual flying experience, reacted
badly in emergencies when the autopilot shuts off
(Carr 2013). Although rarely life-threatening, one
could make a similar argument about automatic model
building: if a data-driven modeling process generates
anomalous results, will the analyst be able to deter-
mine if they are artifacts or genuine? With Open-
shaw’s automated spatial interaction modeling
system, the analyst may become less skilled at spatial
interaction modeling and more skilled at combinato-
rial optimization techniques. While these skills are
valuable and may allow the analyst to reach greater
scientific heights, they are another level removed from
the empirical system being modeled. However, the
more anomalous the results, the deeper the thinking
required.
A solution to de-skilling is to force the skill: require
it as part of education and certification, or design
software that encourages or requires analysts to
maintain some basic skills. However, this is a difficult
case to make compared to the hypnotic call of
sophisticated methods with user-friendly interfaces
Fig. 3 Three of the spatial
interaction models
generated by Openshaw’s
automated modeling system
(Openshaw 1988)
GeoJournal (2015) 80:449–461 459
123
(Carr 2013). Re-reading Jerry Dobson’s prescient
essay on automated geography thirty years later
(Dobson 1983), one is impressed by the number of
the activities in geography that used to be painstaking
but are now push-button. Geographers of a certain age
may recall courses in basic and production cartogra-
phy without much nostalgia. What skills that we
consider essential today will be considered the pen,
ink, and lettering kits of tomorrow? What will we
lose?
Conclusion
The context for geographic research has shifted from a
data-scarce to a data-rich environment, in which the
most fundamental changes are not the volume of data,
but the variety and the velocity at which we can
capture georeferenced data. A data-driven geography
may be emerging in response to the wealth of
georeferenced data flowing from sensors and people
in the environment. Some of the issues raised by data-
driven geography have in fact been longstanding
issues in geographic research, namely, large data
volumes, dealing with populations and messy data,
and tensions between idiographic versus nomothetic
knowledge. However, the belief that spatial context
matters is a major theme in geographic thought and a
major motivation behind approaches such as time
geography, disaggregate spatial statistics, and GI-
Science. There is potential to use Big Data to inform
both geographic knowledge-discovery and spatial
modeling. However, there are challenges, such as
how to formalize geographic knowledge to clean data
and to ignore spurious patterns, and how to build data-
driven models that are both true and understandable.
Cautionary notes need to be sounded about the
impact of data-driven geography on broader society
(see Mayer-Schonberger and Cukier 2013). We must
be cognizant about where this research is occurring—
in the open light of scholarly research where peer
review and reproducibility is possible, or behind the
closed doors of private-sector companies and govern-
ment agencies, as proprietary products without peer
review and without full reproducibility. Privacy is a
vital concern, not only as a human right but also as a
potential source of backlash that will shut down data-
driven research. We must be careful to avoid pre-
crimes and pre-punishments (Zedner 2010):
categorizing and reacting to people and places based
on potentials derived from correlations rather than
actual behavior. Finally, we must avoid a data
dictatorship: data-driven research should support, not
replace, decision-making by intelligent and skeptical
humans. Some of the other papers in this special issue
explore these challenges in depth.
References
Anderson, C. (2008). The end of theory: The data deluge makes
the scientific method obsolete. Wired, 16, 07.
Anselin, L. (1995). Local indicators of spatial association:
LISA. Geographical Analysis, 27(2), 93–115.
Batty, M. (2012). Smart cities, big data. Environment and
Planning B, 39(2), 191–193.
Butler, D. (2008). Web data predict flu. Nature, 456, 287–288.
Carr, N. (2013) The great forgetting. The Atlantic, pp. 77–81.
Cetin, N., Nagel, K., Raney, B., & Voellmy, A. (2002). Large-
scale multi-agent transportation simulations. Computer
Physics Communications, 147(1–2), 559–564.
Charlton, M. (2008). Geographical Analysis Machine (GAM).
In K. Kemp (Ed.), Encyclopedia of Geographic Informa-
tion Science (pp. 179–180). London: Sage.
Cresswell, T. (2013). Geographic thought: A critical introduc-
tion. New York: Wiley-Blackwell.
DeLyser, D., & Sui, D. (2013). Crossing the qualitative-quan-
titative divide II: Inventive approaches to big data, mobile
methods, and rhythmanalysis. Progress in Human Geog-
raphy,37(2), 293–305.
Diplock, G. (1998). Building new spatial interaction models by
using genetic programming and a supercomputer. Envi-
ronment and Planning A, 30(10), 1893–1904.
Dobson, J. E. (1983). Automated geography. The Professional
Geographer, 35, 135–143.
Dumbill, E. (2012). What is big data? An introduction to the big
data landscape, http://strata.oreilly.com/2012/01/what-is-
big-data.html. Last accessed 17 April 2014.
Flake, G. W. (1998). The computational beauty of nature:
computer explorations of fractals, chaos, complex systems,
and adaptation. Cambridge: MIT Press.
Fotheringham, A. S. (1998). Trends in quantitative methods II:
Stressing the computational. Progress in Human Geogra-
phy, 22(2), 283–292.
Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002).
Geographically weighted regression: The analysis of
spatially varying relationships. Chichester: Wiley.
Gahegan, M. (2000). On the application of inductive machine
learning tools to geographical analysis. Geographical
Analysis, 32(1), 113–139.
Gahegan, M. (2009). Visual exploration and explanation in
geography: Analysis with light. In H. J. Miller & J. Han
(Eds.), Geographic data mining and knowledge discovery
(2nd ed., pp. 291–324). London: Taylor and Francis.
Gibbings, J. C. (2011). Dimensional analysis. New York:
Springer.
460 GeoJournal (2015) 80:449–461
123
Glaser, B. G., & Strauss, A. L. (1967). The discovery of
grounded theory. Chicago: Aldine.
Goffman, E. (1959). The presentation of self in everyday life.
New York: Anchor Books.
Goodchild, M. F. (2004). GIScience, geography, form, and
process. Annals of the Association of American Geogra-
phers, 94(4), 709–714.
Goodchild, M. F. (2007). Citizens as sensors: The world of
volunteered geography. GeoJournal, 69(4), 211–221.
Goodchild, M. F., Egenhofer, M. J., Kemp, K. K., Mark, D. M.,
& Sheppard, E. (1999). Introduction to the Varenius pro-
ject. International Journal of Geographical Information
Science, 13(8), 731–745.
Goodchild, M. F., & Li, L. (2012). Assuring the quality of
volunteered geographic information. Spatial Statistics, 1,
110–120. doi:10.1016/j.spasta.2012.03.002.
Graham, M., & Shelton, T. (2013). Geography and the future of
big data, big data and the future of geography. Dialogues in
Human Geography, 3(3), 255–261.
Guptill, S. C., & Morrison, J. L. (Eds.). (1995). Elements of
spatial data quality. Oxford: Elsevier.
Haklay, M. (2010). How good is volunteered geographical
information? A comparative study of OpenStreetMap and
Ordnance Survey datasets. Environment and Planning B:
Planning and Design, 37(4), 682–703.
Hand, D. J. (1999). Discussion contribution on ‘data mining
reconsidered: Encompassing and the general-to-specific
approach to specification search’ by Hoover and Perez.
Econometrics Journal, 2(2), 241–243.
Hartshorne, R. (1939). The nature of geography: A critical
survey of current thought in the light of the past. Wash-
ington, DC: Association of American Geographers.
Hey, T., Tansley S., & Tolle, K. (Eds.). (2009). The fourth
paradigm: Data-intensive scientific discovery.
Hoover, K. D., & Perez, S. J. (1999). Data mining reconsidered:
Encompassing and the general-to-specific approach to
specification search. Econometrics Journal, 2(2), 167–191.
Kitchin, R. (2014). Big data and human geography: Opportu-
nities, challenges and risks. Dialogues in Human Geog-
raphy, 3(3), 262–267.
Kurzweil, R. (1999). The age of spiritual machines: when
computers exceed human intelligence. New York: Vintage.
Mayer-Schonberger, V., Cukier, K. (2013). Big Data: A revo-
lution that will transform how we live, work, and think.
Merton, R. K. (1967). On sociological theories of the middle
range. In R. K. Merton (Ed.), On theoretical sociology (pp.
39–72). New York: The Free Press.
Miller, H. J. (2007). Place-based versus people-based geo-
graphic information science. Geography Compass, 1(3),
503–535.
Miller, H. J. (2010). The data avalanche is here. Shouldn’t we be
digging? Journal of Regional Science, 50(1), 181–201.
O’Leary, M. (2012). Eurovision statistics: post-semifinal
update, Cold Hard Facts (May 23). Available: http://
mewo2.com/nerdery/2012/05/23/eurovision-statistics-
post-semifinal-update/. Accessed October 25, 2013.
Openshaw, S. (1988). Building an automated modeling system
to explore a universe of spatial interaction models. Geo-
graphical Analysis, 20(1), 31–46.
Openshaw, S., Charlton, M., Wymer, C., & Craft, A. (1987).
A Mark I geographical analysis machine for the automated
analysis of point data sets. International Journal of Geo-
graphical Information Systems, 1(4), 335–358.
Openshaw, S., & Taylor, P. J. (1979). A million or so correlation
coefficients: three experiments on the modifiable areal unit
problem. In N. Wrigley (Ed.), Statistical methods in the
social sciences (pp. 127–144). London: Pion.
Preis, T., Moat, H. S., & Stanley, H. E. (2013). Quantifying
trading behavior in financial markets using Google Trends.
Scientific Reports,3(1684). doi:10.1038/srep01684.
Raymond, E. S. (2001). The cathedral and the bazaar: Musings
on linux and open source by an accidental revolutionary.
Sebastopol: O’Reilly Media.
Schuurman, N. (2000). Trouble in the heartland: GIS and its
critics in the 1990s. Progress in Human Geography, 24(4),
569–589.
Silver, N. (2012). The signal and the noise: Why most predic-
tions fail—but some don’t.
Smith, N. (1992). History and philosophy of geography: Real
wars, theory wars. Progress in Human Geography, 16(2),
257–271.
Sui, D. (2004). GIS, cartography, and the ‘‘Third Culture’’:
Geographic imaginations in the computer age. Profes-
sional Geographer, 56(1), 62–72.
Sui, D., & DeLyser, D. (2012). Crossing the qualitative-quan-
titative chasm I: Hybrid geographies, the spatial turn, and
volunteered geographic information (VGI). Progress in
Human Geography, 36(1), 111–124.
Sui, D., & Goodchild, M. F. (2011). The convergence of GIS and
social media: Challenges for GIScience. International
Journal of Geographical Information Science, 25(11),
1737–1748.
Sui, D., Goodchild, M. F., & Elwood, S. (2013). Volunteered
geographic information, the exaflood, and the growing
digital divide. In D. Sui, S. Elwood, & M. F. Goodchild
(Eds.), Crowdsourcing geographic knowledge (pp. 1–12).
New York: Springer.
Taleb, N. N. (2007). The black swan: The impact of the highly
improbable. New York: Random House.
The Economist. (19 October 2013). Trouble at the lab,
pp. 26–30.
Townsend, A. (2013). Smart cities: Big data, civic hackers, and
the quest for a new utopia. New York: Norton.
Tsou, M. H., Yang, J. A., Lusher, D., Han, S., Spitzberg, B.,
Gawron, J. M., et al. (2013). Mapping social activities and
concepts with social media (Twitter) and web search
engines (Yahoo and Bing): a case study in 2012 US Pres-
idential Election. Cartography and Geographic Informa-
tion Science, 40(4), 337–348.
Waldrop, M. M. (1990). Learning to drink from a fire hose.
Science, 248(4956), 674–675.
Warntz, W. (1989). Newton, the Newtonians, and the Geogra-
phia Generalis Varenii. Annals of the Association of
American Geographers, 79(2), 165–191.
Watts, D. J. (2011). Everything is Obvious – Once You Know the
Answer. United States of America: Crown Business.
Weinberger, D. (2011). The machine that would predict the
future, Scientific American, November 15, 2011. http://
www.scientificamerican.com/article.cfm?id=the-machine-
that-would-predict.
Zedner, L. (2010). Pre-crime and pre-punishment: a health
warning. Criminal Justice Matters, 81(1), 24–25.
GeoJournal (2015) 80:449–461 461
123