ChapterPDF Available

Sticky data – context and friction in the use of urban data proxies

Sticky data - context and friction in the use of urban data proxies
Dietmar Offenhuber
What makes data meaningful? Is meaning hidden in the values and attributes of a data set or in
the circumstances in which the data were collected and generated? In his general definition of
information, philosopher Luciano Floridi (2011) describes information as data plus meaning,
referring to a single datum as a lack of uniformity in the broadest sense. This definition implies
that a data set by itself is not necessarily meaningful, since well-formed data can be generated
from random numbers devoid of meaning. Only the imperfections of a random generator lead to
biases and artefacts that can become meaningful when forensically analysed for breaking codes
or identifying machines.
How does meaning emerge from data? A reasonable assumption would be that meaning
is grounded in the numbers and symbols in a data set and the extent to which they serve as a
useful representation of a phenomenon in the world. However, many scholars in the humanities
will also point to the circumstances and conditions in which data were collected and which are
usually not completely represented in data and metadata records. They remind us that data are
already interpreted expressions (Drucker 2011), always already cooked, never raw (Bowker
2005, 183; Gitelman 2013). If data are understood as systematic observations that have been
symbolically encoded and stored in some material form, it becomes clear that assembling a data
set involves many human decisions. These decisions concern the aspects of a phenomenon to be
observed, the method of observation, the rubrics and classification systems to encode the
observations, and finally, how these encodings are to be stored or transmitted in a physical
medium. Some of these decisions might have been provisional, and their justifications might
have been forgotten once the collection mechanism is in place. When the circumstances of data
generation are lost and we only have a data set without contextual information, we might end up
with little in our hands.
Cities and public institutions generally keep accurate documentation specifying the
circumstances under which they collect, encode, and store their data records. But this is not
always the case when data are the by-product of automatic mechanisms, or collected by private
companies, who do not disclose their data sources and methods. The value of data is also
diminished when data and metadata are separated, as we are beginning to see in open data
portals, where data sets are often presented without explanation of their institutional context.
Analysts, administrators and the public might not always be aware of the implicit
assumptions embedded in a data set and tempted to take data at face value. For example, block-
level socio-economic data from the American Community Survey (ACS) is frequently used
without considering the significant error ranges resulting from the small sample size,
conveniently ignoring the values in the error column. In many cases, however, researchers are
very familiar with these issues, but choose to live with the limitations, biases, and uncertainties
of a data set in the absence of more reliable sources. Coining the phrase “data friction,” historian
Paul Edwards describes the struggles that ensue when attempting to move data and metadata
from one format, organisation, scale, or context to another (Edwards 2010).
Data frictions concern all forms of data, including those relating to cities and urban life,
and manifest in various ways. Due to different definitions of, for example, urban areas, data sets
from international agencies such as the UN and the Worldbank often remain incompatible. Even
within the same program, methodical details can change over time and lead to data frictions.
Such is the case in the US Census, where changing geographic boundaries require considerable
effort to harmonise data from different decades.
Finally, the assumptions and definitions
underlying urban data can be shaped by political agendas that are not always obvious (Litman,
Despite these difficulties and limitations, data sets, once collected, can develop a life of
their own, and may become useful in ways initially not anticipated. After all, if the value of data
were strictly determined by the circumstances and original purposes of data generation, there
would be little reason to collect data in the first place. Besides the internal, also the external
context is relevant: how observations relate to various phenomena in the city beyond those of
immediate interest to the observation. The use of proxy data allows studying issues indirectly
Companies such as Geolytics provide harmonized data products based on the US census. For more information,
through data describing related phenomena. Considering that social science works with abstract
constructs that usually cannot be directly observed, one could argue that most urban data sources
are in some sense data proxies. Like data, proxies are imperfect by definition, representing some
aspect of a phenomenon, omitting others, often conflating multiple issues that are hard to
Twitter as sticky data
Among social media services, Twitter has become a favourite data source for investigating a
broad range of urban phenomena. Several reasons explain this popularity: tweets are a public
form of communication, programmatically accessible through a public API.
The data are well-
structured, available in large quantities, and easy to manipulate even for researchers with limited
technical background. Twitter is a medium that is widely used on mobile devices in a variety of
different social contexts, activities, and situations. Containing annotated textual content of a
defined length, a timestamp, and sometimes a geographic location, tweets lend themselves to
quantitative, qualitative and spatial modes of analysis. Due to all of these properties, Twitter has
been used as a proxy for studying a broad range of phenomena. In the domain of urban research,
studies that take advantage of Twitter data can be categorised into three groups depending on
how the research objective relates to the context of data generation.
The first group includes studies that investigate Twitter in its original role as a social
platform instead of using it as a proxy and focuses on its communication processes as they take
place in physical space. This group includes analyses of topics discussed on the network and
their spatial (Lansley and Longley 2016/7) and temporal structures (Naaman et al. 2012).
Subjects include social network and community formation, attitudes and sentiments (Hollander
2015), representation of identity (Bailey 2016), or the use of the platform for activism and
collective action (Jackson and Welles 2016). Also the use of Twitter as a mode of surveillance
falls into this category.
The second group uses Twitter as a proxy for communication, interaction, or information
production in a broader sense. It investigates spatial phenomena that previously could not be
studied due to unavailable data or were limited to small-scale qualitative studies. This group
Acronym for application program interface, a set of defined methods for accessing data automatically through
scripts or programs.
includes studies investigating the perception of neighborhood boundaries (Wakamiya et al.,
2013), the spatial distribution of languages (Hong et al., 2011) and regional dialects (Huang et
al., 2016), as well as the global cultural boundaries inferred from these distributions (Mocanu et
al., 2013). Conversely, this group also includes work that investigates the blind spots or absences
of information production, for example as a proxy for the inequalities of digital labour (Graham,
The third group of studies uses Twitter data as a proxy for phenomena not directly related
to communication, for example, to estimate where people are at a given moment. In this case,
Twitter is often merely a choice of convenience, interchangeable with any other location-based
media service capable of generating time-stamped geotags. In this group, Twitter data are often
used in conjunction with other data sources to describe and predict phenomena that were
previously modelled through other means. The group includes studies of the demographic
structure of urban populations (Longley et al., 2015), estimations of the number of pedestrians in
public space (Lansley 2014), predictions of influenza infections (Broniatowski et al., 2013),
transport behavior (Hawelka et al. 2014; Wang and Taylor 2016), or land use (Frias-Martinez
and Frias-Martinez 2014).
While the first two approaches are directly or indirectly related to the context of online
communication, the third group uses Twitter as an opportunistic data source unrelated to its
purpose. In this third group, questions of validity and accuracy become especially pertinent, as
data from different sources need to be aligned and inherent biases have to be controlled for.
However, this does not always happenissues of internal validity of twitter data are ignored in
many studies (Perng et al., 2016).
At present, it seems that data from Twitter and other social media sources resist the
appropriation as data proxies. Many studies share a similar conclusion: due to the limitations of
the data source no significant results can be reported at the moment, but great future potential is
to be expected.
Social media platforms are used in a multitude of different contexts by different
user groups with a demographic composition that is poorly known and constantly changing. The
inequality in the use of social media among the various urban neighbourhoods is bigger than the
inequality along socio-demographic and economic dimensions (Indaco and Manovich 2016).
It should be noted that the number of active Twitter users has stagnated over the past few years. See for example
Twitter data seem to be sticky to introduce a provisional characterization meaningful when
discussed in their original context, but difficult to separate from this context, requiring the use of
sharp statistical instruments. Tweets are meaningful in their own right, but it is not clear which
aspects of Twitter data can be extrapolated or generalised. Even as a proxy for human presence,
the local context remains sticky, inseparable from the data. Eric Fisher’s maps that compare
where in major cities people tweet and where they take pictures show distinct spatial patterns
that can only be explained by the motivations for using the medium in a given situation.
tend to take photos rather than tweet on the Golden Gate Bridge or the Alcatraz island, but they
tend to tweet rather than take pictures in suburban residential neighbourhoods. These are
interesting, perhaps generalizable findings, but they also generate significant data friction.
Stickiness does not prevent marketers, entrepreneurs or social activists from using Twitter data,
but for urban researchers it is important to consider how each aspect of twitter data set relates to
different phenomena in the city and devise methods to disentangle those aspects.
OLS city lights as non-sticky data
Other data sources, however, seem to involve less friction and serve as reliable proxies for
phenomena that are seemingly unrelated to the original purpose of data collection. A case where
a data set has entirely transcended its original context is the accidental history and the
widespread use of satellite mosaics generated from the Operational Linescan System (OLS) of
the U.S. Air Force’s Defense Meteorological Satellite Program (DMSP). The satellite
composites, provided by the National Geophysical Data Center of the US National Oceanic and
Atmospheric Administration (NOAA) for every calendar year since 1992, contain the brightness
levels of nocturnal city lights, accompanied by the flares of oil and gas fields, wildfires, and the
lights of fishing fleets.
During the past 20 years, OLS/DMSP data have become a “workhorse” for geographers
and economists alike. The radiance values from the satellite mosaics have found their way into
data sets and studies that estimate population density (Sutton 1997, 2003), urbanization and
suburban sprawl (Campanella 2012; Sutton 2003), economic productivity (Henderson et al.,
2009), rural poverty (Jean et al. 2016; Elvidge et al. 2009), resource footprints and electrification
For his See Something or Say Something project, see:
For further information see:
rates (Elvidge et al. 2011), measles outbreaks (Bharti et al. 2011), and average wages (Mellander
et al. 2013).
Considering the range of phenomena the night-time city lights data have been used to
predict and explain, it is remarkable that even the capability of recording city lights was an
accidental by-product of a system designed for different purposes. The Defense Meteorological
Satellite Program was launched in 1961 by the US Air Force after it became clear that an
effective photo-surveillance by satellites require an accurate prediction of cloud cover over the
target area (Hall 2001). At the time, photo-reconnaissance satellites depended on photographic
film, which had to be arduously recovered from re-entry capsules (film-buckets”) dropped from
the satellite back to earth, only to discover, as it was often the case, that the images only
contained clouds.
As detailed by the Army historian Cargill Hall, the development of optical instruments
used for the DMSP satellites involved many iterations, and led to several discoveries such as the
value of the infrared band for detecting clouds. In 1966, the development of the OLS imaging
module started, which continuously recorded digital luminance data transmitted wirelessly back
to earth (Hall 2001).
A year after DMSP data became available for certain civilian agencies in 1972, Thomas
A. Croft, researcher at the Stanford Center for Radar Astronomy, expressed his amazement about
the images in Nature magazine - “The lights of cities are clearly visible, as are the aurora, surface
features illuminated by moonlight, and fires such as those caused by burning gas from oil fields
and refineries” (Croft 1973). At the height of the 1973 oil crisis, Croft read the data as a
testament to the global waste of energy. Meanwhile, the army had also found a different use for
the city lights - to locate and calibrate the recorded night-time scenes accurately, and to estimate
the thickness and density of particles in the atmosphere by measuring the diffusion of their
contours (Air Weather Service 1974).
By 1977, Croft had compiled a first global atlas of city lights that used methods for
digitisation, processing both original films and digital facsimiles, using pattern recognition to
align different viewpoints (Croft and Colvocoresses 1979). A year later, he published the first
global composite of night-time images in the Scientific American (Croft 1978). In 1992, DPMS
data were opened to the general public, and the NASA Black Marble illustrations have become
one of the most popular motifs of space imagery.
Despite its wide use, OLS/DMSP data are limited in many ways. As a proxy for
estimating human presence and activity the data are strongly biased, with the brightness values
for a specific region dependent on many socio-economic, political, and cultural factors. OLS data
models therefore always use available country-level data as controls to allow predictions for
places where no such data are available. Furthermore, since OLS was designed to identify clouds
rather than measure illumination, the brightness values do not allow estimations of luminance at
the source, the values are therefore dimensionless. Fully saturated pixels covering brightly-lit
urban agglomerations are a further concern,
as are blooming artefacts spilling into neighbouring
pixels. While the imaging sensor operates autonomously and stable, stitching the recorded
observations together introduces additional issues that require human decisions. A set of
heuristics regulates how to ensure the best resolution, avoid sun and moonlight and excluding
Some of these rules can be implemented algorithmically, such as removing flares by
normalising brightness over time, others require a human touch, such as the identification and
exclusion of aurora borealis. Some populated regions are rarely ever cloud-free, reducing data
quality. Studies based on OLS data have managed to control for some of these limitations; other
limitations are accepted simply due to the lack of alternative sources that have comparable
spatial and temporal coverage. For recent years, better alternatives exist. Since 2012, the Visible
Infrared Imaging Radiometer Suite (VIIRS) has superseded the OLS instrument providing data
in superior quality and resolution.
OLS mosaics are powerful visual artefacts. At first sight, the correlation of OLS data with
human population density appears to be self-evident, yet without additional statistical controls, it
is much smaller than expected (Elvidge et al. 1997). The level of trust inspired by OLS satellite
mosaics might be explained by their obvious realism: their similarity with photographic material
from the Apollo missions and other examples of space photography. However, this assumption
of realism is shaky, as anyone can confirm who has worked with raw satellite scenes, which
usually look nothing like their vibrant published versions. Rob Simmon, map designer at
NASA’s Earth Observatory, eloquently demonstrates that satellite composites are elaborate
The island of Singapore, for example, appears as a fully saturated blob of light in all yearly mosaics since 1992.
information visualisations, designed to evoke the impression of photographs (Simmon 2011).
What appears as the translucence of shallow coastal waters in delicate shades of blue is, in fact,
the rendering of a dataset of oceanic chlorophyll activity. The colourful transitions between lush
forests and arid regions were never captured by a photographic lens, but are determined by
carefully crafted colour palettes.
Considering the long way from taking pictures of clouds to measuring urban economies,
OLS data appears to be a non-sticky data source. However, as described above, its mobilisation
for research involves a considerable amount of data friction. The successful use of OLS data as a
proxy is only possible because the data source and its methods are extensively documented and
the behaviour of the sensor, with all its limitations, is well understoodprerequisites for
overcoming data friction.
Deconstructing stickiness
At first glance, data from OLS/DMSP and Twitter seem to be very different forms of data. The
one an exemplar of mechanical objectivity (Daston and Galison 2007), continuously recording
under stable conditions, the other reflecting human communication in its irreducible richness. On
closer inspection, many of these apparent differences disappear. Data from both sources are
indices of human behaviour, yet on vastly different spatial and temporal scales. OLS feeds and
geo-located tweets both indicate where people are at a given time, each with their own
representations and subject to their own biases. While tweets are initiated by the user, OLS data
sets are not free from ad-hoc human decisions either. Such decisions involve, for example, trade-
offs between cloudless coverage and data quality for some parts of the globe. The provisional
categorization of sticky and non-sticky data seems increasingly untenable.
Yet, stickiness remains palpable in certain challenges that are unique to social media,
which exist in many different contexts; used by various groups for various purposes. While the
procedures of generating OLS data are explicated and their biases are mostly known, less
information exists about the demographic composition of social media users. What is more
important, is that the contexts of social media are not stable but perpetually evolving, as new
platforms get adopted and the use of existing ones evolves. As David Lazer and his colleagues
demonstrated with the example of the declining prediction quality of Google flu trends, a model
See for example:
that accurately predicts a phenomenon at a given time may quickly become obsolete (Lazer et al.
2014). Social media are feedback systems, the behaviour of their users adapts in response to how
they perceive the system (Offenhuber 2014). As the contexts of data generation on social media
platforms are ephemeral, the only constants are the users themselves. For online marketers,
“sticky data” include user-IDs, email addresses, and other indices for personal identification,
which allow them to track users across the multiple contexts of a constantly shifting social media
As proxies for urban phenomena, both data sources offer only partial perspectives. They
are susceptible to what journalist Joe Cortright describes as the “drunk under the streetlamp”
fallacy, expressed in the dialogue: “Did you lose your keys here? No, but the light is much better
here” (Cortright 2016). To correct for their inherent biases, both sources require the triangulation
with other data sets. In this context, the law of large numbers, or “more trumps better” (Mayer-
Schönberger and Cukier 2013) is only partially helpful and should be contrasted with the
disclaimer “garbage in - garbage out.” Large data volumes allow for more statistical controls to
take biases into account but do not compensate for missing information. Combining, comparing,
and integrating multiple data sources seems the most promising way to go. Many seemingly
unrelated data sources can complete each other as they are already linked in various hidden
ways. This explains why, for example, a survey of noise exposure of a marginalised community
can predict the impacts of air pollution on the same community (Franklin and Fruin,
Social media data offer new lenses for observing urban phenomena. They can
complement existing data sources to provide a more fine-grained view into spatial and temporal
processes. As cultural expressions, their value and richness go beyond narrow measures of
accuracy of validity. But just as nocturnal city lights illustrate the unequal distribution of people,
the landscape of social media includes data deserts as well as hotspots of activity. The data
footprints of equivalent media services rarely align. When matching, and contrasting the data
artefacts of social media with other sources, the frictions and shifting contexts of data generation
continue to play an elementary role.
The stickiness of social media data resists the operationalization in automatic pipelines
for knowledge extraction and manifests itself in false positives that can only be identified and
See for example:
resolved by a close reading of the source. This has consequences for the use of big data in urban
governance, urban operation centers, and predictive policingapplications that often rely on
decontextualized data and reductive modes of analysis, such as text mining based on trigger
words or dictionary-based sentiment analysis. Ignoring stickiness of context can lead to cases
where a terrorism suspect identified by unsupervised text analysis turns out to be the journalist
who reported on the issue (Currier et al., 2015). In this sense, stickiness points to issues of
privacy even within the realm of publicly accessible data sources. As social media scholar Judith
Donath notes, privacy fails when something that is intended for a particular context gets shown
in another where it acquires a different meaning (2014, 212). Ignoring stickiness can also
increase the susceptibility to various forms of manipulation and hacking, such as the practice of
feeding fake information to crowd-sourced traffic systems like Waze in an attempt to create
virtual traffic jams and re-route traffic flows.
The complications of ambiguous data or missing context can rarely be avoided, since the
best proxy is often simply a data source for which no alternative exists. In the case of OLS data,
new remote sensing instruments may allow for more accuracy and resolution, but lack the
historical reach of OLS composites that cover four decades of global urbanisation. Nevertheless,
OLS mosaics afford only a single perspective on urban phenomena that are reflected in many
different representations. Working with stickiness means integrating multiple partial perspectives
rather than simply relying on a larger amount of data and reducing them down to the common
lowest denominator.
Air Weather Service. 1974. “DMSP User’s Guide.” USAF.
Bailey, Moya. 2016. “Redefining Representation: Black Trans and Queer Women’s Digital
Media Production.” Screen Bodies 1 (2). doi:10.3167/screen.2016.010105.
Bharti, N., A. J. Tatem, M. J. Ferrari, R. F. Grais, A. Djibo, and B. T. Grenfell. 2011.
“Explaining Seasonal Fluctuations of Measles in Niger Using Nighttime Lights Imagery.”
Science 334 (6061): 142427.
Bowker, G. C. 2005. “Memory Practices in the Sciences.”
Broniatowski, David A., Michael J. Paul, and Mark Dredze. 2013. “National and Local Influenza
Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic.” PloS One
8 (12). Public Library of Science: e83672.
Campanella, T. J. 2012. The Concrete Dragon: China’s Urban Revolution and What It Means
for the World. Princeton Architectural Press.
Cortright, Joe. 2016. “The Dark Side of Data-Based Transportation Planning.” CityLab.
Croft, Thomas A. 1973. “Burning Waste Gas in Oil Fields.” Nature 245 (5425). Nature
Publishing Group: 37576.
———. 1978. “Nighttime Images of the Earth from Space.” Scientific American, July.
Croft, Thomas A., and A. P. Colvocoresses. 1979. “The Brightness of Lights on Earth at Night,
Digitally Recorded by DMSP Satellite.” US Geological Survey.
Currier, Cora, Glenn Greenwald, and Andrew Fishman. 2015. “U.S. Government Designated Prominent
Al Jazeera Journalist as ‘Member of Al Qaeda.’” The Intercept. May 8.
Daston, Lorraine, and Peter Galison. 2007. Objectivity. Zone Books.
Donath, Judith. 2014. The Social Machine: Designs for Living Online. Cambridge, Mass.: MIT Press.
Drucker, Johanna. 2011. “Humanities Approaches to Graphical Display.” Digital Humanities
Quarterly 5 (1).
Edwards, Paul N. 2010. “A Vast Machine: Computer Models.” Climate Data, and the Politics of
Global Warming (Massachusetts Inst Technology Press, Cambridge, MA).
Elvidge, Christopher D., Kimberly E. Baugh, Eric A. Kihn, Herbert W. Kroehl, and Ethan R.
Davis. 1997. “Mapping City Lights with Nighttime Data from the DMSP Operational
Linescan System.” Photogrammetric Engineering and Remote Sensing 63 (6). [Falls
Church, Va.] American Society of Photogrammetry.: 72734.
Elvidge, Christopher D., Kimberly E. Baugh, Paul C. Sutton, Budhendra Bhaduri, Benjamin T.
Tuttle, Tilotamma Ghosh, Daniel Ziskin, and Edward H. Erwin. 2011. “Who’s in the
Dark—Satellite Based Estimates of Electrification Rates.” In Urban Remote Sensing, 211
24. John Wiley & Sons, Ltd.
Elvidge, Christopher D., Paul C. Sutton, Tilottama Ghosh, Benjamin T. Tuttle, Kimberly E.
Baugh, Budhendra Bhaduri, and Edward Bright. 2009. “A Global Poverty Map Derived
from Satellite Data” 35 (8): 1652–60.
Floridi, L. 2011. The Philosophy of Information. OUP Oxford.
Frias-Martinez, Vanessa, and Enrique Frias-Martinez. 2014. “Spectral Clustering for Sensing
Urban Land Use Using Twitter Activity.” Engineering Applications of Artificial Intelligence
35 (October): 23745.
Gitelman, Lisa. 2013. “Raw Data” Is an Oxymoron. Cambridge Mass.: MIT Press.
Graham, Mark. 2014. “Internet Geographies: Data Shadows and Digital Divisions of Labour.”
SSRN Scholarly Paper ID 2448222. Rochester, NY: Social Science Research Network.
Hall, R. Cargill. 2001. “A History of the Military Polar Orbiting Meteorological Satellite
Program.” DTIC Document.
Hawelka, Bartosz, Izabela Sitko, Euro Beinat, Stanislav Sobolevsky, Pavlos Kazakopoulos, and
Carlo Ratti. 2014. “Geo-Located Twitter as Proxy for Global Mobility Patterns.”
Cartography and Geographic Information Science 41 (3). Taylor & Francis: 26071.
Henderson, J. Vernon, Adam Storeygard, and David N. Weil. 2009. “Measuring Economic
Growth from Outer Space.” 15199. National Bureau of Economic Research.
Hong, Lichan, Gregorio Convertino, and Ed H. Chi. 2011. “Language Matters In Twitter: A
Large Scale Study.” In Fifth International AAAI Conference on Weblogs and Social Media.
Huang, Yuan, Diansheng Guo, Alice Kasakoff, and Jack Grieve. 2016. “Understanding U.S.
Regional Linguistic Variation with Twitter Data Analysis.” Computers, Environment and
Urban Systems 59: 24455.
Indaco, Agustin, and Lev Manovich. 2016. “Urban Social Media Inequality: Definition,
Measurements, and Application.” arXiv [cs.SI]. arXiv.
Jackson, Sarah J., and Brooke Foucault Welles. 2016. “#Ferguson Is Everywhere: Initiators in
Emerging Counterpublic Networks.” Information, Communication and Society 19 (3): 397
Jean, Neal, Marshall Burke, Michael Xie, W. Matthew Davis, David B. Lobell, and Stefano
Ermon. 2016. “Combining Satellite Imagery and Machine Learning to Predict Poverty.”
Science 353 (6301): 79094.
Justin B. Hollander, Henry Renski. 2015. “Measuring Urban Attitudes Using Twitter: An
Exploratory Study.”
Lansley, G. 2014. “Evaluating the Utility of Geo-Referenced Twitter Data as a Source of
Reliable Footfall Insight,” April.
Lansley, Guy, and Paul A. Longley. 2016. “The Geography of Twitter Topics in London.”
Computers, Environment and Urban Systems 58: 8596.
Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “Big Data. The
Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (6176): 12035.
Litman, Todd. 2014. “What Is a ‘House’? Critiquing the Demographia International Housing
Affordability Survey.” Planetizen. August 15.
Longley, Paul A., Muhammad Adnan, and Guy Lansley. 2015. “The Geotemporal
Demographics of Twitter Usage.” Environment and Planning A 47 (2): 46584.
Mayer-Schönberger, Viktor, and Kenneth Cukier. 2013. Big Data: A Revolution That Will
Transform How We Live, Work, and Think. 1 edition. An Eamon Dolan Book. Boston:
Eamon Dolan/Houghton Mifflin Harcourt.
Mellander, Charlotta, Kevin Stolarick, Zara Matheson, and José Lobo. 2013. “Night-Time Light
Data: A Good Proxy Measure for Economic Activity?” 315. Royal Institute of Technology,
CESIS - Centre of Excellence for Science and Innovation Studies.
Mocanu, Delia, Andrea Baronchelli, Nicola Perra, Bruno Gonçalves, Qian Zhang, and
Alessandro Vespignani. 2013. “The Twitter of Babel: Mapping World Languages through
Microblogging Platforms.” PloS One 8 (4): e61981.
Naaman, Mor, Amy Xian Zhang, Samuel Brody, and Gilad Lotan. 2012. “On the Study of
Diurnal Urban Routines on Twitter.” In ICWSM.
Offenhuber, Dietmar. 2014. “Infrastructure Legibility - a Comparative Study of Open311 Citizen
Feedback Systems.” Cambridge Journal of Regions, Economy and Society.
Perng, S-Y., Kitchin, R. and Evans, L. (2016) Locative media and data-driven computing
experiments. Big Data and Society 3: 1-12.
Simmon, Robert. 2011. “Crafting the Blue Marble.” Elegant Figures. NASA Earth Observatory.
October 6.
Sutton, Paul. 1997. “Modeling Population Density with Night-Time Satellite Imagery and GIS.”
Computers, Environment and Urban Systems 21 (3). Elsevier: 22744.
Sutton, Paul C. 2003. “A Scale-Adjusted Measure of ‘Urban Sprawl’ Using Nighttime Satellite
Imagery.” Remote Sensing of Environment 86 (3): 35369.
Wakamiya, Shoko, Ryong Lee, and Kazutoshi Sumiya. 2013. “Social-Urban Neighborhood
Search Based on Crowd Footprints Network.” In Social Informatics, edited by Adam Jatowt,
Ee-Peng Lim, Ying Ding, Asako Miura, Taro Tezuka, Gaël Dias, Katsumi Tanaka, Andrew
Flanagin, and Bing Tian Dai, 42942. Lecture Notes in Computer Science 8238. Springer
International Publishing.
Wang, Qi, and John E. Taylor. 2016. “Patterns and Limitations of Urban Human Mobility
Resilience under the Influence of Multiple Types of Natural Disaster.” PloS One 11 (1):
Wang, Qi, and John E. Taylor. 2016. “Patterns and Limitations of Urban Human Mobility
Resilience under the Influence of Multiple Types of Natural Disaster.” PloS One 11 (1):
... To be clear, however, the use of user-generated data does come with both technical and ethical problems. For example, relying on Twitter for its time-stamped geotags has its shortcomings, raising questions of internal validity (Offenhuber, 2017) and creating problems for interpretation. Local context often shapes the choice whether to post on social media, rendering some places or times more likely to be documented than others (Boy and Uitermark, 2017). ...
Full-text available
The new availability of big data sources provides an opportunity to revisit our ability to predict neighborhood change. This article explores how data on urban activity patterns, specifically, geotagged tweets, improve the understanding of one type of neighborhood change—gentrification—by identifying dynamic connections between neighborhoods and across scales. We first develop a typology of neighborhood change and risk of gentrification from 1990 to 2015 for the San Francisco Bay Area based on conventional demographic data from the Census. Then, we use multivariate regression to analyze geotagged tweets from 2012 to 2015, finding that outsiders are significantly more likely to visit neighborhoods currently undergoing gentrification. Using the factors that best predict gentrification, we identify a subset of neighborhoods that Twitter-based activity suggests are at risk for gentrification over the short term—but are not identified by analysis with traditional census data. The findings suggest that combining Census and social media data can provide new insights on gentrification such as augmenting our ability to identify that processes of change are underway. This blended approach, using Census and big data, can help policymakers implement and target policies that preserve housing affordability and protext tenants more effectively.
... Yet much work has demonstrated that findings based on digital traces are not easily generalized. This is not only because of demographic skew and selection bias (Blank & Lutz, 2017;Hargittai, 2020;Lewis & Molyneux, 2018;Mellon & Prosser, 2017), but also because digital traces are so intimately entangled with their contexts of production that it is difficult for researchers to understand what exactly the data represent and to extrapolate their meaning onto the broader social world (boyd & Crawford, 2012;Crawford, 2013;Hargittai, 2015;Hill & Shaw, 2020;Jungherr, 2019;Jungherr, Schoen, Posegga, & Jürgens, 2017;Marres, 2021;Selbst et al., 2019;Zook et al., 2017)-a phenomenon that Offenhuber (2018) refers to as the "stickiness" of digital traces. For example, an analysis of Facebook data purported to show that "weak" social ties did not help people find jobs (Burke & Kraut, 2013), a finding that went against the grain of conventional wisdom and prior research. ...
Purpose The provision of high-quality e-Government services requires efficient and collaborative sharing of data across varied types of government agencies. However, interagency government data sharing (IDS) is not always spontaneous, active and unconditional. Adopting a stickiness theory, this paper reports on a research study, which explores the causes of data stickiness in IDS. Design/methodology/approach This study employed an inductive case study approach. Twenty-three officials from the government of City M in Hubei Province, Central China, were approached and interviewed using a semi-structured question script. Findings The analysis of the interview data pointed to 27 causes of data stickiness in five main themes: data sharing willingness; data sharing ability; data articulatability; data residence; and data absorptive capacity. The analysis revealed that interagency tensions and lack of preparedness of individual agencies are the main causes of data stickiness in IDS. Originality/value The case setting is based on China's Government, but the findings offer useful insights and indications that can be shared across international borders.
Full-text available
Over the past two decades urban social life has undergone a rapid and pervasive geocoding, becoming mediated, augmented and anticipated by location-sensitive technologies and services that generate and utilise big, personal, locative data. The production of these data has prompted the development of exploratory data-driven computing experiments that seek to find ways to extract value and insight from them. These projects often start from the data, rather than from a question or theory, and try to imagine and identify their potential utility. In this paper, we explore the desires and mechanics of data-driven computing experiments. We demonstrate how both locative media data and computing experiments are ‘staged’ to create new values and computing techniques, which in turn are used to try and derive possible futures that are ridden with unintended consequences. We argue that using computing experiments to imagine potential urban futures produces effects that often have little to do with creating new urban practices. Instead, these experiments promote Big Data science and the prospect that data produced for one purpose can be recast for another and act as alternative mechanisms of envisioning urban futures.
Full-text available
Social media data are increasingly perceived as alternative sources to public attitude surveys because of the volume of available data that are time-stamped and (sometimes) precisely located. Such data can be mined to provide planners, marketers and researchers with useful information about activities and opinions across time and space. However, in their raw form, textual data are still difficult to analyse coherently and Twitter streams pose particular interpretive challenges because they are restricted to just 140 characters. This paper explores the use of an unsupervised learning algorithm to classify geo-tagged Tweets from Inner London recorded during typical weekdays throughout 2013 into a small number of groups, following extensive text cleaning techniques. Our classification identifies 20 distinctive and interpretive topic groupings, which represent key types of Tweets, from describing activities or informal conversations between users, to the use of check-in applets. Our motivation is to use the classification to demonstrate how the nature of the content posted on Twitter varies according to the characteristics of places and users. Topics and attitudes expressed through Tweets are found to vary substantially across Inner London, and by time of day. Some observed variations in behaviour on Twitter can be attributed to the inferred demographic and socio-economic characteristics of users, but place and local activities can also exert a considerable influence. Overall, the classification was found to provide a valuable framework for investigating the content and coverage of Twitter usage across Inner London.
Full-text available
Natural disasters pose serious threats to large urban areas, therefore understanding and predicting human movements is critical for evaluating a population's vulnerability and resilience and developing plans for disaster evacuation, response and relief. However, only limited research has been conducted into the effect of natural disasters on human mobility. This study examines how natural disasters influence human mobility patterns in urban populations using individuals' movement data collected from Twitter. We selected fifteen destructive cases across five types of natural disaster and analyzed the human movement data before, during, and after each event, comparing the perturbed and steady state movement data. The results suggest that the power-law can describe human mobility in most cases and that human mobility patterns observed in steady states are often correlated with those in perturbed states, highlighting their inherent resilience. However, the quantitative analysis shows that this resilience has its limits and can fail in more powerful natural disasters. The findings from this study will deepen our understanding of the interaction between urban dwellers and civil infrastructure, improve our ability to predict human movement patterns during natural disasters, and facilitate contingency planning by policymakers.
Although it has been shown that traffic-related air pollution adversely affects children's lung function, few studies have examined the influence of traffic noise on this association, despite both sharing a common source. Estimates of noise exposure (Ldn, dB), and freeway and non-freeway emission concentrations of oxides of nitrogen (NOx, ppb) were spatially assigned to children in Southern California who were tested for forced vital capacity (FVC, n=1345), forced expiratory volume in 1s, (FEV1, n=1332), and asthma. The associations between traffic-related NOx and these outcomes, with and without adjustment for noise, were examined using mixed effects models. Adjustment for noise strengthened the association between NOx and reduced lung function. A 14.5mL (95% CI -40.0, 11.0mL) decrease in FVC per interquartile range (13.6 ppb) in freeway NOx was strengthened to a 34.6mL decrease after including a non-linear function of noise (95% CI -66.3, -2.78mL). Similarly, a 6.54mL decrease in FEV1 (95% CI -28.3, 15.3mL) was strengthened to a 21.1mL decrease (95% CI -47.6, 5.51) per interquartile range in freeway NOx. Our results indicate that where possible, noise should be included in epidemiological studies of the association between traffic-related air pollution on lung function. Without taking noise into account, the detrimental effects of traffic-related pollution may be underestimated.
Reliable data on economic livelihoods remain scarce in the developing world, hampering efforts to study these outcomes and to design policies that improve them. Here we demonstrate an accurate, inexpensive, and scalable method for estimating consumption expenditure and asset wealth from high-resolution satellite imagery. Using survey and satellite data from five African countries-Nigeria,Tanzania, Uganda, Malawi, and Rwanda-we show how a convolutional neural network can be trained to identify image features that can explain up to 75%of the variation in local-level economic outcomes. Our method, which requires only publicly available data, could transform efforts to track and target poverty in developing countries. It also demonstrates how powerful machine learning techniques can be applied in a setting with limited training data, suggesting broad potential application across many scientific domains.
Social media content shared today in cities, such as Instagram images, their tags and descriptions, is the key form of contemporary city life. It tells people where activities and locations that interest them are and it allows them to share their urban experiences and self-representations. Therefore, any analysis of urban structures and cultures needs to consider social media activity. In our paper, we introduce the novel concept of social media inequality. This concept allows us to quantitatively compare patterns in social media activities between parts of a city, a number of cities, or any other spatial areas. We define this concept using an analogy with the concept of economic inequality. Economic inequality indicates how some economic characteristics or material resources, such as income, wealth or consumption are distributed in a city, country or between countries. Accordingly, we can define social media inequality as the measure of the distribution of characteristics from social media content shared in a particular geographic area or between areas. An example of such characteristics is the number of photos shared by all users of a social network such as Instagram in a given city or city area, or the content of these photos. We propose that the standard inequality measures used in other disciplines, such as the Gini coefficient, can also be used to characterize social media inequality. To test our ideas, we use a dataset of 7,442,454 public geo-coded Instagram images shared in Manhattan during five months (March-July) in 2014, and also selected data for 287 Census tracts in Manhattan. We compare patterns in Instagram sharing for locals and for visitors for all tracts, and also for hours in a 24-hour cycle. We also look at relations between social media inequality and socio-economic inequality using selected indicators for Census tracts.
Conference Paper
Neighborhood is generally a geographically localized community often with face-to-face social interactions. However, modern cities and the widespread social networks have been drastically changing the concept of neighborhood, much beyond spatial constraint. Specifically, due to the complicated urban structures with entangled transportation network and the resulting spatio-temporally extended crowd activities, it is a non-trivial task to examine neighborhood areas from a location of interest. As a promising approach to investigate such a social-urban structure, we propose a social-urban neighborhood search which aims at identifying neighborhood areas from a specific location particularly considering social interactions between urban areas. We especially examine crowd movings through location-based social networks as an important indicator for measuring social interactions. We also introduce a data structure for aggregation of crowd movings as a simplified graph, with which we can easily analyze crowd movements in a large scale urban area. In the experiment, we will look into neighborhoods for several urban areas of our interests in terms of social interactions significantly focusing on how they are distorted from general localized vicinity.
We analyze a Big Data set of geo-tagged tweets for a year (Oct. 2013-Oct. 2014) to understand the regional linguistic variation in the U.S. Prior work on regional linguistic variations usually took a long time to collect data and focused on either rural or urban areas. Geo-tagged Twitter data offers an unprecedented database with rich linguistic representation of fine spatiotemporal resolution and continuity. From the one-year Twitter corpus, we extract lexical characteristics for twitter users by summarizing the frequencies of a set of lexical alternations that each user has used. We spatially aggregate and smooth each lexical characteristic to derive county-based linguistic variables, from which orthogonal dimensions are extracted using the principal component analysis (PCA). Finally a regionalization method is used to discover hierarchical dialect regions using the PCA components. The regionalization results reveal interesting linguistic regional variations in the U.S. The discovered regions not only confirm past research findings in the literature but also provide new insights and a more detailed understanding of very recent linguistic patterns in the U.S.
On the afternoon of 9 August 2014, 18-year-old Michael ‘Mike’ Brown was shot and killed by Officer Darren Wilson in the small American city of Ferguson, Missouri. Brown's body lay in the street for four and a half hours, and during that time, his neighbors and friends took to social media to express fear, confusion, and outrage. We locate early tweets about Ferguson and the use of the hashtag #Ferguson at the center of a counterpublic network that provoked and shaped public debates about race, policing, governance, and justice. Extending theory on networked publics, we examine how everyday citizens, followed by activists and journalists, influenced the #Ferguson Twitter network with a focus on emergent counterpublic structure and discursive strategy. We stress the importance of combining quantitative and qualitative methods to identify early initiators of online dissent and story framing. We argue that initiators and their discursive contributions are often missed by methods that collapse longitudinal network data into a single snapshot rather than investigating the dynamic emergence of crowdsourced elites over time.