ArticlePDF Available

Abstract and Figures

Location data from social network posts are attractive for answering all sorts of questions by spatial analysis. However, it is often unclear what this information locates. Is it a point of interest (POI), the device at the time of posting, or something else? As a result, locational references in posts may get misinterpreted. For example, a restaurant check-in on Facebook only locates that POI. But, check-ins have been used to locate their poster, their poster’s home, or where the posting event occurred. Furthermore, post metadata terms like place and location are ambiguous, making information integration difficult. Consequently, analysts may not be using the correct locational references pertinent to their questions. In this paper, we attempt to clarify and systematize what can be located within social network post metadata. We examine locational references in post metadata documentation from several social networks. We identify three common groups of locatable things: places recorded in a poster’s profile, devices, and points of interest. We posit that these groups can be described using The World Wide Web Consortium’s (W3C) provenance ontology (PROV) – in particular, PROV’s agent, activity, and entity concepts. Next, we encode example post metadata with these descriptions, and show how they support answering questions such as which country’s citizens take the most Flickr photos of the Eiffel Tower? The theoretical contribution of this work is a taxonomy of locatable things derived from social network posts, and a tool-supported method for describing them to users.
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=tgis20
International Journal of Geographical Information
Science
ISSN: 1365-8816 (Print) 1362-3087 (Online) Journal homepage: http://www.tandfonline.com/loi/tgis20
Using provenance to disambiguate locational
references in social network posts
Thomas Hervey & Werner Kuhn
To cite this article: Thomas Hervey & Werner Kuhn (2018): Using provenance to disambiguate
locational references in social network posts, International Journal of Geographical Information
Science, DOI: 10.1080/13658816.2018.1459627
To link to this article: https://doi.org/10.1080/13658816.2018.1459627
Published online: 26 Apr 2018.
Submit your article to this journal
Article views: 113
View related articles
View Crossmark data
RESEARCH ARTICLE
Using provenance to disambiguate locational references in
social network posts
Thomas Hervey and Werner Kuhn
Department of Geography, University of California, Santa Barbara, Santa Barbara, CA, USA
ABSTRACT
Location data from social network posts are attractive for answer-
ing all sorts of questions by spatial analysis. However, it is often
unclear what this information locates. Is it a point of interest
(POI), the device at the time of posting, or something else? As a
result, locational references in posts may get misinterpreted. For
example, a restaurant check-in on Facebook only locates that POI.
But, check-ins have been used to locate their poster, their pos-
ters home, or where the posting event occurred. Furthermore,
post metadata terms like place and location are ambiguous,
making information integration dicult. Consequently, analysts
may not be using the correct locational references pertinent to
their questions. In this paper, we attempt to clarify and system-
atize what can be located within social network post metadata.
We examine locational references in post metadata documenta-
tion from several social networks. We identify three common
groups of locatable things: places recorded in a postersprole,
devices, and points of interest. We posit that these groups can be
described using The World Wide Web Consortiums(W3C)prove-
nance ontology (PROV) in particular, PROVs agent, activity, and
entity concepts. Next, we encode example post metadata with
these descriptions, and show how they support answering ques-
tions such as which countrys citizens take the most Flickr photos of
the Eiel Tower? The theoretical contribution of this work is a
taxonomy of locatable things derived from social network
posts, and a tool-supported method for describing them to users.
ARTICLE HISTORY
Received 2 September 2017
Accepted 28 March 2018
KEYWORDS
Provenance; reference; posts
1. Introduction
Social network posts are increasingly being used as research data (Conole et al.2011)
and are particularly attractive for use in spatial analysis. Posts often contain rich content
and locational references, such as addresses, toponyms, and coordinates. This locative
information is applied widely to study crisis aid (De Longueville et al.2009, Abel et al.
2012), social demonstrations (Stefanidis et al.2013) and population distributions
(Scellato et al.2011), voting and political races (Gayo-Avello 2011, Wetstein 2016), citizen
science (Sakaki et al.2010), human mobility (Chang and Sun 2011), among other topics.
When asking spatial questions, researchers must also ask: what can I locate with my data?
Answering this question is critical for choosing appropriate data, and so data should be
CONTACT Thomas Hervey thomasahervey@umail.ucsb.edu
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2018
https://doi.org/10.1080/13658816.2018.1459627
© 2018 Informa UK Limited, trading as Taylor & Francis Group
clear about what they locate. However, it is not always clear what things, or referents,
can be located from references in a posts metadata. Can a poster or the posters home
be located, or perhaps only a point of interest (POI)? This uncertainty is problematic. If a
locational reference in a post is misinterpreted, it may no longer be applicable, which
compromises the postst for purpose. For example, if a posts metadata contains a
locational reference to a POI, such as the address of a restaurant, the post would be an ill
t for use in a study of human mobility.
We attempt to address this uncertainty by answering the question: what are the
possible locational referents in posts? Our approach is to use provenance descriptions to
model a posts creation as a posting event. We believe each element of this event,
including the poster, the post, and the event itself, are associated with a limited set of
locatable things, or referents. And so, if users have a false presumption of accuracy about
their data, provenance descriptions should help present a more objective, proper context
of their data (Lauriault et al.2007). Describing data provenance is a valued method of
improving data clarity, and ensuring scientic trustworthiness and scholarship (Tan 2004,
Simmhan et al.2005,Hillset al.2015). Additionally, provenance is inherently spatial.
Etymologically, provenance means forth, and to come from, denoting a locatable source.
This connotation suggests that provenance is a where question at the metadata level.
In the following section, we give problem examples of how locational references can be
misinterpreted. In Section 3, we discuss the evolution of posts and related work on clarifying
location information. In Section 4, we review post documentation, and discuss W3C PROV, a
provenance model applied for testing our argument. In Section 5,wecharacterizeexample
post metadata using PROV descriptions. Section 6 is an evaluation of our approach where
we pose competency questions and compare existing and new means for querying data to
answer them. We conclude with future work and limitations in Section 7.
2. Misinterpreting locational references in social network posts
In this work, we examine locational references and their referents within one form of
user-generated content social network posts. We focus on posts colloquially labelled
statuses or updates, which we dene as short public messages issued online by social
network users. Posts contain text bodies, attachable content including photos, videos,
and links, and enrichments such as a check-in. Examples include Facebooks status,
Twitters tweet, and Flickrs photo post.
2.1 Examples of locational reference misuse
A posts metadata can often contain several locational references. For example, a poster
on Facebook can create a post that includes a check-in to a venue, such as a restaurant.
Their posts metadata will contain prior generated information about the restaurant,
such as its name and address. It can also include information from the posters user
prole, such as the name of the neighbourhood where they live. These two locational
references, an address and a place name, are distinguishable by what real-world entity
they locate. The former locates a restaurant POI, and the latter locates a posters home
neighbourhood. This distinction, while simple, is not trivial. Numerous studies show that
locational references are not well understood and inappropriately conated or
2T. HERVEY AND W. KUHN
interchanged. In the following examples, we have only informally tested the degree to
which misinterpretations aect study outcomes. A formal test is outside the scope of
this article, but should provide fruitful results for further motivation.
An initial literature review reveals several recurring misinterpretations. First, check-ins
have been used to proximally locate a posters home. Scellato et al.(2011) suggested
that a Foursquare poster lives near the majority of their postscheck-in venues. We
argued that this is a problematic assumption. A posters home is not necessarily close to
the venues where they have checked into. It is possible for someone to check-in most
frequently to New York restaurants even though they live in California. Check-ins only
locate a point of interest, that is, the venue that is being checked into. Check-ins should
not be conated with or used to determine a home location.
Second, check-ins have been used to determine a posters current location. This is often
called the location of a post, and where the posting occurred. Kumar et al.(2017) reviewed
the social impacts of Facebook check-ins and assume that Any user who makes a check-in
is already making his presence public in that place.Kim (2016) also assumed that check-ins
are a means of sharing where someone is. Furthermore, quoting Facebook Places, Gössling
and Stavrinidi (2015) suggested that check-ins at locations [are] to remember where you
were in your favourite photos,to,let friends know where you are so they can meet you
there,ortoshare where youre going to get tips and advice from friends whove been.”’
Indeed, posters may intend to use check-ins for any of these purposes. However, we argue
that these assertions are risky and suggest that check-ins are ambiguous in what they
locate. In fact, many social network sites allow check-ins to be made to any place or venue,
regardless of the posters location. A poster can be in California when they check into a New
York restaurant. Both Cheng et al.(2011) and Carbunar and Potharaju (2012) recognized this
lack of restriction and with limited success attempt to lter out fake check-ins from their
data. Furthermore, Huck et al.(2015) highlighted several works that inappropriately try to
locate a tweet (presumably the location where a tweet posting occurred) by using which-
ever location metadata attribute is seen as most reliable. This often means falling back to
locations in a usersprole. This is seen as well in Fink et al.(2012), Dredze et al.(2013), and
Bao et al.(2016). Check-ins should not be used to locate a poster and a posting event.
Third, a posting devices coordinates, which are often called a geotag, have been used
to locate a posters home. In their inuential work on event detection, Sakaki et al.(2010)
assumed that user prole locations (UPLs) (e.g. a home location recorded in a prole)
and geotags are comparable and substitutable. Yet, Johnson et al.(2016) have found
that in only 75% of cases is a posters home located accurately using check-ins. We
argued that a posts geotag only refers to the location of a posting device. Since a
geotag can be created from anywhere that a device is (provided that location-based
services (LBS) are enabled), that device is not necessarily near a home location.
2.2 Convoluted post documentation
Misinterpreting locational references can also be attributed to convoluted post documen-
tation. To access post data from a social network, a user typically makes a request to a
publically facing application programming interface (API) endpoint. Social networks usually
provide accompanying documentation for each endpoint, describing what resources are
available, and how to query them. This includes documentation on post metadata.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 3
Unfortunately, documentation rarely describes what locational references in post metadata
locate. For example, Facebooks <graph/user> API endpoint serves data about a specic
usersprole. Query parameters include location indicative terms like locations, location,
and check-ins. It vaguely describes the term locations as a list of posts and photos that a
user is tagged in that contain location information. Facebook suggests using this eld to
chronologize where someone has visited. This suggestion is dangerous and exemplies our
previous argument check-ins are improperly used to locate posters. Also, being tagged in
a post or a photo by someone else is not indicative of oneslocation.
Documentation can also be volatile since post features change often. For example, in
2014, Foursquare Labs, Inc. split product features between its existing social network
Foursquare, and a new one called Swarm. Foursquare no longer focuses on users
creating posts with check-ins, nor releases check-in data for public consumption. This
makes information integration over time dicult. Information integration and alignment
across social networks is also dicult because locational reference terms like place and
location are overloaded. For one, place has a debated theoretical meaning and is a hot
research topic in GIS. It is therefore not surprising that social network platforms use the
term place to describe dierent technical things. Alternatively, describing locational
references using provenance can promote API endpoint transparency, and elevate
user understanding above the implementation level.
3. Background and related work
Over the last decade, social networks have increased in number and popularity (Mislove
et al.2008,Conoleet al.2011), user connectivity (Kumar et al.2010), and functionality (Roick
and Heuser 2013). As of this writing, several social networks, including Facebook and
Twitter, boast hundreds of millions of monthly users who produce and consume billions
of posts daily (Steiger et al.2015). Posts are one type of user-generated content (UGC),
which are attractive as research data in part because of their volume and ease of access. It is
important to note that posts, like most UGC, almost always require wrangling and pre-
processing before being useful (Furche et al. 2006). Within the GIS community, attempts to
wrangle and manage spatial data are evident from projects on organization (e.g. INSPIRE,
NSDI), discovery (e.g. SPIRIT (Jones et al.2004) and the Alexandria Digital Library (Laaet al.
2016)), and commercial cartography (e.g. Mapbox, Cartodb,Mapzen
, and Stamen).
Projects like these have improved the manageability of spatial data. But, research gaps
remain for improving the understandability of spatial data, especially within UGC. This includes
clarifying locational referents in posts. For further details on the semantics of UGC, VGI, and
related terms, we refer readers to survey work including See et al.(2016), and Goodchild
(2007). The remainder of this section is a brief review of work related to post understandability.
3.1 Evolution of locational references in posts
We use Twitter to show one example of locational reference evolution. Twitter
announced a geotagging feature in late 2009. With the help of third-party applications,
this feature allowed users to geotag their tweets, theoretically displaying the location
from where a tweet was posted. Shortly afterward, mobile.twitter.com adopted geotag-
ging natively, and in early 2010 introduced tweeting with location, which allowed users
4T. HERVEY AND W. KUHN
to tag specic place names besides geotags. Users could tag cities or places with a
similar spatial granularity. Tweeting with location was similar to the check-in feature
found in Foursquare and Gowalla. In mid-2010, Twitter shifted its focus from geotags to
tweeting with location, and afterwards other social networks followed. Since then,
location tagging and recommendation features have grown.
3.2 Clarifying location in user-generated content
Sui and Goodchild (2011) discussed how social networks are becoming location-based, and
congruous to a geographic information system (GIS). They divide location-based social media
into three categories: (1) Social check-in sites (e.g.Foursquare,Gowalla,etc.);(2)Socialreview
sites (e.g. Yelp, Geodelic, etc.); (3) Social scheduling/events sites (e.g. Loopt, Plancast, Meetup,
etc.).Several notable works have highlighted the importance of distinguishing between a user
prole location and a postingslocation.Alexet al.(2016) emphasized locating users by their
user prole locations (UPLs) and test the Edinburgh Geoparsers ability to disambiguate place
names in UPLs. Wilken (2014) discussed the history of Twitterandgeographicallocation.He
species means for extracting various types of locational references from the tweet itself; from
the Twitter prole location eld; .. . or, from the geocode functionalities associated with the
Twitter search API or the Twitter Geolocation API,butlacksdetailsonwhatthesereferences
are locating.
3.3 Describing provenance of user-generated content
To our knowledge, there has been little work on describing the provenance of user-generated
content, particularly locational references in posts. Sui and Goodchild (2011)recognizedthe
importance of provenance in posts, stating [p]rovenance and uncertainty of dierent sources
should be maintained in synthesis, which is still a challenging issue in the network-science,
database, and GIScience communities.Taxidou et al.(2015) constructed a W3C Provenance
Data Model (PROV-DM) extension called PROV-SAID to describe information diusion from
Twitter retweets. The Data Mining and Machine Learning (DMML) Laboratory at Arizona State
University has several projects on describing social media provenance. For example, Barbier
and Liu (2011) discussed provenance paths, which describe information diusion during user
communication. Lauriault et al.(2007) discussed how, from an archivist perspective, prove-
nance is important for maintaining the quality of scientic data. Schuurman and Leszczynski
(2006) discussed how ontologically described metadata are a sustainable means of adding
context and clarity to data.
4. Metadata survey and event modeling
To answer our question what are the possible locational referents in posts,werst survey
post documentation and posting procedures. From the survey results, we extract pat-
terns of locatable things and attempt to organize them around a posting event. We
conclude this section with an explanation of how descriptions from W3Cs provenance
model, PROV, can clarify locatable things associated with a posting event.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 5
4.1 Documentation survey
To understand how posts are created, organized, queried, described, and what they locate,
we review post documentation from six social networks: Facebook, Twitter, Flickr,
Foursquare, Instagram, and Google+. While we recognize Foursquare and Flickr are no
longer as popular, we believe they remain useful for diverse spatial problems, such as
locating vague places (Gao 2017). We chose these social networks for several reasons. First,
each social network has a popular feature for creating status-like or update-like posts often
used as research data. Second, each is exible, supporting several modes of interaction,
including desktop and mobile web browsers, and native mobile device applications. During
documentation review, we pay particular attention to locational referent patterns. Table 1
summarizes our survey results.
For each social network chosen, we read the complete ocial developers API documenta-
tion. We record available query parameters and response data (i.e. returned post metadata) for
every endpoint that includes locative terminology (e.g. location, current location, place). This
totals over 45 endpoints, with Facebook making up close to half of them. We reduce our list to
eleven endpoints after removing those that do not query post data directly. Next, we attempt
to understand posting procedures and how a poster includes location information in a post.
For each social network, we create several example posts using every modality available,
ensuring that the posts are queryable through an endpoint on our list. For example, using a
personal Facebook account, we post several statuses using dierent devices (i.e. desktop
computer and mobile devices), dierent applications and web browsers (i.e. Android and iOS
native applications, facebook.com and mobile.facebook.com in Google Chrome and Mozilla
Firefox web browsers), and with varying settings for privacy and LBS. Each combination yields
a post queryable through Facebooks </user/feed> endpoint. While generating the posts, we
record posting procedures, any additional locative terminology encountered, likely purpose
Table 1. A summary of social networks surveyed, the three groups of locatable things that we have
derived, and the API endpoints used to query their locational references. Xindicates that a
locational reference was found referring to a referent in the corresponding column group. ‘–’
indicates that no locational reference was found.
Social Network Prole Place Device Point of Interest API Endpoint
Twitter
f
X X X GET <search/tweets>
Facebook
g
(multiple)
b
- - <graph/user>
- X X <graph/user/feed>
Flickr
h
(multiple)
b
--<ickr.people.getInfo>
-X
c
X<ickr.photos.getinfo>
-X
c
-<ickr.pots.getExif>
Foursquare
i
X
a
XX
d
<users/USER_ID>
Instagram
j
X - - <user-id>
- X X <users/user-id/media/recent>
Google+
k
(multiple)
b
- - <user_ID>
-X
e
X
e
<user_ID/activities/public>
a
Free text location eld with suggestions provided (called hometown in API documentation).
b
Dened as your hometown, city you live in now, country, and places lived in Google+.
c
Determined from devices EXIF GPS information, if available.
d
Not visible to non-authenticated user.
e
Allows users to either attach a POI, or drop a pin on a location, and attach its coordinates.
f
API documentation found at https://dev.twitter.com/rest/reference/get/search/tweets
g
API documentation found at https://developers.facebook.com/docs/graph-api/reference/user/
h
API documentation found at https://www.ickr.com/services/api/
i
API documentation found at https://developer.foursquare.com/docs/users/users
j
API documentation found at https://www.instagram.com/developer/endpoints/users/
k
API documentation found at https://developers.google.com/+/web/api/rest/latest/people/
6T. HERVEY AND W. KUHN
for posting, intended audience, and any confusion we have. Next, we query our example posts
using hand-coded and tool-based methods.
As we expect, we notice several incongruences in both posting and query procedures. First,
locative terminology in documentation diers between social networks. Google+ uses the
term places to denote a UPL where a poster currently lives and has lived previously.
Alternatively, Twitter and Facebook use the term place to denote a POI, while Instagram,
Flickr, and Google+ call this a location.Furthermore,foursquare.comhasaUPLcalledlocation,
but their API labels this eld hometown. Second, the importance of LBS varies by social
network and posting modality. For example, only Twitter for mobile devices allows simulta-
neous inclusion of a precise location (i.e. a geotag), and a nearby POI check-in. What is more,
twitter.com, facebook.com, Facebook for mobile devices, Google+ for mobile devices and
several others do not geographically restrict POI check-ins. Most but not all social networks
provide suggestions for check-ins, such as previously used or nearby POIs. Google+ allows
mobile users to check-in anywhere by dropping a pin but restricts this feature to a browser.
Third, an inconsistent number of endpoints are needed to query the same information across
dierent social networks. For example, Twitters <search/tweets> endpoint alone supplies
three dierent locational references, while querying the same from Flickr requires three
dierent endpoints.
4.2 Event modeling
Even with these incongruences, three distinct groups of locatable things emerge from
post metadata: (personal) places recorded in a posters prole, devices with LBS enabled,
and points of interest. These groups represent real-world entities, such as a posters
apartment, their mobile device, and a restaurant. We suggest that these three groups
are what can be located from a posts metadata and that they are inherently distinct.
Locational referents form these groups in part because of how social networks control the
generation of user accounts and posts. To elaborate, UPLs tend to be created from freeform
text inputs during sign-up, and are less frequently updated. Setting aside data quality issues,
UPLsarecommoninproles across social networks and rely on users specifying a location
personal to them (Hecht et al.2011). Next, a device is locatable when LBS are enabled, and a
social network allows geotagging. When a post includes a geotag, it is a coordinate reference
to that devices position. It is also likely the posters position, assuming they are holding the
deviceatthetimeoftheposting.Finally,POIsare located by their pre-created place names
and addresses, which are attached to a post from a check-in.
Our survey results conrm the issues we discussed in the introduction. It is often
unclear what locational references locate. There are not clear mappings between doc-
umentation descriptions of locational references and what they locate. Therefore, we
propose creating a model to clarify these mappings. To do this, we argue it is valuable to
model a posts origin as an event. As shown by Peuquet and Duan (1995), event
modeling is valuable for organizing locatable things. Furthermore, Kuhns(2012) taxon-
omy of spatial content notes that an event, bounded in time, is located by its partici-
pants. For example, a landslide is an event located by participating collapsing slopes,
collapsing houses, onlookers, and the individuals who talk about it online.
Similarly, the act of posting can be seen as an event. A poster who writes, submits, and thus
createsaposthasquicklystartedandnished a posting event. This event has locatable
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 7
participants, including the poster, the post, and the places associated with each of them (e.g.
the posters home and a restaurant). Figure 1 depicts a proposed mapping between references
and their referents, and associations with event participants. Personal places recorded in a
postersprole (i.e. UPLs) are most associated with the poster, as they are places they frequent
or have a personal attachment to. An LBS enabled device is most associated with the posting
event itself because it denes where the posting event occurred. And nally, POIs are most
related to a post. Like other attachments and enrichments, check-in venues (i.e. POIs) can be
stand-alone but are likely related to the content of a post. In section 4.3,wediscussthese
participants using provenance terms. To summarize, a posting event has participants discover-
able within post metadata, and those participants have an associated location (see Figure 1).
Our model is designed to help users understand these associations.
At this point, we feel it is important to raise a noteabouttheseassociationassertions.Wedo
notattempttodirectlylocateaposter,norapost which is in fact digital. We are also not
interested in investigating locations within a posts text body. There are numerous other
projects that use a multitude of context clues from text to locate posters and improve other
locational referent challenges. Alternatively, we focus on post metadata and stress that they
must be clearly interpreted to remain valuable.
4.3 The W3C provenance recommendation
To our knowledge, no adequate attempts describe locational referents in posts using prove-
nance. We nd this surprising since provenance models are useful for understanding the
origin, organization, and trustworthiness of data and metadata. After reviewing provenance
models and metadata standards including Open Provenance Model and ISO 19,115, we decide
to leverage the W3C provenance model, PROV. PROV is a W3C recommendation with several
exible serializations. We believe its vocabulary naturally describes our three groups of
Figure 1. A summary of our survey ndings depicts the relations between locational references
within posts, the locational referents they locate, and the PROV types that each locational referent
can be associated with. Our event model (see Figure 4) can be used to organize locational references
by what they refer to.
8T. HERVEY AND W. KUHN
locatable things, and interoperates well with other controlled vocabularies, such as Dublin
Core.
The PROV documents are a family of provenance specications for the web produced by
the W3C Provenance Working Group. PROV describes the use and production of entities by
activities which may be inuencedinvariouswaysbyagents (Moreau and Missier 2013).
Figure 2 depicts a graph diagramming PROVs ten core concepts of types and relations. PROVs
foundationisaconceptualdatamodel(PROV-DM), which describes a simple provenance
vocabulary. Figure 3 depicts the mapping from PROVs ten core concepts to PROVsdata
model (PROV-DM) core vocabulary types and relations. PROV has several serializations includ-
ing PROV-O (PROV ontology), which allows mapping from the OWL2 ontology onto the PROV
data model. For more details on PROV including an example application, we point readers to
the work in progress PROV Primer (Gil et al.2013). In this work, we model the production of a
Figure 2. A unied modeling diagram of PROVs ten core concepts. https://www.w3.org/TR/2013/
REC-prov-dm-20130430/Copyright © 20112013 W3C®(MIT, ERCIM, Keio, Beihang), All Rights
Reserved. W3C liability, trademark and document use rules apply.
Figure 3. A mapping between PROVs concepts, types, and relations to core concepts. https://www.
w3.org/TR/2013/REC-prov-dm-20130430/Copyright © 20112013 W3C®(MIT, ERCIM, Keio, Beihang),
All Rights Reserved. W3C liability, trademark and document use rules apply.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 9
post during posting events by a poster. Some may see applying a formal ontology and a
controlled vocabulary to posts as excessive and dicult for users to work with. However, posts
will evolve, and we are attempting to build a theory of groupable locatable things. Therefore,
logical descriptions are a good means of conveying the long-term intended meaning of
locational references (Lauriault et al.2007).
5. Experimentation
In this section, we test our hypothesis that provenance descriptions can help clarify
locational referents in posts. To do this, we apply PROV descriptions to our example
postsmetadata from section 4.1 and attempt to query them by these descriptions. This
application is accomplished by a metadata processing workow. In this workow,
metadata from inputs are examined and compared against template locational refer-
ents, then wrapped with PROV descriptions, and then encoded as linked data to be
semantically queried using SPARQL Protocol and Query Language (SPARQL).
5.1 Using PROV to organize and describe locational referents
In section 4.2 we discussed why it is benecial to model a posting as an event. In section
4.3 we introduced the PROV model. We now discuss how PROV terms can describe a
posting event. From PROV-DMs core type descriptions, a poster can be described as a
PROV agent, a posting event as a PROV activity, and a post as a PROV entity. PROV-DM
core relation descriptions link PROV types. For example, a post entity was generated by a
posting event activity and was attributed to the poster agent. The posting activity was
associated with the poster agent. As discussed in 3.2 and summarized in Figure 1, each
PROV core type has a group of associated locatable things. UPLs are associated with a
poster agent, an LBS enabled device with a posting event activity, and a POI with a post
entity. Figure 4 depicts this association with a model of a tweet posting event.
Once we prototype the model, we check its alignment with current documentation.
Figure 5 depicts an example truncated metadata response from Twitters <search/tweets>
endpoint. In this response, one can identify all three groups of locatable things from three
locational references. Specically, we see the locational reference Los Angeles, CA within
</user/location>. This value refers to a personal place. Also, we see the value [38.8981861,
77.1469772] within <geo/coordinates>, which refers to a device. Finally, we see a
reference to a POI, Arlington, VA, within </place/full_name>.
5.2 Tool implementation and architecture
We next apply our model to example metadata. To do this, we create a workow that
examines post metadata and enriches it with provenance descriptions. The architecture is
divided into three main processes summarized in Figure 6. The workow does not currently
support querying resources from API endpoints and so requires manual input from a user.
Therefore, we query and download each of our example posts, and combine results where
necessary (e.g. combining results from three Flickr endpoints to create a post result with all
three groups of locatable things). In the rst step of the workow, a reader function traverses
input metadata. The reader records the key, value, and nested path of elements that match a
10 T. HERVEY AND W. KUHN
template set of requirements. These requirements are terms in a vocabulary of locational
references, specic to each social network. The template is updated with matched values.
Next, a provenance document template is created and passed to an encoder function. The
provenance template has predened core types and relations as specied in section 4.3 and
elaborated on in section 5.1. A mapping function then takes the updated vocabulary of
requirements and maps values to the provenance templates atLocation property.
Next, the updated provenance document is serialized to RDF and produces a turtle (.ttl) le.
The output le contains the original input metadata along with RDF provenance descriptions
wrapped around locational references. An example of several triples describing our event
models relations can be seen in Figure 7. We have chosen the PROV-Ontology (PROV-O)
serialization so that the results can be semantically queried using SPARQL. This way, a user can
ask questions of their data based on the relations between locational referents. We discuss this
in further detail in section 6.
Our implementation is written in Python and built on the Flask web micro-framework. It
accepts a JSON input, which is recursively traversed by an object walk algorithm. The prov
package from the University of Southampton Provenance Suite is used to generate and
serialize Python PROV documents. Apache Jena Fuseki, a SPARQL server and RDF database,
is used to read, store, and reason on RDF data, such as our workowsoutputturtleles.
Figure 4. An example tweet posing event described using PROV terms and relations. prov: indicates
the PROV vocabulary namespace (including the location term atLocation), and local: indicates the
posting vocabulary namespace terms that we created. The orange pentagon represents a PROV
agent, the blue square represents a PROV activity, and the yellow circle represents a PROV entity.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 11
Figure 6. Our workow contains three main steps. After a user inputs JSON data gathered from an
API response data, rst a reader function parses the incoming data. Second, a provenance mapping
function encodes input data with provenance terms. Third, an RDF le is output. The le can then be
queried against using SPARQL. Input directly from API endpoints is not yet supported.
Figure 5. A truncated sample response from Twitters <search/tweets> endpoint. Three locational
references can be seen including: user/location, which refers to a personal place, geo/coordinates,
which refers to a location-based services (LBS) enabled device, and place/full_name which refers to a
POI. Twitter labels this particular POI a city.
12 T. HERVEY AND W. KUHN
6. Evaluation
Our work aims to give users a more precise means of asking location questions of social
network postings. In essence, our models theory informs our implementation, which acts as
a recommendation system, suggesting to users what data to use based on their questions.
To prove the value of our model and our implementation, two things must be accom-
plished. First, users must be able to query post data based on their questions, not based on
how post data is structured. Second, this query process must be simpler than existing
methods. To test this, we compose two competency questions and compare the queries
necessary for answering them with our example posts. Each competency question involves
at least two locational referents. If our model is able to respond with the correct metadata
(that is, ifthe correct metadata including locational references is returned), our model will be
successful. Each question is best answered using a particular social networksdatabutis
generalizable by altering the questions. For example, location questions concerning photos
are best answered using metadata from Flickr or Instagram, but not Foursquare.
6.1 Competency questions
Our rst question simulates a user asking about photo posts on Flickr. We ask, of people
who take photos at the Eiel Tower, which countries are most of them from? This question
is simple but requires an understanding of two dierent locational referents, the location
where a photo was captured, and the location where a photos author is from. This
question necessitates querying data about LBS enabled devices and personal places. Our
second question asks, which US states are people from that are most interested in showing
support for The Standing Rock Sioux? During the height of the Dakota Access Pipeline
protest in the United States in 201617, supporters of the Standing Rock Sioux tribe
were suggested to check-in to Standing Rock, North Dakota. This question requires
Figure 7. An example RDF output from an executed workow with an input of photo data from
Flickr. This result shows a photo_post entity, a photo_posting activity, and a poster agent, with a
corresponding label, location, and PROV relation. The local prex indicates our dened PROV terms.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 13
querying data about a place where check-ins were made to, and where the check-in
authors are from.
6.2 Query comparison
To answer our rst question using existing methods, a user must rst compile and coerce
responses from three Flickr API endpoints. They need the <ickr.photos.getInfo> endpoint
to get a list of posts and corresponding metadata, the <ickr.people.getInfo> endpoint to
get a posters personal place name, and the <ickr.photos.getExif> endpoint to get a camera
devices coordinates at the time of photo captures. For each photo post collected, a user
would have to manually lter out allposts that are not within a certainproximity of the Eiel
Tower in Paris (the approximate coordinate pair is [2.294535 longitude, 48.858257 latitude]).
The locational reference metadata attributes of a photo capture are called GPS Longitude
and GPS Latitude, accessible by the <ickr.photos.getExif> endpoint. With a ltered set of
posts, the user would then have to look up each posts corresponding poster by a unique
identier and perform a count distinct country aggregation on posters home locations. The
locational reference for a posters home location is a UPL labelled location. It is accessible by
the <ickr.people.getInfo> endpoint. Throughout this process, if a locational reference is
misinterpreted, theresults could be incorrect,such as if check-in locations are believed to be
synonymous with photo capture locations. Were this the case, there may be many false
positives where users check-into the Eiel Tower, but were never actually there.
Alternatively, using our approach, a user need only supply the workow with data queried
from the three endpoints. The workow examines the input resources, extracts locational
references, and organizes them using our provenance model (see section 5.2). Within the
workow, the reader function examines the inputs, and the encoder function would extract
the GPS Latitude and GPS Longitude references. The provenance mapping function wraps
this reference in a PROV term prov:atLocation. This reference is associated with the local:
photo_posting event/activity. This syntax indicates a vocabulary:term that we have created.
The UPL location reference is wrapped with the term prov:atLocation, and is associated with
the local:poster agent. Figure 7 is an example RDF output from the workow. It depicts
semantic relations between locational references of a single post. This output can then be
queried against using SPARQL. Figure 8 is a simplied example query that runs the same
count distinct aggregation. The result is an aggregated count of posters grouped by country.
To answer our second question, a user must rst compile and coerce responses from two
Facebook API endpoints. They need the <graph.users> endpoint to get the postersperso-
nal place name, and the <graph.users.feed> endpoint to get all posts where a check-in has
occurred. For each post collected, a user rst removes all posts that do not have a check-in
to the Standing Rock, ND POI. This requires determining the locational reference for the POI,
which in this case is called place accessible by the <graph.users.feed> endpoint. Next, as in
the previous question, the user has to look up each postsposterbyauniqueidentier and
perform a count distinct aggregation on poster home locations. Also like the previous
question, the locational reference for a posters home location is a UPL labelled Location,
which is accessible by the <graph.users> endpoint. Our alternative approach to answering
this question is similar to our last approach, except we are interested in a POI location, not
device locations. A user rst supplies the workow with data from the two endpoints. Then,
the reader function examines the inputs and the encoder function isolates the check-in and
14 T. HERVEY AND W. KUHN
UPL locational references. Each is wrapped with a prov:atLocation term tag and associated
with their respective PROV core term. The check-in place reference is associated with local:
post and the UPL location reference with local:poster. The output results can be queried
similarly to the last query (see Figure 8). It diers by altering the aggregation to the US state
level and changing the posting-poster relation to a post-poster relation within the where
clause. As a result, using our approach, a user does not risk misinterpreting locational
references, and can more easily extract the data useful to their question.
7. Conclusion
In this paper, we have indicated that locational references in social network post metadata can
be misinterpreted. We have explained that a primary reason for misinterpretation is a lack of
clarity about what a locational reference locates. Since post metadata are used as research
data, we have shown that it is critical to clarify what can be located from a posts metadata. To
do this, we have attempted to answer the question: what are the possible locational referents in
posts? Our approach was to rst survey existing post metadata documentation from several
social networks, and note patterns and discrepancies in locational references and what they
can reliably locate. We determined that there are three groups of things locatable from post
metadata: personal places recorded in a postersprole (i.e. UPLs), location-based services
(LBS) enabled devices, and points of interest (i.e. POIs). Next, we organized locatable things by
mapping them to locational references, and associating them with participants during a
posting event. We described this event using W3Cs provenance description terms. Once we
checked that our model aligned with our survey ndings, we developed a workow to encode
post metadata with the provenance descriptions. This workowtakespostmetadataasan
input, encodes its locational references with provenance descriptions (based on what they
refer to), and produces an RDF output. Users can then ask questions about particular locational
referents through a semantic query. To evaluate our model, we posed two competency
questions and compared the current and improved methods for answering them. As a result,
Figure 8. An example SPARQL query to be executed against our example Flickr RDF output (see
Figure 7). This query returns a poster count by home country of Flickr photo posts that were taken at
the Eiel Tower. Our event models structure makes it easy to query locational references (e.g.
country) based on their referents (e.g. poster).
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 15
the contribution of this work is threefold: a taxonomy of locatable things in posts, a method for
encoding post metadata with provenance descriptions, and a means for clarifying to users
what data to use depending on their particular questions.
Our model has limitations. For one, our three groups of locatable things are only derived
from metadata, but there are more references found in post text bodies. We also recognize
that as social networks evolve, the pattern of locatable things also change, and therefore our
vocabulary, if not general enough, many need updating. An intentional caveat of our results is
that we neglected to include data quality measures. We concern ourselves with clarifying
locational references and not deducing the accuracy of a locational reference. We do however
feel that this caveat is important because other work suggests that some locational references
are highly unreliable, regardless of their referent (Hecht et al.2011). Furthermore, as discussed
in section 2.2, we recognize that many posters do not intend to include false or even
misleading information in their posts. Yet, due to the exibility of social networks, as discussed
in section 4.1,wereiterateourargumentthatitisarisky assumption that references can be
interchanged.
When we began this work, we intended to connect our workow directly to API endpoints.
This would aord users of post metadata a complete workow to query data and lter what is
relevant based on their desired input questions. Successful extract-transfer-load (ETL) tools
such as EsriD.C.s Koop, are inspirational for future work in attaching our workow directly to
API endpoints. However, at this point, an ETL approach is not generalizable for this type of data,
and may only mirror other mature tools like Temboo. We also plan to make our model more
robust by including more data quality checks that evaluate input data for missing attributes
and relevance.
Acknowledgments
We thank our three peer reviewers for their criticism and feedback. We thank Dr. Daniel R Montello, Dr.
Krzysztof Janowicz, Behzad Vahedi, Sara Laa, and Jingyi Xaio for their edits and suggestions.
Disclosure statement
No potential conict of interest was reported by the authors.
ORCID
Thomas Hervey http://orcid.org/0000-0003-3803-0937
Werner Kuhn http://orcid.org/0000-0002-4491-0132
References
Abel, F., et al., 2012. Twitcident. In:21st International Conference on World Wide Web,1620 April
2012 Lyon, New York: ACM, 305308. doi:10.1145/2187980.2188035
Alex, B., et al., 2016. Homing in on Twitter users: evaluating an enhanced geoparser for user prole
locations. In:10th Language Resources and Evaluation Conference,2328 May 2017 Portoroz,
Slovenia, Paris: European Language Resources Association, 39363944.
Bao, J., et al., 2016. Geo-social media data analytic for user modeling and location-based services.
SIGSPATIAL Special, 7 (3), 1118. doi:10.1145/2876480.2876484
16 T. HERVEY AND W. KUHN
Barbier, G. and Liu, H., 2011. Information provenance in social media, In: J. Salerno, S.J. Yang, D.
Nau, & S.-K. Chai, eds. Social Computing, Behavioral-Cultural Modeling and Prediction: 4th
International Conference, SBP 2011, 2931 March 2011 College Park, MD, Berlin: Springer
Berlin Heidelberg, 276283. doi:10.1007/978-3-642-19656-0_39
Carbunar, B. and Potharaju, R., 2012. You unlocked the Mt. Everest badge on foursquare! Countering
location fraud in GeoSocial networks. In:IEEE 9th International Conference on Mobile Ad-Hoc and Sensor
Systems (MASS 2012), 811 October 2012 Las Vegas: IEEE, 182190. doi:10.1109/MASS.2012.6502516
Chang, J. and Sun, E., 2011. Location3: how users share and respond to location-based data on
social networking sites. In:5th AAAI Conference on Weblogs and Social Media,1721 July 2011
Menlo Park: AAAI Press, 7480.
Cheng, Z., et al., 2011. Exploring millions of footprints in location sharing services. In:4th
International AAAI Conference on Weblogs and Social Media,2326 May 2010, Washington, DC,
Menlo Park: AAAI Press, 8188.
Conole, G., Galley, R., and Culver, J., 2011. Social networking barcelona, site for academic practice.
International Review of Research in Open and Distance Learning, 12 (3), 119138. doi:10.1111/
j.1083-6101.2007.00393.x
De Longueville, B., Smith, R.S., and Luraschi, G., 2009.OMG, from here, I can see the ames!: a use
case of mining location based social networks to acquire spatio-temporal data on forest res. In:
2009 International Workshop on Location Based Social Networks, 03 November 2009 Seattle, New
York: ACM Press, 7380. doi:10.1145/1629890.1629907
Dredze, M., et al., 2013. Carmen: a twitter geolocation system with applications to public health. In:
27th AAAI Conference on Articial Intelligence, AAAI Workshop on Expanding the Boundaries of
Health Informatics Using AI (HIAI), 1415 July 2013, Bellevue, WA, Menlo Park: AAAI Press, 2024.
Fink, C., et al., 2012. Mapping the twitterverse in the developing world: an analysis of social media
use in Nigeria. In: S.J. Yang, A.M. Greenberg, and M. Endsley, eds. Social Computing, Behavioral -
Cultural Modeling and Prediction: 5th International Conference, SBP 2012,35 April 2012 College
Park, MD. Berlin: Springer Berlin Heidelberg, 164171. doi:10.1007/978-3-642-29047-3_20
Furche, T., et al., 2016, Data wrangling for big data: challenges and opportunities. In:19th
International Conference on Extending Database Technology,1518 March 2016, Bordeaux,
France, Saarbrücken, Germany: Dagstuhl Publishing, 473478. doi:10.5441/002/edbt.2016.44
Gao, S., 2017.Extracting Computational Representations of Place with Social Sensing. Thesis(Ph.D.).
University of California, Santa Barbara.
Gayo-Avello, D., 2011. Dont turn social media into another Literary Digestpoll. Communications
of the ACM, 54 (10), 121128. doi:10.1145/2001269.2001297
Gil, Y., et al., 2013.PROV model primer [online]. PROV Model Primer. Available from: www.w3.org/
TR/prov-primer/. [Accessed 8 May 2017].
Goodchild, M.F., 2007. Citizens as sensors: the world of volunteered geography. GeoJournal, 69 (4),
211221. doi:10.1007/s10708-007-9111-y
Gössling, S. and Stavrinidi, I., 2015. Social networking, mobilities, and the rise of liquid identities.
Mobilities, 101 (1), 121. doi:10.1080/17450101.2015.1034453
Hecht, B., et al., 2011. Tweets from Justin Biebersheart:thedynamicsoftheLocationeld in user
proles. In:CHI 201129th Annual CHI Conference on Human Factors in Computing Systems, Conference
Proceedings and Extended Abstracts,712 May 2011, Vancouver, BC, New York: ACM Press, 237246.
Hills, D., et al., 2015. The importance of data set provenance for science.. EOS[online], 96 (12).
doi:10.1029/2015EO040557
Huck, J., Whyatt, D., and Coulton, P., 2015. Visualizing patterns in spatially ambiguous point data.
Journal of Spatial Information Science, 10 (10), 4766. doi:10.5311/JOSIS.2015.10.211
Johnson, I.L., et al., 2016. The geography and importance of localness in geotagged social media.
In:34th Annual CHI Conference on Human Factors in Computing Systems,0712 May 2016, San
Jose, CA, New York: ACM Press, 515526. doi:10.1145/2858036.2858122
Jones, C.B., et al., 2004. The SPIRIT Spatial Search Engine: architecture, Ontologies and Spatial
Indexing. In: M.J. Egenhofer, C. Freksa, and H.J. Miller, eds. Proceedings of Geographic Information
Science: Third International Conference, GIScience 2004, October 2023 2004, Adelphi, MD, Berlin:
Springer Berlin Heidelberg, 125139. doi:10.1007/978-3-540-30231-5_9
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE 17
Kim, H.S., 2016. What drives you to check in on Facebook? Motivations, privacy concerns, and
mobile phone involvement for location-based information sharing. Computers in Human
Behavior, 54C (1), 397406. doi:10.1016/j.chb.2015.08.016
Kuhn, W., 2012. Core concepts of spatial information for transdisciplinary research. International
Journal of Geographical Information Science, 26 (12), 22672276. doi:10.1080/13658816.2012.722637
Kumar, H., et al., 2017. Impact of Facebooks check-in feature on users of social networking sites. In:
S.C. Satapathy, et al., eds. 3rd Springer International Conference on Computer & Communication
Technologies,2829 October 2016, Vijayawada, Andhra Pradesh, India, Singapore: Springer
Singapore, Vol. 5, 611620. doi:10.1007/978-981-10-3226-4_63
Kumar, R., Novak, J., and Tomkins, A., 2010. Structure and evolution of online social networks. In:P.S.Yu,J.
Han, and C. Faloutsos, eds. Link mining: models, algorithms, and applications.SpringerNewYork:New
York, 337357. doi:10.1007/978-1-4419-6515-8_13
Laa, S., et al., 2016. Spatial discovery and the research library. Transactions in GIS,20(3),399412.
doi:10.1111/tgis.12235
Lauriault, T.P., et al., 2007. Todays data are part of tomorrows research: archival issues in the
sciences. Archivaria, 64 (Fall), 123179.
Mislove, A., et al., 2008.Growthoftheickr social network. In: 1st Workshop on Online Social Networks
(WOSN 08), 18 August 2008, Seattle, WA, New York: ACM Press, 2530. doi:10.1145/1397735.1397742
Moreau, L. and Missier, P., eds.., 2013 PROV-DM: the PROV data model [online]. W3C recommendation.
Available from: http://www.w3.org/TR/2013/REC-prov-dm-20130430/. Accessed 10 December 2016.
Peuquet, D.J. and Duan, N., 1995. An event-based spatiotemporal data model (ESTDM) for
temporal analysis of geographical data. International Journal of Geographical Information
Systems, 9 (1), 724. doi:10.1080/02693799508902022
Roick, O. and Heuser, S., 2013. Location based social networks - denition, current state of the art
and research agenda. Transactions in GIS, 17 (5), 763784. doi:10.1111/tgis.12032
Sakaki, T., Okazaki, M., and Matsuo, Y., 2010. Earthquake shakes twitter users: real-time event
detection by social sensors. In:19th International Conference on World Wide Web,2630 April
2010, Raleigh, NC. New York: ACM Press, 851860. doi:10.1145/1772690.1772777
Scellato, S., et al., 2011. Socio-spatial properties of online location-based social networks. In:5th
International AAAI Conference on Weblogs and Social Media,1721 July 2011, Barcelona, Menlo
Park, CA: AAAI Press, 329336.
Schuurman, N. and Leszczynski, A., 2006. Ontology-based metadata. Transactions in GIS, 10 (5),
709726. doi:10.1111/tgis.2006.10.issue-5
See, L., et al., 2016. Crowdsourcing, citizen science or volunteered geographic information? The
current state of crowdsourced geographic information. ISPRS International Journal of Geo-
Information, 5 (5), 55. doi:10.3390/ijgi5050055
Simmhan, Y.L., Plale, B., and Gannon, D., 2005. A survey of data provenance techniques. Science,
47405 (3), 125. doi:10.1145/1084805.1084812
Stefanidis, A., Crooks, A., and Radzikowski, J., 2013. Harvesting ambient geospatial information
from social media feeds. GeoJournal, 78 (2), 319338. doi:10.1007/s10708-011-9438-2
Steiger, E., De Albuquerque, J.P., and Zipf, A., 2015. An advanced systematic literature review on
spatiotemporal analyses of Twitter data. Transactions in GIS, 19 (6), 809834. doi:10.1111/tgis.12132
Sui, D. and Goodchild, M., 2011. The convergence of GIS and social media: challenges for
GIScience. International Journal of Geographical Information Science,25 (11), 17371748.
doi:10.1080/13658816.2011.604636
Tan, W.C., 2004. Research problems in data provenance. IEEE Data Engineering Bulletin,27(4),4552.
Taxidou, I., et al., 2015. Modeling information diusion in social media as provenance with W3C
PROV. In:24th International Conference on World Wide Web,1822 May 2015, Florence. New
York: ACM Press, 819824. doi:10.1145/2740908.2742475
Wetstein, S., 2016. Beating election polls with Twitter. A visualization study. Thesis. University in
Amsterdam, Netherlands.
Wilken, R., 2014. Twitter and geographical location. In: K. Weller, A. Bruns, J. Burgess, M. Mahrt, and
C. Puschmann, eds. Twitter and society. New York: Peter Lang, 155167.
18 T. HERVEY AND W. KUHN
... This editorial highlights how these issues are discussed and addressed by the articles of this special issue and how the papers highlight emerging technologies, concepts, platforms, debates, and methodologies and techniques within VGI and suggest future research directions. This special issue gathered papers on the topics of crowdsourced geospatial data quality (Ballatore and Arsanjani 2018), thematic uncertainty and consistency across data sources (Hervey and Kuhn 2018), spatial biases (Millar et al., 2018), trust issues within VGI (Severinsen et al. 2019), and contributors behaviour and interactions (Truong et al. 2018). ...
... Ballatore and Arsanjani (2018) looked at the origin and development of Wikimapia and discussed some aspect of Wikimapia, including the project's intellectual property and strategies for quality management. Hervey and Kuhn (2018) explored uncertainty with locational data obtained from social networks. They presented a taxonomy of things that can be located from social network posts and a means to describe them to users. ...
... Geoparsing is a procedure to detect the geographic information in texts and link with gazetteers, a database storing place names and their attributes, including coordinates, population, size, and type [4]. This process generally involves geotagging that recognizes place names in text and geocoding that transforms place names into coordinates [5][6][7]. Geotagging commonly recognizes place names in a text by constructing geographical language models trained on massive corpora of geotagged annotations, such as river, city, etc. [8]. The goal of geocoding is to select the correct coordinate for the place name from a list of candidate coordinates from a gazetteer such as GeoNames [9]. ...
Article
Full-text available
Geocoding is an essential procedure in geographical information retrieval to associate place names with coordinates. Due to the inherent ambiguity of place names in natural language and the scarcity of place names in textual data, it is widely recognized that geocoding is challenging. Recent advances in deep learning have promoted the use of the neural network to improve the performance of geocoding. However, most of the existing approaches consider only the local context, e.g., neighboring words in a sentence, as opposed to the global context, e.g., the topic of the document. Lack of global information may have a severe impact on the robustness of the model. To fill the research gap, this paper proposes a novel global context embedding approach to generate linguistic and geospatial features through topic embedding and location embedding, respectively. A deep neural network called LGGeoCoder, which integrates local and global features, is developed to solve the geocoding as a classification problem. The experiments on a Wikipedia place name dataset demonstrate that LGGeoCoder achieves competitive performance compared with state-of-the-art models. Furthermore, the effect of introducing global linguistic and geospatial features in geocoding to alleviate the ambiguity and scarcity problem is discussed.
Article
This paper provides an overview of possibilities to localize acts of communication and their agents based on digital traces and scrutinizes their advantages and disadvantages. It shows, (i) what types of geographic information exist in social media data and to what extent they are available to researchers, (ii) which approaches exist to classify locations and (iii) what the advantages and disadvantages of the various approaches are. Introducing an approach to automatically classify location information based on the location information in users’ profiles and a multi-step cross-validation with time zone information, we show that the less resource-intensive approach yields high precision comparable to the “gold standard” of human coding while recall is comparatively low. The discussion of advantages and limitations of all approaches shows that – depending on the research question – the specific research context and its presumed effect, the aspired granularity of location classification and resource considerations can guide a researchers’ decision.
Article
Full-text available
Academic libraries have always supported research across disciplines by integrating access to diverse contents and resources. They now have the opportunity to reinvent their role in facilitating interdisciplinary work by offering researchers new ways of sharing, curating, discovering, and linking research data. Spatial data and metadata support this process because location often integrates disciplinary perspectives, enabling researchers to make their own research data more discoverable, to discover data of other researchers, and to integrate data from multiple sources. The Center for Spatial Studies at the University of California, Santa Barbara (UCSB) and the UCSB Library are undertaking joint research to better enable the discovery of research data and publications. The research addresses the question of how to spatially enable data discovery in a setting that allows for mapping and analysis in a GIS while connecting the data to publications about them. It suggests a framework for an integrated data discovery mechanism and shows how publications may be linked to associated data sets exposed either directly or through metadata on Esri's Open Data platform. The results demonstrate a simple form of linking data to publications through spatially referenced metadata and persistent identifiers. This linking adds value to research products and increases their discoverability across disciplinary boundaries.
Article
Full-text available
Citizens are increasingly becoming an important source of geographic information, sometimes entering domains that had until recently been the exclusive realm of authoritative agencies. This activity has a very diverse character as it can, amongst other things, be active or passive, involve spatial or aspatial data and the data provided can be variable in terms of key attributes such as format, description and quality. Unsurprisingly, therefore, there are a variety of terms used to describe data arising from citizens. In this article, the expressions used to describe citizen sensing of geographic information are reviewed and their use over time explored, prior to categorizing them and highlighting key issues in the current state of the subject. The latter involved a review of ~100 Internet sites with particular focus on their thematic topic, the nature of the data and issues such as incentives for contributors. This review suggests that most sites involve active rather than passive contribution, with citizens typically motivated by the desire to aid a worthy cause, often receiving little training. As such, this article provides a snapshot of the role of citizens in crowdsourcing geographic information and a guide to the current status of this rapidly emerging and evolving subject.
Conference Paper
Full-text available
Geotagged tweets and other forms of social media volunteered geographic information (VGI) are becoming increasingly critical to many applications and scientific studies. An important assumption underlying much of this research is that social media VGI is " local " , or that its geotags correspond closely with the general home locations of its contributors. We demonstrate through a study on three separate social media communities (Twitter, Flickr, Swarm) that this localness assumption holds in only about 75% of cases. In addition, we show that the geographic contours of localness follow important sociodemographic trends, with social media in, for instance, rural areas and older areas, being substantially less local in character (when controlling for other demographics). We demonstrate through a case study that failure to account for non-local social media VGI can lead to misrepresentative results in social media VGI-based studies. Finally, we compare the methods for determining localness, finding substantial disagreement in certain cases, and highlight new best practices for social media VGI-based studies and systems.
Article
Full-text available
Data do not exist in a vacuum. To be useful, data must be accompanied by context on how they are captured, processed, analyzed, and validated and other information that enables interpretation and use.
Chapter
Social media is infiltrating the planet. Facebook has been the leader in this industry for almost a decade. The social media sites have made many changes over the years regarding security and advertising that have frankly, miffed a lot of people. The paper mainly describes about the impact of Facebook Check-in feature. Also, a new feature is proposed- Check-in-Checker which is presently not yet introduced in Facebook. Check-in Checker is an extension of the existing Facebook feature check-in. It is a helpful tool to know who is going where and at what time. Also, popularity about a particular common place being visited by friends will increase. Data was gathered from 550 people who give their views about how they use Facebook. Also, their opinion is taken in account about the existing feature ‘Check-in’ provided by Facebook. Analyzing the data, the new feature Check-in Checker suits today’s demand and requirement of a user, to be added up in Facebook’s features list. This feature doesn’t violates any user’s privacy and holds good number of positive reasons to be there in the pool of Facebook’s exciting features.
Article
Scientific data are essential for training in science and informed decision-making regarding health, the environment, and the economy. Cumulative data sets assist with understanding trends, frequencies and patterns, and can form a baseline upon which we can develop predictions. This paper discusses the preservation of scientific data, providing an overview of the characteristics of scientific data and scientific-data portals from a variety of fields, with a focus on data quality, particularly accuracy, reliability and authenticity, and how these are captured in metadata. These concepts are broadly defined from both scientific and archival perspectives. Based on an extensive literature review of publications from national and international scientific organizations, government and research funding bodies, and empirical evidence from a selection of InterPARES 2 Case Studies and General Study 10, which investigated thirty-two scientific-data portals, the paper includes a brief examination of machine-base "knowledge representation" (KR) and the potential implications for the preservation of scientific data, with a particular focus on formal ontologies. The paper also discusses the concept of record in the context of Web 2.0 environments, the paucity of scientific data archives, and the lack of funding priorities in this area. It is argued that archivists will have to work closely with scientific-data creators to understand their practices, that data portals are mechanisms that archivists can use to extend their preservation practices, and that it is not technology that is impeding progress regarding the preservation of scientific data; it is a lack of funding, policy, prioritizing, and vision allowing our scientific national resources to be lost.
Conference Paper
In recent years, research in information diffusion in social media has attracted a lot of attention, since the produced data is fast, massive and viral. Additionally, the provenance of such data is equally important because it helps to judge the relevance and trust- worthiness of the information enclosed in the data. However, social media currently provide insufficient mechanisms for provenance, while models of information diffusion use their own concepts and notations, targeted to specific use cases. In this paper, we propose a model for information diffusion and provenance, based on the W3C PROV Data Model. The advantage is that PROV is a Web-native and interoperable format that allows easy publication of provenance data, and minimizes the integration effort among different systems making use of PROV.
Article
More and more geo-tagged social media data is generated, nowadays, from the geo-tagged tweets, geo-tagged photos to check-ins. Analyzing this flourish data enables the possibility for us to discover users daily mobility patterns, profiles and preferences. As a result, based on the analyzed results, new types of location-based services emerge. In this article, we first introduce the recent advances in location-based user preferences modeling, which includes: 1) inferring users demographics, 2) identifying users novelty-seeking characteristics and 3) discovering users shopping impulsiveness. After that, we present a comprehensive summary on the state-of-arts of the location-based services, which take advantage of the geo-social media, including: 1) location-based recommendations, 2) location-based predication.
Article
As technologies permitting both the creation and retrieval of data containing spatial information continue to develop, so do the number of visualizations using such data. This spatial information will often comprise a place name that may be "geocoded" into coordinates, and displayed on a map, frequently using a "heatmap-style" visualization to reveal patterns in the data. Across a dataset, however, there is often ambiguity in the geographic scale to which a place-name refers (country, county, town, street etc.), and attempts to simultaneously map data at a multitude of different scales will result in the formation of "false hotspots" within the map. These form at the centers of administrative areas (countries, counties, towns etc.) and introduce erroneous patterns into the dataset whilst obscuring real ones, resulting in misleading visualizations of the patterns in the dataset. This paper therefore proposes a new algorithm to intelligently redistribute data that would otherwise contribute to these "false hotspots," moving them to locations that likely reflect real-world patterns at a homogeneous scale, and so allow more representative visualizations to be created without the negative effects of "false hotspots" resulting from multi-scale data. This technique is demonstrated on a sample dataset taken from Twitter, and validated against the "geotagged" portion of the same dataset.