ArticlePDF Available

Chronotopic Information Interaction: Integrating Temporal and Spatial Structure for Historical Indexing and Interactive Search

Authors:

Abstract and Figures

Domain-based learning and research are important applications driving the development of exploratory search systems. A wealth of historical information about events from around the world resides within documents on the web, yet contemporary search engines do not take advantage of the closely integrated temporal and spatial information found within these web pages for indexing and design of search user interfaces. This gap limits the use of the web as a resource for historical and geohistorical information seeking. In this paper we propose chronotopic information interaction as a new interaction concept for web search that explicitly links temporal and spatial entities to keywords using a space-time grid index and a paired search user interface. The space-time grid index allows different modes of interaction between spatial, temporal, and keyword-based views in the search user interface. We demonstrate use of the space-time grid index and chronotopic information interaction concept with the development of Pteraform, a prototype of a search engine that enables users to explore information in the English version of Wikipedia through a geo-historical lens.
Content may be subject to copyright.
Chronotopic Information Interaction:
Integrating Temporal and Spatial Structure for
Historical Indexing and Interactive Search
Benjamin Adams
Department of Computer Science and Software Engineering
University of Canterbury, New Zealand
benjamin.adams@canterbury.ac.nz
Abstract. Domain-based learning and research are important applica-
tions driving the development of exploratory search systems. A wealth of
historical information about events from around the world resides within
documents on the web, yet contemporary search engines do not take ad-
vantage of the closely integrated temporal and spatial information found
within these web pages for indexing and design of search user inter-
faces. This gap limits the use of the web as a resource for historical and
geohistorical information seeking. In this paper we propose chronotopic
information interaction as a new interaction concept for web search that
explicitly links temporal and spatial entities to keywords using a space-
time grid index and a paired search user interface. The space-time grid
index allows different modes of interaction between spatial, temporal,
and keyword-based views in the search user interface. We demonstrate
use of the space-time grid index and chronotopic information interaction
concept with the development of Pteraform, a prototype of a search en-
gine that enables users to explore information in the English version of
Wikipedia through a geo-historical lens.
Keywords: exploratory search, web search, historical information re-
trieval, geographic information retrieval, information seeking, informa-
tion interaction
1 Introduction
Historical thinking and inquiry is a crucial component of active citizenship in
a civil society [30,52]. While doing history is a complex endeavour of critical
thinking that involves many tasks beyond the discovery of content, information
retrieval systems could do a much better job to help facilitate the process. The
web is a tremendous resource for historical information with millions of pages
that describe rich and varied knowledge about events and processes that have
occurred throughout human history. This historical information is present within
the text of documents and provides an implicit structure for indexing and pre-
senting search results. In the cataloguing systems of both digital and physical
libraries, resources are occasionally organized historically or based on world geog-
raphy. However, neither existing libraries nor web-based search engines provide
[Pre-print]
This article has been accepted for publication in Digital Scholarship in the Humanities
published by Oxford University Press.
DOI: 10.1093/llc/fqaa049
Published version available online at https://doi.org/10.1093/llc/fqaa049.
2 B. Adams
a systematic means for tapping into the historical and geographic content in the
vast majority of documents and books that are not already organized in that
way. Mining this structure from digital documents we can build search engines
that facilitate historical research and learning through the search process, pro-
viding new ways of exploring and finding connections between historical events
and places that are mentioned in heterogeneous web collections.
In this paper we propose a new concept of temporal and geographic informa-
tion search called chronotopic information interaction, after the concept
of the chronotope [2]. M. M. Bakhitin introduced the idea of the chronotope
(time-space in Greek) in literary theory for analyzing different categories of how
integrated concepts of time and space are configured in narrative texts. We have
adopted this notion to the idea of using the integrated connections between time
and space in web documents to build an index for interactive search. We can
frame chronotopic information information in the context of exploratory search
systems, where the goal is not primarily one of fact-finding, but rather to develop
search tools that support constructive learning activities [41]. The purpose then
of the search engine is to facilitate integrating, synthesizing, comparing, and
general discovery of historical and geohistorical information in iterative search
sessions.
Time and space are fundamental dimensions of information that are refer-
enced using a diverse set of entities found in many kinds of texts across myriad
subject areas. Figure 1 illustrates the spatial and temporal references in one
such text, the English Wikipedia article for the History of Montreal. Temporal
references can include dates as well as event entities, and spatial references in-
clude named places and other real locations on the Earth. A vast number of
documents—from literary texts to non-fiction books, scientific articles, newspa-
per articles, encyclopedia entries, and special interest websites—contain these
kinds of spatial and temporal references, especially at historical and geographic
scales. Unlike the example in Figure 1, the documents do not need to be primarily
historical in nature. For example, it can be fictional novel (e.g., Cryptonomicon
or Anna Karenina) that references real world locations at different times in his-
tory. Or it can be a primary source that describes a record of historical events as
they happened. The temporal and spatial references can be present at any point
within the text of the document, and they provide a context for comparison with
other documents on the web. Thus, space-time provides an implicit, crosscutting
structure to a document collection that can enable us to find relevant documents
and discover relationships based on geographic and historical context.
In this paper we present a new space-time grid indexing data structure to sup-
port the integrated search of information along temporal, spatial, and thematic
dimensions. We follow with the system design for Pteraform1, a fully integrated
geo-historical search engine that allows one to explore a document corpus using
the implicit references to places, times, and events in documents. The prototype
was developed using the English Wikipedia data set, and the system design can
be extended to other data sets.
1https://pteraform.csse.canterbury.ac.nz
Chronotopic Information Interaction 3
Fig. 1. A sample web document about the History of Montreal from Wikipedia that
illustrates the kinds of temporal and spatial references that can be found within a text.
Green highlighted words are temporal references and red highlighted words are spatial
references. Note, that these references vary a great deal in terms of their spatial and
temporal granularity.
4 B. Adams
2 Related work
Over the last 25 years, geographic information retrieval (GIR), as a sub-field
of information retrieval, has been concerned with developing systems that can
leverage geographic references in texts to help organize information [29]. The first
fully-integrated spatial and text-based (thematic) index in GIR was developed
for the SPIRIT project and has been subsequently extended [24,14,31,25]. More
recently, the Frankenplace prototype introduced a discrete global grid system
and indexing scheme that combined notions of spatial and textual granularity,
but the index has no temporal components [4].
Document geocoding remains an active research area within the GIR litera-
ture, and this includes spatio-temporal information [35,50]. The advent of mobile
search has shifted the focus of research in GIR toward developing systems that
allow us to search for things in the world [13] as opposed to the original ambi-
tion to develop geographically-based search for text. In contrast to the growth
in mobile search applications in industry, the map-based spatial search systems
developed in research have not seen the same degree of uptake by commercial
web companies.
However, the growth of the spatial humanities, the spatial turn in digital
humanities and the history of sciences, has led to efforts to create large-scale
databases of geographic references in historical documents [18,54,23]. Further-
more, spatial search has been proposed as a way of organizing and discovering
scientific research objects [28]. A significant body of research has also focused on
modeling locations with web text data, especially shorter microblog text, and
ranking locations given a query [45,53,27,3].
Despite this research activity surprising little lasting advancement has been
made toward implementing fully-operational web search user interfaces that ex-
ploit geographic content within documents. Furthermore, spatial and tempo-
ral references are often intrinsically related in texts—important events occur
in places at specific times and spatial and temporal references co-occur often
in texts [2]. Thus, the utility of geographic and historical information retrieval
combined in one system has not been adequately explored; the temporal dimen-
sion has largely been ignored in GIR research in terms of indexing, relevance
ranking, and search user interface design.
The utility of extracting temporal information from documents to support
ad hoc search, exploratory search, and top result clustering has motivated re-
search on temporal information retrieval [5]. Visualization of historical informa-
tion within a document corpus has been explored to differing degrees. Pfoser
et al. [39] developed a database of information in history textbooks using only
thematic metadata (not a true text index). A conceptual model of temporal,
geographic, and thematic search using ontologies was developed by Mata and
Claramunt [32]. The TimeTrails project used time and space to visualize doc-
uments within a corpus but does not use a text index [49]. Chasin et al. [12]
developed methods for visualizing the spatio-temporal entities found within the
texts. Temporal clustering of search results based on temporal expressions in
document has also been explored by Alonso et al. [6].
Chronotopic Information Interaction 5
There is strong indication of positive outcomes from pairing digital tech-
nologies with historical learning in classroom settings [51,8]. In the remainder
of this paper we describe the development of a complete spatio-temporal (geo-
historical) information retrieval system that combines work on geographic and
temporal parsing through to indexing and search user interface development,
with the research contribution emphasis on the latter two components. Such a
system can support learning by putting web resources in an historical frame.
The prototype was built using information from the English Wikipedia but the
methods are not particular to that data set and can be applied to any collection.
It is novel with respect to the strong interconnection between time and space in
the development of both the indexing methods and the design the user interface.
3 Space-time grid index
In order to efficiently search for information in large text collections we need
to index the documents. An inverted index that maps words and phrases to
documents is a commonly used data structure in information retrieval, but a
traditional inverted index does not capture any spatial and temporal structure
of the information [43]. In this section we introduce a space-time grid index that
is a data structure designed to organize document content along spatial and
temporal dimensions at multiple spatial and temporal scales. It is capable of
returning scores for space-time grid cells given an ad hoc keyword query.
Aspace-time grid is defined as a set of space-time cell pairs. A space-time
cell pair, c, is a tuple < g, t >, where gis a cell in a two or three-dimensional
spatial grid and tis a cell in a temporal grid. This can be a 2+1 or 3+1 repre-
sentation of space-time [20]. A regular space-time grid is one where the spatial
grid is a regular spatial tessellation and the temporal grid consists of fixed-
sized temporal cells. A regular space-time grid for organizing information that
is world-wide and covering history since the 1500s, for example, might be cre-
ated using a hexagonal tessellation of the surface of the Earth and a fixed-sized
timeline at the granularity of decades.
Aspace-time grid system is a set of space-time grids that are organized
in a two-dimensional matrix of coarser-to-finer spatial and temporal grids. For
global data, a hierarchical discrete global grid system can be used to specify
spatial grids at multiple granularities. Irregularly shaped regions, such as ad-
ministrative units, can easily be recovered from the more precise global grid
[1]. Meanwhile, the temporal grids are defined as one-dimensional segmentations
that are progressively decomposed by centuries, decades, years, months, and so
on depending on the underlying temporal characteristics of the data.
A space-time grid index is word-level inverted index that matches space-
time cell pairs to terms, so that keyword searches can provide ranking scores
for each space-time cell. The rankings can then be represented using a variety of
interactive spatio-temporal visualisations, such as integrated maps and timelines.
To build the index we need to define a chronotopic mapping, which is a
mapping from a segment of the text within a web document to a set of space-time
6 B. Adams
grid cells. A document is defined as an ordered set of words < w1, w2, ..., wm>.
A chronotopic mapping is an ordered triple < W, C, E >, where Wis a set of
terms, Cis a set of space-time cell pairs, and Eis a set of edges, where E
{{x, y, γ }} | (x, y)W×C, γ R. The choice of what words from a document
are mapped to a given space-time cell (i.e., how the document is decomposed
into segments), as well as the choice of the weighting, γ, will be dependent on
the algorithm used to match terms within the document to spatial and temporal
references or on other semantic metadata associated with the document, such
as time of creation or generic information about the era or places described
within the text. Ultimately, this depends on the application for the index—the
underlying corpus and the kinds of queries that the system should support.
4 Data preparation
Prior to building a space-time grid index, we must first identify the place names
and dates within the texts and match them to space-time grid cells. Then we need
a mechanism for deciding what words are associated with those places and dates,
that is, to establish the chronotopic mappings. In this section we describe how
we performed this requisite step for the corpus used in the Pteraform prototype.
4.1 Pre-processing
The source data for Pteraform is the English Wikipedia dump from July 20,
2017. Each of the Wikipedia articles have been pre-processed and stored in a
database with each paragraph of each article stored in a separate row. This is be-
cause the paragraph-level is the granularity we have chosen for doing chronotopic
mappings. All of the data has been processed to extract geographic references
and temporal references. For this we used a number of open data sets, includ-
ing geographic linked data from DBpedia, Wikidata, and digital gazetteers; and
temporal parsing software, including Heideltime [10,48,17,1].
The workflow involved in verifying and cleaning the results of the entity
recognition tools was extensive. An initial set of named places was identified
using DBpedia and Wikidata, however, the spatial data in those sources is poor
(point data primarily), so the spatial representation was enriched by matching
to named places in the W¯ahi discrete global grid gazetteer [1]. This gave us a
starting point to know what articles in Wikipedia are about places, and when
other articles link to those pages about places. However, in Wikipedia, only the
first reference to another Wikipedia page concept is explicitly linked in the data,
so we needed to match additional mentions to named places throughout the re-
mainder of the document. Furthermore, many articles make no explicit links to
place articles at all, even when references to places exist in the text. We per-
formed further entity recognition on the articles to make those additional links
explicit in the database. We used a variety of methods in an iterative manner to
perform this entity recognition starting with syntactic matching on all named
places from the digital gazetteer. For ambiguous place names we then utilized
Chronotopic Information Interaction 7
multiple open source geoparsing tools [22,16,26], choosing only places where the
tools agreed, prioritizing precision over recall. Finally, we performed some san-
ity checks on many of the more common place names from around the world.
The manual curation involved in these steps should not be understated. This
process worked well for Wikipedia; however, other corpora such as collections of
primary sources with historical place names, for example, would likely have re-
quired even more manual intervention. Drawing in tailored sources of geographic
linked open data and using semantic annotation tools to generate new data (e.g.,
Recogito[47]) could play an important role in those cases.
Compared with the procedure used to identify place references, identifying
dates within the articles was relatively simple. We ran the Heideltime temporal
parser on all the documents to match references to centuries, decades, years,
months, and days [48]. With both the geographic and temporal parsing there
were cases of incorrectly matched places and dates, respectively. Throughout
the development of the prototype system the data has been manually cleaned
as errors are discovered in the data. It is important to note that the decisions
made during data collection and cleaning processes, which help to define the
chronotopic mappings that underlie the space-time grid index, are important
design decisions when building any chronotopic search engine.
Utilizing the spatial representation from the discrete global grid gazetteer [4],
we used a simple heuristic to create mappings between the terms within each
paragraph and grid cells in the discrete global grid. More sophisticated natural
language processing-based methods could be developed in future to better match
the relationships between event references and the surrounding text, which in
turn could lead to better chronotopic mappings. However, adopting a heuristic
window size of individual paragraphs proved sufficient for the development of the
prototype, even though in some cases the connection between a place or date
and other terms within the paragraph might be tenuous. For example, in the
historical summary article shown in Figure 1 most references to individual years
are thematically related only to other words found within the same sentence.
However, the paragraphs are roughly written in such a way that, at the century-
level granularity, all the words of the paragraph can be lumped together. Thus,
an index which contains this article that is built at year-level granularity will
likely have some false positive results, whereas at century-level it will have fewer.
In Wikipedia, most articles are not historical summaries and do not have this
density of individual dates, hence we have settled on the paragraph heuristic for
simplicity sake. In other data sources it might be appropriate to associate an
entire document with a single date or place based on metadata information.
The space-time grid we implemented is based on an ISEA aperture 4 hexago-
nal hierarchical discrete global grid system (DGGS) [42]. This DGGS represents
a hierarchy of tessellations that increase in resolution by four (the aperture)
with each level. For example, at resolution hierarchy of 8 the hexagonal grid has
655,362 equal area cells that cover the Earth. The majority of the hexagonal
cells in the grid do not end up contributing to the index, however, because there
are many areas of the Earth that do not have any documents associated with
8 B. Adams
them (e.g., large sections of the oceans). Two temporal grids were used: one at
the granularity of centuries and another at the level of individual years. Other
granularities are possible and the data described in the previous section has
been pre-processed to extract additional temporal entities—such as by decade,
year-month, and year-month-day—but indexes were not implemented at those
levels.
Pre-processing the data and storing the intermediary data in this format is
not a required to build a space-time grid index, but the data has been stored in
this manner to facilitate quick development of new iterations of the prototype
as well as re-use of the data for other projects. There are very few large scale
datasets available for comparative analysis of GIR systems, which has hindered
progressive development of new systems that can be easily compared with exist-
ing systems. The pre-processed data developed for this study is freely available
for download. The data is packaged as a PostgreSQL database using the PostGIS
extension [36] and contains the following tables:
– pages: page information, including page id, title, and PageRank (in the
Wikipedia article graph) [38].
– sections: structure of the key sections in the Wikipedia page, including page
id, section title (e.g., abstract), and header level.
– paragraphs: contains the text of each paragraph and pre-processed infor-
mation: page id, section id, and extracted entities, including place links,
references to days, years, decades, centuries.
– cells: hexagon cell ids and geometry in GeoJSON format.
cell mappings: contains mappings between hexagon cell ids, temporal ids,
and paragraph ids, with weightings.
The appendix has more information on the schema for each type of database
table.
4.2 Topics
In addition to the pre-processing required for indexing, we used the Mallet toolkit
to perform latent Dirichlet allocation on the entire Wikipedia corpus to gener-
ate 1024 topics [33,11]. After manually cleaning up the topics to remove ‘junk’
topics, a vector containing the topic distribution for each article was stored to
be included as an extra field in the document index. The topic vectors for the
document results are aggregated by the user interface to generate a list of related
terms at search time (see Section 6.2).
5 Query scoring and implementation
For the prototype that is described later in Section 6.2, we created a space-time
grid index using the ElasticSearch indexing software, which is in turn built on
the open source Apache Lucene project [34,21]. By building off ElasticSearch, we
were able to use a mature code base for query parsing and fast parallel search.
For the index we implemented four new scoring models which are described in
the following sub-sections.
Chronotopic Information Interaction 9
5.1 Space-time cell scoring
The space-time grid index uses an information-based model to score space-time
grid cells based on a query [15]. The retrieval function RSVcis defined in Equa-
tion 1.
RSVc(q , c) = X
wqc
xq
wlog
λ
tc
γw
tc
γw +1
wλw
1λw
(1)
In the retrieval function, xq
wis a boosting factor for the word, w, in the query,
q.tc
γw is a normalized version of the sum of occurrences of the word wmultiplied
by the γweighting in the chronotopic mapping to cell c. H2 term frequency nor-
malization is used: tfn =tf ln 1 + avg l
l(c), which normalizes inversely related to
the length l(c) of the number of words mapped to cell c[7]. Let λw=Nw
N, where
Nis the number of grid cells indexed and Nwis the number of cells where word
woccurs. Thus, λwis the average number of grid cells where the word woccurs.
This normalization prevents locations and times that are over-represented in the
data set from dominating the search result rankings.
5.2 Map cell scoring
The space-time cell score is a score for a specific place and time. A map cell score
is an aggregation of space-time cell scores based on a fixed spatial (hexagon) cell,
across one or more temporal units. Let Sq
cbe a set of space-time cell, score tuples
< c, s > based on RSV (q , c), and Tbe a set of temporal units corresponding to
cells in the temporal grid of the index (e.g., the range from 18th to 21st century,
or the years 1941 and 1942). The retrieval function RSVmfor map cell, g, is
defined in Equation 2.
RSVm(Sq
c, g) = X
tT,tSq
c
RSV (q , < g, t >) (2)
5.3 Timeline unit scoring
Similarly, the timeline unit score is calculated as aggregation of space-time cell
scores based on a fixed timeline unit, across one or more spatial grid cells. Letting
Gbe a set of spatial grid cells, the retrieval function RSMtl for a timeline unit,
t, is defined in Equation 3
RSVtl (Sq
c, t) = X
gG,gSq
c
RSV (q , < g, t >) (3)
10 B. Adams
5.4 Document scoring
Relevance scores for document segments are calculated using a separate, stan-
dard inverted index of words to document segments as defined by the chronotopic
mappings. The retrieval function 4 filters based on a set of selected spatial grid
cells, Gand temporal grid cells, T, and the resulting scores are then aggregated
by document id to generate scores for individual documents. Let RSV (q, p) be a
relevance score value for a query and document segment, p(any scoring mecha-
nism can be used here, such as information-based, divergence from randomness,
or language modeling).
RSVd(q , d, T, G) = X
tTd,gGd,pd
RSV (q , p) (4)
6 Search user interface
In this section we introduce a set of chronotopic interaction view dependencies
based on the holistic presentation of spatial, temporal, and textual information.
We follow with a description of an implemented prototype of one of the view
dependencies.
6.1 Chronotopic interaction paradigms
There are effectively three dimensions of information that define the state of our
chronotopic search user interface: keyword input, temporal selection, and spatial
selection. The state of these inputs work in tandem to set the views for three
main components of the search user interface: the map (M), the timeline (T),
and the document search results (K).
The map is an interactive web map that consists of three main components.
The first is a base map that shows the geographic frame of reference, which helps
to both contextualize the search in space and understand better the geographic
distribution of the search over space. The second is a hexagonal grid that repre-
sents the spatial grid cells that match the current search. The third is a heatmap
overlay derived from a score for each cell.
The timeline is an interactive timeline that shows a bar graph that corre-
sponds to the scores for the temporal grid cells that match the current search.
It also has interaction tools to allow the user to zoom the timeline (i.e., switch
between century and year granularity).
The document search result is a top-k list of documents for the current
search (Wikipedia articles in the prototype). Additional information such as
related searches can be found here as well.
These three views have dependencies, which influence the input options in the
other views and thus the type of visual information seeking that we want the user
to perform. In other words, these view dependencies are different model-based
presentations of navigational cues for users to efficiently discover and explore
knowledge in the web corpus [19]. The choice of dependency dictates what view
Chronotopic Information Interaction 11
operates to create an overview of search results, and where the other views allow
the user to hierarchically explore contextual details on demand. For example, a
selection on the timeline can change the view on the map, or a selection on the
map can change the view on the document search result.
K:M, K:T—The top-k results shown in the document search result are based
purely on the document index. Selection events on the search results update the
map and the timeline independently according to the places and dates that are
referenced within the selected document. The information seeking behavior that
this dependency provides is to give the user an understanding of the temporal
and spatial context of the top-k results. Thus, it allows the user to learn about
spatial and temporal content of specific documents, but not at the corpus-level.
M:K, T:K—The map and timeline serve to provide an overview of the
results for a keyword search, and selection events on both the map and the
timeline affect the top-k results shown in the document search result. However,
the map does not affect the timeline view, or vice-versa. This approach uses the
map and timeline to provide a visual overview of how the keyword is represented
across space and time within the full corpus.
The following two variants of M:K, T:K introduce dependencies between the
map and timeline views, which allows the user to use these views to successively
refine the search and drill down to detailed results.
T:M:K—Selections on the timeline affect the map display, and the combi-
nation of selections on both the timeline and the map change the top-k results.
M:T:K—Similar to T:M:K, but in this case the selection on the map updates
the timeline view.
All of these view dependencies can be supported by the same space-time grid
index using different aggregated map cell and timeline unit scores.
6.2 Prototype
Pteraform is a chronotopic search engine prototype developed using space-time
grid and document indexes of the English Wikipedia, and utilizes a T:M:K view
dependency model. The indexes exist on a web server and are used to generate
the space-time cell scores and document scores at query time, via a web socket
connection from the browser client. The aggregate map cell and timeline scores
are generated on the client, which enables real-time interactivity.
Figures 2–7 show the basic layout of the Pteraform system with the following
main components: 1) a search box at the top for ad hoc queries, 2) a dynamic map
view, 3) a timeline representation, and 4) the top-k results window showing the
most relevant documents given the current state of the system. Four versions
of the search query roman empire are shown based on different states. The
results shown in these figures are built from the top-50,000 space-time grid cells
(hexagonal aperture 4 level 8 DGGS) that match the query. The total number
of unique space-time grid cell pairs that exist in the index (and the resulting
index size) depends on the spatial and temporal granularity and the density of
spatial and temporal references with the documents. For example, for the century
granularity, the index consists of 375,026 space-time grid cell pairs (29.4 GB on
12 B. Adams
Fig. 2. The heatmap shows a geographic overview of “Roman empire” references in
Wikipedia without any filter on dates (spanning from 3000 BCE to present). The time-
line shows a temporal overview of the same references. The document results (shown
in the upper right) are based on the users’ map selection on the Balkan coast.
Chronotopic Information Interaction 13
Fig. 3. The green circle overlay on the heatmap shows all “Roman empire” locations
(i.e., grid cells) that also contain a reference to a date in the 1st century. The user
makes this selection by hovering the mouse over the 1st century bar in the timeline.
Fig. 4. After making a selection on the 1st century (by clicking and dragging) the
heatmap is updated to reflect only those locations that contain a reference to a date
in the 1st century. The document results are likewise updated.
14 B. Adams
Fig. 5. Zooming into a local area shows the underlying hexagonal grid cells.
Fig. 6. Clicking and selecting a grid cell renders a set of related terms overlaid on the
map. The terms are derived from the topics associated with the 6 document results.
These terms reflect the intersection of the keyword (“battle”), the selected location (off
the coast of Newcastle, England), and date selection (8th to 12th century).
Chronotopic Information Interaction 15
Fig. 7. By changing the date selection to 18th to 21st century, 22 documents are
returned and the related terms overlaid on the map change.
disk) and 62,915,903 document segments are indexed for document scoring (35.5
GB).
In Figure 2 the timeline shows the timeline unit scores in a bar graph for-
mat. The default view shows a range of centuries from the 10th century BCE
to the 21st century. The map view is dependent on the timeline, so the map
cell scores for all hexagons in the view window are based on aggregated values
across all the centuries. The map cell scores are visualized with a density surface
(a.k.a. heatmap) to give a geographic overview of “roman empire” references
in Wikipedia. Map interaction then allows the user to select a single hexagonal
grid cell (in this case on the Balkan coast). The top-k results shown in the docu-
ment search result are based on the relevance scores given the timeline selection
and map selection. 50 results are shown (the maximum number in the current
version). The top result is an article about Julius Nepos, the ruler of Roman
Dalmatia in the 5th century.
Following the visual information seeking mantra (“overview first, zoom and
filter, then details on demand”) [46], interaction with the timeline creates real-
time feedback for the user. Moving the mouse over the timeline bars highlights on
the map which hexagonal cells have references to the time unit. Figure 3 shows
the result of the user highlighting the 1st century with green circles overlaying
the heatmap. This allows the user to compare the geographic distribution at the
highlighted time unit to the overall distribution from the selected time range.
This is quick visual feedback for the user but highlighting a time unit does not
alter the top-k results.
16 B. Adams
However, if the user chooses to select a subset of the timeline (e.g., 1st century
as shown in Figure 4) then the heatmap is updated and becomes less spatially
distributed, and subsequently the document results are updated with now only
15 results shown. Because the 1st century is selected the top article becomes one
on Illyria, a geographic region in the Balkans from antiquity, and the article on
Julius Nepos does not appear as it is no longer relevant. The map visualization
also updates based on zooming in and out. Discrete grid cells replace the heatmap
as the user zooms to finer grained resolution (Figure 5).
The view dependency T:M:K presents a hierarchical model for visual orga-
nization of the search results, and it is possible to overlay additional views on the
display. Figures 6 and 7 show how related search terms based on the document
search results overlay the map view as the user zooms in on individual hexagon
cells. Here the related searches derived from topic modeling are represented as
a word-cloud centered on the hexagon cell [11].
6.3 Heuristic evaluation
For new kinds of complex user interfaces that are not directly comparable to
existing technologies, a heuristic evaluation of the kinds of situations, tasks, and
users that the system supports can be preferable to a traditional usability study
performed in a controlled setting [37]. The value of search systems that use
chronotopic information interactions will only truly become apparent through
observation of the complex interactions and learning processes that such systems
foster in real-world use cases. Here we focus on highlighting the strengths of our
system and what differentiates it from traditional search.
There are two types of users that benefit from the utilization of the chrono-
topic interaction-based design. First of these are “developers”, where in we also
consider information architects, such as librarians, who want to organize large
collections of unstructured information. By conceptualizing the information in
terms of how it will be presented along map, timeline, and more thematic di-
mensions and indexing it appropriately, developers are able to develop general-
purpose tools for history that can flexibly index and present the collections that
they possess. The second group are the “end-users” such as students and re-
searchers who want to learn about geographic and historical information. The
importance of designing search interfaces to support historical learning has been
detailed above. Yet, no existing general-purpose search interfaces inherently sup-
port historical research tasks. Discovering evidence of continuity and change is
one key aspect of historical thinking [44]. This can be supported by mechanisms
that allow for the quick comparison between places and dates through the lens
of a keyword search, something that is easy with a system like Pteraform. The
system is generalizable to any kind of historical or geographic scope as well as
any kind of raw textual information that might be of interest.
Although we have chosen to implement one prototype of a system with
chronotopic information interaction, the space-time grid index data structure
is highly flexible in that it supports a number of different ways to present the
information and build view dependencies. Thus, the building of one index can
Chronotopic Information Interaction 17
allow the design of multiple front-end solutions that support different kinds of
users performing different kinds of tasks. For example, it would be possible to
extend the Pteraform system to allow the user to select a different view de-
pendency. Likewise, it is extensible in that other kinds of semantic information
can be used to provide additional facets on the search results. In addition, the
map and timeline views provide a background canvas for overlaying all kinds of
additional thematic information.
Finally, a system using map and timeline-based visualization has an expres-
sive match to historical research tasks. Users who bring contextual background
knowledge about history or geography can leverage the user interface to hone
their search parameters. Thus, a user who is familiar with the system, will be
able to utilize the chronotopic features of the system to more efficiently discover
relevant information. Further work on understanding how chronotopic search
is used during real-world research tasks will help us to refine the best ways of
designing the map and timeline based elements of the system.
7 Conclusion
Millions of primary and secondary sources exist online for historical research.
The indexing of these sources using an explicit spatio-temporal model could
revolutionize how people discover information and learn about the past. In this
paper we presented a novel spatio-temporal grid cell index data structure that
can be used to develop a variety of different kinds of geo-historical search engines.
We demonstrated how this index can be used to develop a search user interface
that supports chronotopic information interaction, a technique for using the
integrated temporal and spatial structure in a corpus to interactively navigate
and explore the document space.
Exploratory search systems can be difficult to evaluate because traditional
information retrieval metrics do not measure the kind of performance that ex-
ploratory search systems are meant to optimize, and task or goal-based usability
evaluations must be carefully constructed to reflect appropriate proxy measures
for learning in complex environments [9,41]. In this paper we used an imple-
mented demonstration prototype along with a heuristic evaluation based on the
principles from Olsen [37] to show the potential of the chronotopic information
interaction paradigm.
Future work on chronotopic information interaction will first and foremost fo-
cus on developing appropriate evaluation methods, including how to measure the
efficacy of such a system to support research tasks as well as critical and creative
learning. Furthermore, the development of finer-grained chronotopic mappings
between web page content and temporal and geographic entities is dependent on
innovation in natural language processing methods such as extraction of event
entities, co-reference resolution, and narrative analysis. We will explore making
chonotopic mapping methods more robust across a variety of web sources beyond
Wikipedia toward the end goal of developing chronotopic information interaction
possible across large web collections, digital libraries, and other online cultural
18 B. Adams
resources. Developing infrastructure to ingest structured linked open data into a
chronotopic indexing pipeline would help streamline the development of bespoke
applications based on published data. In addition, although our motivation has
been indexing full text document collections, chronotopic information interac-
tion could be applied to other kinds of scientific and humanities data sets that
contain references to places and dates, and have unstructured text fields—the
interaction of these three dimensions is pervasive in data. For example, the view
dependencies that we describe could be used to design search engines for brows-
ing the relationships in geographic and historical linked open data.
Acknowledments
I would like to thank the anonymous reviewers for their insightful reviews, which
helped to improve the content this paper. The New Zealand eScience Infrastruc-
ture (NeSI) high performance computing system was used in the pre-processing
stage of this research.
References
1. Adams, B.: W¯ahi, a discrete global grid gazetteer built using linked open data.
International journal of digital earth 10(5), 490–503 (2017)
2. Adams, B., Gahegan, M.: Exploratory chronotopic data analysis. In: The An-
nual International Conference on Geographic Information Science. pp. 243–258.
Springer (2016)
3. Adams, B., Janowicz, K.: On the geo-indicativeness of non-georeferenced text. In:
Sixth International AAAI Conference on Weblogs and Social Media. pp. 375–378
(2012)
4. Adams, B., McKenzie, G., Gahegan, M.: Frankenplace: interactive thematic map-
ping for ad hoc exploratory search. In: Proceedings of the 24th International Con-
ference on World Wide Web. pp. 12–22. International World Wide Web Conferences
Steering Committee (2015)
5. Alonso, O., Gertz, M., Baeza-Yates, R.: On the value of temporal information in
information retrieval. In: ACM SIGIR Forum. vol. 41, pp. 35–41. ACM New York,
NY, USA (2007)
6. Alonso, O., Gertz, M., Baeza-Yates, R.: Clustering and exploring search results
using timeline constructions. In: Proceedings of the 18th ACM conference on In-
formation and knowledge management. pp. 97–106. ACM (2009)
7. Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval
based on measuring the divergence from randomness. ACM Transactions on Infor-
mation Systems (TOIS) 20(4), 357–389 (2002)
8. Angeli, C., Tsaggari, A.: Examining the effects of learning in dyads with computer-
based multimedia on third-grade students’ performance in history. Computers &
Education 92, 171–180 (2016)
9. Athukorala, K., G lowacka, D., Jacucci, G., Oulasvirta, A., Vreeken, J.: Is ex-
ploratory search different? a comparison of information search behavior for ex-
ploratory and lookup tasks. Journal of the Association for Information Science
and Technology 67(11), 2635–2651 (2016)
Chronotopic Information Interaction 19
10. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia:
A nucleus for a web of open data. In: The Semantic Web, pp. 722–735. Springer
(2007)
11. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of machine
Learning research 3(Jan), 993–1022 (2003)
12. Chasin, R., Woodward, D., Witmer, J., Kalita, J.: Extracting and displaying tem-
poral and geospatial entities from articles on historical events. The Computer Jour-
nal 57(3), 403–426 (2013)
13. Chen, L., Cong, G., Jensen, C.S., Wu, D.: Spatial keyword query processing: an
experimental evaluation. In: Proceedings of the VLDB Endowment. vol. 6, pp.
217–228. VLDB Endowment (2013)
14. Chen, Y.Y., Suel, T., Markowetz, A.: Efficient query processing in geographic web
search engines. In: Proceedings of the 2006 ACM SIGMOD international conference
on Management of data. pp. 277–288. ACM (2006)
15. Clinchant, S., Gaussier, E.: Information-based models for ad hoc IR. In: Proceed-
ings of the 33rd international ACM SIGIR conference on Research and development
in information retrieval. pp. 234–241. ACM (2010)
16. D’Ignazio, C., Bhargava, R., Zuckerman, E., Beck, L.: CLIFF-CLAVIN: Deter-
mining geographic focus for news articles. In: NewsKDD: Data Science for News
Publishing, at KDD 2014 (2014)
17. Erxleben, F., G¨unther, M., Kr¨otzsch, M., Mendez, J., Vrandeˇci´c, D.: Introducing
Wikidata to the linked data web. In: International Semantic Web Conference. pp.
50–65. Springer (2014)
18. Finnegan, D.A.: The spatial turn: Geographical approaches in the history of sci-
ence. Journal of the History of Biology 41(2), 369–388 (2008)
19. Fu, W.T., Kannampallil, T.G., Kang, R.: Facilitating exploratory search by model-
based navigational cues. In: Proceedings of the 15th international conference on
Intelligent user interfaces. pp. 199–208. ACM (2010)
20. Galton, A.: Fields and objects in space, time, and space-time. Spatial cognition
and computation 4(1), 39–68 (2004)
21. Gormley, C., Tong, Z.: Elasticsearch: the definitive guide: a distributed real-time
search and analytics engine. O’Reilly Media, Inc. (2015)
22. Grover, C., Tobin, R., Byrne, K., Woollard, M., Reid, J., Dunn, S., Ball, J.: Use of
the Edinburgh geoparser for georeferencing digitized historical collections. Philo-
sophical Transactions of the Royal Society A: Mathematical, Physical and Engi-
neering Sciences 368(1925), 3875–3889 (2010)
23. Isaksen, L., Simon, R., Barker, E.T., de Soto Ca˜namares, P.: Pelagios and the
emerging graph of ancient world data. In: Proceedings of the 2014 ACM conference
on Web science. pp. 197–201. ACM (2014)
24. Jones, C.B., Purves, R., Ruas, A., Sanderson, M., Sester, M., Van Kreveld, M.,
Weibel, R.: Spatial information retrieval and geographical ontologies an overview
of the SPIRIT project. In: Proceedings of the 25th annual international ACM
SIGIR conference on Research and development in information retrieval. pp. 387–
388. ACM (2002)
25. Jones, C.B., Purves, R.S.: Geographical information retrieval. International Jour-
nal of Geographical Information Science 22(3), 219–228 (2008)
26. Karimzadeh, M., Pezanowski, S., MacEachren, A.M., Wallgr¨un, J.O.: GeoTxt: A
scalable geoparsing system for unstructured text geolocation. Transactions in GIS
23(1), 118–136 (2019)
20 B. Adams
27. Kinsella, S., Murdock, V., O’Hare, N.: I’m eating a sandwich in Glasgow: modeling
locations with tweets. In: Proceedings of the 3rd international workshop on Search
and mining user-generated contents. pp. 61–68. ACM (2011)
28. Lafia, S., Jablonski, J., Kuhn, W., Cooley, S., Medrano, F.A.: Spatial discovery
and the research library. Transactions in GIS 20(3), 399–412 (2016)
29. Larson, R.R.: Geographic information retrieval and spatial browsing. In: Smith,
Gluck, M. (eds.) Geographic Information Systems and Libraries: Patrons and Maps
and Spatial Information. pp. 81–124 (1996)
30. evesque, S.: Thinking historically: Educating students for the twenty-first century.
University of Toronto Press (2008)
31. Lieberman, M.D., Samet, H., Sankaranarayanan, J., Sperling, J.: STEWARD: ar-
chitecture of a spatio-textual search engine. In: Proceedings of the 15th annual
ACM international symposium on Advances in geographic information systems.
p. 25. ACM (2007)
32. Mata, F., Claramunt, C.: GeoST: geographic, thematic and temporal information
retrieval from heterogeneous web data sources. Web and Wireless Geographical
Information Systems pp. 5–20 (2011)
33. McCallum, A.K.: Mallet: A machine learning for language toolkit. http://mallet.
cs.umass.edu (2002)
34. McCandless, M., Hatcher, E., Gospodneti´c, O., Gospodneti´c, O.: Lucene in action,
vol. 2. Manning Greenwich (2010)
35. Melo, F., Martins, B.: Automated geocoding of textual documents: A
survey of current approaches. Transactions in GIS 21(1), 3–38 (2017).
https://doi.org/10.1111/tgis.12212, http://dx.doi.org/10.1111/tgis.12212
36. Obe, R., Hsu, L.: PostGIS in action. GEOInformatics 14(8), 30 (2011)
37. Olsen Jr, D.R.: Evaluating user interface systems research. In: Proceedings of the
20th annual ACM symposium on User interface software and technology. pp. 251–
258 (2007)
38. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking:
Bringing order to the web. Tech. rep., Stanford InfoLab (1999)
39. Pfoser, D., Efentakis, A., Hadzilacos, T., Karagiorgou, S., Vasiliou, G.: Providing
universal access to history textbooks: a modified GIS case. Web and Wireless
Geographical Information Systems pp. 87–102 (2009)
40. Pustejovsky, J., Castano, J.M., Ingria, R., Sauri, R., Gaizauskas, R.J., Setzer,
A., Katz, G., Radev, D.R.: TimeML: Robust specification of event and temporal
expressions in text. New directions in question answering 3, 28–34 (2003)
41. Rieh, S.Y., Collins-Thompson, K., Hansen, P., Lee, H.J.: Towards searching as a
learning process: A review of current perspectives and future directions. Journal
of Information Science 42(1), 19–34 (2016)
42. Sahr, K., White, D., Kimerling, A.J.: Geodesic discrete global grid systems. Car-
tography and Geographic Information Science 30(2), 121–134 (2003)
43. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing.
Communications of the ACM 18(11), 613–620 (1975)
44. Seixas, P., Morton, T., Colyer, J., Fornazzari, S.: The big six: Historical thinking
concepts. Nelson Education (2013)
45. Serdyukov, P., Murdock, V., Van Zwol, R.: Placing Flickr photos on a map. In:
Proceedings of the 32nd international ACM SIGIR conference on Research and
development in information retrieval. pp. 484–491. ACM (2009)
46. Shneiderman, B.: The eyes have it: A task by data type taxonomy for information
visualizations. In: Proceedings 1996 IEEE symposium on visual languages. pp.
336–343. IEEE (1996)
Chronotopic Information Interaction 21
47. Simon, R., Barker, E., Isaksen, L., de Soto Ca˜namares, P.: Linking early geospatial
documents, one place at a time: annotation of geographic documents with recogito.
e-Perimetron 10(2), 49–59 (2015)
48. Str¨otgen, J., Gertz, M.: HeidelTime: High quality rule-based extraction and nor-
malization of temporal expressions. In: Proceedings of the 5th International Work-
shop on Semantic Evaluation. pp. 321–324. Association for Computational Lin-
guistics (2010)
49. Str¨otgen, J., Gertz, M.: TimeTrails: a system for exploring spatio-temporal infor-
mation in documents. Proceedings of the VLDB Endowment 3(1-2), 1569–1572
(2010)
50. Str¨otgen, J., Gertz, M., Popov, P.: Extraction and exploration of spatio-temporal
information in documents. In: Proceedings of the 6th Workshop on Geographic
Information Retrieval. p. 16. ACM (2010)
51. Swan, K., Locascio, D.: Evaluating alignment of technology and primary source
use within a history classroom. Contemporary Issues in Technology and Teacher
Education 8(2), 175–186 (2008)
52. Wineburg, S.: Historical thinking and other unnatural acts. Phi delta kappan 92(4),
81–94 (2010)
53. Wing, B.P., Baldridge, J.: Simple supervised document geolocation with geodesic
grids. In: Proceedings of the 49th annual meeting of the association for computa-
tional linguistics: Human language technologies-volume 1. pp. 955–964. Association
for Computational Linguistics (2011)
54. Withers, C.W.: Place and the “spatial turn” in geography and in history. Journal
of the History of Ideas 70(4), 637–658 (2009)
8 Appendix: Database tables
The pre-processed Wikipedia data that was used to build the Pteraform in-
dexes can be downloaded as a database dump from https://www.dropbox.com/
s/4z22w14fajzkma3/pteraform-enwiki-data-20200730.sql?dl=0. Note, the
file size is 15.77 GB. This appendix describes schema for each table.
8.1 Pages table
This table contains the primary key (page id) for each article in the English
Wikipedia (7,955,791 rows). This is data that was extracted directly from the
Wikipedia dump file. In addition, the pagerank for each article has been calcu-
lated in the context of the Wikipedia article graph. The following columns are
defined:
page id (integer) – primary key and Wikipedia page id
original length (integer) – length of the article before markup removed
new length (integer) – length of the article after markup removed
stub (integer) – 1 if a stub article, otherwise 0
disambig (integer) – 1 if a disambiguation article, otherwise 0
category (integer) – 1 if a category page, otherwise 0
image (integer) – 1 if an image resource, otherwise 0
22 B. Adams
title (text) – page title
categories (integer[]) – list of category page ids that this article belongs to
links (integer[]) – list of page ids linked from this article
related (integer[]) – list of page ids for related articles
external links (integer[]) – list of external links from this article
pagerank (double precision) – PageRank of the article in the Wikipedia
article graph
8.2 Sections table
This table contains all the header information for each of the pages in the
database (34,478,680 rows). The header level is 0 if it is the abstract for the
article. The following columns are defined.
id (integer) – primary key
page id (integer) – page id of the article containing this section
section title (text) – section title (blank if abstract)
header level (integer) – header level of the section (increases for each sub-
section)
8.3 Paragraphs table
This table contains the text for each article divided into paragraphs (one for each
row, totalling 97,352,848 rows). The ids for the paragraphs of a given page (or
section) are in the same order as in the original text. The words field contains
the text for the paragraph, including link information to other Wikipedia pages
which are specified by <a> tags. The place links column is an integer array
containing the page ids for all places that are explicitly linked in the paragraph.
The stripped words column contains the words but stripped of punctuation
and link tags. The temporal words column shows the output of the Heideltime
tagger on the words as a textual representation of a JSON array of XML snip-
pets using the TIMEX3 format [40]. This output has been further distilled into
year refs (integer array), decade refs (integer array), century refs (inte-
ger array), year month refs references (text array), and year month day refs
(text array). The format for these are described in more detail below. The
pct total words records what proportion of the total number of words for an
article are found in this paragraph.
The century references are integers from -99 to 25. 20 stands for 20th cen-
tury, 19 for 19th, and so on. Negative numbers refer to BCE centuries. Decade
references are three-digit numbers such as 197, for 1970’s. All references are
collapsed so that a reference to the day May 15, 1978 will be represented as
“1978-05-15” in year month day refs, “1978-05” in month year refs, 1978 in
year refs, 197 in decade refs, and 19 in century refs.
Chronotopic Information Interaction 23
8.4 Discrete global grid cell tables
Each level of a hexagonal discrete global grid system is stored as an individual
table in the database. The geometry of the cells are defined using the ISEA
aperture 4 hexagonal tessellation and stored using the geography PostGIS type
in the table. The table names have the format isea4h*, where the * indicates
the resolution hierarchy. Thus, the table that contains the grid cells at level
6 are stored in the isea4h6 table. The table has two columns gid (primary
key) and geog (geography(Polygon, 4326)). To recover the GeoJSON form of
each hexagonal cell one can execute a query similar to the following: SELECT
ST AsGeoJSON(geog) FROM hexgrid.isea4h6 LIMIT 1;.
A similar set of tables exists (isea4h*p) with the same ids for each hexagon
and containing the geometry for the centroid of each hexagon in the geog column
(geography(Point,4326)).
8.5 Cell mappings tables
Each page in Wikipedia associated with a geographic place (with spatial coor-
dinates) has been mapped to corresponding grid cell ids based on the data from
the W¨ahi discrete global grid gazetteer [1]. A table exists for each level in the
global grid system in the form place page hexgrid isea4h*. The table has two
columns: page id (integer primary key) and gids (integer[]), an integer array of
grid ids that define the shape of the place.
... ITHBase and IHBase appeared in earlier versions of HBase, and they have been extended from the source code level, but the implementation effect is not satisfactory and it is no longer updated. The third-party independent engines mainly include ElasticSearch [10] and Solr [11], both of which are full-text search servers based on Lucene [12]. The implementation method is to build an index in Elas-ticSearch or Solr from the indexed data in the data table. ...
Article
Full-text available
With the rapid development of the Internet of Things and cloud computing, HBase has become a good choice for massive data storage, and is efficient in reading and writing data. However, HBase is not supportive for multi-dimensional query of non-rowkey data, unconducive to data analysis and processing. To address this issue, we first analyze the constitution principle and deficiency of secondary index and clustering index, and select clustering index as the basis of optimization. Then, we choose the Hilbert curve in the space filling curve as the linearization technology, design the pre-partition algorithm and subspace partition algorithm, and realize the Hilbert-curve-based clustering index (HCIndex) which supports multi-dimensional point query and range query. Finally, the performance of HCIndex is verified by comparison experiments with HBase Scan, HiBase and CCIndex. The experimental results show that the query efficiency of HCIndex has been greatly improved at the expense of very limited storage space, which is necessary for storing index data and only 1.7 times the size of the original data table of HBase. Compared with HBase scan, the query efficiency of HCIndex’s multi-dimensional point query and range query has been increased to more than 4 times and more than 2 times, respectively. Therefore, the proposed HCIndex is well suited for efficient multi-dimensional and complex queries of massive data in cloud storage systems.
... Chronotopic information interaction is a design paradigm that uses the inherent spatio-temporal structure found in a heterogeneous document collection to support information seeking behavior. This structure can be derived from document metadata as well as the references to places and dates within the text of the documents [Adams, 2020]. A search engine that uses this form of interaction visually emplaces search results within an integrated geographical and temporal frame of reference, which provides context to explore and discover information. ...
Preprint
Full-text available
The vast amount of research produced at institutions world-wide is extremely diverse, and coarse-grained quantitative measures of impact often obscure the individual contributions of these institutions to specific research fields and topics. We show that by applying an information retrieval model to index research articles which are faceted by institution and time, we can develop tools to rank institutions given a keyword query. We present an interactive atlas, Quoka, designed to enable a user to explore these rankings contextually by geography and over time. Through a set of use cases we demonstrate that the atlas can be used to perform sensemaking tasks to learn and collect information about the relationships between institutions and scholarly knowledge production.
Article
Full-text available
Wikidata has been widely used in Digital Humanities (DH) projects. However, a focused discussion regarding the current status, potential, and challenges of its application in the field is still lacking. A systematic review was conducted to identify and evaluate how DH projects perceive and utilize Wikidata, as well as its potential and challenges as demonstrated through use. This research concludes that: (1) Wikidata is understood in the DH projects as a content provider, a platform, and a technology stack; (2) it is commonly implemented for annotation and enrichment, metadata curation, knowledge modelling, and Named Entity Recognition (NER); (3) Most projects tend to consume data from Wikidata, whereas there is more potential to utilize it as a platform and a technology stack to publish data on Wikidata or to create an ecosystem of data exchange; and (4) Projects face two types of challenges: technical issues in the implementations and concerns with Wikidata’s data quality. In the discussion, this article contributes to addressing three issues related to coping with the challenges in the specific context of the DH field based on the research findings: the relevance and authority of other available domain sources; domain communities and their practices; and workflow design that coordinates technical and labour resources from projects and Wikidata.
Article
Due to their historical nature, humanistic data encompass multiple sources of uncertainty. While humanists are accustomed to handling such uncertainty with their established methods, they are cautious of visualizations that appear overly objective and fail to communicate this uncertainty. To design more trustworthy visualizations for humanistic research, therefore, a deeper understanding of its relation to uncertainty is needed. We systematically reviewed 126 publications from digital humanities literature that use visualization as part of their research process, and examined how uncertainty was handled and represented in their visualizations. Crossing these dimensions with the visualization type and use, we identified that uncertainty originated from multiple steps in the research process from the source artifacts to their datafication. We also noted how besides known uncertainty coping strategies, such as excluding data and evaluating its effects, humanists also embraced uncertainty as a separate dimension important to retain. By mapping how the visualizations encoded uncertainty, we identified four approaches that varied in terms of explicitness and customization. This work contributes with two empirical taxonomies of uncertainty and it's corresponding coping strategies, as well as with the foundation of a research agenda for uncertainty visualization in the digital humanities. Our findings further the synergy among humanists and visualization researchers, and ultimately contribute to the development of more trustworthy, uncertainty-aware visualizations.
Chapter
Full-text available
Natural Language Processing (NLP) has experienced explosive growth in recent years. While the field has been around for decades, recent advances in NLP techniques as well as advanced computational resources have re-engaged academics, industry, and the general public. The field of Geographic Information Science has played a small but important role in the growth of this domain. Combining NLP techniques with existing geographic methodologies and knowledge has contributed substantially to many geospatial applications currently in use today. In this entry, we provide an overview of current application areas for natural language processing in GIScience. We provide some examples and discuss some of the challenges in this area.
Article
Full-text available
In this article we present GeoTxt, a scalable geoparsing system for the recognition and geolocation of place names in unstructured text. GeoTxt offers six named entity recognition (NER) algorithms for place name recognition, and utilizes an enterprise search engine for the indexing, ranking, and retrieval of toponyms, enabling scalable geoparsing for streaming text. GeoTxt offers a flexible application programming interface (API), allowing for customized attribute and/or spatial ranking of retrieved toponyms. We evaluate the system on a corpus of manually geo‐annotated tweets. First, we benchmark the performance of the six NERs that GeoTxt provides access to. Second, we assess GeoTxt toponym resolution accuracy incrementally, demonstrating improvements in toponym resolution achieved (or not achieved) by adding specific heuristics and disambiguation methods. Compared to using the GeoNames web service, GeoTxt's toponym resolution demonstrates a 20% accuracy gain. Our results show that places mentioned in the same tweet do not tend to be geographically proximate.
Article
Full-text available
Discrete global grid systems have become an important component of Digital Earth systems. However, previously there has not existed an easy way to map between named places (toponyms) and the cells of a discrete global grid system. The lack of such a tool has limited the opportunities to synthesize social place-based data with the more standard Earth and environmental science data currently being analyzed in Digital Earth applications. This paper introduces Wāhi, the first gazetteer to map entities from the GeoNames database to multiple discrete global grid systems. A gazetteer service is presented that exposes the grid system and the associated gazetteer data as Linked Data. A set of use cases for the discrete global grid gazetteer is discussed.
Article
Full-text available
Academic libraries have always supported research across disciplines by integrating access to diverse contents and resources. They now have the opportunity to reinvent their role in facilitating interdisciplinary work by offering researchers new ways of sharing, curating, discovering, and linking research data. Spatial data and metadata support this process because location often integrates disciplinary perspectives, enabling researchers to make their own research data more discoverable, to discover data of other researchers, and to integrate data from multiple sources. The Center for Spatial Studies at the University of California, Santa Barbara (UCSB) and the UCSB Library are undertaking joint research to better enable the discovery of research data and publications. The research addresses the question of how to spatially enable data discovery in a setting that allows for mapping and analysis in a GIS while connecting the data to publications about them. It suggests a framework for an integrated data discovery mechanism and shows how publications may be linked to associated data sets exposed either directly or through metadata on Esri's Open Data platform. The results demonstrate a simple form of linking data to publications through spatially referenced metadata and persistent identifiers. This linking adds value to research products and increases their discoverability across disciplinary boundaries.
Conference Paper
Full-text available
The intrinsic connection between place, space, and time in narrative texts is the subject of chronotopic literary analysis. We take the notion of the chronotope and apply it to exploratory analysis of unstructured big data. Exploratory chronotopic data analysis provides a data-driven perspective on how place, space, and time are connected in large, crowdsourced text collections. In this study, we processed the English Wikipedia text to find all co-occurrences of named places and dates and discovered that times are linked to places in a large majority of cases. We analyzed these millions of connections between places and dates and discovered a number of interesting trends. Because of the scale of the data involved, we suggest that chronotopic data analysis will lead to the development of new data models and methods for geographic information science and related fields, such as digital humanities.
Conference Paper
Full-text available
Ad hoc keyword search engines built using modern information retrieval methods do a good job of handling fine-grained queries. However, they perform poorly at facilitating spatial and spatially-embedded thematic exploration of the results, despite the fact that many queries, e.g. "civil war," refer to different documents and topics in different places. This is not for lack of data: geographic information, such as place names, events, and coordinates are common in unstructured document collections on the web. The associations between geographic and thematic contents in these documents can provide a rich groundwork to organize information for exploratory research. In this paper we describe the architecture of an interactive thematic map search engine, Frankenplace, designed to facilitate document exploration at the intersection of theme and place. The map interface enables a user to zoom the geographic context of their query in and out, and quickly explore through thousands of search results in a meaningful way. And by combining topic models with geographically contextualized search results, users can discover related topics based on geographic context. Frankenplace utilizes a novel indexing method called geoboost for boosting terms associated with cells on a discrete global grid. The resulting index factors in the geographic scale of the place or feature mentioned in related text, the relative textual scope of the place reference, and the overall importance of the containing document in the document network. The system is currently indexed with over 5 million documents from the web, including the English Wikipedia and online travel blog entries. We demonstrate that Frankenplace can support four distinct types of exploratory search tasks while being adaptive to scale and location of interest.
Article
Full-text available
We critically review literature on the association between searching and learning and contribute to the formulation of a research agenda for searching as learning. The paper begins by reviewing current literature that tends to characterize search systems as tools for learning. We then present a perspective on searching as learning that focuses on the learning that occurs during the search process, as well as search outputs and learning outcomes. The concept of ‘comprehensive search’ is proposed to describe iterative, reflective and integrative search sessions that facilitate critical and creative learning beyond receptive learning. We also discuss how search interaction data can provide a rich source of implicit and explicit features through which to assess search-related learning. In conclusion, we summarize opportunities and challenges for future research with respect to four agendas: developing a search system that supports sense-making and enhances learning; supporting effective user interaction for searching as learning; providing an inquiry-based literacy tool within a search system; and assessing learning from online searching behaviour.
Conference Paper
Wikidata is the central data management platform of Wikipedia. By the efforts of thousands of volunteers, the project has produced a large, open knowledge base with many interesting applications. The data is highly interlinked and connected to many other datasets, but it is also very rich, complex, and not available in RDF. To address this issue, we introduce new RDF exports that connect Wikidata to the Linked Data Web. We explain the data model of Wikidata and discuss its encoding in RDF. Moreover, we introduce several partial exports that provide more selective or simplified views on the data. This includes a class hierarchy and several other types of ontological axioms that we extract from the site. All datasets we discuss here are freely available online and updated regularly.
Article
This survey article describes previous research addressing text-based document geocoding, i.e. the task of predicting the geospatial coordinates of latitude and longitude, that best correspond to an entire document, based on its textual contents. We describe (1) early document geocoding systems that use heuristics over place names mentioned in the text (e.g. names of cities and states), (2) probabilistic language modeling approaches, where generative models are built for different regions in the world (usually considering a discretization based on a rectangular grid) from the words occurring in a set of georeferenced training documents, which are then used to predict per-region probabilities for previously unseen test documents, (3) combinations of different models and heuristics, including clustering procedures, feature selection approaches, and/or language models built from different sources, and (4) recent approaches based on discriminative classification models.