Conference Paper

Basreh or Basra? Geoparsing Historical Locations in the Svoboda Diaries

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... First, the process of corpus creation using HTR involves semi-supervised or unsupervised steps that predict the words on the page, sometimes resulting in misspellings, particularly of proper names (Kapan et al. 2023;Zhou et al. 2024). Additionally, orthography in IOR File 5 is inherently unstable. ...
Article
Full-text available
In this article we analyze a corpus related to manumission and slavery in the Arabian Gulf in the late nineteenth- and early twentieth-century that we created using Handwritten Text Recognition ( HTR ). The corpus comes from India Office Records ( IOR ) R/15/1/199 File 5 . Spanning the period from the 1890s to the early 1940s and composed of 977K words, it contains a variety of perspectives on manumission and slavery in the region from manumission requests to administrative documents relevant to colonial approaches to the institution of slavery. We use word2Vec with the WordVectors package in R to highlight how the method can uncover semantic relationships within historical texts, demonstrating some exploratory semantic queries, investigation of word analogies, and vector operations using the corpus content. We argue that advances in applied computer vision such as HTR are promising for historians working in colonial archives and that while our method is reproducible, there are still issues related to language representation and limitations of scale within smaller datasets. Even though HTR corpus creation is labor intensive, word vector analysis remains a powerful tool of computational analysis for corpora where HTR error is present.
Article
Full-text available
The diaries of Joseph Mathia Svoboda capture over 40 years of trade on the Tigris, describing his daily life and regular journeys as a steamboat purser during the late nineteenth and early twentieth centuries, specifically between the cities of Basra and Baghdad. They offer a unique perspective on daily life, community structure, and social relations. However, with over 600 pages of transcribed material and many more diaries still in the process of being transcribed, it is difficult to track patterns and changes in Joseph Svoboda’s social relationships and daily life by way of reading and inference alone. This article employs natural language processing (NLP) and network analysis to facilitate study of Svoboda’s social interactions, as well as his observations of his broader social milieu. Inspection of the networks and accompanying visualizations showed that Svoboda’s close interactions were primarily with kin, but his position as a steamship purser gave him a unique vantage point to encounter a wide range of persons of diverse backgrounds. Additionally, decomposing networks by time illustrated how significant life events facilitated change in social interactions.
Preprint
Full-text available
After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.
Article
Full-text available
In recent years, dense word embeddings for text representation have been widely used since they can model complex semantic and morphological characteristics of language, such as meaning in specific contexts and applications. Contrary to sparse representations, such as one-hot encoding or frequencies, word embeddings provide computational advantages and improvements on the results in many natural language processing tasks, similar to the automatic extraction of geospatial information. Computer systems capable of discovering geographic information from natural language involve a complex process called geoparsing. In this work, we explore the use of word embeddings for two NLP tasks: Geographic Named Entity Recognition and Geographic Entity Disambiguation, both as an effort to develop the first Mexican Geoparser. Our study shows that relationships between geographic and semantic spaces arise when we apply word embedding models over a corpus of documents in Mexican Spanish. Our models achieved high accuracy for geographic named entity recognition in Spanish.
Article
Full-text available
The automatic extraction of geospatial information is an important aspect of data mining. Computer systems capable of discovering geographic information from natural language involve a complex process called geoparsing, which includes two important tasks: geographic entity recognition and toponym resolution. The first task could be approached through a machine learning approach, in which case a model is trained to recognize a sequence of characters (words) corresponding to geographic entities. The second task consists of assigning such entities to their most likely coordinates. Frequently, the latter process involves solving referential ambiguities. In this paper, we propose an extensible geoparsing approach including geographic entity recognition based on a neural network model and disambiguation based on what we have called dynamic context disambiguation. Once place names are recognized in an input text, they are solved using a grammar, in which a set of rules specifies how ambiguities could be solved, in a similar way to that which a person would utilize, considering the context. As a result, we have an assignment of the most likely geographic properties of the recognized places. We propose an assessment measure based on a ranking of closeness relative to the predicted and actual locations of a place name. Regarding this measure, our method outperforms OpenStreetMap Nominatim. We include other assessment measures to assess the recognition ability of place names and the prediction of what we called geographic levels (administrative jurisdiction of places).
Conference Paper
Full-text available
In this paper we use network analysis to identify qualitative "neighbors" for toponyms in an eighteenth-century French encyclopedia, but could apply to any entry-based text with annotated toponyms. This method draws on relations in a corpus of articles, which improves disambiguation at a later stage with an external resource. We suggest the network as an alternative to geospatial representation, a useful proxy when no historical gazetteer exists for the source material's period. Our first experiments have shown that this approach goes beyond a simple text analysis and is able to find relations between toponyms that are not co-occurring in the same documents. Network relations are also usefully compared with disambiguated toponyms to evaluate geographical coverage, and the ways that geographical discourse is expressed, in historical texts.
Article
Full-text available
While a reasonable amount of work has gone into automatically geoparsing text at the city or higher levels of granularity for different types of texts in different domains, there is relatively little research on geoparsing fine-grained locations such as buildings, green spaces and street names in text. This paper reports on how the Edinburgh Geoparser performs on this task for different types of literary text set in Edinburgh, the first UNESCO City of Literature. The non-copyrighted gold standard datasets created for this purpose are released along with this article.
Conference Paper
Full-text available
This paper discusses the challenges of applying named entity linking in a rich, complex domain – specifically, the linking of (1) military units, (2) places and (3) people in the context of interlinked Second World War data. Multiple sub-scenarios are discussed in detail through concrete evaluations, analyzing the problems faced, and the solutions developed. A key contribution of this work is to highlight the heterogeneity of problems and approaches needed even inside a single domain, depending on both the source data as well as the target authority.
Article
Full-text available
Place name mentions in text may have more than one potential referent (e.g. Peru, the country vs. Peru, the city in Indiana). The Edinburgh Language Technology Group (LTG) has developed the Edinburgh Geoparser, a system that can automatically recognise place name mentions in text and disambiguate them with respect to a gazetteer. The recognition step is required to identify location mentions in a given piece of text. The subsequent disambiguation step, generally referred to as georesolution, grounds location mentions to their corresponding gazetteer entries with latitude and longitude values, for example, to visualise them on a map. Geoparsing is not only useful for mapping purposes but also for making document collections more accessible as it can provide additional metadata about the geographical content of documents. Combined with other information mined from text such as person names and date expressions, complex relations between such pieces of information can be identified. The Edinburgh Geoparser can be used with several gazetteers including Unlock and GeoNames to process a variety of input texts. The original version of the Geoparser was a demonstrator configured for modern text. Since then, it has been adapted to georeference historic and ancient text collections as well as modern-day newspaper text. 1,2,3,4 Currently, the LTG is involved in three research projects applying the Geoparser to historical text collections of very different types and for a variety of end-user applications. This paper discusses the ways in which we have customised the Geoparser for specific datasets and applications relevant to each project.
Conference Paper
Full-text available
Geoparsing and geocoding are two essential middleware services to facilitate final user applications such as location-aware searching or different types of location-based services. The objective of this work is to propose a method for establishing a processing chain to support the geoparsing and geocoding of text documents describing events strongly linked with space and with a frequent use of fine-grain toponyms. The geoparsing part is a Natural Language Processing approach which combines the use of part of speech and syntactico-semantic combined patterns (cascade of transducers). However, the real novelty of this work lies in the geocoding method. The geocoding algorithm is unsupervised and takes profit of clustering techniques to provide a solution for disambiguating the toponyms found in gazetteers, and at the same time estimating the spatial footprint of those other fine-grain toponyms not found in gazetteers. The feasibility of the proposal has been tested with a corpus of hiking descriptions in French, Spanish and Italian.
Conference Paper
Full-text available
Large corpora are ubiquitous in today’s world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). In this paper, we identify a gap in existing implementations of many of the popular algorithms, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. Within this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.
Article
Full-text available
Many approaches have been proposed in recent years in the context of Geographic Information Retrieval (GIR), mostly in order to deal with geographically constrained information in un-structured texts. Most of these approaches share a common scheme: in order to disambiguate a toponym t with n possible referents in a document d, they find a certain number of context toponyms c0,...,ck that are contained in d. A score for each referent is calculated according to the context toponyms, and the referent with the highest score is selected. According to the method used to calculate the score, Toponym Disambiguation (TD) methods may be grouped into three main categories, as proposed by [7]: • map-based: methods that use an explicit representation of toponyms on a map, for instance to calculate the average distance of unambiguous context toponyms from referents; • knowledge-based: methods that exploit external knowledge sources such as gazetteers, Wikipedia or ontologies to find disambiguation clues; • data-driven or supervised: methods based on machine learning techniques.
Article
Full-text available
We report on two JISC-funded projects that aimed to enrich the metadata of digitized historical collections with georeferences and other information automatically computed using geoparsing and related information extraction technologies. Understanding location is a critical part of any historical research, and the nature of the collections makes them an interesting case study for testing automated methodologies for extracting content. The two projects (GeoDigRef and Embedding GeoCrossWalk) have looked at how automatic georeferencing of resources might be useful in developing improved geographical search capacities across collections. In this paper, we describe the work that was undertaken to configure the geoparser for the collections as well as the evaluations that were performed.
Article
Toponym resolution, or grounding names of places to their actual locations, is an important problem in analysis of both historical corpora and present-day news and web content. Recent approaches have shifted from rule-based spatial minimization methods to machine learned classifiers that use features of the text surrounding a toponym. Such methods have been shown to be highly effective, but they crucially rely on gazetteers and are unable to handle unknown place names or locations. We address this limitation by modeling the geographic distributions of words over the earth's surface: we calculate the geographic profile of each word based on local spatial statistics over a set of geo-referenced language models. These geo-profiles can be further refined by combining in-domain data with background statistics from Wikipedia. Our resolver computes the overlap of all geo-profiles in a given text span; without using a gazetteer, it performs on par with existing classifiers. When combined with a gazetteer, it achieves state-of-the-art performance for two standard toponym resolution corpora (TR-CoNLL and Civil War). Furthermore, it dramatically improves recall when toponyms are identified by named entity recognizers, which often (correctly) find non-standard variants of toponyms.
Conference Paper
Named entities (NEs) are among the most relevant type of information that can be used to efficiently index and retrieve digital documents. Furthermore, the use of Entity Linking (EL) to disambiguate and relate NEs to knowledge bases, provides supplementary information which can be useful to differentiate ambiguous elements such as geographical locations and peoples’ names. In historical documents, the detection and disambiguation of NEs is a challenge. Most historical documents are converted into plain text using an optical character recognition (OCR) system at the expense of some noise. Documents in digital libraries will, therefore, be indexed with errors that may hinder their accessibility. OCR errors affect not only document indexing but the detection, disambiguation, and linking of NEs. This paper aims at analysing the performance of different EL approaches on two multilingual historical corpora, CLEF HIPE 2020 (English, French, German) and NewsEye (Finnish, French, German, Swedish), while proposes several techniques for alleviating the impact of historical data problems on the EL task. Our findings indicate that the proposed approaches not only outperform the baseline in both corpora but additionally they considerably reduce the impact of historical document issues on different subjects and languages.
Conference Paper
This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.
Conference Paper
Historians are often interested in the locations mentioned in digitized collections. However, place names are highly ambiguous and may change over time, which makes it especially hard to automatically ground mentions of places in historical texts to their real-world referents. Toponym disambiguation is a challenging problem in natural language processing, and has been approached in two different yet related tasks: toponym resolution and entity linking. In this paper, we propose a weakly-supervised method that combines the strengths of both approaches by exploiting both geographic and semantic features. We tested our method against a historical toponym resolution benchmark and improved the state of the art. We also created five datasets and tested the performance of two state-of-the-art out-of-the-box entity linking methods and also improved on their performance when only locations are considered.
Conference Paper
Interpretability and discriminative power are the two most basic requirements for an evaluation metric. In this paper, we report the mention identification effect in the B3, CEAF, and BLANC coreference evaluation metrics that makes it impossible to interpret their results properly. The only metric which is insensitive to this flaw is MUC, which, however, is known to be the least discriminative metric. It is a known fact that none of the current metrics are reliable. The common practice for ranking coreference resolvers is to use the average of three different metrics. However, one cannot expect to obtain a reliable score by averaging three unreliable metrics. We propose LEA, a Link-based Entity-Aware evaluation metric that is designed to overcome the shortcomings of the current evaluation metrics. LEA is available as branch LEA-scorer in the reference implementation of the official CoNLL scorer.
Article
It has long been known that 'variety' is one of the key challenges and opportunities of big data. This is especially true when we consider the variety of content in historical corpora resulting from large-scale digitisation activities. Collections such as Early English Books Online (EEBO) and the British Library 19th Century Newspapers are extremely large and heterogeneous data sources containing a variety of content in terms of time, location, topic, style and quality. The range of geographical locations referenced in these corpora poses a difficult challenge for state of the art geoparsing tools. In the context of our work on Spatial Humanities analyses, we present our solution for dealing with the variety and scale of these corpora.
Article
This paper presents a machine learning method for disambiguating place references in text. Solving this task can have important applications in the digital humanities and computational social sciences, by supporting the geospatial analysis of large document collections. We combine multiple features that capture the similarity between candidate disambiguations, the place references, and the context where the place references occur, in order to rank and choose from a set of candidate disambiguations, obtained from a knowledge base containing geospatial coordinates and textual descriptions for different places from all around the world. The proposed method was evaluated through English corpora used in previous work in this area, and also with a subset of the English Wikipedia. Experimental results demonstrate that the proposed method is indeed effective, showing that out-of-the-box learning algorithms and relatively simple features can obtain a high accuracy in this task.
Conference Paper
We introduce the brat rapid annotation tool (BRAT), an intuitive web-based tool for text annotation supported by Natural Language Processing (NLP) technology. BRAT has been developed for rich structured annotation for a variety of NLP tasks and aims to support manual curation efforts and increase annotator productivity using NLP techniques. We discuss several case studies of real-world annotation projects using pre-release versions of BRAT and present an evaluation of annotation assisted by semantic class disambiguation on a multicategory entity mention annotation task, showing a 15% decrease in total annotation time. BRAT is available under an open-source license from: http://brat.nlplab.org
Article
An abstract is not available.
Conference Paper
Geographic interfaces provide natural, scalable visualizations for many digital library collections, but the wide range of data in digital libraries presents some particular problems for identifying and disambiguating place names. We describe the toponym-disambiguation system in the Perseus digital library and evaluate its performance. Name categorization varies significantly among different types of documents, but toponym disambiguation performs at a high level of precision and recall with a gazetteer an order of magnitude larger than most other applications.
Calculate distance and bearing between two Latitude/Longitude points using haversine formula in JavaScript
  • Chris Veness
Chris Veness. 2022. Calculate distance and bearing between two Latitude/Longitude points using haversine formula in JavaScript.
Geoparsing the historical Gazetteers of Scotland: accurately computing location in mass digitised texts
  • Rosa Filgueira
  • Claire Grover
  • Melissa Terras
Rosa Filgueira, Claire Grover, Melissa Terras, and Beatrice Alex. 2020. Geoparsing the historical Gazetteers of Scotland: accurately computing location in mass digitised texts. In Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora, pages 24-30, Marseille, France. European Language Ressources Association.
A pragmatic guide to geoparsing evaluation
  • Milan Gritta
  • Mohammad Taher Pilehvar
  • Nigel Collier
Milan Gritta, Mohammad Taher Pilehvar, and Nigel Collier. 2020. A pragmatic guide to geoparsing evaluation. Language Resources and Evaluation, 54(3):683-712.
Recent Advances in Google Translate
  • Isaac Caswell
  • Bowen Liang
Isaac Caswell and Bowen Liang. 2020. Recent Advances in Google Translate.
Exploring Linked Data for the Automatic Enrichment of Historical Archives
  • Gary Munnelly
  • J Harshvardhan
  • Séamus Pandit
  • Lawless
Gary Munnelly, Harshvardhan J. Pandit, and Séamus Lawless. 2018. Exploring Linked Data for the Automatic Enrichment of Historical Archives. In The Semantic Web: ESWC 2018 Satellite Events, Lecture Notes in Computer Science, pages 423-433, Cham. Springer International Publishing.
Capturing Variants of Transliterated Arabic Names in English Text
  • F Abdusalam
  • Ahmad Nwesri
  • Nabila Al-Mabrouk
  • S Shinbir
Abdusalam F. Ahmad Nwesri and Nabila Al-Mabrouk S. Shinbir. 2009. Capturing Variants of Transliterated Arabic Names in English Text.