Available via license: CC BY 4.0
Content may be subject to copyright.
Fine-Tuning NER with spaCy for Transliterated
Entities Found in Digital Collections From the
Multilingual Persian Gulf
Almazhan Kapan
1
,Suphan Kirmizialtin
2
,Rhythm Kukreja
2
and David Joseph Wrisley
2
1New York University Shanghai, Pudong New District, Shanghai, China
2New York University Abu Dhabi, Saadiyat Island, Abu Dhabi, United Arab Emirates
Abstract
Text recognition technologies increase access to global archives and make possible their computational
study using techniques such as Named Entity Recognition (NER). In this paper, we present an approach
to extracting a variety of named entities (NE) in unstructured historical datasets from open digital
collections dealing with a space of informal British empire: the Persian Gulf region. The sources are
largely concerned with people, places and tribes as well as economic and diplomatic transactions in the
region. Since models in state-of-the-art NER systems function with limited tag sets and are generally
trained on English-language media, they struggle to capture entities of interest to the historian and do not
perform well with entities transliterated from other languages. We build custom spaCy-based NER models
trained on domain-specic annotated datasets. We also extend the set of named entity labels provided by
spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test
and compare performance of the blank, pre-trained and merged spaCy-based models, suggesting further
improvements. Our study makes an intervention into thinking beyond Western notions of the entity in
digital historical research by creating more inclusive models using non-metropolitan corpora in English.
Keywords
Named Entity Recognition, Gulf Studies, Colonial Archives, Persian Gulf, spaCy, Transliterated Names.
1. Introduction
With the increase in digitization and transcription of historical archives, Named Entity Recogni-
tion (NER) is oen regarded as an important step in text processing, ensuring scaled access to
layers of information found in text, such as names of people, places or currencies [
1
]. In addition
to the possibility of creating linked data and building gazetteers, identifying relevant entities in
unstructured text enables scholarly examination of broader patterns in archival collections. This
potential of NER has been demonstrated in the spatial humanities and the study of historical
networks, with notable challenges [
2
,
3
]. Cultural heritage collections span long periods of
time, and historical text contains named entities (NE) which oen have changed over time. In
The 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, March 15-18,
2022
aa5456@nyu.edu (A. Kapan); suphan@nyu.edu (S. Kirmizialtin); rk3781@nyu.edu (R. Kukreja); djw12@nyu.edu
(D. J. Wrisley)
0000-0002-1064-8199 (A. Kapan); 0000-0001-5020-0578 (S. Kirmizialtin); 0000-0002-4424-1100 (R. Kukreja);
0000-0002-0355-1487 (D. J. Wrisley)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
288
Digital Humanities in the Nordic and Baltic Countries Publications - ISSN: 2704-1441
Gazetteer of the Persian Gulf, Central Arabia and Oman Gazetteer
2. Related Work
NER with Historical Collections
Using spaCy for Custom NER with Historical Documents
289
Table 1
3. Datasets
Gazetteer
The Handwritten And Typewritten Bushire Political Residency Ledgers
Lorimer’s Gazetteer
Gazetteer Gazetteer
“
290
Figure 1:
“
4. Data Annotation
Annotation workow
Gazetteer
Tag selection and customization
“ “ “
“
5. Methods
System architecture
291
Table 2
SM and LG
BLK-F
DEF-F
UPD-F
REP-F
DOB-F
Resampling training data
292
6. Evaluation and Results
Table 3
Gazetteer
Gazetteer
“ “
293
Figure 2:
Figure 3:
Figure 4:
Gazetteer
294
7. Conclusion and Future work
References
295
296