ArticlePDF Available

Fine Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Arabian/Persian Gulf

Authors:

Abstract and Figures

Text recognition technologies increase access to global archives and make possible their computational study using techniques such as Named Entity Recognition (NER). In this paper, we present an approach to extracting a variety of named entities (NE) in unstructured historical datasets from open digital collections dealing with a space of informal British empire: the Persian Gulf region. The sources are largely concerned with people, places and tribes as well as economic and diplomatic transactions in the region. Since models in state-of-the-art NER systems function with limited tag sets and are generally trained on English-language media, they struggle to capture entities of interest to the historian and do not perform well with entities transliterated from other languages. We build custom spaCy-based NER models trained on domain-specific annotated datasets. We also extend the set of named entity labels provided by spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test and compare performance of the blank, pre-trained and merged spaCy-based models, suggesting further improvements. Our study makes an intervention into thinking beyond Western notions of the entity in digital historical research by creating more inclusive models using non-metropolitan corpora in English.
Content may be subject to copyright.
Fine-Tuning NER with spaCy for Transliterated
Entities Found in Digital Collections From the
Multilingual Persian Gulf
Almazhan Kapan
1
,Suphan Kirmizialtin
2
,Rhythm Kukreja
2
and David Joseph Wrisley
2
1New York University Shanghai, Pudong New District, Shanghai, China
2New York University Abu Dhabi, Saadiyat Island, Abu Dhabi, United Arab Emirates
Abstract
Text recognition technologies increase access to global archives and make possible their computational
study using techniques such as Named Entity Recognition (NER). In this paper, we present an approach
to extracting a variety of named entities (NE) in unstructured historical datasets from open digital
collections dealing with a space of informal British empire: the Persian Gulf region. The sources are
largely concerned with people, places and tribes as well as economic and diplomatic transactions in the
region. Since models in state-of-the-art NER systems function with limited tag sets and are generally
trained on English-language media, they struggle to capture entities of interest to the historian and do not
perform well with entities transliterated from other languages. We build custom spaCy-based NER models
trained on domain-specic annotated datasets. We also extend the set of named entity labels provided by
spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test
and compare performance of the blank, pre-trained and merged spaCy-based models, suggesting further
improvements. Our study makes an intervention into thinking beyond Western notions of the entity in
digital historical research by creating more inclusive models using non-metropolitan corpora in English.
Keywords
Named Entity Recognition, Gulf Studies, Colonial Archives, Persian Gulf, spaCy, Transliterated Names.
1. Introduction
With the increase in digitization and transcription of historical archives, Named Entity Recogni-
tion (NER) is oen regarded as an important step in text processing, ensuring scaled access to
layers of information found in text, such as names of people, places or currencies [
1
]. In addition
to the possibility of creating linked data and building gazetteers, identifying relevant entities in
unstructured text enables scholarly examination of broader patterns in archival collections. This
potential of NER has been demonstrated in the spatial humanities and the study of historical
networks, with notable challenges [
2
,
3
]. Cultural heritage collections span long periods of
time, and historical text contains named entities (NE) which oen have changed over time. In
The 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, March 15-18,
2022
aa5456@nyu.edu (A. Kapan); suphan@nyu.edu (S. Kirmizialtin); rk3781@nyu.edu (R. Kukreja); djw12@nyu.edu
(D. J. Wrisley)
0000-0002-1064-8199 (A. Kapan); 0000-0001-5020-0578 (S. Kirmizialtin); 0000-0002-4424-1100 (R. Kukreja);
0000-0002-0355-1487 (D. J. Wrisley)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
288
Digital Humanities in the Nordic and Baltic Countries Publications - ISSN: 2704-1441
              
              
             
            
   
         
               
           
               
               
 
            
           
           
 
Gazetteer of the Persian Gulf, Central Arabia and Oman  Gazetteer
  
            
             
              
            
2. Related Work
NER with Historical Collections        
              

            
          
             
   
      
             

   
 

     
   

  

    
   
        
         
    
             
         
Using spaCy for Custom NER with Historical Documents    
        

   

            
            
             

         
              

289
Table 1
      
   
      
       
         
           
         
      
      
 
            
3. Datasets
               
               
      Gazetteer       
            
           
         
The Handwritten And Typewritten Bushire Political Residency Ledgers  
             
           
              
            
          
             
             
            
          
                
Lorimer’s Gazetteer             
      Gazetteer  Gazetteer   
            

            
       
              

290
Figure 1:           
              
              
               
        
4. Data Annotation
Annotation workow          
    Gazetteer          
              
           
              

          
        
Tag selection and customization       
         
             
            

           
               
           
             
               
                 
            
        
5. Methods
System architecture            
     

     

291
Table 2
      
  
SM and LG 

        
           
        
BLK-F  
        
        
         
DEF-F  
   
           
       
UPD-F  
   
          
   
REP-F  
   
          
        
DOB-F  

        
     
            
             
      

    
           
          
    
            
                
             
               
             
              
           
           
           
               
             
            
           
            
           
Resampling training data          
           


             
292
            
             
          
6. Evaluation and Results
               
             
               
             
             
            
        
Table 3
           
  
     
      
      
      
      
      
      
        
        
        
        
            
             


           
            
   Gazetteer        
            
              
        Gazetteer   
              
             
              
           
            
             
293
Figure 2:           
Figure 3:            
Figure 4:            
             
         
           Gazetteer
            
       
   
            
           
         
           
          
             
          
     

      
             
           
           
294
            
           
             
           
            
            
            
            
             
             
          
              
             
        
7. Conclusion and Future work
             
           
             
            
           
             
               
             
              
          
               
         
References

            
            

             
          
       

             
         
        

          
            
   
295

             
          
      

         


            
   

          
         

            
         
    

           
        
    

               
     

              
           

           
             
      

             
           
         

              
            
   

           
          
            
  

            
         
      
        

              
             
          
        

            
   
296
... The balanced performance of both RTC-NER models was hence assured. Corpus Annotation 'NER-Annotator', a user-friendly web interface for manual annotation of entities for spaCy model training, was used [25]. We defined a set of custom tags/labels of relevance to RTC incidents, as presented in Table 2. ...
... A screenshot of the NER-annotator web interface is shown in Figure 3. 3.2.2. NER Training with SpaCy 3.61 SpaCy 3.61 is a Python-based open-source library for high-level NLP and has several multilingual, pre-trained and transformer-integrated models, such as 'en_core_web_sm', which can identify up to 18 entities, ranging from people, dates, city to organisations [22], [25], [26], [27], [28], [29]. In addition, spaCy provides features such as the entity recogniser NER pipeline component for development of custom NER models on domain-specific entities, as its pretrained models fail to identify such entities in text [30], [31]. ...
... Performances of the custom RTC-NER models were evaluated using standard metrics of precision, F1-score and recall [25], [27]. The Nigerian test data used for evaluation comprised 44 RTC reports for the RTC-NER Baseline model and 844 sentences for the RTC-NER model, while the international test data comprised 70 RTC reports for both models. ...
Article
Full-text available
Road traffic crashes (RTCs) are a major public health concern worldwide, particularly in Nigeria, where road transport is the most common mode of transportation. This study presents the geo-parsing approach for geographic information extraction (IE) of RTC incidents from news articles. We developed two custom, spaCy-based, RTC domain-specific named entity recognition (NER) models: RTC-NER Baseline and RTC-NER. These models were trained on a dataset of Nigerian RTC news articles. Evaluation of the models’ performances shows that the RTC-NER model outperforms the RTC-NER Baseline model on both Nigerian and international test data across all three standard metrics of precision, recall and F1-score. The RTC-NER model exhibits precision, recall and F1-score values of 93.63, 93.61 and 93.62, respectively, on the Nigerian test data, and 91.9, 87.88 and 89.84, respectively, on the international test data, thus showing its versatility in IE from RTC reports irrespective of country. We further applied the RTC-NER model for feature extraction using geo-parsing techniques to extract RTC location details and retrieve corresponding geographical coordinates, creating a structured Nigeria RTC dataset for exploratory data analysis. Our study showcases the use of the RTC-NER model in IE from RTC-related reports for analysis aimed at identifying RTC risk areas for data-driven emergency response planning.
... NER fine-tuning can be done for all kinds of diverse applications. A paper by Almazhan Kapan et al [5], attempts to fine tune numerous spacy models for identifying transliterated entities, which are entities translated or directly pulled from other languages. As the spacy models are language specific, the base English models had a difficult time identifying entities with a sampled f1 score of 0.36. ...
Conference Paper
Full-text available
Named Entity Recognition (NER) is one of the most popular natural language processing tools used for detecting entities in text in the production environment or as an intermediate process in more complicated NLP tasks. Popular NER models struggle to pick up names from certain demographics due to the lack of training with domain specific data. The following paper presents a novel approach to fine tuning a Named Entity Recognition model to detect names from different areas or ethnic groups. The proposed solution includes using an oracle to annotate data for fine tuning the model, while also replacing generic entities with a custom set of entities from the proposed demographic, in this case India, in an attempt to improve domain specific performance, like for an organization that operates in India and primarily serves Indian clients.
Article
Full-text available
In this article we analyze a corpus related to manumission and slavery in the Arabian Gulf in the late nineteenth- and early twentieth-century that we created using Handwritten Text Recognition ( HTR ). The corpus comes from India Office Records ( IOR ) R/15/1/199 File 5 . Spanning the period from the 1890s to the early 1940s and composed of 977K words, it contains a variety of perspectives on manumission and slavery in the region from manumission requests to administrative documents relevant to colonial approaches to the institution of slavery. We use word2Vec with the WordVectors package in R to highlight how the method can uncover semantic relationships within historical texts, demonstrating some exploratory semantic queries, investigation of word analogies, and vector operations using the corpus content. We argue that advances in applied computer vision such as HTR are promising for historians working in colonial archives and that while our method is reproducible, there are still issues related to language representation and limitations of scale within smaller datasets. Even though HTR corpus creation is labor intensive, word vector analysis remains a powerful tool of computational analysis for corpora where HTR error is present.
Chapter
Research question: While generic named entity recognition (NER) models perform well on general tasks, custom NER models can provide more efficient and accurate solutions for specific domains. The chapter proposes the development of a custom NER model to classify technoscientific persons according to their professional expertise. Previous efforts to identify occupations have been limited due to the absence of precise annotation guidelines and reliable Gold Standard corpora. Methodology: This chapter aims to address this challenge by proposing a hybrid method. The method combines rule and dictionary-based approaches to capture domain-specific knowledge and to automatically annotate data. Bootstrapping is employed to improve the generalizability of the model and reduce overfitting. By training the model on different variations of the data and testing it on new validation sets, a more robust evaluation of the model's performance is possible. Finally, the efficiency and accuracy of the NER model are improved by using transfer learning with RoBERTa. Findings: The first model trained on the initial subcorpus provided accurate results for almost all categories. However, when the model was validated on the next subcorpus, it showed a dramatic decline in performance, implying overfitting. To address this issue, bootstrapping was employed by cumulatively adding different subcorpora and reviewing and correcting the annotations. The model was retrained at each step until a satisfactory level of performance was achieved. The final model performs well on all categories except for social sciences, environmental sciences, and life sciences. Significance: The proposed approach offers several benefits, including more efficient use of resources, improved accuracy, generalizability, and scalability.
Article
Full-text available
Geographic text analysis (GTA) research in the digital humanities has focused on projects analyzing modern English-language corpora. These projects depend on temporally specific lexicons and gazetteers that enable place name identification and georesolution. Scholars working on the early modern period (1400-1800) lack temporally appropriate geoparsers and gazetteers and have been reliant on general purpose linked open data services like Geonames. These anachronistic resources introduce significant information retrieval and ethical challenges for early modernists. Using the geography entries of the canonical eighteenth-century Encyclopédie, we evaluate rule-based named entity recognition (NER) systems to pinpoint areas where they would benefit from adjustments for processing historical corpora. As we demonstrate, annotating nested and extended place information is one way to improve early modern GTA. Working with Enlightenment sources also motivates a critique of the landscape of digital geospatial data.
Article
Full-text available
The field of Spatial Humanities has advanced substantially in the past years. The identification and extraction of toponyms and spatial information mentioned in historical text collections has allowed its use in innovative ways, making possible the application of spatial analysis and the mapping of these places with geographic information systems. For instance, automated place name identification is possible with Named Entity Recognition (NER) systems. Statistical NER methods based on supervised learning, in particular, are highly successful with modern datasets. However, there are still major challenges to address when dealing with historical corpora. These challenges include language changes over time, spelling variations, transliterations, OCR errors, and sources written in multiple languages among others. In this article, considering a task of place name recognition over two collections of historical correspondence, we report an evaluation of five NER systems and an approach that combines these through a voting system. We found that although individual performance of each NER system was corpus dependent, the ensemble combination was able to achieve consistent measures of precision and recall, outperforming the individual NER systems. In addition, the results showed that these NER systems are not strongly dependent on preprocessing and translation to Modern English.
Conference Paper
Full-text available
In recent years, many cultural institutions have engaged in large-scale newspaper digitization projects and large amounts of historical texts are being acquired (via transcription or OCRization). Beyond document preservation, the next step consists in providing an enhanced access to the content of these digital resources. In this regard , the processing of units which act as referential anchors, namely named entities (NE), is of particular importance. Yet, the application of standard NE tools to historical texts faces several challenges and performances are often not as good as on contemporary documents. This paper investigates the performances of different NE recognition tools applied on old newspapers by conducting a diachronic evaluation over 7 time-series taken from the archives of Swiss newspaper Le Temps.
Conference Paper
Full-text available
This paper reports on experiments to improve the Optical Character Recognition (OCR) quality of historical text as a preliminary step in text mining. We analyse the quality of OCRed text compared to a gold standard and show how it can be improved by performing two automatic correction steps. We also demonstrate the impact this can have on named entity recognition in a preliminary extrinsic evaluation. This work was performed as part of the Trading Consequences project which is focussed on text mining of historical documents for the study of nineteenth century trade in the British Empire.
Article
In the Trading Consequences project, historians, computational linguists, and computer scientists collaborated to develop a text mining system that extracts information from a vast amount of digitized published English-language sources from the “long nineteenth century” (1789 to 1914). The project focused on identifying relationships within the texts between commodities, geographical locations, and dates. The authors explain the methodology, uses, and the limitations of applying digital humanities techniques to historical research, and they argue that interdisciplinary approaches are critically important in addressing the technical challenges that arise. Collaborative teamwork of the kind described here has considerable potential to produce further advances in the large-scale analysis of historical documents.
Conference Paper
In this poster, we propose a method for extracting persons' real names and aliases from Japanese historical documents. In this method, we extract personal names and aliases by applying a named entity extraction technique based on machine learning using characters as the unit of analysis. One of the features of this method is that it uses already attached annotations to named entities in order to find undiscovered ones. Experimental results showed that our proposed method was able to extract personal names and aliases from "Yakusha-Hyoban-Ki", a collection of review documents of Kabuki actors in Edo Era (1603-1868) in Japan, with approximately 0.91 in F-measure.
Conference Paper
In this poster, we demonstrate a named entity extraction method for digitized ancient Mongolian documents by using the features of characters' appearance in a word. Named entities such as personal names and place names will be extracted by employing Support Vector Machine that aims to reduce the labor-intensive analysis on historical text. The preliminary results of our experiment show that the proposed method has gained 0.6993, 0.5679 and 0.6268 of precision, recall and F-measure, respectively.