Named Entity Recognition output from en-core-web-sm model tested on gold data

Named Entity Recognition output from en-core-web-sm model tested on gold data

Source publication
Article
Full-text available
Text recognition technologies increase access to global archives and make possible their computational study using techniques such as Named Entity Recognition (NER). In this paper, we present an approach to extracting a variety of named entities (NE) in unstructured historical datasets from open digital collections dealing with a space of informal...

Citations

... The balanced performance of both RTC-NER models was hence assured. Corpus Annotation 'NER-Annotator', a user-friendly web interface for manual annotation of entities for spaCy model training, was used [25]. We defined a set of custom tags/labels of relevance to RTC incidents, as presented in Table 2. ...
... A screenshot of the NER-annotator web interface is shown in Figure 3. 3.2.2. NER Training with SpaCy 3.61 SpaCy 3.61 is a Python-based open-source library for high-level NLP and has several multilingual, pre-trained and transformer-integrated models, such as 'en_core_web_sm', which can identify up to 18 entities, ranging from people, dates, city to organisations [22], [25], [26], [27], [28], [29]. In addition, spaCy provides features such as the entity recogniser NER pipeline component for development of custom NER models on domain-specific entities, as its pretrained models fail to identify such entities in text [30], [31]. ...
... Performances of the custom RTC-NER models were evaluated using standard metrics of precision, F1-score and recall [25], [27]. The Nigerian test data used for evaluation comprised 44 RTC reports for the RTC-NER Baseline model and 844 sentences for the RTC-NER model, while the international test data comprised 70 RTC reports for both models. ...
Article
Full-text available
Road traffic crashes (RTCs) are a major public health concern worldwide, particularly in Nigeria, where road transport is the most common mode of transportation. This study presents the geo-parsing approach for geographic information extraction (IE) of RTC incidents from news articles. We developed two custom, spaCy-based, RTC domain-specific named entity recognition (NER) models: RTC-NER Baseline and RTC-NER. These models were trained on a dataset of Nigerian RTC news articles. Evaluation of the models’ performances shows that the RTC-NER model outperforms the RTC-NER Baseline model on both Nigerian and international test data across all three standard metrics of precision, recall and F1-score. The RTC-NER model exhibits precision, recall and F1-score values of 93.63, 93.61 and 93.62, respectively, on the Nigerian test data, and 91.9, 87.88 and 89.84, respectively, on the international test data, thus showing its versatility in IE from RTC reports irrespective of country. We further applied the RTC-NER model for feature extraction using geo-parsing techniques to extract RTC location details and retrieve corresponding geographical coordinates, creating a structured Nigeria RTC dataset for exploratory data analysis. Our study showcases the use of the RTC-NER model in IE from RTC-related reports for analysis aimed at identifying RTC risk areas for data-driven emergency response planning.
... NER fine-tuning can be done for all kinds of diverse applications. A paper by Almazhan Kapan et al [5], attempts to fine tune numerous spacy models for identifying transliterated entities, which are entities translated or directly pulled from other languages. As the spacy models are language specific, the base English models had a difficult time identifying entities with a sampled f1 score of 0.36. ...
Conference Paper
Full-text available
Named Entity Recognition (NER) is one of the most popular natural language processing tools used for detecting entities in text in the production environment or as an intermediate process in more complicated NLP tasks. Popular NER models struggle to pick up names from certain demographics due to the lack of training with domain specific data. The following paper presents a novel approach to fine tuning a Named Entity Recognition model to detect names from different areas or ethnic groups. The proposed solution includes using an oracle to annotate data for fine tuning the model, while also replacing generic entities with a custom set of entities from the proposed demographic, in this case India, in an attempt to improve domain specific performance, like for an organization that operates in India and primarily serves Indian clients.
Article
Full-text available
In this article we analyze a corpus related to manumission and slavery in the Arabian Gulf in the late nineteenth- and early twentieth-century that we created using Handwritten Text Recognition ( HTR ). The corpus comes from India Office Records ( IOR ) R/15/1/199 File 5 . Spanning the period from the 1890s to the early 1940s and composed of 977K words, it contains a variety of perspectives on manumission and slavery in the region from manumission requests to administrative documents relevant to colonial approaches to the institution of slavery. We use word2Vec with the WordVectors package in R to highlight how the method can uncover semantic relationships within historical texts, demonstrating some exploratory semantic queries, investigation of word analogies, and vector operations using the corpus content. We argue that advances in applied computer vision such as HTR are promising for historians working in colonial archives and that while our method is reproducible, there are still issues related to language representation and limitations of scale within smaller datasets. Even though HTR corpus creation is labor intensive, word vector analysis remains a powerful tool of computational analysis for corpora where HTR error is present.
Chapter
Research question: While generic named entity recognition (NER) models perform well on general tasks, custom NER models can provide more efficient and accurate solutions for specific domains. The chapter proposes the development of a custom NER model to classify technoscientific persons according to their professional expertise. Previous efforts to identify occupations have been limited due to the absence of precise annotation guidelines and reliable Gold Standard corpora. Methodology: This chapter aims to address this challenge by proposing a hybrid method. The method combines rule and dictionary-based approaches to capture domain-specific knowledge and to automatically annotate data. Bootstrapping is employed to improve the generalizability of the model and reduce overfitting. By training the model on different variations of the data and testing it on new validation sets, a more robust evaluation of the model's performance is possible. Finally, the efficiency and accuracy of the NER model are improved by using transfer learning with RoBERTa. Findings: The first model trained on the initial subcorpus provided accurate results for almost all categories. However, when the model was validated on the next subcorpus, it showed a dramatic decline in performance, implying overfitting. To address this issue, bootstrapping was employed by cumulatively adding different subcorpora and reviewing and correcting the annotations. The model was retrained at each step until a satisfactory level of performance was achieved. The final model performs well on all categories except for social sciences, environmental sciences, and life sciences. Significance: The proposed approach offers several benefits, including more efficient use of resources, improved accuracy, generalizability, and scalability.