Getting annotated sentences from Wikipedia.

Getting annotated sentences from Wikipedia.

Source publication
Article
Full-text available
As machine learning techniques are being increasingly employed for text processing tasks, the need for training data has become a major bottleneck for their application. Manual generation of large scale training datasets tailored to each task is a time consuming and expensive process, which necessitates their automated generation. In this work, we...

Contexts in source publication

Context 1
... of the outgoing link was identical to the title of the artwork. These sentences were extracted and the anchor texts of the sentences was tagged as an artwork, serving as accurate annotations for this category. In this stage, a total of 1,628 sentences were added as silver standard annotation data to the training set. The process is illustrated in Fig. 4. This data provided correct and precise textual patterns that were highly indicative of the artwork titles and led to a considerable boost in training data quality. This dataset was augmented to the best performing dataset obtained from the previous stages (WPI-WD-CONA-ULAN (Snorkel)) to generate a combined annotated dataset as the ...
Context 2
... on the annotations that are fed to it during the training phase. Based on this fact, the third stage of our framework incorporates the silver standard sentences from Wikipedia so as to provide clean and precise artwork annotations. From such annotations, the model could learn the textual patterns that are indicative of the mention of an artwork (Fig. 45), where the furniture is overturned, one chair projecting to the very picture surface, and the cards are strewn . . . title. An evaluation of the annotations performed by model on our test dataset shows that the model was indeed able to learn such patterns. For example, in Text 1 from an exhibition catalogue, the model was able to ...

Similar publications

Preprint
Full-text available
Textual entailment recognition is one of the basic natural language understanding(NLU) tasks. Understanding the meaning of sentences is a prerequisite before applying any natural language processing(NLP) techniques to automatically recognize the textual entailment. A text entails a hypothesis if and only if the true value of the hypothesis follows...

Citations

... NER is the focus of this paper has been well-studied in the literature (Ehrmann et al. 2023;Moscato et al. 2023). The use of NER and other term extraction tools within cultural heritage organisations has been well documented for close to a decade (Aejas et al. 2021;Jain et al. 2022). However, these tools are subjective to the data they have been trained on (van Hooland et al. 2015). ...
Article
Full-text available
Keywords are essential to the searchability and therefore discoverability of museum and archival collections in the modern world. Without them, the collection management systems (CMS) and online collections these cultural organisations rely on to record, organise, and make their collections accessible, do not operate efficiently. However, generating these keywords manually is time consuming for these already resource strapped organisations. Artificial intelligence (AI), particularly generative AI and Large Language Models (LLMs), could hold the key to generating, even automating, this key data and as such be considered a co-creative add-on. This study contributes to the literature by introducing the use of Meta’s open-source LLM, Llama, to generate keywords from curator/archivist written descriptions of museum and archival collection items. Our findings suggest that these technologies add significant value compared to current manual methods for keyword generation. In particular, we find that through using carefully crafted prompts, successful keyword augmentations could be established making museum and archival collections much more accessible to wider and more diverse audiences. However, the results also showed that generative AI has biases (e.g., hallucinations, over generalisations, outdated language), though the frequency of occurrence was not as high as general perception may insist. Hence, we also discuss mitigation strategies to address these and how cultural institutions can recognise the risks and errors while getting the most from the systems. Finally, we discuss options to achieve structured results which allow easier ingestion of data back into CMS. Ultimately, LLMs hold significant potential to enhance accessibility to museum and archival collections, yet they are not without imperfection as we extensively discuss.
... The resulting dataset was employed to test multiple models. Another approach to dataset creation focuses on identifying artworks, as demonstrated in [31]. This study also relied on Wikipedia articles, with dataset generation based on analyzing the references provided within the articles. ...
Article
Full-text available
Developing robust and reliable models for Named Entity Recognition (NER) in the Russian language presents significant challenges due to the linguistic complexity of Russian and the limited availability of suitable training datasets. This study introduces a semi-automated methodology for building a customized Russian dataset for NER specifically designed for literary purposes. The paper provides a detailed description of the methodology employed for collecting and proofreading the dataset, outlining the pipeline used for processing and annotating its contents. A comprehensive analysis highlights the dataset’s richness and diversity. Central to the proposed approach is the use of a voting system to facilitate the efficient elicitation of entities, enabling significant time and cost savings compared to traditional methods of constructing NER datasets. The voting system is described theoretically and mathematically to highlight its impact on enhancing the annotation process. The results of testing the voting system with various thresholds show its impact in increasing the overall precision by 28% compared to using only the state-of-the-art model for auto-annotating. The dataset is meticulously annotated and thoroughly proofread, ensuring its value as a high-quality resource for training and evaluating NER models. Empirical evaluations using multiple NER models underscore the dataset’s importance and its potential to enhance the robustness and reliability of NER models in the Russian language.
... Generative Adversarial Networks (GANs) have been effectively applied in artwork restoration, utilizing a modified U-Net architecture for the generator and pre-trained residual networks for the encoder, demonstrating superior performance [60]. A heuristic-based framework has been proposed for generating training data for Named Entity Recognition (NER) of artwork titles, significantly improving NER performance [66]. Spatiotemporal Deep Neural Networks (STDNNs) have been used for defect identification in artworks through infrared thermography, achieving outstanding performance with a high mean F1 score [67]. ...
Article
Full-text available
Artificial intelligence (AI) techniques have been increasingly applied in assisting various cultural heritage (CH)-related tasks. The aim of this study is to examine the research trends and current applications of AI in this vast domain. After obtaining a dataset from the Web of Science and Scopus databases, a scientometric analysis of research publications from 2019 to 2023 related to the use of AI in CH was conducted. The trending topics based on the author’s keywords were identified by using the ScientoPy v2.1.3 software. Through this approach, five main topics were identified: classification, computer vision, 3D reconstruction, recommender systems, and intangible cultural heritage. The analysis highlights the upward trend in publications in this field since 2019, indicating a growing interest in the application of AI techniques in CH. By analyzing the latest research in the field, it is observed that AI techniques are mostly applied to assist CH in the discovery, description, classification, and preservation tasks. The report gives insights into the main research areas and developing trends in the field of artificial intelligence and machine learning. The study offers important information about the key research areas and emerging trends related to using AI techniques in the CH field. This helps to recognize the potential, development, and increasing influence of these technologies within the CH domain. The findings of this study contribute to the future development of AI applications in CH, enabling professionals to use the advantages of these technologies.
... In addition, some researchers have created datasets that can be used for named entity recognition for their own research purposes. For example, Jain et al. [43] observed that there are no named entity recognition datasets for the art domain and therefore created a dataset for artwork recognition based on the extensive digitized art historical documents provided by the Wildenstein Plattner Institute (WPI). Similarly, Sahin et al. [44] found that the current datasets for named entity recognition and text classification tasks are mainly in English and very few in Turkish, and created the largest dataset available in Turkish for named entity recognition and text classification based on the reference to previous datasets. ...
Article
Full-text available
Named entity recognition as a fundamental task plays a crucial role in accomplishing some of the tasks and applications in natural language processing. In the age of Internet information, as far as computer applications are concerned, a huge proportion of information is stored in structured and unstructured forms and used for language and text processing. Before neural networks were widely used in natural language processing tasks, research in the field of named entity recognition usually focused on leveraging lexical and syntactic knowledge to improve the performance of models or methods. To promote the development of named entity recognition, researchers have been creating named entity recognition datasets through conferences, projects, and competitions for many years, based on various research goals, and training entity recognition models with increasing accuracy on this basis. However, there has not been much exploration of named entity recognition datasets. Particularly, there have been many datasets available since the introduction of the named entity recognition task, but there is no clear framework to summarize the development of these seemingly independent datasets. A closer look at the context of the development of each dataset and the features it contains reveals that these datasets share some common features to varying degrees. In this thesis, we review the development of named entity recognition datasets over the years and describe them in terms of the language of the dataset, the domain of research, the type of entity, the granularity of the entity, and the annotation of the entity. Finally, we provide an idea for the creation of subsequent named entity recognition datasets.
Article
[Purpose/Significance] The annotation of natural language corpus not only facilitates researchers to extract knowledge from it, but also helps to achieve deeper mining of the corpus. But the annotated corpus in the humanities knowledge domain is less. And the semantic annotation of humanities texts is difficult, because it requires a high domain background for researchers, even requires the participation of domain experts. Based on this, this study proposes a method for detecting entities and relations in domain which is lack of annotated corpus, and provides a referenceable idea for constructing conceptual models based on textual instances. [Method/Process] Based on syntactic and semantic features, this study proposes SPO triple recognition rules from the perspective of giving priority to predicates and generalization rules from the perspective of triple's content and the meaning of its predicate. The recognition rules are used to extract text-descriptive SPO triples centered on predicates. After clustering and adjusting triples, use the generalization rules proposed in this study to obtain coarse-grained entities and relations, and then form a conceptual model. [Results/Conclusions] This study recognizes SPO triples with high precision and summarization from descriptive texts, generalizes them and then forms a domain conceptual model. The method proposed in this paper provides a research idea for entity-relation detection in a domain with missing annotated corpus, and the formed domain conceptual model provides a reference for building a domain Linked Data Graph. The feasibility of the method is verified through practice on texts related to the four traditional Chinese festivals.