Jan Ehmueller’s research while affiliated with Hasso Plattner Institute and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (6)


Fig. 1. Example of scanned page
Fig. 2. Example of digitized text.
Fig. 3. Overview of the framework showing the progressive improvements of the training datasets (as described in Section 4 and summarized in Table 2). Each stage illustrates the enhancements by referring to the datasets obtained with the corresponding steps.
Fig. 4. Getting annotated sentences from Wikipedia.
Fig. 5. NER performance with different training data sizes

+5

Generation of training data for named entity recognition of artworks
  • Article
  • Full-text available

August 2022

·

82 Reads

·

6 Citations

Semantic Web

·

·

Jan Ehmueller

·

As machine learning techniques are being increasingly employed for text processing tasks, the need for training data has become a major bottleneck for their application. Manual generation of large scale training datasets tailored to each task is a time consuming and expensive process, which necessitates their automated generation. In this work, we turn our attention towards creation of training datasets for named entity recognition (NER) in the context of the cultural heritage domain. NER plays an important role in many natural language processing systems. Most NER systems are typically limited to a few common named entity types, such as person, location, and organization. However, for cultural heritage resources, such as digitized art archives, the recognition of fine-grained entity types such as titles of artworks is of high importance. Current state of the art tools are unable to adequately identify artwork titles due to unavailability of relevant training datasets. We analyse the particular difficulties presented by this domain and motivate the need for quality annotations to train machine learning models for identification of artwork titles. We present a framework with heuristic based approach to create high-quality training data by leveraging existing cultural heritage resources from knowledge bases such as Wikidata. Experimental evaluation shows significant improvement over the baseline for NER performance for artwork titles when models are trained on the dataset generated using our framework.

Download


Sense Tree: Discovery of New Word Senses with Graph-based Scoring

August 2020

·

147 Reads

Language is dynamic and constantly evolving: both the usage context and the meaning of words change over time. Identifying words that acquired new meanings and the point in time at which new word senses emerged is elementary for word sense disambiguation and entity linking in historical texts. For example, cloud once stood mostly for the weather phenomenon and only recently gained the new sense of cloud computing. We developed a simple graph-based language model to detect and visualize these changes. We evaluate our approach using COHA, a corpus books and magazines spanning two centuries. Further, we provide a list of words that were annotated by 16 linguists. https://hpi.de/naumann/s/language-evolution



Figure 1: System Architecture
Figure 2: Entity Landscape Explorer
Figure 3: Graph view of the Curation Interface
CurEx: A System for Extracting, Curating, and Exploring Domain-Specific Knowledge Graphs from Text

October 2018

·

441 Reads

·

11 Citations

The integration of diverse structured and unstructured information sources into a unified, domain-specific knowledge base is an important task in many areas. A well-maintained knowledge base enables data analysis in complex scenarios, such as risk analysis in the financial sector or investigating large data leaks, such as the Paradise or Panama papers. Both the creation of such knowledge bases, as well as their continuous maintenance and curation involves many complex tasks and considerable manual effort. With CurEx, we present a modular system that allows structured and unstructured data sources to be integrated into a domain-specific knowledge base. In particular, we (i) enable the incremental improvement of each individual integration component; (ii) enable the selective generation of multiple knowledge graphs from the information contained in the knowledge base; and (iii) provide two distinct user interfaces tailored to the needs of data engineers and end-users respectively. The former has curation capabilities and controls the integration process, whereas the latter focuses on the exploration of the generated knowledge graph.


Citations (5)


... NER is the focus of this paper has been well-studied in the literature (Ehrmann et al. 2023;Moscato et al. 2023). The use of NER and other term extraction tools within cultural heritage organisations has been well documented for close to a decade (Aejas et al. 2021;Jain et al. 2022). However, these tools are subjective to the data they have been trained on (van Hooland et al. 2015). ...

Reference:

Large Language Models to make museum archive collections more accessible
Generation of training data for named entity recognition of artworks

Semantic Web

... And in case of KGs where issues like inaccuracy and incompleteness widely exist, new evaluation metrics are being proposed, such as those in [170], [171]. Many works suggest to check the extracted rules by domain experts before applying, and a latest work [172] introduces the thought of human-in-the-loop and designs a few-shot knowledge validation framework for interactive quality assessment of rules, which takes the rule validation forward one step. ...

Few-Shot Knowledge Validation using Rules

... Domain-specific NER typically needs to introduce domain-specific (sub-)categories of the established named entity (NE) categories or entirely new categories. This is because domain-specific texts contain NE categories that are (1) detailed variants of the standard NE categories, e.g., "Person" is replaced with the domain-specific sub-categories "Players" and "Coaches" [27], (2) standard NE categories extended with a small number of new categories, e.g., "Trigger of a traffic jam" [11,19,22], and (3) domain-derived NE categories, e.g., "Proteins" in biology or "Reactions" in chemistry domains [9,18,25,30]. Most domain-derived NE categories originate from structured classifications or dictionaries [9,12,25] or are derived by manually unifying multiple of them [5]. ...

CurEx: A System for Extracting, Curating, and Exploring Domain-Specific Knowledge Graphs from Text

... Zehra et al. [14] present a financial knowledge graph-based financial report query system utilizing Wikidata and DBpedia, but the construction process lacks repeatability. Loster et al. [15] highlight challenges in the entity resolution, maintenance, and exploration of such graphs. Cheng et al. [16] propose a multi-modality graph neural network (MAGNN) but do not provide sufficient details on the construction process. ...

The Challenges of Creating, Maintaining and Exploring Graphs of Financial Entities