Marieke van Erp's research while affiliated with Netherlands Institute of Ecology (NIOO-KNAW) and other places

Publications (64)

Conference Paper
Full-text available
The growing interest in named entity recognition (NER) in various domains has led to the creation of different benchmark datasets, often with slightly different annotation guidelines. To better understand the different NER benchmark datasets for the domain of English literature and their impact on the evaluation of NER tools, we analyse two existin...
Article
Full-text available
This paper presents an overview of the LL(O)D and NLP methods, tools and data for detecting and representing semantic change, with its main application in humanities research. The paper’s aim is to provide the starting point for the construction of a workflow and set of multilingual diachronic ontologies within the humanities use case of the COST A...
Poster
Full-text available
The calculation of environmental impacts from recipes remains a barrier to effective uptake of sustainable diets. In our project, we use pilot digital humanities methods to explore digitised recipe texts from websites in English, Dutch and German. Using the natural language processing toolkit GATE [1], we have developed customised tools to automati...
Preprint
Full-text available
Digital sources are more prevalent than ever but effectively using them can be challenging. One core challenge is that digitized sources are often distributed, thus forcing researchers to spend time collecting, interpreting, and aligning different sources. A knowledge graph can accelerate research by providing a single connected source of truth tha...
Preprint
Full-text available
In this work, we fill the gap in the Semantic Web in the context of Cultural Symbolism. Building upon earlier work in, we introduce the Simulation Ontology, an ontology that models the background knowledge of symbolic meanings, developed by combining the concepts taken from the authoritative theory of Simulacra and Simulations of Jean Baudrillard w...
Chapter
Current AI technologies and data representations often reflect the popular or majority vote. This is an inherent artefact of the frequency bias of many statistical analysis methods that are used to create for example knowledge graphs, resulting in simplified representations of the world in which diverse perspectives are underrepresented. With the u...
Article
Full-text available
In this paper, we discuss the use of natural language processing (NLP) and artificial intelligence (AI) to analyse nutritional and sustainability aspects of recipes and food. We present the state of the art and some use cases, followed by a discussion of challenges. Our perspective on addressing these is that while they typically have a technical n...
Chapter
In this paper, we present our work on semantic deep mapping at scale by combining information from various sources and disciplines to study historical Amsterdam. We model our data according to semantic web standards and ground them in space and time such that we can investigate what happened at a particular time and place from a linguistics, socio-...
Preprint
Full-text available
One of the grand challenges discussed during the Dagstuhl Seminar "Knowledge Graphs: New Directions for Knowledge Representation on the Semantic Web" and described in its report is that of a: "Public FAIR Knowledge Graph of Everything: We increasingly see the creation of knowledge graphs that capture information about the entirety of a class of ent...
Chapter
One of the most important goals of digital humanities is to provide researchers with data and tools for new research questions, either by increasing the scale of scholarly studies, linking existing databases, or improving the accessibility of data. Here, the FAIR principles provide a useful framework. Integrating data from diverse humanities domain...
Preprint
Full-text available
Environmental factors determine the smells we perceive, but societal factors factors shape the importance, sentiment and biases we give to them. Descriptions of smells in text, or as we call them `smell experiences', offer a window into these factors, but they must first be identified. To the best of our knowledge, no tool exists to extract referen...
Preprint
Full-text available
One of the most important goals of digital humanities is to provide researchers with data and tools for new research questions, either by increasing the scale of scholarly studies, linking existing databases, or improving the accessibility of data. Here, the FAIR principles provide a useful framework as these state that data needs to be: Findable,...
Conference Paper
Full-text available
The Web of Data has grown explosively over the past few years, and as with any dataset, there are bound to be invalid statements in the data, as well as gaps. Natural Language Processing (NLP) is gaining interest to fill gaps in data by transforming (unstructured) text into structured data. However, there is currently a fundamental mismatch in appr...
Article
Full-text available
The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels, the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks,...
Preprint
Full-text available
Linked Open Data (LOD) is the publicly available RDF data in the Web. Each LOD entity is identfied by a URI and accessible via HTTP. LOD encodes globalscale knowledge potentially available to any human as well as artificial intelligence that may want to benefit from it as background knowledge for supporting their tasks. LOD has emerged as the backb...
Preprint
Full-text available
The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks,...
Preprint
The analysis of literary works has experienced a surge in computer-assisted processing. To obtain insights into the community structures and social interactions portrayed in novels the creation of social networks from novels has gained popularity. Many methods rely on identifying named entities and relations for the construction of these networks,...
Conference Paper
Recognising entities in a text and linking them to an external resource is a vital step in creating a structured resource (e.g. a knowledge base) from text. This allows semantic querying over a dataset, for example selecting all politicians or football players. However, traditional named entity recognition systems only distinguish a limited number...
Conference Paper
The task of entity linking (EL) is often perceived as an algorithmic problem, where the novelty of systems lies in the decision making process, while the knowledge is relatively fixed. As a consequence, we lack an understanding about the importance and the relevance of diverse knowledge types in EL. However, knowledge and relevance are crucial: fol...
Conference Paper
Many entity recognition approaches classify recognised entities into a limited set of coarse-grained entity types. However, for deeper natural language analysis and end-user tasks, fine-grained entity types are more useful. For example, while standard named entity recognition may determine that an entity is a person knowing whether that entity is a...
Article
The large number of tweets generated daily is providing decision makers with means to obtain insights into recent events around the globe in near real-time. The main barrier for extracting such insights is the impossibility of manual inspection of a diverse and dynamic amount of information. This problem has attracted the attention of industry and...
Book
This book constitutes the combined refereed proceedings of ISWC Satellite Wor shops KEKI and NLP&DBpedia 2016 which were held in conjunction with ISWC 2016 in Kobe, Japan, in October 2016. The 9 papers presented were carefully selected and reviewed from 20 submissions. They focus on the use of linguistic linked open data, the linguistic aspects of...
Poster
Full-text available
Entities and events in the world have no frequency, but our communication about them and the words we use to refer to them do have a strong frequency profile. Language expressions and their meanings follow a Zipfian distribution, featuring a small amount of very frequent observations and a very long tail of low frequent observations. Since our NLP...
Article
Full-text available
In this article, we describe a system that . reads news articles in four different languages and detects what happened, who is involved, where and when. This event-centric information is represented as episodic situational knowledge on individuals in an interoperable RDF format that allows for reasoning on the implications of the events. Our system...
Conference Paper
Finding relevant resources on the Semantic Web today is a dirty job: no centralized query service exists and the support for natural language access is limited. We present LOTUS: Linked Open Text UnleaShed , a text-based entry point to a massive subset of today's Linked Open Data Cloud. Recognizing the use case dependency of resource retrieval , LO...
Conference Paper
Full-text available
More and more knowledge bases are publicly available as linked data. Since these knowledge bases contain structured descriptions of real-world entities, they can be exploited by entity linking systems that anchor entity mentions from text to the most relevant resources describing those entities. In this paper, we investigate adaptation of the entit...
Conference Paper
Full-text available
We describe a novel modular system for cross-lingual event extraction for English, Spanish,, Dutch and Italian texts. The system consists of a ready-to-use modular set of advanced multilingual Natural Language Processing (NLP) tools. The pipeline integrates modules for basic NLP processing as well as more advanced tasks such as cross-lingual Named...
Article
Knowledge graphs have gained increasing popularity in the past couple of years, thanks to their adoption in everyday search engines. Typically, they consist of fairly static and encyclopedic facts about persons and organizations–e.g. a celebrity’s birth date, occupation and family members–obtained from large repositories such as Freebase or Wikiped...
Conference Paper
Full-text available
It is difficult to find resources on the Semantic Web today, in particular if one wants to search for resources based on natural language keywords and across multiple datasets. In this paper, we present LOTUS: Linked Open Text UnleaShed, a full-text lookup index over a huge Linked Open Data collection. We detail LOTUS' approach, its implementation,...
Article
For biodiversity research, the field of study that is concerned with the richness of species of our planet, it is of the utmost importance that the location of an animal specimen find is known with high precision. Due to specimens often having been collected over the course of many years, their accompanying geographical data is often ambiguous or m...
Article
Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction fr...
Conference Paper
Full-text available
Aligning named entity taxonomies for comparing or combining different named entity extraction systems is a difficult task. Often taxonomies are mapped manually onto each other or onto a standardized ontology but at the loss of subtleties between different class extensions and domain specific uses of the taxonomy. In this paper, we present an approa...
Article
Full-text available
During the nineties of the last century, historians and computer scientists created together a research agenda around the life cycle of historical information. It comprised the tasks of creation, design, enrichment, editing, retrieval, analysis and presentation of historical information with help of information technology. They also identified a nu...
Conference Paper
Full-text available
Named entity recognition and disambiguation are important for information extraction and populating knowledge bases. Detecting and classifying named entities has traditionally been taken on by the natural language processing com-munity, whilst linking of entities to external resources, such as DBpedia and GeoNames, has been the domain of the Semant...
Conference Paper
Full-text available
Repeating experiments is an important in-strument in the scientific toolbox to vali-date previous work and build upon exist-ing work. We present two concrete use cases involving key techniques in the NLP domain for which we show that reproduc-ing results is still difficult. We show that the deviation that can be found in repro-duction efforts leads...
Article
Full-text available
In this paper, we present an evaluation framework for online access to cultural heritage. The framework enables the assessment of online cultural heritage applications in terms of their provision and support of information and interpretation. It is anchored in digital hermeneutics: the study and theory of the Web as a vehicle of (self)-interpretati...
Conference Paper
Semantic web applications are integrating data from more and more different types of sources about events. However, most data annotation frameworks do not translate well to semantic web. We describe the grounded annotation framework (GAF), a two-layered framework that aims to build a bridge between mentions of events in a data source such as a text...
Article
Full-text available
There is an abundance of semi-structured reports on events being written and made available on the World Wide Web on a daily basis. These reports are primarily meant for human use. A recent move-ment is the addition of RDF metadata to make auto-matic processing by computers easier. A fine example of this movement is the Open Government Data initia-...
Article
Sports events data is often compiled manually by companies who rarely make it available for free to third parties. However, social media provide us with large amounts of data that discuss these very same matches for free. In this study, we investigate to what extent we can accurately extract sports data from tweets talking about soccer matches. We...
Article
There is a need to share linguistic resources, but reuse is impaired by a number of constraints including lack of common formats, differences in conceptual notions, and unsystematic metadata. In this contribution, the five most important constraints and the tasks necessary to overcome these issues are detailed. These con-straints lie in the design...
Chapter
Full-text available
The natural history domain is rich in information. For hundreds of years, biodiversity researchers have collected specimens and samples, and meticulously recorded the how, what, and where of these objects of research. To retrace this information, however, deep knowledge of the collection and patience is necessary. Whereas traditional access methods...
Article
Full-text available
Cultural heritage institutions are currently rethinking ac-cess to their collections to allow the public to interpret and contribute to their collections. In this work, we present the Agora project, an interdisciplinary project in which Web technology and theory of interpretation meet. This we call digital hermeneutics. The Agora project facilitate...
Conference Paper
Full-text available
Within cultural heritage collections, objects are often grounded in a particular historical setting. This setting can currently not be made explicit, as structured descriptions of events are either missing or not marked up explicitly. This paper reports a study on automatic extraction of an historical event thesaurus from unstructured texts. We sho...
Conference Paper
Full-text available
There is an abundance of semi-structured reports on events being written and made available on the World Wide Web on a daily basis. These reports are primarily meant for human use. A recent movement is the addition of RDF metadata to make automatic processing by computers easier. A fine example of this movement is the Open Government Data initiativ...
Article
Full-text available
The amount and type of errors found in cultural heritage databases were assessed by a study in which a random sample of a database was manually checked for errors. The database from the Dutch National Museum of Natural History, Naturalis, was used that contained information about reptile and amphibian specimens in the museum's collection. A specifi...
Article
Importing large amounts of data into databases does not always go without the loss of important information. In this work, methods are presented that aim to rediscover this information by inferring it from the in- formation that is available in the database. From and animal specimen database, the information to which expedition an ani- mal that was...
Article
Full-text available
Within cultural heritage collections, objects are often groun-ded in a particular historical setting. This setting can currently not be made explicit, as structured descriptions of events are either missing or not marked up explicitly. This poster reports a study on automatic ex-traction of an historical event thesaurus from unstructured texts. We...

Citations

... The need for an ontology that can enable us to be consistent in the annotation of olfactory information across studies was reported by Tonelli and Menini et al. (2021) [89] and [90]. On the basis of this ontology, the multilingual Odeuropa benchmark dataset was released [91]. The Odeuropa benchmark dataset is multilingual and consists of historical texts. ...
... The concept discussed in [25] of a polyvocal and contextualised SW draws attention to the fact that these knowledge sources often represent simplified views of the world, in which diverse perspectives may be underrepresented. In this light, the identification, representation and usage of different views or voices constitutes one of the main challenges in addressing that SW technologies often reflect the popularity or majority vote. ...
... In use case 3, both personal information such as person names and names of the state offices may differ throughout the time. Alternatively, the meaning of some concepts evolves without changing the name [12]. A typical example of such changes is occupation names that occur in use cases 1 and 3. Occupation functions may change over time, so their definition will vary across different time layers, even though the name stays the same. ...
... For example, NER helps extract named entities such as person names, organizations, locations, etc from the text. In our context, this can help identify different characters of a story [18]. Dependency parsing finds relationships between words. ...
... Transforming information into knowledge requires Machine Learning/Artificial Intelligence (ML/AI) techniques, such as Named Entity Recognition (NER) and Representative Learning for describing information in a form of knowledge (e.g., as a knowledge graph). Each of these technologies, however, carry a potential risk of introducing errors as a result of improper digitization or storage of information, and the errors introduced due to the geographic and cultural differences in the meaning of the data and its interpretation (29). Furthermore, by its very nature, ML can perpetuate the existing biases in the data, mis-representing aspects of reality. ...
... Theretofore, in Chapter 1 we presented the domain description and the state of the art study based on the life-cycle of ontology evolution proposed by [Zablith et al., 2015]. We presented the current limitations for the state of the art research work. ...
... The digital humanities domain, and in particular historical documents and literary criticism, are perhaps the closest scenarios to our use case that come to mind. Several works in this domain like the historical documents from the Impresso collection (Ehrmann et al., 2020), the multilingual news corpora MeanTime (Minard et al., 2016) and Dekker's work on extracting small snippets of literary criticism from social media (Dekker et al., 2018) have served as a starting point in our journey, helping us to define our annotation guidelines. ...
... However, OCR struggles with noisy document images. 44,45 In Ref. 44, for example, lexicons were used to classify recipes in digitized historical newspapers, and the performance of the classifier dropped because those relatively clean lexicons could not address or cover the various distortions in the digital texts caused by noise. Similarly, Lansdall-Welfare et al. 45 sought to identify and extract words to classify and represent major historical British events in digitized historical newspapers. ...
... In [6] language technology was applied for extracting entities and relations in RDF using Dutch biographies in the BiographyNet. 17 This work was part of the larger NewsReader project 18 extracting data from news [46]. This line of research is similar to ours, based on the idea of extracting RDF data from unstructured biographical texts. ...
... Despite that the named entity recognition task is well studied, it still faces multiple challenges (Li et al., 2022a), namely, NER in domain-specific areas (Weber et al., 2021), NER from noisy data (Derczynski et al., 2017), code-mixed data (Fetahu et al., 2021), and detection of fine-grained and nested named entities (Kim and Kim, 2021;Ringland et al., 2019;Loukachevitch et al., 2021). This is caused by several issues: defining boundaries of compound terms; recognition of whether a lexical unit is part of a compound term; identification of a lexical unit as a term depending on the context and topic of the text in which this lexical unit is used etc. ...