Christian ChiarcosGoethe University Frankfurt · Angewandte Computerlinguistik (ACoLi), Institut für Informatik
Christian Chiarcos
Prof. Dr.
About
116
Publications
34,575
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,419
Citations
Introduction
Focus areas:
(1) Natural Language Semantics, with a focus on semantic parsing (Information Extraction / Text Analytics) and deep (discourse) semantics,
(2) Data Science and Knowledge Representation, with a focus on information integration and interoperability for knowledge processing and natural language processing,
(3) Artificial Intelligence and Machine Learning, with a focus on Semantic Web technologies and neural networks, and
(4) Digital Humanities and Computational Philology as an important area of application for such methods, e.g., with respect to historical or low resource languages.
Skills and Expertise
Publications
Publications (116)
Limited accessibility to language resources and technologies represents a challenge for the analysis, preservation, and documentation of natural languages other than English. Linguistic Linked (Open) Data (LLOD) holds the promise to ease the creation, linking, and reuse of multilingual linguistic data across distributed and heterogeneous resources....
Using language models to detect or predict the presence of language phenomena in the text has become a mainstream research topic. With the rise of generative models, experiments using deep learning and transformer models trigger intense interest. Aspects like precision of predictions , portability to other languages or phenomena , scale have been c...
Linguistic Linked Open Data (LLOD) are technologies that provide a powerful instrument for representing and interpreting language phenomena on a web-scale. The main objective of this paper is to demonstrate how LLOD technologies can be applied to represent and annotate a corpus composed of multiword discourse markers, and what the effects of this a...
This article provides a comprehensive and up-to-date survey of models and vocabularies for creating linguistic linked data (LLD) focusing on the latest developments in the area and both building upon and complementing previous works covering similar territory. The article begins with an overview of some recent trends which have had a significant im...
Discourse markers carry information about the discourse structure and organization, and also signal local dependencies or epistemological stance of speaker. They provide instructions on how to interpret the discourse, and their study is paramount to understand the mechanism underlying discourse organization. This paper presents a new language resou...
In this paper, we present cqp4rdf, a set of tools for creating and querying corpora with linguistic annotations. cqp4rdf builds on CQP, an established corpus query language widely used in the areas of computational lexicography and empirical linguistics, and allows to apply it to corpora represented in RDF. This is in line with the emerging trend o...
We describe the use of linguistic linked data to support a cross-lingual transfer framework for sentiment analysis in the pharmaceutical domain. The proposed system dynamically gathers translations from the Linked Open Data (LOD) cloud, particularly from Apertium RDF, in order to project a deep learning-based sentiment classifier from one language...
We describe the use of linguistic linked data to support a cross-lingual transfer framework for sentiment analysis in the pharmaceutical domain. The proposed system dynamically gathers translations from the Linked Open Data (LOD) cloud, particularly from Apertium RDF, in order to project a deep learning-based sentiment classifier from one language...
With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider...
We introduce the Flexible and Integrated Transformation and Annotation eNgeneering (Fintan) platform for converting heterogeneous linguistic resources to RDF. With its modular architecture, workflow management and visualization features, Fintan facilitates the development of complex transformation pipelines by integrating generic RDF converters and...
With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider...
In this paper we describe the contributions made by the European H2020 project "Prêt-à-LLOD" ('Ready-to-use Multilingual Linked Language Data for Knowledge Services across Sectors') to the further development of the Linguistic Linked Open Data (LLOD) infrastructure. Prêt-à-LLOD aims to develop a new methodology for building data value chains applic...
LD technologies allow metadata of datasets to be exposed on the Web in order to improve their automated discovery, sharing and reuse by humans and software agents. In this chapter we deal with the representation of metadata for LRs, with the idea of enabling their cataloguing, discovery and later reuse. We will distinguish two types of metadata: ge...
This chapter introduces the Lexicon Model for Ontologies (lemon) as defined by the Ontolex W3C community group. The model was originally developed to enrich ontologies with lexical information expressing how the elements of the ontology including classes, properties and individuals are referred to in a given language. In this chapter we cover the c...
In previous chapters, we discussed how to model linguistic data sets using the Resource Description Framework as a basis to publish them as linked data on the Web. In this chapter, we describe a methodology that can be followed in the transformation of legacy linguistic datasets into linked data. The methodology comprises of different tasks, includ...
This chapter introduces the Linguistic Linked Open Data (LLOD) Cloud. In recent years, there has been increasing interest in publishing linguistic datasets following linked data principles. A number of community-driven activities, foremost organized by the Open Linguistics Working Group (OWLG), have fostered and supported the publication of open li...
This chapter introduces preliminaries that are essential to follow the content in the remainder of this book. First of all, we introduce the core data model of the Semantic Web and linked data, that is the Resource Description Framework, RDF. This format was designed in the 1990s and its core purpose is to represent data and knowledge in a Web-comp...
In this chapter we address the question of how links can be discovered between different datasets published as Linguistic Linked Open Data. We describe common patterns to represent links both between data that are on the same language (monolingual scenario) and between data in different languages (cross-lingual scenario). Further, we describe techn...
The (re-)usability of NLP tools and language resources has long been recognized as a key challenge in the language resource and NLP communities. Reuse of resources, however, requires a minimum level of interoperability, and in this chapter, we focus on conceptual interoperability, i.e. harmonization between different annotation schemas by means of...
This chapter describes how linguistic annotations can be represented in RDF. Web Annotation and NIF provide the means to reference text segments on the web. Yet, representing linguistic annotations requires appropriate vocabularies. We discuss relevant vocabularies and illustrate how they can be applied to support annotation at different levels.
Text annotation consists in defining markables (elements to be annotated), their features (attributes and values of annotations) and relations between markables (e.g. syntactic dependencies or semantic links). In this chapter we describe the principles for annotating text data using RDF-compliant formalisms. These principles provide the basis for m...
The Linguistic Linked Data (LLD) paradigm was introduced about 8 years ago by the Open Linguistics Working Group (OWLG). The original mission of this group was to (1) promote the use of open standards in linguistics; (2) act as a central point of reference and provide support for those interested in open linguistic data; (3) develop best practices...
In this chapter we describe principles and architectures that support the development of NLP workflows and pipelines based on linked data technology. The benefit of NLP workflows that build on linked data standards is that they build on an open set of data models and Web technologies that can be implemented with standard functionality not requiring...
Finding appropriate language resources for a particular research purpose or task is of crucial importance and represents a significant challenge at the same time. Currently, there are a number of distributed data repositories which contain metadata about many language resources. However, the metadata formats and metadata content is not harmonized a...
In recent years, Digital Humanities (DH) has become an increasingly flourishing field of research, often posing novel research challenges that require extensions or revisions of existing technologies. One characteristic of this area is the great heterogeneity of scientific disciplines and user communities involved. This leads to heterogeneity of da...
Wordnets are the most widely used lexical resources in natural language processing (NLP). There exist wordnets in more than 40 languages by now and all of these are connected to the original Princeton WordNet. The origins of linguistic linked data (LD) can thus in some sense be traced to the WordNet project. The implementation of the linking, howev...
Making diverse data in linguistics and the language sciences open, distributed, and accessible: perspectives from language/language acquistiion researchers and technical LOD (linked open data) researchers.
This volume examines the challenges inherent in making diverse data in linguistics and the language sciences open, distributed, integrated, and...
This is the first monograph on the emerging area of linguistic linked data. Presenting a combination of background information on linguistic linked data and concrete implementation advice, it introduces and discusses the main benefits of applying linked data (LD) principles to the representation and publication of linguistic resources, arguing that...
This open access volume (https://direct.mit.edu/books/book/4618/Development-of-Linguistic-Linked-Open-Data) examines the challenges inherent in making diverse data in linguistics and the language sciences open, distributed, integrated, and accessible, thus fostering wide data sharing and collaboration. It is unique in integrating the perspectives o...
Making diverse data in linguistics and the language sciences open, distributed, and accessible: perspectives from language/language acquistiion researchers and technical LOD (linked open data) researchers.
This volume examines the challenges inherent in making diverse data in linguistics and the language sciences open, distributed, integrated, and...
This paper describes work on the morphological and syntactic annotation of Sumerian cuneiform as a model for low resource languages in general. Cuneiform texts are invaluable sources for the study of history, languages, economy, and cultures of Ancient Mesopotamia and its surrounding regions. Assyriology, the discipline dedicated to their study, ha...
The adaptation of novel techniques and standards in computational lexicography is taking place at an accelerating pace, as manifested by
recent extensions beyond the traditional XML-based paradigm of electronic publication. One important area of activity in this regard is
the transformation of lexicographic resources into (Linguistic) Linked Open D...
This paper presents an endeavor to transform a scholarly text edition (of a medical treatise written in Middle French) into a digital edition
enriched with references to an on-line dictionary. Hitherto published as a book, the resulting digital edition will use RDFa to interlink
its vocabulary with the corresponding lexical entries of the Dictionna...
Making diverse data in linguistics and the language sciences open, distributed, and accessible: perspectives from language/language acquistiion researchers and technical LOD (linked open data) researchers.
This volume examines the challenges inherent in making diverse data in linguistics and the language sciences open, distributed, integrated, and...
The physical formats used to represent linguistic data and its annotations have evolved over the past four decades, accommodating different needs and perspectives as well as incorporating advances in data representation generally. This chapter provides an overview of representation formats with the aim of surveying the relevant issues for represent...
Understanding the differences underlying the scope, usage and content of language data requires the provision of a clarifying terminological basis which is integrated in the metadata describing a particular language resource. While terminological resources such as the SIL Glossary of Linguistic Terms, ISOcat or the GOLD ontology provide a considera...
We introduce CoNLL-RDF, a direct rendering of the CoNLL format in RDF, accompanied by a formatter whose output mimicks CoNLL’s original TSV-style layout. CoNLL-RDF represents a middle ground that accounts for the needs of NLP specialists (easy to read, easy to parse, close to conventional representations), but that also facilitates LLOD integration...
Interlinear glossed text (IGT) is a notation used in various fields of linguistics to provide readers with a way to understand the linguistic phenomena. We describe the representation of IGT data in RDF, the conversion from two popular tools, and their automated linking with resources from the Linguistic Linked Open Data (LLOD) cloud. We argue that...
We introduce an attention-based Bi-LSTM for Chinese implicit discourse relations and demonstrate that modeling argument pairs as a joint sequence can outperform word order-agnostic approaches. Our model benefits from a partial sampling scheme and is conceptually simple, yet achieves state-of-the-art performance on the Chinese Discourse Treebank. We...
This book constitutes the combined refereed proceedings of ISWC Satellite Wor shops KEKI
and NLP&DBpedia 2016 which were held in conjunction with ISWC 2016 in Kobe, Japan, in
October 2016. The 9 papers presented were carefully selected and reviewed from 20
submissions. They focus on the use of linguistic linked open data, the linguistic aspects
of...
This book constitutes the proceedings of the First International Conference on Language, Data and Knowledge, LDK 2017, held in Galway, Ireland, in June 2017. The 14 full papers and 19 short papers included in this volume were carefully reviewed and selected from 68 initial submissions. They deal with language data; knowledge graphs; applications in...
The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Link...
We present experiments on word segmentation for Akkadian cuneiform, an ancient writing system and a language used for about 3 millennia in the ancient Near East. To our best knowledge, this is the first study of this kind applied to either the Akkadian language or the cuneiform writing system. As a logosyllabic writing system, cuneiform structurall...
This paper describes the Ontologies of Linguistic Annotation (OLiA) as one of the data sets currently available as part of Linguistic Linked Open Data (LLOD) cloud. Within the LLOD cloud, the OLiA ontologies serve as a reference hub for annotation terminology for linguistic phenomena on a great band-width of languages, they have been used to facili...
We introduce lemonUby, a new lexical resource integrated in the Semantic Web which is the result of converting data extracted from the existing large-scale linked lexical resource UBY to the lemon lexicon model. The following data from UBY were converted: WordNet, FrameNet, VerbNet, English and German Wiktionary, the English and German entries of O...
With our poster and the accompanying demo, we present current progress on the information-technological support for scholars and students of cuneiform. For a period of about 3000 years, cuneiform was the dominant writing system of the Ancient Near East, with a rich literary tradition in several languages, and an extensive amount of texts preserved...
We describe a minimalist approach to shallow discourse parsing in the context of the CoNLL 2015 Shared Task. 1 Our parser integrates a rule-based component for argument identification and datadriven models for the classification of explicit and implicit relations. We place special emphasis on the evaluation of implicit sense labeling, we present di...
We propose a generic, memory-based approach for the detection of implicit semantic roles. While state-of-the-art methods for this task combine hand-crafted rules with specialized and costly lexical resources, our models use large corpora with automated annotations for explicit semantic roles only to capture the distribution of predicates and their...
We provide an overview of on-going efforts to facilitate the study of older Germanic languages currently pursued at the Goethe-University Frankfurt, Germany. We describe created resources, such as a parallel corpus of Germanic Bibles and a morphosyntactically annotated corpus of Old High German (OHG) and Old Saxon, a lexicon of OHG in XML and a mul...
For the study of historical language varieties, the sparsity of training data imposes immense problems on syntactic annotation and the development of NLP tools that automatize the process. In this paper, we explore strategies to compensate the lack of training data by including data from related varieties in a series of annotation projection experi...
‘Open Data’ has become very important in a wide range of fields. However for linguistics, much data is still published in proprietary, closed formats and is not made available on the web. We propose the use of linked data principles to enable language resources to be published and interlinked openly on the web, and we describe the application of th...
We describe on going community-efforts to create a Linked Open Data (sub-)cloud of linguistic resources, with an emphasis on resources that are specific to linguistic research, namely annotated corpora and linguistic databases. We argue that for both types of resources, the application of the Linked Open Data paradigm and the representation in RDF...
This paper describes a novel approach towards the empirical approximation of discourse relations between different utterances in texts. Following the idea that every pair of events comes with preferences regarding the range and frequency of discourse relations connecting both parts, the paper investigates whether these preferences are manifested in...
This paper describes POWLA, a generic formalism to represent linguistic annotations in an interoperable way by means of OWL/DL. Unlike other approaches in this direction, POWLA is not tied to a specific selection of annotation layers, but it is designed to support any kind of text-oriented annotation.
The Open Linguistics Working Group (OWLG) is an initiative of experts from different fields concerned with linguistic data, including academic linguistics (e.g. typology, corpus linguistics), applied linguistics (e.g. computational linguistics, lexicography and language documentation) and NLP (e.g. from the Semantic Web community). The primary goal...
The contributions of this part have described recent activities of the OWLG as a whole and of individual OWLG members aiming to provide linguistic resources as Linked Data. Here, we describe how linguistic resources can be linked with each other, and we illustrate possible use cases of information integration from various sources with example queri...
This paper announces the release of the Ontologies of Linguistic Annotation (OLiA). The OLiA ontologies represent a repository of annotation terminology for various linguistic phenomena on a great band-width of languages. This paper summarizes the results of five years of research, it describes recent developments and directions for further researc...
The explosion of information technology in the last two decades has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. These resources become even more useful when linked with each other, and the last few years have seen the emergence of numerous approaches in various disciplines concerned with ling...
This paper describes the application of OWL and RDF to address the interoperability of linguistic corpora and linguistic annotations within such corpora. Interoperability of linguistic corpora involves two aspects: Structural interoperability (annotations of different origin are represented using the same formalism) and conceptual interoperability...
Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper...
We describe the application of a framework for salience metrics and linguistic variabil-ity with respect to the contextually adequate choice of referring expressions and grammati-cal roles: Where multiple meaning-equivalent candidate realizations are available that dif-fer in one of these aspects, NLG systems can apply salience metrics to predict c...
In this paper, we describe tools and resources for the study of African languages developed at the Collaborative Research Centre "Infor- mation Structure". These include deeply anno- tated data collections of 25 subsaharan languages that are described together with their annotation scheme, and further, the cor- pus tool ANNIS that provides a unifie...
This paper describes the modeling of the morphosyntactic annotations of the MULTEXT-East corpora and lexicons as an OWL/DL ontology. Formalizing annotation schemes in OWL/DL has the advantages of enabling formally specifying interrelationships between the various features and making logical inferences based on the relationships between them. We sho...
The Open Linguistics Working Group (OWLG) is an initiative of experts from different fields concerned with linguistic data, including academic linguistics (e.g. typology, corpus linguistics), applied linguistics (e.g. computational linguistics, lexicography and language documentation), and NLP (e.g. from the Semantic Web community). The primary goa...
This paper describes the creation of a re-source of German sentences with multi-ple automatically created alternative syn-tactic analyses (parses) for the same text, and how qualitative and quantitative inves-tigations of this resource can be performed using ANNIS, a tool for corpus querying and visualization. Using the example of PP attachment, we...
This paper describes a series of experiments to test the hypothesis that the parallel application of multiple NLP tools and the integration of their results improves the correctness and robustness of the resulting analysis. It is shown how annotations created by seven NLP tools are mapped onto tool-independent descriptions that are defined with ref...
A crucial step in the development of NLP systems is a detailed error analysis. Our system demonstration presents the infrastructure and the workflow for training classifiers for different NLP tasks and the verification of their predictions on annotated corpora. We describe an enhancement cycle of subsequent steps of classification and context-sensi...
Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 91-102. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronic...
ANNIS (see Dipper & Götze 2005; Chiarcos et al. 2008) is a flexible web-based corpus architecture for search and visualization of multi-layer linguistic corpora. By multi-layer we mean that the same primary datum may be annotated independently with (i) annotations of different types (spans, DAGs with labelled edges and arbitrary pointing relations...
The present paper reports on the development and evaluation of a historical corpus designed to support detailed empirical studies on the inter action of information structure and syntax in Old High German (OHG). The creation and exploratio n of this corpus are part of a more general investigation concerning the role of informat ion-structural facto...