To read the full-text of this research, you can request a copy directly from the authors.
Abstract
We report on finished work in a project that is concerned with providing methods, tools, best practice guidelines, and solutions
for sustainable linguistic resources. The article discusses several general aspects of sustainability and introduces an approach
to normalizing corpus data and metadata records. Moreover, the architecture of the sustainability platform implemented by
the authors is described.
To read the full-text of this research, you can request a copy directly from the authors.
... Similarly, sustainability is an issue for the tools that are used in digital humanities (see Burdick et al. 2012; and for related points in corpus linguistics, e.g. Rehm et al. 2009;Kesäniemi et al. 2018). Often digitised materials are made accessible and searchable through online interfaces. ...
The corpus linguistic study of narrative fiction is not a simple application of existing corpus methods to just another set of texts. It requires consideration of the properties of the texts under analysis, as well as the nature of the questions that can be addressed. The focus of this chapter is on novels as a specific type of narrative fiction. The chapter is particularly concerned with how corpus methods can be used to study novels as fiction, i.e. with an emphasis on the fictional worlds in the texts rather than exclusively on the linguistic features that define a register compared to other registers. The chapter outlines a variety of approaches to fiction by relating corpora to other digital resources and considers how to narrow down a starting point for a corpus linguistic study. To understand what corpus linguistics can do for the study of novels, the chapter reflects on what is special about narrative fiction and discusses patterns and functions of the verb form looking as an example of body language descriptions of fictional characters. The chapter concludes by considering directions for the future of corpus research and its relationship to the wider digital humanities.
... (4) Similarly, other community-maintained vocabularies are linked with OLiA, e. g., the CLARIN Concept Registry (Chiarcos et al., 2020). OLiA was developed as part of an infrastructure for the sustainable maintenance of linguistic resources (Wörner et al., 2006;Schmidt et al., 2006;Rehm et al., 2008b;Witt et al., 2009;Rehm et al., 2009). Its field of application included the formalization of annotation schemes and concept-based querying over heterogeneously annotated corpora (Rehm et al., 2008a). ...
With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER.
... (4) Similarly, other community-maintained vocabularies are linked with OLiA, e. g., the CLARIN Concept Registry (Chiarcos et al., 2020). OLiA was developed as part of an infrastructure for the sustainable maintenance of linguistic resources (Wörner et al., 2006;Schmidt et al., 2006;Rehm et al., 2008b;Witt et al., 2009;Rehm et al., 2009). Its field of application included the formalization of annotation schemes and concept-based querying over heterogeneously annotated corpora (Rehm et al., 2008a). ...
With regard to the wider area of AI/LT platform interoperability, we concentrate on two core aspects: (1) cross-platform search and discovery of resources and services; (2) composition of cross-platform service workflows. We devise five different levels (of increasing complexity) of platform interoperability that we suggest to implement in a wider federation of AI/LT platforms. We illustrate the approach using the five emerging AI/LT platforms AI4EU, ELG, Lynx, QURATOR and SPEAKER.
... The user can either click on the tree or modify the LISP expression to generalize the query. SPLICR also contains a graphical tree editor tool (Rehm et al., 2009). According to Shneiderman and Plaisant (2010), query-by-example has largely fallen out of favor as a user interface design approach. ...
A common task in qualitative data analysis is to characterize the usage of a linguistic entity by issuing queries over syntactic relations between words. Previous interfaces for searching over syntactic structures require programming-style queries. User interface research suggests that it is easier to recognize a pattern than to compose it from scratch; therefore, interfaces for non-experts should show previews of syntactic relations. What these previews should look like is an open question that we explored with a 400-participant Mechanical Turk experiment. We found that syntactic relations are recognized with 34% higher accuracy when contextual examples are shown than a baseline of naming the relations alone. This suggests that user interfaces should display contextual examples of syntactic relations to help users choose between different relations.
... The combined requirements of source preservation and cumulative annotations leads to the principle of stand off annotation, the separation of annotations and source materials [15]. For long term access, all data must be available in well understood, preferably open, formats [16,17]. It is essential that access and processing of the files is not restricted to specific applications or computer platforms as this will compromise the long term access and integrity of the corpus. ...
Research into spoken language has become more visual over the years. Both fundamental and applied research have progressively included gestures, gaze, and facial expression. Corpora of multimodal conversational speech are rare and frequently difficult to use due to privacy and copyright restrictions. In contrast, Free-and-Libre corpora would allow anyone to add incremental annotations and improvement, distributing the cost of construction and maintenance. A freely available annotated corpus is presented with high quality video recordings of face-to-face conversational speech. An effort has been made to remove copyright and use restrictions. Annotations have been processed to RDBMS tables that allow SQL queries and direct connections to statistical software. A few simple examples are presented to illustrate the use of a databases of annotated speech. From our experiences we would like to advocate the formulation of "best practises" for both legal handling and database storage of recordings and annotations.
In this paper, we overview the ways in which computational methods can serve the goals of analysis and theory development in linguistics, and encourage the reader to become involved in the emerging cyberin- frastructure for linguistics. We survey examples from diverse subfields of how computational methods are already being used, describe the current state of the art in cyberinfrastructure for linguistics, sketch a pie-in-the-sky view of where the field could go, and outline steps that linguists can take now to bring about better access to and use of linguistic data through cyberinfrastructure.
Information technology boosts the development of database retrieval in the Chinese digital humanities domain. However, most database providers adopt a system-oriented design pattern, which fails to handle the problem of query gaps in users’ retrieval process. This issue seriously hinders the effective use of database retrieval functionalities, peculiarly among those historical and humanities researchers. To address it, we propose UFTDRDH, a novel user-oriented solution based on automatic query formulation (AQF) technologies, which integrates a human–machine interactive module for the selection of new query-related expansion terms and a powerful query expansion algorithmic component (UFTDRDH-QEV) optimised by a topic-enhancing relevance feedback model approach (ToQE). To verify the effectiveness of UFTDRDH, several comparative experiments are conducted, including quantitative evaluation for retrieval efficiency and user satisfaction, as well as qualitative studies for interpretative traceability. The empirical results are multidimensional and robust, which not only shows the positive effects of different AQFs on gap reduction, especially the importance of query expansion as the most effective technology, but also underlines the remarkably advantageous performance of UFTDRDH compared with traditional system-oriented automatic query expansion in different task contexts. We believe the application of UFTDRDH can further strengthen the research focus on user-centred design and improve the level of current full-text database retrieval in the field of Chinese digital humanities. Broadly speaking, this solution can be also extended to the full-text database retrieval in other languages and digital humanities domains.
The annotation of textual information is a fundamental activity in Linguistics and Computational Linguistics. This article presents various observations on annotations. It approaches the topic from several angles including Hypertext, Computational Linguistics and Language Technology, Artificial Intelligence and Open Science. Annotations can be examined along different dimensions. In terms of complexity, they can range from trivial to highly sophisticated, in terms of maturity from experimental to standardised. Annotations can be annotated themselves using more abstract annotations. Primary research data such as, e.g., text documents can be annotated on different layers concurrently, which are independent but can be exploited using multi-layer querying. Standards guarantee interoperability and reusability of data sets. The chapter concludes with four final observations, formulated as research questions or rather provocative remarks on the current state of annotation research.
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic
corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the
upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines.
A unified representation comprises the separation of conceptually different annotation layers contained in the original corpus
data (e.g. syntax, phonology, and semantics) into multiple XML files. These annotation layers are linked to each other implicitly
by the identical textual content of all files. A suitable data structure for the representation of these annotations is a
multi-rooted tree that again can be represented by the TEI and ISO tag set for feature structures. The mapping process and
representational issues are discussed as well as the advantages and drawbacks associated with the use of the TEI tag set for
feature structures as a storage and exchange format for linguistically annotated data.
This paper discusses work on the sustainability of linguistic resources as it was conducted in various projects, including the work of a three year project Sustainability of Linguistic Resources which finished in December 2008, a follow-up project, Sustainable linguistic data, and initiatives related to the work of the International Organization of Standardization (ISO) on developing standards for linguistic resources. The individual projects have been conducted at German collaborative research centres at the Universities of Potsdam, Hamburg and Tubingen, where the sustainability work was coordinated.
In this chapter, two different ways of grouping information represented in document markup are examined: annotation levels, referring to conceptual levels of description, and annotation layers, referring to the technical realisation of markup using e.g. document grammars. In many current XML annotation projects,
multiple levels are integrated into one layer, often leading to the problem of having to deal with overlapping hierarchies.
As a solution, we propose a framework for XML-based multiple, independent XML annotation layers for one text, based on an
abstract representation of XML documents with logical predicates. Two realisations of the abstract representation are presented,
a Prolog fact base format together with an application architecture, and a specification for XML native databases. We conclude
with a discussion of projects that have currently adopted this framework.
'Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis,'named entity' identification, coreference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.
Our goal is to provide a web-based platform for the long-term preservation and distribution of a heterogeneous collection of linguistic resources. We discuss the corpus preprocessing and normalisation phase that results in sets of multi-rooted trees. At the same time we transform the original metadata records, just like the corpora annotated using different annotation approaches and exhibiting different levels of granularity, into the all-encompassing and highly flexible format eTEI for which we present editing and parsing tools. We also discuss the architecture of the sustainability platform. Its primary components are an XML database that contains corpus and metadata files and an SQL database that contains user accounts and access control lists. A staging area, whose structure, contents, and consistency can be checked using tools, is used to make sure that new resources about to be imported into the platform have the correct structure.
The process of documenting and describing the world's languages is undergoing
radical transformation with the rapid uptake of new digital technologies for
capture, storage, annotation and dissemination. However, uncritical adoption of
new tools and technologies is leading to resources that are difficult to reuse
and which are less portable than the conventional printed resources they
replace. We begin by reviewing current uses of software tools and digital
technologies for language documentation and description. This sheds light on
how digital language documentation and description are created and managed,
leading to an analysis of seven portability problems under the following
headings: content, format, discovery, access, citation, preservation and
rights. After characterizing each problem we provide a series of value
statements, and this provides the framework for a broad range of best practice
recommendations.
This paper describes development and design of an ontology of linguistic annotations, primarily word classes and morphosyntactic features, based on existing standardization approaches (e.g. EAGLES), a set of annotation schemes (e.g. for German, STTS and morphological annotations), and existing terminological resources (e.g. GOLD). The ontology is intended to be a platform for terminological integration, integrated representation and ontology-based search across existing linguis- tic resources with terminologically heterogeneous annotations. Further, it can be applied to augment the semantic analysis of a given text with an ontological interpretation of its morphosyntactic analysis.
This article reports on a survey that was conducted among 16 projects of a collaborative research centre to learn about the requirements of a web-based corpus query interface. This interface is to be created for a collection of corpora that are heterogeneous with respect to their languages, levels of annotations, and their users research interests. Based on the survey and a comparison of three existing corpus query interfaces we compiled a set of requirements. In the context of sustainable strategies of corpus storage and accessibility we point out how to design an interface that is general enough to cover multiple corpora and at the same time suitable for a wide range of users.
Linguistic corpora have been annotated by means of SGML-based markup languages for almost 20 years. We can, very roughly, differentiate between three distinct evolutionary stages of markup technologies. (1) Originally, single SGML tree-based document instances were deemed sufficient for the representation of linguistic structures. (2) Linguists began to realize that alternatives and extensions to the traditional model are needed. Formalisms such as, for example, NITE were proposed: the NITE Object Model (NOM) consists of multi-rooted trees. (3) We are now on the threshold of the third evolutionary stage: even NITE's very flexible approach is not suited for all linguistic purposes. As some structures, such as these, cannot be modeled by multi-rooted trees, an even more flexible approach is needed in order to provide a generic annotation format that is able to represent genuinely arbitrary linguistic data structures.
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic
corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the
upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines.
A unified representation comprises the separation of conceptually different annotation layers contained in the original corpus
data (e.g. syntax, phonology, and semantics) into multiple XML files. These annotation layers are linked to each other implicitly
by the identical textual content of all files. A suitable data structure for the representation of these annotations is a
multi-rooted tree that again can be represented by the TEI and ISO tag set for feature structures. The mapping process and
representational issues are discussed as well as the advantages and drawbacks associated with the use of the TEI tag set for
feature structures as a storage and exchange format for linguistically annotated data.
Comprehensive data repositories are an essential part of practically all research carried out in the digital humanities nowadays. For example, library science, literary studies, and computational and corpus linguistics strongly depend on online archives that are highly sustainable and that contain not only digitized texts but also audio and video data as well as additional information such as metadata and arbitrary annotations. Current Web technologies, especially those that are related to what is commonly referred to as the Web 2.0, provide a number of novel functions such as multiuser editing or the inclusion of third-party content and applications that are also highly attractive for research applications in the areas mentioned above. Hand in hand with this development goes a high degree of legal uncertainty. The special nature of the data entails that, in quite a few cases, there are multiple holders of personal rights (mostly copyright) to different layers of data that often have different origins. This article discusses the legal problems of multiple authorships in private, commercial, and research environments. We also introduce significant differences between European and U.S. law with regard to the handling of this kind of data for scientific purposes. published or submitted for publication
Introduction The World Wide Web has the potential to become a primary source for storing and accessing linguistic data, including data of the sort that are routinely collected by field linguists. Having large amounts of linguistic data on the Web will give linguists, indigenous communities, and language learners access to resources that have hitherto been difficult to obtain. For linguists, scientific data from the world's languages will be just as accessible as information in on-line newspapers. For indigenous communities, the Web will be a powerful instrument for maintaining language as a cultural resource. For students and educators, a new tool will be available for teaching and learning minority and endangered languages. For linguists in particular, having linguistic data on the Web means that data from different languages can be automatically searched and compared. Furthermore, the Webwould provide ready computational resources for the development of machine translation and other
The purpose of this paper is to describe the TuBa-D/Z treebank of written German and to compare it to the independently developed TIGER treebank (Brants et al., 2002). Both treebanks, TIGER and TuBa-D/Z, use an annotation framework that is based on phrase structure grammar and that is enhanced by a level of predicate-argument structure. The comparison between the annotation schemes of the two treebanks focuses on the different treatments of free word order and discontinuous constituents in German as well as on differences in phrase-internal annotation.
The NITE Object Model Library is an implemented set of routines for loading, accessing, manipulating, and serializing linguistic data. It is similar in spirit to the data handling provided by the Annotation Graph Toolkit, but is aimed at data that is heavily cross-annotated with structured information, and thus chooses higher expressivity at the cost of processing speed. We describe our open-source implementation and the XML-based data storage format that it assumes, and discuss the circumstances under which it is a useful addition to previous data handling techniques.