
Thorsten TrippelUniversity of Tübingen | EKU Tübingen · Department of Linguistics
Thorsten Trippel
Doctor of Philosophy
About
57
Publications
5,943
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
217
Citations
Introduction
Skills and Expertise
Publications
Publications (57)
In this contribution, we report on ongoing efforts in the German national research infrastructure consortium Text+ to make research data and services for text- and language-oriented disciplines FAIR, that is findable, accessible, interoperable, and reusable, as well as compliant with the CARE principles for language resources.
What does it mean to share a national infrastructure for research data?
In order to be able to share data, it needs to be fair as in the FAIR DATA Principles. Implicitly we need data that is standardised. Standards refer both to the data format meaning technical specification and from the semantic viewpoint which categories we use to sort, structu...
CLARIN is a European Research Infrastructure Consortium developing and providing a federated and interoperable platform to support scientists in the field of the Social Sciences and Humanities in carrying-out language-related research. This contribution provides an overview of the entire infrastructure with a particular focus on tool interoperabili...
Im Auftrag des Bundesministeriums für Wirtschaft und Klimaschutz haben DIN und DKE im Januar 2022 die Arbeiten an der zweiten Ausgabe der Deutschen Normungsroadmap Künstliche Intelligenz gestartet. In einem breiten Beteiligungsprozess und unter Mitwirkung von mehr als 570 Fachleuten
aus Wirtschaft, Wissenschaft, öffentlicher Hand und Zivilgesellsch...
This chapter will present lessons learned from CLARIN-D, the German CLARIN national consortium. Members of the CLARIN-D communities and of the CLARIN-D consortium have been engaged in innovative, data-driven, and communitybased research, using language resources and tools in the humanities and neigh - bouring disciplines. We will present different...
Text+ aims to develop a research data infrastructure for Humanities disciplines and beyond whose primary research focus is on language and text. Text+ will be flexible, scalable, and thus open for different discipline-specific requirements. By offering easy access to high quality research data, Text+ will support a maximum of methodological diversi...
An ERIC (European Research Infrastructure Consortium) operates and provides research infrastructure offerings, usually within a disciplinary scope, for researchers regardless of their national or institutional background. Apart from direct funding, an ERIC is built upon the national contributions from partner countries, notably cash and in-kind con...
The Component Metadata Infrastructure (CMDI) is a discipline independent metadata framework, though it is currently mainly used within CLARIN and by initiatives in the humanities and social sciences. CMDI allows flexible modelling of metadata schemas that are adjusted to the type of data. The model has built in functionality for semantic interopera...
The transfer of research data management from one institution to another infrastructural partner is all but trivial, but can be required, for instance, when an institution faces reorganization or closure. In a case study, we describe the migration of all research data, identify the challenges we encountered, and discuss how we addressed them. It sh...
We present the CMDI Explorer, a tool that empowers users to easily explore the contents of complex CMDI records and to process selected parts of them with little effort. The tool allows users, for instance, to analyse virtual collections represented by CMDI records, and to send collection items to other CLARIN services such as the Switchboard for s...
An implementation of CMDI-based signposts and its use is presented in this paper. Arnold et al. 2020 present Signposts as a solution to challenges in long-term preservation of corpora, especially corpora that are continuously extended and subject to modification, e.g., due to legal injunctions, but also may overlap with respect to constituents, and...
Making diverse data in linguistics and the language sciences open, distributed, and accessible: perspectives from language/language acquistiion researchers and technical LOD (linked open data) researchers.
This volume examines the challenges inherent in making diverse data in linguistics and the language sciences open, distributed, integrated, and...
Zusammenfassung
Für die sprachbasierte Forschung in den Geistes- und Sozialwissenschaften stellt CLARIN eine Forschungsinfrastruktur bereit, die auf die hochgradig heterogenen Forschungsdaten in diesen Wissenschaftsbereichen angepasst ist. Mit Werkzeugen zum Auffinden, zur standardkonformen Aufbereitung und zur nachhaltigen Aufbewahrung von Daten s...
The Component MetaData Infrastructure (CMDI) provides a lego-brick framework for the creation , use and re-use of self-defined metadata formats. The design of CMDI can be a force for good, but history shows that it has often been misunderstood or badly executed. Consequently, it has led the community towards the dark ages of metadata clutter rather...
The Component MetaData Infrastructure (CMDI) is a framework for the creation and usage of metadata formats to describe all kinds of resources in the CLARIN world. To better connect to the library world, and to allow librarians to enter metadata for linguistic resources into their catalogues, a crosswalk from CMDI-based formats to bibliographic stan...
The Component MetaData Infrastructure (CMDI) is the dominant framework for describing language resources according to ISO 24622 (ISO/TC 37/SC 4, 2015). Within the CLARIN world, CMDI has become a huge success. The Virtual Language Observatory (VLO) now holds over 800.000 resources, all described with CMDI-based metadata. With the metadata being harv...
To optimize the sharing and reuse of existing data, many funding organizations now require researchers to specify a management plan for research data. In such a plan, researchers are supposed to describe the entire life cycle of the research data they are going to produce, from data creation to formatting, interpretation, documentation, short-term...
Measuring the quality of metadata is only possible by assessing the quality of the underlying schema and the metadata instance. We propose some factors that are measurable automatically for metadata according to the CMD framework, taking into account the variability of schemas that can be defined in this framework. The factors include among others...
A single abstract from the DHd-2014 Book of Abstracts.
The linguistics community is building a metadata-based infrastructure for the description of its research data and tools. At its core is the ISOcat registry, a collaborative platform to hold a (to be standardized) set of data categories (i.e., field descriptors). Descriptors have definitions in natural language and little explicit interrelations. W...
The chapter on formats and models for lexicons deals with different available data formats of lexical resources. It elaborates
on their structure and possible uses. Motivated by the restrictions in merging different lexical resources based on widely
spread formalisms and international standards, a formal lexicon model for lexical resources is devel...
This paper discusses work on the sustainability of linguistic resources as it was conducted in various projects, including the work of a three year project Sustainability of Linguistic Resources which finished in December 2008, a follow-up project, Sustainable linguistic data, and initiatives related to the work of the International Organization of...
Lexicon schemas and their use are discussed in this paper from the perspective of lexicographers and field linguists. A variety of lex- icon schemas have been developed, with goals ranging from computational lexicography (DATR) through archiving (LIFT, TEI) to standardization (LMF, FSR). A number of requirements for lexicon schemas are given. The l...
The semantic properties associated with larger text units, typically morphemes, words, phrases, and sentences, have been treated computationally in innumerable studies. Smaller text units have largely been ignored to date in the context of computational semantics research. We argue in this paper that although the forms of smaller units at the ortho...
In this paper we discuss the explicit representation of character features pertaining to written language resources, which we argue are critically necessary in the long term of archiving language data. Much focus on the creation of language resources and their associated preservation is at the level of the corpus itself; however it is generally acc...
In this paper we discuss the explicit representation of character features pertaining to written language resources, which we argue are critically necessary in the long term of archiving language data. Much focus on the creation of language resources and their associated preservation is at the level of the corpus itself; however it is generally acc...
Wir berichten über internationale Normungsarbeit im Bereich von Sprachres-sourcen. Die Normen werden von internationalen Arbeitsgruppen im Rah-men der International Organization for Standardization (ISO) entwickelt und jeweils national von entsprechenden Gruppen, in Deutschland koordiniert vom Deutschen Institut für Normung (DIN), begleitet und dis...
Analysis and knowledge representation of linguistic objects tends to focus on larger units (e.g. words) than print medium characters. We analyse characters as linguistic objects in their own right, with meaning, structure and form. Characters have meaning (the symbols of the International Phonetic Alphabet denote phonetic categories, the character...
For a detailed description of time aligned corpora, for example spoken language corpora and multimodal corpora, specific metadata categories are necessary, extending the scope of traditional metadata categories. We argue that it is necessary to allow metadata on all levels of annotation, i.e. on a general level for catalogues, on the session level...
The extraction of lexical information for machine readable lexica from multilevel annotations is addressed in this paper. Relations between these levels of annotation are used for sub- classification of lexical entries. A method for relating annota- tion units is presented, based on a temporal calculus. Relating the annotation units manually is err...
Currently no standardised gesture annotation systems are available. As a contribution towards solving this problem, CoGesT, a machine processable and human usable computational model for the annotation of a subset of conversational gestures is presented, its empirical and formal properties are detailed, and application areas are discussed. 1. A ges...
We describe PAX, ”Portable Audio Concordance System”, a proof-of-concept prototype of a multipurpose, multilingual audio concor-
Concordancing is one of the oldest corpus analysis tools, es- pecially for written corpora. In NLP concordancing appears in training of speech-recognition system. Additionally, compar- ative studies of different languages result in parallel corpora. Concordancing for these corpora in a NLP context is a new ap- proach. We propose to combine these fi...
With MetaLex we introduce a framework for metadata management where information can be inferred from different areas of metadata coding, such as metadata for catalogue descriptions, linguistic levels, or tiers. This is done for consistency and efficiency in metadata recording and applies the same inference techniques that are used for lexical infer...
New generations of integrated multimodal speech and language systems with dictation, readback or talking face facilities require multiple sources of lexical information for development and evaluation. Recent developments in hyperlexicon development offer new perspectives for the development of such resources which are at the same time practically u...
This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the legacy markup and harvesting the inherent information using widely available standards. This resulted...
In order to create reusable and sustainable multimodal resources a transcription model for hand and arm gestures in conversation is needed. We argue that transcription systems so far developed for sign language transcription and psychological analysis are not suit-able for the linguistic analysis of conversational gesture. Such a model must adhere...
This paper proposes a methodology for querying linguistic data represented in different corpus formats. Examples of the need for queries over such heterogeneous resources are the corpus-based analysis of multimodal phenomena like the interaction of gestures and prosodic features, or syntax-related phenomena like information structure which exceed t...
The Basic Language Resource Kit (BLARK) proposed by Krauweris designed for the creation of initial textual resources. T here are a number of toolkits for the development of spoken language resources and systems, but tools for second level resources, t hat is, resources which are the result of processing primary level speech resources such as speech...
XML has been designed for creating structured documents, but the information that is encoded in these structures are, by definition, out of scope for XML. Additional sources, normally not easily interpretable by computers, such as documentation are needed to determine the intention of specific tags in a tag-set. The Component Metadata Infrastructur...
The Lexicon Graph Model provides a model and framework for lexicons that can be corpus based and contain multimodal information. The focus is more from the lexicon theory perspective, looking at the underlying data structures that are part of existing lexicons and corpora. The term lexicon in linguistics and artificial intelligence is used in diffe...