
Thomas Schmidt- Leibniz Institute for the German Language
Thomas Schmidt
- Leibniz Institute for the German Language
About
73
Publications
10,803
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
958
Citations
Introduction
Skills and Expertise
Current institution
Publications
Publications (73)
Research projects incorporating spoken data require either a selection of existing speech corpora, or they plan to record new data. In both cases, recordings need to be transcribed to make them accessible to analysis. Underestimating the effort of transcribing can be risky. Automatic Speech Recognition (ASR) holds the promise to considerably reduce...
Im vorliegenden Artikel wird ein Überblick über das von der DFG geförderte Projekt Zugänge zu multimodalen Korpora gesprochener Sprache-Vernetzung und zielgruppenspezifische Ausdifferenzierung (ZuMult) gegeben. Dabei wird zunächst auf die Sprachdaten und auf die technische Basis der Applikationen eingegangen, die dem Projekt zugrunde liegen. Im Ans...
Die Darstellung von und Arbeit mit Transkripten spielt in vielen forschungs-und anwendungsbezogenen Arbeiten mit Daten gesprochener Sprache eine wichtige Rolle. Der im ZuMult-Projekt entwickelte Prototyp ZuViel (Zugang zu Visualisierung von Transkripten) knüpft an etablierte Verfahren zur Transkriptdarstellung an und erweitert diese durch neue Mögl...
Der Beitrag illustriert die Nutzung des Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) für interaktionslinguistische Fragestellungen anhand einer exemplarischen Studie. Zunächst werden die Stratifikation (Datenkomposition) des Korpus, das zugrundeliegende Datenmodell und dessen Annotationsebenen sowie Typen von Untersuchungsinteressen vorge...
For many reasons, Mennonite Low German is a language whose documentation and investigation is of great importance for linguistics. To date, most research projects that deal with this language and/ or its speakers have had a relatively narrow focus, with many of the data cited being of limited relevance beyond the projects for which they were collec...
Older adults are often exposed to elderspeak, a specialized speech register linked with negative outcomes. However, previous research has mainly been conducted in nursing homes without considering multiple contextual conditions. Based on a novel contextually-driven framework, we examined elderspeak in an acute general versus geriatric German hospit...
This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) t...
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially a...
This paper describes the corpus Deutsch in Namibia (DNam, 'German in Namibia'), which will be openly accessible via the Datenbank für Gesprochenes Deutsch (DGD, 'Database for Spoken German'). This corpus is a new digital resource that comprehensively and systematically documents the language use of the German-speaking minority in Namibia and relate...
ICC: International Comparable Corpus
We present a method for detecting and reconstructing separated particle verbs in a corpus of spoken German by following an approach suggested for written language. Our study shows that the method can be applied successfully to spoken language, compares different ways of dealing with structures that are specific to spoken language corpora, analyses...
Researchers interested in the sounds of speech or the physical gestures of speakers make use of audio and video recordings in their work. Annotating these recordings presents a different set of requirements to the annotation of text. Special purpose tools have been developed to display video and audio signals and to allow the creation of time-align...
We present an approach to making existing CLARIN web services usable for spoken language
transcriptions. Our approach is based on a new TEI-based ISO standard for such transcriptions.
We show how existing tool formats can be transformed to this standard, how an encoder/decoder
pair for the TCF format enables users to feed this type of data through...
This paper presents practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German, a large collection of spontaneous verbal interaction from diverse discourse domains. After introducing the aims and organisational circumstances of the construction of FOLK, the general idea is discussed that good practices cannot be develop...
This contribution presents the background, design and results of a study of users of three oral corpus platforms in Germany. Roughly 5.000 registered users of the Database for Spoken German (DGD), the GeWiss corpus and the corpora of the Hamburg Centre for Language Corpora (HZSK) were asked to participate in a user survey. This quantitative approac...
In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for...
Dieser Beitrag stellt das Forschungs-und Lehrkorpus Gesprochenes Deutsch (FOLK) und die Datenbank für Gesprochenes Deutsch (DGD) als Instrumente ge-sprächsanalytischer Arbeit vor. Nach einer allgemeinen Einführung in FOLK und DGD im zweiten Abschnitt werden im dritten Abschnitt die methodischen Bezie-hungen zwischen Korpuslinguistik und Gesprächsfo...
The Database for Spoken German (Datenbank für Gesprochenes Deutsch, DGD2, http://dgd.ids-mannheim.de) is the central platform for publishing and disseminating spoken language corpora from the Archive of Spoken German (Archiv für Gesprochenes Deutsch, AGD, http://agd.ids-mannheim.de) at the Institute for the German Language in Mannheim. The corpora...
FOLK is the "Forschungs-und Lehrkorpus Gesprochenes Deutsch (FOLK)" (eng.: research and teaching corpus of spoken German). The project has set itself the aim of building a corpus of German conversations which a) covers a broad range of interaction types in private, institutional and public settings, b) is sufficiently large and diverse and of suffi...
Transcription, the paradoxical task of representing spoken language in the written medium, has always been an interesting challenge for data visualisation. Long before the computer became the tool of choice for transcribing audio or video recordings, researchers from such diverse fields as conversation analysis, dialectology or phonology (to name j...
This volume deals with different aspects of the creation and use of multilingual corpora. The term 'multilingual corpus' is understood in a comprehensive sense, meaning any systematic collection of empirical language data enabling linguists to carry out analyses of multilingual individuals, multilingual societies or multilingual communication. The...
This volume deals with different aspects of the creation and use of multilingual corpora. The term 'multilingual corpus' is understood in a comprehensive sense, meaning any systematic collection of empirical language data enabling linguists to carry out analyses of multilingual individuals, multilingual societies or multilingual communication. The...
This volume deals with different aspects of the creation and use of multilingual corpora. The term 'multilingual corpus' is understood in a comprehensive sense, meaning any systematic collection of empirical language data enabling linguists to carry out analyses of multilingual individuals, multilingual societies or multilingual communication. The...
This paper presents two toolsets for transcribing and annotating spoken language: the EXMARaLDA system, developed at the University of Hamburg, and the FOLK tools, developed at the Institute for the German Language in Mannheim. Both systems are targeted at users interested in the analysis of spontaneous, multi-party discourse. Their main user commu...
This paper formulates a proposal for standardising spoken language transcription, as practised in conversation analysis, sociolinguistics, dialectology and related fields, with the help of the TEI guidelines. Two areas relevant to standardisation are identified and discussed: first, the macro structure of transcriptions, as embodied in the data mod...
High word frequency and neighborhood density contribute to the accuracy and speed of word production in English adults (e.g., Vitevitch & Sommers 2003), and characterize early words in child English (e.g., Storkel 2004). The present study investigated a speech corpus of child German (ages 2;00-3;00) to further the understanding of the influence of...
Words from dense neighborhoods in the mental lexicon, such as cat (with many phonological neighbors, that is, phonologically similar words, e.g., mat, at, cab, rat, pat) are produced more accurately and quickly by adult native speakers of English than words from sparse neighborhoods, such as wolf (with fewer neighbors, e.g., woof, wooly, wool). Hig...
Please note: This is the PowerPoint/presentation version of the extended LSA abstract I also uploaded. Presented at the Linguistic Society of America (LSA) Meeting 2011.
Description: High word frequency and neighborhood density contribute to the accuracy and speed of word production in English adults (e.g., Vitevitch & Sommers 2003), and character...
This paper presents EXMARaLDA, a system for the computer-assisted creation and analysis of spoken language corpora. The first part contains some general observations about technological and methodological requirements for doing corpus-based pragmatics. The second part explains the system’s architecture and gives an overview of its most important so...
This paper proposes a method for deriving visualisations of linguistic documents from an encoding of their logical structure.
The method is based on an extension of the stylesheet processing metaphor as applied, for instance, in XSLT transformations
of XML documents. The paper discusses the method using a piece of discourse transcription in musical...
This paper presents the results of a joint effort of a group of multimodality researchers and tool developers to improve the interoperability between several tools used for the annotation and analysis of multimodality. Each of the tools has specific strengths so that a variety of differ- ent tools, working on the same data, can be desirable for pro...
This paper discusses issues that arise in the transformation of electronic language data from outdated to modern, sustainable formats. We first describe the problem and then present four different cases in which corpora of spoken language were converted from legacy formats to an XML-based representation. For each of the four cases, we describe the...
Dieser Aufsatz befasst sich mit Fragen, die sich im Zusammenhang mit der Ar-chivierung und öffentlichen Bereitstellungen von gesprächsanalytischen Daten (Audio-bzw. Videoaufnahmen und deren Transkriptionen) stellen. Er gibt zu-nächst einen Überblick über die Forschungsperspektiven, die eine verbesserte Praxis der Datenarchivierung für die Gesprächs...
Dieser Aufsatz gibt einen Überblick über EXMARaLDA, ein System aus Daten- modell, Datenformaten und Software-Werkzeugen zum computergestützten Erstellen und Analysieren von Korpora gesprochener Sprache. Der Schwerpunkt der Darstellung liegt auf der Nutzung der verschiedenen Softwarewerkzeuge - ein Partitur-Editor zum Erstellen von Transkriptionen,...
This paper presents some concepts and principles used in the devel- opment of a database of multilingual spoken discourse at the Univer- sity of Hamburg. The emphasis of the first part is on general consid- erations for the handling of heterogeneous data sets: After showing that diversity in transcription data is partly conceptually and partly tech...
This paper attempts a new look at computer assisted transcription as it is commonly practised within the fields of discourse analysis and language acquisition studies. The first part proposes a bridge between dis-course analytical methodology and text technological methods with the concept of modelling as its central idea. The second part demonstra...
This paper describes EXMARaLDA, a system for computer transcription of spoken discourse developed and used by the SFB "Mehrsprachigkeit " at the university of Hamburg. EXMARaLDA consists of several DTDs for XML coding of transcription data and some input and output tools for these formats. Apart from being a transcription system in its own right, E...
EXMARaLDA is a system for computer transcription of spoken discourse that is being developed at the SFB ‚Mehrsprachigkeit' as a basis of a multilingual discourse database into which the transcriptions in use at the SFB will be integrated at a later point in time. The present paper describes the theoretical background of the development – a formal m...
Der Einsatz des Computers zur Transkription natürlicher Gespräche ist in der Pra-xis zwar weit verbreitet, die schnelle Weiterentwicklung der Computertechnologie hat aber dazu geführt, dass verschiedene Systeme oft scheinbar zusammenhangs-los nebeneinander stehen, ohne dass ihre Gemeinsamkeiten und Unterschiede Ge-genstand einer umfassenden theoret...
1. Einleitung Diese Stellungnahme setzt sich mit Wolfgang Schneiders Artikel "Annotati- onsstrukturen in Transkripten" auseinander. Als Entwickler des EXMARaLDA- Systems, das Schneider exemplarisch für die in seinem Aufsatz entwickelten Konzepte diskutiert, habe ich mich aufgerufen gefühlt, einige Sachverhalte aus meiner Sicht zu schildern. Die Ver...
This paper describes a new research initiative addressing the issue of sustainability of linguistic resources. This initiative is a cooperation between three linguistic collaborative research centres in Germany, which comprise more than 40 individual research projects altogether. These projects are involved in creating manifold language resources,...
Gesprächsanalytische Transkription wird heute fast ausnahmslos mit Hilfe des Computers bewerkstelligt. Die methodischen Grundlagen der Transkription, ins- besondere die auch heute gebräuchlichen Transkriptionsrichtlinien und -konven- tionen, haben jedoch ihren Ursprung in einer Zeit, in der das Transkribieren eine mit Hilfe von Bleistift (oder Schr...
We define collaborative commentary as the involvement of a research community in the interpretive annotation of electronic records. The goal of this process is the evaluation of competing theoretical claims. The process requires commentators to link their comments and related evidentiary materials to specific segments of either transcripts or elect...
This paper presents the Kicktionary, a multilingual (English -German -French) electronic lexical resource of the language of football. It explains how a corpus of football match reports was analysed according to the FrameNet and WordNet approaches and how the result of this analysis is presented to a dictionary user via a website.