Thomas Schmidt

Thomas Schmidt
  • Leibniz Institute for the German Language

About

73
Publications
10,803
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
958
Citations
Introduction
Current institution
Leibniz Institute for the German Language

Publications

Publications (73)
Conference Paper
Full-text available
Research projects incorporating spoken data require either a selection of existing speech corpora, or they plan to record new data. In both cases, recordings need to be transcribed to make them accessible to analysis. Underestimating the effort of transcribing can be risky. Automatic Speech Recognition (ASR) holds the promise to considerably reduce...
Article
Full-text available
Im vorliegenden Artikel wird ein Überblick über das von der DFG geförderte Projekt Zugänge zu multimodalen Korpora gesprochener Sprache-Vernetzung und zielgruppenspezifische Ausdifferenzierung (ZuMult) gegeben. Dabei wird zunächst auf die Sprachdaten und auf die technische Basis der Applikationen eingegangen, die dem Projekt zugrunde liegen. Im Ans...
Article
Full-text available
Die Darstellung von und Arbeit mit Transkripten spielt in vielen forschungs-und anwendungsbezogenen Arbeiten mit Daten gesprochener Sprache eine wichtige Rolle. Der im ZuMult-Projekt entwickelte Prototyp ZuViel (Zugang zu Visualisierung von Transkripten) knüpft an etablierte Verfahren zur Transkriptdarstellung an und erweitert diese durch neue Mögl...
Chapter
Der Beitrag illustriert die Nutzung des Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK) für interaktionslinguistische Fragestellungen anhand einer exemplarischen Studie. Zunächst werden die Stratifikation (Datenkomposition) des Korpus, das zugrundeliegende Datenmodell und dessen Annotationsebenen sowie Typen von Untersuchungsinteressen vorge...
Chapter
For many reasons, Mennonite Low German is a language whose documentation and investigation is of great importance for linguistics. To date, most research projects that deal with this language and/ or its speakers have had a relatively narrow focus, with many of the data cited being of limited relevance beyond the projects for which they were collec...
Article
Full-text available
Older adults are often exposed to elderspeak, a specialized speech register linked with negative outcomes. However, previous research has mainly been conducted in nursing homes without considering multiple contextual conditions. Based on a novel contextually-driven framework, we examined elderspeak in an acute general versus geriatric German hospit...
Conference Paper
Full-text available
This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) t...
Conference Paper
Full-text available
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially a...
Preprint
Full-text available
This paper describes the corpus Deutsch in Namibia (DNam, 'German in Namibia'), which will be openly accessible via the Datenbank für Gesprochenes Deutsch (DGD, 'Database for Spoken German'). This corpus is a new digital resource that comprehensively and systematically documents the language use of the German-speaking minority in Namibia and relate...
Chapter
Full-text available
Chapter
Full-text available
We present a method for detecting and reconstructing separated particle verbs in a corpus of spoken German by following an approach suggested for written language. Our study shows that the method can be applied successfully to spoken language, compares different ways of dealing with structures that are specific to spoken language corpora, analyses...
Chapter
Researchers interested in the sounds of speech or the physical gestures of speakers make use of audio and video recordings in their work. Annotating these recordings presents a different set of requirements to the annotation of text. Special purpose tools have been developed to display video and audio signals and to allow the creation of time-align...
Conference Paper
Full-text available
We present an approach to making existing CLARIN web services usable for spoken language transcriptions. Our approach is based on a new TEI-based ISO standard for such transcriptions. We show how existing tool formats can be transformed to this standard, how an encoder/decoder pair for the TCF format enables users to feed this type of data through...
Article
This paper presents practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German, a large collection of spontaneous verbal interaction from diverse discourse domains. After introducing the aims and organisational circumstances of the construction of FOLK, the general idea is discussed that good practices cannot be develop...
Conference Paper
Full-text available
This contribution presents the background, design and results of a study of users of three oral corpus platforms in Germany. Roughly 5.000 registered users of the Database for Spoken German (DGD), the GeWiss corpus and the corpora of the Hamburg Centre for Language Corpora (HZSK) were asked to participate in a user survey. This quantitative approac...
Conference Paper
Full-text available
In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for...
Article
Full-text available
Dieser Beitrag stellt das Forschungs-und Lehrkorpus Gesprochenes Deutsch (FOLK) und die Datenbank für Gesprochenes Deutsch (DGD) als Instrumente ge-sprächsanalytischer Arbeit vor. Nach einer allgemeinen Einführung in FOLK und DGD im zweiten Abschnitt werden im dritten Abschnitt die methodischen Bezie-hungen zwischen Korpuslinguistik und Gesprächsfo...
Conference Paper
Full-text available
The Database for Spoken German (Datenbank für Gesprochenes Deutsch, DGD2, http://dgd.ids-mannheim.de) is the central platform for publishing and disseminating spoken language corpora from the Archive of Spoken German (Archiv für Gesprochenes Deutsch, AGD, http://agd.ids-mannheim.de) at the Institute for the German Language in Mannheim. The corpora...
Conference Paper
Full-text available
FOLK is the "Forschungs-und Lehrkorpus Gesprochenes Deutsch (FOLK)" (eng.: research and teaching corpus of spoken German). The project has set itself the aim of building a corpus of German conversations which a) covers a broad range of interaction types in private, institutional and public settings, b) is sufficiently large and diverse and of suffi...
Poster
Full-text available
Transcription, the paradoxical task of representing spoken language in the written medium, has always been an interesting challenge for data visualisation. Long before the computer became the tool of choice for transcribing audio or video recordings, researchers from such diverse fields as conversation analysis, dialectology or phonology (to name j...
Chapter
This volume deals with different aspects of the creation and use of multilingual corpora. The term 'multilingual corpus' is understood in a comprehensive sense, meaning any systematic collection of empirical language data enabling linguists to carry out analyses of multilingual individuals, multilingual societies or multilingual communication. The...
Chapter
Full-text available
This volume deals with different aspects of the creation and use of multilingual corpora. The term 'multilingual corpus' is understood in a comprehensive sense, meaning any systematic collection of empirical language data enabling linguists to carry out analyses of multilingual individuals, multilingual societies or multilingual communication. The...
Chapter
Full-text available
This volume deals with different aspects of the creation and use of multilingual corpora. The term 'multilingual corpus' is understood in a comprehensive sense, meaning any systematic collection of empirical language data enabling linguists to carry out analyses of multilingual individuals, multilingual societies or multilingual communication. The...
Conference Paper
Full-text available
This paper presents two toolsets for transcribing and annotating spoken language: the EXMARaLDA system, developed at the University of Hamburg, and the FOLK tools, developed at the Institute for the German Language in Mannheim. Both systems are targeted at users interested in the analysis of spontaneous, multi-party discourse. Their main user commu...
Article
Full-text available
This paper formulates a proposal for standardising spoken language transcription, as practised in conversation analysis, sociolinguistics, dialectology and related fields, with the help of the TEI guidelines. Two areas relevant to standardisation are identified and discussed: first, the macro structure of transcriptions, as embodied in the data mod...
Article
Full-text available
High word frequency and neighborhood density contribute to the accuracy and speed of word production in English adults (e.g., Vitevitch & Sommers 2003), and characterize early words in child English (e.g., Storkel 2004). The present study investigated a speech corpus of child German (ages 2;00-3;00) to further the understanding of the influence of...
Article
Full-text available
Words from dense neighborhoods in the mental lexicon, such as cat (with many phonological neighbors, that is, phonologically similar words, e.g., mat, at, cab, rat, pat) are produced more accurately and quickly by adult native speakers of English than words from sparse neighborhoods, such as wolf (with fewer neighbors, e.g., woof, wooly, wool). Hig...
Presentation
Full-text available
Please note: This is the PowerPoint/presentation version of the extended LSA abstract I also uploaded. Presented at the Linguistic Society of America (LSA) Meeting 2011. Description: High word frequency and neighborhood density contribute to the accuracy and speed of word production in English adults (e.g., Vitevitch & Sommers 2003), and character...
Article
Full-text available
This paper presents EXMARaLDA, a system for the computer-assisted creation and analysis of spoken language corpora. The first part contains some general observations about technological and methodological requirements for doing corpus-based pragmatics. The second part explains the system’s architecture and gives an overview of its most important so...
Article
Full-text available
This paper proposes a method for deriving visualisations of linguistic documents from an encoding of their logical structure. The method is based on an extension of the stylesheet processing metaphor as applied, for instance, in XSLT transformations of XML documents. The paper discusses the method using a piece of discourse transcription in musical...
Conference Paper
This paper presents the results of a joint effort of a group of multimodality researchers and tool developers to improve the interoperability between several tools used for the annotation and analysis of multimodality. Each of the tools has specific strengths so that a variety of differ- ent tools, working on the same data, can be desirable for pro...
Article
Full-text available
This paper discusses issues that arise in the transformation of electronic language data from outdated to modern, sustainable formats. We first describe the problem and then present four different cases in which corpora of spoken language were converted from legacy formats to an XML-based representation. For each of the four cases, we describe the...
Article
Full-text available
Dieser Aufsatz befasst sich mit Fragen, die sich im Zusammenhang mit der Ar-chivierung und öffentlichen Bereitstellungen von gesprächsanalytischen Daten (Audio-bzw. Videoaufnahmen und deren Transkriptionen) stellen. Er gibt zu-nächst einen Überblick über die Forschungsperspektiven, die eine verbesserte Praxis der Datenarchivierung für die Gesprächs...
Article
Full-text available
Dieser Aufsatz gibt einen Überblick über EXMARaLDA, ein System aus Daten- modell, Datenformaten und Software-Werkzeugen zum computergestützten Erstellen und Analysieren von Korpora gesprochener Sprache. Der Schwerpunkt der Darstellung liegt auf der Nutzung der verschiedenen Softwarewerkzeuge - ein Partitur-Editor zum Erstellen von Transkriptionen,...
Article
Full-text available
This paper presents some concepts and principles used in the devel- opment of a database of multilingual spoken discourse at the Univer- sity of Hamburg. The emphasis of the first part is on general consid- erations for the handling of heterogeneous data sets: After showing that diversity in transcription data is partly conceptually and partly tech...
Article
Full-text available
This paper attempts a new look at computer assisted transcription as it is commonly practised within the fields of discourse analysis and language acquisition studies. The first part proposes a bridge between dis-course analytical methodology and text technological methods with the concept of modelling as its central idea. The second part demonstra...
Article
Full-text available
This paper describes EXMARaLDA, a system for computer transcription of spoken discourse developed and used by the SFB "Mehrsprachigkeit " at the university of Hamburg. EXMARaLDA consists of several DTDs for XML coding of transcription data and some input and output tools for these formats. Apart from being a transcription system in its own right, E...
Article
Full-text available
EXMARaLDA is a system for computer transcription of spoken discourse that is being developed at the SFB ‚Mehrsprachigkeit' as a basis of a multilingual discourse database into which the transcriptions in use at the SFB will be integrated at a later point in time. The present paper describes the theoretical background of the development – a formal m...
Article
Full-text available
Der Einsatz des Computers zur Transkription natürlicher Gespräche ist in der Pra-xis zwar weit verbreitet, die schnelle Weiterentwicklung der Computertechnologie hat aber dazu geführt, dass verschiedene Systeme oft scheinbar zusammenhangs-los nebeneinander stehen, ohne dass ihre Gemeinsamkeiten und Unterschiede Ge-genstand einer umfassenden theoret...
Article
Full-text available
1. Einleitung Diese Stellungnahme setzt sich mit Wolfgang Schneiders Artikel "Annotati- onsstrukturen in Transkripten" auseinander. Als Entwickler des EXMARaLDA- Systems, das Schneider exemplarisch für die in seinem Aufsatz entwickelten Konzepte diskutiert, habe ich mich aufgerufen gefühlt, einige Sachverhalte aus meiner Sicht zu schildern. Die Ver...
Article
Full-text available
This paper describes a new research initiative addressing the issue of sustainability of linguistic resources. This initiative is a cooperation between three linguistic collaborative research centres in Germany, which comprise more than 40 individual research projects altogether. These projects are involved in creating manifold language resources,...
Article
Full-text available
Gesprächsanalytische Transkription wird heute fast ausnahmslos mit Hilfe des Computers bewerkstelligt. Die methodischen Grundlagen der Transkription, ins- besondere die auch heute gebräuchlichen Transkriptionsrichtlinien und -konven- tionen, haben jedoch ihren Ursprung in einer Zeit, in der das Transkribieren eine mit Hilfe von Bleistift (oder Schr...
Article
Full-text available
We define collaborative commentary as the involvement of a research community in the interpretive annotation of electronic records. The goal of this process is the evaluation of competing theoretical claims. The process requires commentators to link their comments and related evidentiary materials to specific segments of either transcripts or elect...
Article
Full-text available
This paper presents the Kicktionary, a multilingual (English -German -French) electronic lexical resource of the language of football. It explains how a corpus of football match reports was analysed according to the FrameNet and WordNet approaches and how the result of this analysis is presented to a dictionary user via a website.

Network

Cited By