About
142
Publications
16,664
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,848
Citations
Introduction
Publications
Publications (142)
In this contribution, we report on ongoing efforts in the German national research infrastructure consortium Text+ to make research data and services for text- and language-oriented disciplines FAIR, that is findable, accessible, interoperable, and reusable, as well as compliant with the CARE principles for language resources.
Language data are essential for any scientific endeavor. However, unlike numerical data, language data are often protected by copyright, as they easily meet the threshold of originality. The role of research infrastructures (such CLARIN, DARIAH, and Text+) is to bridge the gap between uses allowed by statutory exceptions and the requirements of Ope...
This chapter will present lessons learned from CLARIN-D, the German CLARIN national consortium. Members of the CLARIN-D communities and of the CLARIN-D consortium have been engaged in innovative, data-driven, and communitybased research, using language resources and tools in the humanities and neigh - bouring disciplines. We will present different...
Text+ aims to develop a research data infrastructure for Humanities disciplines and beyond whose primary research focus is on language and text. Text+ will be flexible, scalable, and thus open for different discipline-specific requirements. By offering easy access to high quality research data, Text+ will support a maximum of methodological diversi...
Composition models of distributional semantics are used to construct phrase representations from the representations of their words. Composition models are typically situated on two ends of a spectrum. They either have a small number of parameters but compose all phrases in the same way, or they perform word-specific compositions at the cost of a f...
Composition models of distributional semantics are used to construct phrase representations from the representations of their words. Composition models are typically situated on two ends of a spectrum. They either have a small number of parameters but compose all phrases in the same way, or they perform word-specific compositions at the cost of a f...
Zusammenfassung
Für die sprachbasierte Forschung in den Geistes- und Sozialwissenschaften stellt CLARIN eine Forschungsinfrastruktur bereit, die auf die hochgradig heterogenen Forschungsdaten in diesen Wissenschaftsbereichen angepasst ist. Mit Werkzeugen zum Auffinden, zur standardkonformen Aufbereitung und zur nachhaltigen Aufbewahrung von Daten s...
(1a) shows that the finite auxiliary in a German subordinate clause normally appears in clause-final position. However, when the finite auxiliary governs a modal such as können or müssen, as in (1b), then the finite auxiliary is placed at the left periphery of the verbal complex. The ungrammaticality of (1c) shows that in such cases Oberfeld placem...
This paper presents a novel model that learns and exploits embeddings of phone ngrams for word segmentation in child language acquisition. Embedding-based models are evaluated on a phonemically transcribed corpus of child-directed speech, in comparison with their symbolic counterparts using the common learning framework and features. Results show t...
This paper proposes an embedding matching approach to Chinese word segmentation, which generalizes the traditional sequence labeling framework and takes advantage of distributed representations. The training and prediction algorithms have linear-time complexity. Based on the proposed model, a greedy segmenter is developed and evaluated on benchmark...
Sense definitions are a crucial component for wordnets and enhance the usability of wordnets for a wide variety of NLP applications. Many wordnets for languages other than English - including the German wordnet GermaNet - lack comprehensive coverage of such definitions. The purpose of this paper is to automatically align sense descriptions from the...
Verbal word formation processes involving prefixes and particles are highly productive in Germanic languages. The compositional semantics of such prefix and particle verbs requires an in-depth analysis of the interdependence of their constituent parts for adequately representing these types of complex verbs in lexical-semantic networks. The present...
A comparison and alignment of lexical resources brings about considerable mutual benefits for all resources involved. For all sense distinctions that are completely parallel in two resources, such an alignment provides supporting external evidence for the validity of sense distinction and allows enriching word senses by information contained in the...
This paper provides a deduction-based approach for automatically classifying compound-internal relations in GermaNet, the German version of the Princeton WordNet for English. More specifically, meronymic relations between simplex and compound nouns provide the necessary input to the deduction patterns that involve different types of compound-intern...
This paper describes the manual construction of a sense-annotated corpus for German with the goal of providing a gold standard for word sense disambiguation. The underlying textual resource, the TüBa-D/Z treebank, is a German newspaper corpus already manually enriched with high-quality, manual annotations at various levels of grammar. The sense inv...
Finding coordinations provides useful information for many NLP endeavors. However, the task has not received much attention in the literature. A major reason for that is that the annotation of major treebanks does not reliably annotate coordination. This makes it virtually impossible to detect coordinations in which two conjuncts are separated by p...
This paper describes an automatic method for creating a domain-independent sense-annotated corpus harvested from the web. As a proof of concept, this method has been applied to German, a language for which sense-annotated corpora are still in short supply. The sense inventory is taken from the German wordnet GermaNet. The web-harvesting relies on a...
Treebanks are language resources that provide annotations at various levels of linguistic structure starting from the word level. They typically provide syntactic constituent or dependency structures for sentences, but increasingly extend to annotation beyond syntactic structure, including semantic, pragmatic and rhetorical annotation, or go beyond...
This paper describes the TüBa-D/DC, a diachronic corpus of German that uses selected materials from the German Gutenberg Project and enriches them with different linguistic annotation layers, including part-of-speech, lemmata, and constituent structure. Linguistic annotation is performed automatically by using statistical tools that have been train...
This document proposes an overview of the current (at the time of writing) scene towards an Interoperability Framework and acts as a reference point for the standards that our community supports. This initiative is in close synchronization with other
relevant initiatives such as CLARIN, ELRA, ISO and TEI and META-
Share.
The document builds on th...
In this paper we present the development process of NLP-QT, a question treebank that will be used for data-driven parsing in the context of a domain-specific QA sys-tem for querying NLP resource metadata. We motivate the need to build NLP-QT as a resource in its own right, by com-paring the Penn Treebank-style annotation scheme used for QuestionBan...
In order to be able to systematically link compounds in GermaNet to their constituent parts, compound splitting needs to be applied recursively and has to identify the immediate constituents at each level of analysis. Existing tools for compound splitting for German only offer an analysis of all component parts of a compound at once without any gro...
This paper describes a CoNLL-style chunk representation for the Tübingen Treebank of Written German, which as-sumes a flat chunk structure so that each word belongs to at most one chunk. For German, such a chunk definition causes problems in cases of complex prenominal modification. We introduce a flat annota-tion that can handle these structures v...
eScience - enhanced science - is a new paradigm of scientific work and research. In the humanities, eScience environments can be helpful in establishing new workflows and lifecycles of scientific data. WebLicht is such an eScience environment for linguistic analysis, making linguistic tools and resources available network-wide. Today, most digital...
This paper introduces GernEdiT (short for: GermaNet Editing Tool), a new graphical user interface for the lexicographers and developers of GermaNet, the German version of the Princeton WordNet. GermaNet is a lexical-semantic net that relates German nouns, verbs, and adjectives. Traditionally, lexicographic work for extending the coverage of GermaNe...
This software demonstration presents WebLicht (short for: Web-Based Linguistic Chaining Tool), a web-based service environment for the integration and use of language resources and tools (LRT). WebLicht is being developed as part of the D-SPIN project. We-bLicht is implemented as a web application so that there is no need for users to install any s...
It has been recognized for quite some time that sustainable data formats play an important role in the development and curation of linguistic resources. The purpose of this paper is to show how GermaNet, the German version of the Princeton WordNet, can be converted to the Lexical Markup Framework (LMF), a published ISO standard (ISO-24613) for enco...
GernEdiT (short for: GermaNet Editing Tool) offers a graphical interface for the lexicographers and developers of GermaNet to access and modify the underlying GermaNet resource. GermaNet is a lexical-semantic wordnet that is modeled after the Princeton Word-Net for English. The traditional lexicographic development of GermaNet was error prone and t...
The paper presents a computational analysis of Bulgarian di-alect variation, concentrating on pronunciation differences. It describes the phonetic data set compiled during the project * 'Measuring Linguistic Unity and Diversity in Europe' that consists of the pronunciations of 157 words collected at 197 sites from all over Bulgaria. We also present...
This paper introduces the EU-FP7 project CLARIN, a joint effort of over 150 institutions in Europe, aimed at the creation of a sustainable language resources and technology infrastructure for the humanities and social sciences research community. The paper briefly introduces the vision behind the project and how it relates to speech research with a...
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic
corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the
upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon...
We report on finished work in a project that is concerned with providing methods, tools, best practice guidelines, and solutions
for sustainable linguistic resources. The article discusses several general aspects of sustainability and introduces an approach
to normalizing corpus data and metadata records. Moreover, the architecture of the sustainab...
The present paper is concerned with sta- tistical parsing of constituent structures in German. The paper presents four ex- periments that aim at improving parsing performance of coordinate structure: 1) reranking the n-best parses of a PCFG parser, 2) enriching the input to a PCFG parser by gold scopes for any conjunct, 3) reranking the parser outp...
The field of dialectology relies on lexical knowledge in the form of pronunciation and lexical data. The present paper focuses on the recently developed approach of computational dialectometry, particularly on the scientific visualization techniques that have been developed within this approach. Existing visualization software packages are mature e...
Within the CLARIN e-science infrastructure project it is foreseen to develop a component-based registry for metadata for Language Resources and Language Technology. With this registry it is hoped to overcome the problems of the current available systems with respect to inflexible fixed schema, unsuitable terminology and interoperability problems. T...
This paper presents a corpus-based study of the discourse connective in contrast. The corpus data are drawn from the British National Corpus (BNC) and are analyzed at the levels of syntax, discourse structure, and compositional semantics. Following Webber et al. (2003), the paper argues that in contrast crucially involves discourse anaphora and, th...
In Theory and Evidence in Semantics, editors Erhard W. Hinrichs and John Nerbonne present a series of state-of-the-art papers that investigate the interface of natural language semantics with other modules of grammarâsuch as morphology, syntax, and pragmaticsâand pursue applications of semantic theory in computational linguistics. Written by so...
This paper reports on a hybrid architecture for com-putational anaphora resolution (CAR) of German that combines a rule-based pre-filtering component with a memory-based resolution module (using the Tilburg Memory Based Learner – TiMBL). The data source is provided by the TüBa-D/Z treebank of German newspaper text (Telljohann et al. 04) that is an-...
A novel unsupervised learning approach to computational dialectometry is presented which uses hard clustering. The approach relies on vector analysis over two-dimensional arrays of word lists collected for different geographical sites. The paper presents the underlying theory and applies the approach to a Bulgarian data set. The results of these ex...
In many theoretical and applied areas of computational linguistics researchers op-erate with a notion of linguistic distance or, conversely, linguistic similarity, which is the focus of the present workshop. While many CL areas make frequent use of such notions, it has received little fo-cused attention, an honorable exception being Lebart & Rajman...
Abstract This paper presents a comparative,study of probabilistic treebank parsing of German, using the Negra and TüBa-D/Z tree- banks.,Experiments with the Stanford parser, which uses a factored PCFG and dependency model, show that, contrary to previous claims for other parsers, lexical- ization of PCFG models,boosts parsing performance,for both t...
This paper reports on the SYN-RA (SYNtax-based Reference Annotation) project, an on-going project of annotating German newspaper texts with referential relations. The project has developed an inventory of anaphoric and coreference relations for German in the context of a unified, XML-based annotation scheme for combining morphological, syntactic, s...
This paper profiles significant differences in syntactic distribution and differences in word class frequencies for two treebanks of spoken and written German: the TüBa-D/S, a treebank of transliterated spontaneous dialogs, and the TüBa-D/Z treebank of newspaper articles published in the German daily newspaper 'die tageszeitung' (taz). The approach...
The research reported here is part of a larger project on the development of a robust dependency parsing scheme GRIP (GeRman Incremental Parsing) that uses the Xerox Incremental Deep Parsing System and provides syntactic annotation in an incremental fashion. It is shown that morphological disambiguation is a crucial step in narrowing down the searc...
Chunk parsing has focused on the recognition of partial constituent structures at the level of individual chunks. Little attention has been paid to the question of how such partial analyses can be combined into larger structures for complete utterances.
this paper is to investigate alternative methods of automatic, morphosyntactic annotation with a large tagset that can be effectively trained on manually annotated data of moderate size. We will argue that standard n-gram taggers such as TnT are inadequate for this task, but that probabilistic context-free grammars provide a suitable alternative th...
The purpose of this paper is to describe the TuBa-D/Z treebank of written German and to compare it to the independently developed TIGER treebank (Brants et al., 2002). Both treebanks, TIGER and TuBa-D/Z, use an annotation framework that is based on phrase structure grammar and that is enhanced by a level of predicate-argument structure. The compari...
Rule-based and statistical approaches constitute the two leading paradigms in computational linguistics. This paper applies the two types of approaches to the task of assigning morpho-syntactic categories to words in German, a language with rich inectional morphology. The rule-based approach uses the Xerox Incremental Deep Parsing System and provid...
A growing number of theoretical and computational linguists in recent years have voiced scepticism about the merit of insights gained in one field for the other. With suitable distance and perspective, however, we are confident that the sometimes temperamental arguments about the value of theoretical research in linguistics for computational lingui...
This stylebook describes the design principles and the annotation scheme for the German treebank TuBa-D/Z developed by the Division of Computational Linguistics (Lehrstuhl Prof. Hinrichs) at the Department of Linguistics (Seminar fur Sprachwissenschaft { SfS) of the Eberhard-Karls-Universitat Tubingen, Germany. The guidelines focus on the synta...
LcxicM rules are used in constraint- based grammar formalisms such as 1teml-l)riveu Phrase Structure (Iraromar (IlPSG) (Pollm'd and Sag 1994) [o ex- press geucralizations amoug lexical entries. 'l'his patmr discusscs a nunthor o[ lexical rules fi'om rcccut IIPSG analy- scs of German (Hintlobs nd Nakazawa 199d) and shows that the grammar iu some cas...
this paper we develop a functor-driven approach to natm, al language generation which pairs logical forms, expressed in first-order predicte logic, with syntactically well-formed English senteuces. Granmatical knowledge is expressed in the fi'amework of categorial unification-qrammars developed by Kart- tunen (1986), Wittenburg (1986), Uszkoreit (1...
The paper argues that morphological disambiguation is a crucial step for assignment of dependency structures. Quantitative evaluation on a German corpus shows that morphological disambiguation of NPs together with syntactic heuristics yields unique morphological analyses for the assignment of dependency relations to German NPs in 77.08% of all case...
Chunk parsing has focused on the recognition of partial constituent structures at the level of individual chunks.
This paper provides an overview of current research on a hybrid and robust parsing architecture for the morphological, syntactic and semantic annotation of German text corpora. The novel contribution of this research lies not in the individual parsing modules, each of which relies on state-of-the-art algorithms and techniques. Rather what is new ab...
Chunk parsing has focused on the recognition of partial constituent structures at the level of individual chunks. Little attention has been paid to the question of how such partial analyses can be combined into larger structures for complete utterances.
This paper describes a compositional semantics for temporal expressions as part of the meaning representation language (MRL) of the JANUS system, a natural language understanding and generation system under joint development by BBN Labs and ISI. The analysis is based on a higher-order intensional logic described in detail in Hinrichs (1987a). Tempo...
The automated translation of one natural language to another, known as machine translation (MT), typically requires successful modeling of the grammars of the languages and the relationship between them. Rather than hand-coding these grammars and relationships, some machine translation e#orts employ data-driven methods, where the goal is to learn f...
This paper describes a compositional semantics for temporal expressions as part of the meaning representation language (MRL) of the JANUS system, a natural language understanding and generation system under joint development by BBN Laboratoires and the Information Sciences institute? The analysis is based on a higher order intansional logic describ...
Case-matching effects in in German VP coordination and German free relatives have received a fair amount of attention in recent syntactic theorizing and have been cited by Ingria (1990) as a potential challenge to constraint-based and unification-based approaches to syntax such as HPSG and LFG. This paper considers another construction in question:...
The so-called was-w-construction in German has received a fair amount of attention in recent syntactic theorizing. Most of the discussion has focused on the properties of was. One line of research maintains that was is a scope marker that indicates the semantic scope of the wh-phrase in the embedded interrogative clause. The alternative view, usual...
Chunk parsing has focused on the recognition of partial constituent structures at the level of individual chunks. Little attention has been paid to the question of how such partial analyses can be combined into larger structures for complete utterances. Such larger structures are not only desirable for a deeper syntactic analysis. They also constit...
This book explores a wide variety of theoretically central issues in the framework of Head-Driven Phrase Structure Grammar (HPSG), a major theory of syntactic representation, particularly in the domain of natural language computation. HPSG is a strongly lexicon-driven theory, like several others on the scene, but unlike the others it also relies he...
The Verbmobil treebanks of spoken German, English, and Japanese are part of the Verbmobil project, which has the overriding goal to develop a speaker-independent system for the translation of spontaneous speech. In the framework of this language technology project, the treebanks provide training data for a variety of language technology modules. Th...