Koenraad De SmedtUniversity of Bergen | UiB
Koenraad De Smedt
About
85
Publications
11,039
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
827
Citations
Publications
Publications (85)
To help realise its potential as the research infrastructure for language as social and cultural data, CLARIN is supporting the training of students and scholars in using its language data, tools and services. Lecturers and teachers in the CLARIN network have integrated CLARIN language resources into higher education programmes and other training a...
The CLARINO Bergen Centre, which provides scholars with access to digital language data and processing services, has in recent years provided substantial services to research and development in lexicography. This chapter describes the interplay between three major lexicography efforts and the centre. Easy access to large corpora in CLARINO and powe...
A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data.
“Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current...
While corpora are increasingly used in grammar studies, LFG treebanks have been underused, despite their high level of detail and solid theoretical grounding. The INESS platform provides access to LFG treebanks for several languages, as well as tools to construct and explore LFG treebanks. We present the main features of treebank building and searc...
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade h...
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade h...
Automatic syntactic analysis of a corpus requires detailed lexical and morphological information that cannot always be harvested from traditional dictionaries. Therefore the development of a treebank presents an opportunity to simultaneously enrich the lexicon. In building NorGramBank, we use an incremental parsebanking approach, in which a corpus...
We present NorGramBank, a treebank for Norwegian with highly detailed LFG analyses. It is one of many treebanks made available through the INESS treebanking infrastructure. NorGramBank was constructed as a parsebank, i.e. by automatically parsing a corpus, using the wide coverage grammar NorGram. One part consisting of 350,000 words has been manual...
This is an unpublished paper which should not be published and should not be cited.
This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative's work throughout Europe in order to boost progress a...
This paper discusses the construction of a parallel treebank currently involving ten languages from six language families. The treebank is based on deep LFG (Lexical-Functional Grammar) grammars that were developed within the framework of the ParGram (Parallel Grammar) effort. The grammars produce output that is maximally parallelized across langua...
This study investigates how bilinguals use sublexical language membership information to speed up their word recognition process in different task situations. Norwegian-English bilinguals performed a Norwegian-English language decision task, a mixed English lexical decision task, or a mixed Norwegian lexical decision task. The mixed lexical decisio...
As part of the META-NORD project, the state of affairs in language technology in the Nordic and Baltic countries is being described in a set of eight reports. Each language report describes the situation of a language community and the position of the language service and language technol-ogy industry for that language. This posi-tion paper present...
Abstract The TREPIL project (Norwegian treebank pilot project 2004-2008) is aimed at developing and test- ing methods for the construction of a Norwegian parsed corpus. Annotation of c-structures, f-structures and mrs-structures is based on automatic parsing with human,validation and disambiguation. Parsing is done with a large LFG grammar,and the...
The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and...
This paper introduces the META-NORD pro-ject which develops Nordic and Baltic part of the European open language resource infra-structure. META-NORD works on assem-bling, linking across languages, and making widely available the basic language resources used by developers, professionals and re-searchers to build specific products and ap-plications....
Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources. Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson and Koenraad de Smedt. NEALT Proceedings Series, Vol. 5 (2009), i-ii. © 2009 The editors and contributors. Published by Northern European A...
Parallel grammars and parallel treebanks can be a useful method for studying linguistic diversity and commonality. We use this approach to study how arguments to similar predicates are realized across languages. To that end, we formulate formal principles for aligning at phrase and word levels based on translational correspondences at predicate-arg...
We present the LFG PARSEBANKER, a comprehensive toolkit for in- teractive incremental construction of a treebank as a parsed corpus. This web-based toolkit offers an environment for batch and interactive parsing, versioning, inspection of structures, discriminant-based disambiguation, and statistics. It has recently been extended with a structural...
The present paper reports on an investigation to find an answer to the question to what extent subjects base their judgments of linguistic distances on actual dialect data presented in a listening experiment and to what extent they involve previous knowledge of the dialects when making their judgments. The point of departure for our investigation w...
Segment Grammar (SG) is a grammar formalism which is closely related to TAG but operates on smaller increments called syntactic segments. Both use a similar unification operation as a general tree construction mechanism. Whereas TAG is committed to the distinction of syntactic constructions such as Subj-V-Obj, WH(Obj)-Subj-V, to-V(infinitive)-Obj i...
Abstract We extend discriminant-based disambiguation techniques to LFG gram- mars. We present the design and implementation of lexical, morphological, c-structure and f-structure discriminants for an LFG-based parser. Chief con- siderations in the computation,of discriminants are captur ing all distinctions between analyses and relating linguistic...
Objects are representations of entities in a domain which is modeled in a computer. Each object encapsulates the knowledge relevant to one abstract concept or physical object in the real world. This knowledge may consist of data but also of procedures which are applicable to the object. Using the metaphor of a society of communicating entities, the...
Semantic annotation of natural language text requires a certain degree of understanding of the document in question. Especially the resolution of unclear reference is a major challenge when detecting relevant information units in a document. The ongoing KunDoc project examines how domain specific ontologies can support the task of Coreference chain...
Dreistadt is an educational MOO (Multi User Domain, Object Oriented) for language learning. It presents a virtual world in which learners of German communicate with their fellow learners, teachers and native language users in other locations via the Inter- net. While the original Dreistadt had an artificial command language for interaction with the...
Most existing systems for the correction of word level errors are oriented toward either typographical or orthographical errors. Triphone analysis is a new correction strategy which combines phonemic transcription with trigram analysis. It corrects both kinds of errors (also in combination) and is superior for orthographical errors.
In any academic field, research advances tend to percolate naturally to higher education in that field. In recent years, there has been a slow but steady increase in the number of courses and degree programmes in humanities computing. This paper presents some reflections on the status of humanities computing in higher education, in terms of curricu...
Innovation in humanities education and research is stimulated by new technologies for the processing of language, speech, music, visual arts, and other expressions of the human mind. Three main topics will be discussed: the role of large-scale resources such as text corpora and digital archives; advanced methods and tools for processing and simulat...
We describe an object-oriented approach to the representation of linguistic knowledge. Rather than devising a dedicated grammar formalism, we explore the use of powerful but domain-independent object-oriented languages. We use default inheritance to organize regular and exceptional behavior of linguistic categories. Examples from our work in the ar...
Segment Grammar (SG) is a grammar formalism which is especially suited to model the incremental generation of sentences. SG is characterized by a dual level of syntactic description: f-structures, which are unordered functional structures composed out of syntactic segments, and c-structures, which represent left-to-right order of constituents. True...
The Seventh International Workshop on Natural Language Generation was held from 21 to 24 June 1994 in Kennebunkport, Maine. Sixty-seven people from 13 countries attended this 4-day meeting on the study of natural language generation in computational linguistics and AI. The goal of the workshop was to introduce new, cutting-edge work to the communit...
We describe an object-oriented approach to the representation of linguistic knowledge. Rather than devising a dedicated grammar formalism, we explore the use of powerful but domain-independent object-oriented languages. We use default inheritance to organize regular and exceptional behavior of linguistic categories. Examples from our work in the ar...
Current research in natural language generation is situated in a computational linguistics tradition that was founded several decades ago. We critically analyse some of the architectural assumptions underlying existing systems and point out some problems in the domains of text planning and lexicalization. Guided by the identification of major gener...
Action mode interfaces, in which the user achieves his goals by manipulating representations, suffer from some fundamental disadvantages. In this paper, we present a working prototype of a system for Continuous Linguistic Feedback Generation (CLFG), a facility that addresses some of the major disadvantages. CLFG generates natural language descripti...
In this introduction to the special issues, we begin by outlining a concrete example that indicates some of the motivations leading to the widespread use of inheritance networks in computational linguistics. This example allows us to illustrate some of the formal choices that have to be made by those who seek network solutions to natural language p...
A realistic model for natural language generation must account for overt revisions of the syntactic structure (self-corrections) as well as covert revisions (backtracking on syntactic options). This paper presents the preliminaries of a hybrid architecture for grammatical encoding (the 'tactical' phase in sentence generation) which allows such revi...
Incremental sentence generation imposes special constraints on the representation of the grammar and the design of the formulator (the module which is responsible for constructing the syntactic and morphological structure). In the model of natural speech production presented here, a formalism called Segment Grammar is used for the representation of...
A computer simulation model of the human speaker is presented which generates sentences in a piecemeal way. The module responsible for Grammatical Encoding (the tactical component) is discussed in detail. Generation is conceptually and lexically guided and may proceed from the bottom of the syntactic structure upwards as well as from the top downwa...
IPF (Incremental Parallel Formulator) is a computer model in which the formulation stage in sentence generation is distributed among a number of parallel processes. Each conceptual fragment which is passed on to the Formulator gives rise to a new process, which attempts to formulate only that fragment and then exits. The task of each formulation pr...
Most existing systems for the correction of word level errors are oriented toward either typographical or orthographical errors. Triphone analysis is a new correction strategy which combines phonemic transcription with trigram analysis. It corrects both kinds of errors (also in combination) and is superior for orthographical errors.
The accessibility of lexical information stored on computers is not only important for the human computer user, but also for programs that process natural language. The requirements with respect to the content and structure of a computer dictionary are different than for a printed dictionary and depend on the specific function of the language proce...
Since Garrett’s (1975, 1980) seminal work on speech error phenomena, it has become customary to distinguish four levels of representation within the sentence production process: a message level, a functional level, a positional level, and a phonetic level (see also Bock, this volume). Garrett’s model has been further elaborated and modified by Kemp...
The use of object-oriented programming techniques for the representation of morphological and syntactic knowledge is explored. The class/subclass and the rule/exception distinctions are captured by the inheritance mechanism, which allows overwriting inherited information. Morphological and syntactic functions are represented by a function/inverse f...
This position paper presents META-NORD project which develops Nordic and Baltic part of the European open language resource infra-structure. META-NORD works on assem-bling, linking across languages, and making widely available the basic language resources used by developers, professionals and re-searchers to build specific products and appli-cation...
Current trends in language technology require treebanks that do not stop at the level of constituent structure, but include deeper and richer levels of analysis, including appropriate meaning structures. Capturing sufficient detail at different levels of linguistic description is too complex a task to be practically achievable by manual annotation...
Incremental sentence generation imposes special constraints on the representation of the grammar and the design of the formulator (the module which is responsible for constructing the syntactic and morphological structure). In the model of natural speech production presented here, a formalism called Segment Grammar (SG; Kempen, 1987) is used for th...
The KunDoc project investigates coreference chaining with ontology-based methods. In this paper, we discuss knowledge-based methods for coreference chaining and in particular the use of ontologies and their acquisition from a corpus. We present the KunDoc methodology and its implementation. We use concepts and their interrelations extracted from a...
A b s t r a c t We present the case for an extensive scientific effort to build up large treebanks for the Nordic and Baltic languages, as a step towards developing advanced multilingual communication technologies for these languages in the future. N o r d i c l a n g u a g e t e c h n o l o g y i s u r g e n t Language and speech processing is rap...
Chapter prepared for: De Smedt, K. (1996). Computional models of incremental grammatical encoding. In A. Dijkstra & K. de Smedt (Eds.) (1996). Computational psycholinguistics: AI and connectionist models of human language processing (pp. 279-307). London: Taylor & Francis, 1996.
Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007. Editors: Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit. University of Tartu, Tartu, 2007. ISBN 978-9985-4-0513-0 (online) ISBN 978-9985-4-0514-7 (CD-ROM) pp. 152-159.
Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), i-ii. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically publ...
Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), v. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically publish...
Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources. Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson and Koenraad de Smedt. NEALT Proceedings Series, Vol. 5 (2009), iv. © 2009 The editors and contributors. Published by Northern European Ass...
Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources. Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson and Koenraad de Smedt. NEALT Proceedings Series, Vol. 5 (2009), v+45 pp. © 2009 The editors and contributors. Published by Northern Europea...