Ineke SchuurmanKU Leuven | ku leuven · Department of Linguistics
Ineke Schuurman
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
Additional affiliations
January 1989 - April 2016
Publications (63)
In this pilot study, we investigate the potential of pictograph translation technologies for facilitating communication and integration in the context of migration. We incorporate a new pictograph set in an existing text-to-pictograph translation system and carry out evaluations on three sets of authentic data (language classes, news articles, webs...
In order to enable or facilitate online communication for people with an intellectual disability, the Text-to-Pictograph translation system automatically translates Dutch written text into a series of Sclera or Beta pictographs. The baseline system presents the reader with a more or less verbatim pictograph-per-word translation. As a result, long a...
"But I don’t know how to work with [name of tool or resource]" is something one often hears when researchers in Human and Social Sciences (HSS) are confronted with language technology, be it written or spoken, tools or resources. The TTNWW project shows that these researchers do not need to be experts in language or speech technology, or to know al...
In order to enable or facilitate online communication for people with Intellectual Disabilities, the Text-to-Pictograph translation system automatically translates Dutch written text into a series of Sclera or Beta pictographs. The baseline system presents the reader with a more or less verbatim pictograph-per-word translation. As a result, long an...
The Able to Include project aims at improving the living conditions of people with intellectual or developmental disabilities (IDD) in key areas of society. One of its focus points concerns improving the integration of people with IDD in the workplace by introducing accessible Web-based tools. This paper describes one of the tools developed as resu...
In present-day society, we communicate over the Internet in several media forms. We put videos and images online, listen to music made by famous bands or by our friends, and read and write a lot of text. Never in the history of mankind have we produced more text than at this present moment, so being able to read and write is an important way of tak...
This paper presents a pictograph interface for Pictograph-to-Text translation, which facilitates the construction of written text on social media platforms for users with Intellectual Disabilities. For the design of the interface, a user-centred approach was adopted. Results show that the target group can appreciate accessing social media through p...
Abstract. Information and Communication Technologies have radically changed the way in which we access and share information. However, accessibility for all is still far from being a reality. People with Intellectual
or Developmental Disabilities (IDD) currently have very limited access to the information society and, in particular, to social media...
Compared to well-resourced languages such as English and Dutch, natural language processing (NLP) tools for Afrikaans are still not abundant. In the context of the AfriBooms project, KU Leuven and the NorthWest University collaborated to develop a first, small treebank, a dependency parser, and an easy to use online linguistic search engine for Afr...
We describe the implementation of a Word Sense Disambiguation (WSD) tool in a Dutch Text-to-Pictograph translation system, which converts textual messages into sequences of pictographic images. The system is used in an online platform for Augmentative and Alternative Communication (AAC). In the original translation process, the appropriate sense of...
We describe and evaluate a text-to-pictograph translation system that is used in an online platform for Augmentative and Alternative Communication, which is intended for people who are not able to read and write, but who still want to communicate with the outside world. The system is set up to translate from Dutch into Sclera and Beta, two publicly...
We describe how a Dutch Text-to-Pictograph translation system, designed to augment written text for people with Intellectual or Developmental Disabilities (IDD), was adapted in order to be usable for English and Spanish. The original system has a language-independent design. As far as the textual part is concerned, it is adaptable to all natural la...
We describe the implementation and evaluation of a word sense disambiguation (WSD) tool in a translation system that converts English text messages into sequences of pictographic images. The Text-to-Picto tool for Dutch, English, and Spanish is used on the online communication platform textquotedblleftWAI-NOTtextquotedblright by people who have tro...
De traditionele overzichten van koppelwerkwoorden zijn onvoledig en maken weinig gemotiveerde onderscheidingen, zoals die tussen de ‘echte’ of prototypische koppelwerkwoorden en hun betekenisequivalenten.
Deze bijdrage verschuift de aandacht van de werkwoorden naar de
constructies waarin ze voorkomen. We noemen een constructie copulatief
wanneer e...
This chapter presents the Lassy Small and Lassy Large treebanks, as well as related tools and applications. Lassy Small is a corpus of written Dutch texts (1,000,000 words) which has been syntactically annotated with manual verification and correction. Lassy Large is a much larger corpus (over 500,000,000 words) which has been syntactically annotat...
The construction of a large and richly annotated corpus of written Dutch was identified as one of the priorities of the STEVIN programme. Such a corpus, sampling texts from conventional and new media, is invaluable for scientific research and application development. The present chapter describes how in two consecutive STEVIN-funded projects, viz....
Although several syntactically annotated corpora (or treebanks) exist for Dutch, they are seldomly used for descriptive linguistic research because there are no easy-to-use exploitation tools available. This demonstration paper describes GrETEL, a linguistic search engine (http:// that enables non-technical us...
The ISOcat Data Category Registry provides a community computing environment for creating, storing, retrieving, harmonizing and standardizing data category specifications (DCs), used to register linguistic terms used in various fields. This chapter recounts the history of DC documentation in TC 37, beginning from paper-based lists created for lexic...
Standards and the need for standards, for example for annotation purposes, only emerge after a period of time. Before, people just did what they thought was right. This may have resulted in large amounts of data in a format that in the end did not turn out to be on speaking terms with the (new) standard. This format may even have become a de facto...
In this paper we want to point out some issues arising when a natural language processing task involves several languages (like multi- lingual, multidocument summarization and the machine translation aspects involved) which are often neglected. These issues are of a more cultural nature, and may even come into play when several documents in a singl...
This paper reports on the annotation of a corpus of 1 million words with four semantic annotation layers, including named entities, co- reference relations, semantic roles and spatial and temporal expressions. These semantic annotation layers can benefit from the manually verified part of speech tagging, lemmatization and syntactic analysis (depend...
7th International Workshop on Treebanks and Linguistic Theories (TLT 7). Groningen
METIS-II was an EU-FET MT project running from October 2004 to September 2007, which aimed at translating free text input
without resorting to parallel corpora. The idea was to use “basic” linguistic tools and representations and to link them with
patterns and statistics from the monolingual target-language corpus. The METIS-II project has four par...
We are currently developing MiniSTEx, a spatiotemporal annotation system to handle temporal and/or geospatial information directly and indirectly expressed in texts. In the end, the aim is to locate all eventualities in a text on a time axis and/or a map to ensure an optimal base for automatic temporal and geospatial reasoning. A rst version of Min...
In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required and in which no full parser or extensive rule sets are needed. We describe the evaluation on a developme...
The computational linguistics community in The Netherlands and Belgium has long recognized the dire need for a major reference corpus of written Dutch. In part to answer this need, the STEVIN programme was established. To pave the way for the effective building of a 500-million-word reference corpus of written Dutch, a pilot project was established...
We present the Narrator, a Natural Language Generation component used in a digital storytelling system. The system takes as input a formal representation of a story plot, in the form of a causal network relating the actions of the characters to their motives and their consequences. Based on this input, the Narrator generates a narrative in Dutch, b...
In this paper, we test the METIS-II MT system, from Dutch to English, under several experimental conditions: a verbatim condition in which word by word dictionary translations are used, a condition in which the effect of adding a target language corpus lookup is measured, and the effect of adding a few transfer rules to this. The results indicate t...
Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), 151-162. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) . Electronically p...
In this paper we describe a machine translation prototype in which we use only minimal resources for both the source and the target language. A shallow source language analysis, combined with a translation dictionary and a mapping system of source language phe-nomena into the target language and a target language corpus for generation are all the r...
The construction of a 500-million-word reference corpus of written Dutch has been identified as one of the priorities in the Dutch/Flemish STEVIN programme. For part of this corpus, manually corrected syntactic annotations will be provided. The paper presents the back- ground of the syntactic annotation efforts, the Alpino parser which is used as a...
In this paper, we combine techniques from rule-based and corpus-based MT in a hybrid approach. We only use a dictionary, basic analytical resources and a monolingual target- language corpus in order to enable the construction of an MT system for lesser-resourced languages. Statistical and example-based systems usually do not involve a lot of lingui...
For the METIS-II project (IST, start: 10-2004 end: 09-2007) we are working on an example-based machine translation system, making use of minimal resources and tools for both source and target language, i.e. making use of a target language corpus, but not of any parallel corpora.
In the current paper, we present the results of the �rst experiments...
The METIS-II project1 is an example-based machine translation system, making use of minimal resources and tools for both source and target language, making use of a target-language (TL) corpus, but not of any parallel corpora. In the current paper, we discuss the view of our team on the general philosophy and outline of the METIS-II system.
In this paper, we combine techniques from rule-based and corpus-based MT in a hybrid approach. We only use a dictionary, basic analytical resources and a monolingual target-language corpus in order to enable the construction of an MT system for lesser-resourced
languages. Statistical and example-based systems usually do not involve a lot of linguis...
Although there are two variants of Dutch, the northern variant being the one used in the Netherlands and the southern variant in Flanders (Belgium), one corpus of spoken Dutch is under construction, the Spoken Dutch Corpus (CGN). In this paper first the principles of this corpus will be discussed, thereafter a few small case studies will show what...
The paper discusses the syntactic annotation for the Spoken Dutch Corpus, a Dutch/Flemish cooperation project to build an annotated corpus of about one thousand hours of continuous speech, which amounts to 10 million words. After a brief introduction to the project, we discuss the kind of syntactic annotations we envisage (dependency structures) an...
In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annotation...
The paper describes the syntactic annotation of the Spoken Dutch Corpus ("Corpus Gesproken Nederlands" or CGN), the Dutch-Flemish project (1998-2003) aiming at the collection, description and annotation of ten million words of spoken Dutch. In the first part, the background of the parsing strategy is discussed, as well as some details concerning th...
eval; er is een nieuwe paragraaf 2.3.5 over voorzetseluitdrukkingen; de sectie (2.4.4) over complementen binnen her nominale domein is uitgebreid met een passage over complementen bij niet-verbale nomina; paragraaf 5.1 over discourse-markeerders is uitgebreid met extra voorbeelden; er is een nieuwe paragraaf 5.4 over de behandeling van anakoloeten...
In this paper, we report on quantitative research into certain word order phenomena in Dutch. In our research, we use the Spoken Dutch Corpus (CGN), a major new resource for research into contemporary spoken Dutch. After briefly introducing the primary data, the annotations added, and some of the tools to explore the primary data and the annota- ti...
1. STATISTICAL METHODS Antal van den BOSCH: Instance Families in Memory-Based Language Learning Gert DURIEUX, Walter DAELEMANS & Steven GILLIS: On the Arbitrariness of Lexical Categories Peter KLEIWEG and John NERBONNE: An FGREP Investigation into Phonotactics Ivelin STOIANOV and John NERBONNE: Exploring Phonotactics with Simple Recurrent Networks...
Of the ten million words of contemporary standard Dutch in the Spoken Dutch Corpus (Corpus Gesproken Nederlands, CGN), a selection of one million words of natural spoken language will be annotated syntactically. In the present pa per we discuss the tag sets and the annotation procedures that are currently being developed and tested. The annotation...
In this paper the ANNO Project ("Een Geannoteerde Publieke Gegevensbank voor het Geschreven Nederlands/An Annotated Database for Written Dutch") is reported on 1 . The project aims at laying the foundations for the compilation and linguistic annotation of a large multi-functional Flemish text corpus. The corpus available now consists of language wr...
The European Union has been actively stimulating fundamental and applied research on language and speech technology for more than a decade. A substantial number of European projects aimed at the development of resources and infrastructure. While the develop-ment of general standards and multilingual tools and resources are a concern of the European...
The aim of MiniSTEx, a system for automatic spatiotemporal annotation, is to locate even- tualities on a time-axis and to disambiguate geospatial inf ormation in such a way that geospatial entities can be located on a map. Therefore all kinds of spatiotemporal (geospa- tial, temporal and geotemporal) expressions are disambiguated. In doing so, the...
Certain Dutch construction types are much more frequent in spoken than in written language. Topic drop sentences such as (1) and constructions as in (2) (which we might call 'mirror sentences') are relatively common in spoken language but rare in written prose (with the exception, perhaps, of the most informal of text types such as Internet chat 1)...
In this paper we report on the experiences gained in the recent construction of the SoNaR corpus, a 500 MW reference corpus of contemporary, written Dutch. It shows what can realistically be done within the confines of a project setting where there are limitations to the duration in time as well to the budget, employing current state-of-the-art too...
In this paper on-going work of creating an extensive multilingual parallel corpus of movie subtitles is presented. The corpus currently contains roughly 23,000 pairs of aligned subti-tles covering about 2,700 movies in 29 languages. Subtitles mainly consist of transcribed speech, sometimes in a very condensed way. Insertions, deletions and paraphra...
Most documents researched in the human and social sciences will be enriched one way or another, at least with metadata. Sometimes documents are also enriched with one or more types of annotation. Often the notions used can be interpreted in several ways , which raises the question: "What is meant in a particular case?" ISOcat is a ISO 12620:2009 co...